/ Ruby

Why are our asset downloads so slow?

At Fliva we render many videos. Also, some of those videos consist of one or more quite large assets.

Our servers and our assets are - usually - placed in an AWS data center. Moreover, we usually have the assets in an s3 bucket in the same datacenter as the rendering machine.

We have many assets, and some of them are large. Our business - or at least our financial surplus - lives and dies by doing our renderings as fast as possible.

Quick overview of our rendering servers

We spend time doing five things on our rendering machines:

  • downloading assets
  • rendering the video
  • mixing the audio
  • muxing the final video
  • uploading the final video

The three middle are currently three different processes, where the first two happens concurrently, and the latter depends and those finishing. At some point, we may combine this into one process. However, for now, this is how we do - and it is quite fast.

The final video file size is usually within the 50mb to 150mb range. Uploading that takes next to no time.

However, the source assets - that can be extreme. Especially when we have movie clips that have transparent regions - in those cases we usually use a non-compressed video format, and those clips can be several gigs large even for short clips.

A significant portion of render server time is spent downloading files, instead of rendering video. The rendering runner software is - currently - written in ruby. We have gone through a couple of iterations of how we download files.

Some key features that need to be addressed by our download code are:

  • We have multiple assets per rendering
  • We have multiple renderings that can happen at the same time
  • Many renderings use the same assets, so we cache them locally
  • We do not have infinite disk space

Naive approach

Our first naive approach was something like:

  1. check the cache
  2. if present copy to working dir
  3. if not, download to cache and copy to working dir
class AssetDownloader
  def copy(url, filename)
    FileUtils.cp(get_file(url), filename)
  end
  
  private
  
  def get_file(url)
    return cache_key_for(url) if cached?(url)
    download(url)
  end
  
  def cached?(url)
    File.exists?(cache_key_for(url))
  end
  
  def download(url)
    filename = cache_key_for(url)
    uri = URI.parse(url)
    http = Net::HTTP.new(uri.host, uri.port)
    if url.start_with?('https')
        http.use_ssl = true
        http.verify_mode = OpenSSL::SSL::VERIFY_NONE
    end

    http.start do |h|
        File.open(filename, 'wb') do |f|
            get_path = uri.path
            if(uri.query)
                get_path = "#{get_path}?#{uri.query}"
            end
            h.get(get_path) do |data|
                f.write(data)
            end
        end
    end
    filename
  end
  
  def cache_key_for(url)
    file_id = Digest::MD5.hexdigest(url)
    extension = File.extname(URI.parse(url).path)
    File.join(ENV.fetch('ASSET_CACHE_PATH', '/tmp'), "#{file_id}#{extension}")
  end
end

This code worked great while we were in the development phase, and even in our initial tests with customers. However a couple of apparent problems are present, and we found most of them before going live with this code.

The most obvious may be that if two instances of this downloader are running at the same time on the same server, downloading the same asset, shit will hit the fan.

A couple of nasty things can happen here.

If one process sees cache empty and begins downloading, the other could see that unfinished download, and copy that file to it's working directory. Now when we try to render a file is invalid, and the rendering engine blows up, with a non-obvious error.

If both processes see cache empty, they will both try to write to the file. Depending on the timing of this, that could cause all sorts of problems as well.

Next iteration - fix concurrency

First, our copy code has changed a slight bit, to ensure that another process does not delete the file before we get to copy it.

      File.open(filename, 'r') do |_f|
        FileUtils.cp filename, to_file
      end

However, our download code is what has changed the most.

    def download(url)
      return '' if url == ''

      uri = URI(url)

      out_filename = file_name_for(url)
      http = Net::HTTP.new(uri.host, uri.port)
      if url.start_with?('https')
        http.use_ssl = true
        http.verify_mode = OpenSSL::SSL::VERIFY_NONE
      end

      tmpfile = Tempfile.new(file_id_for(url))
      tmpfile.binmode
      http.start do |h|
        get_path = uri.path
        get_path = "#{get_path}?#{uri.query}" if uri.query
        response = h.get(get_path) do |str|
          tmpfile.write str
        end

        unless response.is_a? Net::HTTPSuccess
          # close and remove file if download failed
          tmpfile.close
          tmpfile.unlink
          raise "Failed downloading: #{url} to #{out_filename} because #{response.code}: #{response.message}"
        end
      end
      tmpfile.close
      FileUtils.mv(tmpfile.path, out_filename)
      tmpfile.unlink
      out_filename
    end

All this code is just a single little change at its core.

Download to a - unique - temp file, then move from that temp file on success. This move is an atomic operation on the disk (it is just a new pointer to the data).

This version has run in production for a long time and has worked pretty well. With one single caveat: It is slow! More so for large files, and even more when we have multiple large files downloading concurrently.

We have a single - rather ambitious - video template with 22gb of assets. Most of which is in 9 scenes of uncompressed mov files, and a couple of smaller mp4 files. Downloading all this footage takes approximately 10-15 minutes. Rendering the video takes about two minutes.

Finding the problem

We started with a set of questions.

  • What is the max throughput we can get when downloading files from s3 to a rendering server in the same data center?
  • What is the max throughput we can get from our ruby download code?
  • How big is the difference?
  • Why is the ruby code slow?

Max throughput using wget

We decided to create a little script, and use wget to download. We figured wget had the smallest overhead.

while read file
do
  echo $file
  wget $file &
  wait &
done < urls.txt
wait
echo "done"

The 22gb gets downloaded in 3 minutes and 20 seconds on average. That is 880 Mbit/s.

The ruby code does it in 12 minutes and 30 seconds on average. That is 234.67 Mbit/s

The wget calls are almost four times faster. The ruby code timing is done on the production code, meaning that other things than just the downloads are happening within this time.

Benchmarking different ways of downloading

We went a bit further with this benchmarking and made a small script that could download the same set of files. We used our two current ways of downloading and added another one because we based on a hunch though that the time difference might be because of not streaming the data to disk, but waiting until it was all downloaded to copy it to the file.

The script separates each download method but uses a thread per file to add concurrency.

require "benchmark"
require "typhoeus"
require "net/http"
require 'openssl'

class Run
  def download(urls)
    urls = [urls] if urls.is_a? String
    Benchmark.bm(7) do |x|
      x.report("wget:")   do
        threads = urls.map{|url|
          filename = File.basename(url)
          Thread.new do
            download_with_wget(url, filename)
          end
        }
        threads.each(&:join)
      end
      x.report("typhoeus:") do
        threads = urls.map{|url|
          Thread.new do
            filename = File.basename(url)
            download_with_typhoeus(url, filename)
          end
        }
        threads.each(&:join)
      end
      x.report("http:")  do
        threads = urls.map{|url|
          Thread.new do
            filename = File.basename(url)
            download_with_http(url, filename)

          end
        }
        threads.each(&:join)
      end
    end
  end

  def download_with_wget(url, filename)
    filename = "wget_#{filename}"
    `wget #{url} -O #{filename} > /dev/null 2>&1`
  end

  def download_with_typhoeus(url, filename)
    filename = "typhoeus_#{filename}"

    downloaded_file = File.open filename, 'wb'
    request = Typhoeus::Request.new(url)
    request.on_headers do |response|
      if response.code != 200
        raise "Request failed"
      end
    end
    request.on_body do |chunk|
      downloaded_file.write(chunk)
    end
    request.on_complete do |response|
      downloaded_file.close
    end
    request.run
  end

  def download_with_http(url, filename)
    filename = "http_#{filename}"
    uri = URI.parse(url)
    http = Net::HTTP.new(uri.host, uri.port)
    if url.start_with?('https')
      http.use_ssl = true
      http.verify_mode = OpenSSL::SSL::VERIFY_NONE
    end

    http.start do |h|
      File.open(filename, 'wb') do |f|
        get_path = uri.path
        if(uri.query)
          get_path = "#{get_path}?#{uri.query}"
        end
        h.get(get_path) do |data|
          f.write(data)
        end
      end
    end
  end
end

We ran the benchmark on our production servers, with a subset of our large files. We settled on using four files between 2.5 and 4 GB of size - for a total of 11.7GB of data per method.

The runtimes do not vary as much as we thought they would.

              user     system      total        real
wget:     0.000000   0.000000 161.940000 (117.036897)
typhoeus: 47.310000  77.290000 124.600000 (142.265417)
http:    55.930000 107.600000 163.530000 (135.841929)

For smaller files, the differences are negligible, for larger files wget seems to be fastest, but only about 14% faster than our code.

What the table does not show us is the throughput for each method.

wget: 818.94 mbit/s
typhoeus: 673.72 mbit/s
http: 705.57 mbit/s

This means that our download code in this test is almost exactly three times faster than what we see in production.

So what the hell is going on?

Well, it is quite simple, when you realize, the answer is earlier in this post. The third step in our naive approach has not changed: "if not, download to cache and copy to working dir".

We download several large files concurrently to temp files, move the file to the cache directory and then copy the file to the working directory.

We cannot omit the download step. The move step is just giving a new name to the inode. However, the last step: Copying a 22gb from the local disk to the local disk in several concurrent jobs is SLOW. It takes double the time to copy the files from cache to working dir than it takes to download them.

The solution: one line

All we did to speed this up 3x : change from FileUtils.cp to FileUtils.ln:

File.open(filename, 'r') do |_f|
    FileUtils.ln filename, to_file, force: true
end

This creates a hardlink to the file inode, instead of copying the content of the file to another location on the disk. When the last hardlink to the inode is deleted, the file is deleted from disk - but not before. This means evicting from the on-disk-cache does not destroy an ongoing rendering - like a symbolic link would. We still only pay for the disk space once - instaed of once per concurrent rendering. This is a huge win, since we usually do several concurrent renderings using the same assets.

On top of that we changed our asset download code to download directly from S3 via the official api, instead of generating public urls. We have not seen any speed improvements from this though. It just seems cleaner.

Next we need some better heuristic for evicting the files from cache. I guess that is a later post.

Why are our asset downloads so slow?
Share this