How to be an idiot at coding part infinite + 1
We had some code read partial responses from a server. This code was tested thoroughly and worked as intended. We got the amount of bytes we needed from the head of a file, and then were able to parse that data as a binary blob.
However; when we put this into production - containers started crashing randomly.
Or rather the code has been in production for years without a crash, but now we were utilizing this code path many times an hour instead of a couple of times per day.
The culprit was hard to find - but it all releates to how ruby works, how networking works, and how I though it worked more like in c.
The old code did something like this:
bytes = nil
uri = URI(url)
begin
http = Net::HTTP.new(uri.host, uri.port)
if url.start_with?('https')
http.use_ssl = true
http.verify_mode = OpenSSL::SSL::VERIFY_NONE
end
http.start do |h|
request = Net::HTTP::Get.new(uri.request_uri)
h.request(request) do |response|
bytes = response.socket.read(count)
end
end
rescue IOError => e
# ignore
end
bytes
and the new code does something far more simple; uses the range header
headers = {'Range' => "bytes=0-#{limit}"}
uri = URI(url)
response = Net::HTTP.get_response(uri, headers)
The former code retrieved the complete file response from the server, and then read the amount of bytes into a variable; the latter only requests the bytes from the server that is needed.
For small files the difference is negligible; but the larger the files get the larger the problems becomes. Response times of the former code goes up, memory usage also goes up. And since video files can be more than 100mb in size in our usage; the slow downloads combined with the large memory usage causes the OOMkiller to destroy the container before the process finishes; further re-enqueing the same job - which increases the possibility that multiple jobs of this type gets handled by the same container in “the same time” - further increasing the possibility of the container crashing again.