December 2017 we changed our metrics collection service at Fliva from being backed by Postgres to being backed by ElasticSearch. Looking back, we learned a lot from this move - but we are now moving back to postgres.
Why we moved it in the first place
Our video playing platform produces a lot of metrics. A LOT! But we could not find a way to get indexes and queries to give us performance to be able to use this data for showing end users as statistics.
At Arnsbo Media we have several online products, one of which uses ElasticSearch and the developers advocates its use greatly. It seemed like a good match, so we tried it out.
How we moved it to ElasticSearch
We changed the backing to ElasticSearch by creating a background job, that were being called as soon as a metric landed in our database. This job did two things; upload the metric to elasticsearch and enqueue a worker to delete the same metric from our local database at some later point in time.
Later we stopped stuffing the metric into the database and just let the background job get the raw data, and put it directly into ElasticSearch.
Trouble in paradise
The move into ElasticSearch was great, we could do most aggregation queries in subsecond times (usually sub 100ms times as well). So if we wanted to know how many PlayStart metrics we had for a given Channel in our player, grouped per day for the last three months - we got an answer in no time.
BUT ; Since we had a lot of metrics, and the questions we wanted to ask was something like what is the viewthrough rate of all videos viewed in this channel in the last two weeks - we had a problem. Since that information was not in a single metric, but an aggregation of multiple metrics.
The first naive approach was to get alle the metrics downloaded when needing to answer this question, and then calculate the answer. That worked great, if you either could wait minutes for the answer or have a very small window of time you wanted the answer for.
And this was just one of the questions where this approach was needed.
Later we began aggregating events and storing them in elasticsearch as well. This meant that every time we stored a metric, and metric aggregation background job was being enqueued with the session id and video id of that metric.
The worker used this information to get alle metrics with that same combination of session id and video id, and aggregating the data into a datastructure that made it easy for us to answer all these aggregation questions.
The downside to this is that every time we store a metric we now:
- store the metric in elastic search
- fetch all metrics with the same video id and session id
- aggregate the values in code locally
- create (or update) the aggregated data in elasticsearch
On top of that our hosted elasticsearch instance is not in the same datacenter as the rest of our servers. So the roundtrip latency adds to this problem really quickly.
Why are we moving it back
Well, since the move to ElasticSearch we have improved our postgres database setup. We moved from a postgres in a container (on a flynn cluster) to a AWS Aurora instance.
And since we are storing and quering data that are mostly straight relational data with very little relations - and actually currying the data with database lookups anyway - it seems much easier to reason about than this whole ES layer.
How did we do this?
The original plan was:
Ensure the database table for metrics are correctly setup, including which fields we should fetch (and which we should not), which indexes to add.
Ensure we have a good database table and rails model for aggregated events. Take care with getting the correct indexes setup here.
Make the metric aggregator put data into the local database instead of into elasticsearch.
Make all new metrics go to the database.
Move existing metrics back to the database from the newest back to the oldest.
Ensure that our frontend application works with these new local aggregated data.
How did this go?
Well it went rather well. We now have all our events back in our postgres database. Our aggregations run in no time, and our queries respond in reasonable time. The slowest is a couple of seconds - but that is when requesting aggregated data for more than just a couple of months.
All day to day statistic views are fast. (It is our view-layer, not our database layer that is the bottleneck right now).
We followed the plan layed out above almost to the letter. There were no hickups, no missing pieces, no downtime. The only downsides was extra load on the servers while moving data back from elasticsearch to postgres. It took a couple of months to move it all - mostly because we tried to not overload our system while doing the move.
Finally our aggregation background jobs finish in a couple of milliseconds instead of in a couple of seconds now. Before the move our the number of jobs in the aggregation queue was constantly in the 5000-10000 range - now the number is about zero all the time. (Meaning we process the jobs as fast as they are created, so our dashboard never shows any queued jobs)