It has been taught “divide and conquer” is a great technique to solve any problem. Well, sometimes in practice it is harder than we, though, here is a case study of how our team apply this concept and still wasn’t enough, we need to divide even more and tweak some other things.
In the beginning, a the developer in charge to create an algorithm for the mean value for some data points did not have in mind the massive data and the required memory to process that piece of code in the future. Maybe he was naive of the computing capacities from the database engine or perhaps in an evil way he got faith in the company failure. What probably happened is with the pace of time developer after developer appended tiny amends for a larger problem. Wherever was the case our current team inherits the problem, and our responsibility is to solve it.
In general, these are the steps followed by the code: * Fetch the IDs from a polymorphic model relation dynamically. * Build a query with the obtained IDs, group them by type and id, to then get the last from each batch. * As the result of that query are all the identifiers for the data points to evaluate. * Calculate with the help of a PostgreSQL method the average value for the points corresponding to the IDs. * Update the result
As you can see the problem is simple, just a mean value, right? Well they were a few things I did not tell: 1) There were more than 150 million data points to evaluate. 2) The data type of the data point was Numeric, in our case, it translates into a BigDecimal. 3) The database RAM available is 3.5 Gb. 4) Everything within a single process.
Solution Zero: just query the IDs
The first naive solution was to pluck the IDs, to fetch the first IDs from the polymorphic relation the original query was using full ORM models. The required memory for the new query will be more efficient right? Well apparently this didn’t work, the background job still got stuck by the reduced RAM. So the problem was somewhere else.
The First Iteration (divide and conquer)
One of the detected problems was that the query is too big to execute with a reduced memory database, so we decide to split into small batches to calculate the mean value. Instead of having a long query for all the points we will have many queries with 1000 records each. This solution sounds reasonable because even when we lose some precision, the render value on the view was rounded to an integer. Again this solution did not work because memory was never released from the cron job, so it didn’t matter if we had one large query or one thousand, all was in the same process.
The Second Iteration (scale horizontally)
Since it was noticed the need for real memory split, we move all the small queries into a background job. Instead, of a single addition to a variable value, we append the results into a Redis array on wich where we could fetch the required values to calculate an average and when the average is asked it only calculate the average from that one. Yes, we lose precision but as I said before we were willing to take that risk knowing that the displayed number wasn’t critical for our business, never the less important.
Now that we have a lot of small jobs, which is fine when your server configuration has a lot of workers, however, the staging server doesn’t have the same capacities, as a consequence, a ton of jobs held in the queue, this is not good if you want to test something else.
Third Iteration (the O(1) solution)
After all these shenanigans, we need it a different aproach, where instead of calculating the average value every single night, we get the required sums for the average when a data point is created. And as a final step we just calculate something like this:
total_points / counter_cache
As you can see one of the values is already generated by our ORM (in this case ActiveRecord) through the counter_cache attribute, that allows us to store the count per association within a column in our database. The other was a little bit more complex because every time we create a point we need to consider the same filter parameters as we used before.
First, to avoid the complex filtering and the calculations over BigDecimal we decide to clone the data into a new table with a lower precision column. Adding columns to the original table, will add 150 million more of data to store, even if is a boolean that won’t be light weight. Also, we only clone a data point when is created, replacing the last saved for the polymorphic association. Ok, now we got a clone table where no matter how large the data points table grows, this will increase at a slower pace. After this stage update, a record from a total sum will be a piece of cake, additions are the most basic operation in computing.
If the NASA could send people to the moon with 15 digits of pi, for sure we could handle some precision loss in our graphs and metrics
Aditional to this we start to purge the legacy data from inactive accounts just to reduce the size of the massive data points table.
Sometimes divide into smaller batches help but not with a colossal amount of data to process. Always to be mindful of the complexity of your algorithm. An O(n) algorithm in a background job or a cron task is harmless when it takes a couple of minutes to complete. But with a large amount of records and a whole night to finish is not enough, then it becomes a real issue. Please do us a favor and think twice before you code.