

Discover more from Machine learning at scale
#10 150 Successful Machine Learning Models: Lessons Learned at Booking.com

Table of contents
Introduction
Lesson 1: Machine learning as a Swiss knife: the two extremes.
Lesson 2: Model performance is not strongly correlated with business performance
Lesson 3: Look for a problem, not for a solution
Lesson 4: Latency matters
Lesson 5: Monitoring for fast iteration
Lesson 6: Experiment design sophistication has good ROI
Closing thoughts
Introduction
This week, I came across a paper [1] from Data scientists at Booking, describing their learnings on how to effectively deploy Machine learning models to serve business needs.
Let's dive right in!
Lesson 1: Machine learning as a Swiss knife: the two extremes.
The first lesson is about applying machine learning models to two extremely different type of product solutions.
The classical one: creating an ML model for a specific business solution, such as:
Optimize the specific size of an UI element to increase conversion.
Provide recommendations at a specific step of the funnel experience.
These modelling approaches provide direct business value: they are close to the bottom line. However, the breadth of the application is limited to a single use case by design.
On the other end of the spectrum, there is a different application: creating ML models for semantic understanding.
The idea here is to create an embedding to be used by other product team as feature to their own models.
While It's harder to provide supporting business metrics to invest in embedding development, an investment like this allows horizontal scaling of ML applications across the whole company.
A word of caution: using an ML model output as input for other ML models creates a tight coupling between two systems, which can lead to ML tech debt.
What happens if the embedding is retrained? A metric increase for the team owning the embedding can lead to a metric decrease for the team that uses that output. This can be solved by model versioning, which adds a little overhead.
I believe the trade-off here works in favour of the embedding retraining, especially because such models are quite robust: a model that understands "images" does not need much retraining.
I talked extensively about Machine learning tech debt in one of my previous articles:
#3 Technical debt in Machine learning systems has very high interest rate.
Lesson 2: Model performance is not strongly correlated with business performance.
I admit, the title is a little read-baity. (Is that a word? It is now!)
At this point, It is common knowledge that a good offline model does not guarantee good online metrics.
But this section is rather about business metrics for online model deployments.
There are many nice ideas in this section:
❗
Value Performance Saturation
At a certain point, the "business value vs model performance" curve saturates: a gain in model performance does not lead to a gain in business value that makes it worth the effort.
And that is OK! It just means that the shift is now to maintain the ML model, while focusing the R&D efforts on other problems.
❗
Uncanny Valley effect
As ML models become better and better, they can become unsettling for some users. It certainly spooks me at how good recommendations are. Sometimes good enough is just perfect!
❗
Proxy Over-optimization
When training a model, the metric to be optimized is usually a proxy metric.
For example, even if the final goal is to increase conversion rate, it is much easier to optimize for Click through rate. As models get better, they might end up "just driving clicks", without conversion rate increase.
Following this ads example, advertisers would end up paying a lot without getting much value out of it, damaging the bottom line.
Lesson 3: Look for a problem, not for a solution
The problem construction process takes as input the business case and outputs different possible modelling approaches.
It is useful to keep the following in mind:
Learning difficulty: in the real world, target variables are not given: they are constructed. Look for a setup that makes simple models work significantly better than random/constant models.
Data to concept match: setup the model such that the data is as close as possible to the concept you want to model.
Selection bias: construct ground truth data / labels carefully to avoid skewing the model.
Lesson 4: Latency matters
When evaluating a model, there are tradeoffs to be made between performance and latency: the paper shows that a 30% increase in latency leads to 0.5% conversion rate loss.
Possible way to minimize latency are:
Scaling models horizontally as much as possible to deal with large traffic.
Using models that are as sparse as possible.
Caching features to avoid recomputations.
Lesson 5: Monitoring for fast iteration with incomplete data
Monitoring the quality of a model in production is not as easy as it sounds.
Most of the times, feedback is:
Incomplete: true labels cannot always be observed. Example: adversarial spaces.
Delayed: true labels are sometimes observed many days after. Example: predicting if a user reviews a given hotel or not.
Still, we want to evaluate models online as fast as possible. It is clear that label dependent metrics are not always the best solutions.
A quick and easy way is to plot the predictions' distribution. This can be a simple heuristic to immediately tell if a model is breaking bad.
For example, if a model cannot assign different scores to different classes, it does not have much predictive power.
Lesson 6: Experiment design sophistication has good ROI
Booking is well known for their experimental architecture. See: [2] and [3].
In the paper, they show how to isolate the causal effect of a specific modelling choice to investigate business metrics.
❗
Selective triggering
In a standard experiment, the population is divided into control and treatment groups, all subjects in the treatment group are exposed to the change, and all subjects in the control group are exposed to no change.
However, in many cases, not all subjects are eligible to be triggered, and the eligibility criteria are unknown at assignment time.
To deal with this situation, only the triggered subjects in both groups are analyzed.
This is quite a simple reasoning but often overlooked!
❗
Model-output dependent triggering
It is often the case that the triggering criteria depends on the model output. (Unknown when you set up the experiment!)
In such cases, some users are not exposed to any treatment: the observed effect is diluted.
The setup for model output dependent triggering requires an experiment with 3 groups:
Control group C is exposed to no change at all.
Treatment group T1 invokes the model and check the triggering criteria, only triggered users are exposes to the change.
Treatment group T2 invokes the model, but users are not exposed to any change regardless of the output.
Then, statistical analysis is conducted only on triggered subjects from both T1 and T2.
❗
Comparing models
When comparing treatments based on models which improve on one another, there are often high correlations.
In this case, the triggering condition is "models disagree".
Again, three groups are needed:
Control group C is using the baseline/current model.
Treatment group T1 invokes model A.
Treatment group T2 invokes model B.
Closing thoughts
This week article is a bit on the lengthier side. Let me know if you like this format or prefer shorter emails :)!
Also, would you like to see more technical content?
Feel free to reach out directly to me on LinkedIn with your feedback!