This is the archive of our weekly newsletter. We are hoping to build the largest ML system design community on the internet! We would love for you to join us.

We provide a weekly round-up of the best Machine learning systems and distributed architecture. Learn how the best engineers scale their ML systems. Delivered every Sunday straight to your inbox!

Machine Learning at Scale: Online serving.
How big tech companies deploy Machine Learning models online.
Machine Learning at Scale: feature stores.
How to design a feature stores that satisfies both online and offline Machine Learning needs?
Machine Learning at scale: Batch Serving.
How big tech companies deploy Machine Learning models with batch serving.
Machine learning at scale: On-Device Machine Learning.
Machine Learning at Scale: Challenges and solutions of deploying On-Device Machine Learning models.
#17 Uber’s Offline Platform For Optimal Feature Discovery.
How Uber runs an offline platform for optimal machine learning feature discovery.
#16 Robust machine learning models in an adversarial world.
Robust machine learning models at scale. How to defend your machine learning models from Adversarial attacks? Deep dive.
#15 Near real-time personalization at LinkedIn.
Table of contents 1. Introduction. 2. Past actions affect personalized recommendations with delay. 3. Leveraging actions in near real-time to adapt recommendations. 4. Results. Introduction In today’s article, I will talk about LinkedIn transitioned from an offline update to a near real-time…
#14 Machine learning models: explainability and interpretability.
Table of contents 1. Introduction. 2. Trading off explainability with other metrics. 3. Inherently explainable models. 4. Improving explainability in non-inherently explainable models. Introduction In this article, I will talk about the need of having explainable and interpretable models. E…
#13 How Netflix built a media understanding platform for ML innovations.
Table of contents 1. Introduction. 2. First iteration: on-demand batch processing. 3. Second iteration: enabling online requests with pre-computation. 4. Final system architecture. Introduction In today’s article, I will describe how Netflix built a ML Media understanding platform. Let’s st…
#12 The sprectrum of Machine Learning roles in industry: different archetypes for different organizations.
Table of contents 1. Introduction. 2. The spectrum of roles. 3. Roles overlap depending on organization’s maturity and scope. 4. Closing thoughts. Introduction Today’s article will be more on the lighter side. I will discuss the different archetypes of Machine learning roles in industry and…
#11 How Pinterest fights harmful content with Machine Learning.
Table of contents 1. Introduction 2. Batch model 3. Online model Introduction This week I have came across an interesting architecture from Pinterest to fight abuse [1]. Pins (read: a collection of pictures) with similar images are grouped together and uniquely defined by an hash signature…
#10 150 Successful Machine Learning Models: Lessons Learned at
Table of contents 1. Introduction 2. Lesson 1: Machine learning as a Swiss knife: the two extremes. 3. Lesson 2: Model performance is not strongly correlated with business performance 4. Lesson 3: Look for a problem, not for a solution 5. Lesson 4: Latency matters 6. Lesson 5: Monitoring for
#9 Reddit’s ML Model Deployment and Serving Architecture
Table of contents 1. Introduction 2. The legacy platform: one system on top of the other 3. The new platform: scalable and reliable 4. Closing thoughts Introduction Today’s topic will deal with improving Reddit’s ML platform, from a non-scalable, failure-prone system to a reliable and highly…
#8 Meta’s AI platform for engineers across the company
Table of contents 1. Introduction 2. Making smart strategies available widely in real-time 3. The Looper platform 4. Adoption and impact Introduction In recent years, there has been a big push by tech companies to scale in-house AI platforms to let software engineers from different teams inc…
#7 How DoorDash maintains models accuracy through a monitoring system.
Introduction The main topic of this article is: “How can we fight model drift?” As soon as a model is trained, validated and deployed to production, it begins degrading. Inputs and outputs need to be closely monitored to diagnose and better yet prevent model drift. DoorDash approaches model obse…
#6 How LinkedIn built a Machine Learning system focused on Explainable AI?
Introduction Lately, there has been a strong focus in tech space on building Machine learning systems that: * Respect privacy. * Avoid harmful bias. * Mitigate unintended consequences. Still, building a transparent and trustworthy system is not easy an easy fit. The focus of this article is g…
# 5 How Uber continuously deploys Machine learning models at scale?
Introduction Today topic is about continuous model deployment. When working with Machine Learning systems, one must always remember that the world around is changing fast. Training a model, deploying it online and calling it a day is not a valid option, as model’s performance degrades quickly. If…
#4 How Yelp predicts Wait Time for your favourite restaurant?
Introduction In this article, I will describe how Yelp predicts the waiting time for restaurants around the world. In this setting, latency of the system is paramount: when users want to know the current waiting time, they expect an immediate answer. However, they don’t particularly care for the s…
#3 Technical debt in Machine learning systems has very high interest rate.
Introduction The harsh reality of new graduates joining a Machine learning team is that the “model development” part of the job dwarfs in comparison to everything else the team needs to support. The priority of the team is often: improving models’ up-time, optimizing the data pipeline, adding new…
#2 How TikTok Real Time Recommendation algorithm scales to billions?
Introduction In today’s article, I am going to dive deep into the paper [1] published by TikTok engineers describing how TikTok real time recommendation system is built. The paper has many topics that are worth learning in more detail, but I am going to focus on: * How the design supports
#1 What is my plan with Machine learning at scale?
Hey! I am Ludo. I am a machine learning engineer at Google. I have created “Machine learning at scale” to talk about ML systems in the real world. I find that the majority of online ML content is divided in these two groups: * Introduction to model X in Python. * An