Thumbtack helps customers search for the right local professionals to get projects done. Our search product collects project details from customers and matches them against preferences from professionals. Afterwards, our ranking algorithm displays the professionals most likely to result in a job well done. We tackle the search ranking problem by scoring professionals that match the customer’s requirements and then sorting them by score. Earlier this year, we changed our search ranking algorithm from a heuristic scoring system to a machine learning (ML) based scoring system. This change was very challenging but impactful. In this blog post, we’ll discuss why we wanted to transition our search ranking algorithm to use machine learning, how we made that transition, and how that changed the developer experience.
Why Transition to Machine Learning
Using machine learning for search ranking isn’t new by any means. Several other companies like Netflix and AirBnB have written about how they’ve used ML for search ranking or recommendations and seen substantial improvements. Thumbtack gets millions of searches every month. Using all this data could help us score professionals more accurately and improve customer engagement.
Perhaps more than improving our ranking algorithm, we were really excited that using ML would help us iterate on our ranking algorithm faster. Before this, we used a hand-tuned heuristic ranking algorithm that combined several ranking signals, (e.g. number of reviews, average review rating (1-5 stars), profile quality) into a single score. To combine the signals, we scaled these numbers into a particular range (usually [0, 1]), scaled them again by another weight according to feature importance, and then multiplied them together.
Adding new features to this heuristic algorithm was becoming cumbersome and slowing down development. Engineers had to add the new signal, and then perform custom, one-off analyses to determine how to transform/combine the new signal with existing features. These custom analyses were also somewhat error prone. Even though there were correlations between new and existing features, manually reweighing all existing features would have been challenging. Essentially, performing these custom analyses was not only slow, but also not as rigorous as we would like.
With an ML-based scoring algorithm, instead of performing custom analyses for every new feature, we could just retrain the model. Evaluating candidate rankers would also be easier, since we could assess the performance of our ML models with metrics like average precision score and ROC AUC score. Thus, one of the most exciting benefits of transition to ML-based scoring was a faster, more consistent method to add and reweigh features as well as evaluate the resulting rankers.
Framing the Search Ranking Problem
Before specifying particular machine learning models, we needed to decide what metric to optimize for. Although we’d want to optimize for how many projects our customers complete, measuring that metric is quite difficult. Furthermore, since completing a project is fairly downstream in the customer flow, we’d have fewer positive examples for training. Thus, we decided to use a more upstream metric predictive of projects completed. In particular, we chose to optimize for the number of customers that receive a “positive response” from a professional. This means that a customer finds a professional they are interested in working with, contacts them, and then that professional also expresses an interest in the customer’s project (“negative responses” are when the professional declines the job).
With this framing, we decided to use two logistic regression models. The first predicts whether a customer will contact a given professional that appeared in the search results. The second predicts whether a professional would express interest in the job if they were contacted. Interpreting the outputs of these regressions as probabilities, we multiply the two together to estimate the probability of a match (i.e. both the customer and professional expressing interest).
Keeping the two models separate, as opposed to having a single logistic regression, gave us two benefits. First, predictions became more interpretable. We can categorize which features predict a professional’s contact rate and which features predict the professional’s positive response rate. Second, the customer contact model has to account for a very powerful confounding effect: that a professional’s rank in a given search greatly affects their contact rate. Isolating this complexity from the positive response model made that model a lot simpler. Indeed, we generally do see more accurate predictions from that model.
The training data for both models comes from historical search event data. Data about customer searches and which professionals were contacted power the customer contact model. Similarly, data about customer contacts and whether professionals responded power the positive response model. At the time of a search, we log all the features that we could conceivably use in the models. So, to add new features, engineers would add those features to the logging and wait for training data to accumulate. Initially, the first few feature additions took some time waiting for training data to accumulate. However, over time, we’ve gotten better at adding several features at once and backfilling data from other sources when appropriate.
Making the Transition
The transition to machine learning was by no means easy. Some of the initial A/B tests didn’t show significant wins over our heuristic scoring algorithm. Others showed wins on upstream engagement metrics but not ones further down the funnel. To debug this, we relied heavily on our ranking evaluation tools to describe what the models were doing differently. These tools include:
- Side-by-Side Ranker Comparison: With our Side-by-Side tool (below), we can see two versions of the search results produced by two different rankers with debugging metadata. This allows us to visualize differences, and examine how various ranking algorithms rank pros differently.
- Offline Simulation: Using an in-house simulation tool, we can replay old requests against experimental rankers and calculate descriptive statistics about their rankings. We can also replay the same requests against the current production ranker. Then, using both outputs, we quantitatively measure how the experimental ranker compares to baseline.
- Live Ranking Experiments: At Thumbtack, we heavily A/B test product changes to understand their impact and this is also true for new ranking algorithms. To further understand a new algorithm’s impact, we built a dashboard that displays slices of our A/B testing data. This lets us see whether an experimental ranker performed better/worse for certain market sizes, categories, or platforms.
Figure 1. A simplified view of the Side-by-Side tool. Between the two rankers, we can see that the 3rd result on the left moved up to become the 1st result on the right. We can also see that it’s likely due to the change in p_contact_model_output.
During the transition, these evaluation tools helped us deeply understand and debug our ranking algorithms. In some cases, we fixed bugs in how we sampled event data to obtain our training set. In other cases, we excluded features from the model whose learned weights we couldn’t justify from a product perspective. For instance, the customer contact model initially learned a negative weight for a professional’s number of pictures. This suggests that customers were less interested in professionals with more pictures. We believed this was unintuitive and likely an omitted variable bias problem. Hence, at the time, we decided to exclude the feature. One of Thumbtack’s core values is “Know Our Customer”, so if we can’t justify to ourselves why a feature improves the search experience, we won’t ship that feature.
Eventually, we iterated enough such that these ML models delivered significant improvements to key engagement metrics, allowing us to ship them confidently. Now these models rank professionals in Thumbtack search requests, and give us systematic ways to add new ranking features.
Further Improving the Model
Even though we shipped these machine learning models, we still haven’t fully optimized search ranking on Thumbtack. We’re constantly iterating on these models to make them better. Generally speaking, our improvements fall into one of three buckets:
- Adding new features to the models: Implementing logging for new ranking signals, waiting for data to accumulate, then retraining models.
- Changing how we train the models: For example how we sample training data from events data or modifying various hyperparameters we specify for model training.
- Changing the type of model we use: We haven’t done this yet, but it’s an exciting possibility for the future.
Our transition to using machine learning models in ranking radically changed how ranking works at Thumbtack. Despite this, we still reason about ranking in largely the same way. We still spend a lot of time brainstorming new ideas for ranking signals. We still depend on product intuition and ranking evaluation tools to convince ourselves that new features are adding useful signal. What’s changed is that our ranking iterations are faster, more reproducible, and more principled. They also adapt better to other changes in the search experience, since newer training data will reflect those changes.
We’re really excited to tackle all the interesting challenges that’ll come with optimizing these ML models and the possibilities that using them will unlock. Beyond search ranking, Thumbtack also uses machine learning in several product areas, like spam detection, intent verification etc. With each of these models, we can do so much to improve them and become an even better marketplace for local services. Come join us!