
Leonardo Mazzone

Will Langdale

When ordering online groceries, you expect your loyalty points to update, your delivery slot to sync with your calendar, and your payment to go through without retyping your address. Citizens and businesses face a different reality with government services, often re-entering the same information multiple times across disconnected systems.
This fragmentation affects both user experience and government's ability to create effective policies. While solutions like centralised logins have proven challenging, departments have developed various approaches tackle this.
Some departments have a centralised data matching team. Others leave analysts and services to solve the problem as they encounter it. Government needs a way to consolidate entity resolution in an environment that resists centralisation. That’s why we’ve built Matchbox. Matchbox is a reusable component for record matching. It allows us to deduplicate and link datasets in a way that is measurable, iterative and collaborative. It retains almost no information about the data it matches. We hope our unique approach to entity resolution will mean data is linked robustly enough to power both analysis and services, yet securely enough to defend civil liberties.
Why we built Matchbox
Analysts at the Department for Business and Trade use company data to inform better policy, monitor the department’s impact, and improve our effectiveness. Digital products use company data to capture our relationship with businesses and offer them a more bespoke experience. This improves the decisions we make on anything from grants to export licenses.
But company data comes from many different sources, which must be joined to produce powerful insights. That is a difficult task when the identifiers in each data source differ. It’s similarly tricky to join by other attributes such as name and address: all attributes can be ambiguous, missing, include mistakes, or change over time. We need to:
- accept that any approach will have a margin of error, and understand its limitations
- evaluate data quality over time and improve our methodology iteratively
Over the years, different parts of the department have tackled this challenge in many ways. Our objective was to build a tool that would make it easy to deduplicate and link datasets in a reusable and consistent way. It had to be easy to query, protect sensitive data, and straightforward to evaluate and evolve. We can avoid repeated effort and ensure that everyone benefits from the highest quality matches available, shaped by the right domain experts in each area.
In building Matchbox, we benefit from the work of countless others, who have released open-source code on which our own code depends. To benefit from community contributions, encourage reuse across government, and to work transparently in the public interest, we also publish our work as an open-source project. We hope that by building an open-source and general solution that is easily customised and adopted, we can multiply our work’s impact, to benefit many government organisations and beyond.
Linking datasets
Linking 2 data sources with Matchbox involves:
- Registering each data source.
- Defining steps to clean the various fields data sources fields. For example, rules to remove white spaces from postcodes or transform “LTD” to “Limited” in company names.
- Defining a “model” (or methodology) to deduplicate each data source.
- Finally, defining a model to link the 2 data sources.
All these steps are expressed in Python and scheduled to run automatically. These steps can be repeated and stacked to link much more than 2 datasets. For example, the output of a model linking datasets X and Y can be the input to a model adding dataset Z on top of X and Y.
Models can be based on simple rules (e.g. “2 companies are the same if their cleaned named and postcodes are the same”), or more sophisticated. For example, “probabilistic models” output confidence scores about 2 records matching, which are fractional values from 0 to 1. These scores express how likely it is that 2 records are the same given that their fields are similar to varying degrees. We use Splink (another open-source government project) to build our probabilistic models.
Matchbox components
Matchbox is made up of a client (which users and applications run), and a server. The server stores information about which models have been run and when, their results, and user evaluation. The original data being matched is never sent to the Matchbox server. The server knows how to stitch datasets together, but not what information they contain, to boost data security.
The Matchbox client is a software package for the Python programming language. The client can:
- run matching models and then send the results to the server
- retrieve from the server the results from a previously run model. The client can then use these results to link and deduplicate data sources it can access outside of the Matchbox server
Thus, Matchbox users can access data from the sources if and only if they already have access to it outside of Matchbox.
Evaluating model results
Let’s say that a model thinks that records A and B are the same, and a different model, stacked on top, thinks that records B and C are the same. From the second model’s point of view, A, B and C are all part of the same “cluster” of records. Which model describes the true entity?
At any one point, we name one model as the “point of truth” in terms of which clusters exist in our data. Matchbox helps us discover which point of truth best-represents reality. We randomly sample clusters and show them to our users. They can confirm that different clusters look sensible or correct them. This forms our validation data, which we use to assign 2 scores describing a model’s performance:
- precision (of the matches made by the model, what percentage are correct?)
- recall (of the matches the model should have made, how many did it make?)
There is typically a trade-off between precision and recall: a more conservative model will have better precision and worse recall, and vice versa. Based on a suitable compromise between the 2, we can tune a model or compare alternative models.
Querying matched data
Once we’re confident with the quality of a model that links multiple datasets, we can serve it to users. There are 2 ways it can be used:
- software can request matches on demand. For example, our business relationship management tool can ask Matchbox about which Companies House numbers corresponds its own identifiers
- analysts can use a Matchbox table on a SQL database to join other data sources as required
Learning at DBT
On top of the satisfaction of tackling a problem with far-reaching consequences for government, and ultimately for all citizens, working on Matchbox has been an incredible learning experience. It encourages us to stretch across product management, data engineering, DevOps and data science.
There is still much more we want to do, such as allowing internal and external government services to look up and exchange company information resolved by Matchbox.
If you’re interested in helping solve similar challenges, have a look at our jobs board to see current opportunities. Alternatively, if you want to deploy your own Matchbox instance, or contribute features that benefit everyone, get in touch at datamatching@businessandtrade.gov.uk.
Leave a comment