Building a reusable data matching product

AI-generated image of a winking matchbox holding up a flaming match. It is surrounded by data symbols. — This is an AI-generated image.

Leonardo Mazzone

Will Langdale

When ordering online groceries, you expect your loyalty points to update, your delivery slot to sync with your calendar, and your payment to go through without retyping your address. Citizens and businesses face a different reality with government services, often re-entering the same information multiple times across disconnected systems.

This fragmentation affects both user experience and government's ability to create effective policies. While solutions like centralised logins have proven challenging, departments have developed various approaches tackle this.

Some departments have a centralised data matching team. Others leave analysts and services to solve the problem as they encounter it. Government needs a way to consolidate entity resolution in an environment that resists centralisation. That’s why we’ve built Matchbox. Matchbox is a reusable component for record matching. It allows us to deduplicate and link datasets in a way that is measurable, iterative and collaborative. It retains almost no information about the data it matches. We hope our unique approach to entity resolution will mean data is linked robustly enough to power both analysis and services, yet securely enough to defend civil liberties.

Why we built Matchbox

Analysts at the Department for Business and Trade use company data to inform better policy, monitor the department’s impact, and improve our effectiveness. Digital products use company data to capture our relationship with businesses and offer them a more bespoke experience. This improves the decisions we make on anything from grants to export licenses.

But company data comes from many different sources, which must be joined to produce powerful insights. That is a difficult task when the identifiers in each data source differ. It’s similarly tricky to join by other attributes such as name and address: all attributes can be ambiguous, missing, include mistakes, or change over time. We need to:

accept that any approach will have a margin of error, and understand its limitations
evaluate data quality over time and improve our methodology iteratively

Over the years, different parts of the department have tackled this challenge in many ways. Our objective was to build a tool that would make it easy to deduplicate and link datasets in a reusable and consistent way. It had to be easy to query, protect sensitive data, and straightforward to evaluate and evolve. We can avoid repeated effort and ensure that everyone benefits from the highest quality matches available, shaped by the right domain experts in each area.

In building Matchbox, we benefit from the work of countless others, who have released open-source code on which our own code depends. To benefit from community contributions, encourage reuse across government, and to work transparently in the public interest, we also publish our work as an open-source project. We hope that by building an open-source and general solution that is easily customised and adopted, we can multiply our work’s impact, to benefit many government organisations and beyond.

Linking datasets

Linking 2 data sources with Matchbox involves:

Registering each data source.
Defining steps to clean the various fields data sources fields. For example, rules to remove white spaces from postcodes or transform “LTD” to “Limited” in company names.
Defining a “model” (or methodology) to deduplicate each data source.
Finally, defining a model to link the 2 data sources.

All these steps are expressed in Python and scheduled to run automatically. These steps can be repeated and stacked to link much more than 2 datasets. For example, the output of a model linking datasets X and Y can be the input to a model adding dataset Z on top of X and Y.

Models can be based on simple rules (e.g. “2 companies are the same if their cleaned named and postcodes are the same”), or more sophisticated. For example, “probabilistic models” output confidence scores about 2 records matching, which are fractional values from 0 to 1. These scores express how likely it is that 2 records are the same given that their fields are similar to varying degrees. We use Splink (another open-source government project) to build our probabilistic models.

Matchbox components

Matchbox is made up of a client (which users and applications run), and a server. The server stores information about which models have been run and when, their results, and user evaluation. The original data being matched is never sent to the Matchbox server. The server knows how to stitch datasets together, but not what information they contain, to boost data security.

The Matchbox client is a software package for the Python programming language. The client can:

run matching models and then send the results to the server
retrieve from the server the results from a previously run model. The client can then use these results to link and deduplicate data sources it can access outside of the Matchbox server

Thus, Matchbox users can access data from the sources if and only if they already have access to it outside of Matchbox.

Evaluating model results

Let’s say that a model thinks that records A and B are the same, and a different model, stacked on top, thinks that records B and C are the same. From the second model’s point of view, A, B and C are all part of the same “cluster” of records. Which model describes the true entity?

At any one point, we name one model as the “point of truth” in terms of which clusters exist in our data. Matchbox helps us discover which point of truth best-represents reality. We randomly sample clusters and show them to our users. They can confirm that different clusters look sensible or correct them. This forms our validation data, which we use to assign 2 scores describing a model’s performance:

precision (of the matches made by the model, what percentage are correct?)
recall (of the matches the model should have made, how many did it make?)

There is typically a trade-off between precision and recall: a more conservative model will have better precision and worse recall, and vice versa. Based on a suitable compromise between the 2, we can tune a model or compare alternative models.

Querying matched data

Once we’re confident with the quality of a model that links multiple datasets, we can serve it to users. There are 2 ways it can be used:

software can request matches on demand. For example, our business relationship management tool can ask Matchbox about which Companies House numbers corresponds its own identifiers
analysts can use a Matchbox table on a SQL database to join other data sources as required

Learning at DBT

On top of the satisfaction of tackling a problem with far-reaching consequences for government, and ultimately for all citizens, working on Matchbox has been an incredible learning experience. It encourages us to stretch across product management, data engineering, DevOps and data science.

There is still much more we want to do, such as allowing internal and external government services to look up and exchange company information resolved by Matchbox.

If you’re interested in helping solve similar challenges, have a look at our jobs board to see current opportunities. Alternatively, if you want to deploy your own Matchbox instance, or contribute features that benefit everyone, get in touch at datamatching@businessandtrade.gov.uk.

3 comments

Comment by Thomas Kingston posted on 14 October 2025

This looks really interesting. These products are just what we need to overcome the challenges of linking records from different data sources in the public sector. Just wondering how this tool differs from Splink, a similar product developed by the Ministry of Justice?

Link to this comment
- Replies to Thomas Kingston>
  Comment by Leonardo Mazzone posted on 22 October 2025
  
  Hi Thomas, thanks so much for your question.
  
  Splink builds models for deduplicating single datasets or linking pairs of datasets. Matchbox helps you combine many deduplicating or pair-wise linking models to bring many datasets together. In other words, you could build several Splink models, and Matchbox is the tool that helps you connect them all together, store the result in a single place, query it easily, and evaluate it on an ongoing basis. We use Splink ourselves and have built Matchbox to make it really simple to use Splink on many datasets at once.
  
  Within Matchbox, you can also define your own models alternative to Splink if you prefer, which we do for those simpler datasets that don't need the full complexity of probabilistic matching.
  
  Hope this makes sense.
  
  Link to this comment
  - Replies to Leonardo Mazzone>
    
    Comment by Thomas Kingston posted on 22 October 2025
    
    Hi Leonardo,
    
    Thank you very much for your reply. That is a clear explaination on how the products are different and how they can work together. I will definitely conisder both Splink and Matchbox for data matching tasks in my work.
    
    Thanks again
    
    Link to this comment

Building a reusable data matching product

Leonardo Mazzone

Will Langdale

Why we built Matchbox

Linking datasets

Matchbox components

Evaluating model results

Querying matched data

Learning at DBT

Share this page

3 comments

Digital and Data at DBT

Categories

Join the team

Follow us

Sign up and manage updates

Leonardo Mazzone

Will Langdale

Why we built Matchbox

Linking datasets

Matchbox components

Evaluating model results

Querying matched data

Learning at DBT

Sharing and comments

Share this page

3 comments

Related content and links

Digital and Data at DBT

Categories

Join the team

Follow us

Sign up and manage updates