https://digitaltrade.blog.gov.uk/2025/10/15/building-better-data-platforms-our-open-source-approach-to-infrastructure-and-tooling/

Building better data platforms: Our open source approach to infrastructure and tooling

A simple diagram of a data platform that includes how it ingests, stores and catalogues data from the sources to turn into analytics and visualisations
Headshots of Soumaya Mauthoor, Michal Charemza and Rachel Price

This is the second blog in our series reflecting on findings from a research project conducted earlier this year, involving 15 organisations across the civil service, private sector and charity sector. The project explored how large, complex organisations approach development of data strategies – their ambitions, challenges and practical choices.  

Our first blog focused on strategic ambitions. In this post, we turn to the solutions that organisations are using to power their data platforms, and how these compare to DBT’s own approach.  

Introduction

At their core, all data platforms have the same aim – they allow organisations to consolidate different data sets and make them accessible for exploration and analysis.   

Yet the way organisations approach this can vary significantly- starting with how the data platform is hosted: on-premises, in the cloud, or through a hybrid model.  

Hosting

The discovery revealed that, as mandated, government departments share a common ambition -  to fully migrate to the cloud. However, some are still transitioning legacy systems. Among cloud providers, Azure is most commonly used for hosting data platforms, closely followed by Amazon Web Services (AWS). 

At DBT we are already using AWS, so it was a natural choice for hosting our data platform, and we are currently migrating the Azure-based CBAS (Cloud-Based Analytical System). Although we don’t actively use Azure, much of our data is stored within Microsoft cloud services like Office 365. Seamlessly integrating this with our AWS-based data platform remains an ongoing challenge. 

In contrast, industry participants have largely completed their cloud migration. Many are successful at employing a multi-cloud approach, and some retain on premises infrastructure for high-security or process hungry operations. 

Infrastructure

Once the hosting strategy is decided, the next critical consideration is the infrastructure that supports the data platform, particularly around how data will be collected and stored. Most organisations we spoke with rely on proprietary data warehouses and lakehouses such as Snowflake and Databricks. These technologies support distributed architectures and can handle data at the petabyte scale. 

At DBT, we’ve taken a different path. We’ve made a deliberate decision to prioritise open source solutions wherever possible. Hence, we built a data platform using PostgreSQL for storage and Airflow for data pipeline orchestration, both open source technologies. These run using AWS services which simplifies maintainability, scalability and reliability while giving us the flexibility to migrate to another cloud provider if needed. Although native Postgres isn’t typically used for analytics, our initial data volumes were modest enough to justify its use. Advances in server performance over recent years have enabled us to handle increasingly large datasets.  

Looking ahead, we’re exploring how to make the platform multi-modal  supporting a broader range of storage options. This includes investigating how we can better integrate object storage for handling non-tabular and larger scale data sets. We are also evaluating various graph databases for scenarios like supply chain analysis, where graph analytics could offer significant value.  

Tooling

After defining the storage architecture, the next logical step is selecting the tooling layer that enable users to interact with and derive insights from the data. Proprietary data platforms often include built-in tools for cataloguing, analysis, and reporting- which can simplify management. At DBT, we are choosing to use a custom-built data platform, which gives us the flexibility to select the tools that best fit our users' needs. 

Our data catalogue is tailor-made using Django and the GOV.UK design system- all open source technologies already familiar across DBT services. This helps us meet the GOV.UK service standard and ensures consistency for users. Depending on their use case and skill set, users can choose from open source tools like JupyterLab and RStudio, or proprietary tools like Stata and Amazon QuickSight.    

We’ve also recently deployed Amazon Sagemaker to host Large Language Models (LLMs). Our upcoming series of blogs will discuss how we onboard and support users across these tools, including AI and machine learning use cases in more detail.  

Tooling access is securely contained and tools can't connect to the internet. Code is managed through a self-hosted GitLab instance, and packages are pulled from internal mirrors. We'll share more on governance and security in a future blog - including how we ensure that users only access the data they’re authorised to see.  

Conclusion

While most organisations in our study rely on proprietary platforms hosted on Azure or AWS, we have taken a different route. By building our data platform from the ground up and prioritising open source solutions, we’ve gained greater flexibility and control- tailoring the platform to our users needs. This approach may be less common, but we believe the benefits far outweigh the trade offs. We're committed to open sourcing our platform in the future, so others can build on what we’ve learned and developed.  

Sharing and comments

Share this page

Leave a comment

We only ask for your email address so we know you're a real person

By submitting a comment you understand it may be published on this public website. Please read our privacy notice to see how the GOV.UK blogging platform handles your information.