High-level Overview of Medallion Architecture
May 7, 2024
Keeping a high volume of data organized is one of the top priorities for a business analyst. You’ve likely heard of (or may already be using) Medallion architecture to achieve these goals. This structure is also considered best practice while designing and using Microsoft Fabric.
I’ve curated an overview of the Medallion Architecture from Microsoft’s resources to help explain the concepts at a high level. Links to the full documentation will be included at the end of this article.
If you have further questions about Medallion architecture or Microsoft Fabric, please contact Superior Consulting Services.
Medallion architecture is a design pattern that's used by organizations to logically organize data. It is comprised of three zones or layers. Each layer indicates the quality of data, with higher levels representing higher quality. This multi-layered approach helps you to build a single source of truth for enterprise data products.
There are three layers that make up the core of this architecture:
- Bronze: Also known as the raw zone, this first layer stores source data in its original format. The data in this layer is typically append-only and immutable.
- Silver: Also known as the enriched zone, this layer stores data sourced from the bronze layer. The raw data has been cleansed and standardized, and it's now structured as tables (rows and columns). It might also be integrated with other data to provide an enterprise view of all business entities, like customer, product, and others.
- Gold: Also known as the curated zone, this final layer stores data sourced from the silver layer. The data is refined to meet specific downstream business and analytics requirements. Tables typically conform to star schema design, which supports the development of data models that are optimized for performance and usability.
Curated Overview of Medallion Architecture in Microsoft Fabric
Store data with OneLake
At the foundation Of Microsoft Fabric is the OneLake data lake. It comes automatically provisioned with every Fabric tenant, and it's designed to be the single location for all your analytics data. Using OneLake allows us to remove silos and reduce management effort. All organizational data is stored, managed, and secured within one data lake resource.
Benefits include:
- Reduce data movement and duplication. The objective of OneLake is to store only one copy of data. Fewer copies of data results in fewer data movement processes, and that leads to efficiency gains and reduction in complexity.
- Use with multiple analytical engines. The data in OneLake is stored in an open format. That way, the data can be queried by various analytical engines, including Analysis Services (used by Power BI), T-SQL, and Spark.
The overall medallian structure within OneLake is represented in this graphic. Below, I’ll go through each of the layers in detail.
1. Ingest raw data to the bronze layer
The bronze layer contains unvalidated data. Data ingested in the bronze layer typically:
- Maintains the raw state of the data source.
- Is appended incrementally and grows over time.
- Can be any combination of streaming and batch transactions.
Retaining the full, unprocessed history of each dataset in an efficient storage format provides the ability to recreate any state of a given data system.
Additional metadata (such as source file names or recording the time data was processed) may be added to data on ingest for enhanced discoverability, description of the state of the source dataset, and optimized performance in downstream applications.
2. Validate and deduplicate data in the silver layer
Recall that while the bronze layer contains the entire data history in a nearly raw state, the silver layer represents a validated, enriched version of our data that can be trusted for downstream analytics.
While Databricks believes strongly in the lakehouse vision driven by bronze, silver, and gold tables, simply implementing a silver layer efficiently will immediately unlock many of the potential benefits of the lakehouse.
For any data pipeline, the silver layer may contain more than one table. The number of required tables will depend on the necessary data.
3. Power analytics with the gold layer
This gold data is often highly refined and aggregated, containing data that powers analytics, machine learning, and production applications. While all tables in the lakehouse should serve an important purpose, gold tables represent data that has been transformed into knowledge, rather than just information.
Analysts largely rely on gold tables for their core responsibilities, and data shared with a customer would rarely be stored outside this level.
Updates to these tables are completed as part of regularly scheduled production workloads, which helps control costs and allows service level agreements (SLAs) for data freshness to be established.
While the lakehouse doesn’t have the same deadlock issues that you may encounter in an enterprise data warehouse, gold tables are often stored in a separate storage container to help avoid cloud limits on data requests.
In general, becuse aggregations, joins, and filtering are handled before data is written to the gold layer, users should see low latency query performance on data in gold tables.
Final Thoughts
Medallion architecture is a proven structure for organizing and validating data. Using the included OneLake data lake within Microsoft Fabric, business analysts can flow data through a three layer architecture to distill raw data into knowledge.
For a more detailed plan regarding your databases, contact a data analyst at SCS to discuss your project.
Further Reading:
What is the medallion lakehouse architecture? - Microsoft Build, 03/01/24
Implement medallion lakehouse architecture in Microsoft Fabric - Microsoft Build, 11/15/23
Superior Consulting Services (SCS) is a Microsoft-centric technology firm providing innovative solutions that enable our clients to solve business problems. We offer full-scale data modeling, analytics, and custom app development.