From Raw Data to Polished Insights: A Guide to the Medallion Architecture
From Raw Data to Polished Insights: A Guide to the Medallion Architecture
Hearing about a new design pattern or concept can feel overwhelming sometimes. But once you look at it, it is really a simple idea, that you were probably already using.
That is how I feel about Medallion Architecture. Most projects you work on involve cleaning data, usually in the early stages. The very nature of data is that it is not always going to be collected in a uniform way.
This makes having a formalized process for cleaning and structuring the data you are using very important
The Medallion Architecture is a way to take a more formalized and structured approact to what you were already doing. Encouraging you to document and iterate the process.
The Bronze Layer: The Raw Collection
The Bronze layer is where all the raw materials arrive. For our project, we used the GWAS Catalog, a public database of published genetic research findings from the European Bioinformatics Institute and the National Human Genome Research Institute.
The raw data is remarkably versatile and can be used for countless abstractions because it contains the most granular level of detail possible. A single record has a comprehensive set of fields, including the specific genetic variant, the p-value of the association, the disease or trait it is linked to, the journal it was published in, and much more. This raw, detailed information is the foundation for all the insights that follow.
However, because it is sourced from so many different studies, the raw data is full of inconsistencies and needs a rigorous quality control check before any analysis. The Bronze layer ensures we have a complete, untouched copy of the raw data before any cleaning begins, which is the first step toward reproducibility.
The Silver Layer: The First Shine
The Silver layer is where the real work begins. We take the raw data from the Bronze layer and start the first round of polishing. This is where we apply a rigorous process to clean, validate, and structure the data.
For the GWAS Catalog data, this meant:
- Renaming columns: We changed the difficult-to-read column names like
DATE ADDED TO CATALOG
to something clean and simple, likedate_added
. - Converting data types: We converted the
date_added
column from a string to a proper date format, making it usable for time-based analysis. - Filtering out noise: We can filter out any incomplete or low-quality data that could skew our results.
By the end of this stage, the data is in a clean, structured format, ready for analysis. It’s no longer just a collection of text; it’s a reliable table that can be trusted.
The Gold Layer: The Final Gleam
The Gold layer is the final stage, where the data is polished to perfection and made ready for scientific inquiry. The goal here is to answer specific questions by aggregating the cleaned data. This layer is optimized for analysis and reporting, ensuring that you can get your answers quickly and reliably.
The real value here is that we can now use this clean data to get key insights that would have been impossible to see in the raw data. We chose two specific insights to abstract from the data:
- Most Studied Traits: We counted the number of studies for each disease or trait. This gives us a high-level view of where research is most concentrated in the field of genomics.
- Most Prolific Journals: We counted the number of studies published in each journal. This tells us which publications are the most influential in this research space.
The Medallion Architecture isn’t just about moving data; it’s about adding a layer of scientific rigor and trust at every stage, turning a chaotic pile of raw data into a set of reliable, valuable insights.
Here’s a sample of what a single record in the raw data looks like:
2016-10-04 26482879 Paternoster L 2015-10-19 Nat Genet www.ncbi.nlm.nih.gov/pubmed/26482879 Multi-ancestry genome-wide association study of 21,000 cases and 95,000 controls identifies new risk loci for atopic dermatitis. Atopic dermatitis 18,900 European ancestry cases, 1,472 Japanese ancestry cases, 422 African American cases, 300 Latino cases, 305 cases, 84,166 European ancestry controls, 7,966 Japanese ancestry controls, 844 African American controls, 1,592 Latino controls, 896 controls 30,588 European ancestry cases, 459 African American cases, 1,012 Chinese ancestry cases, 226,537 European ancestry controls, 729 African American controls, 1,362 Chinese ancestry controls 19p13.2 19 8679458 ADAMTS10, ACTL9 NFILZ ENSG00000268480 rs2918307-G rs2918307 0 2918307 3_prime_UTR_variant 0 0.16 5E-12 11.301029995663981 (EA, fixed effects) 1.12 [1.08–1.16] Illumina [15539996] (imputed) N atopic eczema http://www.ebi.ac.uk/efo/EFO_0000274 GCST003184 Genome-wide genotyping array
This single line of text represents a finding from a scientific study, but it’s almost impossible to read. Now, contrast that with the data from the Gold layer. After being “polished” in the Silver layer and aggregated in the Gold layer, the data is now a clean, structured table that is ready for immediate analysis. .
Here’s what the final, polished data looks like, showing the top 10 most-studied traits:
Trait | Study Count |
---|---|
body height | 44,485 |
body mass index | 21,919 |
fatty acid amount | 18,779 |
high density lipoprotein cholesterol measurement | 12,244 |
triglyceride measurement | 10,532 |
platelet count | 10,213 |
blood protein amount | 9,757 |
systolic blood pressure | 9,574 |
protein measurement | 9,451 |
erythrocyte volume | 8,875 |
…
You can find all the code for this project on my GitHub repository here: https://github.com/TJAdryan/medallion_project.git
I added the full readme here for people who like to do things the pregit way:
The pipeline processes data through three distinct layers:
Bronze Layer: The raw, ingested data directly from the source.
Silver Layer: The cleaned, structured, and validated data.
Gold Layer: The aggregated, business-ready data optimized for analysis.
The Data: GWAS Catalog We are using data from the GWAS Catalog maintained by the European Bioinformatics Institute (EBI) and the National Human Genome Research Institute (NHGRI). This catalog is a public database of published genome-wide association studies.
Source: https://www.ebi.ac.uk/gwas/api/search/downloads/alternative
Project Setup
- Prerequisites You’ll need a Python environment with uv to manage packages.
Install uv: https://astral.sh/uv/install
- Environment Setup Create a dedicated virtual environment for the project.
uv venv
source .venv/bin/activate
- Install Dependencies We need a few key libraries for data manipulation and storage.
Project Structure The project is organized to reflect the layers of the Medallion Architecture.
.
├── .venv/ # Virtual environment
├── data/
│ ├── bronze/ # Raw, original data
│ ├── silver/ # Cleaned and enriched data
│ └── gold/ # Aggregated and finalized data
├── scripts/
│ ├── bronze_ingest.py # Ingestion script
│ ├── silver_process.py # Cleaning and structuring script
│ └── gold_transform.py # Aggregation script
├── README.mdx # Project documentation
└── .gitignore # Git ignore file
The Pipeline Steps Step 1: Bronze Layer - Data Ingestion This step downloads the raw GWAS Catalog data. It is the entry point for our data.
scripts/bronze_ingest.py
import requests
import os
url = "https://www.ebi.ac.uk/gwas/api/search/downloads/alternative"
filename = "gwas_catalog.tsv"
bronze_path = os.path.join("data", "bronze")
os.makedirs(bronze_path, exist_ok=True)
local_filepath = os.path.join(bronze_path, filename)
def download_file(url, local_filepath):
if os.path.exists(local_filepath):
print(f"File '{local_filepath}' already exists. Skipping download.")
return
print(f"Downloading {url} to {local_filepath}...")
try:
response = requests.get(url, stream=True)
response.raise_for_status()
with open(local_filepath, 'wb') as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
print("Download successful!")
except requests.exceptions.RequestException as e:
print(f"Error during download: {e}")
if __name__ == "__main__":
download_file(url, local_filepath)
Step 2: Silver Layer - Data Processing & Cleaning Here, we take the raw data, clean it, and structure it into a more usable format (Parquet).
scripts/silver_process.py
import pandas as pd
import os
import pyarrow as pa
import pyarrow.parquet as pq
input_dir = os.path.join("data", "bronze")
output_dir = os.path.join("data", "silver")
os.makedirs(output_dir, exist_ok=True)
input_file = os.path.join(input_dir, "gwas_catalog.tsv")
output_file = os.path.join(output_dir, "gwas_catalog_clean.parquet")
def process_gwas_data():
print("Reading raw data from the Bronze layer...")
try:
gwas_df = pd.read_csv(input_file, sep='\t')
except FileNotFoundError:
print(f"Error: Raw data file not found at {input_file}. Please ensure the file is in the 'data/bronze' directory.")
return
print("Cleaning and structuring the data...")
cleaned_df = gwas_df[[
'DATE ADDED TO CATALOG', 'JOURNAL', 'DISEASE/TRAIT', 'MAPPED_TRAIT',
'SNPS', 'P-VALUE', 'OR or BETA'
]].copy()
cleaned_df['DATE ADDED TO CATALOG'] = pd.to_datetime(cleaned_df['DATE ADDED TO CATALOG'])
cleaned_df.rename(columns={
'DATE ADDED TO CATALOG': 'date_added', 'JOURNAL': 'journal',
'DISEASE/TRAIT': 'disease', 'MAPPED_TRAIT': 'mapped_trait',
'SNPS': 'snps', 'P-VALUE': 'p_value', 'OR or BETA': 'or_or_beta'
}, inplace=True)
arrow_table = pa.Table.from_pandas(cleaned_df)
print(f"Writing structured data to the Silver layer at {output_file}...")
pq.write_table(arrow_table, output_file)
print("Silver layer processing complete!")
if __name__ == "__main__":
process_gwas_data()
Step 3: Gold Layer - Aggregation & Analysis This is the final stage where we aggregate the clean data into a business-ready format and prepare it for analysis or visualization.
scripts/gold_transform.py
import pandas as pd
import os
import pyarrow as pa
import pyarrow.parquet as pq
silver_dir = os.path.join("data", "silver")
gold_dir = os.path.join("data", "gold")
os.makedirs(gold_dir, exist_ok=True)
input_file = os.path.join(silver_dir, "gwas_catalog_clean.parquet")
output_journal_file = os.path.join(gold_dir, "journal_study_counts.parquet")
output_disease_file = os.path.join(gold_dir, "disease_study_counts.parquet")
def aggregate_gwas_data():
print("Reading data from the Silver layer...")
try:
clean_df = pd.read_parquet(input_file)
except FileNotFoundError:
print(f"Error: Silver layer file not found at {input_file}. Please run the silver_process.py script first.")
return
print("Aggregating study counts by journal...")
journal_counts_df = clean_df.groupby('journal').size().reset_index(name='study_count')
journal_counts_df.sort_values(by='study_count', ascending=False, inplace=True)
journal_table = pa.Table.from_pandas(journal_counts_df)
pq.write_table(journal_table, output_journal_file)
print(f"Saved journal study counts to {output_journal_file}")
print("Aggregating study counts by disease/trait...")
disease_counts_df = clean_df.groupby('mapped_trait').size().reset_index(name='study_count')
disease_counts_df.sort_values(by='study_count', ascending=False, inplace=True)
disease_table = pa.Table.from_pandas(disease_counts_df)
pq.write_table(disease_table, output_disease_file)
print(f"Saved disease study counts to {output_disease_file}")
print("\nGold layer processing complete! Final data is ready for analysis.")
if __name__ == "__main__":
aggregate_gwas_data()
Step 4: Final Analysis & Visualization Finally, you can use the Gold layer data to create charts for your blog post.
scripts/analyze_gold_data.py
import pandas as pd
import os
import matplotlib.pyplot as plt
gold_dir = os.path.join("data", "gold")
input_file = os.path.join(gold_dir, "disease_study_counts.parquet")
def analyze_and_visualize():
try:
gold_df = pd.read_parquet(input_file)
top_results = gold_df.head(10)
print("\nTop 10 most studied diseases/traits:")
print(top_results.to_string(index=False))
top_results.plot(kind='bar', x='mapped_trait', y='study_count', legend=False)
plt.title('Top 10 Most Studied Diseases/Traits')
plt.xlabel('Disease/Trait')
plt.ylabel('Study Count')
plt.tight_layout()
plt.show()
except FileNotFoundError:
print(f"Error: Gold layer file not found at {input_file}. Please run the gold_transform.py script first.")
return
if __name__ == "__main__":
analyze_and_visualize()