From Raw Data to Polished Insights: A Guide to the Medallion Architecture

From Raw Data to Polished Insights: A Guide to the Medallion Architecture


From Raw Data to Polished Insights: A Guide to the Medallion Architecture

Hearing about a new design pattern or concept can feel overwhelming sometimes. But once you look at it, it is really a simple idea, that you were probably already using.
That is how I feel about Medallion Architecture. Most projects you work on involve cleaning data, usually in the early stages. The very nature of data is that it is not always going to be collected in a uniform way. This makes having a formalized process for cleaning and structuring the data you are using very important The Medallion Architecture is a way to take a more formalized and structured approact to what you were already doing. Encouraging you to document and iterate the process.


The Bronze Layer: The Raw Collection

The Bronze layer is where all the raw materials arrive. For our project, we used the GWAS Catalog, a public database of published genetic research findings from the European Bioinformatics Institute and the National Human Genome Research Institute.

The raw data is remarkably versatile and can be used for countless abstractions because it contains the most granular level of detail possible. A single record has a comprehensive set of fields, including the specific genetic variant, the p-value of the association, the disease or trait it is linked to, the journal it was published in, and much more. This raw, detailed information is the foundation for all the insights that follow.

However, because it is sourced from so many different studies, the raw data is full of inconsistencies and needs a rigorous quality control check before any analysis. The Bronze layer ensures we have a complete, untouched copy of the raw data before any cleaning begins, which is the first step toward reproducibility.


The Silver Layer: The First Shine

The Silver layer is where the real work begins. We take the raw data from the Bronze layer and start the first round of polishing. This is where we apply a rigorous process to clean, validate, and structure the data.

For the GWAS Catalog data, this meant:

  • Renaming columns: We changed the difficult-to-read column names like DATE ADDED TO CATALOG to something clean and simple, like date_added.
  • Converting data types: We converted the date_added column from a string to a proper date format, making it usable for time-based analysis.
  • Filtering out noise: We can filter out any incomplete or low-quality data that could skew our results.

By the end of this stage, the data is in a clean, structured format, ready for analysis. It’s no longer just a collection of text; it’s a reliable table that can be trusted.


The Gold Layer: The Final Gleam

The Gold layer is the final stage, where the data is polished to perfection and made ready for scientific inquiry. The goal here is to answer specific questions by aggregating the cleaned data. This layer is optimized for analysis and reporting, ensuring that you can get your answers quickly and reliably.

The real value here is that we can now use this clean data to get key insights that would have been impossible to see in the raw data. We chose two specific insights to abstract from the data:

  • Most Studied Traits: We counted the number of studies for each disease or trait. This gives us a high-level view of where research is most concentrated in the field of genomics.
  • Most Prolific Journals: We counted the number of studies published in each journal. This tells us which publications are the most influential in this research space.

The Medallion Architecture isn’t just about moving data; it’s about adding a layer of scientific rigor and trust at every stage, turning a chaotic pile of raw data into a set of reliable, valuable insights.

Here’s a sample of what a single record in the raw data looks like:

2016-10-04	26482879	Paternoster L	2015-10-19	Nat Genet	www.ncbi.nlm.nih.gov/pubmed/26482879	Multi-ancestry genome-wide association study of 21,000 cases and 95,000 controls identifies new risk loci for atopic dermatitis.	Atopic dermatitis	18,900 European ancestry cases, 1,472 Japanese ancestry cases, 422 African American cases, 300 Latino cases, 305 cases, 84,166 European ancestry controls, 7,966 Japanese ancestry controls, 844 African American controls, 1,592 Latino controls, 896 controls	 30,588 European ancestry cases, 459 African American cases, 1,012 Chinese ancestry cases, 226,537 European ancestry controls, 729 African American controls, 1,362 Chinese ancestry controls		19p13.2 19		8679458 ADAMTS10, ACTL9 NFILZ					ENSG00000268480					rs2918307-G	rs2918307	0	2918307 3_prime_UTR_variant		0	0.16	5E-12	11.301029995663981	 (EA, fixed effects)	1.12	[1.08–1.16]	Illumina [15539996] (imputed)	N	atopic eczema	http://www.ebi.ac.uk/efo/EFO_0000274	 GCST003184 Genome-wide genotyping array

This single line of text represents a finding from a scientific study, but it’s almost impossible to read. Now, contrast that with the data from the Gold layer. After being “polished” in the Silver layer and aggregated in the Gold layer, the data is now a clean, structured table that is ready for immediate analysis. .

Here’s what the final, polished data looks like, showing the top 10 most-studied traits:

TraitStudy Count
body height44,485
body mass index21,919
fatty acid amount18,779
high density lipoprotein cholesterol measurement12,244
triglyceride measurement10,532
platelet count10,213
blood protein amount9,757
systolic blood pressure9,574
protein measurement9,451
erythrocyte volume8,875


You can find all the code for this project on my GitHub repository here: https://github.com/TJAdryan/medallion_project.git

I added the full readme here for people who like to do things the pregit way:

The pipeline processes data through three distinct layers:

Bronze Layer: The raw, ingested data directly from the source.

Silver Layer: The cleaned, structured, and validated data.

Gold Layer: The aggregated, business-ready data optimized for analysis.

The Data: GWAS Catalog We are using data from the GWAS Catalog maintained by the European Bioinformatics Institute (EBI) and the National Human Genome Research Institute (NHGRI). This catalog is a public database of published genome-wide association studies.

Source: https://www.ebi.ac.uk/gwas/api/search/downloads/alternative

Project Setup

  1. Prerequisites You’ll need a Python environment with uv to manage packages.

Install uv: https://astral.sh/uv/install

  1. Environment Setup Create a dedicated virtual environment for the project.
uv venv
source .venv/bin/activate
  1. Install Dependencies We need a few key libraries for data manipulation and storage.

Project Structure The project is organized to reflect the layers of the Medallion Architecture.

.
├── .venv/                         # Virtual environment
├── data/
│   ├── bronze/                    # Raw, original data
│   ├── silver/                    # Cleaned and enriched data
│   └── gold/                      # Aggregated and finalized data
├── scripts/
│   ├── bronze_ingest.py           # Ingestion script
│   ├── silver_process.py          # Cleaning and structuring script
│   └── gold_transform.py          # Aggregation script
├── README.mdx                     # Project documentation
└── .gitignore                     # Git ignore file

The Pipeline Steps Step 1: Bronze Layer - Data Ingestion This step downloads the raw GWAS Catalog data. It is the entry point for our data.

scripts/bronze_ingest.py

import requests
import os

url = "https://www.ebi.ac.uk/gwas/api/search/downloads/alternative"
filename = "gwas_catalog.tsv"
bronze_path = os.path.join("data", "bronze")
os.makedirs(bronze_path, exist_ok=True)
local_filepath = os.path.join(bronze_path, filename)

def download_file(url, local_filepath):
    if os.path.exists(local_filepath):
        print(f"File '{local_filepath}' already exists. Skipping download.")
        return
    print(f"Downloading {url} to {local_filepath}...")
    try:
        response = requests.get(url, stream=True)
        response.raise_for_status()
        with open(local_filepath, 'wb') as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)
        print("Download successful!")
    except requests.exceptions.RequestException as e:
        print(f"Error during download: {e}")

if __name__ == "__main__":
    download_file(url, local_filepath)

Step 2: Silver Layer - Data Processing & Cleaning Here, we take the raw data, clean it, and structure it into a more usable format (Parquet).

scripts/silver_process.py

import pandas as pd
import os
import pyarrow as pa
import pyarrow.parquet as pq

input_dir = os.path.join("data", "bronze")
output_dir = os.path.join("data", "silver")
os.makedirs(output_dir, exist_ok=True)
input_file = os.path.join(input_dir, "gwas_catalog.tsv")
output_file = os.path.join(output_dir, "gwas_catalog_clean.parquet")

def process_gwas_data():
    print("Reading raw data from the Bronze layer...")
    try:
        gwas_df = pd.read_csv(input_file, sep='\t')
    except FileNotFoundError:
        print(f"Error: Raw data file not found at {input_file}. Please ensure the file is in the 'data/bronze' directory.")
        return
    print("Cleaning and structuring the data...")
    cleaned_df = gwas_df[[
        'DATE ADDED TO CATALOG', 'JOURNAL', 'DISEASE/TRAIT', 'MAPPED_TRAIT',
        'SNPS', 'P-VALUE', 'OR or BETA'
    ]].copy()
    cleaned_df['DATE ADDED TO CATALOG'] = pd.to_datetime(cleaned_df['DATE ADDED TO CATALOG'])
    cleaned_df.rename(columns={
        'DATE ADDED TO CATALOG': 'date_added', 'JOURNAL': 'journal',
        'DISEASE/TRAIT': 'disease', 'MAPPED_TRAIT': 'mapped_trait',
        'SNPS': 'snps', 'P-VALUE': 'p_value', 'OR or BETA': 'or_or_beta'
    }, inplace=True)
    arrow_table = pa.Table.from_pandas(cleaned_df)
    print(f"Writing structured data to the Silver layer at {output_file}...")
    pq.write_table(arrow_table, output_file)
    print("Silver layer processing complete!")

if __name__ == "__main__":
    process_gwas_data()

Step 3: Gold Layer - Aggregation & Analysis This is the final stage where we aggregate the clean data into a business-ready format and prepare it for analysis or visualization.

scripts/gold_transform.py

import pandas as pd
import os
import pyarrow as pa
import pyarrow.parquet as pq

silver_dir = os.path.join("data", "silver")
gold_dir = os.path.join("data", "gold")
os.makedirs(gold_dir, exist_ok=True)
input_file = os.path.join(silver_dir, "gwas_catalog_clean.parquet")
output_journal_file = os.path.join(gold_dir, "journal_study_counts.parquet")
output_disease_file = os.path.join(gold_dir, "disease_study_counts.parquet")

def aggregate_gwas_data():
    print("Reading data from the Silver layer...")
    try:
        clean_df = pd.read_parquet(input_file)
    except FileNotFoundError:
        print(f"Error: Silver layer file not found at {input_file}. Please run the silver_process.py script first.")
        return
    print("Aggregating study counts by journal...")
    journal_counts_df = clean_df.groupby('journal').size().reset_index(name='study_count')
    journal_counts_df.sort_values(by='study_count', ascending=False, inplace=True)
    journal_table = pa.Table.from_pandas(journal_counts_df)
    pq.write_table(journal_table, output_journal_file)
    print(f"Saved journal study counts to {output_journal_file}")
    print("Aggregating study counts by disease/trait...")
    disease_counts_df = clean_df.groupby('mapped_trait').size().reset_index(name='study_count')
    disease_counts_df.sort_values(by='study_count', ascending=False, inplace=True)
    disease_table = pa.Table.from_pandas(disease_counts_df)
    pq.write_table(disease_table, output_disease_file)
    print(f"Saved disease study counts to {output_disease_file}")
    print("\nGold layer processing complete! Final data is ready for analysis.")

if __name__ == "__main__":
    aggregate_gwas_data()

Step 4: Final Analysis & Visualization Finally, you can use the Gold layer data to create charts for your blog post.

scripts/analyze_gold_data.py

import pandas as pd
import os
import matplotlib.pyplot as plt

gold_dir = os.path.join("data", "gold")
input_file = os.path.join(gold_dir, "disease_study_counts.parquet")

def analyze_and_visualize():
    try:
        gold_df = pd.read_parquet(input_file)
        top_results = gold_df.head(10)
        print("\nTop 10 most studied diseases/traits:")
        print(top_results.to_string(index=False))

        top_results.plot(kind='bar', x='mapped_trait', y='study_count', legend=False)
        plt.title('Top 10 Most Studied Diseases/Traits')
        plt.xlabel('Disease/Trait')
        plt.ylabel('Study Count')
        plt.tight_layout()
        plt.show()

    except FileNotFoundError:
        print(f"Error: Gold layer file not found at {input_file}. Please run the gold_transform.py script first.")
        return

if __name__ == "__main__":
    analyze_and_visualize()