What it does
A full ETL pipeline that pulls clinical trial metadata from ClinicalTrials.gov, processes and stores it in PostgreSQL, and serves it through Grafana dashboards. Researchers can track drug efficacy, enrollment trends, sponsor performance, and trial outcomes across conditions in one place, instead of digging through the API themselves.
Why I built it
Clinical trial data is completely public but almost impossible to actually use. The raw API returns dense, inconsistent JSON across thousands of trials and no one has time to make sense of it manually. I wanted to build something that made the patterns visible: which drugs are completing trials, which conditions are attracting the most research, where trials are actually happening, and which sponsors have the strongest track records.
The interesting bit
The pipeline itself was a good problem: the ClinicalTrials.gov API is inconsistent across trials, so a lot of the work was in the transformer layer, figuring out how to normalize data that doesn't always follow the same shape. The more interesting analytical layer was building the efficacy and sponsor reputation scores. There's no single "success" metric in clinical research, so I had to make deliberate choices about what completion rate, enrollment rate, and phase progression actually mean when combined. Those design decisions changed the output significantly, which was a good reminder that analytics is never just plumbing.
Stack
Python, PostgreSQL, Grafana, Docker, Pandas, NumPy, SQLAlchemy, Scikit-learn, SciPy, ClinicalTrials.gov REST API.