Associate Principal Scientist, Statistical Programming @ Merck
PhD Candidate in Data Science | AI/LLM Researcher | Clinical Trials Automation
9+ years in clinical trial statistical programming | CV & Oncology therapeutic expertise | NDA/BLA submission leadership
End-to-end expertise in clinical trial programming, from traditional SAS to modern GenAI automation
Key Member of Merck Programming Initiative Teams
SAS · R · Python
LLMs · Automation Agents
Dash · Shiny · HTML/JS
Publications & Presentations
Core Programming Languages
Conference Presentations
GenAI Projects Delivered
Associate Principal Scientist, Statistical Programming at Merck
9+ years of experience in clinical trial statistical programming across Cardiovascular (CV) and Oncology therapeutic areas. Currently pursuing a PhD in Data Science focusing on AI/LLM applications for clinical trial automation.
Years Experience
Therapeutic Areas
Publications
NDA/BLA Submissions
Submission Lead
NDA/BLA regulatory submissions
Key Member
GenAI Code Conversion Team
Lead
Automation GenAI Subteam
Training Contributor
Multilingual Programming (R/Python)
Cardiovascular (CV)
Phase I-III trials, complex endpoints
Oncology
Multiple studies, all phases
Non-Oncology
Cross-therapeutic expertise
Multi-Language Programming
SAS, R, Python, CDISC Standards
GenAI & Automation
LLMs, Autonomous Agents, Code Generation
Interactive Visualization
Dash, Shiny, HTML/JS Applications
Infrastructure & Tools
Neo4j, Docker, CI/CD, Git
Zero Gravity Platform
Synthetic CDISC data contributor for AWS testing
GenAI Code Conversion
Core member of SAS ↔ R conversion initiative
Automation Infrastructure
Leading GenAI subteam for automation
Harrisburg University of Science and Technology
Key contributions and roles at Merck & Co., Inc.
NDA/BLA Submissions & Regulatory Excellence
CV (Cardiovascular) & Oncology Statistical Programming
Founding Team Member
Team Lead
R & Python Training Specialist
Full-Stack Visualization Developer
Dissertation: ClinAgent - Autonomous Clinical Programming Architecture
Merck Programming Infrastructure Initiative
12+ publications including preprints, conference presentations, and archived papers
January 2026
Proposes ClinAgent, an autonomous architecture for automating clinical trial statistical programming, addressing that Phase 3 studies typically require 12-24 FTE-months.
View PaperDecember 2025
Authors: Jaime Yan, Jason Zhang, Tingting Tian
Structured review of TLF generation, validation frameworks, and AI/ML integration in clinical trial statistical programming.
View PaperOctober 2025
Authors: Jaime Yan, Ming Yang
View PaperMarch 2025
Focuses on using LLMs and knowledge graphs to generate realistic synthetic CDISC ADaM datasets.
View PaperMarch 2025
Presents methodology for fine-tuning small language models to generate R code for the Admiral package.
View PaperPhUSE-US 2025
With Tingting Tian
View PDF →PhUSE-US 2025
With Changhong Shi
View PDF →PhUSE-US 2024
With Chao Su, Changhong Shi
View PDF →End-to-end solutions from SAS to GenAI automation
❌ PROBLEM
Manual SAS ↔ R code conversion was time-consuming, error-prone, and inconsistent across studies
✅ SOLUTION
Built LLM-powered automated translation pipeline with validation and CDISC compliance checks
🎯 IMPACT
90% translation accuracy, 10x faster than manual, maintained CDISC compliance
Merck's foundational system for automated SAS ↔ R code translation, enabling seamless cross-language statistical programming while maintaining CDISC standards compliance.
SAS Code
Input
LLM Engine
Translation
Validation
Quality Check
R Code
Output
Translation Accuracy
Faster than Manual
Compliant
Secure & Validated
Autonomous Programming
Problem: Phase 3 studies required 12-24 FTE-months for TLF generation
Solution: Five-layer autonomous architecture with LLM orchestration
Impact: Reduced TLF generation from months to days, maintaining CDISC compliance
Pipeline Flow
Requirements
LLM Planning
Code Gen
Validation
Fine-Tuning Pipeline
End-to-end methodology for fine-tuning small language models to generate executable Admiral R code for statistical programming.
Pipeline Flow
Dataset
Fine-tune
Generate
Debug
Neo4j + LLM
LLM-powered natural language interface to Neo4j graph database for clinical trial data exploration and complex relationship discovery.
Pipeline Flow
Query
LLM Parse
Cypher
Results
Dash · Shiny · HTML/JS
Full-stack interactive dashboards for clinical data review using Python Dash, R Shiny, and HTML/JavaScript.
Pipeline Flow
Data
Process
Visualize
Interactive
Synthetic Data Contributor
Created high-quality synthetic CDISC data for testing Merck's Zero Gravity platform, supporting AWS team validation efforts.
Contribution Flow
Create
Validate
Provide
Test
Synthetic data contributor for platform testing
Interested in collaboration or have a question? Feel free to reach out!
Want my full CV with detailed experience?
Download Resume (PDF)[email protected]
linkedin.com/in/jaime-yan
github.com/yanmingyu92
View Publications
Open to opportunities in: