Jaime Yan

Associate Principal Scientist, Statistical Programming @ Merck

PhD Candidate in Data Science | AI/LLM Researcher | Clinical Trials Automation

9+ years in clinical trial statistical programming | CV & Oncology therapeutic expertise | NDA/BLA submission leadership

Technical Competency Pipeline

End-to-end expertise in clinical trial programming, from traditional SAS to modern GenAI automation

Key Member of Merck Programming Initiative Teams

Multi-Language Programming

SAS · R · Python

CDISC SDTM & ADaM Standards
TLF Generation & Validation
Cross-Language Integration
Key contributor to non-oncology multilingual programming team

GenAI & Automation

LLMs · Automation Agents

GenAI Code Conversion (SAS ↔ R)
Autonomous Programming Agents
LLM-Powered Code Review
Lead of Automation GenAI Subteam
Key member of Merck GenAI Code Conversion initiative

Interactive Visualization

Dash · Shiny · HTML/JS

Python Dash Applications
R Shiny Interactive Dashboards
HTML/JavaScript Data Displays
Build interactive tools for clinical data review and analysis
12+

Publications & Presentations

3

Core Programming Languages

5+

Conference Presentations

10+

GenAI Projects Delivered

Statistical Programming

  • CDISC Standards (SDTM, ADaM)
  • TLF Generation & Validation
  • Macro Development (SAS/R)
  • Clinical Trial Reporting

AI/LLM Expertise

  • GenAI Code Conversion
  • Autonomous Agents (ClinAgent)
  • LLM Fine-tuning (R, SAS)
  • Knowledge Graph Integration

Data Visualization

  • Python Dash Applications
  • R Shiny Dashboards
  • Interactive HTML/JS Tools
  • Clinical Data Review Apps

Infrastructure & Tools

  • Git/GitHub Version Control
  • Neo4j Knowledge Graphs
  • Docker & Containerization
  • CI/CD Pipelines
  • py4csr (PyPI package)

Team Leadership

  • Automation GenAI Subteam Lead
  • Multilingual Training (R/Python)
  • Cross-functional Collaboration
  • Technical Mentoring

Research & Innovation

  • PhD Research (AI/LLM)
  • 12+ Publications
  • Conference Presentations
  • Methodology Development

About Me

Associate Principal Scientist, Statistical Programming at Merck

9+ years of experience in clinical trial statistical programming across Cardiovascular (CV) and Oncology therapeutic areas. Currently pursuing a PhD in Data Science focusing on AI/LLM applications for clinical trial automation.

9+

Years Experience

3

Therapeutic Areas

12+

Publications

Lead

NDA/BLA Submissions

Key Leadership

  • Submission Lead

    NDA/BLA regulatory submissions

  • Key Member

    GenAI Code Conversion Team

  • Lead

    Automation GenAI Subteam

  • Training Contributor

    Multilingual Programming (R/Python)

Therapeutic Areas

  • Cardiovascular (CV)

    Phase I-III trials, complex endpoints

  • Oncology

    Multiple studies, all phases

  • Non-Oncology

    Cross-therapeutic expertise

Research Focus

  • AI/LLM applications in clinical trials
  • SAS ↔ R code conversion & automation
  • CDISC ADaM data standards
  • Knowledge graphs for clinical data

Core Competencies

Multi-Language Programming

SAS, R, Python, CDISC Standards

GenAI & Automation

LLMs, Autonomous Agents, Code Generation

Interactive Visualization

Dash, Shiny, HTML/JS Applications

Infrastructure & Tools

Neo4j, Docker, CI/CD, Git

Platform Contributions

1

Zero Gravity Platform

Synthetic CDISC data contributor for AWS testing

2

GenAI Code Conversion

Core member of SAS ↔ R conversion initiative

3

Automation Infrastructure

Leading GenAI subteam for automation

PhD Candidate - Data Science

Harrisburg University of Science and Technology

AI/LLM applications in clinical trial statistical programming
12+ publications (preprints, conferences, archived papers)
Presenting at PhUSE, PharmaSUG & industry conferences

Leadership & Expertise

Key contributions and roles at Merck & Co., Inc.

Regulatory Submission Leadership

Submission Lead

NDA/BLA Submissions & Regulatory Excellence

  • Led statistical programming for NDA/BLA submissions across multiple therapeutic areas
  • Delivered regency-ready submissions with 100% CDISC compliance and quality standards
  • Supported regulatory agency interactions with timely data-driven responses

Therapeutic Area Expertise

Domain Expert

CV (Cardiovascular) & Oncology Statistical Programming

  • Cardiovascular (CV) trials: Extensive experience in Phase I-III CV studies with complex endpoints
  • Oncology trials: Led statistical programming for multiple oncology studies across different phases
  • Cross-functional collaboration with clinical teams on study design and analysis strategies

GenAI Code Conversion Initiative

Key Member

Founding Team Member

  • Core contributor to Merck's GenAI-powered code conversion platform (SAS ↔ R)
  • Developed and validated LLM-based translation pipelines for statistical programming code
  • Enabled cross-language interoperability maintaining CDISC standards compliance

Automation GenAI Subteam

Lead

Team Lead

  • Leading the automation subteam focused on GenAI applications for clinical programming
  • Architected autonomous agent systems (ClinAgent) for TLF generation automation
  • Reduced TLF generation time from months to days through AI-driven automation

Non-Oncology Multilingual Programming Team

Training Contributor

R & Python Training Specialist

  • Trained and mentored team members in R and Python for statistical programming
  • Developed training materials and workshops for open-source tools adoption
  • Facilitated SAS to R/Python migration for non-oncology studies

Interactive Data Visualization

Subject Matter Expert

Full-Stack Visualization Developer

  • Build Python Dash applications for clinical data review and analysis
  • Develop R Shiny interactive dashboards for statistical reporting
  • Create HTML/JavaScript tools for real-time data visualization

PhD Candidate - Data Science

Harrisburg University

Dissertation: ClinAgent - Autonomous Clinical Programming Architecture

  • Researching AI/LLM applications in clinical trial statistical programming
  • 12+ publications including preprints, conference presentations, and archived papers
  • Presenting at PhUSE, PharmaSUG and other industry conferences

Zero Gravity Platform

Synthetic Data Contributor

Merck Programming Infrastructure Initiative

  • Created high-quality synthetic CDISC data for testing Merck's Zero Gravity platform
  • Collaborated with AWS team by providing realistic test datasets for platform validation
  • Supported platform testing and validation efforts before production deployment

Publications

12+ publications including preprints, conference presentations, and archived papers

medRxiv Preprints

January 2026

ClinAgent: A Five-Layer Architecture for Autonomous Clinical Trial Statistical Programming

Proposes ClinAgent, an autonomous architecture for automating clinical trial statistical programming, addressing that Phase 3 studies typically require 12-24 FTE-months.

View Paper

December 2025

Automation in Clinical Trial Statistical Programming: A Structured Review

Authors: Jaime Yan, Jason Zhang, Tingting Tian

Structured review of TLF generation, validation frameworks, and AI/ML integration in clinical trial statistical programming.

View Paper

TechRxiv Publications

October 2025

Schema-Preserving Generation of Clinical TLF Templates via Iterative LLM-Guided Debugging

Authors: Jaime Yan, Ming Yang

View Paper

March 2025

Novel Pipeline for Synthetic CDISC ADaM Datasets Using LLMs and Knowledge Graphs

Focuses on using LLMs and knowledge graphs to generate realistic synthetic CDISC ADaM datasets.

View Paper

March 2025

End-to-End Fine-Tuning Small LLMs for Generating Admiral R Code

Presents methodology for fine-tuning small language models to generate R code for the Admiral package.

View Paper

March 2025

Automating SAS and R Code Interpretation and Debugging: A Practical Pipeline

View Paper

Conference Papers (PhUSE & PharmaSUG)

PhUSE-US 2025

End-to-End Fine-Tuning Small LLMs for Generating Admiral R Code

With Tingting Tian

View PDF →

PhUSE-US 2025

Automating SAS and R Code Interpretation and Debugging

With Tingting Tian

View PDF →

PhUSE-US 2025

Enhancing Clinical Trial Data Queries with LLMs and Neo4j

With Changhong Shi

View PDF →

PhUSE-US 2025

Novel Pipeline for Synthetic CDISC ADaM Datasets

With Chao Su

View PDF →

PharmaSUG 2025

GenAI Assisted Code Conversion: SAS to R Standard

View PDF →

PharmaSUG 2025

Comparing SQL and Graph Database Query Methods

View PDF →

PhUSE-US 2024

AI-Enhanced Chatbot for Clinical Trials Analysis

With Chao Su, Changhong Shi

View PDF →

PharmaSUG 2024

LLM-Enhanced Training Agent for Statistical Programming

View PDF →

Pipeline Projects

End-to-end solutions from SAS to GenAI automation

⭐ KEY INITIATIVE

GenAI Code Conversion Platform

❌ PROBLEM

Manual SAS ↔ R code conversion was time-consuming, error-prone, and inconsistent across studies

✅ SOLUTION

Built LLM-powered automated translation pipeline with validation and CDISC compliance checks

🎯 IMPACT

90% translation accuracy, 10x faster than manual, maintained CDISC compliance

Merck's foundational system for automated SAS ↔ R code translation, enabling seamless cross-language statistical programming while maintaining CDISC standards compliance.

📄

SAS Code

Input

🤖

LLM Engine

Translation

Validation

Quality Check

💎

R Code

Output

90%

Translation Accuracy

10x

Faster than Manual

CDISC

Compliant

🔒

Secure & Validated

SAS R LLM CDISC Validation

ClinAgent

Autonomous Programming

Problem: Phase 3 studies required 12-24 FTE-months for TLF generation

Solution: Five-layer autonomous architecture with LLM orchestration

Impact: Reduced TLF generation from months to days, maintaining CDISC compliance

Pipeline Flow

1

Requirements

2

LLM Planning

3

Code Gen

4

Validation

LLMs SAS R/Admiral CDISC
Read Paper

LLM Code Generation

Fine-Tuning Pipeline

End-to-end methodology for fine-tuning small language models to generate executable Admiral R code for statistical programming.

Pipeline Flow

1

Dataset

2

Fine-tune

3

Generate

4

Debug

Fine-tuning R Admiral LLM
Read Paper

Clinical Knowledge Graph

Neo4j + LLM

LLM-powered natural language interface to Neo4j graph database for clinical trial data exploration and complex relationship discovery.

Pipeline Flow

1

Query

2

LLM Parse

3

Cypher

4

Results

Neo4j LLM Cypher SQL
Read Paper

Interactive Visualization

Dash · Shiny · HTML/JS

Full-stack interactive dashboards for clinical data review using Python Dash, R Shiny, and HTML/JavaScript.

Pipeline Flow

1

Data

2

Process

3

Visualize

4

Interactive

Dash Shiny HTML/JS Plotly
Request Demo

Zero Gravity Platform

Synthetic Data Contributor

Created high-quality synthetic CDISC data for testing Merck's Zero Gravity platform, supporting AWS team validation efforts.

Contribution Flow

1

Create

2

Validate

3

Provide

4

Test

Zero Gravity Synthetic Data CDISC AWS

Synthetic data contributor for platform testing

Get In Touch

Interested in collaboration or have a question? Feel free to reach out!

Want my full CV with detailed experience?

Download Resume (PDF)

Open to opportunities in:

Statistical Programming AI/LLM Research Clinical Trials Data Science Analytics Leadership