Author: Ravi Teja Voruganti (Senior Technical Lead)

Introduction

In clinical programming, generating Analysis Data Model (ADaM) datasets is one of the most time‑consuming and repetitive tasks for statistical programmers. Each variable often requires a carefully written algorithm, validated logic, and a corresponding R program. As studies scale, so does the programming burden.

At Excelra, we asked a simple question: Can we automate this process without compromising accuracy or traceability? This aligns with Excelra’s broader AI and Machine Learning solutions designed to accelerate data-driven innovation across life sciences workflows.

The answer led us to build an AI‑powered ADaM Program Automation Tool—a system that uses Hybrid Retrieval‑Augmented Generation (RAG), vector databases, and large language models (LLMs) to generate high‑quality R code directly from a specification sheet. By combining FAISS‑based semantic search, Azure OpenAI embeddings, and a feedback‑driven learning loop, we created a solution that reduces manual effort, improves consistency, and builds a reusable knowledge base over time.

This blog walks you through how we built it, why it works, and how it can transform ADaM programming workflows.

Why automate ADaM programming?

Figure 1: Challenges in manual ADaM programming workflow

ADaM dataset creation is essential for regulatory submissions, but the process is often:

Manual
Repetitive
Dependent on programmer experience
Prone to inconsistencies across studies

A typical workflow requires programmers to interpret specifications, derive algorithms, write R code, validate outputs, and maintain documentation. When multiple studies run in parallel, this becomes a bottleneck.

Automation helps us:

Reduce programming time
Improve consistency across studies
Minimize human error
Free programmers to focus on complex derivations and validation
Build reusable institutional knowledge

Such automation approaches complement Excelra’s Clinical Data Services, which focus on transforming raw clinical information into analysis-ready datasets for regulatory and research workflows. Our goal was not to replace programmers, but to augment them with a system that handles the repetitive parts reliably.

Building the vector database for ADaM knowledge

Technical-architecture-diagram-of-ADaM-automation-tool

Figure 2: Challenges in manual ADaM programming workflow

The foundation of our automation tool is a vector database (VDB) built using FAISS, an open‑source library optimized for similarity search. Similar scalable data architectures are central to Excelra’s Scientific Data Management capabilities, enabling efficient handling of complex scientific and clinical datasets. We embedded 195 curated ADaM code library entries—each containing:

Dataset
Variable
Analysis algorithm
Revised algorithm
R code
Tagged source

To generate embeddings, we used the text‑embedding‑3‑large model from Azure OpenAI. Each row was converted into a semantic vector representing its meaning, not just its text.

Why a vector database?

Traditional keyword search fails when:

Variable names differ slightly
Algorithms are phrased differently
Context matters more than exact text

Vector search solves this by retrieving the closest semantic match for any query. For example:

Query: “ADEG ADY Copy from ADEG.ADY”

The VDB returns the most relevant ADaM logic—even if the wording differs.

These intelligent automation strategies are closely aligned with Excelra’s Data Science Services, where advanced analytics and AI models are applied to solve complex biomedical challenges. This forms the backbone of our Hybrid RAG approach, where we combine:

Retrieval (FAISS + embeddings)
Generation (LLM‑based algorithm and code creation)

The Two‑Step AI workflow for R code generation

Figure 3: Two-step AI workflow for ADaM R code generation

Our automation pipeline runs in two major steps.

Step 1: generate the revised algorithm using hybrid RAG

For each row in the specification sheet (Dataset, Variable, Analysis Algorithm), we:

Construct a semantic query
Retrieve the closest match from the VDB
Inject the retrieved context into a structured prompt
Ask the LLM to generate a Revised Algorithm

This ensures the algorithm is:

Aligned with ADaM standards
Context‑aware
Consistent with historical logic
Adapted to the specific variable and dataset

This step dramatically reduces ambiguity and ensures the generated logic is grounded in prior validated code.

Step 2: generate R code from the revised algorithm

Once the revised algorithm is ready, we feed it into a second LLM prompt designed to produce:

Clean
Modular
Readable
Study‑ready

R code for the variable.

The generated code is saved as an .R file in the POSIT Workbench environment and is ready for integration into the study pipeline. Integrating automated pipelines into enterprise environments reflects Excelra’s expertise in Cloud Enablement, supporting scalable and secure deployment of scientific applications.

Closing the loop — Learning from every run

One of the most powerful features of our system is the feedback loop.

After generating the R code:

We embed the new code
Append it to the vector database
Store metadata including function name and timestamp

This means the system learns continuously.

Over time:

The VDB becomes richer
The LLM relies less on generation and more on retrieval
Costs decrease
Accuracy increases
Study‑specific logic becomes part of the institutional memory

This transforms the tool from a one‑off automation script into a self‑improving knowledge engine.

Technical architecture overview

Figure 4: Technical architecture diagram of ADaM automation tool

Here’s a simplified view of the architecture:

1. Input layer

Excel specification sheet
Dataset, Variable, Analysis Algorithm

2. Embedding layer

Azure OpenAI embeddings
FAISS vector normalization
Semantic indexing

3. Retrieval layer

Vector search for closest match
Context extraction

4. Generation layer

LLM prompt for revised algorithm
LLM prompt for R code

5. Output layer

R program saved to POSIT Workbench
Updated VDB with new embeddings
Metadata appended with function name and timestamp

6. UI layer

Streamlit interface for user interaction
Real‑time logs and outputs

This modular design ensures scalability, maintainability, and transparency.

Benefits for statistical programming teams

Figure 5: Technical architecture diagram of ADaM automation tool

Our automation tool delivers measurable value:

1. Significant time savings

Automating repetitive derivations frees programmers to focus on complex logic and validation.

2. Improved consistency

Algorithms and code follow standardized patterns across studies.

3. Reduced cost

The feedback loop reduces LLM usage over time, lowering operational costs.

4. Enhanced traceability

Every generated function is timestamped and stored with metadata.

5. Scalable knowledge base

The VDB grows with every run, capturing institutional expertise.

6. Seamless integration

Hosted on POSIT Workbench with a Streamlit UI, the tool fits naturally into existing workflows.

Conclusion

Automating ADaM programming is no longer a futuristic idea—it’s a practical, scalable solution that brings immediate value to statistical programming teams. By combining Hybrid RAG, vector databases, FAISS search, and Azure OpenAI models, we created a system that generates accurate R code, learns continuously, and reduces manual effort dramatically.

This approach doesn’t replace programmers—it empowers them. It ensures consistency, accelerates delivery, and builds a reusable knowledge base that strengthens over time. Learn how Excelra applies similar AI-driven approaches through its Excelra for AI initiatives focused on accelerating discovery, automation, and digital transformation in life sciences.

As we continue refining this platform, we’re excited about its potential to transform clinical programming workflows and support faster, more reliable study execution.

If you’d like to explore how this solution can be adapted for your organization, we’d be happy to discuss it with you.

Accelerating ADaM Programming with AI: How We Automated R Code Generation Using Hybrid RAG and Vector Databases

Introduction

Why automate ADaM programming?

Building the vector database for ADaM knowledge

Why a vector database?

The Two‑Step AI workflow for R code generation

Step 1: generate the revised algorithm using hybrid RAG

Step 2: generate R code from the revised algorithm

Closing the loop — Learning from every run

Technical architecture overview

1. Input layer

2. Embedding layer

3. Retrieval layer

4. Generation layer

5. Output layer

6. UI layer

Benefits for statistical programming teams

1. Significant time savings

2. Improved consistency

3. Reduced cost

4. Enhanced traceability

5. Scalable knowledge base

6. Seamless integration

Conclusion

ABOUT US

USEFUL LINKS

OUR OFFICES

CONTACT US

Accelerating ADaM Programming with AI: How We Automated R Code Generation Using Hybrid RAG and Vector Databases

Introduction

Why automate ADaM programming?

Building the vector database for ADaM knowledge

Why a vector database?

The Two‑Step AI workflow for R code generation

Step 1: generate the revised algorithm using hybrid RAG

Step 2: generate R code from the revised algorithm

Closing the loop — Learning from every run

Technical architecture overview

1. Input layer

2. Embedding layer

3. Retrieval layer

4. Generation layer

5. Output layer

6. UI layer

Benefits for statistical programming teams

1. Significant time savings

2. Improved consistency

3. Reduced cost

4. Enhanced traceability

5. Scalable knowledge base

6. Seamless integration

Conclusion

Recommended For You

From Scientific Literature to Clinical Insights: Transforming Fragmented Evidence into Decision-Ready Data

From Models to Medicine: How Quantitative Science is Closing the Translational Gap

Graceful Zero-Read Handling in Nextflow: Keeping Pipelines Running When Samples Fail

ABOUT US

USEFUL LINKS

OUR OFFICES

CONTACT US