Skip to main content

Author: Ravi Teja Voruganti (Senior Technical Lead)

Introduction

In clinical programming, generating Analysis Data Model (ADaM) datasets is one of the most time‑consuming and repetitive tasks for statistical programmers. Each variable often requires a carefully written algorithm, validated logic, and a corresponding R program. As studies scale, so does the programming burden.

At Excelra, we asked a simple question: Can we automate this process without compromising accuracy or traceability? This aligns with Excelra’s broader AI and Machine Learning solutions designed to accelerate data-driven innovation across life sciences workflows.

The answer led us to build an AI‑powered ADaM Program Automation Tool—a system that uses Hybrid Retrieval‑Augmented Generation (RAG), vector databases, and large language models (LLMs) to generate high‑quality R code directly from a specification sheet. By combining FAISS‑based semantic search, Azure OpenAI embeddings, and a feedback‑driven learning loop, we created a solution that reduces manual effort, improves consistency, and builds a reusable knowledge base over time.

This blog walks you through how we built it, why it works, and how it can transform ADaM programming workflows.

Why automate ADaM programming?

Challenges-in-manual-ADaM-programming-workflow

Figure 1: Challenges in manual ADaM programming workflow

ADaM dataset creation is essential for regulatory submissions, but the process is often:

  • Manual
  • Repetitive
  • Dependent on programmer experience
  • Prone to inconsistencies across studies

A typical workflow requires programmers to interpret specifications, derive algorithms, write R code, validate outputs, and maintain documentation. When multiple studies run in parallel, this becomes a bottleneck.

Automation helps us:

  • Reduce programming time
  • Improve consistency across studies
  • Minimize human error
  • Free programmers to focus on complex derivations and validation
  • Build reusable institutional knowledge

Such automation approaches complement Excelra’s Clinical Data Services, which focus on transforming raw clinical information into analysis-ready datasets for regulatory and research workflows. Our goal was not to replace programmers, but to augment them with a system that handles the repetitive parts reliably.

Building the vector database for ADaM knowledge

Technical-architecture-diagram-of-ADaM-automation-tool

Figure 2: Challenges in manual ADaM programming workflow

The foundation of our automation tool is a vector database (VDB) built using FAISS, an open‑source library optimized for similarity search. Similar scalable data architectures are central to Excelra’s Scientific Data Management capabilities, enabling efficient handling of complex scientific and clinical datasets. We embedded 195 curated ADaM code library entries—each containing:

  • Dataset
  • Variable
  • Analysis algorithm
  • Revised algorithm
  • R code
  • Tagged source

To generate embeddings, we used the text‑embedding‑3‑large model from Azure OpenAI. Each row was converted into a semantic vector representing its meaning, not just its text.

Why a vector database?

Traditional keyword search fails when:

  • Variable names differ slightly
  • Algorithms are phrased differently
  • Context matters more than exact text

Vector search solves this by retrieving the closest semantic match for any query. For example:

Query: “ADEG ADY Copy from ADEG.ADY”

The VDB returns the most relevant ADaM logic—even if the wording differs.

These intelligent automation strategies are closely aligned with Excelra’s Data Science Services, where advanced analytics and AI models are applied to solve complex biomedical challenges. This forms the backbone of our Hybrid RAG approach, where we combine:

  • Retrieval (FAISS + embeddings)
  • Generation (LLM‑based algorithm and code creation)

The Two‑Step AI workflow for R code generation

Challenges-in-manual-ADaM-programming-workflow

Figure 3: Two-step AI workflow for ADaM R code generation

Our automation pipeline runs in two major steps.

Step 1: generate the revised algorithm using hybrid RAG

For each row in the specification sheet (Dataset, Variable, Analysis Algorithm), we:

  1. Construct a semantic query
  2. Retrieve the closest match from the VDB
  3. Inject the retrieved context into a structured prompt
  4. Ask the LLM to generate a Revised Algorithm

This ensures the algorithm is:

  • Aligned with ADaM standards
  • Context‑aware
  • Consistent with historical logic
  • Adapted to the specific variable and dataset

This step dramatically reduces ambiguity and ensures the generated logic is grounded in prior validated code. 

Step 2: generate R code from the revised algorithm

Once the revised algorithm is ready, we feed it into a second LLM prompt designed to produce:

  • Clean
  • Modular
  • Readable
  • Study‑ready

R code for the variable.

The generated code is saved as an .R file in the POSIT Workbench environment and is ready for integration into the study pipeline. Integrating automated pipelines into enterprise environments reflects Excelra’s expertise in Cloud Enablement, supporting scalable and secure deployment of scientific applications.

Closing the loop — Learning from every run

One of the most powerful features of our system is the feedback loop.

After generating the R code:

  • We embed the new code
  • Append it to the vector database
  • Store metadata including function name and timestamp

This means the system learns continuously.

Over time:

  • The VDB becomes richer
  • The LLM relies less on generation and more on retrieval
  • Costs decrease
  • Accuracy increases
  • Study‑specific logic becomes part of the institutional memory

This transforms the tool from a one‑off automation script into a self‑improving knowledge engine.

 Technical architecture overview

Technical architecture diagram of ADaM automation tool

Figure 4: Technical architecture diagram of ADaM automation tool

Here’s a simplified view of the architecture:

1. Input layer

  • Excel specification sheet
  • Dataset, Variable, Analysis Algorithm

2. Embedding layer

  • Azure OpenAI embeddings
  • FAISS vector normalization
  • Semantic indexing

3. Retrieval layer

  • Vector search for closest match
  • Context extraction

4. Generation layer

  • LLM prompt for revised algorithm
  • LLM prompt for R code

5. Output layer

  • R program saved to POSIT Workbench
  • Updated VDB with new embeddings
  • Metadata appended with function name and timestamp

6. UI layer

  • Streamlit interface for user interaction
  • Real‑time logs and outputs

This modular design ensures scalability, maintainability, and transparency.

Benefits for statistical programming teams

Challenges-in-manual-ADaM-programming-workflow

Figure 5: Technical architecture diagram of ADaM automation tool

Our automation tool delivers measurable value:

1. Significant time savings

Automating repetitive derivations frees programmers to focus on complex logic and validation.

2. Improved consistency

Algorithms and code follow standardized patterns across studies.

3. Reduced cost

The feedback loop reduces LLM usage over time, lowering operational costs.

4. Enhanced traceability

Every generated function is timestamped and stored with metadata.

5. Scalable knowledge base

The VDB grows with every run, capturing institutional expertise.

6. Seamless integration

Hosted on POSIT Workbench with a Streamlit UI, the tool fits naturally into existing workflows.

Conclusion

Automating ADaM programming is no longer a futuristic idea—it’s a practical, scalable solution that brings immediate value to statistical programming teams. By combining Hybrid RAG, vector databases, FAISS search, and Azure OpenAI models, we created a system that generates accurate R code, learns continuously, and reduces manual effort dramatically.

This approach doesn’t replace programmers—it empowers them. It ensures consistency, accelerates delivery, and builds a reusable knowledge base that strengthens over time. Learn how Excelra applies similar AI-driven approaches through its Excelra for AI initiatives focused on accelerating discovery, automation, and digital transformation in life sciences.

As we continue refining this platform, we’re excited about its potential to transform clinical programming workflows and support faster, more reliable study execution.

If you’d like to explore how this solution can be adapted for your organization, we’d be happy to discuss it with you.