LLM-Powered Multimodal Oncology Data Analysis Platform for High-Impact Cancer Research

Authors: Meeta Sunil (Principal Technical Program Manager) & Debamitra Chakravorty (Technical Program Manager II)

Overview

Excelra partnered with a global biopharma organization to develop an LLM-powered multimodal oncology data analysis platform that transforms fragmented genomic, clinical, and real-world evidence into unified, actionable insights. By integrating diverse data sources such as ctDNA, genomics, methylation profiles, EHRs, and claims data, the platform enables scalable biomarker discovery, patient stratification, and real-time hypothesis generation. Leveraging advanced AI, deep learning, and explainable analytics, Excelra delivered a future-ready solution that accelerates precision oncology research, improves decision-making, and enhances the adoption of AI-driven workflows across translational and clinical programs.

2–3× faster biomarker discovery and 30–50% efficiency gains through a scalable, LLM-powered multimodal oncology platform.

Our client

Our client

A global biopharma organization advancing precision oncology programs across solid tumors and hematological malignancies. The client operates large-scale translational research and real-world evidence (RWE) initiatives, leveraging genomics, liquid biopsy (ctDNA), EHR, and claims data to accelerate biomarker discovery and therapeutic decision-making.

Client’s challenge

Client’s challenge

The client faced significant technical and operational barriers in extracting actionable insights from oncology data:

  • Highly fragmented, siloed multimodal datasets spanning genomics, ctDNA, methylation profiles, EHRs, and insurance claims
  • Lack of harmonization and interoperability across structured, semi-structured, and unstructured data modalities
  • Manual, SME-driven biomarker discovery workflows, resulting in long turnaround times and scalability constraints (learn more about data curation services)
  • Black-box ML models with limited interpretability, reducing trust and regulatory confidence
  • Absence of a unified analytics platform capable of real-time, cross-modal exploration and hypothesis testing

These challenges slowed biomarker validation, constrained AI adoption, and limited the scientific and clinical impact of the data.

If left unresolved, the client risked delayed biomarker validation, suboptimal patient stratification, and potential competitive disadvantage in precision oncology programs (see how precision medicine solutions address these challenges).

Client’s goals

Client’s goals

Why GenAI / LLMs Were Necessary: Traditional ML pipelines struggled to interpret unstructured oncology text (clinical notes, molecular reports, publications) and synthesize multimodal evidence. LLMs were required to enable contextual understanding, hypothesis summarization, and cross-modal reasoning at scale.

The client sought to establish a future-ready foundation for GenAI-driven oncology research by:

  • Building a scalable, cloud-native multimodal integration platform (supported by cloud enablement services)
  • Accelerating AI-driven biomarker discovery
  • Applying LLM-powered and deep learning models with explainability
  • Enabling real-time analytics for researchers and clinical SMEs

What Made This Challenging: Oncology data is inherently sparse, longitudinal, and multimodal, with frequent missing modalities (e.g., incomplete ctDNA or EHR records). Aligning heterogeneous molecular and clinical signals at the patient level while preserving biological relevance required careful architecture design beyond standard AI fusion techniques.

Our Approach

Excelra designed and implemented a GenAI-powered multimodal analytics architecture combining advanced data engineering, deep learning, and explainable AI to deliver scalable, near–real-time oncology insights (aligned with scientific data management best practices).

  • Adopted a modality-aware ingestion and harmonization strategy using AWS-native services
  • Applied domain-specific encoders optimized for biological sequences, imaging-derived features, and clinical text
  • Leveraged contrastive learning–based fusion to unify heterogeneous data into joint latent representations
  • Embedded model interpretability and knowledge graphs to ensure transparency and scientific trust
  • Delivered scalable, near–real-time analytics through intuitive dashboards for rapid insight generation
LLM-Powered Multimodal Oncology Data Analysis Platform for High-Impact Cancer Research

Our solution

1. Data ingestion & harmonization

  • Ingested genomics, ctDNA, methylation, EHR, and claims data using AWS Glue–based ETL pipelines
  • Standardized data across formats, ontologies, and vocabularies (clinical, molecular, and longitudinal)
  • Enabled continuous ingestion to support near–real-time updates (aligned with healthcare data structuring)

2. Modality-Specific encoding

Each data modality was processed using specialized encoders to generate high-quality embeddings optimized for cross-modal fusion and downstream analytics.

Foundational biomedical language models (e.g., BioBERT, BioGPT) were pretrained on large-scale public biomedical corpora and subsequently fine-tuned on client-specific oncology datasets to capture tumor-specific molecular and clinical context.

  • Encoder Framework by Modality:
  • Genomics & Clinical Text: BioBERT, BioGPT
  • Imaging / Structured Molecular Features: CNN-based architectures
  • Epigenomics: MethylNet

This modality-aware encoding strategy ensured biologically meaningful feature representation while preserving domain specificity across heterogeneous oncology data sources (similar to approaches used in bioinformatics solutions).

3. Fusion & embedding layer

Modality-specific embeddings were unified into joint latent vectors using contrastive learning techniques. This approach ensured meaningful biological alignment across heterogeneous modalities.

Alignment Objective: Contrastive learning was applied at the patient level, aligning molecular and clinical embeddings belonging to the same patient while separating unrelated patient representations. In outcome-driven experiments, outcome-aware contrastive objectives further aligned embeddings based on shared survival or response endpoints.

Preserved cross-modal relationships between molecular, clinical, and real-world signals Enabled robust downstream tasks such as biomarker identification, patient stratification, and outcome prediction

Unlike early fusion (simple feature concatenation) or late fusion (decision-level aggregation), contrastive learning preserved biologically meaningful relationships across modalities—ensuring that molecular alterations, epigenetic signatures, and clinical phenotypes co-localized in latent space when biologically linked.

Early fusion risked noise amplification across heterogeneous scales; late fusion limited cross-modal interaction learning; attention-only fusion lacked explicit alignment objectives. Contrastive learning provided stronger representation alignment while maintaining modality independence.

4. ML modeling & interpretability

  • Built predictive and exploratory ML models on fused embeddings
  • Integrated SHAP and LIME for post-hoc explanations at feature and patient levels
  • Improved scientific interpretability and regulatory readiness (aligned with AI/ML capabilities)

Model robustness was evaluated using stratified cross-validation and temporal splits to simulate prospective clinical deployment scenarios, ensuring generalizability across evolving oncology cohorts.

5. Knowledge graphs & dashboards

  • Constructed oncology-focused knowledge graphs linking biomarkers, pathways, phenotypes, and outcomes
  • Delivered interactive Streamlit dashboards on AWS for real-time analytics, hypothesis exploration, and SME validation

6. Role of LLMs in the platform

  • Encoding: Structured transformation of unstructured clinical notes and molecular reports into contextual embeddings
  • Retrieval-Augmented Reasoning: Integrated structured data with literature-derived insights for contextual interpretation
  • Hypothesis Summarization: Generated concise biomarker hypotheses and mechanistic summaries for SME review

This solution empowered the client to move from fragmented data exploration to integrated, insight-driven oncology research at scale.

LLM-Powered Multimodal Oncology Data Analysis Platform for High-Impact Cancer Research

Conclusion

This architecture moved beyond traditional multimodal AI by explicitly aligning biological, clinical, and real-world signals at the patient level while maintaining interpretability and deployment readiness.

Excelra delivered a scalable, LLM-powered multimodal oncology analytics platform that transformed the client’s research workflows:

  • 2–3× faster biomarker discovery
  • 30–50% improvement in research efficiency
  • Improved trust and adoption of AI
  • Unified molecular, clinical, and RWE insights
  • Established a future-ready foundation for GenAI-enabled precision oncology