Development of optimized workflow for viral genome reconstruction and taxonomic classification from RNA-Seq data

Overview

We partnered with a pioneering biotech company focused on bat genomes to develop an automated, scalable workflow for viral genome assembly and taxonomic classification from Illumina RNA-Seq data. By integrating advanced bioinformatics tools and a dual-path assembly strategy, the pipeline achieved over 95% sequence identity and 98% classification accuracy. This solution not only accelerated virus detection and genome reconstruction but also enhanced zoonotic surveillance, supporting early identification of potential spillover threats. The project underscores Excelra’s commitment to enabling faster, data-driven decisions in virology, public health, and vaccine development.

Our client

Our client

A biotech innovator at the forefront of bat genome research and studying their evolutionary adaptations. The client was working on bat genomes and needed Excelra’s bioinformatics support to assemble the viral genomes from bat genomes and perform taxonomic classification.  This involved identifying and assembling viral genomes embedded in host transcriptomic data, with the aim of establishing high-quality reference assemblies. 

Client’s challenge

Client’s challenge

An early-stage biotech company focused on bat genomes aimed to build a framework for automating the assembly of Illumina RNA-Seq reads from host species to construct reliable virus genome assemblies and perform taxonomic classification. Many species, including bats and swine, serve as natural reservoirs for a wide variety of viruses, some of which have the potential to jump to humans — as seen in pandemics like COVID-19. RNA-Seq data from these hosts can reveal both known and previously undiscovered viruses, making it a powerful resource for early virus identification and zoonotic surveillance. 

Client’s goals

Client’s goals

By analyzing RNA-Seq data across species, researchers can gain insights into which viruses may pose a risk of spillover, helping pharmaceutical companies and public health organizations prepare for emerging threats. 

The client sought to feed this RNA-Seq data into a virus genome assembly pipeline to create an automated, scalable workflow. The resulting viral genome assemblies would not only support their current research studying taxonomic classification but also serve as high-quality reference sequences for future studies. The client’s goal was to study virus integration in bats constructing reliable virus genome assemblies from host-derived RNA-Seq data. 

Our approach

Excelra designed a robust and automated pipeline to process Illumina RNA-Seq reads and assemble viral genomes. The workflow leveraged multiple bioinformatics tools and databases to streamline assembly and taxonomic analysis:

viral-genome-classification-pipeline

Pre-processing

  • Raw sequencing reads were subjected to quality trimming.
  • This step removes low-quality bases and adapter contamination, ensuring that only high-quality, clean reads are used for downstream analysis.

Assembly

The cleaned reads were processed through two complementary assembly strategies to maximize the detection and reconstruction of viral genomes:

  • A general-purpose metagenome assembler was employed, ideal for generating a broad and comprehensive assembly of all organisms present in the sample.
  • A virus-focused assembler optimized for RNA viruses, was specifically used to reconstruct viral genomes with higher sensitivity, particularly those related to coronaviruses and similar viral pathogens.

Annotation & taxonomic classification

  • A fast and accurate taxonomic classifier, was used to identify and classify the microbial and viral components present in the sample, providing taxonomic labels to the assembled sequences.
  • BLAST analysis was then carried out on the assembled contigs from above processes to: Confirm the taxonomic identities and validate the presence of known or novel viral species through sequence similarity against reference databases.

Quality analysis & interpretation:

The final assemblies were evaluated for:

  • Taxonomic composition—to determine the diversity and abundance of organisms present.
  • Viral genome distribution—to assess the breadth of viral detection across different taxa.
  • Contig assembly quality—to ensure the assembled viral genomes are contiguous, complete, and biologically meaningful.

These evaluations helped to ensure confidence in the detection, classification, and downstream interpretation of viral and microbial sequences within the sample.

Our solution

Excelra successfully assembled virus genomes with high percentage identity to references using our workflow with Illumina RNA-Seq reads. This robust workflow enables comprehensive detection and reconstruction of viral sequences from complex sequencing datasets, ensuring both accuracy and biological relevance. The key components of the pipeline include: 

High identity assemblies

Successfully reconstructed viral genomes with high sequence identity to known references using Illumina RNA-Seq data.  

Comprehensive detection & reconstruction

Workflow ensures accurate identification and biologically relevant reconstruction of viral sequences from complex datasets. 

Quality trimming & pre-processing

  • Applied stringent quality control. 
  • Removed low-quality bases and adapter sequences using standard tools. 

Dual-path assembly strategy

  • General-purpose assembler (e.g., MEGAHIT): Captured broad sequence diversity. 
  • Virus-specific assembler (e.g., coronaSPAdes): Optimized for sensitive reconstruction of RNA viral genomes. 

Taxonomic classification via Kraken2

  • High-performance k-mer-based classification of contigs. 
  • Enabled detection of both known and novel viral species. 

Downstream analysis included

  • BLAST validation for confirming viral identities. 
  • Distribution mapping to assess viral abundance and prevalence. 
  • Species-level detection to resolve closely related strains. 

Final output

Delivered high-quality, assembled viral contigs with validated annotations and taxonomic labels—supporting virome studies, epidemiology, and functional research. 

Key performance outcomes

>95% sequence identity achieved to known reference genomes, ensuring high-confidence detection of both known and novel viruses. 

>98% taxonomic classification accuracy in identifying viral species and strains. 

>3x improvement in assembly turnaround time compared to manual or semi-automated methods. 

Early discovery of potential zoonotic threats, providing a 6–12 months lead time for risk mitigation compared to traditional surveillance. 

Up to 70% reduction in manual curation effort, freeing up bioinformatics experts for higher-value tasks.  

viral-genome-classification-pipeline-Value add

Strategic business impacts

  • Enhanced pandemic preparedness through high-throughput zoonotic surveillance 
  • Faster vaccine and antiviral development powered by high-quality reference genomes 
  • Creation of proprietary genomic assets for R&D, licensing, and partnership opportunities 
  • Improved regulatory support via quality-controlled genome assemblies for submissions and surveillance 

Conclusion

Excelra delivered a customized, automated bioinformatics pipeline that enabled accurate and efficient assembly of viral genomes from host-derived RNA-Seq data. The solution significantly improved processing speed, reproducibility, and scalability, while achieving >95% sequence identity to reference genomes and >98% taxonomic classification accuracy. Designed for modular integration, the pipeline is capable of processing >10,000 datasets annually with up to 70% reduction in manual curation. This empowers large-scale viral genomics efforts—particularly in zoonotic reservoirs like bats—supporting early detection of emerging threats with a 6–12 months lead time. Additional value delivered to the client includes accelerated vaccine and antiviral target discovery, proprietary data asset creation, and robust support for regulatory submissions.Â