Skip to main content

QUICK DEFINITION

Whole Genome Sequencing (WGS) is an advanced next-generation sequencing (NGS) technique that determines the complete DNA sequence of an organism’s genome—encompassing all ~3.2 billion base pairs in humans—in a single assay. Unlike targeted methods, WGS comprehensively analyzes all protein-coding regions (exons), non-coding intervening structures (introns), regulatory elements, repetitive regions, and mitochondrial DNA.

Key takeaways

  • Unbiased Global View: WGS offers a comprehensive, hypothesis-free method that looks across the entire genomic landscape simultaneously.
  • Broad Variant Spectrum: It efficiently detects single nucleotide variants (SNVs), small indels, copy number variants (CNVs), and large structural variants like translocations or inversions.
  • WGS vs. WES Scope: While WES restricts its target territory to just the ~1–2% protein-coding area, WGS delivers raw data covering all non-coding regulatory and intronic regions.
  • Application Depth: WGS serves as a primary pillar in drug discovery for novel target identification, pharmacogenomics profiling, population genetics, and tracking resistance mechanisms.
  • Sequencing Depth Standards: Standard germline variants are resolved cleanly at 30× depth, while somatic tumor profiling with higher heterogeneity requires 60–100× or higher.
  • AI Pipeline Integration: Modern big-data WGS interpretation depends on AI and machine learning tools (like Google’s DeepVariant) for automated variant calling and structural analysis.

What is Whole genome sequencing (WGS)?

Whole genome sequencing (WGS) is a next-generation sequencing (NGS) technique that determines the complete DNA sequence of an organism’s genome in a single experiment. Unlike targeted approaches — such as Whole Exome Sequencing (WES) or gene panels — whole genome sequencing reads every nucleotide base across the entire genome: protein-coding exons, non-coding introns, regulatory elements, repeat regions, and mitochondrial DNA. This comprehensive, hypothesis-free approach makes whole genome sequencing the most complete genomic profiling method available in life sciences today.
In humans, the genome spans approximately 3.2 billion base pairs across 23 chromosome pairs. Whole genome sequencing captures all of this information simultaneously — from bacteria and viruses to plants, animals, and humans — providing an unbiased view of genomic variation that no other sequencing technique can match in scope or resolution.

Powered by Next Generation Sequencing (NGS) platforms, whole genome sequencing has been transformed from a decade-long, $3-billion project (the Human Genome Project, completed 2003) into a routine analysis achievable in days for under $1,000. This cost revolution is making population-scale and clinical WGS studies increasingly feasible for pharma, biotech, and clinical research organizations worldwide.

WGS detects a wide spectrum of genomic variants, including:

  • Single Nucleotide Variants (SNVs) — single base-pair substitutions
  • Insertions and Deletions (indels) — small insertions or deletions of sequence
  • Copy Number Variants (CNVs) — duplications or deletions of genomic segments
  • Structural Variants (SVs) — large-scale chromosomal rearrangements, inversions, and translocations
  • Short Tandem Repeats (STRs) and other repeat expansions

This breadth of detection is what distinguishes WGS from WES, targeted panels, or microarray-based approaches, all of which capture only a subset of the genomic landscape.

Whole genome sequencing vs. Whole exome sequencing comparison

WGS and WES are both powerful NGS-based approaches, but they differ significantly in scope, cost, and use case. Understanding these differences is essential for designing the right genomic study.

WGS vs. WES Comparison
Feature WGS (Whole Genome Sequencing) WES (Whole Exome Sequencing)
Genomic Coverage Entire genome (~3.2 Gb in humans) Protein-coding exons only (~1–2% of genome)
Variant Types Detected SNVs, indels, CNVs, SVs, non-coding variants SNVs, indels in coding regions; limited SV/CNV detection
Data Output per Sample ~100–150 GB (30× coverage) ~8–15 GB (100× coverage)
Relative Cost Higher Lower (~3–5× cheaper than WGS)
Bioinformatics Complexity High — requires larger compute resources and storage Moderate — smaller datasets, well-established pipelines
Best For Novel variant discovery, structural variant detection, population genomics, metagenomics Clinical diagnostics, Mendelian disease, cost-efficient rare variant studies
Non-Coding Region Analysis Yes — full regulatory and intronic coverage No

WGS is the preferred approach for discovery-oriented research — where novel genes, regulatory variants, and structural rearrangements are the primary interest. WES is generally preferred in clinical diagnostics and studies focused on protein-coding mutations, where cost efficiency and faster turnaround matter most.

Step-by-Step whole genome sequencing workflow

A standard whole genome sequencing workflow proceeds through five broad phases, from biological sample to interpreted genomic results. Each step requires rigorous quality control to ensure data accuracy and downstream reliability.

1. Sample collection & DNA extraction

High-molecular-weight (HMW) genomic DNA is extracted from biological material — commonly peripheral blood, saliva, fresh/frozen tissue, or FFPE (Formalin-Fixed Paraffin-Embedded) tissue. DNA quantity and quality are assessed using Qubit fluorometry (for concentration) and gel electrophoresis or Bioanalyzer (for fragment integrity). Degraded or low-input DNA requires specialized library preparation protocols.

2. Library preparation

The extracted DNA is fragmented (typically to 300–500 bp by sonication or enzymatic shearing), end-repaired, and ligated with sequencing adapters containing unique molecular identifiers (UMIs) and sample barcodes. For short-read platforms (Illumina), this produces a sequencing-ready library that undergoes size selection and amplification. Long-read platforms (Oxford Nanopore, PacBio HiFi) use minimal amplification to preserve native DNA modifications.

3. Sequencing

The prepared library is loaded onto a sequencing instrument. Short-read sequencing (most commonly Illumina) generates paired-end reads of 100–250 bp at depths of 30× (standard germline) to 100× (somatic tumor profiling). Long-read platforms generate reads of thousands to millions of base pairs, which is particularly valuable for resolving complex structural variants, repetitive regions, and phasing haplotypes.

4. Raw data processing & quality control

Raw sequencing reads (FASTQ format) are assessed for quality using tools like FastQC. Adapter sequences and low-quality bases are trimmed using Trimmomatic, Cutadapt, or Fastp. At this stage, per-base quality scores, GC content, duplication rates, and read length distributions are evaluated to confirm the library meets quality thresholds.

5. Bioinformatics analysis & interpretation

Clean reads are aligned to the reference genome, variants are called and annotated, and results are interpreted in biological and clinical context. This is covered in detail in the WGS Bioinformatics Pipeline section below.

Whole genome sequencing bioinformatics pipeline

The bioinformatics analysis of whole genome sequencing data is as critical as the sequencing itself. A poorly designed or executed computational pipeline can introduce errors, bias, or missed variants that undermine the scientific and clinical value of the experiment. Excelra’s bioinformatics services and Online Pipeline Platform (OP²) deliver validated, scalable WGS pipelines built on industry-standard best practices.

Read alignment

Trimmed reads are aligned to the reference genome (GRCh38/hg38 for human) using alignment tools such as BWA-MEM2, STAR, or Bowtie2. The output is a BAM file (Binary Alignment/Map) that records where each read maps in the genome, along with mapping quality scores. Proper alignment is essential — even a small percentage of misaligned reads can produce false-positive variant calls.

Post-Alignment processing

The aligned BAM undergoes: (a) duplicate marking using Picard or samtools, to flag PCR duplicates that would otherwise inflate variant allele frequencies; (b) base quality score recalibration (BQSR) using GATK, which corrects systematic errors in base quality scores introduced during sequencing; and (c) depth-of-coverage calculation to confirm that sufficient genomic territory is covered at adequate depth.

Variant calling

This is the core computational step in variant calling. For germline variants, GATK HaplotypeCaller is the gold-standard tool, operating in GVCF mode for scalable joint genotyping across cohorts. For somatic variants (tumor vs. normal), Mutect2 or Strelka2 are widely used. Structural variant callers such as Manta, Lumpy, or DELLY identify large rearrangements, while CNVkit or CNVnator quantify copy number changes.

Variant annotation

Called variants are annotated against reference databases to assign biological and clinical context: ClinVar (clinical significance), dbSNP (population frequency), gnomAD (allele frequencies across diverse populations), COSMIC (cancer somatic mutations), and OMIM (Mendelian disease associations). Tools such as ANNOVAR, VEP (Ensembl Variant Effect Predictor), or SnpEff perform functional annotation — predicting whether a variant is synonymous, missense, nonsense, splice-altering, or regulatory.

Variant filtering & prioritization

Because WGS generates millions of variants per sample, robust filtering is essential to prioritize those most likely to be biologically relevant. Filtering criteria typically include: allele frequency thresholds, variant quality scores, population frequency (excluding common polymorphisms for rare disease studies), predicted functional impact scores (CADD, SIFT, PolyPhen), and disease-specific criteria. The output is a refined candidate variant list for downstream validation or reporting.

Downstream analysis & reporting

Filtered variants feed into downstream analyses: somatic mutation signature analysis (using COSMIC SBS signatures), tumor mutational burden (TMB) calculation, homologous recombination deficiency (HRD) scoring, microsatellite instability (MSI) detection, clonal evolution tracking, and pharmacogenomic profiling. Results are integrated into clinical reports or research databases, often visualized through data visualization platforms.

Key applications of whole genome sequencing in life sciences

Whole genome sequencing has become foundational across multiple domains of biomedical and life sciences research. Its ability to capture the full genomic landscape — without prior assumptions about where important variants might reside — makes it invaluable for both discovery science and translational medicine.

Rare and undiagnosed disease

For patients with rare or undiagnosed conditions who have exhausted standard diagnostic approaches, WGS offers a diagnostic yield of 25–50%, significantly higher than WES or gene panels in many studies. By covering non-coding regulatory regions, WGS can identify promoter mutations, deep intronic splice variants, and structural rearrangements that would be completely missed by exome-based approaches.

Population genomics

Large-scale population WGS programs — such as the UK Biobank, All of Us, and the 100,000 Genomes Project — are building reference panels that capture the full spectrum of human genetic diversity. These datasets power genome-wide association studies (GWAS), polygenic risk score (PRS) development, and pharmacogenomic research at population scale, directly informing drug target identification and patient stratification strategies.

Infectious disease & metagenomics

WGS is the gold standard for pathogen surveillance and outbreak investigation. By sequencing the complete genomes of bacteria, viruses, or fungi, public health agencies can track transmission chains, identify resistance mutations, and monitor the emergence of new strains in near real-time. WGS also underpins metagenomic studies that characterize complex microbial communities — including the human microbiome — without prior culture or species identification.

Agricultural genomics (Agrigenomics)

In plant and animal genomics, WGS is used to map quantitative trait loci (QTL), identify genes underlying yield, disease resistance, or nutritional content, and guide marker-assisted selection in breeding programs. Excelra’s agrigenomics capabilities leverage WGS workflows to support clients in crop genomics and agricultural biotech.

Whole genome sequencing in drug discovery & target ID

Whole genome sequencing is playing an increasingly central role in the early stages of drug discovery — particularly in target identification, patient stratification, and the development of companion diagnostics.

Identifying novel drug targets

Population-scale WGS studies enable the discovery of rare and common variants associated with disease risk. Loss-of-function variants in specific genes — identified through WGS in large cohorts — have proven to be some of the most compelling drug targets in modern medicine. The PCSK9 inhibitor class of cholesterol-lowering drugs, for example, was validated precisely because rare loss-of-function mutations in PCSK9 were observed to protect against cardiovascular disease in population genomics data.

Pharmacogenomics

WGS provides comprehensive coverage of pharmacogenomically relevant loci — including CYP450 enzyme genes, drug transporter genes, and HLA alleles — enabling precision prediction of drug metabolism, efficacy, and adverse event risk. This informs dose optimization and patient stratification in clinical trials, connecting directly to Excelra’s pharmacogenomics capabilities.

Resistance profiling

In both oncology and infectious disease, WGS is used to characterize resistance mechanisms at the genomic level. In cancer, longitudinal tumor WGS reveals how clonal evolution under selective therapeutic pressure drives acquired resistance. In antimicrobial research, WGS identifies resistance genes and mutations in bacterial and viral genomes, guiding next-generation antibiotic or antiviral design.

Biomarker discovery

WGS data feeds directly into biomarker discovery workflows — identifying genomic signatures that predict treatment response, disease progression, or patient survival. Mutation burden, specific somatic signatures, structural rearrangements, and HRD scores derived from WGS are increasingly used as predictive and prognostic biomarkers in oncology clinical trials.

Whole genome sequencing in oncology & precision medicine

Cancer genomics represents one of the most impactful applications of whole genome sequencing. Tumors accumulate thousands to millions of somatic mutations, and WGS provides a comprehensive catalog of all genomic alterations — enabling a level of biological insight that is impossible with targeted panels alone.

Comprehensive somatic mutation profiling

WGS of matched tumor-normal pairs identifies the full landscape of somatic alterations: single nucleotide substitutions, indels, structural rearrangements, copy number changes, and microsatellite instability (MSI). This comprehensive view captures driver mutations, passenger mutations, and actionable alterations that may inform treatment decisions or clinical trial eligibility.

Mutational signature analysis

The pattern of somatic mutations in a tumor genome reflects the biological processes — including carcinogen exposure, DNA repair defects, and viral integration — that shaped its evolution. COSMIC mutational signature analysis of WGS data can identify defects in homologous recombination (associated with BRCA1/2 dysfunction and PARP inhibitor sensitivity), APOBEC mutagenesis, or UV-induced damage patterns. These signatures can serve as predictive biomarkers for treatment selection.

Tumor mutational burden (TMB) & immunotherapy response

WGS-derived TMB (the total count of somatic mutations per megabase of genome sequenced) is an emerging predictive biomarker for response to immune checkpoint inhibitors. WGS provides the most accurate and comprehensive TMB measurement compared to targeted panels, which may miss variants in non-covered regions or introduce panel-specific biases.

Liquid biopsy & circulating tumor DNA

Ultra-low-pass WGS of circulating tumor DNA (ctDNA) from plasma — sometimes called liquid biopsy WGS — enables copy number profiling and genome-wide tumor fraction estimation from a minimally invasive blood draw. This application is growing rapidly for early cancer detection, minimal residual disease monitoring, and treatment response assessment.

Challenges & considerations in whole genome sequencing

Despite its power, whole genome sequencing presents several challenges that must be carefully managed to ensure high-quality, interpretable results.

Data volume & storage

A single human genome at 30× coverage generates approximately 100–150 GB of raw sequencing data. At scale — across hundreds or thousands of samples in a clinical or population study — this creates enormous demands on compute infrastructure, storage, and data transfer. Cloud-based solutions and scalable bioinformatics platforms, such as Excelra’s OP² Online Pipeline Platform, are increasingly essential for cost-effective WGS at scale. Excelra also helps teams design cloud-enabled genomic data architectures that are both scalable and cost-efficient.

Variant interpretation & clinical reporting

WGS generates millions of variants per sample, but the vast majority are benign. Distinguishing disease-causing variants from incidental findings requires deep domain expertise, curated reference databases, and robust bioinformatics filtering. For clinical WGS, variants of uncertain significance (VUS) present a particular challenge — requiring careful variant classification frameworks aligned with ACMG/AMP guidelines.

Reference genome limitations

Standard WGS relies on alignment to a reference genome, which represents a composite of a small number of individuals and may not capture the full range of human diversity. Emerging pangenome reference frameworks (such as the Human Pangenome Reference Consortium’s T2T-CHM13 and pangenome assembly) are addressing this limitation, but require updated alignment and variant calling strategies.

Regulatory & ethical considerations

WGS data contains highly sensitive personal information, raising important concerns around data privacy, consent, secondary findings disclosure, and regulatory compliance. Organizations handling WGS data must adhere to frameworks such as GDPR, HIPAA, and ICH E15 guidelines, and implement robust data governance practices. FAIR data principles provide a foundation for ensuring WGS data is managed responsibly and remains reusable across research programs.

AI & machine learning in WGS data analysis

The scale and complexity of WGS datasets make them ideal for machine learning approaches that can identify patterns beyond the reach of traditional bioinformatics tools.

Deep learning for variant calling

Google’s DeepVariant uses deep learning (convolutional neural networks trained on pileup images of aligned reads) to call SNVs and indels with accuracy that exceeds traditional statistical methods, particularly in difficult genomic regions such as homopolymers and repetitive sequences. Similar deep learning approaches are being applied to structural variant detection and copy number analysis.

Predictive modeling from WGS data

Machine learning models trained on WGS data are being used to: predict disease risk from polygenic variant combinations; classify tumor subtypes based on somatic mutation patterns; identify pharmacogenomically relevant genotypes; and predict drug response or adverse event risk from germline variants. Excelra’s AI/ML capabilities are applied to translate complex WGS datasets into actionable biological and clinical insights.

Large language models & WGS interpretation

Emerging applications of large language models (LLMs) in genomics include automated variant interpretation, clinical report generation, and literature-informed annotation of novel variants. These AI-assisted interpretation tools are reducing the time and expertise required to process WGS results, making comprehensive genomic analysis increasingly accessible.

Learn more about how Excelra applies AI to genomics in our NGS Data Analysis glossary page and explore our multi-omics analysis capabilities for integrating WGS with transcriptomic and proteomic data.

How Excelra supports WGS projects

Excelra offers end-to-end WGS bioinformatics capabilities — from raw data processing to biological interpretation — delivered through scalable, cloud-native infrastructure and a team of expert computational biologists.

  • WGS Pipeline Development & Deployment — custom, validated pipelines for germline, somatic, and metagenomic WGS applications, deployable on AWS, Azure, or GCP via Excelra’s Online Pipeline Platform (OP²)
  • Variant Calling & Annotation — GATK-based germline and somatic variant calling with multi-database annotation and clinical interpretation support
  • Genomic Data Management — FAIR-compliant data lake design, metadata management, and long-term storage solutions for large-scale WGS cohorts; see our SDMS glossary for context
  • Bioinformatics-Ready Data — curated, QC-passed, analysis-ready genomic datasets through Excelra’s data services
  • Multi-Omics Integration — integration of WGS data with RNA-seq, proteomics, and clinical data for comprehensive multi-omics analysis
  • AI-Powered Interpretation — machine learning models for variant prioritization, biomarker discovery, and predictive genomics

Explore our fast and cost-effective whole genome analysis case study and our whole exome pipeline deployment case study to see these capabilities in practice.

Conclusion

Whole genome sequencing has fundamentally changed what is possible in genomics — and its impact on life sciences is only deepening. By reading every nucleotide across the complete genome in a single experiment, WGS provides a level of biological resolution and discovery power that no other sequencing approach can match.

From identifying novel drug targets through population-scale genomics, to profiling the complete somatic mutation landscape of a tumor, to delivering molecular diagnoses in patients who have exhausted all other options — whole genome sequencing is now central to drug discovery, clinical genomics, precision oncology, and public health. Its ability to detect not just SNVs and indels, but structural variants, copy number changes, mutational signatures, and non-coding regulatory variants, makes it uniquely suited to the challenges of modern biomedical research.

As sequencing costs continue to fall — from $3 billion in 2003 to under $350 today — and as AI-powered analysis tools compress bioinformatics timelines, whole genome sequencing is moving from a specialized research tool into routine clinical practice. Long-read WGS is resolving genomic regions that were previously inaccessible. Pangenome reference frameworks are addressing the diversity limitations of single-reference alignment. Each of these advances makes whole genome sequencing more powerful, more accessible, and more impactful.
For organizations running WGS programs — whether in pharma, biotech, clinical diagnostics, or agricultural genomics — the bottleneck is rarely the sequencer. It is the downstream bioinformatics: building validated pipelines, managing petabyte-scale data, annotating variants in biological context, and translating genomic findings into decisions. Excelra’s end-to-end whole genome sequencing bioinformatics capabilities, delivered through the OP² Online Pipeline Platform and a team of expert computational biologists, are built to support WGS programs at any scale, on any cloud, and for any application.

What is Whole Genome Sequencing (WGS)?

Whole Genome Sequencing (WGS) is an NGS-based technique that reads the complete DNA sequence of an organism’s genome — including all exons, introns, regulatory regions, and mitochondrial DNA — in a single experiment. It detects single nucleotide variants (SNVs), insertions/deletions (indels), copy number variants (CNVs), and structural variants (SVs) across the entire genome.

What is the difference between WGS and WES?

WGS sequences the entire genome (~3 billion base pairs in humans), while WES targets only the protein-coding exons (~1–2% of the genome). WGS provides broader variant detection — including structural variants and non-coding regulatory regions — but at higher cost per sample. WES is more cost-efficient for identifying protein-altering mutations. The right choice depends on research goals, budget, and the variant types being investigated.

What are the main applications of WGS in drug discovery?

In drug discovery, WGS is used for: identifying novel drug targets from disease-associated genomic variants; pharmacogenomics and drug response prediction; resistance mechanism profiling in oncology and infectious disease; biomarker discovery for patient stratification; and population genomics studies that inform therapeutic development programs.

How does the WGS bioinformatics pipeline work?

A standard WGS pipeline involves: quality control and trimming of raw reads; alignment to the reference genome; duplicate marking and base quality score recalibration; variant calling; multi-database variant annotation; filtering and prioritization; and downstream biological interpretation or clinical reporting.

What sequencing depth is required for WGS?

Coverage depth depends on application: 30× is standard for germline variant detection; 60–100× is used for somatic cancer profiling; and low-pass WGS (4–8×) is sufficient for population-level copy number and GWAS studies.

How is AI being used in WGS data analysis?

AI is applied to WGS for more accurate variant calling (deep learning-based callers like DeepVariant), automated structural variant detection, somatic mutation signature identification, and predictive models linking genomic profiles to disease phenotypes or drug responses. AI pipelines also reduce turnaround time and improve reproducibility in large-scale WGS studies.

How can we help you?

We speak life science data and help you unlock its potential.