Machine Learning for Predicting Potential Insecticidal Proteins

Overview

This case study demonstrates how Excelra’s ML-driven bioinformatics pipeline supported the discovery of insecticidal proteins using public datasets. Through homology analysis, data annotation, and predictive modeling, the project enabled identification of high-confidence candidates. The pipeline adhered to FAIR Data principles and leveraged comprehensive data curation to ensure reproducibility and scientific rigor.

Our client

Our client

A US-based biotech company specializing in protein discovery for insecticidal applications partnered with Excelra to accelerate its AI/ML research pipeline. The collaboration focused on building a scalable bioinformatics pipeline and manuscript support for publication.

Client’s challenge

Client’s challenge

The client aimed to harness machine learning to identify novel insecticidal proteins using high-quality public datasets. A primary challenge was validating their proprietary asset using homology analysis and developing a credible scientific foundation for publication. This required deep expertise in bioinformatics pipelines, data curation, and ML modeling.

Client’s goals

Client’s goals

  • Predict insecticidal proteins from public datasets
  • Validate their asset using sequence and structure homology analysis
  • Build a scientifically sound manuscript supported by FAIR Data principles
  • Accelerate protein discovery and increase model prediction accuracy

Our approach

Our team utilized a structured ML workflow integrated with our internal bioinformatics pipeline, leveraging principles of data curation and FAIR data to ensure reproducibility. The approach involved:

Training data acquisition and annotation

  • Extracted public datasets (Crickmore, UniProt): Retrieved curated insecticidal protein sequences and associated metadata from trusted public databases such as the Crickmore database and UniProt, ensuring a high-quality, FAIR-compliant foundation for model training.
  • Annotated proteins to prepare ML-ready data: Enriched raw sequences with functional annotations, domain info, and structural features via our bioinformatics pipeline, aligned with FAIR data principles.

Data processing, model building, optimization, and screening

  • Feature selection: Identified relevant features using techniques aligned with computational biology best practices.
  • Model building: Built predictive ML models with over 90% classification accuracy to support high-confidence protein discovery.
  • Screened proteins: Applied models to identify novel insecticidal protein candidates, reinforced by rigorous data curation.

Short-Listing of classified insecticidal proteins

  • Performed conserved domain analysis: Performed conserved domain analysis to identify and validate the presence of functionally significant motifs within classified insecticidal proteins, aiding in the short-listing of candidates with established or putative insecticidal activity.
  • Shortlisted proteins: Proteins were shortlisted following conserved domain analysis, prioritizing those containing well-characterized motifs known to be associated with insecticidal mechanisms, thereby refining the pool for downstream functional studies. Compiled a final list of potential insecticidal proteins: Consolidated high-confidence candidates into a curated list for experimental validation.

Justification for client’s asset

  • Sequence homology analysis: Conducted precise comparisons to known insecticidal proteins for functional alignment, applying computational biology techniques.
  • Structure homology analysis: Validated candidates via 3D alignment tools, part of Excelra’s broader computational biology services.
  • Hypothesis formulation: Synthesized findings from homology, annotations, and scoring to position the client’s protein as a validated lead asset in protein discovery.

Manuscript support

  • Supported high-impact manuscript development with expert scientific writing, visualizations, and fully reproducible code hosted on GitHub — aligned with FAIR data principles.
  • Implemented version control and metadata tagging for structured data curation, enabling transparent peer review and publication.

Our solution

Through a structured bioinformatics pipeline, Excelra accelerated the client’s ML workflow and enabled rigorous homology analysis for validation. This process supported data-backed prioritization of high-confidence insecticidal protein candidates. Adhering to FAIR Data principles and leveraging domain-specific data curation, we provided comprehensive support for publication.

Conclusion

Excelra’s machine learning–based workflow reduced protein screening time by over 70% and improved prediction accuracy beyond 90%. The client’s proprietary protein ranked in the top 5% of predicted candidates. Our expertise in protein discovery, homology analysis, and bioinformatics pipelines not only validated their asset but also laid a strong foundation for peer-reviewed publication. The streamlined manuscript support process, rooted in FAIR Data and high-quality data curation, reduced preparation time by 50%—amplifying their impact in the scientific community.