Machine Learning for Predicting Potential Insecticidal Proteins

Overview

This case study demonstrates how Excelra’s ML-driven bioinformatics pipeline supported the discovery of insecticidal proteins using public datasets. Through homology analysis, data annotation, and predictive modeling, the project enabled identification of high-confidence candidates. The pipeline adhered to FAIR Data principles and leveraged comprehensive data curation to ensure reproducibility and scientific rigor.

Our client

A US-based biotech company specializing in protein discovery for insecticidal applications partnered with Excelra to accelerate its AI/ML research pipeline. The collaboration focused on building a scalable bioinformatics pipeline and manuscript support for publication.

Client’s challenge

The client aimed to harness machine learning to identify novel insecticidal proteins using high-quality public datasets. A primary challenge was validating their proprietary asset using homology analysis and developing a credible scientific foundation for publication. This required deep expertise in bioinformatics pipelines, data curation, and ML modeling.

Client’s goals

Predict insecticidal proteins from public datasets
Validate their asset using sequence and structure homology analysis
Build a scientifically sound manuscript supported by FAIR Data principles
Accelerate protein discovery and increase model prediction accuracy

Our approach

Our team utilized a structured ML workflow integrated with our internal bioinformatics pipeline, leveraging principles of data curation and FAIR data to ensure reproducibility. The approach involved:

Training data acquisition and annotation

Extracted public datasets (Crickmore, UniProt): Retrieved curated insecticidal protein sequences and associated metadata from trusted public databases such as the Crickmore database and UniProt, ensuring a high-quality, FAIR-compliant foundation for model training.
Annotated proteins to prepare ML-ready data: Enriched raw sequences with functional annotations, domain info, and structural features via our bioinformatics pipeline, aligned with FAIR data principles.

Data processing, model building, optimization, and screening

Feature selection: Identified relevant features using techniques aligned with computational biology best practices.
Model building: Built predictive ML models with over 90% classification accuracy to support high-confidence protein discovery.
Screened proteins: Applied models to identify novel insecticidal protein candidates, reinforced by rigorous data curation.

Short-Listing of classified insecticidal proteins

Performed conserved domain analysis: Performed conserved domain analysis to identify and validate the presence of functionally significant motifs within classified insecticidal proteins, aiding in the short-listing of candidates with established or putative insecticidal activity.
Shortlisted proteins: Proteins were shortlisted following conserved domain analysis, prioritizing those containing well-characterized motifs known to be associated with insecticidal mechanisms, thereby refining the pool for downstream functional studies. Compiled a final list of potential insecticidal proteins: Consolidated high-confidence candidates into a curated list for experimental validation.

Justification for client’s asset

Sequence homology analysis: Conducted precise comparisons to known insecticidal proteins for functional alignment, applying computational biology techniques.
Structure homology analysis: Validated candidates via 3D alignment tools, part of Excelra’s broader computational biology services.
Hypothesis formulation: Synthesized findings from homology, annotations, and scoring to position the client’s protein as a validated lead asset in protein discovery.

Manuscript support

Supported high-impact manuscript development with expert scientific writing, visualizations, and fully reproducible code hosted on GitHub — aligned with FAIR data principles.
Implemented version control and metadata tagging for structured data curation, enabling transparent peer review and publication.

Our solution

Through a structured bioinformatics pipeline, Excelra accelerated the client’s ML workflow and enabled rigorous homology analysis for validation. This process supported data-backed prioritization of high-confidence insecticidal protein candidates. Adhering to FAIR Data principles and leveraging domain-specific data curation, we provided comprehensive support for publication.

Conclusion

Excelra’s machine learning–based workflow reduced protein screening time by over 70% and improved prediction accuracy beyond 90%. The client’s proprietary protein ranked in the top 5% of predicted candidates. Our expertise in protein discovery, homology analysis, and bioinformatics pipelines not only validated their asset but also laid a strong foundation for peer-reviewed publication. The streamlined manuscript support process, rooted in FAIR Data and high-quality data curation, reduced preparation time by 50%—amplifying their impact in the scientific community.

Previous ProjectAdvancing Plant-Insect Interaction Research Through Bioinformatics
Next ProjectEpigenetics in Wheat Genome: Unlocking Gene Regulation Through Advanced Computational Analysis

Machine Learning for Predicting Potential Insecticidal Proteins

Overview

Our client

Client’s challenge

Client’s goals

Our approach

Training data acquisition and annotation

Data processing, model building, optimization, and screening

Short-Listing of classified insecticidal proteins

Justification for client’s asset

Manuscript support

Our solution

Conclusion

Previous ProjectAdvancing Plant-Insect Interaction Research Through Bioinformatics

Next ProjectEpigenetics in Wheat Genome: Unlocking Gene Regulation Through Advanced Computational Analysis

ABOUT US

USEFUL LINKS

OUR OFFICES

CONTACT US

Machine Learning for Predicting Potential Insecticidal Proteins

Overview

Our client

Client’s challenge

Client’s goals

Our approach

Training data acquisition and annotation

Data processing, model building, optimization, and screening

Short-Listing of classified insecticidal proteins

Justification for client’s asset

Manuscript support

Our solution

Conclusion

Previous ProjectAdvancing Plant-Insect Interaction Research Through Bioinformatics

Next ProjectEpigenetics in Wheat Genome: Unlocking Gene Regulation Through Advanced Computational Analysis

ABOUT US

USEFUL LINKS

OUR OFFICES

CONTACT US

Please fill the form

GOSTAR™ SAR Databases - Popupbox

What data do you need?

GOSTAR™ Small Molecules

Request for demo - GOSTAR™ Small Molecule

GOSTAR™ TPD

Request for demo - GOSTAR™ TPD

GOSTAR™ Large Molecules

Let's Connect - GOSTAR™ Large Molecules

BioVisualizer

Thank you for showing interest in the BioVisualizer™

Download Whitepaper

Download Whitepaper

Online Pipeline Platform

Online Pipeline Platform (OP2)

jobSeniorConsultantLifeScienceInformatics