Overview
This case study demonstrates how Excelra’s ML-driven bioinformatics pipeline supported the discovery of insecticidal proteins using public datasets. Through homology analysis, data annotation, and predictive modeling, the project enabled identification of high-confidence candidates. The pipeline adhered to FAIR Data principles and leveraged comprehensive data curation to ensure reproducibility and scientific rigor.

Our client
A US-based biotech company specializing in protein discovery for insecticidal applications partnered with Excelra to accelerate its AI/ML research pipeline. The collaboration focused on building a scalable bioinformatics pipeline and manuscript support for publication.

Client’s challenge
The client aimed to harness machine learning to identify novel insecticidal proteins using high-quality public datasets. A primary challenge was validating their proprietary asset using homology analysis and developing a credible scientific foundation for publication. This required deep expertise in bioinformatics pipelines, data curation, and ML modeling.

Client’s goals
- Predict insecticidal proteins from public datasets
- Validate their asset using sequence and structure homology analysis
- Build a scientifically sound manuscript supported by FAIR Data principles
- Accelerate protein discovery and increase model prediction accuracy
Our approach
Our team utilized a structured ML workflow integrated with our internal bioinformatics pipeline, leveraging principles of data curation and FAIR data to ensure reproducibility. The approach involved:
Training data acquisition and annotation
- Extracted public datasets (Crickmore, UniProt): Retrieved curated insecticidal protein sequences and associated metadata from trusted public databases such as the Crickmore database and UniProt, ensuring a high-quality, FAIR-compliant foundation for model training.
- Annotated proteins to prepare ML-ready data: Enriched raw sequences with functional annotations, domain info, and structural features via our bioinformatics pipeline, aligned with FAIR data principles.
Data processing, model building, optimization, and screening
- Feature selection: Identified relevant features using techniques aligned with computational biology best practices.
- Model building: Built predictive ML models with over 90% classification accuracy to support high-confidence protein discovery.
- Screened proteins: Applied models to identify novel insecticidal protein candidates, reinforced by rigorous data curation.
Short-Listing of classified insecticidal proteins
- Performed conserved domain analysis: Performed conserved domain analysis to identify and validate the presence of functionally significant motifs within classified insecticidal proteins, aiding in the short-listing of candidates with established or putative insecticidal activity.
- Shortlisted proteins: Proteins were shortlisted following conserved domain analysis, prioritizing those containing well-characterized motifs known to be associated with insecticidal mechanisms, thereby refining the pool for downstream functional studies. Compiled a final list of potential insecticidal proteins: Consolidated high-confidence candidates into a curated list for experimental validation.
Justification for client’s asset
- Sequence homology analysis: Conducted precise comparisons to known insecticidal proteins for functional alignment, applying computational biology techniques.
- Structure homology analysis: Validated candidates via 3D alignment tools, part of Excelra’s broader computational biology services.
- Hypothesis formulation: Synthesized findings from homology, annotations, and scoring to position the client’s protein as a validated lead asset in protein discovery.
Manuscript support
- Supported high-impact manuscript development with expert scientific writing, visualizations, and fully reproducible code hosted on GitHub — aligned with FAIR data principles.
- Implemented version control and metadata tagging for structured data curation, enabling transparent peer review and publication.
Our solution
Through a structured bioinformatics pipeline, Excelra accelerated the client’s ML workflow and enabled rigorous homology analysis for validation. This process supported data-backed prioritization of high-confidence insecticidal protein candidates. Adhering to FAIR Data principles and leveraging domain-specific data curation, we provided comprehensive support for publication.

Conclusion
Excelra’s machine learning–based workflow reduced protein screening time by over 70% and improved prediction accuracy beyond 90%. The client’s proprietary protein ranked in the top 5% of predicted candidates. Our expertise in protein discovery, homology analysis, and bioinformatics pipelines not only validated their asset but also laid a strong foundation for peer-reviewed publication. The streamlined manuscript support process, rooted in FAIR Data and high-quality data curation, reduced preparation time by 50%—amplifying their impact in the scientific community.