What are structured datasets for AI/ML in drug discovery?

Structured datasets are high-quality, normalized data that have been processed from heterogeneous sources into a relational format, with fields categorized and standardized using controlled vocabularies. This preparation is essential to enable reliable training and use of AI/ML algorithms for drug discovery and prediction.

How does GOSTAR™ support AI/ML-based drug discovery?

GOSTAR™ provides a vast, manually curated database containing chemistry, biological, pharmacological, and therapeutic data on millions of compounds. This data is transformed into a structured, analysis-ready format and can be directly integrated into a client's AI/ML platform, serving as the essential, high-quality training material for virtual hit identification and lead optimization models.

What type of data is included in GOSTAR™ for AI/ML modeling?

GOSTAR™ datasets include Chemistry (structural representations, physicochemical properties), Biological (target names, mechanism of action, binding affinity), Pharmacological (ADMET data, in-vitro/in-vivo assay results), and Therapeutic (indication names, clinical status, adverse events) information.

Case studies

Structured and analysis-ready data for AI/ML-based drug discovery

Overview

A biotech company in the US was seeking to employ AI/ML technologies to identify potential small molecules for therapeutic development, specifically focusing on the critical areas of oncology and renal fibrosis. Excelra addressed this need by transforming vast amounts of heterogeneous and unstructured data, captured from various sources, into a highly structured relational database format within the GOSTAR™ platform.

Our client

Our client is a bioinformatics and statistics group within a larger international pharmacological company. This organization processes thousands of bulk RNA-seq libraries for biomarker discovery and therapeutic target validation and sought to accelerate their alignment pipeline while reducing cloud compute costs. Their existing GSNAP-based pipeline suffered from long runtimes, uneven load balancing, and repeated failures caused by random low-quality read segments that required manual pipeline restarts.

Client’s challenge

The client’s main challenge was the need for high-quality, harmonized, and structured datasets of small molecules. They required comprehensive chemical, biological, and pharmacological data to be standardized and ready for immediate use in their internal platforms for virtual hit-identification.

Client’s goals

The primary objective for the client was to integrate these standardized small molecule datasets into their internal AI/ML platform for algorithm training. This integration was critical for model building, activity/property prediction, and ultimately supporting hit identification and lead optimization in oncology and renal fibrosis.

Our approach

Excelra’s approach centered on leveraging the Global Online Structure Activity Relationship Database (GOSTAR™). The methodology involved a multi-step process to ensure data quality and integrity:

Data transformation

Heterogeneous and unstructured data gathered from various sources were transformed into a structured, relational database format within GOSTAR™.

Quality control

All content in GOSTAR™ is captured manually and must pass through a rigorous 3-step quality control process.

Data normalization

The normalized and structured datasets, including Structure Activity Relationship (SAR), physicochemical properties, and ADMET parameters, were then prepared for integration.

Integration

The final analysis-ready data was integrated directly into the client’s internal platform for training the AI/ML algorithms.

Our solution

Excelra provided high-quality, annotated datasets tagged to standard identifiers (such as Entrez gene ids or UniProt protein identifiers) and used controlled vocabularies for simpler data integration. This detailed contribution was categorized into four comprehensive dataset types, enabling the client to build effective AI/ML models

Biological datasets

Provided insights into disease mechanisms and potential target proteins, including target names, target family information, mechanism of action, and binding affinity data.

Chemistry datasets

Facilitated the design of high-throughput screening libraries, including chemical structural representations, line notations (SMILES & InChI), molecular property descriptors, and compound-specific biological data. This aligns directly with advanced applications in cheminformatics.

Pharmacological datasets

Included ADMET (adsorption, distribution, metabolism, elimination, and toxicity) data, along with functional in-vitro and in-vivo assay properties.

Therapeutic datasets

Offered patient-related information such as indication names, safety and efficacy data, clinical/drug status, and adverse events or side-effects information. These datasets feed into higher-level services like Data Science.

Flexible data delivery

GOSTAR™ data can be delivered in various formats, including relational database formats (Oracle, PostgreSQL, MySQL), flat file formats (CSV, XML, XLS), chemistry specific formats (SDF, RDF), and semantic web formats (RDF, Turtle).

GOSTAR™ curation process showing the transformation of unstructured data through quality control into structured datasets for AI/ML model training.

Structured data for AI/ML workflow illustrating data preparation for drug discovery models.

Conclusion

By providing these manually curated, structured and analysis-ready datasets, Excelra enabled the client to overcome the challenge of data heterogeneity. The delivered data was immediately useable for training the client’s AI/ML algorithms, leading to successful model building and activity/property prediction to support virtual hit identification and lead optimization in drug discovery.

Previous ProjectActivity Landscape Analysis for Compound Datasets