Any data that requires support from technological and infrastructural investments in order to get meaningful insights is defined as “Big Data.” The main reasons contributing for Big Data are the exponential growth in the data due to increased usage and the requirement to integrate these datasets for gaining valuable insights. A good example is the data in drug discovery processes.
This blog aims to provide insights into various types of Big Data in drug discovery, and highlights the applications of Big Data and AI in drug discovery for fast-tracking the drug discovery process by using machine learning (ML) approaches.
What is Big Data in drug discovery?
Big data in drug discovery refers to the data collected from biological, chemical, pharmacological and clinical domains. The attributes that define the characteristics of these datasets include: fast producing, large size, complex, heterogeneous and high value data with commercial opportunities.
Some of the large datasets of use in drug discovery processes are highlighted below:
Biology datasets:
Biological data provides insights to understand the underlying mechanisms associated with disease state, prediction and validation of potential target proteins for therapeutics, development of new bioassay techniques for identifying treatment modalities associated with potential targets, predictions on how treatments will interact with the body when given to a patient and finally assistance in the design of effective clinical trials.
The data types that define biological data are: drug target data, OMICS data (genomic, transcriptomic, proteomic and metabolomic data), exome data, GWAS data, gene expression data, disease-relevant animal and cellular models data, gene knockout or knockdown data etc.
Chemistry datasets:
Chemistry datasets are useful in the design of high-throughput screening libraries which assist in identifying and validating therapeutic targets in silico. These datasets assist in the prediction of molecular properties required for drug compounds and help provide insights in understanding how those molecules interact with biological macromolecules.
The data types that define chemistry data are: chemical structural representations, chemical line notations or identifiers (SMILES & InChI), molecular property descriptors, topological descriptors, topographical descriptors, structure-activity-relationship (SAR) and compound specific biological data.
Pharmacology datasets:
Pharmacological data in drug discovery provides information about the compounds or drugs tested in animal models in combination with assay data on protein targets in cell- or tissue- based models that allows the investigation of the effects of compounds at different levels of biological complexity.
The data types that define the pharmacological data are: absorption, distribution, metabolism, elimination, toxicity (ADMET) data, functional in-vitro assay and in-vivo assay properties.
Clinical datasets:
The clinical datasets in drug discovery provide valuable information in relation to patient data.
The data types that define the clinical datasets are safety and efficacy data, treatment response and side-effect profiles, patient stratification data, competitive landscape, and trial design data.
The information contained in all the aforementioned large and complex datasets offers opportunities to explore and understand mechanisms associated with a disease state, and provides the possibility to prevent and treat such conditions.
What is artificial intelligence and what are its applications in drug discovery?
Scientists working globally in drug discovery research generate voluminous pharmaceutical Big Data which is by nature, multisource and multidimensional. It is becoming increasingly difficult to not only stay informed on all the available literature, but also, to properly parse and integrate this Big Data into one’s own workflows within various research projects.
In order to overcome the hurdles associated with Big Data in drug discovery, pharmaceutical or information technology companies adopted artificial intelligence (AI) technologies to provide robust solutions that could fast track the drug discovery process.
When a machine exhibits human cognitive skills like the ability to learn and solve a problem, then the term describing the actions of the machine is defined as artificial intelligence (AI). AI comprises of technologies like Machine Learning and Deep Learning methods.
Machine Learning methods are well established for learning and prediction of novel properties, while Deep Learning methods show great prospects in drug design owing to their powerful generalization and feature extraction capability. Both these methods offer opportunities across all stages of drug discovery.
Some of the applications of artificial intelligence in drug discovery include:
- Protein design and function
- Prediction of protein folding
- Prediction of protein-protein interactions
- Hit discovery
- Generation of chemical libraries or new molecule fingerprints
- Virtual screening
- Drug repurposing
- Hit to lead optimization
- Generating models for de novo design of drugs
- QSAR models prediction
- Prediction of molecular descriptors
- Prediction of topological & topographical descriptors
- Prediction of ADMET properties
- Prediction of pharmacokinetic parameters like ADME properties
- Prediction of toxicity properties
- Pharmacodynamics modeling
Challenges and limitations associated with Big Data & AI in drug discovery
Some of the major challenges associated with Big Data in drug discovery include: data generation, data integration, data quality, data storage and management. Furthermore, errors in reproducibility and standardization of data, data format difficulties for chemical structure representations, missing original data, lack of contextual information, insufficient availability of disease-relevant human data, bias in data, gaps of fundamental understanding in many diseases, and issues in managing ontologies are critical challenges.
Although artificial intelligence technologies are promising, limitations still exist. Processing and analyzing large volumes of data can affect performance reliability, while interpretation of complex biological data remains challenging.
How Excelra can support your AI-based drug discovery programs
Standardized and high-quality datasets are essential for AI/ML-based drug discovery programs. Excelra’s GOSTAR® is the world’s largest medicinal chemistry intelligence database providing comprehensive and structured SAR data for more than 8 million compounds.
Available as a one-stop data source for in silico drug discovery, GOSTAR® captures small molecule activities encompassing SAR, physicochemical, metabolic, ADME and toxicological profiles into a relational database format.
GOSTAR® datasets are created with industry-accepted ontologies and can be delivered in flexible file formats such as:
- Flat files
- Hierarchical files
- Databases (Oracle, MySQL, etc.)
- Semantic format
“10 of the Top 20 pharma companies utilize GOSTAR® to support their drug discovery programs”
