Contact Us
Posts By :



Immune or “convalescent” plasma refers to plasma collected from infected individuals, post resolution of infection. These infected individuals develop antibodies against the infection. Passive antibody administration through transfusion of convalescent plasma may offer a short-term strategy to confer immediate immunity to severe COVID-19 patients. Antibodies generated against SARS-CoV spike protein may offer certain degree of protection against SARS-CoV-2. The sera from convalescent SARS patients may cross-neutralize the SARS-2-Spike protein driven virus entry into the host cell.

There are around 23 clinical trials (highest phase-3) studying the safety/efficacy of convalescent plasma therapy to treat COVID-19 patients. Interaction of spike protein and the ACE-2 cellular receptor is required for membrane fusion and entry into the target cell. A more refined strategy could be the development of monoclonal antibodies targeting spike protein of SARS-CoV-2. Hence, development and use of monoclonal neutralizing antibodies may be a viable approach for the treatment of COVID-19 disease.

Excelra’s open-access COVID-19 Drug Repurposing Database is a synoptic compilation of ‘Approved’ small molecules and biologics, which can rapidly enter either Phase 2 or 3, or may even be used directly in clinical settings against COVID-19. The database additionally includes information on promising drug candidates that are in various clinical, pre-clinical and experimental stages of drug discovery and development.

Supported with referenced literature, we provide mechanistic insights into SARS-CoV-2 biology and disease pathogenesis. We hope that these drug repositioning approaches can help the global biotech and pharma community develop treatments to combat COVID-19.


Although three human coronaviruses namely: SARS-CoV, MERS-CoV, and SARS-CoV-2 have hit mankind in last two decades, no vaccines have yet been developed against them. Vaccines are the most effective strategy for preventing infectious disease including viral diseases like COVID-19. Specialized antigen presenting cells ingest the virus and present the viral antigens to activate the Helper T-cells.

These Helper T-cells can function in the following two ways:

1. Activate B-cells to produce large quantity of anti-viral antibodies

2. Activate Cytotoxic T-cells to identify and clear the virus infected cells.

In this process, long-lived memory B-cells and T-cells are generated in the body to develop long term immunity against the virus. Among the two mechanisms being activated, T-cell mediated immunity is the most important mechanism to tackle the viral infections.

In the current global scenario, a range of technologies are being used to develop the COVID-19 vaccine. Some of the approaches include: nucleic acid (DNA and RNA), virus-like particle, peptide, viral vector (replicating and non-replicating), recombinant protein, live attenuated virus and inactivated virus approaches. A US vaccine company, Moderna started clinical testing of its mRNA-based vaccine mRNA-1273 within two months of sequence identification. Some of the vaccine candidates in Phase-I testing are: mRNA-1273 (Moderna), Ad5-nCoV (CanSino Biologicals), INO-4800 (Inovio Pharmaceuticals), LV-SMENP-DC (Shenzhen Geno-Immune Medical Institute) etc.

Whatever may be the approach adopted for the vaccine development, it must be safe and develop long term immunity against the SARS-CoV-2.

Excelra’s open-access COVID-19 Drug Repurposing Database is a synoptic compilation of ‘Approved’ small molecules and biologics, which can rapidly enter either Phase 2 or 3, or may even be used directly in clinical settings against COVID-19. The database additionally includes information on promising drug candidates that are in various clinical, pre-clinical and experimental stages of drug discovery and development.

Supported with referenced literature, we provide mechanistic insights into SARS-CoV-2 biology and disease pathogenesis. We hope that these drug repositioning approaches can help the global biotech and pharma community develop treatments to combat COVID-19.


Growth in technology has been attributed to the strong association between computational power and Big Data. Data Management has been the corner stone for this evolution in the field of technology. Management of data has helped in saving time, money and reducing efforts, which has helped in delivering attention to more complex fields.

How to build a predictive analytics engine?

Building a predictive analytical algorithm entirely depends on its foundation i.e. the data and the subsequent processes involved. Due to sequential nature of the process, it becomes highly imperative to be cautious at each step as an error at any point will affect the overall prediction quality. Be it at the data level or the right algorithm selection, or even while defining the front-end experience (which a user journeys through while using the platform); multiple factors play important roles in meeting the objective. Therefore, this article attempts to highlight certain rules, which one must consider while building a predictive engine.

Figure 1: Key aspects of building a Predictive Analytical Platform

Data Handling:

The internet freedom has undoubtedly played a catalytic role in the fast-developing world but on the other hand, it does evoke skepticism about the genuineness of floating information. Therefore, identifying reliable data sources becomes the very first important step. Rather than extracting information from unreviewed-public records, it’s always advisable to get the data from reputed organizations/institutions which are solely dedicated in maintaining the required information. Once decided, following steps come into the effect

  • Extraction: The data extensity within a specific data source completely depends on the interest/purpose of the holding organization. Their objective may or may not align with yours. In the former case, chances are that you get the entire longitudinal/latitudinal data but in the latter situation, the data can only serve your partial need. Therefore, it’s better to tap various sources to gain surety over the data size & quality.
    Even the process of extraction also tends to differ. Unlike for the situation where the holding data source provides easily downloadable data; techniques like web-scraping, data-mining & parsing have come into rapid use. Based on the web architecture, manual efforts at times are also needed to mine the data meticulously.
  • Integration: Bringing the data from multiple sources under one roof can be a bit challenging. However, this step should be done while preserving the pattern coming from a single source. This ensures that the relation between the data & variables is not lost during integration. This step provides a sense on the entire data dimensions in a broader way.
  • Prioritization/Feature selection: While extracting data, it is quite possible that the source page doesn’t allow to download the specific required information. This is quite common where certain data repositories only allow you to download the entire data dump. Therefore, once downloaded, the variables specific to the needs should only be considered, excluding the remaining ones. This tends to avoid the noise effect in the platform’s performance.
  • Standardization: The objective of this step is to analyze the distribution and determine whether the data is normally distributed or if it is skewed towards any specific variables, or if any outliers co-exist. Finetuning the data to avoid skewness is really important as it avoids the platform getting biased towards any variable. Moreover, the experience while running UI would be uniform and no randomness coming from the nature of the data would be observed.

Algorithm selection:

In today’s world, smart technologies are mainly driven by series of well-scripted codes, the algorithms. Based on the objective one should choose the right algorithm fitting the purpose, as while one algorithm might perform well in a certain instance, it may fail to provide good results within another set of instances. Also, it is not always necessary that the implementation of bootstrapping techniques lets you meet the objective. Sometimes, there might be a need for building one from ground zero by compiling multiple statistical methods. But while doing so, one must select the training set cautiously. It should be devoid of any biases, should well represent all the possibilities of the realistic scenario and the strength of the training set should such be chosen that there is still a section for concept proofing.

Dashboard development:

Choosing the right interface for the platform is the last but most important step towards goal completion. While deciding the design, one must consider the intuitive experience a user should have while handling the platform. It should be interactive, thereby allowing the user to take control as per his requirements. This not only gives confidence to the user but also enhances the sense of trust he can put in on the back-end functioning principles of the platform. Selection of colour schemes, animations & fonts should be well balanced as it adds to the visual aesthetics. The most important aspect is testing the application on multiple resolution sizes and environments. This allows a fool-proof check on how the platform adapts when being run on various screen resolutions.

Case Study


The manufacturing & usage of chemicals has provided a significant support to the modern-day society but handling and maintaining the workplace environment is extremely important. Specially, in the Manufacturing Industries, where the workers are constantly exposed to all sorts of chemicals, their health safety should be of utmost priority. Aligning with this concern, we helped a Big Pharma Company in building an algorithm to predict the Occupational Exposure Limits (OEL) & Hazard Information for known as well as unknown compounds. Such an application ensures occupational health of employees and eventually helps the Client to maintain proper safety standards, providing an advantage of investing the allotted budget to other important tasks.



Going with such notion, we developed a web-based application by integrating multiple datasets containing OEL values as well as related hazardous information for variety of chemicals. Due to sparsity of OEL data in the public space and region-wide variation in the acceptable values, special attention was made while harmonizing the data to preserve the original data connections. Moreover, since the information was pooled from various data sources, the values specific to the Unit of Measurements (UoMs) were carefully normalized, preserving inter-dependent relations with other variables.


A given chemical may also have different OELs based on geographical location, therefore, the region was well retained and allows the user to see associated OEL/Hazard descriptors in a region-wise manner. Compound-centric details were also collected from the public as well as our proprietary SAR database. Several techniques ranging from automation to manual intervention, were implemented resulting in a comprehensive uniform data structure.


All this data became the foundation for the application comprising a ‘structural similarity algorithm’ able to infer the OEL values for new chemical substances or categorize it into any of the three categories as per the Industrial Standards (Highly toxic, Handle in contained environment, Safe to handle in open environment). Prior to the deployment of the built algorithms into the front-end, thorough validation checks on its functioning was also performed and certain parameters were deliberately decided to avoid the appearance of false-positive/negative predictions.



Following the above set of principles, we built applications from ground zero. Starting with the data and its processing, followed by algorithm development and dashboard creation, the OEL application ensures safety and occupational health of Client employees & helps the company to channelize time, money and efforts towards other important tasks. The user can query the application using SMILES format of a chemical substance or even by drawing the structure using Chemsketch tool. The OEL value predicted by the platform helps the chemist to understand the appropriate handling environment for a given chemical, when not much information is available in the public domain. The application additionally suggests proper measures to maintain work environment safety.

The Excelra Edge

Strongly believing in the key aspects discussed above, we build and deliver applications that add value to our Clients. Starting with the data, its processing, algorithm development, followed by front-end creation, our team involving SMEs, takes proper measures to ensure overall build quality. Moreover, Excelra has an advantage in bringing workforce from different professional backgrounds to work seamlessly on a given objective. Based on the nature of the task, we pool in experiences from verticals ranging from computational experts, data analysts, chemists, biologists, and even front-end UI developers.


Question: You joined Excelra in 2014. Tell us a little about yourself, your background and how you became involved with Excelra?

Anandbir: I have been involved with Excelra right from its inception in 2014 as part of the team that identified the original opportunity. Big data technology and analytics were starting to emerge as high growth areas, so we wanted to leverage these trends to serve the global life sciences industry. In my initial years, I worked as Director – focused on New Business Development & Initiatives driving service line strategy, expansion and innovation. I then took over the reins of the company as CEO in 2016.


Prior to joining Excelra, I held key leadership positions at a renewable energy start-up where I focused on business development, capitalization, growth & scaling; and prior to that I was at GE Capital in the Energy Financial Services business. I hold an MBA from INSEAD and a BA in Economics and International Relations from Tufts University in Massachusetts.

Question: What was the initial vision behind the launch of Excelra and how would you say that vision has changed?

Anandbir: Excelra was created with the purpose of transforming life science data into actionable insights for our R&D clients. A two-pronged approach is used to achieve this. First, structuring and unifying the data from disparate sources into analysis-ready scientific datasets, and second, employing high-end data analytics to build predictive models for unlocking insights to accelerate drug research and development.


The original vision that started the company has undergone some evolution. Firstly, the analytics team has graduated into a high-end data science team. Secondly, we have made strong forays into clinical pharmacology and value evidence areas. Thirdly, across Excelra we have created a foundation of a rapidly growing R&D BIO-IT team which itself has expanded from a support function to a flourishing independent business unit. This evolution into specific domains and leveraging pure tech has been due to high customer demand.


Excelra’s vision today is to be a data structuring, analytics and engineering/data science partner, empowering innovation in the global life sciences industry. Our offerings run from discovery to market, essentially catering to the entire pharma R&D value chain.

Question: What would you say sets Excelra apart from your competitors?

Anandbir: The Excelra edge comes from a seamless unification of data, deep domain expertise and data engineering.  We have put together ‘bilingual resources’ who have a blend of science + technology.  These resources are very hard to find, and through various training programs, a wide variety of projects and experience, we have been able to cultivate a team that is versatile in skill and agile in practice.  Our people were, are, and will remain our biggest asset!


Further, Excelra has deep experience and trusted relationships in the pharma industry, with some dating back to nearly two decades (such as GVK Informatics). These long-standing relationships have helped develop an image of being less of a vendor and more of a trusted ally that is always committed to our clients’ success.  Nurturing these relationships itself offers a big competitive advantage.


Lastly, what defines us is our values and culture. We are obsessed with delivering the highest quality solutions with personal attention to each client’s specific needs while maintaining trust, transparency, and integrity every step along the way. This may sound basic, but we take it to another level!

Question: Can you tell us more about who your clients are, and how you work with them?

Anandbir: Excelra is the preferred data and analytics partner to over 90 global biopharma companies, including 17 of the top 20 pharmaceutical companies in the world and some of the leading drug discovery focused AI companies. We have multiple business models that include providing high end consulting services for specific objectives or providing highly skilled resources as an extension to the partner’s teams. We also offer subscription products and platforms, as well as long term data and technology partnerships.

Question: What are the most notable client success stories?

Anandbir: There are many, but two recent examples come to mind.


Excelra teamed up with one of the fastest growing clinical stage biotech companies that was creating an internal bioinformatics function. Excelra provided critical, multi-skilled resources to act as an extended team to the client, so they could hit the ground running. It’s been 2 years since the relationship began and the team so far has delivered ~10 diverse projects ranging from response predictive models, standard bioinformatics analysis for drug treated data, project metadata management, to OMICs pipeline development. This analytics support was fundamental in helping their lead candidate to transition into the clinical phase with a very robust recruitment and stratification strategy.


Another success story is with a large top 10 pharma company for whom we have been providing high-fidelity curated clinical trial datasets for the last few years. Impressed by the quality and turnaround time of service delivery as well as our responsive engagement model, we have recently been made their dedicated data partner to deliver curation services across all indications in their clinical pipeline! We have become a seamless extended arm of their global statistical sciences and advanced analytics groups.


These datasets are carefully extracted and normalized into analysis-ready datasets for Model-Based Meta-Analysis (MBMA), QSP and other statistical models. Through these applications, our client gains valuable intelligence into a clinical program’s competitive landscape and comparative effectiveness. They can benchmark dose-response, time-course, placebo effects, and outcome heterogeneity for a given drug class and indication. The insights gained are used to optimize clinical trial designs and marketing/commercial strategies among other use cases.

Question: What do you think is the greatest opportunity for pharma to improve the speed and efficiency of discovering and developing new therapies?

Anandbir: McKinsey estimates the size of the digital opportunity in life sciences to be $50–150 billion of EBITDA across the industry, and BCG analysis reckons that a data science driven re-engineering of the pharma value chain can unlock ~2X economic value.


However, Pharma has been relatively late in embracing the data and digital revolution compared to other industries such as retail, media, financial services and transportation. One key reason is the siloed nature of R&D and the traditional regulatory caution that drives the mindset. There is also a bit of cultural reluctance in perceiving data as an asset to be leveraged, rather than an output to be guarded.


Having said that – these barriers are rapidly breaking down and there is a real momentum shift across the industry to adopt the latest data and digital initiatives across R&D. Most large pharma companies now have a ‘CDO’ function driving a range of use cases such as: predictive modelling & AI for drug discovery; smart clinical trial design, biomarkers for patient stratification, RWE driven clinical trial feasibility & patient recruitment, ‘Beyond the pill’ approaches using wearables, mobile apps for digital patient engagement etc. The Covid-19 pandemic has only accelerated all these trends.

Question: The pharma business model is under significant strain at the moment, with the average cost of getting a drug to market doubling every 9 years. How can Excelra help with this?

Anandbir: Every aspect of the pharmaceutical value-chain involves generation of voluminous data. Excelra turns these vast data-pools into meaningful insights that unlock operational efficiencies, and increase the speed and accuracy of drug discovery, clinical development, regulatory approval, reimbursement, and market access.


To take a case study in point, Excelra has meticulously curated and structured valuable proprietary datasets. One such intelligence platform is GOSTAR, the largest Structure Activity Relationship (SAR) database for drug discovery. Our scientists have curated a 360° view of close to 8 million small molecule drug compounds and more than 23 million SAR property data points! From target profiling to hit identification and lead optimization, GOSTAR is a critical resource for medicinal and computational chemists capturing granular assay data across chemical, biological, pharmacological and therapeutic dimensions. Some of the top AI drug discovery biotech companies train their algorithms on GOSTAR data. Similarly, we offer a range of data products and services including – translational biomarkers (GOBIOM), clinical trial design/outcomes and real-world data. With the likes of GOSTAR and other data assets, Excelra is providing scientists a starting point by offering deeper insights and accuracy which in turn helps shorten that 9-year window.


On the other hand, as a scientific consultancy, we partner with companies in data science driven drug discovery, target and biomarker identification and validation. We help them in bioinformatics, drug repurposing, indication prioritization and drug combination prediction. Once the molecules hit Phase II in clinical development, we help generate and communicate the value and evidence of the molecule through pharmacoeconomic modelling, value communication reimbursement and RWE analytics.  Again, all these consulting services help in either progressing a faster clinical program or getting the “Phase IV” approval in a shorter timeframe.


Finally, as a data engineering and technology partner, we provide expertise in semantic enrichment, ontology mapping, data harmonization, data ingestion pipelines, meta/master data management, semantic and knowledge graph solutions; as well as solving life science R&D technology needs in application development, workflow and process integration and  data visualization.


All these offerings help shorten the time to market, either fail fast/fail early or help get to the market quicker. Ultimately, it’s a collaborative approach with our clients to prioritize areas of the most critical need where Excelra can bring a holistic approach combining data, analytics and technology.

Question: How have you noticed perspectives and engagement levels changing over the last few years, with respect to adoption of more data-driven approached for discovery and development?

Anandbir: As I mentioned earlier, we have noticed a distinct uptake in the interest and appetite for data-driven approaches both at a functional level as well as at an enterprise level. Especially in the Covid-19 environment, our relationships have seen a deeper level of engagement, increased budgets and demand for outsourced digital solutions.

Question: What would you say is the single biggest untapped opportunity for pharma companies to improve the effectiveness and outcomes from their R&D programs?

Anandbir: The interlinking between the world of drug research and R&D (Pharma) and clinical practice (Healthcare) is best exemplified by Real World Evidence (RWE), which has the potential to enhance the efficiency of drug development and provide new evidence on risks and benefits of medical products. While there has been a lot of development in this area, there remains an untapped potential to maximize the use of RWE.


It is still a big challenge to connect the dots and stitch together a full longitudinal picture of a patient’s experience. If ongoing clinical practice can generate regulatory grade data, this can transform the way clinical trials are conducted in the future. There’s still a long way to go but we are on the way and we will get there.

Question: You’ve recently released some open access resources to support scientific efforts to combat COVID-19. Can you tell us more about those resources?

Anandbir: That’s right. We have launched a biomarker database and a drug repurposing database related to Covid-19: Excelra’s Covid-19 Resource Hub.


The biomarker database is excerpted from our biomarker intelligence platform – GOBIOM. The database is a compilation of manually curated biomarkers from published clinical trials, evaluating potential drugs or biologics for the treatment of SARS-CoV-2. The database additionally includes information on FDA-NIH recommended BEST classification of biomarkers, supported with direct links to referenced literature.


The drug repurposing database is a compilation of ‘Approved’ small molecules and biologics, which can rapidly enter either Phase 2 or 3, or may even be used directly in clinical settings against Covid-19. The database additionally includes information on promising drug candidates that are in various ‘clinical, pre-clinical and experimental’ stages of drug discovery and development. Supported with referenced literature, we provide mechanistic insights into SARS-CoV-2 biology and disease pathophysiology.


Both the databases are provided as ‘Open Access’ resources to the research community – it’s a small contribution to the global efforts towards finding a medical breakthrough against Covid-19.


Text mining - Unlocking intelligent insights from immense data

The conversion of unstructured data into high-quality actionable insights based on information retrieval, data mining, machine learning, statistics, and computational linguistics is commonly referred to as text mining. Accessibility to such organized information enables data-driven decisions, extraction of previously unknown knowledge, and building of new data patterns. News articles, plain text, technical papers, books, digital libraries, emails, blogs, and internet pages are few sources of such data1. From academia and healthcare to businesses and social media platforms, text mining processes have found utility in web mining, risk management, cybercrime prevention, business intelligence, content enrichment, customer care service, and knowledge management for generating value and ROI from unstructured data2. In this regard, the availability of large volumes of unstructured biological and biomedical data necessitates the need to integrate manual curation and computational methods for literature mining, to drive meaningful data-centric outcomes in academia, pharma, and healthcare.

The 5-Steps of text mining

Text mining in any domain broadly constitutes the following 5 steps:

  • Data acquisition
  • Text pre-processing and transformation (cleansing)
  • Text selection, extraction and organization
  • Text evaluation and validation
  • Application

Text mining for biology & biomedical data - Need for integrating automation with manual expertise

A biology or biomedical data text mining tool requires the combination of computational and manual curation approaches for optimal performance and development of comprehensive databases in a timely manner. Often data-centric organizations that utilize computational text mining methods to develop domain-specific services or products are dependent on outsourced help to manually curate/evaluate data. It is therefore imperative to have data curation, automation and evaluation under one roof to support end-to-end functioning of an in-house text mining tool.

Fig.1: Multiple steps associated with text mining processes and steps that require both computational and manual curation interventions.

Biology curation services & text mining at Excelra

Biomedical text mining requires expert curation of data and projection of meaningful data insights. Excelra is a hub of domain experts in the field of biomedical data curation and provides a repertoire of computational biology services, bioinformatics services and biology curation services. Our experts use computational biology tools like NLP, coupled with manual data analysis interventions to process unique requirements of clients. The outcome is clean, organized and structured data with a range of applications:

  • Extraction of previously unknown knowledge
  • Deriving new data patterns with analysis-ready data for training AI/ML pipelines
  • Enabling informed data-driven decisions to accelerate Biopharma R&D

Excelra's text mining approaches

Our text mining approaches are tailored to the needs of customers and is dependent on the type of input data.

  • A traditional ‘keyword-based approach’ may only discover relationships at a shallow level, while ‘tagging-based approaches’ may rely on tags obtained manually or by automated algorithms.
  • A more advanced information-extraction approach requires ‘semantic analysis of text’ by NLP and machine learning methods.
  • At Excelra, we have provided all three kinds of services with varying input data from canonical search engines, PubMed like scientific literature search engine, news reports, client specific articles and scientific databases. Our years of experience coupled with domain expertise in the field position us a formidable player to meet client specific requirements in critical projects in data curation services.

Fig.2: Schematic representation of Excelra’s three approaches for text mining of biomedical data

Case Study

Excelra has been providing biomedical data curation and text mining services to multiple clients, each customized to meet specific needs. To underscore the utility of our in-house text mining tool, a client case study is presented here.


The project aimed to develop datasets with experimentally approved protein degrading enzyme pairs. The scope was to identify relevant degrading enzyme-substrate pair databases, extract data with pre-defined features using the text mining pipeline, and lastly deliver structured curated data. These outcomes were further implemented in machine learning (ML) enabled data analysis and in user interface (UI) based data visualizations.

Methodology for literature mining

The entire process was 3 phased.

  • In phase 1, datasets with protein degrading enzyme-substrate relations from PubMed and PMC were found as suitable data sources with experimentally proven protein degrading enzyme-substrate associations. A sentence construction exercise was conducted to identify salient features that represent the data to be identified.
  • In phase 2, a lexicon of enzymes and degradation keywords were developed. A text mining tool based on elastic search and integrated with the lexicon and identified features was established. The tool was applied on the processed data. The manual evaluation of the text mining tool extracted articles had confirmed accuracy >90%. In total, the tool could identify 2000 relevant PubMed articles and 9500 PMC articles.
  • Finally, the full text articles with experimentally proven protein degrading enzyme-substrate links were manually curated based on 25 pre-defined variables in phase 3. A spreadsheet with curated data was provided for use in machine learning, and a UI was developed for data visualization.

Fig. 3: The 3-phased solution strategy for developing a dataset with experimentally proven protein degrading enzyme-substrate relations using text mining and manual curation

The Excelra Edge

The key differentiators that set Excelra apart from other domain experts can be summarized in the following points:

  • Excelra has been providing expert support to 15 of the top 20 pharma companies, with 90+ clients spread across the globe
  • It holds 18+ years of experience in data sciences with 60+ PhDs among a 600+ talent pool
  • The organization is equipped with data, deep domain expertise and data science capabilities
  • Excelra has a scalable multilingual techno-human curation engine
  • A customized ontology-based approach with expert curated relevant scientific terms specific for each text mining project is another differentiator
  • Domain experts who are well-experienced in handling versatile text mining projects
  • Finally, presence of a multidisciplinary blend of Math, Computation, and Life Sciences under one roof with proprietary text mining algorithms enables Excelra to develop text mining tools with optimal performance in quick turnaround time

Fig. 4: The Excelra Edge with key differentiators

Conclusion: Advantages & the imperative need for text mining solutions to derive actionable insights from biomedical literature

Text mining in biomedical/scientific literature could provide significant benefits in finding new data patterns and in knowledge extraction management. Today, large volumes of biological and biomedical data are being churned out at an exponential rate due to usage of multi-experimental methods such as Omics technologies. These investigational data are further published in peer-reviewed journals and get indexed in public literature repositories such as PubMed or Medline. Researchers query and use this primary literature data to generate new hypotheses or validate their own results, while scientific curators manually curate and survey such data to populate databases with varied functionalities. Manual efforts are however labor-intensive, consume enormous time, and require extensive searches to obtain relevant information. Extracting and processing research outcomes from literature repositories using AI-enabled technologies like NLP, ML, and languages such as Python or R could possibly offer new and valuable biological insights. Also, applying text mining methods on clinical data could benefit in understanding disease pathogenesis, development of new diagnostics and drugs, efficient patient data management systems, and implementation of precision medicine amongst a repertoire of probable clinical solutions 3. Text mining of biomedical data thus holds immense promise and could pave the way to new innovations across several domains such as academia, pharma, and healthcare.


Reports suggest that SARS-CoV-2 virus has the ability to directly infect human blood vessel and kidney organoids. The use of clinical-grade recombinant human ACE2 (rhACE2) can reduce SARS-CoV-2 infection in cells and multiple human organoid models. Furthermore, the same study has also suggested that the application of soluble human ACE2 in early stages of infection can block SARS-CoV-2 entry into host cells. Currently, there is one clinical trial sponsored by Apeiron Biologics in phase-2 testing the use of rhACE2 for the treatment of COVID-19 patients.

Excelra’s open-access COVID-19 Drug Repurposing Database is a synoptic compilation of ‘Approved’ small molecules and biologics, which can rapidly enter either Phase 2 or 3, or may even be used directly in clinical settings against COVID-19. The database additionally includes information on promising drug candidates that are in various clinical, pre-clinical and experimental stages of drug discovery and development.


Supported with referenced literature, we provide mechanistic insights into SARS-CoV-2 biology and disease pathogenesis. We hope that these drug repositioning approaches can help the global biotech and pharma community develop treatments to combat COVID-19.


A brief review of GPCR family: The largest family of druggable targets

G protein-coupled receptors (GPCRs) have become a hot frontier in basic research of life sciences and therapeutic discovery of translational medicines and is widely pursued by both academic and industrial research for drug discovery. They represent an important opportunity for both small molecule-based and antibody-based therapeutics and are the largest family of targets for approved drugs. The discovery of a diverse set of molecules targeting this family could become valuable assets, by solving unexploited horizons like establishing target biological functions and disease relevance.

GPCR structures and families

GPCRs are the largest family of proteins involved in membrane signal transduction and are also the most intensively studied drug targets, largely due to their substantial involvement in human pathophysiology. The pharmacological modulation of GPCRs provides leverage for treatment of diseases of central nervous system (CNS), cancer, viral infections, inflammatory disorders, metabolic disorders, etc.


The superfamily is classified into six classes based on amino acid sequence similarities namely, Class A (rhodopsin-like family); Class B (secretin receptor family); Class C (glutamate receptor family); Class D (fungal mating pheromone receptors); Class E (cAMP receptors) and Class F (frizzled or smoothened receptors), of which only four (A, B, C and F) are found in humans.

Figure 1: Target classes in GPCR super family

GPCRs are involved in various biological processes and disease indications and they make excellent drug targets (Fig 2). Some GPCRs have been linked to cancer development and progression, based on their overexpression and/or up-regulation by diverse factors. A higher expression of GPR49 was found to be involved in the formation and proliferation of basal cell carcinoma, the glycine receptor GPR18 was found to be associated with melanoma metastases, and high levels of GPR87 were found to be associated with lung, cervix, skin, urinary bladder, testis, head and neck squamous cell carcinomas.

Figure 2: Some of the indications linked to members of GPCR family

Recently, orphan GPCRs have become a potentially novel targets for treatment of diverse set of indications, such as GPR119 for treatment of diabetes, leucine-rich repeat-containing G protein-coupled receptors 4 & 5 (LGR4/5) for treatment of gastrointestinal disease, GPR35 for treatment of an allergic inflammatory condition, GPR55 as an antispasmodic target, proto-oncogene Mas for treatment of thrombocytopenia, and GPR84 for of ulcerative colitis.

Landscape of GPCR research and drug development

GPCRs are the largest ‘target’ class of the ‘druggable genome’ representing approximately 19% of the currently available drug targets. In humans, the GPCR superfamily consists of 827 distinct members, of which 406 are non-olfactory. However, current therapeutics in humans target only 25% of potentially druggable GPCRs, 103 out of possible 403 GPCR targets, for which there is at least one marketed drug in practice.

Figure 3: Classification tree of drugs targeting GPCR family members

Current literature analysis shows that GPCRs have traditionally been regarded as the domain for small-molecule drugs and very few targets are well studied. More than 30% of the US Food and Drug Administration (FDA) approved drugs target GPCRs, which makes them the largest druggable class of biomolecules (Fig 4).

Figure 4: More than 30% FDA approved drugs target members of GPCR family

Enormous efforts have been expended to find relevant and potent GPCR ligands as lead compounds. Non-olfactory GPCRs constitute more than half of the human genome encoded targets that are not yet exploited for any therapeutic use and the knowledge is disproportionately focused in the scientific literature. Preliminary studies highlight that these receptors have functions in genetic and immune system disorders.

While the drugs that currently target GPCRs are primarily small molecules and peptides, GPCRs also recognize diverse ligands, including inorganic ions, amino acids, proteins, steroids, lipids, nucleosides, nucleotides, and small molecules (Fig 5).

Figure 5: FDA approved drug types across GPCR classes

The latest trends in GPCR research indicates that modalities other than small molecules are becoming more popular as GPCR targeting agents with the entry of monoclonal antibodies, peptide drugs and allosteric modulators into early-stage clinical trials. For instance, GLP1 receptor targeting biologics like exenatide, liraglutide, and dulaglutide have been approved for type 2 diabetes, and CGRP receptor targeting erenumab in the treatment of chronic migraine and so many other peptide drugs targeting various GPCRs are also in development.

Current trends in GPCR research

In recent years, there is a significant increase in information available about the sequences, structures and signaling networks of GPCRs and the G proteins, due to breakthroughs in X-ray crystallography and cryo-electron microscopy (cryo-EM), leading to great understanding of GPCR-G protein interactions. This significant increase in information of GPCR-G protein interactions is being explored using several bioinformatics and software tools, including protein data bank GPCRdb gpDB , human gpDB  and many more.

Due to limited spatial and high cost of experimental studies, computational modeling techniques such as bioinformatics, protein-protein docking and molecular dynamics simulations are playing an important role in exploring the GPCR-G protein interactions. Determining the 3-dimensional structural features of various unexplored orphan receptors and their ligand-associated complexes has become an exciting avenue in the GPCRs research in understanding on the molecular recognition and activation mechanisms and help the pharmaceutical investigation of new diseases in variety of therapeutic areas.

As the current human therapeutics cover only 25% of potentially druggable GPCRs, a relatively large extent of GPCRs still remain ‘orphan’ and therapeutically unexploited. This prediction and identification of GPCR ligands for these orphan receptors is an active area of research and interest to pharmaceutical industry.


  • Hutchings CJ. A review of antibody-based therapeutics targeting G protein-coupled receptors: an update. Expert Opin Biol Ther. 2020 Aug;20(8):925-935
  • Ellaithy A, Gonzalez-Maeso J, Logothetis DA, Levitz J. Structural and Biophysical Mechanisms of Class C G Protein-Coupled Receptor Function. Trends Biochem Sci. 2020 Dec;45(12):1049-1064
  • Sriram K, Insel PA. G Protein-Coupled Receptors as Targets for Approved Drugs: How Many Targets and How Many Drugs? Mol Pharmacol. 2018 Apr;93(4):251-258
  • Hauser AS, Attwood MA, Rask-Andersen M, Schioth HB, Gloriam DE. Trends in GPCR drug discovery: new agents, targets and indications. Nat Rev Drug Discov. 2017 Dec;16(12):829-842
  • Rask-Andersen M, Masuram S, Schioth HB. The druggable genome: evaluation of drug targets in clinical trials suggests major shifts in molecular class and indication. Annu Rev Pharmacol Toxicol. 2014;54:9-26.
  • Lu S, Zhang J. Small molecule allosteric modulators of G-protein-coupled receptors: drug-target interactions. J Med Chem. 2018
  • Gugger M, White R, Song S, Waser B, Cescato R, Rivière P. GPR87 is an overexpressed G-protein coupled receptor in squamous cell carcinoma of the lung. Reubi JC Dis Markers. 2008; 24(1):41-50
  • M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, P.E. Bourne. The Protein Data Bank Nucleic Acids Research, 2000;28:235-242
  • Margarita C Theodoropoulou, Pantelis G Bagos, Ioannis C Spyropoulos and Stavros J Hamodrakas. “gpDB: A database of GPCRs, G-proteins, Effectors and their interactions.” Bioinformatics. 2008 Jun 15;24(12);1471-2
  • Satagopam, V.P., Theodoropoulou, M.C., Stampolakis, C.K., Pavlopoulos, G.A., Papandreou, N.C., Bagos, P.G., Schneider, R. & Hamodrakas, S.J. GPCRs, G-proteins, effectors and their interactions: human-gpDB, a database employing visualization tools and data integration techniques. Database (Oxford) 2010;baq019
  • Kooistra AJ, Mordalski S, Pándy-Szekeres G, Esguerra M, Mamyrbekov A, Munk C, Keserű GM, Gloriam DE. GPCRdb in 2021: integrating GPCR sequence, structure and function. Nucleic Acids Research, 2020;49:D335-D343

GOSTAR is the largest manually annotated structure-activity relationship (SAR) database of small molecules published in leading medicinal chemistry journals and patents. Compounds from both discovery and development stages targeting all target families are covered. Along with SAR, key properties like ADME and toxicity are captured. This relational database enables users to navigate and analyze massive content of small molecules to derive insightful decisions in design and discovery of novel compounds.

Content coverage

The GOSTAR database content is composed from various sources which includes:

  • MedChem Journals
  • Patents
  • FDA/EMEA/PMDA Reports
  • Clinical Trial Registries
  • Scientific Reviews
  • Company Websites
  • Books
  • Conferences
  • Public Sources

Figure 1. A quick view of content covered and sources of the content.

Patents covered in 2020

The patent coverage in GOSTAR database is very comprehensive. The content was indexed from more than 2900 patents in the year 2020. GOSTAR avoids duplicity or redundancy in database by avoiding capturing similar patents, i.e. patent published in multiple patent offices.

Table 1: Number of patents (patent office wise) covered in 2020 updates.

Preclinical candidates covered in 2020

In the year 2020, the GOSTAR database was enriched with 1500+ preclinical compounds acting against various indications like COVID-19, Non-alcoholic steatohepatitis (NASH), Hepatitis virus infections, HIV infections, Cardiovascular diseases, and various cancers.


Few significant drug inclusions in 2020 were:

  • EPV-COV19
  • FT-8225
  • VNRX-9945
  • CARG-201
  • S-540956
  • BMS-818251
  • BRII-732
  • CR-13626
  • NAB815
  • CV730
  • GLPG-4124
  • IDG-16177

Target space covered in 2020 updates

New content was updated for more than 2500 protein targets in 2020. While content for EGFR was updated from 200+ references, Adenosine A2A receptor was updated from 86 references and KRAS had content updated from 54 references, whilst NOTCH made into top 20 with around 4.7K compounds covered from a reference (Table 2).

Table 2: List of top 20 targets covered in 2020 updates

Distribution of SAR content

Figure 2. Assay wise distribution of SAR content covered 2020.

Of the 1.2 million SAR rows added to the GOSTAR, functional in-vitro and in-vivo contribute 41.25% to data, binding constitutes 32.28%, and 6.69% of content consists of ADME properties.

Approximately, 2% content is around toxicity properties of the compounds covered in 2020 and the rest 17% represents other property types including physicochemical properties.


In medicinal chemistry, the relationship between molecular structure of a compound and its biological activity is referred to as Structure Activity Relationship (SAR). Medicinal chemists modify biomedical molecules by inserting new chemical groups into the compound and test those modifications for their biological effects. Determining and identifying SARs is key to many aspects of the drug discovery process, ranging from hit identification to lead optimization.

Although information on millions of compounds and their bio-activities e.g. reaction ability, solubility, target activity etc., is freely available to the public, it is very challenging to infer a meaningful and novel SAR from that information. The underlying problem in here is the un-structured and heterogeneous nature of these datasets contributed by the scientific & research community in journals, scientific articles, patents, regulatory documents and various secondary sources. Owing to the increasing structural diversity among hit compounds and their potency distribution, it is becoming a challenge to analyze the SAR information. If these relationships are properly extracted, associated and analyzed, they provide valuable information that would support drug discovery and development. To this end, there has been an increasing need and interest in mining and structuring SAR information from bioactivity data available in the public domain.

Global Online Structure Activity Relationship Database (GOSTAR)

Excelra, a leading global biopharma data and analytics company, has responded to this pertinent need by developing a knowledge repository, Global Online Structure Activity Relationship Database (GOSTAR), which provides a 360-degree view of millions of compounds linking their chemical structure to the biological, pharmacological and therapeutic information. GOSTAR contains high-quality, manually annotated and very well-structured SAR data captured from various primary sources (patents and top journals of medicinal chemistry) and secondary sources (conference meetings & abstracts, company drug development pipelines, company annual reports, clinical registries and drug approval reports).

Who can use GOSTAR and how?

The main objective for creating GOSTAR is to assist medicinal chemists, computational chemists and cheminformaticians in their quest for identifying potential small molecules that have decent biological effect and could be of a specific therapeutic use. GOSTAR enables users to quickly visualize, explore, analyze and evaluate SAR data based on their project requirements. The users can explore various SAR associations by searching various identifiers like drug names, chemical structures, bibliography, compound development stage and activity endpoints.

What are the applications of GOSTAR?

Better understanding of SAR data will enable the users to take correct decisions in exploring the chemical space while designing a drug.

Following are the applications of GOSTAR:

  • Target profiling – GOSTAR enables a holistic exploration of the chemical space around a target of interest & enables the users to understand the pathways and indications in which a given target is implicated
  • Structure based drug design – GOSTAR can be used as a compound library to perform virtual screening and hit identification in traditional structure-based drug design methodologies
  • Lead optimization – GOSTAR enables lead optimization by suggesting the structure activity relationships with improved potency, reduced off-target activities, and physiochemical/metabolic properties
  • Assay validation – GOSTAR suggests the right functional assays for secondary validation for the chemical modifications while involved in the tuning of the hit molecule
  • Drug repurposing and Translational science – GOSTAR data can be mined to interrogate diverse targets with a compound of interest to understand the feasibility and viability for drug rescue or for label expansion
  • Competitive intelligence and Novelty analysis – GOSTAR captures drug lifecycle information such as indication, phase of development, sponsor and recruitment/approval status including suspended trials along with the reason for discontinuation that can be used for building the competitive landscape around the drug/target/indication.


Currently, there are hundreds and thousands of chemical classes, and it often becomes daunting task to identify potential candidates for therapeutic use. In such cases, using knowledge repositories like GOSTAR, we can rapidly characterize data points that can help to efficiently capture and encode specific SAR. Below are the key features that showcase why GOSTAR is the ideal and simplistic solution for the complex task of gathering SAR data.

  • Reachability – Easy content accessibility to a wide and diverse user community
  • Utility – Maximize the utilization of content to create insights/concepts
  • Applicability – Selective utilization of content in diverse early discovery programs targeting unmet medical needs
  • Reliability – Standardized and normalized content to support traditional as well as AI/ML driven discovery programs

GOSTAR is the largest manually annotated structure-activity relationships (SAR) database of small molecules published in mainstream medicinal chemistry journals and patents. Compounds from both discovery and development stages targeting all target families are covered. Along with SAR, key properties like ADME, and Toxicity are captured. This relational database enables users to navigate and analyze massive content of small molecules to derive insightful decisions in design and discovery of novel compounds.

Content Coverage

The GOSTAR database is composed of many different types of content, from scientific literature to publicly available material.

  • MedChem Journals
  • Patents
  • FDA/EMEA/PMDA Reports
  • Clinical Trial Registries
  • Scientific Reviews
  • Company websites
  • Books
  • Conferences
  • Public Sources

Fig 1. A quick view of content covered and sources of the content

Preclinical Candidates Covered in 2021 (until July’2021)

In the year 2021, the GOSTAR database is enriched with various preclinical compounds acting against various indications like COVID-19, Non-alcoholic steatohepatitis (NASH), Hepatitis virus infections, HIV infections, Cardiovascular diseases, and various cancers.


Few significant drug inclusions until July 31, 2021:

  • Synflorix
  • AZD1222
  • Benaglutide
  • GSK-1557484A
  • MRNA-1273

Target Space Covered in 2021 Updates

New content is updated for more than 2400 protein targets in to GOSTAR database until July 31, 2021.

Table 2: List of top 20 targets covered

Type of Content

Further deep analysis of the content covered in 2021 is shown in figure 2. Of the 1.2 million SAR rows added to GOSTAR, functional in-vitro and in-vivo contribute 41% to data, binding constitutes 33%, and 5% of content consists of ADME properties. 2% of content covers toxicity properties of compounds covered in 2021 and the rest 19% represents other property types including physicochemical properties.

Fig 2. Assay wise distribution of SAR content