Authors: –
Large language models are reshaping how scientists work with chemistry data. But put them in front of real patents and real assay archives, and the failure modes are surprisingly consistent. Here is what the field is learning.
Large language models have moved from novelty to near-default infrastructure across drug discovery in just two years. They mine patents, summarise literature, draft SAR interpretations, and let researchers query complex databases in plain English. The productivity story is real.
The story has a quieter second half. When LLMs are pointed at real chemistry data — the kind that lives inside working drug discovery organizations — the failure modes are not random. They are structural, and they tend to be invisible in the output.
Understanding where these failure modes originate requires understanding what makes chemistry data fundamentally different from the text-heavy domains where LLMs were originally developed. Excelra’s blog on Unravelling the Path to Drug Discovery through Cheminformatics provides essential context on why chemical structure representation, SAR data organization, and assay metadata demand a level of precision that general-purpose LLMs are not inherently trained to respect.
Challenge 1: Patent extraction: when valid SMILES is still wrong
Automated patent extraction is one of the most active LLM use cases in chemistry. On clean, isolated structures, performance is genuinely strong. However, real patents are messier: multiple structures per page, low-resolution images, Markush representations with R-group definitions, and stereochemical annotations that are easy to miss.
In that environment, three reliability problems recur. SMILES hallucination, where models generate syntactically valid but chemically wrong structures. An independent 2024 study reported only 73.5% accuracy on LLM-extracted SMILES from patents [1]. Chirality inversion, where a missing stereo indicator silently flips configuration. And Markush misinterpretation, where R-group claims get collapsed into specific structures the patent never claimed.
The 73.5% SMILES accuracy figure understates the practical risk, because the errors that remain are not evenly distributed — they cluster around the most complex and most therapeutically interesting structures: chiral centers, Markush claims covering large chemical space, and fused heterocyclic scaffolds. For organizations whose competitive intelligence depends on complete and accurate patent structure extraction, the difference between curated medicinal chemistry data and raw LLM extraction is not incremental — it is structural. Excelra’s ChEMBL vs. GOSTAR™ — Data Diversity and Compound Coverage blog examines how curated, expert-validated compound data compares to automated extraction across dimensions that matter for AI and SAR workflows.
Challenge 2: Natural language querying: more than language understanding
Conversational interfaces over scientific databases work well for simple lookups. As queries get more complex, three patterns appear. Hallucinated schema relationships, where SQL with joins that do not exist still executes and returns plausible-looking results. Loss of context, for example, where biochemical and cellular readouts get merged into the same result set. And missing implicit scientific reasoning, where selectivity or cross-target questions need domain knowledge the schema does not encode.
Fluent SQL is not, by itself, a sign of reliable scientific interpretation.
The schema hallucination problem is particularly insidious in drug discovery databases because the outputs often look scientifically plausible. A query that merges biochemical and cellular IC50 values into the same result set does not produce an error message — it produces numbers that appear internally consistent. The only defence is a database schema designed with scientific context built in from the start, not retrofitted as metadata. Excelra’s Structured and Analysis-Ready Data for AI/ML-Based Drug Discovery case study demonstrates what this looks like in practice — a data architecture where assay context, endpoint type, and experimental conditions are structured fields, not free text that an LLM must interpret.
Challenge 3: Scientific reliability: more than model intelligence
Across every workflow, the pattern is the same. LLMs produce fluent narratives, executable queries, and confident summaries. Scientific reliability depends on contextual information the model cannot validate on its own. A 2025 review of LLMs in chemistry reached the same conclusion: data quality and integration are now among the central challenges in the field [2].
The centre of gravity is moving. For five years, AI progress was a model-intelligence story. The harder problem in 2026 is the data layer: completeness of context, clarity of schema, calibrated uncertainty, and validation around the model.
Figure 1. A working architecture for reliable LLM-driven chemistry workflows. The curated data layer carries the scientific context the model cannot infer on its own.
The shift from model-intelligence to data-layer challenges is not unique to chemistry — but chemistry makes it visible faster than most domains, because chemical structures are unambiguous. A molecule is either right or wrong. There is no partial credit for a SMILES string with the wrong chirality. This unforgiving precision requirement is why the field is converging on the conclusion that curated, context-rich chemical data is the limiting factor in reliable AI-driven drug discovery — not model architecture. For a broader perspective on how AI and data quality interact across drug discovery workflows, see Excelra’s blog on Empowering Drug Discovery with Big Data and Artificial Intelligence.
Building reliable scientific AI
Four areas matter operationally. Transparent benchmarking on production-grade data, not vendor demos. Uncertainty-aware systems that can say “I do not have enough context” instead of generating fluent text anyway. Human validation workflows, because expert review catches errors automated checks cannot see. And better contextual data representation, so experimental conditions and assay metadata live inside the data structure rather than in free text.
In scientific AI workflows, reliable outputs depend on more than language generation. They depend on trusted scientific data foundations.
At Excelra, this is precisely the gap GOSTAR™ has been built to address: curated, context-rich medicinal chemistry data designed to make AI and ML workflows reliable.
GOSTAR™ covers small molecules, large molecules, and targeted protein degraders — each with the same emphasis on curated, context-rich data designed for computational workflows. For a detailed look at how GOSTAR™’s data architecture compares to public chemistry databases in the dimensions that matter for AI training and querying, see the .
For teams building or evaluating LLM-powered chemistry workflows, the practical first step is rarely a better model — it is a better data foundation. Excelra’s whitepaper on Transforming Unstructured Data into Actionable Insights Using AI examines how unstructured chemistry and life sciences data — including patent literature and assay records — can be systematically converted into structured, AI-ready assets that eliminate the failure modes described in this blog.
References
- Patiny L. et al. (2024). Suitability of large language models for extraction of high-quality chemical reaction dataset from patent literature. Journal of Cheminformatics, 16, Article 131. https://doi.org/10.1186/s13321-024-00928-8
- Ramos M.C., Collison C.J., White A.D. (2025). A review of large language models and autonomous agents in chemistry. Chemical Science. PMC11739813. https://pmc.ncbi.nlm.nih.gov/articles/PMC11739813/
- Mol-Hallu (2025). How to Detect and Defeat Molecular Mirage: A Metric-Driven Benchmark for Hallucination in LLM-based Molecular Comprehension. arXiv:2504.12314. https://arxiv.org/abs/2504.12314
- ReactionSeek (2026). LLM-powered literature data mining and knowledge discovery in organic synthesis. Nature Communications.
What is SMILES hallucination and why does it matter in drug discovery?
SMILES hallucination refers to the tendency of large language models to generate SMILES strings — the text-based notation used to represent chemical structures — that are syntactically valid but chemically incorrect. A hallucinated SMILES can parse correctly, pass basic validity checks, and even render as a plausible-looking molecular structure in a chemistry software tool, while representing a compound that does not match the original source. In drug discovery, this matters because downstream workflows — including ADMET prediction, docking, SAR analysis, and patent freedom-to-operate assessments — all depend on the accuracy of the input structure. A chirally inverted SMILES represents a different stereoisomer with potentially different biological activity and toxicity. An independent 2024 study reported only 73.5% accuracy on LLM-extracted SMILES from chemistry patents, and the errors cluster disproportionately around the most complex and therapeutically important structural features.
Why do LLMs fail at chemistry database querying even when their SQL is correct?
LLMs can generate syntactically correct SQL that nonetheless returns scientifically wrong answers — a failure mode called schema hallucination. This happens because chemistry databases encode scientific context in their schema design: biochemical IC50 values and cellular IC50 values are different measurements that should never be aggregated without explicit filtering, selectivity data requires cross-target joins that are domain-specific, and assay conditions like concentration units and assay format are fields that change the interpretation of a result. When an LLM generates a query, it can correctly identify table names and column headers while completely missing the scientific logic that governs how those fields should be combined. The resulting output looks like a proper database response but contains scientifically invalid aggregations. Fluent SQL is not a reliable proxy for scientific correctness in complex chemistry databases, and this is why schema design — with scientific context built in rather than stored as free text — is a prerequisite for reliable LLM querying.
What is Markush misinterpretation in LLM patent extraction?
Markush representations are a specialized notation used in chemistry patents to describe families of related compounds with a single generic structure — a core scaffold with variable R-group substituents defined separately in the patent text. A single Markush claim can represent hundreds or thousands of specific compounds simultaneously. When LLMs process Markush representations during patent extraction, a common failure mode is collapsing the generic Markush structure into one or a few specific compounds that the patent never explicitly claimed, losing the breadth of the actual IP coverage. Alternatively, models may misassign R-group definitions — applying the wrong substituent set from one R-group position to another — generating structures that appear chemical valid but misrepresent the patent’s actual claims. This is particularly consequential for freedom-to-operate assessments and competitive intelligence, where an incomplete or incorrect chemical structure extraction can lead to a materially wrong conclusion about patent coverage.
What does 'data layer' mean in the context of AI-driven drug discovery?
The data layer in AI-driven drug discovery refers to the infrastructure of scientific data that feeds AI and ML models — including the databases, schemas, curation standards, metadata frameworks, and validation processes that determine what information the model actually receives as input. For five years, progress in AI drug discovery was primarily a model-intelligence story: better architectures, more parameters, more training data. The emerging consensus in 2025-2026 is that the binding constraint has shifted to the data layer — specifically, whether the data reaching the model is complete, contextually rich, and structurally consistent enough for the model to produce reliable scientific outputs. In chemistry, this means chemical structures with validated stereochemistry, assay results annotated with experimental conditions and endpoint type, and patent data curated beyond raw text extraction. The data layer is where hallucination, context loss, and schema drift originate — and where they must be fixed.
How does curated chemistry data reduce LLM hallucination?
Curated chemistry data reduces LLM hallucination by providing the model with explicit, validated scientific context that it cannot reliably infer from raw text alone. When chemical structures are stored as validated, stereochemically correct SMILES with expert review rather than as raw patent extracts, the model cannot hallucinate a wrong structure because it is retrieving a pre-validated one. When assay results are annotated with endpoint type, assay format, and experimental conditions as structured database fields rather than free text, the model cannot merge incompatible measurements because the schema prevents it. When compound metadata includes biological activity, selectivity profiles, and physicochemical properties as linked records, the model has the scientific context needed to answer selectivity or cross-target questions correctly. Curation does not eliminate all LLM failure modes — hallucination at the reasoning layer is a separate problem — but it eliminates the most common and most consequential failure modes that originate in inadequate data representation.
What are the most important requirements for a chemistry database to support reliable LLM workflows?
A chemistry database designed to support reliable LLM workflows needs five properties. First, validated chemical structures — every SMILES and InChI should be validated against a chemical structure toolkit, with stereochemistry explicitly encoded rather than left implicit. Second, structured scientific context — assay type, endpoint, experimental conditions, species, and cell line should be structured fields, not free text, so LLMs cannot misinterpret them. Third, explicit uncertainty representation — data quality flags, assay reliability scores, and confidence levels allow the LLM to know when to hedge rather than generating fluent but unreliable outputs. Fourth, canonical identifiers — consistent compound, target, and assay identifiers across records enable reliable joins and cross-references without schema hallucination. Fifth, regular expert curation — automated ingestion without expert validation accumulates errors that compound as the database is used in training or querying. These properties describe what separates a production-grade medicinal chemistry database from a raw data aggregation.
Is Your Chemistry Data Ready for AI?
GOSTAR® from Excelra provides curated, context-rich medicinal chemistry data — small molecules, large molecules, and targeted protein degraders — designed specifically to make LLM and ML workflows reliable in drug discovery. If your AI chemistry workflows are producing fluent but scientifically unreliable outputs, the data layer is where to look first.
