Authors: Dr Bindu Ajithkumar (PhD)
The data infrastructure work that defensible AI rests on — FAIR data management, ontologies, controlled vocabularies, and scientific data curation at scale — is increasingly funded under the regulatory submissions banner. The organisations doing it well are not always the ones the AI press writes about.
The Observation from the Pistoia Alliance
At the Pistoia Alliance’s London meeting in April, the sessions drawing the deepest engagement were not on generative chemistry or LLM tooling. They were on ontologies, FAIR data, and the unglamorous work of making data mean the same thing across systems that were never built to talk to each other. The chemistry sessions had thinner rooms. So did the model-architecture sessions. The crowd, fairly uniformly, was in the rooms where people were trying to solve interoperability.
That observation cuts against the dominant narrative outside the room. The headlines this year have been about AI-generated chemistry, multi-agent discovery platforms, and language models for protocol writing. Those are real, and some of them are working. But the projects I have seen delivering in production — faster regulatory submissions, accelerated CMC transfers, more reliable preclinical synthesis — all sit on top of substantial, expensive, often invisible data infrastructure work that the AI demo does not show.
<h2″>The Question Pharma Has Been Asking
For years the conversation in life sciences AI went something like: we have the models, now how do we get the data ready. Data quality was treated as cleanup work, scheduled after model selection and before deployment.
That framing has shifted, and I would argue more decisively than most external commentary suggests. The pharma teams I am speaking with are treating semantic foundations — controlled vocabularies, shared ontologies, FAIR-compliant pipelines, audit-grade provenance — as a precondition for AI investment rather than a downstream task. Several have told me, in fairly direct terms, that their AI roadmap is gated on their ontology work, not on model availability.
What is more interesting is where the funding is coming from.
The Submissions Banner
A lot of FAIR data and ontology work in pharma is now funded under the regulatory submissions banner: CMC data standardisation, eCTD module authoring, substance registration, IDMP readiness. These are data infrastructure investments dressed up as compliance projects, and they are the same investments that make downstream AI defensible at scale.
The AI function in many large pharma organisations has been requesting data infrastructure budget for years, with mixed success. Submissions teams have been quietly pulling the same budget under a different name, because the regulator was asking for it. The two conversations have often run in parallel, funded out of different P&Ls, sometimes by teams that do not speak to each other.
The organisations that have noticed are starting to consolidate. They are treating their submissions data work and their AI data work as a single programme, on the basis that the underlying ontologies, vocabularies and provenance requirements are the same whether the consumer is a regulator or a model.
What Defensible Means in This Context
I use the word defensible deliberately. There is a difference between AI that performs well in a demonstration and AI that holds up in production — in front of a regulator, a scientific review board, or a cross-functional leadership team that needs to sign off on a clinical decision.
The difference is rarely the model itself. By the time a model is being deployed in a regulated context, the candidates are usually adequate. What separates the AI you can defend from the AI you can only demo is the architecture around it: where the training data came from and whether it can be retraced; whether the input data is grounded in a shared ontology with controlled vocabularies; whether human experts have validated the outputs at the steps that matter; whether you have an audit trail you would put in front of a CMC reviewer.
These are data and process characteristics, and they take time to build. The teams that have invested in them are now finding their AI roadmap unblocked in ways that AI-first teams are still struggling with.
Who Is Building This
The phrase “data foundations” covers a lot of work that is being done by different groups, often in parallel and often without enough coordination. Three categories are worth distinguishing.
Community Standards
The first is the community standards work: the Pistoia Alliance‘s FAIR Implementation programme, CDISC, OHDSI, IDMP, the IMI/IHI consortia. This is the foundational layer that nobody owns and everybody depends on. It is also chronically underfunded relative to its importance, and the people doing it are not the people who get written up in the AI press.
Internal Pharma Teams
The second is the internal teams inside large pharma — the bioinformatics functions, the regulatory data offices, the CMC standardisation programmes — that translate the community standards into something operable inside a specific organisation’s data estate. This is where most of the budget actually sits, and where the AI/submissions consolidation I described above is starting to happen.
Specialist Data Partners
The third is a small number of specialist data partners that do FAIR-compliant scientific data curation at scale — the work of taking unstructured scientific content (papers, trial protocols, regulatory documents, lab notebooks, CRO outputs) and turning it into ontology-grounded, provenance-tracked, AI-ready datasets. Excelra sits in this third category. Most of my work involves either delivering this scientific data curation directly or advising on the data architecture decisions that determine whether downstream AI is defensible.
These three groups are not substitutes for each other. The standards work without operational translation produces nothing usable inside a pharma data estate. The internal teams without specialist partners are often building from scratch what could be sourced from groups that have already done it across multiple sponsors. And specialist partners without the standards work are building without the shared semantic layer that interoperability depends on.
What Follows for AI Strategy
The AI roadmap and the data infrastructure roadmap should not be separate documents. If they are — and in many organisations they still are — the AI roadmap is downstream of work that is happening on a different clock under a different sponsor, and the alignment is incidental.
The AI investment case should also acknowledge that timelines are set by the upstream work. The headline use cases (generative chemistry, multi-agent planning, regulatory writing automation) all share a dependency on semantic infrastructure that most organisations are still building. Funding the use cases without funding the infrastructure tends to produce demos that do not survive contact with production.
My read is that the next eighteen months of pharma AI progress will come disproportionately from teams that have already consolidated their submissions and AI data work into a single programme, and that are working with specialist partners on the curation layer instead of rebuilding it internally. The teams still running these workstreams in parallel will spend most of that period discovering why their AI roadmap is stuck.
