Development of
a hit-calling algorithm on DEL selection data

Our client’s requirement

Our client is a European pharma company focused on discovering and developing small-molecule medicines with novel modes of action. A key stage of their research demands the processing of sequenced data through a next-generation sequencing (NGS) pipeline.

The client’s DNA-encoded library (DEL) had a large volume of data but low compound coverage. A pragmatic statistical approach using differential gene expression was required to reliably detect true positives and avoid the obstacles caused by the data.

The ultimate goal was to develop a hit-calling algorithm to find candidates for testing.

Our approach

To ensure the algorithm was optimized for the specific data involved, we started with a thorough analysis of data from 72 samples with raw counts for each of the 5 million compounds in the client’s library. Understanding that low raw count numbers increased the risk of selecting false positives or dropping false negatives, we put particular emphasis on normalizing the data set so that trends were maintained across all the generated data.

We reviewed various differentially expressed gene (DEG) algorithms before selecting DESeq2m. We then built an effective and efficient automated pipeline, containerized with NextFlow.