Machine learning in flow cytometry data analysis is becoming increasingly important as flow cytometry (FC) is a crucial technique utilized in biotech and biomedical laboratories for the characterization and sorting of individual cells. The vast quantity and complexity of generated data present, however, a challenge for the analysis and interpretation of the results. To solve this problem, researchers can apply Machine Learning (ML) algorithms. This blog post outlines four primary methods by which ML aids in the processing of flow cytometry data.
Flow cytometry (FC) is a technique that enables high-throughput analysis and sorting of individual cells based on their physical and biochemical characteristics. Using this method, researchers can characterize a large set of parameters for millions of single cells per sample in a matter of seconds. This remarkable throughput combined with a relatively low cost of FC makes it an essential technique used in academic and industrial labs to gain deeper insights into cellular biology and advance research in areas such as cancer diagnostics and therapies, immunology, and stem cell research.
Exemplary applications of flow cytometry
- Cell counting and viability assessment
- Identification and characterization of cell populations
- Analysis of cell cycle and apoptosis
- Detection and quantification of cell surface markers and intracellular proteins
- Sorting cells for downstream applications, such as cell culture and gene expression analysis
The “big data” challenge in flow cytometry
The throughput of data generated by flow cytometry is approx. 1 Thit/s (1). The data has moreover high-dimensional structure. Using the traditional two-dimensional flow cytometry data plots, it is often impossible to capture all the patterns and relationships in this data space.
The manual setting of the gates is moreover labor-intensive, significantly slows down the data processing, and is prone to human subjectivity and even errors (2). To unlock the potential of the flow cytometry data and access the complete information they contain in a time-efficient and more objective and consistent way, researchers should consider the use of machine learning in flow cytometry data analysis.
ML as a powerful aid in flow cytometry data analysis
Machine learning is a “set of computational and statistical methods that learn patterns from the data with minimal input from humans” (2). The ability of ML to leverage large-scale data to improve performance on a specified set of tasks makes it a powerful tool for FC data processing, analysis, and interpretation.
Below, we highlight the most notable ML use cases that aid flow cytometry and support bioinformatics-driven data analysis.
ML in dimensionality reduction of the flow cytometry data
One of the important steps in FC data analysis is creating a visual representation of the results in the form of two- or three-dimensional plots. These visuals help researchers explore and communicate the results.
There are many machine learning algorithms that can be used to compress the high-dimensional FC data into the desired number of dimensions. Primary examples include principal component analysis (PCA), t-Distributed Stochastic Neighbor Embedding (tSNE), Uniform Manifold Approximation and Projection (UMAP), and Multidimensional scaling (MDS).
It is important to keep in mind that the reduction of dimensionality inevitably leads to the loss of some information in data. What is lost and what is preserved will depend on the chosen algorithm.
ML in clustering and classification of different cell types
A common goal in flow cytometry experiments is to classify cells into different groups based on their physical and biochemical characteristics. In this way, researchers can, for example, profile the composition of healthy tissues and characterize how cells change in disease.
Many machine learning algorithms are available that can be used to cluster and classify cell populations in the high-dimensional space of the flow cytometry data (2). Researchers can choose from various supervised or unsupervised ML methods, or combinations of the two, depending on the FC experiment setting, objectives, and prior knowledge.
ML in anomaly detection
In some cases, the aim of FC analyses is to detect rare cell types or cell types that may have a pathological function. To identify cells that are significantly different from most of the sample, researchers can use Decision Tree or Random Forest algorithms.
These supervised algorithms require training datasets before use. In contrast to clustering approaches that rely on unsupervised methods, anomaly detection depends on labeled data to identify deviations effectively within flow cytometry data analysis.
ML in predictive modeling
After cells are characterized per sample through clustering and classification, the statistics of biological characteristics of different cell groups can be used to organize samples into hierarchies such as “healthy” versus “diseased”.
This input can subsequently be used in ML algorithms to discover biomarkers associated with diseased cells and to analyze clinical effects, such as response to therapy or vaccination. Machine learning algorithms used in predictive modeling include neural networks, gradient-boosting machines, and others, supporting biomarker discovery and analysis.
Conclusion
To sum up, researchers who want to enhance the accuracy, speed, and scalability of their machine learning in flow cytometry data analysis workflows should consider including ML algorithms in their pipelines.
By doing so, they can gain a more comprehensive understanding of cellular biology, which can in turn guide the development of more effective diagnoses and treatments of diseases through data-driven biomedical research.
