Flow cytometry (FC) is a crucial technique utilized in biotech and biomedical laboratories for the characterization and sorting of individual cells. The vast quantity and complexity of generated data present, however, a challenge for the analysis and interpretation of the results. To solve this problem, researchers can apply Machine Learning (ML) algorithms. This blog post outlines four primary methods by which ML aids in the processing of flow cytometry data.
Flow cytometry (FC) is a technique that enables high-throughput analysis and sorting of individual cells based on their physical and biochemical characteristics. Using this method, researchers can characterize a large set of parameters for millions of single cells per sample in a matter of seconds. This remarkable throughput combined with a relatively low cost of FC makes it an essential technique used in academic and industrial labs to gain deeper insights into cellular biology and advance research in areas such as cancer diagnostics and therapies, immunology, and stem cell research.
Exemplary applications of FC include:
- Cell counting and viability assessment
- Identification and characterization of cell populations
- Analysis of cell cycle and apoptosis
- Detection and quantification of cell surface markers and intracellular proteins
- Sorting cells for downstream applications, such as cell culture and gene expression analysis
The “big data” challenge in flow cytometry
The throughput of data generated by flow cytometry is approx. 1 Thit/s (1). The data has moreover high-dimensional structure. Using the traditional two-dimensional flow cytometry data plots, it is often impossible to capture all the patterns and relationships in this data space. The manual setting of the gates is moreover labor-intensive, significantly slows down the data processing, and is prone to human subjectivity and even errors (2).
To unlock the potential of the flow cytometry data and access the complete information they contain in a time-efficient and more objective and consistent way, researchers should consider the use of machine learning (ML) algorithms.
ML as a powerful aid in flow cytometry data analysis
Machine learning is a “set of computational and statistical methods that learn patterns from the data with minimal input from humans” (2). The ability of ML to leverage large-scale data to improve performance on a specified set of tasks makes it a powerful tool for FC data processing, analysis, and interpretation. Below, we highlight the most notable ML use cases that aid flow cytometry.
ML in dimensionality reduction of the flow cytometry data
One of the important steps in FC data analysis is creating a visual representation of the results in the form of two- or three-dimensional plots. These visuals help researchers explore and communicate the results. There are many machine learning algorithms that can be used to compress the high-dimensional FC data into the desired number of dimensions. Primary examples include principal component analysis (PCA), t-Distributed Stochastic Neighbor Embedding (tSNE), Uniform Manifold Approximation and Projection (UMAP), and Multidimensional scaling (MDS). It is important to keep in mind that the reduction of dimensionality inevitably leads to the loss of some information in data. What is lost and what is preserved will depend on the chosen algorithm.
ML in clustering and classification of different cell types
A common goal in flow cytometry experiments is to classify cells into different groups based on their physical and biochemical characteristics. In this way, researchers can, for example, profile the composition of healthy tissues and characterize how cells change in disease Many machine learning algorithms are available that can be used to cluster and classify cell populations in the high-dimensional space of the flow cytometry data (2), Researchers can choose from various supervised or unsupervised ML methods, or combinations of the two, depending on the FC experiment setting (e.g. comparing data from a single experiment vs comparing multiple datasets), objectives (e.g. identifying known cell populations vs discovering novel cell populations or subpopulations within a sample), and prior knowledge.
ML in anomaly detection
In some cases, the aim of FC analyses is to detect rare cell types or cell types that may have a pathological function. To identify cells that are significantly different from most of the sample, researchers can use Decision Tree or Random Forest algorithms. These are so-called supervised algorithms, meaning that before they can be used in the analysis, they need to be trained on separate data sets. This is unlike the algorithms used in clustering which usually do not require prior training (unsupervised), The supervised algorithms find however application also in classification, In contrast to the clustering approaches, which use unsupervised methods (do not require to have training data set), the algorithms used in anomaly detection require training data sets (supervised).
ML in predictive modeling
After cells are characterized per sample (clustering and classification), the statistics of biological characteristics of different cell groups (cell type, cell cycle stage, or response to a particular treatment, etc.) can be used to organize the samples into a hierarchy (sample classification), for example, “healthy” vs “diseased”. This input can be subsequently used in ML algorithms to discover different biomarkers associated with “diseased” cells and/or to analyze the clinical effects, such as the response to therapy or vaccination, Machine learning algorithms used in predictive modeling include neural networks, gradient-boosting machines, and others.
To sum up, researchers who want to enhance the accuracy, speed, and scalability of their flow cytometry data analysis should consider including machine learning algorithms in their workflows. By doing so, they can gain a more comprehensive understanding of cellular biology, which can in tum guide the development of more effective diagnoses and treatments of diseases.