# Can Clinical Flow Cytometry Gating Analysis Be Automated?

### Unsupervised and supervised algorithms can improve the efficiency and reproducibility of flow cytometric analysis

A major challenge in clinical flow cytometry is reliable data analysis. Although flow cytometry analysis is a multistep process, significant interest and efforts have been placed on the gating step, motivating researchers to develop and implement various algorithms. Using cell markers, the gating step identifies cell populations, enabling the quantification of cell counts and mean fluorescence intensity. Traditionally, gating has been performed manually, which has been reported to achieve a mean interlaboratory coefficient of variation (CV) of 17–44 percent, depending on the sample type and analysis method. Automated gating can decrease such variation, along with reducing demands on staff and time spent for analysis. Broadly, gating algorithms can be classified into unsupervised and supervised algorithms, depending on whether they extract features in a data-driven way or based on previous knowledge regarding the data, respectively.

## Unsupervised gating algorithms

Unsupervised algorithms extract features from the dataset to group cells into different cell populations. They can be divided into three main computational approaches:

- Dimensionality reduction techniques represent high-dimensional data in a lower (typically, three dimensional) space for ease of visualization and classification.
- Clustering-based analyses group similar data points together.
- A trajectory detection algorithm, Wanderlust, orders cells based on their developmental relationships, i.e., differentiation from parent to daughter cells.

### Dimensionality reduction techniques

#### PRINCIPAL COMPONENT ANALYSIS

Principal component analysis (PCA) is a technique implemented across various fields, where multivariate data is condensed into its principal components (PC) that explain the variance of the original dataset and are orthogonal to each other.

If you imagine your cell data as an ellipsoid cloud of data points, the first PC—which explains most of the variance in your sample—can be visualized as running through the longest axis of the ellipsoid. The second PC explains most of the remaining variance not explained by the first PC and runs perpendicular to the first PC, forming the second longest axis. The third PC explains remaining variance and runs orthogonal to the first and second PC, and so forth, for higher-dimensional data.

PCA does not explicitly assign cells to discrete clusters but if the first two or three PC of a given dataset account for most of the variance, then these PC are sufficient to help you differentiate cell subsets, for example, for immunophenotypic profiling of neoplastic B cells.

However, PCA depends on linear transformations to reduce dimensionality, so if the first two or three PC do not account for most of the variance, then it fails to parse apart cell subsets. Thus, although PCA is easy to implement, its utility for gating is dependent on the statistical features of the dataset.

#### Stochastic neighbor embedding

As an alternative to linear transformation, stochastic neighbor embedding (SNE) depends on nonlinear dimensionality reduction. Here, the distance between points in high-dimensional space is converted into a probability distribution that can be embedded on a two-dimensional plot. SNE maps are useful for visualizing clusters, which can be colored by cell marker expression. An example of its application is the discovery of a novel T cell subpopulation when researchers used the SNE-based ACCENSE (Automatic Classification of Cellular Expression by Nonlinear Stochastic Embedding) algorithm to study mouse CD8+ T cells.

### Clustering approaches

#### Spanning-tree progression analysis of density-normalized events

Spanning-tree progression analysis of density-normalized events (SPADE) is a clustering technique where each point in multidimensional space is merged with a similar point to form a parent cluster. This process is reiterated until reaching a target number of clusters, which you specify beforehand. In SPADE, the size of the cluster corresponds to the number of cells, and clusters are connected based on their phenotypic similarity. This approach helps preserve rare cell populations in the dataset. Like SNE, SPADE can be color-coded for visualizing data into cellular phenotypes. One example where SPADE yielded results similar to manual gating was for identifying cell populations in mouse bone marrow.

#### FlowSOM

Similar to SPADE but relying on an artificial neural network for increased computational efficiency, FlowSOM uses successive iterations of training to also perform hierarchical clustering. FlowSOM has been tested on several flow cytometry datasets, including one for primary immune deficiencies. FlowSOM assigns each node with different numbers of cells, which helps preserve rare cell populations while also identifying larger cell populations (by grouping nodes). Like SPADE, FlowSOM can be used for visualization and its detection of populations depends on the user-defined number of clusters.

#### Cluster identification, characterization, and regression

Cluster identification, characterization, and regression (Citrus) combines clustering of cells based on marker expression with data on experimental endpoints of interest, such as good versus bad patient outcomes, or patient survival time. Citrus has been used for various publicly available datasets, including to study stimulated and unstimulated peripheral blood mononuclear cells exposed to different drug treatments.

CITRUS works by doing the following:

- Unsupervised hierarchical clustering of phenotypically similar cells, like in SPADE and FlowSOM
- Characterizing the behavior of identified clusters with biologically interpretable metrics by calculating features such as proportion of cells and median marker expression level in a cluster
- Supervised learning to identify clusters predictive of a sample’s endpoint
- Plotting predictive subset features as a function of the experimental endpoint, alongside the corresponding marker expression phenotype of relevant cell clusters

## Supervised gating algorithms

In contrast to unsupervised algorithms, supervised algorithms rely on expert knowledge. Based on the known unique characteristics of cell populations, particularly the shape and distribution of a given cell population relative to other populations, users can train and parameterize the supervised algorithm. Thus, when replacing manual gating, supervised methods tend to outperform unsupervised methods and generate more accurate cell population statistics.

For example, when analyzing a T cell panel, researchers at the Providence Portland Medical Center found a significant correlation between manual gating and automated gating with the flowDensity software package, which is a freely-available open source supervised clustering algorithm.

## Limitations and future directions of automated gating algorithms

Unsupervised algorithms can identify cell populations in biological samples without requiring much parametrization or prior training. This makes unsupervised algorithms easy to use but also limits their performance given the complexity of cell populations within a sample. Supervised algorithms can mitigate some of these limitations by using existing knowledge regarding cell populations, including their marker expression, statistical features, and even their functional outcomes.

The performance of algorithms is further constrained by confounding factors, including the following:

- Elevated background staining
- Artifacts, such as doublets
- Broad or narrow signal intensity coefficients of variation
- Instrument variability

These factors can be addressed through developing and implementing rigorous flow cytometry protocols, as well as standardizing protocols across operators and clinical laboratories to ensure robust datasets prior to analysis.

Despite the use of automated gating algorithms in academic and research settings, the biological complexity of cell phenotypes, lack of standardization practices, and need for bioinformatics and computational expertise have prevented automated gating algorithms from being adopted in clinical settings. Investing in these automated analysis pipelines could significantly boost the efficiency and reproducibility of clinical flow cytometry analysis. However, this would be a huge undertaking for individual clinical labs. Therefore, initiative taken by manufacturers and to design clinical flow cytometers with built-in algorithms are promising for the broad clinical adoption of automated analysis, which would enable clinical flow cytometry labs to guide patient care more confidently.