scFseCluster: a feature selection-enhanced clustering for single-cell RNA-seq data

Data collection and preprocessing

We collected six publicly available scRNA-seq datasets containing cell-type annotations and gene expression values from various scRNA-seq platforms, which can be downloaded from the Gene Expression Omnibus (62) and BioStudies (63). All the datasets are from different species, including mice and humans, and from different organs, such as the brain, pancreas, and embryo. The detailed information on the datasets is summarized in Table 1.

Assume that X is a single cell reads count matrix, where $X_{i j}$ represents the count of $j - t h$ gene of the $i - t h$ cell. The same preprocessing process was followed for all datasets. First, the genes that have no counts in any cell will be filtered out. The expression matrix of the single-cell transcriptome was considered more suitable for clustering analysis (64). Therefore, we converted the reads count matrix into the expression matrix by normalization and $\log 2$ transform. The current gene expression matrix remains a high-dimensional sparse matrix, and those low-expressed genes have insufficient information to recognize the cell type. Accordingly, we screened the top D HVGs by Scanpy package (65) (default $D = 2000$ ) and subsequently input the HVG matrix into the feature selection algorithm FSQSSA.

Methodology of scFseCluster

The whole framework of scFseCluster includes three steps (Fig 6). Firstly, HVG selection is implemented on the normalized gene expression matrix. Secondly, FSQSSA is used to select the optimally selected genes based on the HVG expression matrix. Finally, the optimally selected gene expression matrix was input into the clustering module for cell type detection. If the number of clusters (K) is given, K-means algorithm will be called; otherwise, Louvain algorithm (66) will be started and the suitable value of K will be estimated.

Figure 6. Diagram of the proposed scFseCluster framework.

HVG denotes a highly variable gene. OSG means optimally selected gene.

FSQSSA for feature selection

In this section, we introduce the proposed feature selection algorithm FSQSSA (Fig S3), which is inspired by Squirrel Swarm Algorithm (52). Each feature is indicated by a “1” or “0,” which respectively signifies that the feature is selected or unselected. In quantum-based optimization (67), each feature is represented by a quantum bit or Q-bit (q). Q-bit is the superposition of a “0” and “1,” which is expressed in Dirac notation as $q = α | 0 〉 + β | 1 〉$ (68). The values of α and β correspond to the probability that the value of the Q-bit is “0” and “1,” respectively. They must also obey the formula ${| α |}^{2} + {| β |}^{2} = 1$ . Because Q-bits are a linear superposition of probabilities, they are able to represent a more versatile population (69). Because the Q-bit uses the Dirac notation and cannot be directly involved in the operation, it is necessary to represent each feature using the angle θ of the Q-bit (70). The symbol θ is related to the probabilities α and β as follows: $θ = \tan^{- 1} (α / β)$ , $α = \cos θ$ , $β = \sin θ$ .

Each position of the flying squirrel represents an individual, which consists of D Q-bits. Here, D represents the total number of features. So, each individual ( $Q_{i}$ ) can be represented by the following Equation (1): $Q_{i} = [q_{i 1}, q_{i 2}, . . ., q_{i D}] = [θ_{i 1}, θ_{i 2}, . . ., θ_{i D}]$ (1)

The state of the $j - t h$ element in $Q_{i}$ can be derived using Equation (2). $x_{j}^{i}$ is equal to 1 denotes the feature included in the feature subset; otherwise, it is not selected. $x_{j}^{i} = {\begin{matrix} 1, i f {| α |}^{2} \leq {| β |}^{2} \\ 0, o t h e r w i s e \end{matrix}$ (2)

The uniform distribution (Equation (3)) is used to assign the initial position of each flying squirrel. $Q_{i} = θ_{L} + r a n d o m (0,1) \times (θ_{U} - θ_{L})$ (3)where $θ_{L}$ and $θ_{U}$ are the lower and upper bounds of $i - t h$ flying squirrel in $j - t h$ dimension. In addition, $r a n d o m (0,1)$ is a uniformly distributed random number in the range $[0,1]$ .

The fitness function in the FSQSSA is an important metric for assessing the strength of individuals in a population. The fitness value reflects the goodness of fit of each candidate solution (optimal feature subset) to the objective problem. As a multi-objective problem, FSQSSA tries simultaneously minimizing the size of a subset of selected features and maximizing the clustering accuracy of a given subset of features. Based on the above basis, the fitness function constructed to achieve a balance between the two objectives for determining the solution, in this case, is defined as shown in Equation (4). $F i t n e s s (S_{i}) = w \times S C (\hat{y_{i}}) + (1 - w) \times (1 - \frac{| S_{i} |}{D})$ (4)where $S_{i}$ represents the subset of features obtained by $i - t h$ squirrel, and for each feature subset, this study uses the K-Means model for clustering, $\hat{y_{i}}$ means the clustering label of the output of the $i - t h$ subset. The function $S C (\hat{y_{i}})$ denotes the contour coefficient of the potential feature subset, and $| S_{i} |$ indicates the number of selected features. The parameter w is a balance parameter that controls the clustering accuracy and feature selection rate. To ensure that our primary objective of maintaining accuracy is achieved, we set w as 0.9 in our study (53, 71, 72, 73).

As mentioned earlier, three types of trees in the forest represent different food resource classes. To make FSQSSA achieve a better balance between exploration and exploitation, we assume that there are 50 trees in the forest; only 1 tree was top-ranked hickory, 3 second-ranked acorns, and 46 lowest-ranked normal trees. The number of squirrels matches the number of trees in the forest, with only one squirrel per tree.

Squirrels need to constantly search for more advanced resources in the forest to satisfy their requirements. The dynamic foraging process of a flying squirrel leads to three scenarios: (1) a squirrel flies from an acorn tree to a hickory tree; (2) a squirrel flies from a normal tree to an acorn tree; and (3) a squirrel flies from a normal tree directly to a hickory tree. It is hypothesized that in the absence of natural predators, a squirrel glides throughout the forest and effectively searches for food, whereas the presence of natural predators causes it to become alarmed and forced to flee to random locations. Natural enemies give each squirrel room to escape, which makes FSQSSA less likely to fall into a local optimum solution. We define the probability of the presence of a natural enemy as $P_{d p}$ , which is equal to 0.1 by default. The squirrel’s foraging process can be mathematically modeled as follows.

Case 1

A squirrel flies from an acorn tree ( $θ_{a t}$ ) to a hickory tree ( $θ_{h t}$ ). In this case, the new location of the squirrel can be obtained as follows: $θ_{a t}^{t + 1} = {\begin{matrix} θ_{a t}^{t} + d_{g} \times G_{c} \times (θ_{h t}^{t} - θ_{a t}^{t}), R_{1} \geq P_{d p} \\ R a n d o m l o c a t i o n, o t h e r w i s e \end{matrix}$ (5)where $d_{g}$ is the random glide distance, defaulted between 0.3 and 0.7. $R_{1}$ is a random number ranging between $[0,1]$ . The t denotes the current iteration. The balance between exploration and exploitation is achieved through the sliding constant $G_{c}$ in the equation, whose value significantly affects the algorithm’s performance, which uses the default value of 1.9 in the standard Squirrel Search Algorithm.

Case 2

Flying squirrel moves from a normal tree ( $θ_{n t}$ ) to an acorn tree. In this case, the new location of squirrels can be obtained as follows: $θ_{n t}^{t + 1} = {\begin{matrix} θ_{n t}^{t} + d_{g} \times G_{c} \times (θ_{a t}^{t} - θ_{n t}^{t}), R_{2} \geq P_{d p} \\ R a n d o m l o c a t i o n, o t h e r w i s e \end{matrix}$ (6)where $R_{2}$ is a random number in the range $[0,1]$ .

Case 3

Some of the squirrels in the normal tree fly directly to hickory trees. In this case, the new location of squirrels can be obtained as follows: $θ_{n t}^{t + 1} = {\begin{matrix} θ_{n t}^{t} + d_{g} \times G_{c} \times (θ_{h t}^{t} - θ_{n t}^{t}), R_{3} \geq P_{d p} \\ R a n d o m l o c a t i o n, o t h e r w i s e \end{matrix}$ (7)where $R_{3}$ is a random number in the range $[0,1]$ .

Seasonal changes can significantly affect the foraging activity of squirrels (74). They suffer substantial heat loss at low temperatures, and weather conditions force them to be less active in winter than fall (75). Squirrel movements are affected by changes in weather, hence the seasonal monitoring conditions retained in this study. The seasonal monitoring condition prevents FSQSSA from falling into a local optimum solution and enhances the exploratory ability of squirrels. The following steps are involved in modeling the behavior.

$S_{c}^{t} = \sqrt{\sum_{k = 1}^{d} {(θ_{a t, k}^{t} - θ_{h t, k})}^{2}}$ (8)Where $t = 1,2,3$ .

$S_{\min} = \frac{{10 E}^{- 6}}{{(365)}^{t / (t_{m} / 2.5)}}$ (9)Where t and $t_{m}$ are the current and maximum iteration values, respectively. The value $S_{\min}$ affects the exploration and exploitation capabilities of the proposed method. Larger value of $S_{\min}$ promotes exploration, whereas smaller values of $S_{\min}$ enhance the exploitation capability of the algorithm. For any effective metaheuristic, there must be a proper balance between these two phases (76).

(c) If seasonal monitoring condition is found to be true, those flying squirrels unable to explore the forest for optimal winter food sources will be randomly relocated. Because the normal trees have the lowest level of food resources, FSQSSA assumes that only squirrels in the normal trees will be forced to migrate randomly in search of better food sources. The random migration of squirrels is given by Equation (10).

$θ_{n t}^{n e w} = θ_{L} + L \overset{´}{e} v y (n) \times (θ_{U} - θ_{L})$ (10)Lévy flight is a powerful mathematical tool used for improving global exploration capabilities of various metaheuristic algorithms (77). $L \overset{´}{e} v y$ flight helps to find new candidate solutions far from the current best solution.

Performance metrics

We aggregate six quality metrics (78) including Rand Index ( $R I$ ), Adjusted Rand Index ( $A R I$ ), Normalized Mutual Information ( $N M I$ ), Adjusted Mutual Information ( $A M I$ ), Accuracy ( $A C C$ ), and Fowlkes–Mallows Index ( $F M I$ ) to access the clustering performance of the scFseCluster model. These metrics are defined as follows: $R I = \frac{N u m b e r o f p a i r - w i s e c o r r e c t p r e d i c t i o n s}{T o t a l n u m b e r o f p o s s i b l e p a i r s}$ (11) $A R I = \frac{N u m b e r o f p a i r - w i s e t r u e p o s i t i v e p r e d i c t i o n - E [R I]}{A v e r a g e n u m b e r o f p a i r s i n s a m e c l u s t e r f o r a c t u a l a n d p r e d i c t e d - E [R I]}$ (12) $N M I = \frac{M I}{[H (U) + H (V)] / 2}$ (13) $A M I = \frac{M I - E [M I]}{[H (U) + H (V)] / 2 - E [M I]}$ (14) $A C C = \frac{T P + T N}{T P + F P + T N + F N}$ (15) $F M I = \frac{T P}{\sqrt{(T P + F P) \times (T P + F N)}}$ (16)RI and $A R I$ measure the similarity between the cluster assignments by making pair-wise comparisons (79). $N M I$ and $A M I$ measure the agreement between the cluster assignments (53). $H (U)$ and $H (V)$ denote the entropy of actual and predicted cluster assignments, respectively. $N M$ is equal to $\sum_{i = 1}^{| U |} \sum_{j = 1}^{| V |} P (i, j) \log [P (i, j) / P (i) P^{'} (j)]$ . $P (i)$ and $P^{'} (j)$ represent the probability of data occurring in Cluster i (actual) and Cluster j (predicted). $F M I$ measures the correctness of the cluster assignments using pairwise precision and recall (53). The definition for $T P$ , $T N$ , $F P$ , and $F N$ is done by counting the number of pairwise samples if they are allocated in the same or different cluster for the predicted and actual labels.

Comparison analysis

To prove the effectiveness, we carried out a comparison analysis from two aspects. In one aspect, we compared FSQSSA with metaheuristic methods, including the standard Squirrel Search Algorithm (Squirrel) (52), Enhanced Salp Swarm Algorithm (Salp) (53), ABC (54), and Genetic Algorithm (GA) (55). All the comparative algorithms share the same fitness function (Equation (4)). In addition, all of them were iterated 100 times, with 50 individuals in each iteration of the population. In particular, it should be noted that the standard Squirrel and ABC can only solve continuous optimization problems, whereas feature selection is typically a discrete optimization problem. For this purpose, we apply the sigmoid function to these two algorithms to obtain the feature subset.

In another aspect, we compared our scFseCluster algorithm with seven SOTA methods for scRNA-seq data clustering, which includes two deep learning approaches scDeepCluster (35) and DESC (33), and five machine learning methods: Seurat (28), CIDR (56), SINCERA (13), SC3 (29), and SIMLR (57). We respect all the steps of other methods without any additional extraneous operations. Table S2 summarizes the details of these methods.

Legal Disclaimer:

EIN Presswire provides this news content "as is" without warranty of any kind. We do not accept any responsibility or liability for the accuracy, content, images, videos, licenses, completeness, legality, or reliability of the information contained in this article. If you have any complaints or copyright issues related to this article, kindly contact the author above.

You just read:

scFseCluster: a feature selection-enhanced clustering for single-cell RNA-seq data

Distribution channels:

EIN Presswire's priority is author transparency. We do our best to weed out false and misleading content. The content above is the sole responsibility of the author who makes it available. If you have any complaints, kindly contact the author above.