Single-Molecule Clustering for Super-Resolution Optical Fluorescence Microscopy

Molecular assembly in a complex cellular environment is vital for understanding underlying biological mechanisms. Biophysical parameters (such as single-molecule cluster density, cluster-area, pairwise distance, and number of molecules per cluster) related to molecular clusters directly associate with the physiological state (healthy/diseased) of a cell. Using super-resolution imaging along with powerful clustering methods (K-means, Gaussian mixture, and point clustering), we estimated these critical biophysical parameters associated with dense and sparse molecular clusters. We investigated Hemaglutinin (HA) molecules in an Influenza type A disease model. Subsequently, clustering parameters were estimated for transfected NIH3T3 cells. Investigations on test sample (randomly generated clusters) and NIH3T3 cells (expressing Dendra2-Hemaglutinin (Dendra2-HA) photoactivable molecules) show a significant disparity among the existing clustering techniques. It is observed that a single method is inadequate for estimating all relevant biophysical parameters accurately. Thus, a multimodel approach is necessary in order to characterize molecular clusters and determine critical parameters. The proposed study involving optical system development, photoactivable sample synthesis, and advanced clustering methods may facilitate a better understanding of single molecular clusters. Potential applications are in the emerging field of cell biology, biophysics, and fluorescence imaging.

In general, super-resolution imaging involves recording single-molecule data and subsequent estimation of key features such as location and number of photons detected per molecule. Physical models are developed to determine the parameter of interest (such as number of molecules/cluster, cluster-density, cluster-area, and pairwise-distance). These parameters form the basis for understanding the underlying biological mechanism at a single molecule in a cellular system. Molecular clustering is a process that represents an accumulation of proteins/single-molecules (here, Hemaglutinin) as determined by local cell physiology. Specifically, single-molecule clusters are known to form during the onset of diseases and viral infection (Influenza A) [4,35]. Hence, it becomes vital to understand clustering and to devise methods to disrupt the process. Hemagglutinin proteins is an antigenic glycoprotein found on the surface of influenza viruses. It is responsible for binding the virus to the cell that is being infected. The protein has the ability to cause cells to clump together in vitro. In our study, we have studied NIH 3T3 cells transfected with Dendra2HA plasmid. Post-transfection (24 h), it was found that hemagglutinin proteins begin to form small clusters, which in turn integrate together to form large clusters [36]. HA clustering has a direct bearing on the infectivity rate.
The basic fact that different physical models employ different assumptions and specific statistical distribution rules out the possibility of a single model to accurately estimate all the relevant parameters. Moreover, a lot depends on the natural biological processes that result in a specific distribution of single-molecules based on local cell physiology. The complex nature of biological processes necessitates multi-model analysis [4,35]. The Kmeans clustering method is often used for single-molecule analysis [37][38][39]. While K-means is an efficient technique, it has its limitations. Recently, a non-parametric descriptor has been used for quantifying the density of cluster [40]. Traditionally, Ripley's K-test, histogram analysis, and point clustering methods are used for understanding single-molecule clustering in cellular systems [2,4]. Most of these clustering techniques perform well for determining specific parameters and particularly when data are complete. Investigation shows that none of these techniques are found to be accurate for multi-parameter estimation. Since single-molecule data are known to be incomplete (containing only a subset of observed single-molecules) and have complex distribution, a single model may not adequately estimate all relevant parameters. Moreover, existing clustering methods are not equipped to handle multi-parameter estimation; thus, they are not readily extendable to single-molecule imaging. This is a bottleneck and remains a challenge. Thus, there is a constant need for a reliable clustering method that can be used for characterizing molecular clusters [4,35].
Existing methods used for analysing clustering process in SMLM include Ripley's K function [41,42] and density-based spatial clustering of applications with noise (DB-SCAN) [43]. While Ripley's function is basically a second moment property that describes the relationship between two or more patters, mathematically, Ripley's K-function is given by [41], K(r) = 1 n ∑ N i=1 N pi (r)/λ, where pi is the i-th point, N is the total number of points, and λ is the number of points per area. On the other hand, DBSCAN is a density-based method that primary uses two parameters, neighbourhood radius (say rmin), and minimum number of points (nmin) for identifying a cluster. The technique looks for the minimum density of molecules within rmin. Although effective, the technique is not robust and are prone to imaging artifacts. Moreover, the subjectivity and ambiguity in selecting algorithm parameters makes it more complex, and this affects its performance. Both techniques are supervised techniques and, thus, can be used in combination with other analysis methods.
In this report, we report the co-development of a single-molecule imaging system and advanced clustering methods to estimate critical biophysical parameters in a cellular system. A multi-modal clustering approach is employed for estimating multiple parameters. This is predominantly due to single-molecule dynamics in a complex cellular system driven by interlinked biophysical processes. Here, we investigate the Influenza A disease model that involves clustering of Hemagglutinin (HA) protein in NIH3T3 cells. Clustering is an important step in the entire cycle of HA entry to its maturation during infection. In short, the viral glycoproteins of hemagglutinin (HA) are used by virus for fusion, viral budding, and infection [44]. Once inside the cell, HA dynamics and its clustering in the plasma membrane are critical for budding and fusion. The biophysical parameters (number of molecules per cluster, density, and cluster size) are critical indicators of HA dynamics and overall infectivity (release of bud-particles from infected cells) [45,46]. Hence, there is an urgent need for cluster analysis techniques and tracking HA dynamics overtime in a cellular system. Figure 1A shows the schematic diagram of the single-molecule imaging system developed for cluster analysis. Two lasers are employed: one for activation and the second for excitation of conjugate photoactivable molecule (Dendra2-HA). The beams are combined using a dichroic mirror (DM 505) and directed towards the high NA objective lens (Olympus UPlanFL N 100X, 1.30 NA, Oil immersion) that focus the combined beam to transfected cells cultured on a coverslip. To understand the dynamics of HA molecules during viral infection, a photoactivable probe (Dendra2) is conjugated with the protein-ofinterest Hemaglutinin (HA) to study its dynamics in a cellular environment. This enables both photoactivation and photoexcitation of single-molecules. Upon activation and subsequent excitation, the target molecules (Dendra2-HA) blinks with a duration of ∼30 ms (triplet state lifetime). The fluorescence, thus, emitted during this time window is collected by the same objective and directed towards the EMCCD camera (iXon 897 Ultra). The description of actual optical setup is detailed in Supplementary S1. Images containing single-molecule signatures are recorded at a rate of 33 Hz and a total of 5000 frames were collected with an average of 17.1 molecules per frame. The images are processed and rendered using developed MATLAB scripts [9,14,25]. Subsequently, super-resolved image is constructed from single-molecule data. The high intensity spots (on the recorded images) representing single-molecule were extracted and approximated by a Gaussian function,

Results
distance from the centroid (ρ 0 = (x 0 , y 0 )), A 0 is the background pixel value, and I 0 is the peak pixel value. Subsequently, the centroid and standard-deviation of single-molecule are extracted to, respectively, determine its location (centroid, ρ 0 ) and localization precision ) in the reconstructed super-resolved image. Here, r d is the diffraction-limited PSF with N as the number of photons detected per molecule. We have neglected background and insignificant pixel-size effect to arrive at the above relation from Thompson's well-known relation [47].
NIH3T3 cells were chosen for the present study following similar studies in the literature [2,4]. NIH3T3 cells are a standard fibroblast cells extracted from swiss albino mouse embryo tissue. The cells are well suited for transfection studies using plasmid DNA (Dendra2-HA). As they are receptive to transformations, 3T3 is capable of undergoing spontaneous transformations in culture. The lipid-based transfection reagents used in the present study are specifically optimized for transfection of DNA and RNA into NIH3T3 cells. Altogether, the efficiency of transfection makes 3T3 cell lines suitable for singlemolecule studies. For the present study, the cells were thawed and cultured using the standard protocol [9,29]. To ensure healthy growth before carrying out transfection, cells were passaged a few times. Subsequently, the cells were PBS washed and transfected with Dendra2-HA plasmid using standard Lipofectamine 3000 (Thermo Fisher Scientific, L3000001) based protocol. A confluence of 75% was ensured before transfection was carried out. The cells were incubated overnight and fixed using 3.7% paraformaldehyde after 24 h [6]. Post fixation, widefield-inverted fluorescence microscopy studies were carried out to look for transfected cells and their efficiency. Figure 1B shows a transfection efficiency of about 20% as observed in an inverted fluorescence microscope equipped with 20X objective lens (Meiji 20X, 0.4 NA). Subsequently, single-molecule studies were carried out on the transfected cells, as shown in Figure 1B. It may be noted that HA is known to exist as a trimer, but here we are dealing with HA monomer. This has been verified, and the details can be found in Refs. [48,49]. The monomers are known to form compact and elongated clusters. Specifically, HA has basic residues and palmitoylation sites in the cytoplasmic tail (CT). It contains at least two highly conserved, positively charged amino acids and three acylation sites per monomer [48]. A detailed description of HA glycoprotein can be found in Ref. [49].
The transmission images of the cells chosen for the study is shown in Figure 2A. We have carefully chosen strongly and weakly transfected cell. These selections helps in determining the tolerance limit of clustering methods and related parameter estimation. The corresponding super-resolution images are shown in Figure 2B for both the cases along with a few enlarged clusters (see green and red boxes). The centroid and localization precision for the detected molecules are determined as shown in Figure 2C. These parameters are estimated by using developed MATLAB scripts [4,9] and subjected to further analysis.
To determine the performance of single-molecule clustering method, we computationally generated random clusters with known parameters such as number of clusters, number of molecules per cluster, and inter-cluster distance. The generation of test data (random clusters) is discussed in Supplementary S3. Subsequently, three different clustering methods (point-clustering, Gaussian mixture, and K-means) were employed for analysing clusters. The first method is point-clustering that uses Euclidean distance to identify the next point in the cluster and, depending on the distance cut-off, determines if the point belongs to the cluster [50,51]. We chose j = 30 nm as the cut-off and minimum number of points to be greater than 200 for it to be considered as a valid cluster. This technique has the distinct advantage of leaving out un-clustered points. This is consistent with the observation that not all molecules in real biological sample form clusters. The second method is based on Gaussian mixture that requires the identification of each cluster (i) by fitting a Gaussian function, where, µ i and σ i are, respectively, the mean and standard deviation of the Gaussian function [52]. Often, expectation maximization based iterations are carried out for convergence to a stable solution [53][54][55][56]. Due to iterative nature of the Gaussian mixture method, the technique benefits from the feedback mechanism and converges to the most likely solution. The third promising method is K-means in which cluster centroids/means are randomly placed and each point is assigned to the nearest mean based on the least-squared Euclidean distance, J = ∑ k j=1 ∑ n 1=1 ||p i − ν j || 2 , where ||p i − ν j || is the Euclidean distance between p i and the centroid, ν j (calculated using k-means++ algorithm in Matlab) [57,58]. Iteratively, the centroid for each cluster is recalculated until convergence. Finally, the simulated data are subjected to cluster-analysis algorithms. The details of algorithms and data processing are detailed in Supplementary S2.  Figure 3A shows the performance of clustering methods on the generated test clusters. We adopted the approach that is often used in microscopy, optical tomography, astronomy, and data science that employs phantoms to evaluate different methods [59][60][61][62]. For quantification of a specific parameter, we have computationally generated random clusters (phantom) that serve as a ground truth (see, Figure 3) and used this as a guide to calculate biophysical parameters: first in the simulated clusters and then in real data. A comparison of estimated value with the known value gives an accurate picture of the efficiency of the clustering method. Biophysical parameters (such as cluster area, cluster density, number of molecules per cluster, and pairwise-distance) relevant to single-molecule clustering are considered. It is observed that k-means clustering and point clustering performs superior compared to that of GMM for determining the cluster area. This is quantified based on the coefficient-of-variation measure for each parameter, CoV = σ/µ where µ = ∑ i µ i , and µ and σ corresponds to mean and standard deviation, respectively. All techniques performed equally well in determining the number of molecules per cluster (0.05 < CoV < 0.1). Point-clustering (CoV = 0.061) and K-means (CoV = 0.016) techniques are found to be better suited for estimating cluster density when compared to GMM (CoV = 0.1492) (see, Figure 3B). This is broadly due to the fact that K-means use centriod-to-molecule distance based search while GMM is a probabilistic model. However, GMM (CoV = 0.0468) and K-means (CoV = 0.015) performed better than point-clustering for accessing pairwise distance (CoV = 0.0558). Once the characteristics of clustering methods are determined, they are used to investigate real data (Dendta2-HA transfected cells).
The analysis was carried out on a large number of cells, and the data were segregated based on weakly (sparse clusters) and strongly (dense clusters) transfected cells. This is due to the fact that weak transfection are known to occur in a variety of biological specimens. A typical cell with dense and sparse HA clustering is shown in Figure 4. For both the cases, three different methods were employed to determine HA clusters and its characteristics. It is evident that, sparse clusters are well-recognized by all the clustering methods (5.3 µm 2 < Area < 6.9 µm 2 ), whereas compact clusters are better categorized by point clustering (with a minimum average cluster area, A = 1.4848 µm 2 ). This is in line with the known test data. Point clustering has the added advantage of excluding unclustered HA molecules. This is due to the fact that point clustering does not require the initialization of a number of clusters and the necessity to cluster every molecule, which is predominantly due to its point-to-point based search. On the other hand, Gaussian and K-means clustering needs prior knowledge about the number of clusters. In addition, point clustering is found to be the fastest among the clustering methods ( Figure 4A,B). K-means assume symmetric clusters whereas point and Gaussian clustering methods are capable of handling random clusters. This is beneficial for identifying clusters that are of arbitrary shape and for calculating biophysical parameters of unknown random clusters. Visualization of clusters shows few large area clusters (5-15 µm 2 ) and the majority of them are in the moderate range (2-4 µm 2 ). This observation is consistent with point clustering whereas GMM and K-means suggests a large proportion of larger area clusters (>3 µm 2 for dense clustering and >6 µm 2 for sparse clustering cases, respectively) ( Figure 4C-F). This is in line with the observation on test sample (Figure 3). Moreover, density analysis indicates the existence of a few high density clusters (>2000 µm −2 ). This observation is consistent with point-clustering and K-means whereas GMM does not fare well. The number of molecules per cluster is consistent with all clustering methods showing the existence of a few clusters with a large number of HA molecules. These observations indicate that none of the methods are well equipped for determining all the relevant parameters and a suitable combination of them is necessary. Thus, a multimodal approach seems essential.    show dense and weak molecular clusters, respectively. Three different clustering methods (point clustering, Guassian mixture model, and k-means clustering) were employed to determine biophysical parameters (cluster area, cluster density, number of molecules per cluster, and pairwise-distance ). The colors are used to discern clusters.

Discussion
A single-molecule localization microscope was developed and integrated with advanced clustering methods for assessment of molecular clusters. In the influenza type A disease model, HA molecules are known to assemble and form HA-assembly post 24 h of transfection (see Figure 2B). A conjugate photoactivable probe (Dendra2-HA) was used to study clustering in transfected NIH3T3 cells. The dynamics of these clusters (virion assembly), including their shape and size, are known to play critical roles that ultimately result in the maturation of the virus [4,63].
The clustering methods identify single-molecule clusters, and critical biophysical parameters are estimated. It is apparent that parameter estimation needs a multimodel approach and a single model is incapable of evaluating all parameters consistent with observation. In order to access the performance of clustering methods, test clusters with known biophysical parameters of interest (such as the number of clusters, number of molecule per cluster, cluster density, and pairwise distance between two nearby molecules) are computationally generated (see Figure 3). All clustering methods (point-clustering, Gaussian mixture model, and k-means) are then used to classify clusters and estimate the parameters. Comparison show point-clustering and k-means as the preferred method for estimating cluster size (area), whereas GMM and k-means are found to be suitable for determining pairwise distance. Both k-means and point-clustering show better estimation of cluster density, and all methods perform equally well for estimating the number of molecules per cluster (see Figure 3). Encouraged by the performance of clustering methods on simulated test data, we anticipate a similar trend for actual data.
The clustering methods are applied to real data (Dendra2HA transfected NIH3T3 cells) and analyzed. Results show that cluster density is better estimated by point-clustering and K-means, whereas cluster area is better estimated by point-clustering (see Figure 4). All clustering techniques perform equally well as far as the number of molecules per cluster is concerned. None of the methods are found to be suitable for determining pairwise distance accurately (see Figure 4). We found a disparity between dense and sparse clusters. Particularly for sparse clusters, cluster density and the number of molecules are better represented by K-means. This shows that accurate estimation of biophysical parameters is dependent on the strength of transfection. Parameter estimation on dense clusters is found to be in line with computational test data.
Overall, the proposed study provides a deeper understanding of single-molecule clustering process in the complex cellular system. It contributes to a better estimation of key biophysical parameters critical for understanding single-molecule complexes, resulting in a better determination of underlying biophysical mechanisms with single-molecule precision.

Cell Culture
NIH3T3 fibroblast cells were thawed and resuspended in complete media (10%FBS + 89%DMEM + 1% penicillin streptomycin). The resuspended cells were centrifuged to form a pallete, which is resuspended again in PBS (1X) to remove toxicity caused by the freezing medium (90% complete cell medium +20% DMSO ). Then, the solution was centrifuged to obtain a DMSO free pallete, which is resuspended in complete media and 105 cells were plated in 35 mm dish with 2 mL media (the counting of cells was performed using hemocytometer). The cells were kept undisturbed for 2 days in CO 2 incubator (at 37°C) so that the confluence of cells becomes 70-80% in the plate before the split. We have split the cells into three passages followed by transfection with Dendra-2HA plasmid-DNA.

Cell Transfection and Fixing
NIH 3T3 was used for the study. The cells were seeded in 35 mm dish supplemented with coverslip. The coverslip used was washed with ethanol prior to cell seeding to remove contamination, and then it was washed with PBS to remove leftover ethanol. Then, the cells were transfected with Dendra2-HA after 12-14 h of seeding. It should be noted that the cell count should be not more than 1-1.5 ×10 5 cells per 35 mm dish while seeding so that confluency after 12-14 h is around 60-70%. Lipofectamine 3000 (Life Technologies, Invitrogen) was used for transfecting NIH 3T3 cells with Dendra2-HA as per the Lipo 3000 protocol. The cells were kept in incubation (37°C + 5% CO 2 ) for 24 h in antibiotic free complete media (only DMEM and FBS). Subsequently, the cells were washed with PBS and then fixed with 4%PFA. After fixing, the cells were sealed in glass slide using Fluorosave solvent (Invitrogen, Carlsbad, CA, USA) to preserve it for a long duration. The cells were observed through white light to confirm that cells are in shape and subsequently visualized using blue light to observe green fluorescence from transfected cells. Then, the brightest cells (transfected cell) were chosen for single-molecule imaging.

Optical Setup
The optical system designed for visualizing molecular clusters technically consists of 3 optical arms: (1) blue laser (470-490 nm) for visualizing transfected cells, (2) violet laser (405 nm) for single-molecule activation, and (3) green laser (561 nm) for excitation of activated molecules (see, Figure 1). Dichroic mirror (DM 505) was used to combine the 405 nm and 561 nm laser and directed to the dichroic mirror of inverted fluorescence microscope. On its way, a flip mirror is used to choose either the combined beam (405 nm + 561 nm) or blue light. Blue light is used for identifying transfected cells. Subsequently, the combined beam is used for selectively exciting Dendra2HA single-molecules. A high NA objective (Olympus, 100X, 1.3 NA) is used to tightly focus the light onto the sample in an inverted microscope system (Olympus, IX81 Inverted, Tokoy, Japan). The emitted fluorescence light (570-600 nm) is collected by the same objective, transmits through the dichroic mirror DM 570, and filtered by a set of filters before focusing it on to the EMCCD camera (Andor, iXon 897 Ultra, Belfast, UK). The details of the actual setup and the components are discussed in Supplementary S1.

Superresolution Imaging and Image Reconstruction
The process begins by recording a large number of images followed by image processing to reconstruct a super-resolved image. For the present study, we have recorded 5000 images for a set of cells that express Dendra2HA. We have considered both weakly and strongly transfected cells. Images are recorded at 33 Hz with an EM-gain of 257. A total of 29,463 and 14,499 single-molecules corresponding to strong and weak transfected cells were recorded. A series of processes were carried out to isolate the single-molecule blinks from the background. This include particle-size filtering to remove false counting and photon-count filtering to remove random backgrounds. Subsequently, the bright spots are identified and fitted with a Gaussian filter. Further analysis is carried out to determine the centroid and variance of the fitted Gaussian. This provides the location and localization precision of the detected single molecules. The single molecules from all 5000 frames are superimposed to reconstruct a super-resolved image.