Deep Learning and Transfer Learning for Automatic Cell Counting in Microscope Images of Human Cancer Cell Lines

: In biology and medicine, cell counting is one of the most important elements of cytometry, with applications to research and clinical practice. For instance, the complete cell count could help to determine conditions for which cancer cells could grow or not. However, cell counting is a laborious and time-consuming process, and its automatization is highly demanded. Here, we propose use of a Convolutional Neural Network-based regressor, a regression model trained end-to-end, to provide the cell count. First, unlike most of the related work, we formulate the problem of cell counting as the regression task rather than the classiﬁcation task. This allows not only to reduce the required annotation information (i.e., the number of cells instead of pixel-level annotations) but also to reduce the burden of segmenting potential cells and then classifying them. Second, we propose use of xResNet, a successful convolutional architecture with residual connection, together with transfer learning (using a pretrained model) to achieve human-level performance. We demonstrate the performance of our approach to real-life data of two cell lines, human osteosarcoma and human leukemia, collected at the University of Amsterdam (133 training images, and 32 test images). We show that the proposed method (deep learning and transfer learning) outperforms currently used machine learning methods. It achieves the test mean absolute error equal 12 ( ± 15) against 32 ( ± 33) obtained by the deep learning without transfer learning, and 41 ( ± 37) of the best-performing machine learning pipeline (Random Forest Regression with the Histogram of Gradients features).


Introduction
Automatically analyzing microscope images is an important and challenging computer vision problem that can be done for a wide range of tasks. These include but are not limited to: classifying cell types [1,2], extracting cell shapes [3,4], identifying the position of cells [5][6][7], and counting the number of cells [8][9][10][11]. Counting cells in particular serves an important function for several different biomedical tasks. It has been used to aid in estimating microbial content [12,13], measuring cytotoxicity [14,15] and discovering the role of particular genes in cell biology, microbiology, and immunology [16][17][18]. For example, when conducting cancer research, researchers can investigate the effects of radiation using the clonogenic colony formation assay [19][20][21] or the neurosphere formation assay for measuring the proportion of brain tumor-initiating cells [22][23][24]. Manually counting cells is still conducted in some laboratories-frequently using hemocytometers as a visual aid [25]. However, this method is prone to interobserver variations, and it is arduous, time-consuming, and not feasible when using high throughput assays. Therefore, there is a need to count cells automatically.
In the last couple of decades, many computer-vision-based approaches have been suggested for automatic cell counting. The structure of these frameworks can be roughly categorized into two sections. First, hand-crafted feature representations are extracted that are further fed into a classifier. For instance, features are extracted using Histogram of Oriented Gradients (HOG) [26], Laplacian of Gaussian operation (LoG) [27], or Local Binary Patterns (LBP) [28] to represent the input images. Then the image representation is fed to a classification model such as a Support Vector Machine (SVM) or a Random Forest (RF) to detect the cells. However, approaches with hand-crafted features as input to classifiers could suffer from the following limitations: (i) Selecting a suitable feature extractor is a non-trivial task. Frequently, this requires specific knowledge of the cells in an image and the context wherein those cells are present. (ii) Feature extraction methods can be highly coupled with a specific type of cell, and, therefore, they do not generalize well to different images containing other cell types. (iii) Many feature extraction methods contain a large number of tunable parameters that highly affect the final performance. This makes the optimization process time-consuming and a matter of trial-and-error. (iv) Even with large datasets, the performance of such methods reaches a level upon which it cannot overcome anymore.
Deep Neural Networks (DNN) have been applied for a wide range of computer vision tasks and has set new state-of-the-art on several benchmarks related to classification, object detection, and segmentation [29][30][31][32][33]. The biggest advantage of using deep neural networks is that there is no need to rely on hand-crafted features anymore. Instead, a deep neural network is optimized using a (mini-batch) gradient descent over large amounts of data and automatically learns problem-specific features directly from training data. With regards to the problem of counting cells in images, this task can be approached from two distinct directions: (i) developing a cell detector, and (ii) developing a cell counter.
Cell detection approaches consider the problem as a matter of identifying cell instances in an image first and then counting the number of instances found. Frequently, these approaches rely on each image containing annotation information, consisting of either dot annotation localizing the centroids of cells [34][35][36], or bounding box annotations indicating the borders of cells [37]. The problem with these approaches is that using them in real-world applications is challenging because cell densities can be extremely high, ranging in the thousands per image, and cells can have wide varieties regarding shapes, sizes, and colors. A different approach relies on applying semantic segmentation and predicting the spatial density [38] or density maps [39]. However, these methods require pixel-level annotations of the images during training, which is typically hard to acquire due to high costs.
Counting by regression, on the other hand, directly predicts the number of target instances in an image rather than detects the number of objects present. Various machine learning and deep-learning approaches are available for a regression task, such as Support Vector Regression (SVR) [40], Random Forest Regression (RF) [41], Nearest-neighbor regression (NNR) [42], and Ridge Regression (RR) [43]. Counting by segmentation can be considered to be a mix of the counting by detection and counting by regression method since cells are first segmented before they are regressed. For example, in [44] a Feature Pyramid Network was used to segment the images by first building a ground truth feature mask, before regressing on that feature mask. Although this method is less labor-intensive than methods that require annotating, it cannot overcome limitations such as cell clumping and overlap.
More generally, all methods that require obtaining annotations for the entire training data are time-consuming and expensive, especially if we are only interested in the global count of cells, and not in additional information such as their shapes, sizes, or locations. When it comes to developing a cell counter, the problem could be considered to be a regression task, whereby we are interested in learning the relationship between image representations (which are usually image global features) and the number of cells present [45,46]. The problem is, however, that many of these approaches treat the regression problem as a multi-class classification task. This means that the number of cells is considered to be a class ID, and images with the same number of cells belong to the same class. As a result, a model will view images with 2 cells or 500 cells just as far apart as images containing 50 and 51 cells. In other words, they are unable to properly learn ordering. Thus, it is important to model this task as a regression problem rather than a multi-class classification problem.
Here, we propose use of a Convolutional Neural Network (CNN) that directly predicts the number of cells, i.e., it is a regression model. As a result, our approach does not require pixel annotations, which are typically very costly to obtain, which drastically improves its wide applicability. We will demonstrate that in comparison to approaches with handcrafted features as input to a regressor, our method not only provides better performance, but it is also easy-to-use. Moreover, we present that the application of the transfer learning is essential to achieve a high performance when a small number of data are available.
The contribution of this paper is three-fold: The development of a pipeline for automatically counting cells using a convolutional neural network-based regressor. We indicate a specific architecture of a neural network that in combination with transfer learning allows the achievement of very promising performance. • We provide baseline results for the newly collected data. Moreover, we present that the proposed approach achieves a human-level performance.

Problem Statement
We consider the problem of counting cells as a regression task. Let x ∈ R C×H×W denote an image with C channels (e.g., RGB), height H and width W. Furthermore, let y ∈ R be several cells present in x. The regression model F transforms the extracted features from x, φ(x), to the number of cells, i.e.,: where a ∈ R D denotes the parameters of a feature extractor, and θ stands for parameters of the regression model F. For given N datapoints, D = {(x n , y n )} N n=1 , we aim to minimize a given loss function L with respect to a and θ, i.e.,: where, typically, (a, θ; x i , y i ) = (y i − F(φ(x i ; a); θ)) 2 , and L is then known as the Mean Squared Error (MSE) loss. We consider two approaches for automatic cell counting, namely a machine learning pipeline and a CNN-based regressor. In the following sections, we explain the components of these two approaches. For the machine learning pipeline, we outline different feature extractors and nonlinear regressors. Next, we present our proposition of the CNN-based regressor with a specific architecture of the CNN, namely xResNet [47,48].

Machine Learning Pipeline
A crucial component for a successful regressor is the feature extraction. There are many feature extraction methods available for image processing [49]. Here we focus on two feature extractors: Histogram of Oriented Gradients (HOG) and Frangi filter. Next, once features are extracted, the regression model is used to predict cell counts. The simplest approach would be to apply linear regression. However, nonlinear models, e.g., Support Vector Regression, XGBoost are preferred due to their flexibility. As a result, we obtain a two-stage pipeline (see Figure 1A) that consists of: (i) a feature extraction method, (ii) a regression method. Each stage is optimized separately. We refer to this approach as the machine learning pipeline.

Feature Extractors
In this paper, we consider two feature extractors, namely Histogram of Oriented Gradients (HOG) and the Frangi Filter. HOG is considered one of the best-performing feature extractors in computer vision, while the Frangi Filter is widely used in medical imaging.

HOG
A Histogram of Oriented Gradients [26] is a commonly used feature descriptor for extracting features from image data. It works by extracting the distribution of orientations of gradients of an image. Intuitively, these gradients are useful because the magnitude is large around edges and corners, since these regions contain abrupt intensity changes, and could be indicative of a cell being present. An example of the output of the HOG feature extractor is presented in Figure 2 (Middle).

Frangi Filter
The Frangi filter was applied to medical images [50]. Although this was originally designed and still applied to detect vessel-like or tube-like structures and fibers [51,52], some preliminary exploratory research demonstrated that this filter was also effective for our task. An example of the output of the HOG feature extractor is presented in Figure 2 (Right).

Support Vector Regression
One of the most popular machine learning tool for classification is the Support Vector Machines (SVM), first introduced in [57]. Many variations have been introduced since then, and we will focus on the Epsilon Support Vector Regression, whereby the goal is to find a function F that does not deviate more than from y n for each training point x n , and is as flat as possible. The resulting regressor takes the following form: where k(φ(x n ), φ(x)) is a kernel function, and θ n andθ n follow from the solution of the dual formulation [57].

Gradient Tree Boosting
Gradient boosting is a common machine learning method where multiple weak prediction models (usually CART tree learners) are combined into one predictive model.
The regression gradient boosting algorithm was first introduced in [56] and it modifies this algorithm such that it chooses a separate optimal value γ jm for each of the tree's regions R jm , instead of a single γ m for the whole tree. Then, the update rule for the model becomes:

XGBoost
XGBoost is an improved version of previously developed gradient boosted models (GBM) [54]. However, GBM builds trees sequentially, XGBoost is parallelized, therefore making the algorithm much faster. In addition, XGBoost also includes regularization in the objective function, which punishes more complex models and prevents overfitting.

Ridge Regression
The ridge regression is a popular variation to linear regression that adds L2 regularization to the loss function, hereby preventing overfitting to the training data. Let λ > 0 be a regularization factor. Then the penalized loss function in the ridge regression can be formulated as follows:

Nearest-Neighbor Regression
The Nearest-Neighbor algorithm is a non-parametric regression introduced in [58]. It works by predicting target values through interpolation of the targets of k nearest neighbors in the training set, scaled by a weighting factor. Most commonly they are weighted by the inverse of their distance.

Remarks
The machine learning pipeline works fairly well in practice; however, there are still some issues with it. First, the range of possible methods for extracting features is broad, and the same could be said about choosing a regressor. In [59] alone, they mention 35 different feature extractors, while a plethora of regression techniques have been described in the literature. Second, even with a perfect understanding of a feature extractor or a regressor's inner workings, it is not possible to predict exactly how well it will perform on a dataset for a given task. In general, it is very difficult to predict which combination of a feature extraction method and a regressor will perform well. Taking into account these issues, neural networks seem to constitute a more suitable class of models due to their flexibility and adaptivity to a broad class of problems, including our task of cell counting.

CNN-Based Regressors
Alternatively, we can consider a deep-learning-based approach. Instead of selecting various methods and optimize them separately, we could train a single model end-to-end, in which a feature extractor and a regressor are modeled by deep-learning building blocks (e.g., convolutional and fully connected layers, residual blocks). Such an approach has been proved to be successful for counting cells but in the classification task, see [60]. Here, we propose combination of a convolutional neural network (CNN) with a regressor (a fully connected neural network, FCN) to predict the cell count directly (see Figure 1B), without segmenting cells and then counting them. We refer to this approach as a CNNbased regressor.

Neural Network Regression
The main advantage of neural networks is that instead of selecting a pair of a feature extractor and a regression model, we select an architecture that allows learning features and predicting cell counts. Moreover, the learning process is end-to-end that significantly simplifies the optimization, and increases adaptive capabilities to a specific problem or data. Moreover, a neural network regression relies only on pixel intensities, whereas other cell counting approaches commonly require additional information, such as the cell's shape, size, and location.
As mentioned previously, manually extracting features is time and labor-intensive and its performance is hard to predict. Therefore, deep-learning-based models and especially Convolutional Neural Networks (CNN) have become more popular among researchers and practitioners. Because their architecture is intrinsically hierarchical, CNNs can automatically and adaptively learn multi-level hierarchies of features that are translationequivariant [61]. This means that inside of the CNN, early layers will learn low-level features such as edges, textures, and boundaries, while deeper layers can learn high-level semantic features that are close to the target labels.
The CNN architecture usually consists of several building blocks, such as convolutional layers, pooling layers, and fully connected layers, whereby the first two layers are repeated multiple times, followed by one or more fully connected layers. In the convolutional layers, features are extracted through a combination of linear and nonlinear operations, i.e., a convolutional operation whereby a kernel is applied across the input and nonlinear activation functions such as the rectified linear unit (ReLU) and its variants. In the pooling layers, feature maps are typically downsampled to introduce translation invariance to small shifts and distortions and to decrease the number of learnable parameters. Max pooling and global average pooling are most commonly used. The output feature map of the last convolutional layer is commonly flattened and connected to one or more fully connected layers.
As for obtaining large amounts of high quality, labeled data in medical imaging is seldom available because of the cost and necessary workload of radiology experts, transfer learning is often applied. With transfer learning, a network is pretrained on an extremely large dataset, such as ImageNet, and then applied to a different task in a different domain [62]. The underlying notion is that early layers in this network are already trained to recognize low-level features and that these features can be shared across unrelated datasets. This makes transfer learning particularly useful for small datasets.

Our CNN-Based Regressor
In this paper, we propose use of the deep residual network architecture (ResNet) as introduced in [63]. In contrast to other CNN architectures, the ResNet consists of several residual blocks, which are composed of a convolutional layer, a batch normalization layer, and a shortcut that connects the original input to the output of the residual block. Figure 3a shows schematically how a Residual Block with Identity Shortcut (RB-IS) and a Residual Block with Projection Shortcut (RB-PS) operate. Mathematically, the residual block can be summarized as follows: where and h l−1 is the output of the l − 1-th layer, F is the residual function (e.g., a composition of convolutional layers, batch-norm layers, pooling-layers, and nonlinear activations such as ReLU), G corresponds to the shortcut connection, and G(h l−1 ) could be either the identity map, G(h l−1 ) = h l−1 , or a projection, G(h l−1 ) = W p h l−1 , W l denote weights of a residual block, and W p stands for weights of the projection. If the dimensions of h l−1 and h l are the same, the identity shortcuts are used; otherwise, a projection mapping is applied, meaning a linear projection W p is performed on the shortcut connections to match the dimensions.  The main idea of the ResNet is to learn the additive residual function F with respect to G(h l−1 ), with the choice to use an identity mapping and/or projection mapping. In our implementation, we changed the last fully connected layer of ResNet to output 1 scalar instead of the original 1000-D vector. In addition, the SoftMax loss function was replaced with MAE loss.
In this paper, we used xResNet, which is a modified version of the original ResNet architecture, which has shown to increase accuracy, and provide improved performance for transfer learning tasks, compared to ResNet architectures [47]. xResNet applies a couple of small tweaks compared to the original architecture. The first one is shown in Figure 3b as ResNet-B in which the A path is altered in the downsampling block by moving the stride of 2 to the second convolution and keeping a stride of 1 for the first layer.
The second modification can be seen in ResNet-C (Figure 3b): the 7 × 7 convolution is removed in the input stem of the network and replaced with three consecutive 3 × 3 convolutions, whereby the first one has a stride of 2 and the other two a stride of 1. The third modification is present in ResNet-D (Figure 3b): the path B of the downsampling layer is modified by adding average pooling with a stride of 2, before a convolution with a stride of 1, and not 2.
For our purposes, we propose use of xResNet50, which consists of an input stem, four stages with each varying amounts of xResNet blocks, and an output stem. In the input stem there are three convolutional layers, each followed by batch normalization, and a ReLU. In between the input stem and the first of the four stages there is a max pooling layer. The first stage consists of three xResNet blocks, the second stage of four xResNet blocks, the third stage of six xResNet blocks, and the fourth stage of three xResNet blocks again. In between the fourth stage and the output stem, there is an average pooling layer. In the output stem the input is first flatted and dropout is applied, before the final fully connected linear layer, which has 1 output. In total there are trainable 23,529,313 parameters.

Dataset Data Information
For cell counting, two cell lines were used, namely a human osteosarcoma cell line (U2OS) and a human leukemia cell line (HL-60). Osteosarcoma is a primary malignant form of bone cancer. Generally, it affects children and adolescents, where it represents the eighthmost common form of childhood cancer. Considering its poor prognosis, it is the second most important cause of deaths related to cancer in both children and adolescents [64,65]. Cells from the U2OS cell line originate from bone tissue, from a differentiated sarcoma of the tibia. Moreover, U2OS cells are epithelial adherent cells and exhibit a fast growth rate. HL-60 is a cell line that was originally derived from a woman suffering from acute promyelocytic leukemia. This type of blood cancer is a subtype of acute myeloid leukemia. Untreated, the median survival is less than one month [66].
The U2OS and HL60 cell lines were cultured in Roswell Park Memorial Institute 1640 medium (RPMI; GibcoBRL, Grand Island, NY, USA) and Dulbecco's Modified Eagle Medium (DMEM; GibcoBRL, Grand Island, NY, USA), respectively, supplemented with 10% Fetal Bovine Serum (FBS) and 1% antibiotics (penicillin and streptomycin). To determine the number of cells growing in medium, hemocytometer counting was performed according to the ABCAM protocol 'Counting Cells Using a Hemocytometer' with minor modifications. Visualization and counting of the cells was done using microscope Axiovert S100 from Zeiss, with a 10X objective. The mean number of cells was determined by repeatedly counting cells in four non-overlapping sets of 16 corner squares, indicated on the hemocytometer, selected at random. A camera and computer program from Zeiss, AxioCam Cm1 and ZEN 2 respectively, were used to capture and save snapshots of the hemocytometer containing the cells. These microscope images could then be used for experiments.

Data Preparation
The images were originally in the Carl Zeiss Images (.CZI) format, a microscope image file format commonly containing image stacks, time-lapse series, and tile images captured from a Carl Zeiss microscope. In contrast with regular image formats, CZI can combine imaging data with additional metadata about the image itself, the microscope, and the camera used. The files in this dataset, however, only contained metadata on the images, i.e., its size (1388 by 1038 pixels with 1 channel), and its pixel type ('gray16'). In the filename, the number of cells present in the inner 16 squared grid was given. Using the AICSPYLIBCZI library (https://pypi.org/project/aicspylibczi/ (accessed on 26 May 2021)) the images were loaded as NUMPY arrays and saved in the TIFF format using the PILLOW Image module (https://pillow.readthedocs.io/ (accessed on 26 May 2021)). Next, the grid of 16 inner squares was manually cropped because only cells in this area were counted and represented in its label. This resulted in images of 700 by 700 pixels. The OpenCV library was used to handle images in the tiff format.
When looking at the original images in Figure 4, it can be observed that the following challenges are present: • The cells tend to have varying circularity ratios, ranging from elongated ellipses to round circles. Therefore, algorithms that rely on the round shape assumption do not hold. • Some cells have differing levels of staining intensity. Therefore, algorithms that require homogeneous intensity distributions of the object will not perform well. • Since cells were manually counted using counting chambers, the grid lines are still present in the image. These will cause interference with algorithms that separate foreground from background. Altogether, we have obtained 165 images that we randomly split into the training set (133 images) and the test set (32). To compare all methods in a fair manner, we kept this training-test split fixed. Moreover, we decided to keep this split fixed for future comparisons and reproducibility. The data are available online at: https://doi.org/10.528 1/zenodo.4428844 (accessed on 26 May 2021).

Data Augmentation
Since we have a low number of images compared to image size (700 × 700 pixels), we used data augmentation [67,68]. We used transformations that manipulate the orientation, brightness, and contrast of the cropped cell images. Each cell was randomly mirror-flipped along the vertical and horizontal axes, and the brightness and contrast were randomly adjusted (by a 20% strength with a probability of 75% of occurring). In our preliminary experiments, training without data augmentation resulted in overfitted models, which is a known fact in the computer vision field [68].

Details of the Machine Learning Pipeline
In this work, we implemented and verified five different machine learning techniques, namely: Support Vector Regressor (SVR), Ridge Regression (RR), Nearest-Neighbor Regression (NNR), XGboost, and Gradient Tree Boosting (GTB). For each of these approaches, in the preliminary experiments, we analyzed the following feature extractors: Multi-dimensional Gaussian filtering, Frangi filtering, Hybrid Hessian filtering, Laplace operator filtering, local median filtering, Meijering filtering, Roberts' filtering, ISODATA threshold-based filtering, Li's filtering, local mean threshold-based filtering, Niblack, Otsu, Sauvola, Yen's filtering. Since the Frangi filtering gave the best preliminary results, further experiments were conducted using that method only. Additionally, we used the Histogram of Oriented gradients since it is one of the state-of-the-art feature extractors in computer vision.

Details of Our Approach
A pretrained model was used to initialize the xResNet (we used the network xResNet50 available at: https://docs.fast.ai/vision.models.xresnet.html (accessed on 26 May 2021), which is pretrained on ImageNet data). The entire network was trained for 400 epochs, where an epoch represents one training pass through all training instances. Training employed a momentum-based gradient optimization with momentum linearly increasing from 0.85 to 0.95 in the first 30% of epochs (133 epochs), and for the remaining 70% (267 epochs) the momentum decreased from 0.95 to 0.85. Furthermore, learning rates were scheduled using cosine annealing with a base learning rate at 0.002 [69]. No weight decay was applied to bias and batch normalization layers. For the first epoch, all but the last layer was frozen, meaning that only the weights of the last layer were updated with a fixed learning rate of 0.002. Then for 399 epochs, the entire network was unfrozen and trained using learning rate scheduling with cosine annealing and the one-cycle training policy as introduced by Smith [70].
The proposed CNN-based regressor was implemented in Python using PyTorch and Fastai package [48]. All the experiments are run on a machine with Intel Xeon CPU @ 2.2 GHz × 2 and GPU Nvidia Tesla V100-SXM2-16GB.

Evaluation Metric
In all the experiments we used the Mean Absolute Error (MAE) as the metric to quantitatively evaluate the performance that is defined as follows: where N is the total number of images, t i and y i are the true and predicted numbers of cells in the i-th test image, respectively. We prefer MAE over MSE because MAE could be easily interpreted (the same unit as the original cell count, while MSE provides the squared error).

Results and Discussion
We present the results for machine learning-based regressors in Figure 5 and for our proposed CNN-based regressor in Figures 6 and 7. We gather all results (average ± standard deviation) in Table 1 where raw images are used as inputs (IMG), or images are processed using HOG or Frangi filtering.    We notice that there are large differences in performance among the machine learning approaches with a feature extraction or without. The best-performing regressor is the Ridge Regression with the HOG features. The performance of RR + Frangi is slightly worse, but it seems to be statistically indistinguishable. The NNR performed the worst, with an MAE of 107 using the images as features, and 98 and 111 using the HOG and Frangi, respectively.
Its standard deviation was close to 100 for all three feature extractors, hereby making this approach too unreliable to use in a practical setting. Interestingly, for some regressors (SVR, NNR) applying a feature extractor does not decrease the error, or it could even increase the MAE value. In some cases (NNR, XGB) the performance is almost the same. When looking at the average error of all ML-based regressors for each feature extractor, we see that using just the images gives an MAE of 88, using HOG 81, and using Frangi 78. So, although the best-performing model (RR) uses HOG as features, the average performance of all ML-based regressors was best when using Frangi features. These results clearly indicate how difficult it is to select a proper pair of a feature extraction method and a regressor.
The performance of the machine learning-based regressors is rather mediocre. First, the average of the best-performing model is around 40 that is rather high from a practical point of view. Second, the standard deviation is also high, at around 35. This means that in many cases the error is above 50 or even more. Moreover, all other regressors achieved error above 70 that place them as completely unreliable and impractical approaches. A possible explanation of this outcome is the small training sample.
Our deep-learning-based approach with transfer learning, on the other hand, achieved an error around 12, with the standard deviation equal to 15 (see Table 1). Only in a few cases, the error was large (>30). Interestingly, the model performed worst when there were more than 300 cells present in an image. In that case the error was more than 60, more than 3 standard deviations away from the mean absolute error of 12. That being said, the MAE of 12 indicates the great practical potential of the proposed CNN-based regressor. This result is especially promising because even human-based counting is a noisy process that could result in a discrepancy of a similar magnitude (∼10).
To achieve a confirmation of whether transfer learning is indeed useful, we also trained the presented CNN-based regressor without pretraining. Consequently, the MAE increased to 33 with the standard deviation equal to 34 (see Table 1). This result indicates that using a pretrained neural network is indeed beneficial, especially in a biomedical domain as highlighted in the past [62]. Nevertheless, even though the performance of the CNN-based regressor without transfer learning dropped, it was still better than most of the machine-learning-based regressors. This is another indicator of the superiority of the end-to-end training of a single model instead of hand-crafted feature engineering with an adaptive model on top of it.
Analyzing Figure 6 (the model without transfer learning) and Figure 7 (the model with transfer learning) it is apparent that transfer learning has a huge impact on the final performance. Interestingly, both models have their worst prediction on the image containing 323 cells (the model with transfer learning achieved an error equal 70, and the model without transfer learning obtained error equal 135). More generally, both models' accuracy decreases in images with more cells, although this effect is stronger for the CNN without transfer learning, as can be seen in the scatterplots in Figures 6 and 7.
Looking at Figure 8 we can notice the images with the highest cell counts of the test set, and the corresponding average loss over all models. Some images show artifacts, which may have affected the models' performance. For instance, Image A shows some black artifact in the top right corner, while image B shows some light-colored circular line, and a dark-colored circle in the top right corner. More importantly, all images show at least some degree of cell clumping, and some show overlapping cells. It is most likely not a coincidence that the images with the highest average error have cell clumping and overlapping cells present (Figure 8, images F & H). Apparently, these issues were hard to deal with for both the ML-pipeline and the CNN-regressor, but mostly for the former. Since manual feature extracting methods usually represent images using a limited number of feature types (such as edge, corner or shape detection), problems may arise when the input changes such that the feature extractor's requirements are not satisfied anymore. In the case of the HOG, choosing the right kernel size may be tightly coupled with the cell size, and to obtain larger gradients it may require strong differences in pixel intensity between the inside-and outside of cell areas (which is less present with cell clumping and cell overlapping). The Frangi filter may also perform less well when artifacts are present [71], and similar to the HOG, it also performs better with stronger differences in pixel intensities for cell borders [51]. The CNN, on the other hand, did not require any assumptions of the data and was able to learn features automatically, and the CNN with transfer learning (TL) did so even better than the version without. Although the latter had to spend some training epochs learning low-level features, the CNN with TL could transfer this from its training on ImageNet across domains to our task. This head start meant that deeper layers were able to learn more abstract features, such as entire cells. Possibly this made the model better at distinguish cells from one another, even in the cases where cells were clumped or overlapping.

Conclusions
In this paper, we investigated two supervised learning approaches for the task of counting cells. In the first pipeline, several feature extractors were combined with commonly used machine learning regression models. Since relying on hand-crafted feature extraction methods is arduous, domain-specific, and its performance rather unpredictable, we suggest an end-to-end approach that is based on deep learning. We propose use of a Convolutional Neural Network architecture and consider the problem of cell counting as a regression task whereby the image cell count was considered to be an annotation to supervise training. To further improve the model's performance, a specific Deep Residual Network architecture, xResNet, was used in combination with transfer learning (using a pretrained model).
Importantly, the proposed approach can handle dense cell microscope images of two cell lines, namely human osteosarcoma (U2OS) and human leukemia (HL-60), real-life data, where differing levels of illumination, occlusion, variations in appearance, and other challenges are present. We have demonstrated that the proposed approach achieved better performance compared with other machine learning methods. Moreover, the error obtained by our method (12 ± 15) is on par with the error of a human lab worker. Additionally, the proposed CNN-based regressor provides an answer in milliseconds while the human counting process lasts minutes. These two facts indicate the great potential of our approach in practice.