# The Impact of Preprocessing Methods for a Successful Prostate Cell Lines Discrimination Using Partial Least Squares Regression and Discriminant Analysis Based on Fourier Transform Infrared Imaging

^{1}

^{2}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

## 2. Materials and Methods

#### 2.1. Cell Culture

#### 2.2. Sample Preparation for Spectroscopic Studies

_{2}optical windows at a density of 40.000 cells/mL. 48 h after seeding, cells were washed for 15 min in HBSS supplemented with calcium chloride and magnesium chloride (Gibco, Invitrogen, #14025) at 37 °C and fixed in 4% PFA solution (Affymetrix, Cleveland, OH, USA) in PBS for 20 min at 37 °C. Cells were washed and dried in a gradient of HBSS from 100% to 0% including: 100%, 90%, 80%, 70%, 60%, 50%, 40%, 30%, 20%, 10%, and 0% (ddH2O) and air-dried. Each step was performed 1 time for 2 min. All samples were washed and dried at room temperature.

#### 2.3. FT-IR Measurements

_{2}(13 × 1 mm) were acquired in a transmission mode in the range from 3800 cm

^{−1}to 900 cm

^{−1}. The spectral resolution was 4 cm

^{−1}(with a zero-filling factor of 1, giving rise to 1582 spectral points), and the number of scans per spectrum was 256. Spectra from each of the cells were averaged to a single spectrum, giving rise to a set of 48 spectra across the 5 cell lines: RWPE-1 (14 spectra), 22Rv1 (7 spectra), PC3 (7 spectra), Du145 (10 spectra), and LNCaP (10 spectra). The above approach provides data with the dimensionality of 48 objects (cells) by 1582 variables (frequencies).

#### 2.4. Noise Addition and Preprocessing

#### 2.5. Regression and Classification

#### 2.6. Model Calibration and Validation

#### 2.7. Description of Methods

#### 2.7.1. Baseline Correction

^{−1}, 1280 cm

^{−1}, 1302 cm

^{−1}, 1761 cm

^{−1}, 1977 cm

^{−1}, 2412 cm

^{−1}, 2825 cm

^{−1}, 2997 cm

^{−1}, and 3519 cm

^{−1}.

_{i}calculations (introduction of this parameter gives higher weights for negative residuals $\left({y}_{i}-{z}_{i}\right)$ and lower weights for positive one). The above approach is expressed by the equation:

_{i}—weights (if y

_{i}> z

_{i}w

_{i}= p, if y

_{i}< = z

_{i}w

_{i}= 1−p), y

_{i}—signal, z

_{i}—rough signal [49]. Parameters which were chosen after optimization: p = 0.1, λ = 10

^{6}:10

^{8}.

#### 2.7.2. Normalization

^{−1}) was the most stable and was chosen for normalization.

#### 2.7.3. Denoising

## 3. Results and Discussion

#### 3.1. Spectral Changes

#### 3.2. PLS Discriminant Analysis

^{−1}), Amide A and Amide B (3000–3400 cm

^{−1}); nucleic acids (944–1140 cm

^{−1}) and CH

_{2}/CH

_{3}(2800–3000 cm

^{−1}) spectral regions—in-depth discussion of this can be found elsewhere [23]. Following this, accuracy values for external and internal validation for chosen LV obtained for each combination were compared. The high number of methods gives accuracy values for external and internal validation above 0.8. However, some methods result in internal accuracy above 0.8, but they gave a poor prediction for the independent test set—below 0.2. The model was overfitted, which resulted from a small number of samples, i.e., PQN method use mean spectrum as standard; therefore, this method is fragile for spectral distortion, which can occur after the first preprocessing step—baseline correction. Assuming that the test set contains a small number of samples, if certain spectra are being distorted, the majority of the individual spectra and standard ratios can be unstable. The method which gives the worst external accuracy and the best internal accuracy values for original data is a combination of FT/POL and PQN (marked with a green circle in Figure 3c). Spectra and beta coefficients for this combination are shown in Supplementary materials (Figure S4a). Denoising methods had no significant influence on classification stability for this dataset. However, data with added noise showed that classification performance was worse than for original data (Figure 3b,d), as was expected. Now three clusters of method combinations could be observed. In Figure 3b, most of the combinations were placed between accuracy equals 0.4–0.85, whereas the original data had accuracy intervals of 0.8–0.94. In noisy data case, there were 13 combinations which gave the best accuracy value with a reasonable number of LV equal to six (marked with a circle in the Figure 3b and listed in Table 1), consisting of SG denoising (the most frequent method), EIL, and FT, with the ALS baseline correction method and the CON normalization method. One of the combinations with the highest internal and external accuracy values (marked with green color in Table 1) is presented in Supplementary Materials (Figure S3b). Beta coefficients for data with added noise are different with the same spectral regions as for original data (Figure S3b in Supplementary Materials). Comparison of accuracy values for external and internal validation (Figure 3d) shows that there were two clusters of combinations, but accuracy was below 0.8 for both validations. In comparison to the original data (Figure 3c), the smaller number of methods provided external and internal accuracy values higher than 0.8. The most overfitted model was the combination of FT/RB and PQN (presented in Supplementary materials Figure S4b). Furthermore, the number of combinations in which accuracy was above 0.8 for external and internal validation was inspected (Figure 3e) to find which combinations were the most robust. Baseline correction methods had a crucial impact on accuracy values. In the case of original data, the DER method was found to be the most stable and achieved accuracy above 0.8 most frequently. ALS method is the most robust baseline correction method for noise-added data sets. This change in the best-performing baseline correction method was due to the fact that each derivation of signal adds a certain amount of noise. For high SNR, this was acceptable; however, if the starting SNR is not that great, the derivation of added noise lowers the capability of the models. Normalization methods had a smaller impact on PLS-DA classification. The most stable were combinations with CON and TSN and, while PQN normalization was the most unstable.

#### 3.3. PLS Regression

## 4. Conclusions

## Supplementary Materials

_{2}removal for (a) original data; (b) data with added noise. Figure S2: Principal Component Analysis exploration of the data with added noise after application of the following preprocessing approach denoising → baseline correction → normalization; Figure S3: Spectra for the test set and beta coefficients for the best combination of methods which gave very high internal accuracy with the smallest reasonable LVs (marked with a red circle in the left panel in Figure S3: (a) original data and (b) noise added data; Figure S4: Spectra for the test set and beta coefficients for the worst combination of methods which gave very high internal accuracy and the worst external accuracy (marked with a green circle in the left panel in Figure 3: (a) original data and (b) noise added data; Figure S5: Internal and external accuracy values comparison for the best LVs for (a) original data and (b) noise added data, for baseline correction methods: ALS, polynomial, rubber band; Figure S6: Comparison of RMSECV and RMSEP values for the best LVs for (a) original data and (b) noise added data. Each dot on the plot presents a value for one combination; Figure S7: Comparison of PLS-DA accuracies calculated for all methods on each preprocessing step (model with optimal LVs allowed by CV was chosen) for (a) original data; (b) data with added noise.

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Siegel, R.L.; Miller, K.D.; Jemal, A. Cancer statistics. CA Cancer J. Clin
**2016**, 66, 7–30. [Google Scholar] [CrossRef] [Green Version] - Ackerstaff, E.; Pflug, B.R.; Nelson, J.B.; Bhujwalla, Z.M. Detection of increased choline compounds with proton nuclear magnetic resonance spectroscopy subsequent to malignant transformation of human prostatic epithelial cells. Cancer Res.
**2001**, 61, 3599–3603. [Google Scholar] - Augustyniak, K.; Chrabaszcz, K.; Jasztal, A.; Smeda, M.; Quintas, G.; Kuligowski, J.; Marzec, K.M.; Malek, K. High- and Ultra-High definition of IR spectral histopathology gives an insight into chemical environment of lung metastases in breast cancer. J. Biophotonics
**2018**, e201800345. [Google Scholar] [CrossRef] - Quaroni, L.; Zlateva, T. Infrared spectromicroscopy of biochemistry in functional single cells. Analyst
**2011**, 136, 3219–3232. [Google Scholar] [CrossRef] - Majzner, K.; Kaczor, A.; Kachamakova-Trojanowska, N.; Fedorowicz, A.; Chlopicki, S.; Baranska, M. 3D confocal Raman imaging of endothelial cells and vascular wall: Perspectives in analytical spectroscopy of biomedical research. Analyst
**2013**, 138, 603–610. [Google Scholar] [CrossRef] [PubMed] - Wrobel, T.P.; Mateuszuk, L.; Chlopicki, S.; Malek, K.; Baranska, M. Imaging of lipids in atherosclerotic lesion in aorta from ApoE/LDLR-/ mice by FT-IR spectroscopy and Hierarchical Cluster Analysis. Analyst
**2011**, 136. [Google Scholar] [CrossRef] - Wrobel, T.P.; Marzec, K.M.; Chlopicki, S.; Maślak, E.; Jasztal, A.; Franczyk-Zarów, M.; Czyzyńska-Cichoń, I.; Moszkowski, T.; Kostogrys, R.B.; Baranska, M. Effects of Low Carbohydrate High Protein (LCHP) diet on atherosclerotic plaque phenotype in ApoE/LDLR
^{−/−}mice: FT-IR and Raman imaging. Sci. Rep.**2015**, 5. [Google Scholar] [CrossRef] [Green Version] - Marzec, K.M.; Wrobel, T.P.; Rygula, A.; Maslak, E.; Jasztal, A.; Fedorowicz, A.; Chlopicki, S.; Baranska, M. Visualization of the biochemical markers of atherosclerotic plaque with the use of Raman, IR and AFM. J. Biophotonics
**2014**, 7, 744–756. [Google Scholar] [CrossRef] - Baker, M.J.; Trevisan, J.; Bassan, P.; Bhargava, R.; Butler, H.J.; Dorling, K.M.; Fielden, P.R.; Fogarty, S.W.; Fullwood, N.J.; Heys, K.A.; et al. Using Fourier transform IR spectroscopy to analyze biological materials. Nat. Protoc.
**2014**, 9, 1771–1791. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Wrobel, T.P.; Bhargava, R. Infrared Spectroscopic Imaging Advances as an Analytical Technology for Biomedical Sciences. Anal. Chem.
**2018**, 90, 1444–1463. [Google Scholar] [CrossRef] - Wrobel, T.P.; Piergies, N.; Pieta, E.; Kwiatek, W.; Paluszkiewicz, C.; Fornal, M.; Grodzicki, T. Erythrocyte heme-oxygenation status indicated as a risk factor in prehypertension by Raman spectroscopy. Biochim. Biophys. Acta Mol. Basis Dis.
**2018**, 1864, 3659–3663. [Google Scholar] [CrossRef] [PubMed] - Pięta, E.; Petibois, C.; Pogoda, K.; Suchy, K.; Liberda, D.; Wróbel, T.P.; Paluszkiewicz, C.; Kwiatek, W.M. Assessment of cellular response to drug/nanoparticles conjugates treatment through FTIR imaging and PLS regression study. Sens. Actuatorsb Chem.
**2020**, 313, 1–9. [Google Scholar] [CrossRef] - Paluszkiewicz, C. SR-FTIR spectroscopic preliminary findings of non-cancerous, cancerous, and hyperplastic human prostate tissues. Vib. Spectrosc.
**2007**, 43, 237–242. [Google Scholar] [CrossRef] - Taleb, A.; Diamond, J.; Mcgarvey, J.J.; Beattie, J.R.; Toland, C.; Hamilton, P.W. Raman Microscopy for the Chemometric Analysis of Tumor Cells. J. Phys. Chem. B
**2006**, 110, 19625–19631. [Google Scholar] [CrossRef] - Nicholson, J.M.; Lyng, F.M.; Byrne, H.J.; Hart, C.A.; Brown, M.D.; Clarke, N.W.; Gardner, P. An investigation of the RWPE prostate derived family of cell lines using FTIR spectroscopy. Analyst
**2010**, 135, 887–894. [Google Scholar] [CrossRef] - Corsetti, S.; Rabl, T.; Mcgloin, D.; Nabi, G. Raman spectroscopy for accurately characterizing biomolecular changes in androgen-independent prostate cancer cells. J. Biophotonics
**2018**, 11, 1–8. [Google Scholar] [CrossRef] [Green Version] - Crow, P.; Barrass, B.; Kendall, C.; Wright, M.; Persad, R.; Stone, N. The use of Raman spectroscopy to differentiate between different prostatic adenocarcinoma cell lines. Br. J. Cancer
**2005**, 92, 2166–2170. [Google Scholar] [CrossRef] [Green Version] - Gazi, E.; Dwyer, J.; Gardner, P.; Wade, A.P.; Miyan, J.; Lockyer, N.P.; Vickerman, J.C.; Clarke, N.W.; Shanks, J.H.; Scott, L.J.; et al. Applications of Fourier transform infrared microspectroscopy in studies of benign prostate and prostate cancer. A pilot study. J. Pathol.
**2003**, 201, 99–108. [Google Scholar] [CrossRef] - Henderson, A.; Brown, M.D.; Snook, R.D.; Faria, E.C.; Gardner, P.; Harvey, T.J.; Clarke, N.W.; Ward, A.D.; Gazi, E. Spectral discrimination of live prostate and bladder cancer cell lines using Raman optical tweezers. J. Biomed. Opt.
**2008**, 13, 064004. [Google Scholar] [CrossRef] [Green Version] - Harvey, T.J.; Gazi, E.; Henderson, A.; Snook, R.D.; Clarke, N.W.; Brown, M.; Gardner, P. Factors influencing the discrimination and classification of prostate cancer cell lines by FTIR microspectroscopy†. Analyst
**2009**, 134, 1083–1091. [Google Scholar] [CrossRef] - Harvey, T.J.; Henderson, A.; Gazi, E.; Clarke, N.W.; Brown, M.; Faria, C.; Snook, R.D.; Gardner, P. Discrimination of prostate cancer cells by reflection mode FTIR photoacoustic spectroscopy. Analyst
**2007**, 132, 292–295. [Google Scholar] [CrossRef] - He, D.; Guan, Z.; Fan, J.; Cao, P.; Zhang, G.; Wang, J.; Dang, Q.; Wang, X.; Huang, L.; Wang, L.; et al. Raman spectroscopy, a potential tool in diagnosis and prognosis of castration-resistant prostate cancer. J. Biomed. Opt.
**2013**, 18, 087001. [Google Scholar] [CrossRef] [Green Version] - Pogoda, K.; Pięta, E.; Roman, M.; Piergies, N.; Liberda, D.; Wróbel, T.P.; Janmey, P.A.; Paluszkiewicz, C.; Kwiatek, W.M. In search of the correlation between nanomechanical and biomolecular properties of prostate cancer cells with different metastatic potential. Arch. Biochem. Biophys.
**2021**, 697. [Google Scholar] [CrossRef] - Mukherjee, P.; Lim, S.J.; Wrobel, T.P.; Bhargava, R.; Smith, A.M. Measuring and Predicting the Internal Structure of Semiconductor Nanocrystals through Raman Spectroscopy. J. Am. Chem. Soc.
**2016**, 138. [Google Scholar] [CrossRef] [PubMed] - Wrobel, T.P.; Kwak, J.T.; Kadjacsy-Balla, A.; Bhargava, R. High-definition Fourier transform infrared spectroscopic imaging of prostate tissue. In Proceedings of the Progress in Biomedical Optics and Imaging—Proceedings of SPIE, San Francisco, CA, USA, 13 February 2016; Volume 9791. [Google Scholar]
- Pérez-Guaita, D.; Kuligowski, J.; Garrigues, S.; Quintás, G.; Wood, B.R. Assessment of the statistical significance of classifications in infrared spectroscopy based diagnostic models. Analyst
**2015**, 140, 2422–2427. [Google Scholar] [CrossRef] - Pérez-Guaita, D.; Kuligowski, J.; Lendl, B.; Wood, B.R.; Quintás, G. Assessment of discriminant models in infrared imaging using constrained repeated random sampling—Cross validation. Anal. Chim. Acta
**2018**, 1–9. [Google Scholar] [CrossRef] - Koziol, P.; Raczkowska, M.K.; Skibinska, J.; Urbaniak-Wasik, S.; Paluszkiewicz, C.; Kwiatek, W.; Wrobel, T.P. Comparison of spectral and spatial denoising techniques in the context of High Definition FT-IR imaging hyperspectral data. Sci. Rep.
**2018**, 8, 1–11. [Google Scholar] [CrossRef] [PubMed] - Koziol, P.; Raczkowska, M.K.; Skibinska, J.; McCollum, N.J.; Urbaniak-Wasik, S.; Paluszkiewicz, C.; Kwiatek, W.M.; Wrobel, T.P. Denoising influence on discrete frequency classification results for quantum cascade laser based infrared microscopy. Anal. Chim. Acta
**2018**. [Google Scholar] [CrossRef] - Gorrochategui, E.; Jaumot, J.; Lacorte, S.; Tauler, R. Data analysis strategies for targeted and untargeted LC-MS metabolomic studies: Overview and workflow. TRAC Trends Anal. Chem.
**2016**, 82, 425–442. [Google Scholar] [CrossRef] - Singh, R.; Wrobel, T.P.; Mukherjee, P.; Gryka, M.; Kole, M.; Harrison, S. Bulk Protein and Oil Prediction in Soybeans Using Transmission Raman Spectroscopy: A Comparison of Approaches to Optimize Accuracy. Appl. Spectrosc.
**2019**. [Google Scholar] [CrossRef] [PubMed] - Engel, J.; Gerretzen, J.; Szymańska, E.; Jansen, J.J.; Downey, G.; Blanchet, L.; Buydens, L.M.C. Breaking with trends in pre-processing? TRAC Trends Anal. Chem.
**2013**, 50, 96–106. [Google Scholar] [CrossRef] - Zimmermann, B.; Kohler, A. Optimizing savitzky-golay parameters for improving spectral resolution and quantification in infrared spectroscopy. Appl. Spectrosc.
**2013**, 67, 892–902. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Filzmoser, P.; Walczak, B. What can go wrong at the data normalization step for identification of biomarkers? J. Chromatogr. A
**2014**, 1362, 194–205. [Google Scholar] [CrossRef] - Mishra, P.; Biancolillo, A.; Roger, J.M.; Marini, F.; Rutledge, D.N. New data preprocessing trends based on ensemble of multiple preprocessing techniques. TRAC Trends Anal. Chem.
**2020**, 132, 116045. [Google Scholar] [CrossRef] - Lee, L.C.; Liong, C.Y.; Jemain, A.A. A contemporary review on Data Preprocessing (DP) practice strategy in ATR-FTIR spectrum. Chemom. Intell. Lab. Syst.
**2017**, 163, 64–75. [Google Scholar] [CrossRef] - Oliveri, P.; Malegori, C.; Simonetti, R.; Casale, M. The impact of signal pre-processing on the final interpretation of analytical outcomes—A tutorial. Anal. Chim. Acta
**2019**, 1058, 9–17. [Google Scholar] [CrossRef] [PubMed] - Martyna, A.; Menżyk, A.; Damin, A.; Michalska, A.; Martra, G.; Alladio, E.; Zadora, G. Improving discrimination of Raman spectra by optimising preprocessing strategies on the basis of the ability to refine the relationship between variance components. Chemom. Intell. Lab. Syst.
**2020**, 202. [Google Scholar] [CrossRef] - Bassan, P.; Kohler, A.; Martens, H.; Lee, J.; Jackson, E.; Lockyer, N.; Dumas, P.; Brown, M.; Clarke, N.; Gardner, P. RMieS-EMSC correction for infrared spectra of biological cells: Extension using full Mie theory and GPU computing. J. Biophotonics
**2010**, 3, 609–620. [Google Scholar] [CrossRef] - Solheim, J.; Gunko, E.; Petersen, D.; Großerüschkamp, F.; Gerwert, K.; Kohler, A. An open source code for Mie Extinction EMSC for infrared microscopy spectra of cells and tissues. J. Biophotonics
**2019**, 10–16. [Google Scholar] [CrossRef] - Wrobel, T.P.; Liberda, D.; Koziol, P.; Paluszkiewicz, C.; Kwiatek, W.M. Comparison of the new Mie Extinction Extended Multiplicative Scattering Correction and Resonant Mie Extended Multiplicative Scattering Correction in transmission infrared tissue image scattering correction. Infrared Phys. Technol.
**2020**, 107, 103291. [Google Scholar] [CrossRef] - Eilers, P.H.C. A Perfect Smoother. Anal. Chem.
**2003**, 75, 3631–3636. [Google Scholar] [CrossRef] - Dieterle, F.; Ross, A.; Senn, H. Probabilistic Quotient Normalization as Robust Method to Account for Dilution of Complex Biological Mixtures. Application in 1 H NMR Metabonomics. Anal. Chem.
**2006**, 78, 4281–4290. [Google Scholar] [CrossRef] [PubMed] - Kohl, S.M.; Klein, M.S.; Hochrein, J.; Oefner, P.J.; Spang, R.; Gronwald, W. State-of-the art data normalization methods improve NMR-based metabolomic analysis. Metabolomics
**2012**, 8, 146–160. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Bassan, P.; Kohler, A.; Martens, H.; Lee, J.; Byrne, H.J.; Dumas, P.; Gazi, E.; Brown, M.; Clarke, N.; Gardner, P. Resonant Mie scattering (RMieS) correction of infrared spectra from highly scattering biological samples. Analyst
**2010**, 135, 268–277. [Google Scholar] [CrossRef] - Kennard, R.W.; Stone, L.A. Computer Aided Design of Experiments. Technometrics
**1969**, 11, 137–148. [Google Scholar] [CrossRef] - Savitzky, A.; Golay, M.J.E. Smoothing and Differentiation of Data by Simplified Least Squares Procedure. Anal. Chem.
**1964**, 36, 1627–1639. [Google Scholar] [CrossRef] - Hen, X.I.S.; Iang, L.X.U.; Hubin, S.Y.E.; Ong, R.H.U.; In, L.I.N.G.J.; Anyang, H.X.U.; Iu, W.E.L. Automatic baseline correction method for the open-path Fourier transform infrared spectra by using simple iterative averaging. Opt Express
**2018**, 26, 609–614. [Google Scholar] - Eilers, P.H.C. Baseline Correction with Asymmetric Least Squares Smoothing. Anal. Chem.
**2005**, 75, 3631–3636. [Google Scholar] [CrossRef] - Bassan, B.P.; Kohler, A.; Byrne, H.J.; Martens, H.; Lee, J.; Bassan, P.; Kohler, A.; Martens, H.; Lee, J.; Byrne, H.J.; et al. Resonant Mie Scattering (RMieS) EMSC correction guide. Analyst
**2010**. [Google Scholar] [CrossRef] - Konevskikh, T.; Lukacs, R.; Kohler, A. An improved algorithm for fast resonant Mie scatter correction of infrared spectra of cells and tissues. J. Biophotonics
**2017**, 1–10. [Google Scholar] [CrossRef] - Bylesjo, M.; Cloarec, O.; Rantalainen, M. Normalization and Closure. In Comprehensive Chemometrics; Brown, S., Tauler, R., Walczak, B., Eds.; Elsevier: Amsterdam, The Netherlands, 2009; Volume 2, pp. 109–127. [Google Scholar]
- Reis, M.S.; Saraiva, P.M.; Bakshi, B.R. Denoising and Signal-to-Noise Ratio Enhancement: Wavelet Transform and Fourier Transform. In Comprehensive Chemometrics; Brown, S., Tauler, R., Walczak, B., Eds.; Elsevier: Amsterdam, The Netherlands, 2009; Volume 2, pp. 25–55. ISBN 9780444527011. [Google Scholar]
- Wrobel, T.P.; Mukherjee, P.; Bhargava, R. Rapid visualization of macromolecular orientation by discrete frequency mid-infrared spectroscopic imaging. Analyst
**2017**, 142, 75–79. [Google Scholar] [CrossRef] [PubMed]

**Figure 1.**A scheme of the preprocessing steps—five prostate cell lines were imaged with FT-IR, then white noise was added to the original spectra. Raw data and data with added noise were preprocessed in the following order: denoising → baseline correction → normalization. Individual methods coming from one preprocessing step were combined with each method from the remaining two preprocessing steps. Taking into account the above and the number of parameters adjusted for each method, the number of combinations (data sets) was equal to 2835. All of these combinations were then used to create a classifier discriminating cell lines and a regression model of class assignments giving more detail about the relative importance of different preprocessing factors and parameters.

**Figure 2.**Principal component analysis exploration of the original data structure after application of three preprocessing steps. Each point corresponds to the individual spectrum coming from the original dataset on which unique combinations of the three steps were used. Subsections a, b, and c present the same PC projection but are colored according to a single preprocessing type: (

**a**) denoising, (

**b**) baseline correction, and (

**c**) normalization. For better understanding, a set of spectra on which combinations of DER baseline correction method with other preprocessing steps are marked with a circle.

**Figure 3.**Results of PLS-DA classification: Values of accuracy for each combination of the methods (marked with dots, with yellow corresponding to a high number of models while blue to low number) calculated for up to 30 LVs of: (

**a**) original data and (

**b**) data with added noise. The red circle indicates the best combination of methods which gave very high internal accuracy with the smallest reasonable LVs number. Comparison of internal and external validation accuracy values (for the best LVs chosen based on internal validation) for (

**c**) original data and (

**d**) data with added noise. The green circle indicates the worst combination of methods which gave very high internal validation and low external validation accuracy values. (

**e**) Number of combinations giving accuracy higher than 0.8 for internal and external validation—marked with a red frame on right figure panel: for original and noise added data divided into baseline/normalization categories.

**Figure 4.**Internal and external accuracy values comparison for the best LVs for: (

**a**) original data and (

**b**) noise added data, for baseline correction methods: DER, RMie-EMSC, and ME-EMSC.

**Figure 5.**Comparison of RMSECV and RMSEP values for: (

**a**) original data and (

**b**) noise added data. Each dot on the plot presents a value for one combination of preprocessing methods. (

**c**) Histogram of 10% of all combinations giving the lowest RMSECV and RSEMP for the original and noise added data.

**Figure 6.**Comparison of PLSR mean RMSECV and RMSEP errors calculated for all methods on each preprocessing step (model with optimal LVs allowed by CV was chosen) for (

**a**) original data and (

**b**) data with added noise. The standard deviation of all models that used a given method (from the current preprocessing step) in combination with other methods (from other preprocessing steps) was marked with error bars.

**Table 1.**The best combination of methods gave very high internal accuracy with the smallest reasonable LVs (marked with the red circle in the left panel in Figure 3). Methods giving the best external validation values for original and raw data were marked with green color.

Denoising | Adjusted Parameter | Baseline | Adjusted Parameter | Normalization | Internal Accuracy | External Accuracy | ||
---|---|---|---|---|---|---|---|---|

Original Data | ||||||||

Fourier | frame | 100 | Second derivative | Poly, frame | 2, 27 | CONSTANT | 0.94 | 0.86 |

Fourier | frame | 100 | Second derivative | Poly, frame | 2, 29 | CONSTANT | 0.94 | 0.79 |

Fourier | frame | 100 | Second derivative | Poly, frame | 3, 27 | CONSTANT | 0.94 | 0.86 |

Fourier | frame | 100 | Second derivative | Poly, frame | 3, 29 | CONSTANT | 0.94 | 0.79 |

Fourier | frame | 100 | Second derivative | Poly, frame | 2, 23 | TSN | 0.94 | 0.86 |

Fourier | frame | 100 | Second derivative | Poly, frame | 2, 25 | TSN | 0.94 | 0.86 |

Fourier | frame | 100 | Second derivative | Poly, frame | 2, 27 | TSN | 0.94 | 0.86 |

Fourier | frame | 100 | Second derivative | Poly, frame | 2, 29 | TSN | 0.94 | 0.86 |

Fourier | frame | 100 | Second derivative | Poly, frame | 3, 23 | TSN | 0.94 | 0.86 |

Fourier | frame | 100 | Second derivative | Poly, frame | 3, 25 | TSN | 0.94 | 0.86 |

Fourier | frame | 100 | Second derivative | Poly, frame | 3, 27 | TSN | 0.94 | 0.86 |

Fourier | frame | 100 | Second derivative | Poly, frame | 3, 29 | TSN | 0.94 | 0.86 |

Noise Added Data | ||||||||

Fourier | frame | 140 | ALS | λ, p | 6, 0.1 | CONSTANT | 0.91 | 0.86 |

Fourier | frame | 220 | ALS | λ, p | 6, 0.1 | CONSTANT | 0.91 | 0.93 |

Eilers | λ | 6 | ALS | λ, p | 6, 0.1 | CONSTANT | 0.91 | 0.86 |

SavitzkyG | Poly, frame | 2, 15 | ALS | λ, p | 6, 0.1 | CONSTANT | 0.91 | 0.93 |

SavitzkyG | Poly, frame | 2, 17 | ALS | λ, p | 6, 0.1 | CONSTANT | 0.91 | 0.86 |

SavitzkyG | Poly, frame | 2, 19 | ALS | λ, p | 6, 0.1 | CONSTANT | 0.91 | 0.93 |

SavitzkyG | Poly, frame | 2, 21 | ALS | λ, p | 6, 0.1 | CONSTANT | 0.91 | 0.93 |

SavitzkyG | Poly, frame | 2, 23 | ALS | λ, p | 6, 0.1 | CONSTANT | 0.91 | 0.93 |

SavitzkyG | Poly, frame | 3, 15 | ALS | λ, p | 6, 0.1 | CONSTANT | 0.91 | 0.93 |

SavitzkyG | Poly, frame | 3, 17 | ALS | λ, p | 6, 0.1 | CONSTANT | 0.91 | 0.86 |

SavitzkyG | Poly, frame | 3, 19 | ALS | λ, p | 6, 0.1 | CONSTANT | 0.91 | 0.93 |

SavitzkyG | Poly, frame | 3, 21 | ALS | λ, p | 6, 0.1 | CONSTANT | 0.91 | 0.93 |

SavitzkyG | Poly, frame | 3, 23 | ALS | λ, p | 6, 0.1 | CONSTANT | 0.91 | 0.93 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Liberda, D.; Pięta, E.; Pogoda, K.; Piergies, N.; Roman, M.; Koziol, P.; Wrobel, T.P.; Paluszkiewicz, C.; Kwiatek, W.M.
The Impact of Preprocessing Methods for a Successful Prostate Cell Lines Discrimination Using Partial Least Squares Regression and Discriminant Analysis Based on Fourier Transform Infrared Imaging. *Cells* **2021**, *10*, 953.
https://doi.org/10.3390/cells10040953

**AMA Style**

Liberda D, Pięta E, Pogoda K, Piergies N, Roman M, Koziol P, Wrobel TP, Paluszkiewicz C, Kwiatek WM.
The Impact of Preprocessing Methods for a Successful Prostate Cell Lines Discrimination Using Partial Least Squares Regression and Discriminant Analysis Based on Fourier Transform Infrared Imaging. *Cells*. 2021; 10(4):953.
https://doi.org/10.3390/cells10040953

**Chicago/Turabian Style**

Liberda, Danuta, Ewa Pięta, Katarzyna Pogoda, Natalia Piergies, Maciej Roman, Paulina Koziol, Tomasz P. Wrobel, Czeslawa Paluszkiewicz, and Wojciech M. Kwiatek.
2021. "The Impact of Preprocessing Methods for a Successful Prostate Cell Lines Discrimination Using Partial Least Squares Regression and Discriminant Analysis Based on Fourier Transform Infrared Imaging" *Cells* 10, no. 4: 953.
https://doi.org/10.3390/cells10040953