Psoriatic Arthritis (PsA) Clinical Lipidomics Dataset with Hidden Laboratory Workflow Artifacts: A Benchmark Dataset for Data Processing Quality Control in Lipidomics
Abstract
1. Summary
2. Data Description
2.1. General Properties of the Dataset
2.2. Description of the Single Data Files
2.2.1. “PsA_lipids_BC.csv”
- Propensity score matching (PSM) [10] for age, using logistic regression with nearest-neighbor matching (1:1) within the region of common support (psmatch2, Stata/IC 15.1 for Linux, StataCorp, College Station, TX, USA), reduced confounding between PsA patients (n = 81) and controls (n = 26); post-matching balance of age, sex, and BMI confirmed by ANOVA and chi-square tests.
- Box–Cox transformation to normalize distributions [11], implemented via the “ABCstats” R package (https://github.com/Waddlessss/ABCstats, (accessed on 12 December 2025) [12]).
- The original lipidomics matrix was complete with no missing values. Outliers were removed using the boxplot method, which targets extreme values defined as the third quartile plus three times the interquartile range (IQR) or below the first quartile minus three IQRs. This method was implemented in the R package “rstatix” (https://CRAN.R-project.org/package=rstatix, (accessed on 12 December 2025) [13]). This process removed 0.017% of the lipidomics variable values and created missingness.
- Imputation of missing and outlier values (totaling only 14 values) was performed using multivariate random forest imputation from the R package “miceRanger” (https://cran.r-project.org/package=miceRanger, (accessed on 12 December 2025) [14]) with parameters m = 1 (1 imputation), maxiter = 1 (1 iteration), valueSelector = “meanMatch”, meanMatchCandidates = max(5, 1% of samples), and returnModels = TRUE. Additionally, 60 highly correlated lipid variables (r > 0.9) were removed during preprocessing.
2.2.2. “PsA_lipids_orig_values.csv”
2.2.3. “PsA_classes.csv”
- “PsA”: Original clinical classification distinguishing Psoriatic arthritis patients from controls.
- “ESOM”: An unsupervised classification derived from projecting the transformed dataset onto a self-organizing feature map of artificial neurons (ESOM/U-matrix) [17], revealing a distinct subset (class “2”) associated with laboratory workflow differences.
- “gender”: Biological sex of subjects (“Male” or “Female”).
- Columns 5 through 10 document analytical batch identifiers for each assay type: “BatchID_Endocannabinoids”, “BatchID_Ceramides”, “BatchID_LPA”, “BatchID_Lipidomics”, “BatchID_Oxylipins”, and “BatchID_Sphingolipids”. Most samples (PsA patients) were analyzed in five batches (BatchID prefix “R_1” through “R_5”), while 16 control samples were processed in three batches (BatchID prefix “C_1” through “C_3”) approximately one month earlier (June vs. July 2021) (Figure 2). Both disease groups appear intermingled across multiple analytical runs within each batch type. These temporal batch differences among controls were subsequently identified by unsupervised ESOM analysis as the primary driver of control sample bifurcation (forthcoming analysis paper). “BatchID_Lipidomics” contains additional lipids not included in other batches.
- Columns 11 and 12 provide processing metadata specific to control subjects: “Sampling_date” (June–July 2021) and “Sampling_weekday”. The complete batch processing strategy is visualized in Figure 2.
- “Sampling_weekday”: Day of the week for control sample collection.
2.2.4. “readme.csv”
- “variable_name”: The specific lipid variable (analyte) name, consistent with how it appears in the dataset’s raw and processed files.
- “unit”: The unit of measurement for the variable. Lipid screening results are expressed in arbitrary units, calculated as the chromatographic peak area in relation to the chromatographic peak area of the corresponding lipid-class-specific internal standard.
- “class_name”: The lipid class to which the variable belongs (e.g., “Phosphatidylcholines,” “Triacylglycerols”).
- “class_code”: Abbreviations used to identify the corresponding lipid class (e.g., “PC” for Phosphatidylcholines, “TG” for Triacylglycerols).
- “analytical_method_category”: The analytical method applied to quantify the lipid variable, either “Lipid screening” (broad profiling) or “Lipid targeted” (target-specific assays).
- “LLOQ”: Lower limit of quantification (NA for screening lipids; numeric values such as “3” ng/mL or “25” pg/mL for targeted lipids).
- “ULOQ”: Upper limit of quantification (NA for screening lipids; numeric values such as “250” ng/mL or “6250” pg/mL for targeted lipids).
3. Clustering Analysis as a Demonstration of Dataset Complexity
4. Methods
4.1. Patients and Study Design
4.2. Blood Sample Collection
4.3. Lipidomics Analysis
4.4. Batch Structure and Quality Control Considerations
5. Recommended Use
- Benchmarking batch effect detection algorithms: The dataset contains documented temporal variation across collection periods and analytical batches, enabling researchers to test the sensitivity of their quality control frameworks to real-world laboratory processing patterns.
- Evaluating dimensionality reduction and exploratory analysis methods: Researchers can assess how different analytical approaches reveal or obscure systematic technical variation relative to biological signal in complex lipidomics data.
- Testing classification and variable selection strategies: The dataset structure supports comparative evaluation of how supervised and unsupervised methods perform when biological and technical sources of variation coexist at different magnitudes.
- Developing robust preprocessing pipelines: The inclusion of both transformed and original-scale data, along with comprehensive metadata, allows methodologists to test normalization, scaling, and imputation strategies under realistic conditions where quality issues are subtle rather than obvious.
- Training and validation of integrative analytical frameworks: The dataset’s complexity and documented characteristics make it suitable for testing multi-method approaches to identifying samples or variables that may be affected by uncontrolled technical factors.
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Conflicts of Interest
References
- Sens, A.; Rischke, S.; Hahnefeld, L.; Dorochow, E.; Schäfer, S.M.G.; Thomas, D.; Köhm, M.; Geisslinger, G.; Behrens, F.; Gurke, R. Pre-analytical sample handling standardization for reliable measurement of metabolites and lipids in LC-MS-based clinical research. J. Mass. Spectrom. Adv. Clin. Lab. 2023, 28, 35–46. [Google Scholar] [CrossRef]
- Rischke, S.; Hahnefeld, L.; Burla, B.; Behrens, F.; Gurke, R.; Garrett, T.J. Small molecule biomarker discovery: Proposed workflow for LC-MS-based clinical research projects. J. Mass. Spectrom. Adv. Clin. Lab. 2023, 28, 47–55. [Google Scholar] [CrossRef] [PubMed]
- Surace, A.E.A.; Hedrich, C.M. The Role of Epigenetics in Autoimmune/Inflammatory Disease. Front. Immunol. 2019, 10, 1525. [Google Scholar] [CrossRef] [PubMed]
- Alinaghi, F.; Calov, M.; Kristensen, L.E.; Gladman, D.D.; Coates, L.C.; Jullien, D.; Gottlieb, A.B.; Gisondi, P.; Wu, J.J.; Thyssen, J.P.; et al. Prevalence of psoriatic arthritis in patients with psoriasis: A systematic review and meta-analysis of observational and clinical studies. J. Am. Acad. Dermatol. 2019, 80, 251–265.e219. [Google Scholar] [CrossRef] [PubMed]
- McGonagle, D.; McDermott, M.F. A proposed classification of the immunological diseases. PLoS Med. 2006, 3, e297. [Google Scholar] [CrossRef]
- Scher, J.U.; Ogdie, A.; Merola, J.F.; Ritchlin, C. Preventing psoriatic arthritis: Focusing on patients with psoriasis at increased risk of transition. Nat. Rev. Rheumatol. 2019, 15, 153–166. [Google Scholar] [CrossRef]
- Ogdie, A.; Schwartzman, S.; Husni, M.E. Recognizing and managing comorbidities in psoriatic arthritis. Curr. Opin. Rheumatol. 2015, 27, 118–126. [Google Scholar] [CrossRef]
- Lembke, S.; Macfarlane, G.J.; Jones, G.T. The worldwide prevalence of psoriatic arthritis—A systematic review and meta-analysis. Rheumatology 2024, 63, 3211–3220. [Google Scholar] [CrossRef]
- Kugler, S.; Hahnefeld, L.; Kloka, J.A.; Ginzel, S.; Nürenberg-Goloub, E.; Zinn, S.; Vehreschild, M.J.; Zacharowski, K.; Lindau, S.; Ullrich, E.; et al. Short-term predictor for COVID-19 severity from a longitudinal multi-omics study for practical application in intensive care units. Talanta 2024, 268, 125295. [Google Scholar] [CrossRef]
- Rosenbaum, P.R.; Rubin, D.B. The central role of the propensity score in observational studies for causal effects. Biometrika 1983, 70, 41–55. [Google Scholar] [CrossRef]
- Box, G.E.; Cox, D.R. An analysis of transformations. J. R. Stat. Soc. Ser. B 1964, 26, 211–252. [Google Scholar] [CrossRef]
- Yu, H.; Sang, P.; Huan, T. Adaptive Box-Cox Transformation: A Highly Flexible Feature-Specific Data Transformation to Improve Metabolomic Data Normality for Better Statistical Analysis. Anal. Chem. 2022, 94, 8267–8276. [Google Scholar] [CrossRef] [PubMed]
- Kassambara, A. rstatix: Pipe-Friendly Framework for Basic Statistical Tests. R package version 0.7.3. CRAN Contrib. Packages 2025. [Google Scholar] [CrossRef]
- Wilson, S. miceRanger: Multiple Imputation by Chained Equations with Random Forests. R package version 1.5.0. CRAN Contrib. Packages 2021. [Google Scholar] [CrossRef]
- Conroy, M.J.; Andrews, R.M.; Andrews, S.; Cockayne, L.; Dennis, E.A.; Fahy, E.; Gaud, C.; Griffiths, W.J.; Jukes, G.; Kolchin, M.; et al. LIPID MAPS: Update to databases and tools for the lipidomics community. Nucleic Acids Res. 2024, 52, D1677–D1682. [Google Scholar] [CrossRef]
- Liebisch, G.; Fahy, E.; Aoki, J.; Dennis, E.A.; Durand, T.; Ejsing, C.S.; Fedorova, M.; Feussner, I.; Griffiths, W.J.; Köfeler, H.; et al. Update on LIPID MAPS classification, nomenclature, and shorthand notation for MS-derived lipid structures. J. Lipid Res. 2020, 61, 1539–1555. [Google Scholar] [CrossRef]
- Ultsch, A.; Lötsch, J. Machine-learned cluster identification in high-dimensional data. J. Biomed. Inform. 2017, 66, 95–104. [Google Scholar] [CrossRef]
- Hotelling, H. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 1933, 24, 498–520. [Google Scholar] [CrossRef]
- Pearson, K. LIII On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos. Mag. J. Sci. 1901, 2, 559–572. [Google Scholar] [CrossRef]
- Wold, S.; Ruhe, A.; Wold, H.; Dunn, I.W.J. The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverses. SIAM J. Sci. Stat. Comput. 1984, 5, 735–743. [Google Scholar] [CrossRef]
- Rischke, S.; Poor, S.M.; Gurke, R.; Hahnefeld, L.; Köhm, M.; Ultsch, A.; Geisslinger, G.; Behrens, F.; Lötsch, J. Machine learning identifies right index finger tenderness as key signal of DAS28-CRP based psoriatic arthritis activity. Sci. Rep. 2023, 13, 22710. [Google Scholar] [CrossRef] [PubMed]
- Hahnefeld, L.; Gurke, R.; Thomas, D.; Schreiber, Y.; Schäfer, S.M.G.; Trautmann, S.; Snodgrass, I.F.; Kratz, D.; Geisslinger, G.; Ferreirós, N. Implementation of lipidomics in clinical routine: Can fluoride/citrate blood sampling tubes improve preanalytical stability? Talanta 2020, 209, 120593. [Google Scholar] [CrossRef] [PubMed]
- Kopczynski, D.; Hoffmann, N.; Peng, B.; Ahrends, R. Goslin: A Grammar of Succinct Lipid Nomenclature. Anal. Chem. 2020, 92, 10957–10960. [Google Scholar] [CrossRef] [PubMed]



| Lipid Class | Abbreviation | Number of Variables | Analytical Method Category |
|---|---|---|---|
| Acylcarnitines | CAR | 6 | Lipid screening |
| Ceramides | Cer | 9 + 11 | Lipid targeted/Lipid screening |
| Diacylglycerols | DG | 7 | Lipid screening |
| Endocannabinoids | AEA, LEA, etc. | 9 | Lipid targeted |
| Free fatty acids | FA | 5 | Lipid screening |
| Gangliosides | Hex2NeuAcCer | 1 | Lipid screening |
| Hexosylceramides | HexCer | 8 + 7 | Lipid targeted/Lipid screening |
| Hormones and Derivatives | Thy | 1 | Lipid screening |
| Lysophosphatidic Acid | LPA | 4 | Lipid targeted |
| Lysophosphatidyl ethanolamine | LPC | 16 | Lipid screening |
| Lysophosphatidylcholine | LPC | 27 | Lipid screening |
| Phosphatidylcholine | PC | 59 | Lipid screening |
| Phosphatidylethanolamine | PE | 29 | Lipid screening |
| Sphingomyelins | SM | 27 | Lipid screening |
| Sphingosine-Based Phosphonates | SPBP | 3 | Lipid targeted |
| Steroids | ST | 2 | Lipid screening |
| Sterol Esters | SE | 9 | Lipid screening |
| Triacylglycerols | TG | 29 | Lipid screening |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Lötsch, J.; Gurke, R.; Hahnefeld, L.; Behrens, F.; Geisslinger, G. Psoriatic Arthritis (PsA) Clinical Lipidomics Dataset with Hidden Laboratory Workflow Artifacts: A Benchmark Dataset for Data Processing Quality Control in Lipidomics. Data 2026, 11, 32. https://doi.org/10.3390/data11020032
Lötsch J, Gurke R, Hahnefeld L, Behrens F, Geisslinger G. Psoriatic Arthritis (PsA) Clinical Lipidomics Dataset with Hidden Laboratory Workflow Artifacts: A Benchmark Dataset for Data Processing Quality Control in Lipidomics. Data. 2026; 11(2):32. https://doi.org/10.3390/data11020032
Chicago/Turabian StyleLötsch, Jörn, Robert Gurke, Lisa Hahnefeld, Frank Behrens, and Gerd Geisslinger. 2026. "Psoriatic Arthritis (PsA) Clinical Lipidomics Dataset with Hidden Laboratory Workflow Artifacts: A Benchmark Dataset for Data Processing Quality Control in Lipidomics" Data 11, no. 2: 32. https://doi.org/10.3390/data11020032
APA StyleLötsch, J., Gurke, R., Hahnefeld, L., Behrens, F., & Geisslinger, G. (2026). Psoriatic Arthritis (PsA) Clinical Lipidomics Dataset with Hidden Laboratory Workflow Artifacts: A Benchmark Dataset for Data Processing Quality Control in Lipidomics. Data, 11(2), 32. https://doi.org/10.3390/data11020032

