Next Article in Journal
Analytical and Computational Approaches for Bi-Stable Reaction and p-Laplacian Diffusion Flame Dynamics in Porous Media
Next Article in Special Issue
Sensitivity Analysis on Hyperprior Distribution of the Variance Components of Hierarchical Bayesian Spatiotemporal Disease Mapping
Previous Article in Journal
A Lattice Boltzmann Method-like Algorithm for the Maximal Covering Location Problem on the Complex Network: Application to Location of Railway Emergency-Rescue Spot
Previous Article in Special Issue
Hypothesis Test to Compare Two Paired Binomial Proportions: Assessment of 24 Methods
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Classification Methods for the Serological Status Based on Mixtures of Skew-Normal and Skew-t Distributions

by
Tiago Dias-Domingues
1,*,†,
Helena Mouriño
1,† and
Nuno Sepúlveda
2,†
1
Centro de Estatística e Aplicações, Faculdade de Ciências, Universidade de Lisboa, 1749-016 Lisboa, Portugal
2
Faculty of Mathematics and Information Science, Warsaw University of Technology, 00-662 Warsaw, Poland
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Mathematics 2024, 12(2), 217; https://doi.org/10.3390/math12020217
Submission received: 25 November 2023 / Revised: 31 December 2023 / Accepted: 6 January 2024 / Published: 9 January 2024
(This article belongs to the Special Issue Advances in Biostatistics and Applications)

Abstract

:
Gaussian mixture models are widely employed in serological data analysis to discern between seropositive and seronegative individuals. However, serological populations often exhibit significant skewness, making symmetric distributions like Normal or Student-t distributions unreliable. In this study, we propose finite mixture models based on Skew-Normal and Skew-t distributions for serological data analysis. Although these distributions are well established in the literature, their application to serological data needs further exploration, with emphasis on the determination of the threshold that distinguishes seronegative from seropositive populations. Our previous work proposed three methods to estimate the cutoff point when the true serological status is unknown. This paper aims to compare the three cutoff techniques in terms of their reliability to estimate the true threshold value. To attain this goal, we conducted a Monte Carlo simulation study. The proposed cutoff points were also applied to an antibody dataset against four SARS-CoV-2 virus antigens where the true serological status is known. For this real dataset, we also compared the performance of our estimated cutoff points with the ROC curve method, commonly used in situations where the true serological status is known.

1. Introduction

Mixture models allow one to describe the distribution of a random variable as a mixture of various distributions. For a long time, mixture models have captured the attention of researchers primarily due to their flexibility in describing data from a non-homogeneous population, as they can reveal latent subgroups within the overall population. Hence, the heterogeneity comes from the situation where one knows (or suspects) that the observations arise from G ( G 2 ) distinct subpopulations, but no mechanism accurately distinguishes between these subpopulations [1]. This makes the finite mixture models a very important tool to handle special features of the density functions under consideration, such as multimodality, skewness, and heavy tails [1].
Nowadays, finite mixture models have experienced significant breakthroughs and are employed in various domains of science, from medicine and biology to social and actuarial sciences, among others. The widespread use of this approach can be attributed to the versatility of finite mixture models, allowing them to effectively tackle diverse challenges in statistical modeling, including classification, clustering, density estimation, and pattern recognition problems [2]. To provide one relevant example, model-based clustering relies on mixture models, and it is considered a classical and powerful approach to address the unsupervised learning problem of accurately grouping observations into clusters [3]. One of the most recent improvements in this field was made by Melnykov and Wang [4], who addressed the matter of the lack of parsimony when studying clustering analysis in the mixture modeling framework. In the last few years, mixture models have also been extended to network data through stochastic blockmodels, opening a new avenue of extensions and novel models for future research (see [5]).
Serology is a branch of medicine that classically studies proteins, encompassing mainly antibodies found in blood and secretions such as saliva [6]. In this work, we focus on serological data, mainly from serological tests: blood tests that evaluate the presence of a specific antibody in the blood against a particular pathogen. Therefore, serological data can be described as a mixture of serological status distributions: seronegative (antibody-negative) or seropositive (antibody-positive) populations.
An individual is seropositive to a pathogen if they have detectable antibodies specific to that pathogen due to a previous infection or vaccination. However, individuals who have never been infected (or vaccinated) with the pathogen under consideration might have non-zero antibody responses due to cross-reactivity with other pathogens or background noise [7]. The detection of antibodies in the serum samples is classically conducted via enzyme-linked immunosorbent assays (ELISA), where the resulting data are light intensities, also called optical density, which reflects the underlying antibody concentration in the samples [8]. The analysis of serological data proceeds by dichotomizing the amount of antibodies in an individual’s serum using a predefined cutoff point in the antibody probability density function. This procedure allows the classification of individuals into the seronegative (with antibody levels below the cutoff point) or seropositive (with antibody levels above the cutoff point) category [7].
Different criteria for seropositivity determination (which means choosing distinct cutoff points) have a direct impact on the sensitivity and specificity of the respective serological classification [9]. In addition, it might also impact the estimation of the seroprevalence [10] and the following (epidemiological) decision that can be taken when facing a given estimate of this epidemiological parameter. This aspect means that when determining the cutoff point for a serological test, one should consider the benefit of the test, the economic and social consequences of serological misclassification, and the prevalence of the disease in the population. Unfortunately, it turns out that these aspects are often ignored in practice [11].
Considering the serological assays, one of the traditional methods to establish the cutoff point is to consider the logarithmic transformation of the antibody concentration of a known seronegative population and proceed to calculate the mean plus two or three standard deviations [11,12,13,14]. This method is adequate when the antibody distribution of the seronegative population is normally distributed [14]. However, previous studies of different serological data [8] showed evidence against the normality assumption for the antibody levels associated with a putative seronegative population. It has been demonstrated that the antibody concentration of a known seronegative (or seropositive) population is highly skewed [8,15,16], which invalidates the use of the Normal distribution in this context.
Recent literature has shown that a mixture of skewed distributions accurately models serological data, such as Skew-Normal or Skew-t distributions [8,16], as far as these distributions can accommodate the skewness structure of the underlying distribution of the antibody concentration.
Our previous work used three methods based on mixtures of Skew-Normal or Skew-t distributions to empirically determine the seropositivity cutoff points [8]. In this paper, we will analyze the performance of the above methods through simulation studies. Additionally, we will apply these methods to the freely available serological data concerning the SARS-CoV-2 virus [7]. In this dataset, there is information about the actual infection status of the individuals, which takes a relevant role in evaluating the accuracy of the methods developed by [8] to establish the cutoff points. Therefore, we will compare the performance of the above techniques with ROC curve-based methods, which are commonly used to determine the cutoff point for defining seropositivity when the true infection (or disease) status is known [17,18,19,20,21,22,23].

2. Modeling Antibody Data: Skew-Normal and Skew-t Distributions

Serological data can be viewed as arising from two or more latent populations; each population is assumed to represent different levels of exposure to a given antigen. For simplicity, individuals who were never exposed or exposed a long time ago to an infectious agent are considered seronegative. In contrast, individuals exposed to the same infectious agent are considered seropositive. In this scenario, the antibody distribution can be described by a mixture of two or more probability distributions [16]. However, the true serological state of an individual is unknown; therefore, it needs to be estimated.
In many serological studies, it is usual to assume a Normal distribution for the basis of the mixture models. However, the behaviour of antibody distribution is not constant over time, and their concentration decreases after infection [7]. This fact makes the distribution of the seropositive population skewed to the left [24]. To accommodate the possible skewness in the seropositive population, we use the scale mixture of the Skew-Normal (SMSN) class of distributions that include the Skew-Normal and the Skew-t distributions, which will be the focus of our study. A brief description of these distributions is presented below.

2.1. Skew-Normal Distribution

Let W S N ( μ , σ 2 , α ) a random variable (r.v.) with a Skew-Normal distribution. In this distribution, the parameters μ , σ 2 , and α can be seen as the location, scale, and shape parameters, respectively. Then the probability density function (pdf) is given by
f W ( w ) = 2 1 2 π σ e 1 2 ( w μ σ ) 2 × 0 α ( w μ σ ) 1 2 π e x 2 2 d x = 2 σ ϕ w μ σ Φ α ( w μ σ ) ,
where w , μ R , σ R + ; ϕ ( . ) and Φ ( . ) is the pdf and the cumulative distribution function of the standard Normal distribution, respectively [8,25,26]. The Skew-Normal distribution is part of a family of distributions called the Scale Mixtures of Skew-Normal distributions (SMSN), of which the Skew-t distribution is also a particular case [8].

2.2. Skew-t Distribution

A random variable W is said to have a Skew-t distribution, W S T ( μ , σ 2 , α , v ) , if the pdf is given by
f W ( w ) = 2 f T ( w ; μ , σ 2 , v + 1 ) F T A ( w ) v + 1 d ( w ) + v ; v + 1 ,
where w , μ , v R , σ 2 R + ; f T ( . ; μ , σ 2 , v + 1 ) and F T ( . ; μ , σ 2 , v + 1 ) represents the pdf and the cumulative distribution function of the generalized Student’s t distribution with v + 1 degrees of freedom, A ( w ) = α ( w μ ) σ and d ( w ) = w μ σ 2  [8,25,26].
Considering the skewness parameter, α , when α = 0 , the S N ( α ) reduces to the N ( 0 , 1 ) and the S T ( α ) reduces to the non-central Student’s t distribution, respectively. When α + , the distribution under analysis shows a positive skew whereas α the distribution under analysis shows a negative skew (Figure 1). This aspect of heavy tails to the left or right is important as it impacts on the estimation of the cutoff point for defining serological subpopulations.

3. Finite Mixture Models to Describe Serological Data: Estimation of the Parameters

Finite mixture models are very flexible models used to model data from heterogeneous populations, allowing for the capture of population characteristics such as multimodality, skewness, and kurtosis [1]. The rationale behind these types of models is that, given a population, it is possible to consider subpopulations in a finite number, in different proportions, with each subpopulation characterized by a probability density function, with the respective parameter space [27].
In general, let G 1 , , G g be the partition from a superpopulation G (sample space), and let π 1 , , π g be the probabilities of sampling an individual belonging to each latent population (with the usual restriction of k = 1 g π k = 1 and 0 π k 1 ) . A random variable Z is a finite mixture of independent random variables Z 1 , Z 2 , , Z g if the probability density function (pdf) of Z is given by
f ( z ) = k = 1 g π k f Z k ( z ; θ k ) ,
where f Z k ( z ; θ k ) is the mixing probability density function (pdf) of Z k associated with the k-th latent population and parameterized by the vector θ k [27].
In the specific case of serological data, let Z be the random variable that represents the antibody level, and let π 1 and π 2 = 1 π 1 be the probabilities of sampling a seronegative and a seropositive individual, respectively. Then, the marginal probability density function of Z is given by
f ( z ; θ ) = k = 1 2 π k f k ( z Y = k ; θ k ) ,
where f k ( . Y = k ; θ k ) or simply f k ( . ; θ k ) is the mixing probability density function of Z associated with the kth latent population; the latent (unobservable) random variable Y { 1 , 2 } represents the mixture component for Z, and thus π k = P ( Y = k ) , k = 1 , 2 ; θ is the vector of all unknown parameters of the mixture model, i.e., θ = ( π 1 , π 2 , θ 1 , θ 2 ) T . In our application f k ( . ; θ k ) is given by the Skew-normal or the Skew-t distributions.
Consider the n-dimensional vector z = ( z 1 , z 2 , , z n ) T of the observed antibody sample of size n; to estimate the parameters of the model defined by Equation (4), one needs to organize the data to take into account the population from which z i comes, which is to say the pair ( y i , z i ) T , with y i = ( y i 1 , y i 2 ) T , where y i 2 = 1 if z i comes from the second (seropositive) population, or 0 otherwise; y i 1 = 1 y i 2 , i = 1 , , n . As a result, one obtains the complete dataset. The complete log-likelihood is thus given by
log ( L C ( ( y , z ) ; θ ) ) = i = 1 n k = 1 2 y i k log π k + log f k ( z i ; θ ) ,
where y = ( y 1 , , y n ) is considered a vector of missing values.
Due to the missing structure of the complete dataset from the latent variable, it is crucial to use the EM algorithm to obtain the maximum likelihood estimates for the model’s parameters. The EM algorithm is an iterative method widely used in incomplete data problems where the maximum likelihood estimators (MLE) have no closed expression [28].
In this work, we use the Expectation/Conditional Maximization (ECM) algorithm instead of the classical EM algorithm because mixtures of Skew-Normal or Skew-t distributions lead to a very complex complete-data maximum likelihood estimation [26]. Roughly, the ECM algorithm replaces a complicated M-step of the EM algorithm with several computationally simpler conditional or constrained maximization, or CM steps, each of which maximizes the expected complete-data log-likelihood found in the preceding E-step subject to constraints on the unknown parameters, θ ; the collection of all constraints is such that the maximization is over the entire space of θ . Maximizations that take part in the CM-step are over smaller dimensional spaces, which means they are simpler and more reliable than the corresponding full maximization underlying the M-step of the EM algorithm [29,30].
In brief, the sth iteration of the ECM algorithm proceeds as follows:
  • E-step:
    The random variable Y i k takes the value 1 if the ith observation belongs to population k, and zero otherwise; thus, Y i k B e r ( p i k ) , with p i k = P ( Y i k = 1 Z , θ ) ; E ( Y i k ) = p i k .
    In this step, one estimates the unobserved component membership, p ^ i k , i.e., the estimated probability that the ith observation comes from the kth population, k = 1 , 2 , given the vector of the antibody levels, z , and the current values for the unknown parameters:
    p ^ i k ( s + 1 ) = π k ( s ) f k z i ; θ k ( s ) l = 1 2 π l ( s ) f l z i ; θ l ( s ) , k = 1 , 2 .
    Afterwards, it estimates the probability of sampling from a seronegative or seropositive population, π ^ k ( s + 1 ) :
    π ^ k ( s + 1 ) = 1 n i = 1 n p ^ i k ( s + 1 ) , k = 1 , 2 .
  • M-step:
    In this step, one maximizes the weighted log-likelihood function (derived from Equation (5)), denoted by Q ( θ ; θ ( s + 1 ) ) , with respect to θ :
    Q ( θ ; θ ( s + 1 ) ) = i = 1 n k = 1 2 p ^ i k ( s + 1 ) log π ^ k ( s + 1 ) + log f k z i ; θ ( s ) .
    Therefore,
    θ k ( s + 1 ) = argmax θ k Q ( θ ; θ ( s ) ) , k = 1 , 2 .
It should be stressed that the M-step involves the maximization of two weighted likelihoods separately, one for each component under consideration (seropositive and seronegative populations), which reduces the overall complexity of the computations involved in this step.
The process iterates between the E-step and M-step until the difference between two consecutive weighted log-likelihoods is smaller than a prefixed value, which means that convergence has been attained. The ECM algorithm has been proved to share the same appealing convergence properties as the EM [29,30,31].
Considering the SMSN family of distributions, namely the Skew-Normal and the Skew-t distributions, the application of the ECM algorithm in the context of mixtures can be found in [2,26]. The initial values for the parameters are discussed in detail in [26].
To decide which model is the best one among all the models fitted to the same data, we used the Bayesian Information Criterion (BIC) [8].

3.1. Definition of Seropositivity: Methods to Estimate the Cutoff Points in the Mixture Models

Seroprevalence is an epidemiological measure defined by the proportion of seropositive individuals in the sample. For its estimation, it is then necessary to define the serological status of the i-th individual by dichotomising the variable, Z i , which represents the antibody concentration of the individual. This dichotomization is performed by determining a value c such that for antibody values equal to or greater than c, the individual is classified as seropositive and seronegative otherwise. Thus, let Y be the r.v. representing the number of seropositive individuals in a sample of size n, whereby we have to
Y = i = 1 n I { Z i c } B i n o m i a l ( n , π 2 ) ,
where π 2 represents the seroprevalence, i.e, π 2 = P [ Z i c ] and I { . } is the indicator variable. Considering that the r.v. representing the antibody levels Z i is modeled by a finite mixture of distributions, the way to estimate the cutoff c from the observed data is nonstandard. To determine this cutoff value, we present three estimation methods below.
-
Method 1 (M1): It is based on the 99.9%-quantile associated with the estimated seronegative population. This method is the most popular in sero-epidemiology [32,33]. It is often called the 3 σ rule because the 99.9%-quantile is given by the mean plus three times the standard deviation of a normally distributed seronegative population;
-
Method 2 (M2): It relies on the minimum of the density mixture functions. In the case of two latent populations, the cutoff corresponds to the absolute minimum. For three or more latent populations, the cutoff corresponds to the lowest relative minimum. This point can be calculated using Dekker’s algorithm [34]. It should be noted that the minimum of the mixing function is not expected to coincide with the point of intersection of the probability densities of each subpopulation;
-
Method 3 (M3): It imposes a threshold in the so-called conditional classification curves [32]. Under the assumption that all components but the first one referred to seropositive individuals, the conditional classification curve for the i-th individual given the antibody level Z i = x is defined as
p + Z i = x = π 2 f 2 ( Z i = x ; θ 2 ) k = 1 2 π k f k ( Z i = x ; θ k ) .
In turn, the classification curve of seronegative individuals is simply given by
p Z i = x = 1 p + Z i = x .
After calculating these curves, one can impose a minimum value for the classification of each individual. In this case, two cutoff values arise in the antibody distribution, one for the seronegative individuals and another for seropositive individuals. Mathematically, the classification rule is given as follows
C i = seronegative , if x i c equivocal , if c < x i < c + seropositive , if x i c +
where c and c + are the cutoff values in the antibody distribution that ensure a minimum classification probability (say 90%). In practice, one can use the bisection method to calculate these cutoff values in practice, providing an initial interval where they might be located [32].

3.2. Software

For this study, we used the R software version 4.2.3. In particular, we used the package mixsmsn to fit different mixture models based on SMSN [35]. To estimate the model parameter via the EM algorithm, we used the function smsn.mix. For fitting the Student’s t-distribution, we considered the R package extraDistr [36]; namely, the function dlst to calculate their density, the function plst to define the cumulative distribution function, and the function rlst to generate random samples in the simulation study. The fitting of the Skew-Normal distributions was performed with the package sn [37]. The functions dsn, psn, and rsn were used to calculate the probability density function, the cumulative distribution function, and generate random samples of the Skew-Normal distribution, respectively. In the case of the Skew-t distribution, the functions dst, pst, and rst were used to calculate the probability density function, the cumulative distribution function, and generate random samples, respectively.

4. Simulation Study

We used Monte Carlo simulation to assess the performance of the cutoff points techniques (see Section 3.1). We based our analyses on the usual mixture distributions used in serological data (Normal and Student-t distributions) and their skewed versions proposed in this article. Overall, we want to check whether the performance of the proposed techniques is worse in symmetrical distributions (usually considered when analysing this type of data). Then, four simulation scenarios regarding the mixture models were considered (Table 1).
Another goal of the simulation was to evaluate how well the fitted mixture models can distinguish between the two populations under study, seropositive or seronegative. Accordingly, we considered different proportions of seropositive (and consequently seronegative) individuals in the dataset. That is, varying the weights assigned to the populations in the mixture model allowed us to check the ability of the model to identify the seropositive component even when the weight assigned to that component was low. The practical implications of varying the weight of the seronegative and seropositive population is identifying false negative and false positive individuals. In addition, when the proportion of seronegative individuals is very high relative to seropositive individuals, more effective decisions can be made to control the number of infections in the population. On the other hand, the opposite scenario is essential for the effectiveness of vaccination in the population, particularly for individuals who may have lost immunity.
To proceed with the simulation study, we randomly selected an antigen from the practical case presented in Section 5 and fitted a mixture of Normals, Skew-normal, Student-t and Skew-t distributions to this data. The parameter estimates of each fitted model were considered the true parameters for the simulation study (Table 1). To assess the impact of sample size and the percentage of seronegative (or seropositive) individuals in the mixture model on the performance of the methods in identifying the threshold value for distinguishing seronegative from seropositive individuals, we considered sample sizes (n) between 50 and 500, with fixed intervals of 50. We set π 1 = 0.3 , 0.6 , 0.9 for the probability of a seronegative individual in the mixture model. In each simulation scenario, N = 1000 replicate samples were drawn. For each simulated sample, the parameters of the two-component mixture model were estimated by maximum likelihood (via the ECM algorithm described in Section 3).
The primary goal of the simulation study is to assess the differences between the cutoff values obtained by the three methods under study and the true cutoff points. Therefore, the evaluation criteria must focus on quantifying these differences. The two-component mixture models defining various simulation scenarios vary based on the weights assigned to the seropositive/seronegative components and the underlying probability distribution. Consequently, the true cutoff values also vary according to the scenario under consideration. Table 2 provide the true cutoff values for each situation considered in the simulation study.
Next, we report the following performance measures for the three methods under consideration: relative bias (RB) and mean squared error (MSE). Let δ i j * be the estimated cutoff point based on the i-th simulated sample ( i = 1 , , 1000 ) and for the jth method ( j = 1 , 2 , 3 , representing methods M1, M2, and M3, respectively); δ j is the theoretical cutoff point based on the jth method. Then, the Relative Bias (RB) and the estimated Mean Squared Error (MSE) for the jth method based on N = 1000 replicate samples are given by:
RB j = 1 N i = 1 N δ i j * δ j δ j = δ j * ¯ δ j 1 , MSE j = 1 N i = 1 N ( δ i j * δ j ) 2 , j = 1 , 2 , 3 ,
where δ j * ¯ = 1 N i = 1 N δ i j * , the average of the cutoff points obtained from the jth method, j = 1 , 2 , 3 for all simulations performed.
We also compute a distribution-free approximate 100 ( 1 α ) % confidence interval for each cutoff value derived from the three methods under study using the empirical percentiles method, F ^ δ * 1 ( α / 2 ) , F ^ δ * 1 ( 1 α / 2 ) , where F ^ δ * ( . ) is the empirical cumulative distribution function of the cutoff points’ sample, ( δ 1 * , δ 2 * , , δ N * ) , with N representing the number of replicates.
Next, we outline the algorithm of the simulation procedure used in this article.
Simulation Procedure
Let Z be the random variable representing the antibody level, which is described by a two-component mixture model with probability density function given by expression (4). The mixture probability density functions analyzed here are the Normal, Skew-normal, Student-t, and Skew-t distributions.
For each combination of two-component mixture distribution with fixed weight from the set π 1 { 0.3 , 0.6 , 0.9 } (and thus π 2 = 1 π 1 ) and the theoretical vector of parameters θ = ( μ 1 , μ 2 , σ 1 2 , σ 2 2 , α 1 , α 2 , ν 1 , ν 2 ) T given by Table 1, we proceed as follows:
1
For i = 1 to N (run N Monte Carlo simulations)
S.1
Simulate a sample with dimension n of antibodies concentration:
  • Generate m = n π 1 seronegative individuals using B e r n o u l l i ( 1 , π 1 ) .
  • The remaining n m individuals from the sample with dimension n are seropositive.
  • Based on the theoretical model under consideration, generate a random sample of antibody concentration, with the sample size equal to n: the m observations of 1(S.1)i are drawn from the seronegative population, whereas the n m observations of 1(S.1)ii come from the seropositive population.
S.2
Fit a two-component mixture model to the simulated sample using the ECM algorithm described in Section 3.
S.3
Estimate the cutoff points based on the three methods under study, δ i j * , where i denotes the ith simulated sample, j = 1 , 2 , 3 represents the method under consideration, M1, M2, and M3, respectively.
2
Store the estimated cutoff values in a 3 × N matrix, δ * , where the jth column contains the cutoff points’ sample with dimension N, for the jth method ( j = 1 , 2 , 3 ), i.e., the N-dimensional column vector δ j * = ( δ 1 j * , δ 2 j * , , δ N j * ) T , and δ * = [ δ 1 * δ 2 * δ 3 * ] .
3
Calculate the RB and the estimated MSE according to (9) for each cutoff points’ sample stored in the N-dimensional column vector δ j * , j = 1 , 2 , 3 .
4
Determine the empirical cumulative distribution function from the N-dimensional column vector δ j * , j = 1 , 2 , 3 , of the estimated cutoff points; then, construct a distribution-free approximate 100 ( 1 α ) % confidence interval for the true cutoff point from method j , j = 1 , 2 , 3 , based on the percentile method [38].
The main results from the simulation study are provided in Figure 2, Figure 3, Figure 4, Figure 5, Figure 6 and Figure 7 and Appendix C Table A3, Table A4, Table A5, and Table A6. We start by analyzing the bias properties. Considering the balanced scenario ( π 1 = 0.6 ), we observe that, for both the mixture of Normal distributions and the respective skewed variant (Figure 3A,B), the cutoff estimates exhibit moderate positive bias across all considered methods. Additionally, there is a stabilizing pattern in bias behavior, indicating that the bias remains almost constant as the sample size increases. However, the cutoff derived from the M1 method is the one that shows the worst behavior in the framework of the mixture of Normal distributions. Concerning the mixture of Student-t distributions (Figure 3C,D), the M1 method has a large positive bias, stabilizing at n = 100 . Methods M2 and M3 demonstrate comparable performances, displaying a generally unbiased pattern. Finally, the mixture of Skew-t distributions unfolds an erratic behavior of the M1 method, exhibiting an initial positive bias for n = 50 and decreasing to negative biases as the sample size increases. Cutoff estimates linked to methods M2 and M3 consistently display stable moderate negative bias patterns.
When π 1 = 0.3 (indicating a predominant population of seropositive individuals, π 2 = 0.7 ), the cutoff estimates based on method M1 exhibit the poorest performance (Figure 2), although its behavior remains stable as the sample size increases. The cutoff estimates from the other two methods share this regular pattern as the sample size increases. Moreover, cutoff estimates derived from M2 and M3 demonstrate similar performance, displaying a moderate positive bias for the mixture of normals and Skew Normals and a moderate negative bias within the context of the mixture of Student-t and the mixture of Skew Student-t distributions, emphasizing the bias for the latter case.
Figure 4 illustrates the relative bias of the cutoff estimates obtained from methods M1, M2, and M3 when π 1 = 0.9 , reflecting a scenario where only 10 % of the population is seropositive, indicating high susceptibility to the considered virus. For the mixture of Normals, all three methods yield cutoff estimates with similar performance in terms of bias, displaying moderate positive bias (Figure 4A). Surprisingly, Figure 4B reveals an unexpected behavior in the cutoff estimates from the M2 method, exhibiting fluctuations as the sample size increases and showing a larger relative bias than its competitors. This characteristic might be linked to the method’s incapacity to precisely model a mixture of Normal distributions, particularly in cases where one of the components has a very small weight in the mixture. Methods M1 and M2 produce cutoff estimates with stable patterns. For the mixture of Student-t and the mixture of Skew-t distributions (Figure 4C,D), cutoff estimates from methods M2 and M3 exhibit good performance, showing small positive biases in the former and small negative biases in the latter. Conversely, method M1 generates cutoff estimates with large and erratic biases.
The estimated MSE (Figure 5, Figure 6 and Figure 7) measures the accuracy of the cutoff points’ estimates. To avoid misinterpretations, we must start by stressing that the ranges of the y-axes for the mixture of Normals (symmetrical and skewed versions) are significantly narrower than for the mixture of Student-t distributions (symmetrical and skewed versions). An overview of the variability in the cutoff estimates reveals that the M1 method consistently emerges as one that produces the cutoffs with the least performance, irrespective of the probability of seropositivity in the underlying population, for medium sample size. With increasing sample size, the cutoff values based on the M1 method rapidly converge to values near zero, indicating a significant reduction in estimate variability. Cutoff values derived from the M2 and M3 methods reveal a good convergence to values near zero as the sample grows. These characteristics hold for the different values of π 1 , that is, the weight of the seronegatives in the mixture. A word of caution should be noted regarding the cutoff estimates based on the M2 method for π 1 = 0.9 in the case of the Skew Normal distribution (Figure 7B) as it exhibits a seemingly increasing MSE as the sample size grows. This misleading impression might be due to the scale of the y-axis, which has a range of values very close to zero. Nevertheless, a similar pattern appeared when analyzing the corresponding relative bias (Figure 4B).

5. Applications to SARS-CoV-2 Real Data

We analysed IgG antibody responses against four SARS-CoV-2 spike or nucleoprotein antigens: RBD—glycoprotein receptor-binding domain; S t r i —S trimeric spike protein; S1—spike glycoprotein S1 domain; S2—SARS-CoV-2 spike glycoprotein S2 domain. Antibodies were measured in serum samples collected up to 39 days after symptom onset from 215 adults in four French hospitals (53 patients and 162 healthcare workers) with quantitative RT-PCR-confirmed SARS-CoV-2 infection. Three hundred and thirty-five negative control serum samples were collected from France, Thailand, and Peru before the COVID-19 pandemic [7]. A detailed description of lab procedures can be found in the original study [7].
SARS-CoV-2 infection, which causes the devastating and often lethal COVID-19 disease, was first detected in the Chinese province of Wuhan in December 2019 [7]. Rapidly, SARS-CoV-2 infection spread worldwide, and the COVID-19 disease was declared a pandemic by the World Health Organization.
The detection of the virus is so far achieved by the so-called reverse quantitative PCR reverse transcriptase (RT-qPCR) on samples from nasopharyngeal or throat swabs [7]. However, in general, only symptomatic individuals or people in close contact with detected cases are tested, which might lead to overestimating the proportion of individuals infected with SARS-CoV-2 [39]. Alternatively, serological testing allows for detecting asymptomatic individuals exposed to the infection. In addition, serological testing can quantify the degree of exposure to the infection in the population. In this context, it is crucial to estimate seroprevalence at the population level, i.e., the proportion of seropositive individuals that show antibodies against any SARS-CoV-2 antigen [40].

5.1. Patients’ Characteristics

For this study, data relating to 549 individuals were analysed. Serum samples were collected from individuals with confirmed SARS-CoV-2 infection by PCR test in four hospital units from Paris, namely: 4 (0.7%) from the Hôpital Bichat, 49 (9.0%) from the Hôpital Cochin, and 161 (29.3%) from the Nouvel Hôpital (Strasbourg). Regarding the negative controls, 68 (12.4%) are from the Thai Red Cross (TRC), 90 (16.4%) from Peruvian donors (NHP), and 177 (32.2%) from the France blood donors (Établissement Français du Sang). For each antigen under analysis, the logarithmic transformation base ten was considered for the concentration of antibodies against that antigen.
Regarding the analysis of antibodies by the individuals who performed PCR test, there were statistically significant differences between individuals who tested negative and positive for SARS-CoV-2 by Mann-Whitney test (RBD: 1.64 versus 3.48, p < 0.001 ; S1: 1.72 versus 2.59, p < 0.001 ; S2: 1.79 versus 2.99, p < 0.001 ; S t r i : 1.59 versus 3.43, p < 0.001 ) (Figure 8). Such differences were expected given the general knowledge about the infection status, i.e., individuals who have already been exposed to the virus have a higher concentration of antibodies than those who are still susceptible.

5.2. Mixture Model Approach and Cutoff Points

We fitted the different mixture models analyzed in this paper to the SARS-CoV-2 datasets considering two subpopulations (seronegative and seropositive subpopulations). Specifically, we adjusted two-component mixture models based on the Normal, Skew-normal, t-Student, and Skew-t distributions. Table A1 provides a summary of the main results.
According to the BIC values, the model based on the Skew-Normal distribution was considered the best fit for the following antigens: RBD (BIC = 852.25), S1 (BIC = 561.63), S2 (BIC = 775.29). For the case of the Stri antigen ( S t r i ), the best model was found to be the Skew-t distribution (BIC = 915.82) (Table A1). Table 3 displays the parameter estimates for the optimal mixture models determined by the BIC criterion. Additionally, graphical representations of the estimated densities for each antigen are shown in Figure 9.
In line with results from previous studies, the seronegative population is skewed to the right, whereas the seropositive population reveals skewness to the left; this feature is not very pronounced in the case of S1 ( α ^ S 1 = 1.062 ) and S2 ( α ^ S 2 = 0.450 ) antigens (Table 3).
After identifying the mixture model that best fits each dataset, we categorized the antibody concentration for each antigen by estimating the respective cutoff point. To achieve this goal, we employed the methods M1, M2, and M3 as previously described in Section 3.1. The results are comprehensively detailed in Appendix A Table A1. In addition to the estimated cutoff points, Table A1 also shows a few performance measures, namely sensitivity (sen), specificity (spec), and accuracy (ACC). Graphical representations of the cutoff points for the three methods under consideration are displayed in Figure 9, where these points are indicated by vertical dotted lines.
It is important to emphasize that when adjusting a mixture of Student-t distributions (either symmetric or skewed) for the S t r i antigen, calculating the method’s sensitivity and accuracy based on the M1 method (99.9%-quantile for the seronegative population) was impossible due to the high value assumed by the respective quantile. This characteristic led to the complete absorption of the seropositive population by its seronegative counterpart (Table A1).
Considering the best-fitted models, one can conclude that estimation of the cutoff point based on the minimum densities of the mixture model (M2 method) proved to be the method with the highest sensitivity for classifying seropositive individuals regardless of the antigen under consideration: RBD antigen: cutoff =  2.49 , sens =  86.45 % ; S1: cutoff =  2.27 , sens =  71.03 % ; S2: cutoff =  2.39 , sens =  83.64 % ; S t r i : cutoff =  2.53 , sens =  89.25 % (Table A1). Method M2 also yields the highest proportion of correct results (accuracy) for the RBD antigen (cutoff =  2.49 , ACC =  92.89 % ), S1 (cutoff =  2.27 , ACC =  86.89 % ), and S2 (cutoff =  2.39 , ACC =  90.89 % ). Regarding the S t r i antigen, both methods M2 and M3 achieve the same accuracy of 93.44 % (Table A1).
Since the true infection status of the individuals is known for this case study and to reinforce the performance of the proposed methods, we computed ROC curve-based methods (hereafter designated as the M4 method) through univariable logistic regression analysis. We considered the disease status as the binary outcome variable and the log 10 (antibody concentration) for each antigen as the covariate. The ROC curve is commonly used to evaluate the performance of biomarkers, and the area under the ROC curve (AUC) summarizes this performance. To estimate the optimal cutoff point and consequently the sensitivity and specificity, we used the R package OptimalCutPoints and the method that minimizes the misclassification rate that is well described in [41].
Results from the ROC curve-based methods are presented in Appendix B Table A2, where the estimated cutoff points for each antigen are reported, along with the same performance measures used to evaluate methods M1, M2, and M3, namely sensitivity, specificity, and accuracy. Additionally, we calculated the Area Under the Curve (AUC) and the respective 95 % confidence interval. A graphical representation of the main results, facilitating the comparison of methods M1 to M4 in terms of performance measures, is displayed in Figure 10. The source data are stored in Appendix A Table A1 (results from M1 to M3 methods) and Appendix B Table A2 (ROC curve-based method, M4).
Regarding sensitivity, the M1 method performs worse than the others under consideration (Figure 10A, Appendix A Table A1). The worst case occurs for the sensitivity associated with the antigen S1, where approximately fifty percent of infected individuals are misclassified with M1. More precisely, only 51 % of infected individuals are correctly classified as seropositive. When considering the antigen S2, the sensitivity shows an increase of 5.6 percentage points compared to the antigen S1 for method M1; however, both results remain low in terms of sensitivity. Methods M2 and M3 behave similarly in terms of sensitivity for all the antigens in analysis. Method M4 (ROC curve-based method) has the highest true positive rate compared to the remaining three methods under study (Figure 10A and Appendix B Table A2).
In terms of specificity (Figure 10B), all four methods under consideration exhibit similar behavior across all antigens, with specificities above 95 % . This feature indicates that the methods accurately identify seronegative individuals in the subpopulation of the non-infected by SARS-CoV-2.
Concerning the accuracy, method M1 exhibits the worst behavior, similar to what happens with sensitivity. However, it is important to note that the accuracy values are higher than sensibility, with a minimum of 80.15 % for the S1 antigen (Figure 10B and Appendix A and Appendix B Table A1 and Table A2 ).
In summary, method M1 exhibits the lowest performance in accurately classifying individuals as seronegative or seropositive. These findings align with the conclusions drawn from the simulation study.

6. Discussion and Conclusions

Serological data can be described as a mixture of serological status distributions: seronegative (antibody-negative) or seropositive (antibody-positive) populations. In this framework, one must thus use a mixture model with two components. Therefore, it is crucial to accurately calculate the threshold that distinguishes between seropositive and seronegative populations.
This study aimed to evaluate the performance of three cutoff point estimation methods developed by [8] for defining the seropositivity of an individual using mixtures of Skew-Normal and Skew-t distributions. The 99.9%-quantile (or 3- σ rule, if it is assumed to be a Normal distribution) method is commonly used in practice as the gold standard to estimate the cutoff point in serological tests assuming a Normal distribution for the components of the mixtures [11,12,13,14]. This method (M1 method) is compared with two other methods, M2 and M3. Method M2 relies on the minimum of the density function derived from a mixture model with two components, whereas the cutoff point obtained from method M3 is based on conditional classification curves.
A Monte Carlo simulation study was conducted to evaluate the performance of the cutoffs obtained by each method based on the mixture of two Normal or two Student-t distributions or their skewed variants. The relative bias, estimated mean square error, and confidence intervals for the true cutoff values based on the three methods under study were calculated.
When a new virus appears in the population, there is a natural tendency for the proportion of susceptible individuals (seronegative individuals) to be higher than seropositive individuals. This context corresponds to the phase in which early identification of the infected people is essential for pandemic control. However, total control of the spread of the virus only occurs when there is vaccination or eradication of the virus. Due to these complex dynamics, in the simulation study we decided to evaluate the effect of different percentages of seropositive (and obviously, seronegative) individuals in the overall population on the determination of cutoff points as well as the ability of the modeling procedure to correctly capture two components in the mixture model. In addition to evaluating the effect of different analytic expressions to describe the two-component mixture model, the simulation study allowed us to study distinct pandemic evolution scenarios due to varying the probability of seropositivity (or seronegativity) in the population.
For the majority of the two-component mixture models studied in the simulation carried out in this paper, we can experimentally conclude that the traditional method (M1 method) has the poorest performance in terms of bias and estimated MSE when compared with methods M2 and M3. The low performance found in the context of this simulation study can be explained in light of heavy-tailed distributions (such as the Skew normal or Skew-t distributions) to fit serological data. In fact, the calculation of the 99.9%-quantile (3- σ rule, for the normal distribution) relies only and exclusively upon the population of seronegative individuals. Considering these heavy-tailed distributions are skewed to the right, the seropositive population is absorbed by the 99.9%-quantile from the seronegative population.
For the methods M2 and M3 studied in this work, we found that both are moderately biased and with small variability. As expected, the larger the sample size, the smaller the estimated mean square error of the cutoff points estimates.
Lastly, we used real data regarding SARS-CoV-2 infections to apply these methods and evaluate their performance. Since the true disease status of the individuals was known in advance, we also computed the ROC curve-based method, which is a standard procedure to evaluate the performance of biomarkers. In line with the conclusions from the simulation study, method M1 has been revealed as the one with the lowest performance in identifying the cutoff point to distinguish between a seronegative and seropositive individual. Therefore, methods M2 and M3 are preferable to method M1.
A limitation of this study is that the different mixture models were fitted using the same distribution for the two components. If the components of the mixture model were distinct, this would directly affect the estimated cutoff points and might increase the performance of the methods under consideration. Future research on this topic should be carried out.
In conclusion, we recommend using mixture models based on distributions of the SMSN family to analyze serological data, given the flexibility of these models and the proposed M2 or M3 methods for determining cutoff points. These methods have been proven to be a reliable alternative to the gold standard method based on the 99.9%-quantile (or 3- σ rule for the Normal distribution).

Author Contributions

Conceptualization, T.D.-D., H.M. and N.S.; methodology, T.D.-D. and N.S.; software, T.D.-D. and H.M.; validation, T.D.-D., H.M. and N.S.; formal analysis, T.D.-D. and H.M.; investigation, T.D.-D., H.M. and N.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by FCT—Fundação para a Ciência e Tecnologia, Portugal, under the project UIDB/00006/2020. DOI: https://doi.org/10.54499/UIDB/00006/2020.

Data Availability Statement

Data are available upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Bayesian Information Criteria (BIC), Sensitivity, Specificity, and Accuracy by Method for Each Antigen

Table A1. BIC values, cutoff value estimates, sensitivity, specificity, and accuracy for each method under study. C denotes the cutoff point estimate.
Table A1. BIC values, cutoff value estimates, sensitivity, specificity, and accuracy for each method under study. C denotes the cutoff point estimate.
Method M1Method M2Method M3
AntigenDistribution BIC C Sens (%) Spec (%) ACC (%) C Sens (%) Spec (%) ACC (%) C Sens (%) Spec (%) ACC (%)
RBDNormal953.002.6584.1197.6192.352.3390.1895.5293.442.3788.7995.8293.08
Skew-Normal852.252.8379.9198.2191.072.4986.4597.0192.892.5685.0597.0192.35
Student t959.604.160.0910061.382.3490.1895.5293.442.3888.7996.4293.44
Skew-t854.784.80100.002.6084.5897.6192.532.8978.9798.5190.89
S1Normal561.812.4363.0897.9184.342.1381.3195.5289.982.1282.7195.5290.53
Skew-Normal561.632.5850.9398.8180.152.2771.0397.0186.892.3069.6397.3186.52
Student t568.983.1515.42100.0067.032.1480.3795.5289.622.1282.7195.5290.53
Skew-t568.273.2710.28100.0065.032.2771.0397.0186.892.3169.1697.3186.34
S2Normal778.762.6672.8998.5188.522.2389.7292.2391.262.2488.3292.8491.07
Skew-Normal775.292.8656.5499.1082.512.3983.6495.5290.892.4980.8496.7290.53
Student t785.733.519.35100.0064.662.2488.3292.8491.072.2587.3893.1390.89
Skew-t781.753.724.21100.0062.662.3983.6495.5290.892.5080.3797.0190.53
S t r i Normal1010.182.7587.8597.9193.982.3791.1294.6393.262.4790.1794.9393.08
Skew-Normal916.152.9879.4499.4091.622.4690.1994.9393.082.5889.2596.1293.44
Student t1016.844.34100.002.3990.6594.6393.082.4889.7295.2293.08
Skew-t915.825.49100.002.5389.2596.1293.442.8485.5198.5193.44

Appendix B. Performance Measures for the Estimated Cutoff Point for Each Antigen

Table A2. SARS-COV-2 virus antigens: Cutoff point estimates, sensitivity, specificity, accuracy, and area under the curve (AUC) for the empirical ROC curve method.
Table A2. SARS-COV-2 virus antigens: Cutoff point estimates, sensitivity, specificity, accuracy, and area under the curve (AUC) for the empirical ROC curve method.
AntigenCutoffSensitivity
(%)
Specificity
(%)
Accuracy
(%)
AUC
(CI 95%)
RBD2.1594.3994.3394.3598.50
(97.80, 99.30)
S12.0786.9293.7391.0796.10
(94.60, 97.60)
S22.3386.9294.6391.6294.90
(92.80, 97.00)
S t r i 2.8186.9298.5193.9898.30
(97.40, 99.20)

Appendix C. Simulation Results

Table A3. Relative bias, Mean Squared Error (MSE), and 95% confidence interval (CI) of the 99.9%-quantile method (M1); minimum of mixture densities method (M2); and conditional probability method (M3) considering a mixture of Normal distributions. o p t M 1 denotes the theoretical cutoff point for the M1 method; o p t M 2 denotes the theoretical cutoff point for the M2 method; o p t M 3 denotes the theoretical cutoff point for the M3 method. π 1 denotes the weight of the seronegative population; c M 1 , c M 2 , and c M 3 denote the cutoff estimated by the M1, M2, and M3 methods after N = 1000 simulations.
Table A3. Relative bias, Mean Squared Error (MSE), and 95% confidence interval (CI) of the 99.9%-quantile method (M1); minimum of mixture densities method (M2); and conditional probability method (M3) considering a mixture of Normal distributions. o p t M 1 denotes the theoretical cutoff point for the M1 method; o p t M 2 denotes the theoretical cutoff point for the M2 method; o p t M 3 denotes the theoretical cutoff point for the M3 method. π 1 denotes the weight of the seronegative population; c M 1 , c M 2 , and c M 3 denote the cutoff estimated by the M1, M2, and M3 methods after N = 1000 simulations.
Normal Distribution
Sample SizecM195% CI
(M1)
cM295% CI
(M2)
cM395% CI
(M3)
R.bias
(M1)
MSE
(M1)
R.bias
(M2)
MSE
(M2)
R.bias
(M3)
MSE
(M3)
π 1 = 0.3, o p t M 1 = 2.33 , o p t M 2 = 2.24 , o p t M 3 = 2.30
n = 50 2.68(2.05–3.92)2.37(1.97–2.97)2.51(1.99–3.43)14.920.345.560.079.120.17
n = 100 2.64(2.19–3.34)2.34(2.08–2.65)2.47(2.10–2.93)13.570.184.490.037.390.07
n = 150 2.65(2.29–3.09)2.35(2.14–2.56)2.48(2.19–2.77)13.940.154.710.027.630.05
n = 200 2.66(2.35–3.03)2.35(2.18–2.52)2.48(2.23–2.73)14.200.144.790.027.810.05
n = 250 2.65(2.37–2.97)2.35(2.19–2.49)2.48(2.26–2.69)13.840.134.700.027.550.04
n = 300 2.65(2.39–2.95)2.35(2.12–2.49)2.47(2.29–2.69)13.810.124.650.027.460.04
n = 350 2.65(2.42–2.91)2.35(2.22–2.47)2.48(2.29–2.65)14.010.124.720.027.580.04
n = 400 2.64(2.43–2.88)2.34(2.23–2.46)2.47(2.30–2.63)13.540.114.490.017.260.03
n = 450 2.66(2.45–2.88)2.35(2.24–2.47)2.48(2.33–2.65)14.180.124.760.017.690.04
n = 500 2.65(2.48–2.88)2.35(2.25–2.45)2.48(2.34–2.64)13.920.124.710.017.540.04
π 1 = 0.6, o p t M 1 = 2.33 , o p t M 2 = 2.33 , o p t M 3 = 2.37
n = 50 2.63(2.26–3.10)2.51(2.22–2.84)2.59(2.23–2.99)13.110.148.030.069.350.09
n = 100 2.64(2.35–2.94)2.50(2.30–2.72)2.58(2.32–2.86)13.230.127.660.048.940.06
n = 150 2.64(2.41–2.89)2.51(2.35–2.67)2.58(2.38–2.79)13.460.117.790.049.030.06
n = 200 2.65(2.47–2.83)2.51(2.38–2.64)2.58(2.41–2.75)13.680.117.870.049.110.05
n = 250 2.65(2.48–2.82)2.51(2.39–2.63)2.58(2.42–2.74)13.670.117.800.049.120.05
n = 300 2.65(2.50–2.81)2.51(2.40–2.61)2.59(2.44–2.72)13.720.117.880.049.180.05
n = 350 2.65(2.51–2.80)2.51(2.41–2.61)2.59(2.46–2.73)13.770.117.880.049.230.05
n = 400 2.65(2.53–2.78)2.51(2.42–2.59)2.58(2.47–2.71)13.650.117.750.039.060.05
n = 450 2.65(2.53–2.78)2.51(2.42–2.59)2.59(2.47–2.70)13.810.117.810.039.170.05
n = 500 2.65(2.54–2.76)2.51(2.43–2.59)2.59(2.48–2.69)13.890.117.890.049.230.05
π 1 = 0.9, o p t M 1 = 2.33 , o p t M 2 = 2.43 , o p t M 3 = 2.46
n = 50 2.61(2.34–2.89)2.75(2.22–3.59)2.79(2.32–3.59)12.280.1013.010.1913.620.20
n = 100 2.64(2.45–2.82)2.72(2.53–2.99)2.76(2.50–3.03)13.560.1112.090.1012.240.11
n = 150 2.64(2.49–2.79)2.71(2.55–2.93)2.74(2.53–2.96)13.490.1011.470.0911.460.09
n = 200 2.65(2.51–2.78)2.71(2.57–2.89)2.74(2.55–2.91)13.630.1111.470.0811.480.09
n = 250 2.65(2.53–2.77)2.71(2.59–2.86)2.74(2.58–2.89)13.690.1111.560.0811.510.08
n = 300 2.65(2.54–2.76)2.71(2.60–2.83)2.74(2.59–2.87)13.640.1011.450.0811.380.08
n = 350 2.65(2.54–2.75)2.71(2.61–2.83)2.74(2.61–2.87)13.770.1111.410.0811.430.08
n = 400 2.65(2.55–2.74)2.70(2.61–2.81)2.73(2.61–2.85)13.640.1011.250.0811.280.08
n = 450 2.65(2.56–2.73)2.70(2.61–2.81)2.73(2.62–2.85)13.710.1011.260.0811.310.08
n = 500 2.65(2.57–2.73)2.70(2.62–2.80)2.73(2.63–2.84)13.750.1011.290.0811.300.08
Table A4. Relative bias, Mean Squared Error (MSE), and 95% confidence interval (CI) of the 99.9%-quantile method (M1); minimum of mixture densities method (M2); and conditional probability method (M3) considering a mixture of Skew-Normal distributions. o p t M 1 denotes the theoretical cutoff point for the M1 method; o p t M 2 denotes the theoretical cutoff point for the M2 method; o p t M 3 denotes the theoretical cutoff point for the M3 method. π 1 denotes the weight of the seronegative population; c M 1 , c M 2 , and c M 3 denote the cutoff estimated by the M1, M2, and M3 methods after N = 1000 simulations.
Table A4. Relative bias, Mean Squared Error (MSE), and 95% confidence interval (CI) of the 99.9%-quantile method (M1); minimum of mixture densities method (M2); and conditional probability method (M3) considering a mixture of Skew-Normal distributions. o p t M 1 denotes the theoretical cutoff point for the M1 method; o p t M 2 denotes the theoretical cutoff point for the M2 method; o p t M 3 denotes the theoretical cutoff point for the M3 method. π 1 denotes the weight of the seronegative population; c M 1 , c M 2 , and c M 3 denote the cutoff estimated by the M1, M2, and M3 methods after N = 1000 simulations.
Skew-Normal
Distribution
Sample Size cM1 95% CI
(M1)
cM2 95% CI
(M2)
cM3 95% CI
(M3)
R.bias
(M1)
MSE
(M1)
R.bias
(M2)
MSE
(M2)
R.bias
(M3)
MSE
(M3)
π 1 = 0.3, o p t M 1 = 2.60 , o p t M 2 = 2.33 , o p t M 3 = 2.44
n = 50 3.14(2.20–4.51)2.52(2.15–2.93)2.78(2.18–3.48)20.830.688.350.0814.050.24
n = 100 3.13(2.29–4.28)2.50(2.23–2.76)2.76(2.24–3.33)20.320.557.370.0513.040.18
n = 150 3.09(2.37–4.14)2.49(2.26–2.73)2.73(2.28–3.27)19.100.457.130.0412.130.15
n = 200 3.08(2.39–3.94)2.49(2.28–2.67)2.72(2.30–3.16)18.560.396.980.0411.740.13
n = 250 3.04(2.44–3.83)2.48(2.29–2.64)2.69(2.32–3.11)16.850.326.570.0310.660.11
n = 300 3.05(2.47–3.81)2.48(2.29–2.64)2.70(2.36–3.12)17.390.326.620.0310.890.11
n = 350 3.05(2.51–3.72)2.48(2.30–2.63)2.70(2.36–3.07)17.290.296.460.0310.780.10
n = 400 3.02(2.51–3.72)2.48(2.32–2.62)2.68(2.37–3.04)16.190.276.350.0310.100.09
n = 450 3.04(2.56–3.64)2.48(2.33–2.61)2.69(2.40–3.02)16.790.276.360.0310.410.09
n = 500 3.04(2.56–3.58)2.48(2.34–2.60)2.69(2.39–2.99)16.770.266.310.0310.370.09
π 1 = 0.6, o p t M 1 = 2.60 , o p t M 2 = 2.48 , o p t M 3 = 2.56
n = 50 2.85(2.27–3.56)2.75(2.32–3.19)2.76(2.25–3.26)9.490.1710.890.127.770.11
n = 100 2.85(2.39–3.37)2.74(2.39–3.08)2.74(2.38–3.09)9.670.1210.380.107.320.07
n = 150 2.85(2.48–3.24)2.75(2.44–3.05)2.74(2.45–3.02)9.410.0910.910.106.970.05
n = 200 2.85(2.51–3.23)2.75(2.45–3.01)2.73(2.46–3.00)9.470.0910.780.106.920.05
n = 250 2.84(2.55–3.14)2.74(2.48–2.99)2.73(2.50–2.96)9.080.0810.440.096.560.04
n = 300 2.84(2.58–3.12)2.75(2.49–2.98)2.73(2.52–2.93)9.220.0810.750.096.610.04
n = 350 2.85(2.61–3.11)2.75(2.49–2.98)2.73(2.53–2.93)9.570.0810.690.096.820.04
n = 400 2.84(2.60–3.08)2.74(2.49–2.97)2.72(2.53–2.90)9.020.0710.490.096.370.04
n = 450 2.84(2.63–3.08)2.74(2.50–2.97)2.72(2.55–2.91)9.250.0710.590.096.510.04
n = 500 2.85(2.65–3.06)2.75(2.51–2.96)2.73(2.56–2.89)9.390.0710.850.096.600.04
π 1 = 0.9, o p t M 1 = 2.60 , o p t M 2 = 2.67 , o p t M 3 = 2.71
n = 50 2.81(2.14–5.07)2.85(1.64–3.70)2.71(1.17–3.34)8.160.386.660.330.0030.29
n = 100 2.82(2.48–3.14)3.04(2.36–3.69)2.86(2.49–3.21)8.270.1213.680.295.320.08
n = 150 2.79(2.55–3.02)3.08(2.51–3.97)2.87(2.59–3.17)7.350.0515.410.325.810.08
n = 200 2.81(2.62–2.99)3.07(2.59–3.90)2.88(2.64–3.12)7.980.0514.890.286.240.07
n = 250 2.82(2.67–2.99)3.14(2.68–3.98)2.91(2.68–3.09)8.270.0517.690.367.220.05
n = 300 2.81(2.66–2.96)3.07(2.69–3.56)2.89(2.73–3.07)8.010.0514.870.256.620.04
n = 350 2.83(2.69–2.96)3.13(2.73–3.89)2.92(2.77–3.10)8.710.0617.070.327.460.05
n = 400 2.82(2.67–2.95)3.14(2.72–3.96)2.90(2.73–3.06)8.310.0517.520.356.880.04
n = 450 2.81(2.67–2.95)3.12(2.69–3.97)2.89(2.72–3.05)8.160.0516.880.346.600.04
n = 500 2.81(2.67–2.94)3.14(2.71–4.00)2.89(2.74–3.05)8.080.0517.720.376.550.04
Table A5. Relative bias, Mean Squared Error (MSE), and 95% confidence interval (CI) of the 99.9%-quantile method (M1); minimum of mixture densities method (M2); and conditional probability method (M3) considering a mixture of Student-t distributions. o p t M 1 denotes the theoretical cutoff point for the M1 method; o p t M 2 denotes the theoretical cutoff point for the M2 method; o p t M 3 denotes the theoretical cutoff point for the M3 method. π 1 denotes the weight of the seronegative population; c M 1 , c M 2 , and c M 3 denote the cutoff estimated by the M1, M2, and M3 methods after N = 1000 simulations.
Table A5. Relative bias, Mean Squared Error (MSE), and 95% confidence interval (CI) of the 99.9%-quantile method (M1); minimum of mixture densities method (M2); and conditional probability method (M3) considering a mixture of Student-t distributions. o p t M 1 denotes the theoretical cutoff point for the M1 method; o p t M 2 denotes the theoretical cutoff point for the M2 method; o p t M 3 denotes the theoretical cutoff point for the M3 method. π 1 denotes the weight of the seronegative population; c M 1 , c M 2 , and c M 3 denote the cutoff estimated by the M1, M2, and M3 methods after N = 1000 simulations.
Student t
Distribution
Sample Size cM1 95% CI
(M1)
cM2 95% CI
(M2)
cM3 95% CI
(M3)
R.bias
(M1)
MSE
(M1)
R.bias
(M2)
MSE
(M2)
R.bias
(M3)
MSE
(M3)
π 1 = 0.3, o p t M 1 = 2.34 , o p t M 2 = 2.25 , o p t M 3 = 2.31
n = 50 2.51(1.79–4.55)2.15(1.79–2.67)2.24(1.78–3.13)7.191.06−4.150.05−3.020.09
n = 100 2.47(1.98–3.69)2.16(1.95–2.36)2.24(1.95–2.54)5.360.19−4.090.02−3.200.03
n = 150 2.45(2.05–3.19)2.16(1.99–2.32)2.24(2.02–2.48)4.630.11−3.960.02−3.090.02
n = 200 2.46(2.08–3.14)2.15(2.02–2.28)2.24(2.04–2.44)4.920.09−4.160.01−3.220.02
n = 250 2.48(2.13–3.09)2.16(2.04–2.29)2.25(2.08–2.46)6.000.08−3.800.01−2.660.01
n = 300 2.48(2.17–2.97)2.16(2.05–2.28)2.25(2.09–2.42)5.860.06−3.730.01−2.570.01
n = 350 2.47(2.18–2.92)2.16(2.07–2.25)2.24(2.11–2.39)5.250.05−3.970.01−2.920.01
n = 400 2.48(2.18–2.95)2.16(2.07–2.26)2.25(2.11–2.39)5.930.06−3.850.01−2.690.009
n = 450 2.48(2.20–2.87)2.16(2.08–2.25)2.25(2.13–2.38)5.710.05−3.890.01−2.740.008
n = 500 2.48(2.23–2.85)2.16(2.08–2.24)2.25(2.13–2.28)5.820.04−3.830.01−2.640.007
π 1 = 0.6, o p t M 1 = 2.34 , o p t M 2 = 2.33 , o p t M 3 = 2.38
n = 50 3.21(1.98–7.92)2.29(2.01–2.60)2.43(1.97–2.95)36.954.25−1.410.022.260.07
n = 100 2.91(2.10–5.01)2.30(2.11–2.49)2.44(2.08–2.79)24.150.92−1.350.012.470.04
n = 150 2.93(2.13–4.49)2.30(2.12–2.46)2.45(2.10–2.73)25.030.81−1.310.0082.860.03
n = 200 2.92(2.22–4.15)2.31(2.17–2.44)2.46(2.20–2.70)24.720.59−1.050.0053.390.02
n = 250 2.90(2.29–3.87)2.31(2.18–2.43)2.46(2.23–2.68)23.880.49−1.030.0043.470.02
n = 300 2.90(2.30–3.89)2.31(2.20–2.43)2.46(2.26–2.68)23.790.47−1.020.0043.490.02
n = 350 2.90(2.38–3.78)2.31(2.21–2.41)2.47(2.28–2.65)23.830.46−1.120.023.530.03
n = 400 2.89(2.39–3.73)2.31(2.22–2.39)2.47(2.29–2.64)23.730.43−1.040.033.540.01
n = 450 2.88(2.39–3.68)2.31(2.22–2.39)2.46(2.29–2.62)22.880.39−1.040.0023.520.01
n = 500 2.89(2.42–3.63)2.31(2.23–2.39)2.47(2.32–2.62)23.570.39−0.970.0023.690.01
π 1 = 0.9, o p t M 1 = 2.34 , o p t M 2 = 2.44 , o p t M 3 = 2.48
n = 50 3.72(2.09–11.21)2.57(1.92–3.27)2.77(1.90–3.61)58.717.925.100.1811.940.34
n = 100 3.98(2.11–8.04)2.63(2.20–3.12)2.89(2.16–3.61)69.908.407.720.0916.720.31
n = 150 3.73(2.21–6.92)2.59(2.29–2.94)2.87(2.22–3.37)58.983.356.130.0515.870.23
n = 200 3.87(2.35–7.11)2.61(2.37–2.94)2.92(2.37–3.39)65.123.677.090.0618.010.28
n = 250 3.83(2.39–6.06)2.60(2.38–2.86)2.91(2.38–3.34)63.323.166.660.0517.740.25
n = 300 3.81(2.45–5.85)2.60(2.41–2.85)2.92(2.39–3.29)62.542.996.670.0417.820.24
n = 350 3.87(2.62–5.71)2.60(2.45–2.82)2.93(2.59–3.27)65.224.566.650.0718.260.27
n = 400 3.79(2.69–5.49)2.60(2.44–2.80)2.93(2.58–3.27)62.092.656.660.0418.290.24
n = 450 3.78(2.75–5.29)2.59(2.46–2.77)2.92(2.62–3.20)61.282.516.430.0318.060.22
n = 500 3.76(2.81–5.22)2.59(2.47–2.77)2.93(2.67–3.23)60.432.416.460.0318.210.23
Table A6. Relative bias, Mean Squared Error (MSE), and 95% confidence interval (CI) of the 99.9%-quantile method (M1); minimum of mixture densities method (M2); and conditional probability method (M3) considering a mixture of Skew-t distributions. o p t M 1 denotes the theoretical cutoff point for the M1 method; o p t M 2 denotes the theoretical cutoff point for the M2 method; o p t M 3 denotes the theoretical cutoff point for the M3 method. π 1 denotes the weight of the seronegative population; c M 1 , c M 2 , and c M 3 denotes the cutoff estimated by the M1, M2, and M3 methods after N = 1000 simulations.
Table A6. Relative bias, Mean Squared Error (MSE), and 95% confidence interval (CI) of the 99.9%-quantile method (M1); minimum of mixture densities method (M2); and conditional probability method (M3) considering a mixture of Skew-t distributions. o p t M 1 denotes the theoretical cutoff point for the M1 method; o p t M 2 denotes the theoretical cutoff point for the M2 method; o p t M 3 denotes the theoretical cutoff point for the M3 method. π 1 denotes the weight of the seronegative population; c M 1 , c M 2 , and c M 3 denotes the cutoff estimated by the M1, M2, and M3 methods after N = 1000 simulations.
Skew-t Distribution
Sample Size cM1 95% CI
(M1)
cM2 95% CI
(M2)
cM3 95% CI
(M3)
R.bias
(M1)
MSE
(M1)
R.bias
(M2)
MSE
(M2)
R.bias
(M3)
MSE
(M3)
π 1 = 0.3, o p t M 1 = 3.71 , o p t M 2 = 2.38 , o p t M 3 = 2.64
n = 50 3.08(1.93–5.14)2.37(1.93–2.93)2.55(1.93–3.36)−16.902.51−8.960.11−11.850.24
n = 100 3.09(2.11–5.81)2.34(2.04–2.68)2.52(2.07–3.14)−16.411.56−9.940.09−12.890.22
n = 150 3.05(2.16–5.73)2.33(2.09–2.60)2.51(2.12–3.05)−17.781.27−10.420.09−13.510.21
n = 200 3.05(2.19–5.49)2.32(2.09–2.57)2.49(2.13–2.98)−17.651.11−10.690.09−13.810.21
n = 250 3.01(2.24–5.15)2.31(2.12–2.55)2.48(2.16–2.95)−18.841.05−11.210.09−14.530.21
n = 300 3.03(2.28–5.12)2.32(2.13–2.53)2.49(2.18–2.91)−18.210.94−11.020.09−14.210.20
n = 350 3.02(2.28–4.93)2.31(2.14–2.52)2.48(2.18–2.89)−18.580.93−11.120.09−14.340.20
n = 400 3.02(2.30–4.75)2.31(2.14–2.51)2.48(2.19–2.86)−18.580.89−11.100.09−14.300.20
n = 450 3.06(2.34–4.86)2.32(2.17–2.50)2.49(2.22–2.86)−17.360.82−10.90.09−13.930.19
n = 500 3.03(2.37–4.83)2.31(2.17–2.49)2.49(2.24–2.85)−18.180.81−11.120.09−14.230.19
π 1 = 0.6, o p t M 1 = 3.71 , o p t M 2 = 2.58 , o p t M 3 = 2.87
n = 50 3.84(2.09–9.23)2.49(2.10–2.93)2.69(2.10–3.36)3.5717.61−3.460.06−6.280.15
n = 100 3.52(2.17–8.77)2.47(2.16–2.77)2.67(2.17–3.23)−4.892.89−4.390.04−7.050.12
n = 150 3.49(2.23–7.43)2.47(2.21–2.73)2.68(2.24–3.18)−5.882.02−4.240.03−6.720.10
n = 200 3.43(2.31–6.72)2.47(2.26–2.70)2.68(2.29–3.14)−7.521.41−4.170.02−6.600.08
n = 250 3.41(2.31–6.31)2.47(2.26–2.66)2.68(2.29–3.07)−7.991.21−4.090.02−6.490.08
n = 300 3.44(2.37–6.06)2.47(2.29–2.66)2.69(2.34–3.08)−7.141.07−4.970.03−7.160.08
n = 350 3.43(2.39–5.78)2.48(2.30–2.64)2.69(2.35–3.03)−7.450.82−4.950.02−7.100.07
n = 400 3.45(2.41–5.58)2.48(2.31–2.64)2.71(2.36–3.04)−6.910.77−4.640.02−6.620.07
n = 450 3.44(2.47–5.55)2.48(2.33–2.63)2.71(2.41–3.03)−7.080.67−4.610.02−6.530.06
n = 500 3.42(2.47–5.38)2.48(2.32–2.63)2.70(2.39–3.01)−7.680.66−4.790.02−6.830.06
π 1 = 0.9, o p t M 1 = 3.71 , o p t M 2 = 2.88 , o p t M 3 = 3.23
n = 50 4.15(2.16–10.27)2.74(2.06–3.38)2.96(2.09–3.64)12.0515.18−4.970.15−8.420.28
n = 100 4.08(2.25–9.52)2.75(1.98–3.33)2.98(2.09–3.69)10.363.27−4.610.11−7.880.45
n = 150 4.21(2.25–10.14)2.75(2.21–3.24)3.04(2.16–3.73)13.533.55−4.560.09−6.020.29
n = 200 4.09(2.35–8.18)2.76(2.36–3.25)3.04(2.32–3.72)10.532.20−4.060.07−5.890.32
n = 250 4.32(2.39–8.52)2.81(2.35–3.13)3.07(2.38–3.68)16.642.19−2.610.04−4.861.58
n = 300 4.52(2.38–8.19)2.82(2.42–3.17)3.19(2.41–3.68)22.053.12−1.940.04−1.360.16
n = 350 4.49(2.46–7.49)2.82(2.45–3.10)3.07(2.37–3.62)21.152.45−2.060.03−4.876.03
n = 400 4.52(2.53–8.22)2.88(2.46–3.08)3.22(2.45–3.65)22.013.31−0.183.42−0.293.77
n = 450 4.57(2.64–8.01)2.84(2.52–3.06)3.21(2.54–3.62)23.272.41−1.560.02−0.640.23
n = 500 4.50(2.56–7.14)2.83(2.48–3.06)3.21(2.48–3.61)21.542.07−1.850.02−0.620.14

References

  1. Dávila, V.H.L.; Cabral, C.R.B.; Zeller, C.B. Finite Mixture of Skewed Distributions; Springer: Cham, Switzerland, 2018. [Google Scholar]
  2. Lin, T.I.; Lee, J.C.; Yen, S.Y. Finite mixture modelling using the Skew-Normal distribution. Stat. Sin. 2007, 17, 909–927. [Google Scholar]
  3. Govaert, G.; Nadif, M. Clustering with block mixture models. Pattern Recognit. 2003, 36, 463–473. [Google Scholar] [CrossRef]
  4. Melnykov, V.; Wang, Y. Conditional mixture modeling and model-based clustering. Pattern Recognit. 2023, 133, 108994. [Google Scholar] [CrossRef]
  5. De Nicola, G.; Sischka, B.; Kauermann, G. Mixture models and networks: The stochastic blockmodel. Stat. Model. 2022, 22, 67–94. [Google Scholar] [CrossRef]
  6. Wine, Y.; Horton, A.P.; Ippolito, G.C.; Georgiou, G. Serology in the 21st Century: The Molecular-Level Analysis of the Serum Antibody Repertoire. Curr. Opin. Immunol. 2015, 35, 89–97. [Google Scholar] [CrossRef] [PubMed]
  7. Rosado, J.; Pelleau, S.; Cockram, C.; Merkling, S.H.; Nekkab, N.; Demeret, C.; Meola, A.; Kerneis, S.; Terrier, B.; Fafi-Kremer, S.; et al. Multiplex assays for the identification of serological signatures of SARS-CoV-2 infection: An antibody-based diagnostic and machine learning study. Lancet Microbe 2020, 2, E60–E69. [Google Scholar] [CrossRef] [PubMed]
  8. Domingues, T.; Mouriño, H.; Sepúlveda, N. Analysis of antibody data using Finite Mixture Models based on Scale Mixtures of Skew-Normal distributions. medRxiv 2021. [Google Scholar] [CrossRef]
  9. Parker, R.A.; Erdman, D.D.; Anderson, L.J. Use of mixture models in determining laboratory criterion for identification of seropositive individuals: Application to parvovirus B19 serology. J. Virol. Methods 1990, 27, 135–144. [Google Scholar] [CrossRef]
  10. Kafatos, G.; Andrews, N.J.; McConway, K.J.; Maple, P.A.; Brown, K.; Farrington, C.P. Is it appropriate to use fixed assay cut-offs for estimating seroprevalence? Epidemiol. Infect. 2016, 144, 887–895. [Google Scholar] [CrossRef]
  11. Ridge, S.E.; Vizard, A.L. Determination of the optimal cutoff value for a serological assay: An example using the Johne’s Absorbed EIA. J. Clin. Microbiol. 1993, 31, 1256–1261. [Google Scholar] [CrossRef]
  12. Maple, P.A.C.; Simms, I.; Kafatos, G.; Solomou, M.; Fenton, K. Application of a noninvasive oral fluid test for detection of treponemal IgG in a predominantly HIV-infected population. Eur. J. Clin. Microbiol. Infect. Dis. 2006, 25, 743–749. [Google Scholar] [CrossRef] [PubMed]
  13. Tong, D.D.; Buxser, S.; Vidmar, T.J. Application of a mixture model for determining the cutoff threshold for activity in high-throughput screening. Comput. Stat. Data Anal. 2007, 51, 4002–4012. [Google Scholar] [CrossRef]
  14. Baughman, A.L.; Bisgard, K.M.; Lynn, F.; Meade, B.D. Mixture model analysis for establishing a diagnostic cut-off point for pertussis antibody levels. Stat. Med. 2006, 25, 2994–3010. [Google Scholar] [CrossRef] [PubMed]
  15. Silva, J.; Prata, S.; Domingues, T.D.; Leal, R.O.; Nunes, T.; Tavares, L.; Almeida, V.; Sepúlveda, N.; Gil, S. Detection and modeling of anti-Leptospira IgG prevalence in cats from Lisbon area and its correlation to retroviral infections, lifestyle, clinical and hematologic changes. Vet. Anim. Sci. 2020, 10, 100144. [Google Scholar] [CrossRef] [PubMed]
  16. Domingues, T.D.; Mouriño, H.; Sepúlveda, N. A statistical analysis of serological data from the UK myalgic encephalomyelitis/chronic fatigue syndrome biobank. AIP Conf. Proc. 2020, 2293, 420099. [Google Scholar]
  17. Hasibi, M.; Jafari, M.S.; Mortazavi, H.; Asadollahi, M.; Djavid, G.E. Determination of the accuracy and optimal cut-off point for ELISA test in diagnosis of human brucellosis in Iran. Acta Medica Iran. 2013, 51, 687–692. [Google Scholar]
  18. Rota, M.M.; Antolini, L. Finding the optimal cut-point for Gaussian and Gamma distributed biomarkers. Comput. Stat. Data Anal. 2014, 69, 1–14. [Google Scholar] [CrossRef]
  19. Habibzadeh, F.; Habibzadeh, P.; Yadollahie, M. On determining the most appropriate test cut-off value: The case of tests with continuous results. Biochem. Medica 2016, 26, 297–307. [Google Scholar] [CrossRef]
  20. Blacksell, S.; Lim, C.; Tanganuchitcharnchai, A.; Jintaworn, S.; Kantipong, P.; Richards, A.L.; Paris, D.H.; Limmathurotsakul, D.; Day, N. Optimal cutoff and accuracy of an IgM enzyme-linked immunosorbent assay for diagnosis of acute scrub typhus in northern Thailand: An alternative reference method to the IgM immunofluorescence assay. J. Clin. Microbiol. 2016, 54, 1472–1478. [Google Scholar] [CrossRef]
  21. Perkins, N.J.; Schisterman, E.F. The inconsistency of “optimal” cut-points using two ROC based criteria. Am. J. Epidemiol. 2006, 163, 670–675. [Google Scholar] [CrossRef]
  22. Unal, I. Defining an optimal cut-point value in ROC analysis: An alternative approach. Comput. Math. Methods Med. 2017, 2017, 3762651. [Google Scholar] [CrossRef] [PubMed]
  23. Migchelsen, S.J.; Martin, D.L.; Southisombath, K.; Turyaguma, P.; Heggen, A.; Rubangakene, P.P.; Joof, H.; Makalo, P.; Cooley, G.; Gwyn, S.; et al. Defining Seropositivity Thresholds for Use in Trachoma Elimination Studies. PLoS Neglected Trop. Dis. 2017, 11, e0005230. [Google Scholar] [CrossRef] [PubMed]
  24. Gay, N.J. Analysis of serological surveys using mixture models: Application to a survey of parvovirus B19. Stat. Med. 1996, 15, 1567–1573. [Google Scholar] [CrossRef]
  25. Azzalini, A. The Skew-Normal and Related Families; Cambridge University Press: Cambridge, UK, 2014. [Google Scholar]
  26. Basso, R.M.; Lachos, V.H.; Cabral, C.R.B.; Gosh, P. Robust mixture modelling based on scale mixtures of skew-normal distributions. Comput. Stat. Data Anal. 2010, 54, 2926–2941. [Google Scholar] [CrossRef]
  27. Domingues, T.; Mouriño, H.; Sepúlveda, N. Analysis of antibody data using Skew-Normal and Skew-t mixture models. REVSTAT-Stat. J. (Fourthcoming) 2022. Available online: https://revstat.ine.pt/index.php/REVSTAT/article/view/455 (accessed on 24 November 2023).
  28. Dempster, A.P.; Rubin, D.B. Maximum likelihood estimation from incomplete data via the EM algorithm. J. R. Stat. Soc. 1977, 39, 1–38. [Google Scholar]
  29. Meng, X.L.; Rubin, D.B. Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika 1993, 80, 267–278. [Google Scholar] [CrossRef]
  30. Liu, C.; Rubin, D.B. The ECME algorithm: A simple extension of EM and ECM with faster monotone convergence. Biometrika 1994, 81, 633–648. [Google Scholar] [CrossRef]
  31. McLachlan, G.J.; Krishnan, T. The EM Algorithm and Extensions; John Wiley & Sons: Hoboken, NJ, USA, 2008. [Google Scholar]
  32. Sepúlveda, N.; Stresman, G.; White, M.T.; Drakeley, C.J. Current Mathematical Models for Analyzing Anti-Malarial Antibody Data with an Eye to Malaria Elimination and Eradication. J. Immunol. Res. 2015, 10, 738030. [Google Scholar] [CrossRef]
  33. Saraswati, K.; Phanichkrivalkosil, M.; Day, N.; Blacksell, S.D. The validity of diagnostic cut-offs for commercial and in-house scrub typhus IgM and IgG ELISAs: A review of the evidence. PLoS Neglected Trop. Dis. 2019, 13, e0007158. [Google Scholar] [CrossRef]
  34. Brent, R.P. Algorithms for Minimization Without Derivatives; Prentice-Hall: Hoboken, NJ, USA, 1973; pp. 73–76. [Google Scholar]
  35. Prates, M.O.; Lachos, V.H.; Cabral, C. Fitting finite mixture of scale mixture of skew-normal distributions. J. Stat. Softw. 2013, 54, 1–20. [Google Scholar] [CrossRef]
  36. Wolodzko, T. Additional Univariate and Multivariate Distributions. R CRAN. 2020. Available online: https://github.com/twolodzko/extraDistr (accessed on 24 November 2023).
  37. Azzalini, A. The Skew-Normal and Related Distributions Such as the Skew-t. R CRAN. 2020. Available online: http://azzalini.stat.unipd.it/SN/ (accessed on 24 November 2023).
  38. Meeker, W.Q.; Han, G.J.; Escobar, L.A. Statistical Intervals: A Guide for Practitioners and Researchers; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2017. [Google Scholar]
  39. Stringhini, S.; Wisniak, A.; Piumatti, G.; Azman, A.; Lauer, S.; Baysson, H.; Ridder, D.; Petrovic, D.; Schrempft, S.; Marcus, K.; et al. Seroprevalence of anti-SARS-CoV-2 IgG antibodies in Geneva, Switzerland (SEROCoV-POP): A population-based study. Lancet 2020, 396, 313–319. [Google Scholar] [CrossRef] [PubMed]
  40. Larremore, D.; Fosdick, B.; Bubar, K.; Zhang, S.; Kissler, S.; Metcalf, C.; Buckee, C.; Grad, Y. Estimating SARS-CoV-2 seroprevalence and epidemiological parameters with uncertainty from serological surveys. Elife 2021, 10, e64206. [Google Scholar] [CrossRef] [PubMed]
  41. López-Ratón, M.; Rodríguez-Álvarez, M.X.; Cadarso-Suárez, C.; Gude-Sampedro, F. OptimalCutpoints: An R Package for Selecting Optimal Cutpoints in Diagnostic Tests. J. Stat. Softw. 2014, 61, 1–36. [Google Scholar] [CrossRef]
Figure 1. (A) Skew-Normal distribution considering different values for the skewness parameter α = 3 , 5 , 10 , 50 showing positive skew. (B) Skew-t distribution considering different values for the skewness parameter α = 3 , 5 , 10 , 50 showing negative skew.
Figure 1. (A) Skew-Normal distribution considering different values for the skewness parameter α = 3 , 5 , 10 , 50 showing positive skew. (B) Skew-t distribution considering different values for the skewness parameter α = 3 , 5 , 10 , 50 showing negative skew.
Mathematics 12 00217 g001
Figure 2. Results from the simulation study: Relative bias of the cutoff points for methods M1, M2, and M3 considering π 1 = 0.3 ; sample sizes vary between n = 50 and 500, with intervals of 50. (A) Mixture of Normal distributions. (B) Mixture of Skew-Normal distribution. (C) Mixture of Student-t distribution. (D) Mixture of Skew-t distribution.
Figure 2. Results from the simulation study: Relative bias of the cutoff points for methods M1, M2, and M3 considering π 1 = 0.3 ; sample sizes vary between n = 50 and 500, with intervals of 50. (A) Mixture of Normal distributions. (B) Mixture of Skew-Normal distribution. (C) Mixture of Student-t distribution. (D) Mixture of Skew-t distribution.
Mathematics 12 00217 g002
Figure 3. Results from the simulation study: Relative bias of the cutoff points for methods M1, M2, and M3 considering π 1 = 0.6 ; sample sizes vary between n = 50 and 500, with intervals of 50. (A) Mixture of Normal distributions. (B) Mixture of Skew-Normal distribution. (C) Mixture of Student-t distribution. (D) Mixture of Skew-t distribution.
Figure 3. Results from the simulation study: Relative bias of the cutoff points for methods M1, M2, and M3 considering π 1 = 0.6 ; sample sizes vary between n = 50 and 500, with intervals of 50. (A) Mixture of Normal distributions. (B) Mixture of Skew-Normal distribution. (C) Mixture of Student-t distribution. (D) Mixture of Skew-t distribution.
Mathematics 12 00217 g003
Figure 4. Results from the simulation study: Relative bias of the cutoff points for methods M1, M2, and M3 considering π 1 = 0.9 ; sample sizes vary between n = 50 and 500, with intervals of 50. (A) Mixture of Normal distributions. (B) Mixture of Skew-Normal distribution. (C) Mixture of Student-t distribution. (D) Mixture of Skew-t distribution.
Figure 4. Results from the simulation study: Relative bias of the cutoff points for methods M1, M2, and M3 considering π 1 = 0.9 ; sample sizes vary between n = 50 and 500, with intervals of 50. (A) Mixture of Normal distributions. (B) Mixture of Skew-Normal distribution. (C) Mixture of Student-t distribution. (D) Mixture of Skew-t distribution.
Mathematics 12 00217 g004
Figure 5. Results from the simulation study: Estimated MSE of the cutoff points for methods M1, M2, and M3 considering π 1 = 0.3 ; sample sizes vary between n = 50 and 500, with intervals of 50. (A) Mixture of Normal distributions. (B) Mixture of Skew-Normal distribution. (C) Mixture of Student-t distribution. (D) Mixture of Skew-t distribution.
Figure 5. Results from the simulation study: Estimated MSE of the cutoff points for methods M1, M2, and M3 considering π 1 = 0.3 ; sample sizes vary between n = 50 and 500, with intervals of 50. (A) Mixture of Normal distributions. (B) Mixture of Skew-Normal distribution. (C) Mixture of Student-t distribution. (D) Mixture of Skew-t distribution.
Mathematics 12 00217 g005
Figure 6. Results from the simulation study: Estimated MSE of the cutoff points for methods M1, M2, and M3 considering π 1 = 0.6 ; sample sizes vary between n = 50 and 500, with intervals of 50. (A) Mixture of Normal distributions. (B) Mixture of Skew-Normal distribution. (C) Mixture of Student-t distribution. (D) Mixture of Skew-t distribution.
Figure 6. Results from the simulation study: Estimated MSE of the cutoff points for methods M1, M2, and M3 considering π 1 = 0.6 ; sample sizes vary between n = 50 and 500, with intervals of 50. (A) Mixture of Normal distributions. (B) Mixture of Skew-Normal distribution. (C) Mixture of Student-t distribution. (D) Mixture of Skew-t distribution.
Mathematics 12 00217 g006
Figure 7. Results from the simulation study: Estimated MSE of the cutoff points for methods M1, M2, and M3 considering π 1 = 0.9 ; sample sizes vary between n = 50 and 500, with intervals of 50. (A) Mixture of Normal distributions. (B) Mixture of Skew-Normal distribution. (C) Mixture of Student-t distribution. (D) Mixture of Skew-t distribution.
Figure 7. Results from the simulation study: Estimated MSE of the cutoff points for methods M1, M2, and M3 considering π 1 = 0.9 ; sample sizes vary between n = 50 and 500, with intervals of 50. (A) Mixture of Normal distributions. (B) Mixture of Skew-Normal distribution. (C) Mixture of Student-t distribution. (D) Mixture of Skew-t distribution.
Mathematics 12 00217 g007
Figure 8. Violin plot for the antibody concentration by infection status. (A) RBD antigen. (B) S1 antigen. (C) S2 antigen. (D) S t r i antigen. Number of negative individuals: 335; the number of positive individuals: 214. The antibody concentration on the y-axis is given in log10 units.
Figure 8. Violin plot for the antibody concentration by infection status. (A) RBD antigen. (B) S1 antigen. (C) S2 antigen. (D) S t r i antigen. Number of negative individuals: 335; the number of positive individuals: 214. The antibody concentration on the y-axis is given in log10 units.
Mathematics 12 00217 g008
Figure 9. Histogram of the antibody concentration data by antigen. Overlaid on the histograms are the seropositive estimated density functions (red lines), the seronegative estimated density functions (green lines), and the estimated two-component mixture density (blue lines). The vertical dotted lines correspond to the cutoff points based on methods M1 (gray), M2 (black), and M3 (orange). (A) RBD antigen. (B) S1 antigen. (C) S2 antigen. (D) S t r i antigen. The antibody concentration on the x-axis is given in log 10 units.
Figure 9. Histogram of the antibody concentration data by antigen. Overlaid on the histograms are the seropositive estimated density functions (red lines), the seronegative estimated density functions (green lines), and the estimated two-component mixture density (blue lines). The vertical dotted lines correspond to the cutoff points based on methods M1 (gray), M2 (black), and M3 (orange). (A) RBD antigen. (B) S1 antigen. (C) S2 antigen. (D) S t r i antigen. The antibody concentration on the x-axis is given in log 10 units.
Mathematics 12 00217 g009
Figure 10. Classification performance of the methods M1 to M4 by antigen: (A) Sensitivity. (B) Specificity. (C) Accuracy. The measures for methods M1 to M3 are based on the best-fitted model for each antigen and are detailed in Appendix A Table A1; method M4 relies on information in Appendix B Table A2.
Figure 10. Classification performance of the methods M1 to M4 by antigen: (A) Sensitivity. (B) Specificity. (C) Accuracy. The measures for methods M1 to M3 are based on the best-fitted model for each antigen and are detailed in Appendix A Table A1; method M4 relies on information in Appendix B Table A2.
Mathematics 12 00217 g010
Table 1. Simulation scenarios and theoretical parameter values for seronegative and seropositive populations.
Table 1. Simulation scenarios and theoretical parameter values for seronegative and seropositive populations.
Seronegative PopulationSeropositive Population
Distribution μ 1 σ 1 α 1 ν 1 μ 2 σ 2 α 2 ν 2
Normal1.720.300.003.350.600.00
Skew-Normal1.410.405.774.090.90−9.12
Student-t1.650.100.003.003.350.600.003.00
Skew-t1.460.203.642.914.080.90−7.9318.07
Table 2. Simulation study: Theoretical cutoff values considered for each method by mixture distribution and seronegative weight.
Table 2. Simulation study: Theoretical cutoff values considered for each method by mixture distribution and seronegative weight.
Distribution π 1 M1M2M3Distribution π 1 M1M2M3
Mixture of Normals0.32.332.242.30Mixture of Student-t0.32.342.252.31
0.62.332.332.370.62.342.332.38
0.92.332.432.460.92.342.442.48
Mixture of Skew-Normals0.32.602.332.44Mixture of Skew-t0.33.712.382.64
0.62.602.482.560.63.712.582.87
0.92.602.672.710.93.712.883.23
Table 3. Results from fitting two-component mixture model to antibody concentration by antigen: Estimated parameters for the best model based on the BIC criterion.
Table 3. Results from fitting two-component mixture model to antibody concentration by antigen: Estimated parameters for the best model based on the BIC criterion.
Seronegative PopulationSeropositive Population
AntigenDistribution μ σ 2 α v μ σ 2 α v
RBDSkew-Normal1.4350.1256.3184.0770.767−7.634
S1Skew-Normal1.5690.0622.6872.3390.3211.062
S2Skew-Normal1.5830.0962.8042.8170.2120.450
S t r i Skew-t1.3520.1215.7514.8733.8850.367−6.4824.873
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Dias-Domingues, T.; Mouriño, H.; Sepúlveda, N. Classification Methods for the Serological Status Based on Mixtures of Skew-Normal and Skew-t Distributions. Mathematics 2024, 12, 217. https://doi.org/10.3390/math12020217

AMA Style

Dias-Domingues T, Mouriño H, Sepúlveda N. Classification Methods for the Serological Status Based on Mixtures of Skew-Normal and Skew-t Distributions. Mathematics. 2024; 12(2):217. https://doi.org/10.3390/math12020217

Chicago/Turabian Style

Dias-Domingues, Tiago, Helena Mouriño, and Nuno Sepúlveda. 2024. "Classification Methods for the Serological Status Based on Mixtures of Skew-Normal and Skew-t Distributions" Mathematics 12, no. 2: 217. https://doi.org/10.3390/math12020217

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop