Serology Assays Used in SARS-CoV-2 Seroprevalence Surveys Worldwide: A Systematic Review and Meta-Analysis of Assay Features, Testing Algorithms, and Performance

Background: Many serological assays to detect SARS-CoV-2 antibodies were developed during the COVID-19 pandemic. Differences in the detection mechanism of SARS-CoV-2 serological assays limited the comparability of seroprevalence estimates for populations being tested. Methods: We conducted a systematic review and meta-analysis of serological assays used in SARS-CoV-2 population seroprevalence surveys, searching for published articles, preprints, institutional sources, and grey literature between 1 January 2020, and 19 November 2021. We described features of all identified assays and mapped performance metrics by the manufacturers, third-party head-to-head, and independent group evaluations. We compared the reported assay performance by evaluation source with a mixed-effect beta regression model. A simulation was run to quantify how biased assay performance affects population seroprevalence estimates with test adjustment. Results: Among 1807 included serosurveys, 192 distinctive commercial assays and 380 self-developed assays were identified. According to manufacturers, 28.6% of all commercial assays met WHO criteria for emergency use (sensitivity [Sn.] >= 90.0%, specificity [Sp.] >= 97.0%). However, manufacturers overstated the absolute values of Sn. of commercial assays by 1.0% [0.1, 1.4%] and 3.3% [2.7, 3.4%], and Sp. by 0.9% [0.9, 0.9%] and 0.2% [−0.1, 0.4%] compared to third-party and independent evaluations, respectively. Reported performance data was not sufficient to support a similar analysis for self-developed assays. Simulations indicate that inaccurate Sn. and Sp. can bias seroprevalence estimates adjusted for assay performance; the error level changes with the background seroprevalence. Conclusions: The Sn. and Sp. of the serological assay are not fixed properties, but varying features depending on the testing population. To achieve precise population estimates and to ensure the comparability of seroprevalence, serosurveys should select assays with high performance validated not only by their manufacturers and adjust seroprevalence estimates based on assured performance data. More investigation should be directed to consolidating the performance of self-developed assays.


Introduction
Serosurveys have been foundational to emergency pandemic surveillance and evidenceguided public health policy during the COVID-19 pandemic. These studies help map the true extent of SARS-CoV-2 infection, indicators of population humoral immunity, and other measures of disease risk [1]. Serological assays, the laboratory tools for detecting antibodies produced after SARS-CoV-2 infection or vaccination, are a critical methodological step in serosurvey design and result interpretation. In response to expanding demand for serosurveys, many SARS-CoV-2 serological assays were developed, mobilized, and adopted since the beginning of the pandemic.
The breadth of available serological assays since the beginning of the pandemic is large and diverse, with over hundreds of serological assays currently commercially available. Most serological assays target antibodies against the spike (S) and/or nucleocapsid (N) proteins [2] of the SARS-CoV-2 virus and detect a variety of antibody isotypes (IgG, IgM, IgA, or all-Total Ab). To date, several types of analyte binding methods and virological techniques have been applied to SARS-CoV-2 serology-the most common being neutralization assays, lateral flow immunoassays [LFIAs], immunofluorescence assays [IFAs], enzyme-linked immunosorbent assays [ELISAs], and chemiluminescence assays [CLIAs].
An important consideration during serosurvey study design is assay performance. Assay performance has direct consequences on the validity of a study, where the sensitivity (Sn.) and specificity (Sp.) reflect whether a given seroprevalence result is accurately reflective of the sample group's true antibody positivity. Sn. and Sp. are not fixed properties of an assay-they are dependent on the panel of samples they were tested with. Manufacturers, third-party sources, and other independent groups conduct performance evaluations on the Sn. and Sp. of assays to ensure the reliability and comparability of seroprevalence results. These evaluations use panels with different compositions of samples, some of which are likely to produce high estimates. Thus, the evaluation performance of assays varies considerably. Recently, a review compared serological assay performance against RT-PCR results for 58 studies [3]. The authors found that among ELISAs, CLIAs, and LFIAs, the pooled assay Sn. and Sp. ranged from 75-91% (Sn.) and 92-100% (Sp.). This broadly varying assay performance raises the concern that SARS-CoV-2 seroprevalence estimates may be biased by imperfect or inconsistent assay performance, especially in cases where no statistical adjustments are made to account for test performance.
Validation from different sources is often in disagreement and results in varied intraassay performance data especially compared to manufacturer-certified evaluations, as supported by several head-to-head laboratory assay comparison studies [4][5][6][7][8]. Commercial assays constitute the vast majority of assays used in serosurveys, and manufacturers of these commercial assays self-certify their testing products with in-lab evaluations [9]. Such evaluations were usually done in the early pandemic using small true positive samples drawn from patients with confirmed symptomatic COVID-19 and no co-infection of other viruses [10]. The lack of endemic samples representing the demographics and endemic pathogens in a study area introduces spectrum bias [11]. There is also a lack of standardization between the methodology for manufacturer evaluations, and key factors such as the time of post-symptom onset that sampling was performed vary.
There is uncertainty in the extent to which mis-specified assay performance will introduce bias to results in unadjusted and adjusted seroprevalence estimates. This issue is further exacerbated by the discordant validation data between sources and the unavailability of third-party evaluations for certain assays. For this reason, there is a need to synthesize assay performance data for use in both the design and interpretation of serosurveys. In particular, how these sources of validation data differ and what Sn. and Sp. of an assay are needed to minimize bias in seroprevalence estimates given the true background prevalence. These results have important implications for public health policy and resource mobilization through the interpretation of seroprevalence data: especially critical for the future course of the pandemic and advising serosurveillance for future infectious disease threats.
Our group maintains a living systematic review of SARS-CoV-2 seroprevalence [12]. We sought to (1) describe features and usage of serological assays, as well as the implementation of testing algorithms employing multiple tests in SARS-CoV-2 serosurveys during the COVID-19 pandemic; (2) comprehensively compare the performance of these assays across manufacturers, third-party reference labs, and independent investigator evaluations; and (3) quantitatively assess the influence of assay performance on seroprevalence estimates. To our knowledge, this is the first large-scale evaluation of discrepancies between validation sources and intra-assay performance for serological assay targeting SARS-CoV-2 antibodies.

Materials and Methods
This study is registered as a part of a living systematic review of global SARS-CoV-2 seroprevalence studies in PROSPERO (CRD42020183634 [12]), which is also accessible on the open-access web dashboard, SeroTracker [13]. Detailed methods and results from this review have previously been published [14,15].

Data Sources and Search Strategy
We developed a search strategy intending to be as thorough as possible in comprehending the immunoassays utilized in seroprevalence studies. The search strategy identified published literature, and preprints were created in collaboration with a health sciences librarian. We sought to reduce any potential publishing bias by adding a range of sources besides peer-reviewed publications, including institutional reports, media sources, and grey literature. Experts who collaborated with us and the users of the SeroTracker website recommended grey literature [14]. All identified articles were recorded in the SeroTracker database. This is a database that contains the most complete source of seroprevalence research ever made available. From 1 January 2020 to 19 November 2021, we searched for articles on Medline, EMBASE, and Web of Science preprints on Europe PMC. Our secondary search included Google News, articles submitted to SeroTracker.com, and studies submitted to us by expert recommendations. Two reviewers independently screened titles/abstracts and full texts. Data were extracted and critically appraised in duplicate [16].

Inclusion and Exclusion Criteria
We included all SARS-CoV-2 seroprevalence studies on humans which reported a sample size, sampling date and locale, and prevalence estimate. We excluded studies conducted only on people with SARS-CoV-2 infection or vaccination and online public dashboard estimates that were not associated with a defined serology study [15]. We adapted an automated appraisal tool based on the Joanna Briggs Institute critical checklist to evaluate the risk of bias in included seroprevalence studies [17]. Full details of the assessment process can be accessed from this preprinted work [16].

Serological Assay Data Extraction
We extracted all data for serological assays to develop an independent database linked to the master seroprevalence study database. It included assay-related information reported in individual seroprevalence studies. For each assay, we identified product name, manufacturer or developer, country, WHO geographical region of development, antibody isotypes detected (IgG, IgM, IgA, total Ab), test type (ELISA, LFIA, IFA, CLIA, neutralization assay, others; Table S1), antibody target (Spike, Nucleocapsid, others), multiplex detection (detecting more than one antibody targets), time to result (Rapid Diagnostic Tests [RDT]/non-RDT), and test Sn. and Sp. as reported by manufacturers or developers. For commercial assays, we validated and complemented details on assays using reference links provided by authors. These links directed us to manufacturer's websites or user's guides which contained detailed information on the given assay. For self-developed assays, reference links pointed to the original research article with comprehensive development details.
Many studies cited serological assay validation results to corroborate the performance of the assay they selected. However, given that the testing environment, validation procedure, and reference panel varied across groups conducting validation, we categorized assay validation as either (1) third-party lab validation or (2) independent group field validation. We linked commercial assays with their performance in five large third-party lab performance evaluations and defined these as third-party lab validations. These five labs conducted large-scale head-to-head evaluations under controlled and reproducible conditions, including NRL (WHO sponsored [4]), the US FDA [5], Netherland CIDC [6], The Doherty Institute [7], and FIND Diagnostics [8] (Table S2). Independent field validation results were defined as performance validation data extracted from individual seroprevalence studies. These studies reported pretest results with a smaller sample in addition to population prevalence. Where available, assays' Sn. and Sp. for all isotypes and total antibodies were extracted from third-party lab evaluations and independent evaluations. The WHO has set performance criteria for the emergency use of Sn. >= 90.0% and Sp. >= 97.0% [18]. We applied these thresholds to categorize commercial assays based on performance in manufacturer, third-party, and independent evaluations.
Evaluation data was not very available for self-developed assays as for commercial assays, in the assay description of which concentrated on the steps of developing such an assay with performance matrices provided randomly. Therefore, corresponding performance analysis was not conducted for self-developed assays.

Analysis
Data extraction, cleaning, and management were performed in a collaborative data collection platform (Airtable.com, accessed on 20 November 2022). Data analysis was performed using R 4.0.2 [19]. We first summarized basic study characteristics, seroprevalence estimates, and serological assay features stratified by the WHO region at the study level.
At the assay level, we described the distribution of test usage, initial adoption, test type, region of development, test features, test evaluation states, and eligibility for emergency use by commercial and self-developed assays. We collected Sn./Sp. data to show the difference in reported performance for the top 50 assays and the top 20 assays by evaluation sources (manufacturers, third parties, and independent groups). The median Sn. and Sp. values for the top 50 assays were extracted from three evaluation sources and plotted on a panel against the WHO criteria. Bland-Altman plots were created to compare manufacturerreported Sn./Sp. with a third party's lab and independently evaluated Sn./Sp. in pairs.
For studies that used a testing algorithm involving multiple assays, we examined the combination of assays used (commercial/self-developed), how results from assays were combined (e.g., either test positive for a specimen to be positive vs. both tests positive), and whether the study reported a combined Sn. and Sp. for the testing algorithm. Many studies used multiple assay testing algorithms and also reported seroprevalence derived from using individual assays on the same set of samples. For these studies, we generated another set of Bland-Altman plots to show the discrepancy of estimates between testing algorithms. Seroprevalence estimates given by multiple assay algorithms and seroprevalence given by individual assays were compared in pairs.

Modeling Analysis
In examining whether assay performance differs by evaluation sources, we developed separate mixed-effect beta regression models for Sn. and Sp. with random effects specified for individual serological assays. Given that data with high heterogeneity, a diagonal heterogeneous variance-covariance structure was finally selected when estimating the assay performance by evaluation source. Assay features of isotype, test type, antibody targets, multiplex detection, and time to result were fitted as covariates to adjust outputs. Raw log odds obtained from models were converted to percentage for ease of interpretation. Difference in performance matrix against manufacturer values with 95% Confidence Interval (95% CI) by evaluation sources was derived using bootstrapping with 10,000 iterations. This modeling analysis enables us to determine discordance between evaluation sources and how inherent assay features may affect performance metrics.
We then performed a simulation. We simulated 1000 scenarios in which observed seroprevalence ranged from 0.0-99.9%. We adjusted assay performance [20] on observed prevalence to answer the third question we asked-to what extent a misreported assay performance value will bias the adjusted estimates from the 'true' prevalence estimate. The precise prevalence estimate intervals were defined by specifying error levels at ±5% [21] to the true prevalence. We simulated adjusted seroprevalence for assays at three accuracy levels-(1) high: Sn. = 95.0%, Sp. = 99.0%; (2) good: Sn. = 90.0%, Sp. = 97.0%; (3) moderate: Sn. = 87.0%, Sp. = 90.0%, with different levels of error of performance misspecification.

Included Studies
We screened 72,799 titles and abstracts and 4876 full texts published between 1 January 2020, and 19 November 2021. This represents the pre-booster vaccine time window before Omicron where most qualitative tests were introduced. We extracted data from 2069 articles-262 of these were identified as preprints, overlapped by subsequent full articles. A total of 1807 serosurveys were included for final analysis ( Figure S1).

Assay Use in Seroprevalence Studies
Among these 1807 serosurveys, 80.7% of studies used a single serological assay (73.1% commercial assays, 18.2% self-developed assays, 8.7% unable to specify), while 19.3% used a testing algorithm involving multiple assays (Tables S3 and S5); 248 adjusted seroprevalence estimates for assay performance. Overall, global usage of commercial serology assays follows a power-law distribution, with the top 25 assays accounting for 67.0% of total commercial assay use in seroprevalence studies ( Figure S2) and the top 50 assays accounting for 91.4% of use.

Characteristics of Identified Assays
Among 1807 serosurveys, we identified 192 commercial serology assays and 380 selfdeveloped serology assays (Table 1). A full list of identified commercial serology assays can be found in Supplemental Files (Table S6). Of the 192 identified commercial assays, 31.3% were ELISAs, 39.1% were LFIAs, 15.6% were CLIAs, 2.6% were IFA assays, and 15.6% were other types or not able to specify (Table 1). Of the 380 studies using self-developed assays, most used ELISAs (68.7%, Table 1). Product information was limited for many assays, most notably LFIAs: up to 32.6% and 42.6% of studies did not mention details about targeted antigen(s) and antibody isotypes, respectively. A total of 45.0% of studies using self-developed assays used multiplex detection to recognize multiple antibody targets. RDTs (types including LFIA, and IFA) accounted for 53.6% (103/192) of all commercial assays, while only 4.5% (17/192) of self-developed assays were developed as RDTs.

Reporting of Assay Performance
Manufacturer data could be searched from publicly available online sources or manufacturerled research papers for 91/192 (47.4%) commercial assays; 61.5% of these were subsequently either assessed in the five third-party evaluations or independent group evaluations (Table 2). Based on manufacturer data, the mean Sn. was 97.8 (95% CI: 93.9-100)% and the mean Sp. was 99.7 (95% CI: 97.8-100)%; 55/192 (28.6%) met the 90.0% Sn. and 97.0% Sp. WHO criteria for emergency use (Figures 1 and 2); of the 50 most frequently used assays, 76.9% met the WHO criteria. In contrast, only 46.1% and 53.7% met these criteria based on third-party and independent evaluations, respectively ( Figure 2). Note: Assay features of isotype, test type, antibody targets, multiplex detection, and time to result were fitted as covariates to adjust outputs. a Coefficients derived from the model were in log-odds formats, which were then converted to percentages for absolute performance value estimates. b Difference in performance matrix against manufacturer value was estimated using bootstrapping with 10,000 iterations. c corresponds to the significant level of p < 0.05; p attaches to absolute performance values. The mixed beta regression model bounded between 0-100%, clustering assay performance data in serological assays groups. * corresponds to the significant level of p < 0.05. Figure 1. The difference in reported assay performance among manufacturer evaluation, third party evaluation, and independent evaluation. Note: the figure shows the side-by-side comparison of assay performance for the top 20 assays. Performance evaluations came from three sources: manufacturer reports, third-party labs, and independent groups. Intervals show the range of performance values for a certain assay derived from the given evaluation source. CLIAs demonstrated higher and more reliable performance across all three evaluation sources than ELISAs, LFIAs, and IFAs among the top 50 assays (Figure 1 and Figure  S3). The pairing comparison of manufacturer-reported figures of merit against five thirdparty lab and independent group evaluations indicated manufacturers systematically overstated the Sn. and Sp. of the assays they developed (Figures S4 and S5). After adjusting for assay features, Sn. and Sp. were considerably lower by 1.0% (95% CI: 0.1-1.4)% (p = 0.289) and 0.9% (95% CI: 0.9-0.9)% (p < 0.001) according to third parties and by 3.3 (95% CI: 2.7-3.4)% (p = 0.001) and 0.2 (95% CI: -0.1, 0.4)% (p = 0.247, Table 2) according to independent evaluations.
We conducted a simulation to examine the impact of incorrect Sn. and Sp. estimates on estimated seroprevalence, using a threshold of ±5% between true and adjusted prevalence to define substantial effects. Falsely specifying Sn. 5% higher than its true value will not affect population prevalence estimates for any assay with a higher than moderate performance (Sn. >= 80%, Sp. >= 87%). However, if Sn. is falsely specified by 10% higher and Sp. by 3% higher, population prevalence estimates are inaccurate for true prevalence below 18.3% or above 38.7% (assays with moderate performance), or inaccurate for true prevalence below 17.5% or above 41.5% (assays with good performance, i.e., Sn = 90%, Sp = 97%). Falsely specifying assay Sn. 10% lower and Sp. 5% lower than their true values lead to substantial deviations between estimated and true population seroprevalence at all seroprevalence values (Figure 3a-c, Table S4). Figure 3. Consequences of correcting seroprevalence estimates using biased estimates of sensitivity (Sn.) and specificity (Sp.): simulation-based analysis. Serological assay with (a) high: a true Sn. at 95.0% and a true Sp. at 99.0%, (b) good: a true Sn. at 90.0% and a true Sp. at 97.0%, (c) and moderate performance: a true Sn. at 80.0% and a true Sp. at 87.0%. Note: The dot-dash lines provide an interval which indicates the seroprevalence adjusted for the mis-specified assay performance at a given error level was still within ±5% deviation of the true seroprevalence. The prevalence adjustment was performed using the formula by Sempos and Tian [20]. Notice that an assay with underestimated Sn. and Sp. is unable to provide prevalence estimates after adjustment at a low prevalence setting: (a) 5.3% and (b) 5.6%. An assay with overestimated Sn. and Sp. tends to inflate seroprevalence after adjustment when the true prevalence is low: (b) 3.0%.
Of the 349 multiple-testing studies, 42.4% of these tested the same sample on multiple assays concurrently ("parallel testing"); among these, 68.2% defined seropositivity as a positive result on at least one assay, and 31.8% defined this as a positive result on all assays. A total of 31.8% used one assay first for screening, followed by another for confirmation ("sequential testing"). While having the combined Sn. and Sp. for a testing algorithm is important to interpret seroprevalence estimates, this was only reported in 9.5% of seroprevalence studies using multiple testing algorithms (Table S5). A subset of samples from 167 studies tested on parallel or sequential multiple algorithms were identified to interpret seroprevalence estimates derived from these algorithms. These studies also have estimates provided by a single assay. We found parallel and sequential testing algorithms were potentially effective in ruling out false-positive cases given by RDTs (−7.8% in prevalence estimates using a single assay) and recognizing positive cases missed by ELISAs (+4.4%, Figure S6).

Discussion
In examining 1807 global serosurveys published between 1 January 2020 and 19 November 2021, we found that 192 unique commercial and 380 unique self-developed serological assays were used. A total of 50 commercial assays are used across 91% of SARS-CoV-2 seroprevalence studies. We found that intra-assay performance evaluations varied widely according to evaluation method and source. This variation in assay evaluations may have an impact on seroprevalence estimate validity and bias by under-or over-estimating estimates by up to 9.5%.
Serological assay performance is context dependent. Previous literature did not focus on assessing intra-assay consistency across different sources of validation for assays but put more emphasis on inter-assay comparisons [22][23][24][25][26]. Our study reveals that manufacturer evaluations of assay Sn. and Sp. were overestimations compared to independent and third-party head-to-head validations. Our pooled analysis found that Sn. on average was lower by 1.0% and 3.3% in third party and independent group evaluations, respectively. Likewise, Sp. on average was lower by 0.9% and 0.2% in third party and independent group evaluations, respectively. These results imply there may be more false positives and negatives than would be expected given manufacturer-verified test evaluations, which may impact result adjustment and interpretation.

Third-Party Evaluation Validates Manufacturer Data
Third-party evaluations are essential for more objective estimates of Sn. and Sp., enabling retrospective adjustment of seroprevalence data and selecting candidate assays for new studies. The five third-party labs included in our study all disclosed the reference panel they used (Table S2). The composition of samples in reference panels is consistent across the evaluation of each individual assay, including testing materials consisting of combinations of high-titer, mid-titer, and low-titer samples on N-and S-antibody targets. Reference panels reflect the full-time course of infection (past infections, and waning antibodies). It also mirrors the complexity of antibody detection in real settings [27,28], as cross-reactivity to other viral infections (such as HIV, Dengue, Malaria, and Middle East Respiratory Syndrome) was also consistently assessed in negative panels. Third-party evaluations are of value in retrospectively adjusting data or selecting and adjusting for assays in new studies. However, these evaluations typically only target frequently used commercial ELISA and CLIA assays, which were less distributed in low-income regions like Africa (Table S3).

Independent Evaluation Reflects Regional Population Characteristics
This situation necessitates that study investigators validate assays not included in these third-party evaluations. These independent evaluations better reflect the study geography, demographic context, epidemiological time course, and variant landscape, minimizing spectrum bias. Of note, studies have demonstrated loss of Sn. over time as antibodies wane, and incorporating performance based on time since the infection will gain further importance as the pandemic progresses [10,29]. Moreover, viral mutations may result in decreased assay performance [30]. Additionally, studies have shown differences in antibody dynamics in specific populations such as those from sub-Saharan Africa, young adults, and pregnant women that may impact test performance [31][32][33][34]. This step is not always feasible for all research settings, as we found only a small proportion (6.9%) of independent author groups conducted their own assay pre-study validation before rolling out their serosurvey. Fewer described the evaluation panels and methods they used. We encourage future studies to integrate assay evaluation more into a serosurvey design as a pre-step. Independent evaluations targeted toward the intended study population will update the understanding of serological assay performance, and conversely, accurate seroprevalence estimates.

Correct Seroprevalence Estimate for Assay Performance
Seroprevalence estimates can vary considerably based on the assay used, even in the same population and based on the same samples [23]. For instance, low-Sp. assays can lead to overinflated seroprevalence estimates, creating misleading results-particularly in settings with low true prevalence [35]. Moreover, Sn. and Sp. are not true parameters of the assays but can vary for the same assay depending on the reference panel or population used. Overall, our findings caution against accepting aggregate Sn. and Sp. reported by assay manufacturers, favoring independent or third-party evaluations on representative populations. Sn. and Sp. should be stratified by disease severity and time since infection, and the characteristics of the positive and negative reference panels should be reported at a minimum. The chance of biased estimates can be substantially minimized with proper adjustment. Our finding implies that statistically adjusting for test validity may be an essential step, particularly in low prevalence settings where a small absolute difference in seroprevalence can produce a massive relative difference in the understanding of case ascertainment, and/or where assay performance values are low (as seen with some rapid test assays).

Multiple Testing
Another option to minimize bias in seroprevalence studies was to use a multipletesting strategy. Although findings should be further validated due to the heterogeneity of data, we noticed that pairing RDT with other assays could minimize false-positive rates by using RDTs only. RDT as a preliminary screening test suggests whether the test recipient produces any antibodies against SARS-CoV-2 in general. However, series confirmation tests identify the source of antibodies (infection/vaccination) and helps determine a much more precise timepoint of infection, see article [36]. Using just one RDT should be avoided. Moreover, multiple testing algorithms could also increase the Sn. of laboratory binding assays such as ELISAs and CLIAs and rule out false negatives, especially in low prevalence settings. Requiring a positive result on multiple assays in parallel and sequential testing improves the overall Sp. of the testing algorithm compared with the individual assays alone [37], but sometimes at the expense of Sn. [38]; conversely, requiring a positive result on just one of multiple assays improves Sn. at the expense of Sp. Sn. and Sp. should be taken as a whole to improve the positive predicted value of a testing algorithm to truly identify positive cases among all positive tests. Rational deployment of these algorithms should also consider contextual factors such as background prevalence in the population being studied, as positive predictive values are substantially lower in low prevalence settings [27]. Additionally, for accurate interpretation, reporting the details of the assays used and how they were combined with one another is important.
The combined Sn. and Sp. is calculatable for multiple testing algorithms based on individual performance features under either rule [36,39], but reporting a combined Sn. and Sp. at the point of completing all steps for a multiple testing algorithm on a regional sample is more preferable.

Limitations
This study has some limitations. While the living review from which our data is drawn captures all seroprevalence studies, we have not captured all applications of serological assays. For example, we excluded studies performed exclusively in confirmed COVID-19 cases and vaccinated individuals, and our findings may not apply to these areas of serological research. Additionally, our findings apply to population-based contexts and may not translate to the patient or clinical level, where serological assays are used to guide patient care. When collating third-party, independent, and manufacturer data on assay performance, we extracted the overall Sn./Sp. on total antibodies whenever available. We performed an empirical synthesis, making the best effort to collate all assay performance data accessible from online dashboards, preprints, institutional reports, and academic journals by identified sources. We extracted performance data collected from the far-most day from symptom onset. Finally, while we made our best effort to identify and summarize the use of serological assays in each serosurvey and the performance of assays from different sources, we saw people miss reporting performance matrices for self-developed assays. Therefore, we did not proceed with analyses for self-developed assays on performance comparison. Studies released as conference abstracts (48/1807, 2.7%) did not have enough space to describe the type of test used in a serosurvey in detail. However, given that the number of conference abstracts is small, it did not contribute a major result bias.

Conclusions
In conclusion, we found a large and diverse number of assays used in seroprevalence studies. This diverse selection of serological assays may impact the interpretation and reliability of seroprevalence estimates by up to ±9.5%, as Sn. and Sp. are not fixed properties of a serological assay but varying features depending on the reference panel or population on which is tested. We strongly recommend that: (1) authors conducting seroprevalence studies should consider adopting third-party or independently evaluated assays, which inform assay properties in a particular context; (2) statistical test adjustments on population seroprevalence should be employed using validated assay performance data; and (3) utilizing multiple testing strategies where possible (reporting a combined overall Sn. and Sp.) to minimize the risk of bias in seroprevalence estimates.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/vaccines10122000/s1, Figure S1: PRISMA diagram; Figure S2. Pareto chart of serological assay use; Figure S3: Difference in reported assay performance among independent evaluation, third party evaluation, and manufacturer evaluation among top 50 assays (ELISA, CLIA, and LFIA/IFA); Figure S4: Pairwise Bland-Altman plots on the agreement of Sn. and Sp. between a source of assay evaluation against the manufacturer evaluation; Figure S5: Pairwise Bland-Altman plots on the agreement of Sn. and Sp. between a source of lab evaluation against independent group evaluations; Figure S6: Bland-Altman plot on the agreement between serological testing results with multiple assays and with a single assay; Table S1: The major categories of assay; Table S2: Reference panels of five third-party lab evaluations; Table S3: Testing characteristics of included seroprevalence studies; Table S4: Categorizing the top 50 assays based on the median performance value given by third-party and independent evaluations; Table S5: Testing algorithms involving multiple serological assays; Table S6: List of identified commercial serology assays from systematic review. References [5][6][7][40][41][42][43] are cited in the supplementary materials.  Institutional Review Board Statement: Ethical review and approval were waived for this study, due to only second-hand synthesized data were used.

Informed Consent Statement: Not applicable.
Data Availability Statement: Data from seroprevalence studies and the serological assays used therein are available from: https://serotracker.com/en/Explore (accessed on 20 November 2022).