A Systematic Review of Methodology Used in Studies Aimed at Creating Charts of Fetal Brain Structures

Ultrasound-based assessment of the fetal nervous system is routinely recommended at the time of the mid-trimester anatomy scan or at different gestations based on clinical indications. This review evaluates the methodological quality of studies aimed at creating charts for fetal brain structures obtained by ultrasound, as poor methodology could explain substantial variability in percentiles reported. Electronic databases (MEDLINE, EMBASE, Cochrane Library, and Web of Science) were searched from January 1970 to January 2021 to select studies on singleton fetuses, where the main aim was to construct charts on one or more clinically relevant structures obtained in the axial plane: parieto-occipital fissure, Sylvian fissure, anterior ventricle, posterior ventricle, transcerebellar diameter, and cisterna magna. Studies were scored against 29 predefined methodological quality criteria to identify the risk of bias. In total, 42 studies met the inclusion criteria, providing data for 45,626 fetuses. Substantial heterogeneity was identified in the methodological quality of included studies, and this may explain the high variability in centiles reported. In 80% of the studies, a high risk of bias was found in more than 50% of the domains scored. In conclusion, charts to be used in clinical practice and research should have an optimal study design in order to minimise the risk of bias and to allow comparison between different studies. We propose to use charts from studies with the highest methodological quality.


Introduction
Ultrasound-based assessment of the fetal nervous system is routinely recommended in most settings at the time of the mid-trimester fetal anatomy scan or at different gestations based on clinical indications [1,2]. This usually includes routine measurement of the lateral ventricle anteriorly (AV) and posteriorly (PV), the transcerebellar diameter (TCD), and the cisterna magna (CM). Additional measurements as part of an extended neurosonography examination have been proposed in order to assess gyration and sulcation disorders, such as the parieto-occipital fissure (POF) and the Sylvian fissure (SF).
In previous systematic reviews of studies aimed at creating fetal and neonatal biometry charts, many studies were found to have high risks of bias. Such shortcomings of methodological design can become a source of substantial variability in percentiles reported, with differences in interpretation of the same measurement; ultimately, this can adversely influence clinical decision making [3,4]. Over the last five years, international prescriptive standards have been published in order to overcome the limitations inherent in such descriptive reference charts [5].
The objective of this systematic review was to evaluate the methodological quality of studies aimed to develop charts of fetal brain structures measured by ultrasound.

Materials and Methods
We conducted a systematic review of observational studies following the Preferred Reporting Item for Systematic Reviews and Meta-Analyses (PRISMA) IPD statement [6]. We searched the major electronic databases (MEDLINE, EMBASE, Cochrane Library, and Web of Science) and secondary reference sources from January 1970 to January 2021 to select studies on singleton fetuses aimed at creating charts on fetal brain structures growth.
Inclusion criteria for each study were (1) having as the main scope to construct charts on POF, SF, AV, PV, TCD, and CM; (2) published in English; (3) selection of normal singleton pregnancies; (4) acquisition of the image on routinely acquired transverse axial planes (transthalamic, transventricular, transcerebellar) [1]; and (5) growth charts developed beyond 14 weeks of gestation. No restriction for the ultrasound acquisition technique was applied (either from 2D pictures or images derived from 3D volumes and with transvaginal or transabdominal probe). Studies aiming at comparing different population groups or methods of imaging were excluded from the review.
The keyword search strategy was formulated in collaboration with a professional information specialist (NWR) and is presented in Table S1. Two reviewers (VD and RN) independently undertook a two-stage process to select the studies. In the first stage, they assessed abstracts and titles of all identified citations and selected potentially eligible studies. In the second stage, they obtained and assessed the full texts of the studies that fulfilled the inclusion criteria for evaluation. Disagreements regarding inclusion were resolved by consensus or by consultation with a third author (ATP). Reference lists of retrieved full-text articles were examined for additional, relevant citations.
Methodological quality criteria were defined a priori, using modified versions previously used to evaluate studies aimed at creating fetal growth charts and crown-rump length dating charts [3,4]. Table S2 reports the set of 29 quality criteria. Those criteria refer to three domains, namely, (1) study design (2) statistical methods, and (3) reporting methods.
All studies included were then scored against each criterion. The level of bias was defined as a dichotomous variable: 0 referred to a 'high risk', and 1 referred to a 'low risk'. The overall risk score was defined by adding all scores across the whole set of criteria. Thus, the quality score for each item of the review could range from 0 (highest risk of bias) to 29 (lowest risk of bias). The assessment of the methodological quality was performed by two reviewers (RN and AC) for each study. Where disagreements arose, those were solved through consultation with a third reviewer (ATP).

Statistical Analysis
Data from the review were coded and transferred to an Excel spreadsheet (Microsoft Corporation 2007, Redmond, WA, USA). The quality score (0-29) was reported in percentage dividing the actual score by 29 and multiplying per 100. The distribution of the 5th and the 95th centile in studies with a low and high risk of bias was evaluated.
We also evaluated the impact on centiles' heterogeneity associated with poor study methodology in the most commonly measured brain structure-TCD.

Results
From a total of 1005 records identified after database search, 73 were considered for potential inclusion (Figure 1). Excluded studies and reasons for exclusion are reported in Table S3. Finally, 42 studies, reported between January 1970 and 2021, met the inclusion criteria, and these provided data for 45,626 fetuses, included in the final analysis (Table 1)

Results
From a total of 1005 records identified after database search, 73 were considered for potential inclusion (Figure 1). Excluded studies and reasons for exclusion are reported in Table S3. Finally, 42 studies, reported between January 1970 and 2021, met the inclusion criteria, and these provided data for 45,626 fetuses, included in the final analysis (Table 1)  . The median sample size of participating fetuses was 372.5 (range: 50 to 8313; 25th percentile: 175.8; 75th percentile: 709.3). Most studies created charts that covered a range of gestation; for example, if a study reported a chart from 20 to 40 weeks, this covered 21 weeks. The median of this coverage was 24 weeks (range: 7 to 30 weeks; 25th percentile: 18; 75th percentile: 27).    Nine studies reported more than one fetal brain structure [9,10,17,32,34,36,[41][42][43].
In Table 2, we present the centiles of each structure at three relevant gestational ages for those studies where this was possible-either reported by the authors or calculated when a relevant equation was reported. Figure 3 shows the distribution of the 5th and the 95th centile for the TCD in studies with a high risk of bias in more or less than 50% of the quality criteria. Studies with a lower risk of bias had a smaller distribution of centiles, compared with studies with a higher risk of bias at any of the three gestational ages analysed. The same analysis could not be reported for other structures in view of a low number of data points. Centiles highlighted in white are reported in the relative study; centiles highlighted in grey are calculated. Studies are reported in descending quality score. SD= standard deviation; POF = parieto-occipital fissure; SF = Sylvian fissure; AV = anterior ventricle; PV = posterior ventricle; TCD = transcerebellar diameter; CM = cisterna magna; GA = gestational age.

Discussion
The aim of this systematic review was to evaluate the methodology used in studies aimed at creating charts on specific fetal brain structures measured by ultrasound. Using a set of 29 predefined quality criteria on study design, statistical methods, and reporting methods, studies were scored as having a low or high risk of bias. This approach has been previously proposed in order to evaluate the quality of existing charts on fetal biometry and first-trimester dating [3,4].
In 34 out of 42 studies (80%), a high risk of bias was found in >50% of the domains scored ( Figure 2). Only the studies by Napolitano et al. and Rodriguez-Sibaja et al. were at low risk of bias in a significantly high number of quality criteria, respectively, in 93% and 86%; all other studies were below 60% [36,39].
The highest potential for bias was noted in most of the criteria regarding the 'study design'. Specifically, only two studies reported a low risk of bias in the definition of

Discussion
The aim of this systematic review was to evaluate the methodology used in studies aimed at creating charts on specific fetal brain structures measured by ultrasound. Using a set of 29 predefined quality criteria on study design, statistical methods, and reporting methods, studies were scored as having a low or high risk of bias. This approach has been previously proposed in order to evaluate the quality of existing charts on fetal biometry and first-trimester dating [3,4].
In 34 out of 42 studies (80%), a high risk of bias was found in >50% of the domains scored ( Figure 2). Only the studies by Napolitano et al. and Rodriguez-Sibaja et al. were at low risk of bias in a significantly high number of quality criteria, respectively, in 93% and 86%; all other studies were below 60% [36,39].
The highest potential for bias was noted in most of the criteria regarding the 'study design'. Specifically, only two studies reported a low risk of bias in the definition of inclusion/exclusion criteria. In addition, only two studies described detailed neonatal and development outcomes [36,39]. In two other studies, there was neurological follow-up described, but this was never assessed with a standardised approach in all fetuses included. Thus, in the study by Hilpert et al. telephone follow-up was obtained at a minimum of 2 years of age in fetuses with PV measurement of 10 mm or more [26], and in the study by Farrell et al., most fetuses with PV measurements more than 8 mm had an unspecified follow-up that varied from 2 days to 12 months [19]. We believe that infant follow-up is essential if the aim is to create charts of brain structures. This is because pathological conditions may be prevalent, possibly affecting the resulting charts, and because many cases of developmental delay cannot currently be predicted by antenatal ultrasound. The proportion of infants with abnormal development in a study should confirm that they are representative of a healthy population. The reason for this high source for bias identified in these fields is also probably related to the retrospective design (around 30% of the studies). Furthermore, in only four studies sample size estimation was performed [18,36,39,44], and only three studies had populationbased sample selection [36,39,47], with all other studies reporting either convenience sampling or arbitrary recruitment or sampling methods that were not reported ( Figure 2).
One area of significant bias in fetal biometry is associated with calliper placement not done in a blinded fashion [49] (considered in only two studies [36,39]). It is difficult to quantify the magnitude of this factor, but one might assume that for values close to cut-offs for referral or investigation (e.g., 10 mm for the PV); such lack of blinding may have a relevant impact on the resulting charts. The sonographer tends to over-or underestimate to generate referrals or avoid abnormality diagnosis, respectively. In addition, there was a lack of ultrasound quality control in 80% of the studies, which has been previously demonstrated to be useful in reducing measurement variability [50].
The high risk of bias in most of the criteria assessed may explain the substantial heterogeneity in the resulting centiles. For example, in the case of TCD at 32 weeks of gestation, the 50th centile according to Vinkesteijn et al. was equivalent to the 95th centile according to Hayata et al. and Hata et al. [24,25,48]. Likewise, the 95th centile according to Smith et al. was smaller than the 5th centile according to Serhatlioglu et al. (Table 2) [41,42]. In Figure 3, we show that the higher is the risk of bias score the higher is the variability in centiles reported of TCD for the three gestational age ranges considered. It is clear that such differences are not desirable since they can lead to false-positive or false-negative results. This also makes different studies difficult to compare in research.
Only nine studies reported the regression equation of the standard deviation-instead, they either did not report the equation or they just reported the mean standard deviation throughout gestation. This is an important limitation since a clear increase in variability with advancing gestation from visual assessment of scatter diagrams in those studies was observed. Hence, the changing standard deviation with advancing gestation should be taken into account and is needed for the accurate calculation of centiles.

Strengths and Limitations of the Review
This is the first systematic review on the topic and it includes all currently published charts on POF, SF, AV, PV, TCD, and CM. A rigorous methodology based on predetermined quality criteria was applied and was based on previously published quality checklists used to evaluate studies on other aspects of fetal size [3,4]. We had no limitations on the year of inclusion of studies; it could be argued that, in a rapidly emerging field such as prenatal imaging, older studies should not be subjected to the same rigorous quality assessment as more recent ones. It is also fair to assume a gradual improvement in both the ultrasound technology and the statistical methods of data analysis over the decades. In fact, there was some evidence of improving study quality over time; the median quality score for the first half of the studies (between 1984 and 2008) was 9, while the median score for the latter half of the studies was 13. Nevertheless, the high risk of bias was time independent for many criteria such as inclusion/exclusion criteria, neonatal and infant outcome, sample selection, characteristics of the study population, measurement acquired blindly, and standardisation of the sonographers.

Conclusions
The use of a wide range of reference charts can affect both clinical assessment and research on the development of new technologies associated with antenatal ultrasound [51,52]. We have shown the lack of methodological quality in most existing studies aiming at creating fetal brain charts. Most studies have significant risks of bias, leading to large differences in size charts of normal brain structures. This provides the controversial background context to studies that suggest that differences in fetal brain measurements exist due to, for example, maternal ethnicity, country of origin, or fetal sex: it is not possible to ascribe such differences to biological causes when basic methods of study design, statistical analyses, and reporting are not optimised and when different studies utilise widely different methods.
This review of the literature has shown that 40 out of 42 studies had a low risk of bias in no more than 60% of the quality criteria. In order to allow a unified approach to clinical practice and research, we suggest using charts that have an optimal study design, use a prescriptive approach, and a methodology that is at low risk of bias, including fetuses from low-risk populations worldwide, and that follow infants up for developmental assessment to confirm that the population was appropriate for the construction of these charts.
Author Contributions: R.N., V.D. and A.T.P. designed the study and defined the quality criteria a priori. N.W.R. made the literature search. R.N., V.D. and A.C. extracted the data and scored the studies. R.N., V.D., A.C., C.I. and A.T.P. analysed the data, interpreted the results, drafted the manuscript, and made the decision to submit. All authors assisted in drafting the article submitted and revising it for important intellectual content, and all edited and approved the final, submitted version. All authors have read and agreed to the published version of the manuscript.