4.1. Methods
Though far from comprehensive, literacy and numeracy rates remain the most readily available outcome data. Under the Every Student Succeeds Act (ESSA), standardized testing is required in grades three through eight, and once again in high school. The U.S. Department of Education’s EDFacts initiative provides the most comprehensive dataset on school performance nationwide. EDFacts primarily covers public schools which are eligible for federal funding and are participating in federally mandated assessments. To protect student privacy and comply with federal laws such as the Family Educational Rights and Privacy Act (FERPA), EDFacts suppresses the data from small subgroups by reporting proficiency in ranges rather than specific numbers. To address suppressed data, we calculate the midpoint of each range to estimate proficiency scores. By multiplying these midpoints by the number of participants in each group and summing these values across groups, weighted averages can be calculated. This ensures that groups with more participants contribute proportionally to the overall proficiency score.
Because each state employs distinct standardized assessments, with differing content, scoring systems, and proficiency thresholds, direct comparisons of raw performance metrics across states are inherently problematic. To facilitate cross-state analysis, we first standardized each school’s average proficiency score into a Z-score, calculated relative to the mean and standard deviation of all schools within the same state. A Z-score represents a school’s relative position in its state distribution; for example, a value of +1 indicates a performance that is one standard deviation above the state mean. This within-state normalization enables researchers to evaluate schools on a common relative scale, abstracting from the idiosyncrasies of individual state assessments.
However, while Z-scores enable within-state comparisons, they do not fully resolve cross-state comparability. Specifically, the same Z-score in two different states does not guarantee equivalent absolute levels of student proficiency, as the state mean itself may vary somewhat. Specifically, a school performing at one standard deviation above the mean in a low-performing state may not reach the same absolute proficiency level as a school at the same relative position in a high-performing state.
To address this, we incorporate data from the National Assessment of Educational Progress (NAEP), a nationally representative, standardized assessment administered uniformly across states. The NAEP provides state-level benchmarks for core subjects such as reading and mathematics, enabling us to estimate each state’s average proficiency level relative to the national average. We convert state-level NAEP scores into standardized effect sizes (i.e., mean differences expressed in national standard deviation units), which are then used as additive adjustments to the school-level Z-scores. This “NAEP-adjusted Z-score” approach situates each school’s performance not only relative to its state peers, but also accounts for systematic differences in state-level academic performance.
We acknowledge that this method assumes that the distributional properties (e.g., variance, shape) of state tests are sufficiently aligned with the NAEP to support linear translation via mean-shifting. Furthermore, this approach does not account for potential within-state heterogeneity, whereby the factors contributing to state-level NAEP performance may not be uniformly distributed across all schools. Nonetheless, in the absence of student-level crosswalks or harmonized state assessments, NAEP-based adjustment offers a pragmatic and empirically grounded means of enhancing cross-state comparability. It mitigates a key source of bias in national analyses of school performance while maintaining interpretability and statistical rigor.
By integrating Z-scores adjusted for state effects with weighted margins of error, researchers can generate robust comparisons of school performance. Weighted margins of error account for variability in subgroup sizes and ensure that aggregated scores are statistically reliable. For instance, in this paper, we exclude scores with margins of error exceeding 10 points to maintain analytical integrity. In total, this represents about 2% of students. This methodology leverages EDFacts data to create standardized, comparable measures of school performance across states and sectors, and provides valuable context for understanding the role of AICS nationally (
Huang & White, 2023).
To better understand the effects of the COVID-19 pandemic and to better evaluate AICS in general, we examine performance data from pre-pandemic (2018–2019) and post-pandemic (2021–2022) periods. Our goal is to evaluate whether suppression-related uncertainty meaningfully affects the observed differences in proficiency between school types. In other words, we are not conducting traditional hypothesis tests on group means, but rather testing whether data suppression introduces enough uncertainty to undermine the observed descriptive differences. This distinction is important: we are working with population-level data—not a sample—so formal statistical inference is not required to estimate central tendencies. Instead, we use standard errors to quantify uncertainty from data suppression and assess whether the differences between group means are significant for this uncertainty. To do this, we calculate a group-level standard error (SE) based on the margins of error (MoEs) provided by EDFacts for schools with suppressed data. Specifically, for each school with a range-based estimate, we define participation (P) as the number of students tested, and compute a scaled uncertainty term MoE × P to represent that school’s potential contribution to group-level error. The group-level SE is then calculated as follows:
This formula estimates the standard error of the group mean under the assumption that uncertainty is concentrated in schools with suppressed data. For schools that report exact proficiency values (i.e., no suppression), we assume zero measurement uncertainty in their contribution to the group score. While this assumption likely underestimates the true variability for non-suppressed schools, it allows us to focus specifically on whether suppression-related uncertainty is large enough to explain away the observed group differences. This directly answers the following core question: are descriptive differences between school types an artifact of data suppression?
We acknowledge that an alternative approach would be to conduct a standard difference-in-means test assuming a common underlying variance structure for each group. However, that method requires either (a) access to raw student-level data to compute within-group variances, or (b) strong distributional assumptions (e.g., homoskedasticity) that are difficult to verify given the public reporting format of EDFacts. In contrast, our approach provides a conservative, transparent method for bounding the impact of missingness in the data.
To interpret significance, we compare observed differences in average proficiency between school types to the group-level standard error. We use critical values of 1.645, 1.96, and 2.576 to represent 90%, 95%, and 99% confidence thresholds, respectively.
Finally, we emphasize that all findings are descriptive rather than causal. We do not attempt to explain why differences exist between school types, only whether they are likely to be robust to the uncertainty introduced by suppression in the data. Factors such as student demographics, school funding, or instructional design may underlie observed differences, but are not tested in this analysis. For full tables of standard errors and school-type comparisons, please refer to
Appendix A.
4.2. Results: AICS Excel Academically Across Time Periods and Demographic Groups
As
Table 8 shows, pre-COVID-19 proficiency data using NAEP-weighted Z-scores, with district schools set as the baseline, show that from 2018 to 19, AICS outperformed the district baseline across all populations in English Language Arts (ELAs), though the differences were not significant for Asian students and barely (
p = 0.05) significant for Black students. They also outperformed other charter school models in most categories, though not in all grades. As
Table 9 shows, three years later, following the COVID-19 pandemic, we see AICS outperforming both district schools and other charter school models in ELA across all populations, generally with higher levels of significance. Again, it is important to note that these are not causal claims, but descriptive observations based on standardized test data.
As
Table 10 and
Table 11 show, for mathematics, we see a similar trend. As
Table 10 shows, before the COVID-19 pandemic, AICS outperformed both district and other charter schools across
all populations, though again, for Asians the differences were not statistically significant. As
Table 11 shows, we saw AICS outperforming both district schools and other charter school models in Math across all populations, generally with higher levels of significance than before the pandemic.