Evaluating Development of Empirical Estimates Using Two Top-Down Methods at Midstream Natural Gas Facilities

: To align with climate initiatives, multiple reporting programs are transitioning from generic activity-based emission factors to site-specific measured emissions data to estimate greenhouse gas emissions at oil and gas facilities. This study contemporaneously deployed two top-down (TD) aerial methods across 14 midstream facilities, building upon previous research in the field. The methods produced multiple whole-facility estimates at each facility, resulting in 773 individual paired estimates (same facility, same day), and robust mean estimates for each facility. Mean estimates for each facility, aggregated across all facilities, differed by nearly 2:1 (49% [32% to 69%]). At 6 of 14 facilities, the methods produced mean estimates that differed by more than a factor of two. These data suggest that one or both methods did not produce accurate facility-level estimates at a majority of facilities and in aggregate across all facilities. The overall results are augmented with two case studies where TD estimates at two pre-selected facilities were coupled with comprehensive onsite measurements to understand the factors driving the divergence between TD and bottom-up (BU) emissions estimates. In 3 of 4 paired comparisons between the intensive onsite estimates and one of the TD methods, the intensive onsite surveys did not conclusively diagnose the difference in estimates. In these cases, our work suggests that the TD methods mis-estimate emissions an unknown fraction of the time, for unknown reasons. While two methods were selected for this study, it is unlikely that the issues identified here are confined to these two methods; similar issues may exist for other similar whole-facility methods on midstream and/or other facility types. These findings have important implications for the construction of voluntary and regulatory reporting programs that rely on emission estimates for reporting fees or penalties, or for studies using whole-facility estimates to aggregate TD emissions to basin or regional estimates.


Introduction
The global initiative to reduce methane emissions from the energy sector has gained substantial momentum because of its pivotal role in combating global warming [1].Methane emissions are the second-most prevalent greenhouse gas (GHG) after CO 2 , with a global warming potential (GWP) that is 86 [2] times higher than CO 2 , over a 20-year time frame.Accurately quantifying these emissions stands as a fundamental prerequisite in the pursuit of effective methane emission reduction policies.The natural gas system is the second largest anthropogenic source of methane in the U.S., spanning from production and processing to transmission and distribution [3].This study focuses on the midstream sector, which includes gathering and processing (G&P) as well as transmission and storage (T&S) segments.Midstream facilities are complex in nature and typically include gas treating and compression equipment, with the largest source of methane emissions being from compressors and compressor drivers, particularly reciprocating engines [4,5].This study enrolled five G&P and nine T&S facilities operated by six different midstream companies.
Methane emissions on midstream facilities are typically divided into three categories [5][6][7]: Fugitive emissions, at times called 'leaks', are unintended, unknown releases of unburned gas.Venting is the release of unburned gas for known and planned reasons, such as maintenance activities or to drive valve controllers or pumps.Combustion or combustion slip refers to the fraction of fuel gas that remains uncombusted during any combustion process, such as engine or turbine exhaust, heaters, or other process equipment.All combustion processes have some slip.
For the last decade, the U.S. Environmental Protection Agency (EPA) has required mandatory annual reporting of GHG emissions using the Greenhouse Gas Reporting Program (GHGRP) [7], which relies on activity-based or BU inventory methods.Policymaking decisions frequently hinge on these reported inventory estimates [8].Recent regulations are shifting to require empirical data at facilities.In 2022, the EPA proposed the incorporation of empirical data to improve the accuracy of inventories, recognizing the advancements in top-down (TD) technologies for reliable methane emission monitoring and quantification [9].
In the past, there existed no direct means to validate the accuracy of the BU inventory methodology.A major investment by the Advanced Research Project Agency-Energy (ARPA-E) has stimulated substantial commercial investment over the last decade in measurement methods to assess emissions using remote sensing techniques, including drones, manned aircraft, and satellite technologies.The increase in investment and method development have been accompanied by extensive testing programs using a wide variety of protocols [10,11], as well as a major Department of Energy-funded program [12,13] to develop standardized test protocols.
These methods are often contracted to conduct emission assessments at facilities, providing a comprehensive "snapshot" of whole-facility emissions.Currently, it is widely recognized that the traditional BU inventories often underestimate emissions relative to the TD methods [14][15][16][17][18][19][20][21].While it is evident that BU inventories might miss or underestimate certain emission sources, a critical question arises: Can we place full confidence in the accuracy of TD technologies in capturing the emissions profile of a facility?Many of these technologies have undergone controlled testing [22][23][24], but these tests are single-point sources with near ideal conditions, which may not faithfully replicate the complexities of midstream facilities.
Other relevant studies have also explored the variability of methane emission estimation methodologies.For instance, Daniels et al. conducted research that utilized multi-scale technologies in the production segment and discovered substantial discrepancies in TD methods, with variations spanning over three orders of magnitude [25,26].Similarly, Stokes et al. [27] deployed two distinct aircraft-based emission measurement systems at tank battery sites and observed discrepancies in their respective estimates of total emissions.However, a notable gap remains in the literature when it comes to extensive comparisons between different TD methods.In-depth analyses directly comparing the outcomes of various TD approaches are limited, especially in terms of investigating the factors contributing to the observed differences.This research aims to bridge that gap by conducting a detailed examination of two widely used TD methods-aerial Light Detection and Ranging (LiDAR) (Solution 1) [28] and drone-mounted flux plane mass balance (Solution 2) [29]-systematically dissecting the underlying sources of variation between their estimates.
While there are no limits on emissions, recent regulatory moves have established penalties for exceptionally large emissions [30].With these regulations mandating empirical data and imposing methane fees, the accuracy of methane estimates gains heightened significance.Disagreements between TD methods measuring the same facility, at the same time, indicates inaccuracy in one or both of the TD methods.Resolution of this issue is paramount to ensure that the site's emissions are both accurate and comparable across different methods, for all types of facilities.
This study is part of the Quantification, Monitoring, Reporting and Verification (QMRV) research and development (R&D) program [31], which is structured into three distinct phases: baseline, enhanced monitoring, and End-of-Project (EOP) verification.This program deploys multiscale measurement technologies to quantify emissions and spans all sectors of the natural gas supply chain.Recent studies, Daniels et al. [25] and Wang et al. [26], present results from the QMRV program focused on the production sector.
In the baseline phase, operators estimated facility emissions (for a 24-h period) using BU inventory methodologies, while simultaneously two TD technologies conducted wholefacility emissions estimations.This setup facilitated a TD-BU comparison per facility.Data were collected over 15 facilities, all within the midstream sector.The results from the baseline phase are presented in Brown et al. [32], and a short summary is provided below: There was systematic disagreement between the TD-BU methods in 40 of the 43 comparisons, with the TD estimates being statistically higher than inventory estimates.
Despite the relatively small sample size of 28 paired measurements, the prior baseline study provided sufficient data to identify a significant and non-trivial issue: the substantial disagreement between the two popular, whole-facility, TD methods for these midstream facilities.This research aims to examine the divergence between these TD methods.To achieve this, we engaged the same two TD methods to perform more extensive repeat measurements at each facility during the EOP project phase.Additional repeat estimates address a fundamental experimental question from the prior study: Is the inconsistency observed between the TD methods a consequence of limited sample size, or a more substantial disagreement between the methods?It is worth highlighting that one facility was sold during the QMRV project term; thus, the facility did not participate in EOP measurements, leading to a reduction in the total number of facilities from 15 to 14.

Methods
This analysis focuses on two datasets.First, the two TD methods were deployed contemporaneously and provided multiple whole-facility estimates of emissions.Throughout this analysis, we refer to two separate deployments of these (and auxiliary) measurement methods.The baseline deployment was conducted early in the QMRV project.The results of this phase were presented in Brown et al. [32] and will be utilized here as well.The second deployment was the EOP, 9 months after the baseline.The EOP deployment made adjustments relative to the baseline deployment to support the study presented here.
The second data source is an operator-provided BU inventory estimate that corresponds to the facility's operational state during the measurement period, using either the same methodology as in the baseline deployment or an updated methodology including more measurement data.The inventory estimate included information about the facility operations during the measurement day-operational configuration for the 24-h period and the timing of any episodic or upset events.
Although both methane and carbon dioxide are important to the overall study and QMRV program, this study (EOP estimates) focuses on methane emissions only; unless specifically indicated, this paper uses the term emission to mean methane emissions.

Field Deployment for EOP
All methods were deployed simultaneously on a single day, with two exceptions: adverse weather conditions prevented Solution 1 from conducting simultaneous flights at two facilities, and these were completed on the following day without any noted changes in facility operations.Other exceptions are noted in SI Section S2.A typical measurement day follows Brown et al. [32] and is outlined in SI Section S1.

Bottom-Up Estimate
Operators were instructed to operate the facility as normal, so that the EOP estimates would provide a "snapshot" of methane emissions during typical operations.Operators of the enrolled assets estimated the facility emissions for the 24-h day using BU methodologies.Typically, these estimates were based on equipment count and GHGRP factors plus supplemental emissions from either a calculation or additional measurements [32].
Although these inventory methods often result in underestimations of emissions compared to TD methods, they serve as a valuable baseline for understanding a facility's emissions profile.Given that these facilities had 1 to 15 compressors, combustion slip is a substantial emission source, particularly if reciprocating engines are used to drive operational compressors [4,19].When a compressor is in a pressurized standby mode (prepared for operation but not actively running), compressor vents may also be a large source of methane emissions.Except in rare cases where there is an usual process failure or large leak, emissions from compressors are often a more substantial contributor of emissions at midstream facilities than fugitive emission sources [6].Midstream facilities may undergo operational state changes in response to varying demands throughout the day, underscoring the significance of monitoring compressor states for the inventory during the measurement period.A significant effort was made to avoid measuring temporary blowdown events by the TD technologies.
The prior study discusses a comparison between the BU estimate and TD estimates [32].For this study, the primary use of the BU estimate was to estimate the change in facility emissions due to operational state differences between the baseline and EOP deployments.These estimates illuminate whether emissions should have increased or decreased between field deployments, which is critical for interpreting the results from the TD methods.

Top-Down Methods
Two aerial methods, Solution 1 and Solution 2, were contracted to estimate a snapshot of emissions at the enrolled facilities during both baseline and EOP deployments.This study did not assess the methodologies of the two technologies; instead, it simply evaluates the output of these solutions as an operator would in real-world scenarios.
Solution 1 operates a remote sensing technology that is deployed via aircraft for methane emission detection and quantification.Using a laser, it measures path-integrated methane gas concentration between the aircraft and the ground.This technology differentiates emission plumes from background methane levels to detect and quantify concentrated point source emissions.Their proprietary algorithms utilize wind data from a nearby weather station to compute emission rates.
Solution 1 was instructed to perform as many whole-facility surveys as possible, in a single day, resulting in an increased number of estimates in the EOP relative to the baseline deployment.The method for grouping data into whole-facility estimates is described in SI Section S1.1.The uncertainty for Solution 1 was estimated using the relative error in (857) controlled release tests by Bell et al. [22].See SI Section S1.2.
Solution 1 states an emission rate detection sensitivity of 3 kg/h with 90% probability of detection under typical conditions [22].Therefore we need to adjust Solution 1 estimates to include emissions that were known to exist, but were below the detection sensitivity and may not have been detected by Solution 1's measurement technology.Emissions from a range of sources, including fugitive leaks <1 kg/h, combustion slip from turbines, and non-compressor sources such as heaters and flares, typically register below Solution 1's detection sensitivity and were added to the facility-scale estimates from Solution 1.The quantification of these additional sources was achieved through the BU methodology, following the same methodology as in Brown et al. [32].Added sources varied between facilities (SI Section S2).All references to Solution 1 results in this paper imply that these adjustments have supplemented Solution 1's estimates.The exact uncertainty related to these adjustments was not available in the BU estimate, and these adjustments were made without uncertainty estimates.
Solution 2 was implemented as described in Corbett and Smith [23].The method uses a drone-mounted spectroscopy sensor to fly downwind passes through emission plumes at varying heights, a technique commonly known as a 'flux plane'.The flux plane starts at the lowest altitude, flying horizontally until the methane concentration reaches background levels.Subsequently, it repeats this curtain pattern at incremental altitudes until reaching the highest altitude.Estimated emission rates for all sources upwind of the flux plane are calculated by multiplying concentration estimates throughout the flux plane with the normal wind speed through the flux plane and integrating all estimates across the flux plane.Wind data originated from a portable onsite anemometer with additional data from corrections applied by the drone's autopilot system.In principle, Solution 2 detects emissions from all sources located upwind of the flux plane that were active within the time frame of emission transport.The uncertainty for Solution 2 was estimated using the relative error in (12) controlled release tests [23].See SI Section S1.2.
In a typical deployment, Solution 2 estimates emissions from individual "zones" or equipment groups, which would then be summed to create a whole-facility estimate; this method was utilized for the baseline deployment.For the EOP estimates, Solution 2 was instructed to make more comprehensive whole-facility surveys-capturing all facility emissions-rather than summing individual zones.In some cases, the meteorological conditions and/or the footprint of the enrolled facility prevented Solution 2 from estimating the entire facility with one flux plane.It was left to the discretion of the Solution 2 field team whether they could fly a whole-facility estimate in one flux plane flight, or if multiple flights (all downwind of the facility) were needed to capture the entire facility.Therefore, at some facilities (SI Section S2), a single Solution 2 whole-facility estimate may include multiple flux plane flights.

Results and Discussion
We address the comparison of the TD methods in four analyses: 1.
Method consistency: Do the methods exhibit similar changes between baseline and EOP deployments?2.
Per-facility results: How do the methods compare when evaluated at each facility?Did the methods produce repeatable results during the EOP? 3.
Aggregate comparison: If all 14 facilities were owned by one operator, what would be the difference between the TD methods? 4.
Analyzing disagreement: We present two case studies of intensive contemporaneous ground estimates to analyze disagreements between the methods.
Drawing upon insights from the first paper Brown et al. [32], a key issue that emerged was the disagreement between the two TD methods.Consequently, this paper specifically tackles the challenges associated with TD methods during a day of measurements, where operational states can be controlled through onsite observations.

Method Consistency
Table 1 provides information about the facilities enrolled, including the operational state and the BU inventory estimate for the baseline and EOP measurement days.This insight is crucial for interpreting the expected difference in emissions between the two field deployments.While the daily BU estimates may not accurately reflect emissions at the facility (see Brown et al. [32]), changes in the BU estimate provide an estimate of how emissions likely differed between baseline and EOP based on operational conditions, which are independent of either TD method.Most BU estimates in this study utilized prior onsite direct measurements of emissions (stack tests, high-flow sampler, etc.) rather than emission factors, and are therefore site-specific estimates.As a result, facilities may have the same number of compressors operating in both baseline and EOP deployments but have different compressors running, creating substantially different BU emissions estimates.
We used the change in BU estimates (Delta BU) between baseline and EOP to improve comparisons; subtracting Delta BU from the EOP estimates corrects for expected differences in emissions between the two deployments.In the baseline, the per-facility estimate from Solution 2 is the sum of the zones and in the EOP it is the average of all whole-facility estimates.In both baseline and EOP cases, the per-facility mean estimate from Solution 1 is the average of all whole-facility estimates.The difference between the baseline and EOP estimates for each TD method will be referred to as the 'delta difference'.Figure 1 displays the delta difference for estimates from both TD methods.At 11 of the 14 facilities, the methods closely followed the parity line (within +/−20%), indicating both methods exhibited similar delta difference, i.e., both methods displayed similar changes in emissions during baseline-to-EOP.The three outliers, labeled on Figure 1, draw attention to instances where one method exhibited minimal change baseline-to-EOP, while the other experienced a substantial change.
The most significant disagreement in delta difference occurred at facilities D and L: • Facility D: Solution 2 estimated an increase of 624 kg/h emissions baseline-to-EOP.During the EOP, the onsite observer noted that wind speeds picked up early in the morning, with Solution 2 managing to complete two estimates before the winds became too strong.Additionally, the wind direction shifted from due south to due west between the two estimates.Results changed by 356 kg/h between the two estimates, while no noticeable changes occurred in the facility operations during this period.These factors suggest that wind conditions may impact TD results.

•
At facility L, a T&S facility, fewer compressors were in operation during the EOP than during baseline deployment.Solution 1 estimated a delta difference in emissions of −556 kg/h.In contrast, Solution 2's estimates remained consistent between the two days, with a delta difference of less than 50 kg/h.During the baseline measurement survey, Solution 1 detected a substantial emitter located near a tank, with an average emissions rate of 410 kg/h.This same source location was detected during the EOP survey at a mean rate of 200 kg/h.Additionally, another significant emitter located on the compressor building was detected during the baseline survey, with an estimated emission rate of approximately 300 kg/h, but it was not detected during the EOP survey.Theses results reflect either marked changes in Solution 1's detection and quantification or changes in actual emission rates, which did not impact Solution 2's estimates for unknown reasons.These two cases suggest that the delta difference in measured emissions may be due to (a) changes in facility operations or emission sources, (b) variations in the performance of the TD methods in varying environmental conditions, or (c) inherent differences between the two TD methods regardless of wind conditions.Explanation (a) requires coordinated observation, likely with robust measurement capability, on the facility to identify changes in emission sources or rates.Explanation (b) would require robust quality indicators incorporated into reported results from the TD methods; as part of this QMRV experiment or deployment neither solution provides these indicators.A quality indicator could identify measurements with elevated uncertainty levels due to sensor and/or wind speed not within a specified range.Explanation (c) would highlight a need to improve method performance, including corrections for conditions that are poorly estimated by a solution.
More generally, an operator deploying only one TD method will have no information to determine if a measured facility is one of the 11 (79% of facilities) that showed similar delta differences between methods, or one of the three (21%) that showed dissimilar delta differences between facilities, and therefore would have few clues to differentiate between a decrease in the quality of TD method results or a major change in facility emissions.

Per-Facility Analysis
During the baseline phase, the methods statistically agreed in 2 out of 28 comparisons using the 2-sided ks test, α = 0.05.The first analysis compares the mean of all measurements by each method, during the EOP, at each facility.SI Table S1 presents all statistical test results; highlights are below:

•
The methods agree by the 2-sided ks test (distribution shape) and t-test (distribution mean) at 1 out of 14 facilities, and they agree by the Wilcoxon test (distribution median) at 3 facilities.

•
The mean values from the methods overlap in the 95% confidence interval (CI) at 6 out of 14 facilities; however, it is important to note that this comparison does not imply equality; see SI Section S3.
Outside of research projects, it is improbable that any operator would engage a TD method to conduct multiple measurements at a single facility, as observed in this study.Therefore, it is crucial to focus on comparing the mean values of individual estimates, each of which would represent a single measurement at a facility.For regulatory or voluntary reporting, operators will likely use only one TD estimate and report one mean value.To explore agreement between estimates, the same statistical tests were performed comparing each individual estimate at a facility from one method to all individual estimates from the other method, resulting in 773 pairwise comparisons.In the pairwise testing, using the 2-sided ks test (α = 0.05) as an example, at half (7) of all facilities, no paired tests showed agreement, at 4 facilities greater than 15% of paired tests showed agreement, and at one facility 40% of paired tests agreed.See SI Table S2 for the number of comparisons and results for each facility.
Had an operator deployed either method at any given facility, the mean of any one of the estimates from either TD method would represent a valid number to report, under regulatory reporting.To assess the range of reported estimates, we simulated 1000 possible field campaigns.In each simulation, one estimate from each method was randomly selected and the difference in the mean estimates between the two methods was calculated for each iteration.The resulting 95% CI of these differences provides an estimate of how much the reported emissions could vary, based solely upon the choice of measurement vendor ('Reporting Range' in Table 2).The variability in the reporting range at individual facilities spans a wide range, from as low as 0 to 7 kg/h up to as high as 53 to 687 kg/h.This range is substantial, particularly considering that a single mean value from either TD method has significant implications for methane fee calculations under future U.S. regulations [8].Additionally, should one of these facilities be included in a supply route calculation for life cycle or supply-chain assessments, changes in reported emissions could have a material impact on the resulting emission intensity, with resultant contractual or financial issues.
Given the observed disagreement between the methods, it is also interesting to look at the agreement within one method's estimates.For this analysis, we consider the set of all estimates made by one method at one facility.If this set of estimates agrees within the set, the method produces repeatable results with relatively small variations in the mean.When averaged, the mean will exhibit reduced uncertainty bounds relative to the individual estimates.Excluding the facilities that underwent operational changes during the measurement period (3 facilities), this is true at 11 out of 11 facilities for Solution 1 and 9 out of 11 facilities for Solution 2; see SI Table S3.At the remaining facilities (2 for Solution 2), relative uncertainty of the average exceeds relative uncertainty of the individual estimates.This result is counter to conventional expectations that increased measurements will reduce uncertainty.

Aggregate Method Comparison
The extended measurements conducted during the EOP deployment provided a more robust comparisons between the two TD methods.At a single facility, Solution 1 produced anywhere from 2 to 31 whole-facility emissions estimates.While at 12 facilities Solution 1 produced more than 10 whole-facility emissions estimates, Solution 2 produced 2 to 6 wholefacility emissions estimates.In contrast, during the baseline period, Solution 1 produced one or two whole-facility estimates, while Solution 2 produced one.
Figure 2 presents a Bland-Altman difference plot, illustrating a comparison between the two TD methods.The measurement emissions check (MEC) represents the average of the mean values from the TD methods and is utilized as the X-axis and to normalize the relative difference on the Y-axis.Each data point corresponds to a mean estimate made at one facility, and error bars denote a 95% empirical confidence interval associated with each estimate.
The concurrent deployment of the two methods at the same facilities assumes that they (a) captured the same emissions profile and (b) should produce the same estimated emissions.Solution 2 exhibited a 28% [5.7% to 52%] higher mean compared to Solution 1 during the baseline deployment, and in the EOP Solution 2 displayed a mean of 49% [32% to 69%] relative to Solution 1, as seen in Figure 2.These data indicate that the difference seen in the baseline was not due to a limited number of samples.Instead, the increased number of measurements from both TD methods demonstrates that with repeat measurements there is a larger disagreement between methods.
The aggregate of all 14 facilities is indicative of what an operator would report, had this been one operator's asset base.In aggregate, the discrepancy between the two methods increased from the baseline (966 [197 to 1756] kg/h over 15 facilities, or 64 [13 to 117] kg/h per facility) to the EOP (1470 [972 to 2029] kg/h over 14 facilities, or 105 [69 to 145] kg/h per facility).In total, in both project deployments Solution 1 consistently estimated lower emissions than Solution 2. During the EOP, Solution 2's mean estimates were two times or greater than Solution 1's mean estimates at 6 out of 14 facilities.
Given the magnitude of disagreement comparing pairs of single estimates, the means of multiple estimates at a facility, and the aggregate mean estimates across multiple facilities, one or both methods cannot provide accurate estimates of emissions at these midstream facilities.While this study focused on two methods, few similar comparisons have been completed with other methods in the field or in controlled test conditions.Therefore, this type of unpredictable inaccuracy, in either single estimates or aggregate results, may occur with other solutions, at different types of facilities, and/or under varied environmental conditions.

Identifying Sources of Disagreement
To understand the disagreement between TD methods, the EOP deployments included enhanced on-ground emissions detection and quantification efforts, as suggested in Brown et al. [32].
Two facilities (E & N) were volunteered by their operators for enhanced diagnostics after the baseline deployment and before the start of the EOP deployment.At these facilities, intensive BU screening was performed to supplement the TD methods.Facility N is a simple transmission site featuring two turbines (32,000 HP/24 MW) driving centrifugal compressors, along with minimal additional equipment, including inlet scrubbers and yard piping.Facility E is a complex gas processing plant with 15 compressors (total of >40,000 horsepower/35 MW) and processing equipment to upgrade gas to pipeline quality, including dehydrators, heaters, acid gas removal units, and associated flares and tanks.These two facilities offer a unique perspective, representing opposite ends of complexity within the midstream sector.In both cases, the enhanced surveillance measured all known sources, using a ground team performing optical gas imaging (OGI) and persource measurements of all detected sources using high-flow samplers.Additionally, the compressor engines at facility E had stack testing and crankcase vent measurements conducted on the measurement day.These two facilities provide compact case studies of issues when using TD methods to estimate emissions for midstream facilities.
At facility N, the two turbine-driven compressors were operating during measurements.Major expected emissions were compressor centrifugal venting and other fugitive leak emissions (e.g., isolation valve leaks); SI Section S5.1.Turbine exhaust has minimal combustion slip [6].
On the compressor building, Solution 1 detected emissions in a location suggestive of emissions from the centrifugal compressors' seal vents.The average Solution 1 estimate for each seal vent was 6 kg/h, in comparison to the BU estimate of 5.15 kg/h for each seal vent.Solution 1 detected emissions in the area of the blowdown stacks and reported an average of 2 kg/h, roughly four times the high-flow sampler estimate of 0.45 kg/h for each of the two compressor blowdown stacks.Combined, the BU estimated 12 kg/h for total facility emissions, which is in relatively good agreement with Solution 1's mean emission estimate of 17 [15.2 to 19.2] kg/h, but still outside the CI of the TD method.In contrast, Solution 2 conducted three separate downwind flux measurements at this facility, yielding a mean estimate of 130 [88.7 to 180] kg/h, over seven times the Solution 1 and BU estimates.
We consider two possibilities for the disagreement between the two TD methods.

1.
Solution 2 either detected emissions transported from upwind, or encountered difficulties in capturing downwind emissions on multiple occasions.The probability and impact of these issues remains unknown.

2.
Both Solution 1 and on-ground teams missed onsite sources totaling 113 to 118 kg/h (mean).Solution 1 would miss sources if they were (a) exceptionally diffuse and did not form visible plumes (sources of this type were not observed by the ground-based monitoring methods or crew) or (b) sources were all below Solution 1's emission rate detection sensitivity (1-3 kg/h).This would require between 40, 3 kg/h sources, or 120, 1 kg/h sources, all of which were not detected by on-ground teams.Both of these scenarios appear improbable.
At facility E, 12 compressors were operating during the EOP deployment.While many emission sources could exist, major sources from compressors include compressor driver exhaust, compressor rod packing vents, and crankcase vents.(SI Section S5.2) The BU estimate utilized extensive onsite measurements rather than emission factors.
Given the large size of the facility, Solution 2 utilized three flux planes due to drone battery limitations.These zones were designated as follows: Zone 1 encompassing six compressors (five operating), Zone 2 comprising nine compressors (seven operating), and Zone 3 housing a flare and stabilizing equipment.SI Figure S2 represents the layout of facility E with major equipment outlined and Solution 2's flux planes for each zone.
Solution 1 estimated emissions on the 12 operating units; we assume these estimates include emissions from compressor combustion exhaust, nearby compressor rod packing vents, and crankcase vents; see SI Table S4.Solution 1 and the BU estimate's results were partitioned according to the specific zones flown by Solution 2 in order to make zonal comparisons.SI Figure S3 compares each method's estimate by zones; key results are noted here:

•
In zone 1, there was agreement overall between Solution 1 and the BU estimate, with the estimates differing by 12 kg/h.Solution 1 estimated 78 kg/h, while the BU estimate was 66 kg/h.Solution 2's estimate was on average 274 kg/h higher than Solution 1's estimate and 286 kg/h higher than the BU estimate.

•
In zone 3, where there were no compressors, Solution 1 and the BU estimates agree to within 2 kg/h of each other.Solution 2 on average was 95 kg/h higher than both Solution 1 and the BU estimates.

•
In zone 2, which includes seven operating compressors, a significant discrepancy is evident, where Solution 1's estimate surpasses the BU estimates by more than 450 kg/h.Solution 2's and Solution 1's estimates relatively agree in this zone, with a difference of 50 kg/h between the two methods.
The same supplemental onsite measurements of all known emission sources were made in all three zones.The bulk of the BU-Solution 1 disagreement in zone 2 originates with four compressors estimated at 120 kg/h each.These estimates were substantially in excess of stack tests and nearby point-source measurements, which agreed with Solution 1 estimates in zone 1 on similar equipment.No other large sources are in this vicinity.Given this extensive on-ground measurement work, it is unlikely that any persistent emitter or emitters, large enough to account for the BU-Solution 1 disagreement near these compressors would have gone undetected by the BU team making supplemental measurements.The onsite observer noted that zone 2 was in a depression relative to the other two zones.Topography may have impacted wind speeds and emission transport, but there is insufficient data to validate this hypothesis.
Stack testing and extensive ground measurements occurred concurrently.During this time, Solution 1 flew 18 overpasses and produced consistent estimates with a mean of 764 [+8.3%/−6.9%]kg/h.No large transient emitters were identified by Solution 1,which may have skewed the TD results, and none were noted by the ground team.Therefore, a large, transient emitter(s) does not explain the disagreement.
In summary, this analysis suggests that the discrepancy in measured emissions, particularly in zone 2, is more likely attributable to inaccuracies in the TD method(s) rather than the omission of an emission source by the onsite measurement teams.However, this cannot be conclusively proven.
The above analysis indicates that, for unknown reasons, a whole-facility method may fail, and current quality-control measures are not mature enough to detect the failure.While this study deployed and compared two specific methods, it is highly unlikely that this type of issue is confined to any one solution; any solution could have similar issues under currently unknown conditions.Therefore, data and analysis suggest that further development of TD analysis algorithms is needed.
Additionally, the extensive onsite measurement work performed at these two facilities is not characteristic of normal operating practice or required by any regulatory or voluntary programs, including the daily BU inventory estimations at the other 12 facilities in this study.This type of diagnostic is both expensive and cumbersome to perform.In Brown et al. [32], the authors suggested that intensive onsite measurements were needed to diagnose disagreements between BU and TD estimates of emissions.Experimentally conducting this type of intensive onsite work at two facilities resulted in 3 of 4 BU-TD comparisons (BU-Solution 2 at both facilities, BU-Solution 1 at Facility E) not conclusively diagnosing disagreements.This illustrates the complexity of using measurements to inform inventories.

Implications
While inventories are evolving to incorporate empirical data, it is important that the data being used are defensible.The results from this study confirm our previous finding that the TD estimates disagree.With an increase of sample size by repeated measurements, compared to the baseline, the disagreement between the methods expanded.Across all 15 facilities in the baseline deployment, the methods exhibited a relative difference of 28% [5.7% to 52%] in the mean, which increased to 49% [32% to 69%] over 14 facilities during the EOP, despite a dramatic increase in estimates by both methods (from 28 total estimates in the baseline to 773 estimates).This becomes particularly significant when considering methane or other GHG fees, as the choice of TD measurement method can lead to substantial variations in the fees assessed.
Increasing the number of TD measurements at a facility presented both methods with more opportunities to capture facility emissions accurately.It also allowed for better control of the analysis to account for changes in operations at the facility and/or transient events, compared to what could be done during the baseline analysis [32].At 6 out of the 14 facilities, the mean of the TD method estimates differed by at least 2:1; i.e., one or both methods did not provide accurate estimates for nearly half of these midstream facilities.This result has a direct impact on any voluntary or regulatory reporting program focused on per-facility reporting.For example, the EPA's recently proposed 'super-emitter' reporting program and the 'other large release event' reporting [33] would rely on approved anonymous surveillance by third parties to detect and report a "large release event of at least 250 mt CO 2 e per event or have a methane emission rate of 100 kg/h or greater at any point in time" [9] at oil and gas facilities, including midstream facilities such as a compressor station or natural gas processing plant.Reported emission rates for these large emitters would be used to estimate total emissions, with emissions duration assumed to be a default of 182 days unless the operator can prove otherwise [9].Over-or underestimation of emissions by 10 s to 100 s of kg/h-a difference seen multiple times in this analysis-would result in substantial errors in quantification and, given methane fees of USD 1500 per tonne of methane, significant financial impacts for individual facilities and their operators.Similarly, the European Union (EU) is considering methane emission rules covering operators within the EU, and also additional reporting requirements on emissions data for importers of natural gas and liquified natural gas (LNG) from the exporting suppliers.Voluntary programs do not address the inter-measurement technology considerations presented in this work.
Given longstanding issues with inventory estimates [14,17,34,35], there is a strong case for utilizing measurements to inform, supplement, or potentially replace inventory estimates.However, in order to rely on measurements, the results of this study indicate methods must improve quantification accuracy, produce more representative uncertainty estimates, and provide robust quality control indicators to identify when estimates are suspect or potentially in error.

− 1 . 7 †
Supply chain sector of the facility: G&P = Gas Processing, T&S = Transmission and Storage.‡ The type of mover driving the compressor(s) at the facility: Recip = reciprocating (piston) engine, Turbine = combustion turbine.* For facilities with reciprocating (piston) engines, code indicates the type of engine: 2SLB = two-stroke, lean-burn, 4SLB = four-stroke, lean-burn, 4SRB = four stroke, rich-burn.||If there are two numbers, this indicates the facility changed from state 1 to state 2 during the measurement period.

Figure 1 .
Figure 1.The horizontal axis is the difference in Solution 1 estimates from the EOP to the baseline, the vertical axis is the difference in Solution 2 estimates from the EOP to the baseline.Both axes are corrected by the difference in the BU estimate (Delta BU).The blue points represent G&P facilities and the orange points represent T&S facilities.Error bars indicate 95% confidence interval for each estimate.

Figure 2 .
Figure 2. Comparison of Solution 1 and Solution 2 estimates.Left plot displays baseline data, right plot EOP data.The horizontal axis is the MEC, an average of the TD estimates.Vertical axis is the relative difference between Solution 1's and Solution 2's estimates for each facility.The gray box displays the 95% confidence interval over all facilities.Dashed line at y = 0 displays perfect agreement.

Table 2 .
EOP facility estimates and variability between methods.