2.2. Pipeline
The Bifrost ISBAS pipeline (BISBAS;
https://github.com/ProfundityOfScope/BISBAS, accessed on 5 July 2024)) developed for this work takes as input the unwrapped interferograms and runs through the steps defined in the ISBAS analysis.
Masking of the interferograms based on a minimum coherence or presence of water;
Referencing interferograms to a provided reference point;
Time-series inversion, where the time series is solved for from the interferograms;
Removal of residual “ramps” present in the individual date images;
Estimation of ground velocities from the time series.
The pipeline is assembled from a series of Bifrost ‘blocks’ which perform one or some combination of the above steps. These blocks are a central component of creating a Bifrost pipeline, with a given block typically receiving data from one or more ring buffers, performing some sort of transformation, and outputting the result to another ring buffer.
Figure 1 shows an illustration of how the blocks are connected with basically all mathematical operations happening on the GPU side. Data flow through the pipeline in ‘gulps’, which represent all interferogram information for a given set of pixels. The number of pixels, and thus the size of the gulp, can be set manually or automatically determined from the available memory on the GPU. Phase data are read alongside the corresponding coherence and water mask, which are then used to apply a mask where coherence is less than 0.3 or water is present in the NASA SWBD dataset [
9].
These masked data are then passed to the GPU, where more math-intensive steps can be performed. The first of these involves referencing the interferograms, effectively subtracting a value from each interferogram pixel such that the phase data near the reference point will be zero in all interferograms. The bulk of computation time during this data reduction is used during the step of time-series inversion, where we solve for the model rates which are used to generate the time-series data from the base interferograms. This typically takes the form of the equation
where the matrix
represents the difference in dates between the measurements used to generate the interferogram,
x is the model rates and
is the interferogram data. Typically, solving for
x would be straightforward, since the problem is over-determined, so a least-squares approach would suffice. Complications arise when we introduce masking into the data such that NaN values appear. A typical solution for this would be to remove the offending NaN values from
and their corresponding rows from
and then proceed with your solution method of choice (assuming that the resulting matrix is still full rank; otherwise, the pixel is skipped).
As this solve must be performed for every single pixel in a given dataset, the added time required for these on-the-fly modifications becomes significant. Beyond the time requirement, the size of the problem changing during each step is problematic if the goal is GPU processing, which would generally require problems to remain consistent in the usage of memory. With this in mind, we modify the problem somewhat to avoid this issue and also optimize toward the types of operations GPUs are well suited for.
To handle the NaN values, we introduce a masking matrix,
, which is a diagonal matrix generated from
. If a value
is present in the data, then the corresponding element
, while if the value is missing (NaN), then
. If we wish to insert this masking matrix, then our modified problem takes the form
The resultant problem features a new left-hand matrix, which we may call , and a new data vector . The size of these new matrices and vectors will remain constant in memory for every pixel, allowing the problem to be moved more easily to the GPU. Once the relevant broadcasting and matrix multiplications are performed, we are left with a few relatively small matrices ( is , with being the number of observation dates), and so it is possible to perform this solve for all pixels in a gulp simultaneously, effectively as a tensor problem, via the linalg.solve function in cupy.
The matrix is also functionally equivalent to the Gram matrix of . We can leverage this in checking if the resulting problem is still full rank. This check in the original code is made via a singular value decomposition, which checks the rank of the truncated for each pixel. Instead, we can check whether the Gramian (the determinant of the Gram matrix) is non-zero. The mostly diagonal composition of in most time-series inversion problems means that the determinant will be significantly faster than a typical rank check, even if the said rank check could be performed on the GPU. As the condition of the Gram matrix is the square of the original matrix, we opt to use a slogdet function for this, which is better in cases of potential numerical instability, as it calculates the logarithm of the determinant, which is less prone to under- or overflow. For singular pixels, the resultant model is simply set to all-NaN.
As a final measure toward catching these numerical uncertainties, we check the new time-series solutions against a threshold which can either be supplied from the user or precomputed from a random set of pixels throughout the entire image. Pixels containing time-series values larger than this threshold (typically 10 times the standard deviation) are discarded, as they likely suffered from these numerical instability effects. We note that this affects only a small percentage of pixels, fewer than 1 in 10,000 in this dataset using this hardware (see
Section 2.3).
Here, it is worth taking a brief aside to discuss the complexity and how it scales with the number of interferograms and the number of dates that those interferograms were calculated from. The steps described above, specifically the time-series inversion and the relevant checks involved, dominant the time complexity of the entire process for both methodologies. The original CPU-bound code’s time complexity is largely dictated by singular value decomposition, an operation for each pixel, performed in both the matrix rank check and (if the matrix is full rank) the calculation of the pseudo-inverse. For each pixel in the new implementation, we expect the construction of the matrices to scale as and the solves to scale as , but as for nearly all conceivable cases, we can say that both implementations scale as for each pixel. The relative difference in speed comes from being able to dramatically parallelize these operations on the GPU. In both cases, having more masked data will mean faster processing times, as the old implementation skips solves and the new one trivializes them, with the overall speed-up factor decreasing slightly as more masked pixels are present.
After converting the time series into the appropriate units (typically millimeters), the data split: one copy being taken off the GPU to be written to disk, and the other sent to a block which accumulates the solution to the detrending step. This removal of residual linear or quadratic trends across the image is typically performed sequentially over each image, raveling the data, and removing any NaN values on the fly. We instead opt to accumulate the matrices which can be solved for these trend parameters on the fly during our pipeline, as the math involved is similar to that described above and also benefits in speed from GPU processing. After the entire dataset has been processed, we can then quickly perform the detrending solve (this typically involves a few hundred matrices no larger than 6 × 6), then read the data back in, copying it to the GPU, and subtracting the model to remove these residual trends. The detrended data are split, one copy again being sent to be written to disk, and the other copy used to calculate rates for each pixel, which is another solve that benefits dramatically from the GPU acceleration. Finally, the rates are copied out to be written to disk.
2.3. Benchmarking
To measure the increase in processing speed, we ran our pipeline, as well as the original ISBAS code, on three subsets of the dataset. For benchmark purposes, we report only the time to complete the time-series inversion, as the other steps typically do not impact the overall processing time significantly, and we are generally more interested in the time-series inversion. The minimum coherence used for masking in all benchmarks is set to 0.3. All benchmarks and tests were performed on a local compute node running Ubuntu 20.04.6 LTS and CUDA version 12.4, equipped with a dual AMD EPYC 7313 16-core processor, 512 GiB of memory, and an NVIDIA RTX A4000 series GPU. Summary information on our benchmarks for these subsets can be found in
Table 1.
The first of these subsets (denoted as “Small Images” in
Table 1) used all 2195 interferograms of a small region (200 × 200 pixels) around a trial point in the full data. This subset was most commonly used for testing, as the total processing time for both codes was manageable and allowed for quick iteration or debugging. Choosing reference points randomly such that at least 95% of the data was land in the water mask, we generated 50 such datasets to evaluate the consistency of both implementations. The original CPU-bound code had a range of times from 2770 to 4362 s with a median of 3806 s. For the Bifrost implementation, this range of times was from 9.173 to 9.541 s with a median of 9.459 s.
One such Small Image dataset was used to generate
Figure 2, which illustrates the strong agreement between the two methods. We use the maximum mean change to quantify this agreement, as it generally works well when comparing floating point numbers and can be thought of as somewhat analogous to a percent error. The median value for the maximum mean change in this region is 0.0042 with an inner quartile range of 0.0023 to 0.0097. These quite small errors probably originate from small numerical differences between Numpy and Cupy implementations of the same algorithm, combined with the 32-bit precision used to match the original implementation, and typically only arise when there are a larger number of missing interferograms; thus, the resulting problem is less well constrained. Some degree of this effect could be mitigated by moving to 64-bit numbers at the cost of some performance.
The second subset was chosen to be somewhat more representative of real-world use cases for the original code, which was intended to operate on data taken by satellites such as ERS or ENVISAT. While the data from these could feature a similar number of pixels in each interferogram, a given dataset would more typically include only a few tens of dates, producing a somewhat smaller number of interferograms. As such, our “Fewer Dates” subset used only interferograms from the first 25 dates in the dataset, producing a new dataset that had 195 full-sized interferograms. From this dataset, we can see a more typical timescale on which the ISBAS code was expected to run, here 14 h, as well as our accelerated processing time of less than 3 min. The original processing time becomes cumbersome if any sort of iteration is required in the processing of the data, and it becomes even more problematic as the field moves toward longer timescales with more observation dates.
Finally, we test our pipeline against the entirety of the dataset. This “Full Data” subset represents a modern stress test that is meant to demonstrate the capabilities afforded by GPU acceleration via Bifrost. The utility of the ISBAS schema is that it allows one to process a given pixel even with some number of dates missing, but on modern datasets where the number of dates is fairly large, the time cost of such an algorithm can be extremely prohibitive. Based on the time it took to complete each pixel, we estimated that it would take the original ISBAS approximately 51 days to fully process this full dataset; meanwhile, it can be processed using our Bifrost accelerated version in around 5 h.
From these tests, we can see a marked performance gain. The computation time in both cases scales strongly with the number of interferograms and to a lesser extent with the number of pixels. Larger amounts of data require more transfers to and from the GPU, but the overall speed-up is significant enough that these effects are fairly small and could be remedied via hardware upgrades or more thorough optimization of the data transfer and, particularly, the gulp size.