Compilation of a Nationwide River Image Dataset for Identifying River Channels and River Rapids via Deep Learning

Brimhall, Nicholas; Bladen, Kelvyn K.; Kerby, Thomas; Legleiter, Carl J.; Swapp, Cameron; Fluckiger, Hannah; Bahr, Julie; Roberts, Makenna; Hart, Kaden; Stegman, Christina L.; Bean, Brennan L.; Moon, Kevin R.

doi:10.3390/rs18020375

Open AccessFeature PaperArticle

Compilation of a Nationwide River Image Dataset for Identifying River Channels and River Rapids via Deep Learning

by

Nicholas Brimhall

¹

,

Kelvyn K. Bladen

¹

,

Thomas Kerby

²

,

Carl J. Legleiter

^3,*

,

Cameron Swapp

¹

,

Hannah Fluckiger

¹

,

Julie Bahr

⁴

,

Makenna Roberts

⁵

,

Kaden Hart

¹

,

Christina L. Stegman

⁴

,

Brennan L. Bean

¹

and

Kevin R. Moon

¹

Department of Mathematics and Statistics, Utah State University, Logan, UT 84322, USA

²

Department of Statistics, Brigham Young University, Provo, UT 84602, USA

³

U.S. Geological Survey, Observing Systems Division, Denver, CO 80225, USA

⁴

National Park Service, Water Resources Division, Fort Collins, CO 80525, USA

⁵

Department of Statistical Science, Duke University, Durham, NC 27708, USA

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(2), 375; https://doi.org/10.3390/rs18020375

Submission received: 29 August 2025 / Revised: 13 January 2026 / Accepted: 21 January 2026 / Published: 22 January 2026

(This article belongs to the Section Environmental Remote Sensing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A new dataset of 281,024 river images from across the United States, with metadata and labeled subsets to support hydrologic research is made publicly available.
Demonstrated strong performance of segmentation and classification models for detecting rivers and rapids, which could enable expansion of existing inventories for these key geomorphic features.

What is the implication of the main finding?

Establishes a hydrologic dataset that enables new machine learning approaches for characterizing rivers via remote sensing, including advanced river segmentation and detection of rapids.
Provides a framework to support a range of hydrologic applications including discharge estimation, habitat assessment, resource management, and recreation planning.

Abstract

Remote sensing enables large-scale, image-based assessments of river dynamics, offering new opportunities for hydrological monitoring. We present a publicly available dataset consisting of 281,024 satellite and aerial images of U.S. rivers, constructed using an Application Programming Interface (API) and the U.S. Geological Survey’s National Hydrography Dataset. The dataset includes images, primary keys, and ancillary geospatial information. We use a manually labeled subset of the images to train models for detecting rapids, defined as areas where high velocity and turbulence lead to a wavy, rough, or even broken water surface visible in the imagery. To demonstrate the utility of this dataset, we develop an image segmentation model to identify rivers within images. This model achieved a mean test intersection-over-union (

I o U

) of 0.57, with performance rising to an actual

I o U

of 0.89 on the subset of predictions with high confidence (predicted

I o U

> 0.9). Following this initial segmentation of river channels within the images, we trained several convolutional neural network (CNN) architectures to classify the presence or absence of rapids. Our selected model reached an accuracy and F1 score of 0.93, indicating strong performance for the classification of rapids that could support consistent, efficient inventory and monitoring of rapids. These data provide new resources for recreation planning, habitat assessment, and discharge estimation. Overall, the dataset and tools offer a foundation for scalable, automated identification of geomorphic features to support riverine science and resource management.

Keywords:

remote sensing; river imagery; rapids detection; computer vision; hydrological monitoring; convolutional neural networks; segmentation model

Graphical Abstract

1. Introduction

Rivers are vital elements of the landscape, conveying water, sediment, and organic materials through the organized channel networks that comprise Earth’s watersheds. Characterizing rivers is of immense practical and scientific significance, with applications ranging from water supply planning and infrastructure design to in-stream flow monitoring and habitat assessment. Although information on channel form and behavior has traditionally been acquired through field surveys, this laborious approach is often limited to short, isolated reaches [1]. As a consequence, remote sensing has emerged as a more efficient, alternative means of mapping river corridors across an expanded range of spatial and temporal scales [2,3,4,5].

Whether observed directly from the bank or from the aerial perspective afforded by an imaging system, the texture of the water surface is a particularly important river attribute. Distinguishing among various surface flow types can provide a basis for categorizing distinct hydraulic units, often referred to as mesohabitats [1]. For example, Zavadil et al. [6] suggested that observations of water surface texture could play a key role in rapid assessments of channel morphology and fluvial processes. However, field surveys of this kind are inherently qualitative and can also be somewhat subjective. Milan et al. [7] introduced a more quantitative approach that involved measuring water surface roughness with a terrestrial laser scanner to delineate features such as boils, unbroken standing waves, riffles, runs, rapids, and cascades. Woodget et al. [8] also called into question the accuracy and reliability of traditional surface flow type mapping and argued that remote sensing could provide a more rigorous basis for evaluating in-stream habitat. Hedger and Gosselin [9] took an additional step in this direction by demonstrating the potential to automate the process of mapping fluvial hydromorphology by applying deep learning methods to aerial orthophotos. Similar studies include recent efforts to leverage artificial intelligence tools to not only detect rivers in images [10,11] but also identify specific kinds of features, such as dams [12], in-stream wood [13], riparian woodlands [14], and Arctic beaded streams [15].

In addition to providing insight on habitat conditions, the ability to detect a specific type of surface flow—rapids, which we define as areas where the water surface is wavy, irregular, or even broken (i.e., whitewater), presumably due to high flow velocities and turbulence, that stand out from adjacent areas of smoother flow and are thus visible in satellite and aerial imagery—would be valuable in at least two additional contexts. First, Legleiter et al. [16] showed that in rivers with well-defined standing waves, basic measurements of the distance between wave crests and the width of the channel can be used to calculate discharge based on critical flow theory. This type of non-contact streamflow measurement could facilitate hydrologic monitoring in remote, inaccessible locations. The critical flow-based approach could also reduce risk for equipment and personnel by obviating the need for direct field measurements in the steep, swiftly flowing streams where standing waves are most often found. Second, rivers are a highly valued recreational resource enjoyed by the public. For example, in the United States (US), millions of people each year visit rivers managed by the National Park Service (NPS) to take part in activities such as rafting, kayaking, and canoeing. For the Colorado River in the Grand Canyon alone, the number of rafters is roughly 27,000 [17]; another 60,000 float down the Gauley River [18]. The number of visitors to river-centric National Parks, National Rivers, and National Wild and Scenic Rivers in the year 2024 was on the order of 16.3 million people [19]. Tools for detecting rapids would allow agencies like the NPS to monitor these natural resources and thus help to inform their management. In addition, information on rapids would facilitate studies on the extent of critical flow conditions and various other topics, such as river-atmosphere gas exchange [20].

To the best of our knowledge, a comprehensive, focused, image-based database on the occurrence of rapids is lacking. Although the existing inventories described in Section 2.1 specify the location of some known rapids, they were not acquired systematically, do not provide information on the extent of the rapids, and might be incomplete. Moreover, these datasets consist only of vector features delineating the locations of rapids but do not include images of these areas. As a consequence, these data cannot be used to develop automated techniques for using vast, growing archives of aerial and satellite images to identify additional rapids that are not already included in existing databases. In this study, our objective is to address this void by helping to advance the incorporation of remote sensing methods into river characterization. More specifically, this paper outlines an end-to-end workflow for obtaining river images, identifying active stream channels, and detecting rapids. Currently, these tasks are accomplished through a time-consuming, subjective process that involves visually inspecting images and searching for features of interest in a mapping application like Google Earth, often based on a user’s a priori knowledge of rapid locations. As an initial proof-of-concept, we demonstrate the potential of our new, more automated and objective approach by (1) compiling a massive image database for rivers throughout the US; (2) applying a segmentation model to produce channel masks; (3) developing a series of convolutional neural network (CNN) models for classifying the presence or absence of rapids; and (4) assessing the accuracy of the resulting classifications.

2. Data Construction

2.1. Google Maps Application Programming Interface (API)

To assemble a large set of candidate images for modeling and detecting river rapids across the US, we evaluate multiple sources of high-resolution satellite and aerial imagery. The Google Maps Static API is selected because it provides scalable, programmatic access to the same imagery available in Google Earth, with broad spatial coverage across the US. Side-by-side comparisons show that images obtained via the API match the resolution of those manually obtained from Google Earth. However, images retrieved programmatically via the API did not include metadata on when an image was acquired, what type of platform (satellite or aerial) it was acquired from, or its spatial resolution. Whereas the Google Earth desktop software (Version 7.3.6.10441) application allows the user to interactively select the level of spatial detail by zooming in and out, the spatial resolution is set by a a single, fixed parameter included in the API query. The relationship between this parameter, called the zoom level z, and the ground sampling distance (GSD) of the image is given by

GSD (m) = \frac{156,543.03392 \times cos (\frac{l a t * π}{180})}{2^{z}},

(1)

where z is the zoom level and

l a t

is the latitude of the location. For the zoom levels and range of latitudes used in this study

(z \in [18, 19])

, the API returned images with GSDs ranging from 0.10 to 0.55 m. Larger GSDs occur at lower latitudes and zoom levels, and smaller GSDs are associated with higher values of these two parameters. For the most common zoom level

(z = 19)

, training and validation images have GSDs between 0.19 and 0.28 m, whereas the test images range from 0.10 to 0.18 m. The Google Maps Static API provided images with typical GSDs on the order of 0.25 m, whereas programmatic extraction from the National Agriculture Imagery Program via Google Earth Engine produced lower-resolution images with GSDs of approximately 1 m for conterminous US images.

We use the API to retrieve images at 3-km intervals along flowlines in the National Hydrography Dataset (NHD) Plus HR (high-resolution) dataset within the contiguous US [21] and from the 3D Hydrography Program (3DHP) in Alaska [22]. Given the nationwide extent of the dataset, obtaining images spaced every 3 km provides systematic coverage that does not yield overlapping images or exceed our allotted quota of free images. We exclude all flowlines with a stream order less than 4 (based on the

S t r e a m O r d e r

attribute in the NHD Plus HR dataset), a stream length under 30 km (as specified by the

L e n g t h K m

attribute), or those identified as headwaters (defined by the

S t a r t F l a g

attribute), ensuring that extracted points generally correspond to rivers large enough to be clearly visible in the satellite and aerial imagery. For Alaska, the 3DHP flowlines lack stream order or size information, so we include only streams with a unique geographic place name (i.e., gnisid) for this portion of the dataset.

In addition to the above flowline datasets, we also used the API to pull images from the locations of known rapids, compiled in three publicly available datasets: one from the NHD; one from OpenStreetMap (OSM), a collaborative, crowd-sourced mapping platform; and one from the data release [23] associated with [16]. These datasets provide readily available examples of rapids and are distributed as KML files, each containing spatial features marking rapid locations. We implement code in R version 4.4.0 [24] to read these files and extract coordinate pairs for each of the rapids included in these existing inventories. However, rather than assuming that the images from these locations include rapids, we use this set of images as input to the annotation phase of the workflow, just like the images we extract at a regular interval along NHD flowlines.

To systematically collect satellite and aerial imagery for the candidate river locations, we develop a flexible, high-throughput pipeline leveraging the Google Maps Static API. We implement an R-based, multi-threaded image downloader using the future_walk function from the furrr package [25], enabling the efficient retrieval of hundreds of images per minute across multiple processing cores. The script accepts a data frame containing latitude, longitude, and river name or watershed identifier fields. API credentials are securely managed via ENV files with the dotenv package [26], and API request signing is implemented using SHA-1 (Secure Hash Algorithm-1) hashing of the full request URL (Uniform Resource Locator) and a secret key from the Google Cloud dashboard to comply with usage limits and allow authenticated access beyond 25,000 daily requests per user.

The downloader constructs properly formatted API requests with geographic coordinates and the desired zoom level, submits them to Google, and saves successful responses locally. Using this pipeline, we collect 281,024 images across the US, creating a large and diverse database that provides a foundation for robust training and evaluation of models for detecting rapids [27]. Because the Google Maps Static API returns only a single image for a given location and does not provide information on the date the image was acquired, we made no attempt to examine changes in rapids over time.

2.2. Metadata

To facilitate modeling, enable reproducibility, and effectively manage the dataset, we construct a comprehensive metadata framework to accompany each image. Although we initially considered using individual JSON files for metadata storage, this approach proved inefficient at scale. We therefore transitioned to a unified CSV-based metadata file, which provides a reasonably scalable and accessible format for managing large volumes of annotations and image attributes.

Each row in the CSV file corresponds to a single image, uniquely identified by the image field that serves as the primary key for the dataset. This centralized structure allows for streamlined integration with training pipelines and annotation tools, and supports consistent tracking of image-specific metadata across tasks. A detailed description of the metadata fields is provided in Table 1. This metadata structure supports a range of downstream tasks, including data cleaning, model evaluation, and iterative annotation.

2.3. Annotation Process

Following image acquisition and metadata processing, we developed a suite of annotation tools to support the creation of high-quality labeled data for training machine learning models. These tools include a Python (Version 3.12.3) script that enabled both segmentation of river channels and labeling rapid presence, two important applications for our new dataset. In addition, an R script facilitates labeling rapid presence for large batches of images. Rapids were labeled visually by inspecting the imagery to search for areas where the water surface was wavy, irregular, or even broken (i.e., whitewater) and thus distinct from the smoother surface flow types that comprise the majority of most rivers [1,6,8]. To enhance labeling reliability, we implemented a two-annotator consensus approach: a classification label was accepted only when both annotators agreed. Although not infallible, this method limits subjectivity and improves consistency in the dataset [28,29]. Additional information on channel slopes and/or flow velocities might allow for a more precise definition of what constitutes “a rapid,” but such data are not widely available and our annotations were thus based solely on visual interpretation of the images themselves. Examples of annotated images with and without rapids are shown in Figures 7 and 8, where each image is displayed along with the consensus label assigned by our annotators and a model predicted label.

Given the presence of visual noise (e.g., riverbanks, shadows, and surrounding vegetation), we create high-resolution segmentation masks to isolate river pixels within images and guide model fine-tuning. To develop such masks, our Python script was designed as an interactive, command-line tool that leverages and iteratively fine-tunes Meta’s Segment Anything Model 2 (SAM2 [30]) to generate image masks. The interface allows annotators to guide the SAM2 model using positive (left-click) and negative (right-click) point prompts directly on the image to specify the location of the river channel within the image based on visual interpretation by the analyst. This provided a substantial improvement upon earlier methods, which required manual entry of point coordinates into arrays and involved considerable trial-and-error.

The segmentation tool offers keyboard shortcuts for efficient labeling: for example, users can indicate that an image contains only water (t key) or no visible water (f key). Additional features include undoing the last point (z key), saving and exiting (q key), and automatic updates to the associated metadata file. If the automatically generated mask is unsatisfactory, users can interactively refine the mask or opt not to save the mask at all, with the program automatically recording each decision. All masks are stored as NumPy arrays (npy files), and their paths, along with automatically defined river presence labels, are recorded in the metadata CSV.

Both the Python and R scripts include a module for binary image labeling based on the presence (1) or absence (0) of river rapids, streamlining the classification process. These tools are intentionally designed to facilitate efficient annotation and minimize the manual effort involved in dataset preparation. The annotation interface defaults to an NA (Not Available) value, which can be quickly overridden with a single keystroke.

For broader scalability, the full toolset is available as part of the data release that provides access to the images we compiled [27]. Using these resources, we manually classify 4058 images as rapids or non-rapids and generate 885 segmentation masks distributed across data splits (Section 3.4, Table 2). These annotated datasets provide the foundation for training and evaluating the segmentation and convolutional neural network models described in Section 3 and Section 4.

Beyond initial dataset creation, we adopt an active learning approach that is amenable to iterative application to maximize annotation efficiency. By selecting images where model predictions are most uncertain, we expand rapid label coverage by 407 in undersampled regions and minimize redundant effort. This strategy not only reduces the labeling burden, but could also yield improvements in classifier performance, as detailed in Section 3.3.3.

3. Experiments

To illustrate the utility of the dataset constructed in Section 2, we conduct two sets of experiments: (1) river segmentation and (2) rapid classification. Both experiments follow the same data-splitting strategy to ensure fair evaluation and reproducibility.

3.1. Segmentation Model

Accurate segmentation of river surfaces is a foundational task in remote sensing, enabling a wide range of analyses such as monitoring hydrological features over time, quantifying riverbank erosion, and assessing in-stream and riparian habitat. In our work, we fine-tune SAM2 [30] using annotated masks (Section 2.3) to isolate pixels from extraneous context. Segmentation not only supports downstream tasks such as rapid classification but also serves as a standalone tool for characterizing river morphology and dynamics. By masking out non-river regions, we ensure that subsequent models learn features intrinsic to the water surface. Below we summarize our approach.

3.1.1. Model and Losses

For our experiments, we employ the lightweight tiny variant of SAM2, which offers a balance between efficiency and performance. Input images (JPEG) are passed through the model to produce binary masks (NumPy NPY) alongside predicted intersection over union (

I o U

) confidence scores (

P R E D_{I o U}

).

Before mask generation, each pixel prediction

p_{j}^{(i)}

for the

j th

pixel of image i is passed through a sigmoid activation and clamped to the interval

[ϵ, 1 - ϵ]

to ensure numerical stability, where

ϵ

is a small constant (e.g.,

10^{- 5}

). Binary masks (

M A S K_{P R E D}^{(i)}

) are then obtained by thresholding the processed probabilities: each pixel is assigned a value of 1 if

p_{j}^{(i)} > 0.5

and 0 otherwise.

The fine-tuning objective consists of two components. First, the pixel-wise cross-entropy segmentation loss (

L_{S E G}

) is defined as

L_{S E G} = - \frac{1}{B M_{i}} \sum_{i = 1}^{B} \sum_{j = 1}^{M_{i}} [t_{j}^{(i)} log (p_{j}^{(i)}) + (1 - t_{j}^{(i)}) log (1 - p_{j}^{(i)})],

(2)

where

t_{j}^{(i)} \in 0, 1

denotes the ground truth label for pixel j in image i,

p_{j}^{(i)} \in [ϵ, 1 - ϵ]

is the clamped predicted probability,

M_{i}

is the number of pixels in image i, and B is the batch size.

Second, an

I o U

alignment loss encourages the model’s reported confidence to reflect actual mask quality and is defined as

\begin{matrix} I N T E R_{i} & = \sum M A S K_{T R U E}^{(i)} \times M A S K_{P R E D}^{(i)}, \end{matrix}

(3)

\begin{matrix} U N I O N_{i} & = \sum M A S K_{T R U E}^{(i)} + M A S K_{P R E D}^{(i)} - I N T E R_{i}, \end{matrix}

(4)

\begin{matrix} I o U_{i} & = \{\begin{matrix} 1, & if U N I O N_{i} = 0 \\ \frac{I N T E R_{i}}{U N I O N_{i}}, & otherwise \end{matrix}, \end{matrix}

(5)

\begin{matrix} L_{I o U} & = \frac{1}{B} \sum_{i = 1}^{B} |I o U_{i} - P R E D_{I o U_{i}}|, \end{matrix}

(6)

where

M A S K_{T R U E}^{(i)}

and

M A S K_{P R E D}^{(i)}

are binary masks for image i. There were no cases where

M A S K_{T R U E}^{(i)}

and

M A S K_{P R E D}^{(i)}

were both empty in any of our dataset images.

The final training objective is

L_{T O T A L} = L_{S E G} + 0.1 \times L_{I o U} .

(7)

3.1.2. Implementation Details

The segmentation model is fine-tuned using PyTorch (Version 2.6.0) [31]. We use a learning rate of

10^{- 6}

and a batch size of 32. The model is trained for 100 epochs with early stopping implemented based on the following criterion: if the average

I o U

score does not reach a new high for five consecutive epochs, training is stopped. The model weights from the epoch with the highest average

I o U

score are saved. These hyperparameter choices are common conservative defaults for training based on established practices for fine-tuning and stability: a small initial learning rate when fine-tuning pretrained networks, a mini-batch size of 32 as a robust default for generalization and Graphics Processing Unit (GPU) efficiency, and a long maximum of 100 epochs with validation-based early stopping to avoid overfitting [32,33,34,35,36]. Training and validation runs are executed remotely on a server equipped with an NVIDIA GeForce RTX 3090 GPU (Santa Clara, CA, USA).

3.2. Classification of River Rapids

3.2.1. Data Preprocessing

All images are read via PIL and converted to PyTorch tensors on-the-fly within a custom RapidsDataset subclass of torch.utils.data.Dataset. During training, we apply the following transformations in sequence with torchvision.transforms.Compose:

1.: Resize: Each image is rescaled to 480 × 480 pixels.
2.: Data Augmentation (training only): random vertical and horizontal flips and color jitter (brightness, contrast, saturation).
3.: Normalization: Standardize RGB (red-green-blue) channels using default ResNet means and standard deviations.

We did not apply rotation or scaling as part of this data augmentation process. Because the presence/absence labels we created indicate whether any portion of an image contains rapids, an arbitrary rotation and/or scaling transformation could remove the rapids of interest from the image and thus lead to inaccuracies in the training data. Moreover, our dataset already includes substantial variation in river orientation and scale, so these additional transformations would offer limited benefit relative to the risk of omitting rapids from images labeled as including rapids. For validation and test sets, we only apply the resize and normalization steps.

3.2.2. Model Architecture

The final models are built on a pretrained ResNetv2_152x2 backbone from the timm library. Residual networks such as ResNet have shown strong performance in a remote sensing context [37,38]. ResNetv2 builds upon the original version of ResNet and offers improved optimization and generalization. We therefore selected the well-established, efficient ResNetv2 architecture for our analysis of satellite and aerial images [39].

For the ResNetv2_152x2 backbone, we replace its original classification head with a standardized three-layer MLP:

Hidden layer: 1024 units, ReLU activation.
Hidden layer: 512 units, ReLU activation, dropout $p = 0.5$ .
Output layer: linear projection to $K = 2$ rapid classes

The final activation step performs the softmax operation to obtain probabilities for each class.

3.2.3. Classifier Model Evaluation

To evaluate the performance of our rapid classifier, we report three standard metrics: accuracy, F1 score, and area under the receiver operating characteristic curve (AUC). These metrics are all derived from a model confusion matrix, which summarizes the counts of true positives (

T P

), true negatives (

T N

), false positives (

F P

), and false negatives (

F N

) produced by the classifier.

Accuracy measures the overall proportion of correctly classified images and is defined:

Accuracy = \frac{T P + T N}{T P + T N + F P + F N} .

(8)

The F1 score balances the trade-off between false positives and false negatives. Because our task is binary classification (rapids vs. non-rapids), we report the macro-F1, which averages this trade-off for both classes:

F 1 = \frac{1}{2} (\frac{2 T P}{2 T P + F P + F N} + \frac{2 T N}{2 T N + F N + F P}) .

(9)

Finally, AUC measures the classifier’s ability to discriminate between rapids and non-rapids across all decision thresholds:

AUC = \int_{0}^{1} TPR d (FPR) .

(10)

where

TPR = \frac{T P}{T P + F N}

is the true positive rate and

FPR = \frac{F P}{F P + T N}

is the false positive rate. High values of these metrics indicate strong classification performance, reflecting accurate and reliable detection of river rapids.

3.2.4. Training Procedure

All models share the same training regimen, managed by a custom RiverClassifier wrapper:

Optimizer: AdamW with decoupled weight decay
-
Backbone learning rate: $5 \times 10^{- 5}$
-
Classifier learning rate: $1 \times 10^{- 3}$
-
Weight decay: $1 \times 10^{- 5}$
Scheduler: ReduceLROnPlateau (factor = 0.1; patience = 5 epochs on validation F1)
Mixed-precision: enabled via torch.cuda.amp
Early stopping: patience = 10 epochs monitored on validation F1
Batch size: 32
Epochs: up to 50

Pretrained CNNs are generally used “as-is,” with only the final classifier head trained and the image layers remaining frozen. However, when the target imagery differs substantially from the imagery used to train the CNN, fine-tuning the last few layers with a small learning rate can improve performance. In our case, the ResNetv2 model we use is pretrained on ImageNet [40], a large collection of natural images of everyday objects, animals, and scenes, whereas our task involves specialized satellite and aerial imagery with very different visual structures. To account for this difference in image types, we unfreeze the final model block and normalization head, training these layers with a reduced learning rate of

5 \times 10^{- 5}

while using

10^{- 3}

for the classifier head.

For training the rapids classifier, we optimize binary cross-entropy loss on the rapid_class label while monitoring F1 and accuracy on the validation split. We save the checkpoint with the best overall F1 and report final performance on the held-out test set (Section 4.2). The models were trained using PyTorch (Version 2.7.1) on an NVIDIA GeForce RTX 3090 GPU.

3.3. Classification Input Architectures

3.3.1. Baseline

As a reference point, we first train rapid classification models using only the initially labeled set of 4058 unmasked images. No additional labels were provided, and no segmentation masks were applied. These baseline models are trained on normalized RGB input alone, providing a foundation for evaluating the potential to improve classification performance by using both masked inputs and active learning.

3.3.2. Masked Inputs

Upon obtaining segmentation masks, we evaluate rapid detection performance when using both masked and unmasked images for training. In addition to the baseline images mentioned in Section 3.3.1, masked versions of 510 images with rapid class labels are included in the training dataset. Comparative results are presented in Table 3.

3.3.3. Active Learning

We next apply a hybrid active learning strategy that combines uncertainty sampling with spatial coverage to reduce geographic bias and improve training discrimination. Starting with the initial labeled set of 4058 images, we train a baseline classifier and generate predictions for the remaining 276,966 unlabeled images. Substantial variation in label representation is observed across four-digit hydrologic units (HUC4s), with several regions severely underrepresented. To address these imbalances, we prioritize uncertain samples (

p_{j} \approx 0.5

) from underrepresented HUC4 regions. Uncertainty sampling efficiently targets potentially informative instances near the decision boundary [41,42], and taking spatial coverage into account reduces the risk of reinforcing geographic sampling bias [43]. Approximately 10% of the labeled dataset (407 images) is added via manual consensus annotation before retraining. This strategy balances informativeness and representativeness [44], ensuring that new labels both challenge the model and expand coverage of previously neglected regions.

3.4. Data Splitting

To prevent spatial data leakage, we partition the database into training, validation, and test subsets based on HUC4s. Avoiding spatially proximate images across subsets is particularly important in machine learning, because overlap between training and validation sets can lead to overfitting and inflated performance estimates [45,46,47]. Specifically, we partition our data as follows with exact counts shown in Table 2:

Test set: All images associated with Alaskan HUCs.
Validation set: A random sample of HUC4s across the conterminous US selected via a constrained random walk, optimized to ensure that approximately 20% of the remaining dataset was held out for validation.
Training set: All other HUC4s not allocated to test or validation.

Grouping by the HUC4 level ensures that geographically proximate images do not appear in multiple splits. The test set is drawn from a distinct geographical region relative to the training data, providing a reasonable surrogate for evaluating model generalization to new areas. Figure 1 presents a complete visualization of our HUC-based train and validation partitions.

4. Results

4.1. Results for Segmentation Model

Using the SAM2 model described in Section 3.1 with the data partitioning from Section 3.4, we fine-tuned and assessed a preliminary segmentation model with 885 manually masked river reach images created as described in Section 2.3. The model achieves overall validation and test

I o U

values of 0.42 and 0.57, respectively. Notably, when we restrict to images with predicted

I o U

above 0.9, which represent 6.5% and 12% of the validation and test images (to the right of the red dashed line in Figure 2), respectively, we observe a substantial gain in segmentation accuracy. Within this high-confidence subset, the true corresponding validation and test

I o U

values increase to 0.99 and 0.89, respectively. These results suggest that high-confidence predictions could already be of sufficient quality for use in automated river detection.

To better understand how segmentation performance varies across image characteristics, we evaluated model

I o U

as a function of the proportion of river pixels within each image. As shown in Figure 3, true

I o U

generally increases with the proportion of river pixels, indicating that the model performs substantially better when rivers occupy a larger fraction of the scene. Images with a small proportion of river pixels exhibit the greatest error rates. These cases are often prone to false positives, because weak textures such as shadows or dark vegetation can be mistakenly detected as rivers when the river signal is limited [48,49]. False negatives also tend to occur more frequently in images with a small fraction of river pixels, though this behavior is generally present in images containing narrow tributaries or partially occluded reaches [50,51]. In contrast, images with larger proportions of river pixels provide stronger spatial cues and more continuous coverage of the channel, supporting more reliable boundary localization and improved

I o U

. This analysis indicates a moderate correlation (

ρ = 0.44

) between model performance and river proportion, suggesting that river prevalence is an important consideration for future dataset design and model evaluation in fluvial remote sensing.

Figure 4 and Figure 5 illustrate this performance visually, showing examples of predicted masks from the validation and test sets. In most cases, the river channel is clearly delineated and surrounding areas are masked, indicating that the model can effectively identify rivers in this type of image data. The main exceptions are the first validation image (Figure 4-V1 and Figure 5-V1), where the model captures only a small portion of the river channel while mistakenly including adjacent trees and shadows, and the second test image (Figure 4-T2 and Figure 5-T2), where an exposed sandbar was also included in the mask. Although the first case represents a more problematic failure, the second is a minor error that would likely still be sufficient for many applications.

4.2. Results for Classification of River Rapids

Table 3 summarizes the performance of our rapid classifiers under three conditions when applied to the test dataset. The baseline model achieves strong overall performance, implying that the ResNetv2 architecture we selected provided an appropriate balance between accuracy and recall, with an ability to detect rapids with fewer missed cases. However, our results also indicate that incorporating masked inputs did not lead to any further improvements or declines in either F1-score or accuracy. Similarly, the model trained with additional active learning labels did not yield any substantive gains or reductions in F1-score or accuracy, suggesting that in this case targeted labeling in undersampled regions did not translate into measurable performance improvements. A detailed analysis of these results is provided in the following three subsections.

4.2.1. Baseline Rapid Model Results

The confusion matrix in Figure 6 summarizes the classification performance of the baseline-input model on the Alaska test set, with rounded model predictions on the x-axis and manual labels on the y-axis. Strong performance is reflected by high values along the diagonal, where correct classifications appear. The model accurately classified 886 images, resulting in an overall accuracy of 93%. The model had 17 false positive (non-rapid images predicted as rapid) and 52 false negative (rapid images predicted as non-rapid) misclassifications. The observation that most errors are false negatives implies that the model is conservative in its positive predictions. In other words, the higher number of false negatives suggests a slight tendency to label an image as non-rapid in cases that are potentially more ambiguous.

The F1 scores for each class—0.92 for Non-Rapid and 0.93 for Rapid—indicate consistent performance across both classes, with no substantive imbalance in precision-recall trade-offs. The overall F1 score of 0.93 and a high AUC of 0.98 further confirm that the model distinguishes well between classes, with robust discriminative ability.

Figure 7 shows example images from the Alaska test set that were correctly classified as true positives or true negatives by the baseline model. These cases appear to highlight the model’s robustness across a range of conditions, including variation in rapid strength (i.e., presence and extent of visually distinct, pronounced white water), river color, river size, flow direction, and surrounding terrain.

Figure 8 presents some comparatively rare instances of misclassification. False negatives suggest that the model might struggle with extreme river hues, highly heterogeneous rapid texturing, or complex features such as channel bifurcations. False positives, on the other hand, often appear to arise from shadows, surface glare, or snow on the water, as well as patterned ripple effects in adjacent vegetation. Although perfect classification is always the ideal, these example misclassifications seem reasonable, given the visual complexity of the task, and reflect conditions that even human observers might find challenging.

To provide some geographic context for these results and identify conditions under which the classifier might perform poorly, we produced a map of accuracy scores based on the test set from Alaska. Figure 9a shows consistently strong model performance across the HUC4 regions in Alaska. We observed a general spatial trend in which predictive performance tended to be stronger in the center of the state, with slightly lower accuracy along the northern and western shores and eastern boundary. These deviations might be a consequence of sampling variability, given the relatively small number of observations in the border regions (Figure 9b). However, no concerning outliers are present.

4.2.2. Mask and Active Learning Results

We observe essentially no overall gains or losses in accuracy when applying masking (Figure 10) or active learning (Figure 11), indicating that these strategies on their own did not surpass the strong performance of our baseline model. However, shifts in false positive and false negative counts suggest that the mask-input model may be preferable in scenarios where correctly identifying rapids is prioritized over avoiding false detections, but the active learning model may be more advantageous when minimizing false positives is the greater concern.

5. Discussion

5.1. Limitations

The results presented in Section 4 demonstrate that our models can accurately detect the presence or absence of rapids across a geographically diverse range of river systems. Our baseline model achieved an overall accuracy of 93% on an independent test set drawn from Alaska HUC4 regions. This sampling strategy avoided spatial data leakage and thus enabled a stringent evaluation of model performance. The study most closely related to ours is that of Hedger and Gosselin [9], who used a convolutional neural network to classify individual pixels in river images into two categories of surface flow: (1) smooth or rippled; or (2) standing waves. These authors reported accuracies near 99%, but their formulation of the problem and approach to accuracy assessment were fundamentally different from ours. Hedger and Gosselin [9] focused on pixel-by-pixel classification within a single image, whereas we developed a workflow for detecting, at the image level, the presence of rapids across large river networks. In addition, Hedger and Gosselin [9] based their evaluation of classifier performance on a random split of pixels from the same image into training (70%) and validation (30%) subsets, without any kind of independent test set; the data used for accuracy assessment was thus very similar to that used for training. In contrast, our geographically disjoint sampling strategy provides a more conservative and realistic indication of the generalizability of our rapids classifier for large-scale river monitoring applications.

However, several limitations constrain the broader applicability of our findings. First, our compilation of river images does not provide exhaustive coverage of the entire river network, nor is it a catalog of rapids. Instead, we assembled the database to provide a large sample to use in developing and testing algorithms for identifying rivers and rapids in images. Most importantly, this study is essentially a non-inferiority experiment in which we test how well an artificial intelligence-based approach can reproduce classifications of river channels and rapids made by human interpreters. In other words, we evaluate whether and to what extent SAM2 and CNN models could obviate the need for analysts to manually digitize specific features in images. We acknowledge that the people annotating the images are by no means perfect and that the labeling of rapids in particular involves subjective decision-making and thus could vary among observers. We have no means of testing the accuracy of these annotators and can only assess the performance of the computer-based approach relative to the manually created data, not some independent, presumably more reliable source of information.

With this important caveat in mind, the segmentation results from Section 4.1 provide a promising start for automated river detection. However, only 6.5% of the validation images and 12% of the test images met the predicted

I o U > 0.9

criterion, which corresponded to true

I o U

values of 0.99 and 0.89, respectively. In addition, their broader use is limited by the relatively small number of labeled masks available for training. Expanding the annotated dataset would improve robustness, generalizability, and consistency at scale. Nonetheless, these results serve as a proof-of-concept and provide a foundation for further methodological development.

Building on this foundation, the mask-aided classifier results in Section 4.2.2 are similarly promising yet remain exploratory. Only a few hundred masked images could be included in training due to the relatively limited number of images in which the river channel had been manually labeled, and masks were not systematically applied to the test set. However, evaluation of the small subset of 185 test images with available masks yielded exceptionally strong performance (97% accuracy, 0.97 F1, 0.99 AUC). Because the ease of generating high-quality masks might be correlated with the ease of discriminating rapids, these results should not be interpreted as definitive evidence of performance gains. Instead, they highlight the potential for substantial improvements when high-quality masks can be applied systematically.

A key limitation of the results in Section 4.2 is the assumption that the hyperparameters we used for the baseline model were also appropriate in the two alternative models we considered, which could have constrained their effectiveness. Rather than discourage the use of masks and active learning, we view these results as motivation to explore these approaches further—particularly in combination with one another and with tailored hyperparameter tuning—to more fully realize their potential for advancing rapid classification.

Together, these findings underscore both the promise and current limitations of our segmentation and classifier approaches, suggesting that, even in their current form, or with further development, the approaches could be valuable across a wide range of river monitoring and rapid detection applications.

5.2. Applications and Extensions

In this work, we have demonstrated several applications of our newly curated river imagery dataset, which we refer to as the Compilation of Images from River Reaches across the United States (CIRRUS). These include providing a small training set for developing and fine-tuning segmentation models, applying such models for river monitoring, and achieving strong rapid detection through image classification with the aid of segmentation masks and targeted active learning. Our experiments show accurate and robust classifier performance, providing a proof of concept for automated river feature identification. By releasing this dataset alongside our analysis [27], we provide a resource to the broader research community, enabling further development of machine learning models for hydrologic studies and related environmental and water resource applications.

In the previous subsection, we identified several limitations of our approach, but these issues could be addressed through further work. For example, the subjectivity inherent to manual labeling could be quantified by having multiple sets of annotators label an identical set of images so that the agreement between them can be quantified (Cohen’s

κ

[52] or Krippendorff’s

α

[53]). Comparing labels across groups of annotators would allow a measure of consistency, identify sources of ambiguity, and provide refined guidance for the labeling process. Similarly, increasing the number of labeled masks could help improve the performance of the segmentation model across a broader range of river conditions. Although using masked inputs and active learning did not lead to substantial gains in performance relative to the baseline rapids classifier in this initial study, further exploration of these approaches could yield gains in classification accuracy. Applying the classifier to masked images that include only the river channel itself could help reduce errors driven by image textures in terrestrial areas adjacent to the channel, such as the examples shown in Figure 8. Performing more extensive hyperparameter tuning and sensitivity analysis could also lead to further improvements in performance. Such an effort could be made on a model-by-model basis, rather than assuming that parameters for the baseline model are also optimal for the masked input- and active learning-based approaches. Finally, exploring spatial variations in model performance could provide information on which types of rivers are conducive to deep learning-based segmentation of rivers and identification of rapids and which kinds of channels are less well-suited to this approach.

Our results highlight that large-scale annotated river imagery and ongoing data curation can support robust, generalizable models for hydrologic research. Such efforts could improve the efficiency and reproducibility of surface flow type observations used to assess in-stream habitat [8]. The ability to detect rapids could also facilitate practical applications like recreation planning and natural resource management. For instance, automated detection of in-channel hazards following events such as floods or post-wildfire debris flows could enhance safety and situational awareness. Similarly, non-contact estimates of river discharge based on critical flow theory could help to expand streamflow monitoring into regions underrepresented in current gage networks [16].

6. Conclusions

Rivers play many important roles, both in the natural environment and for human society. Remote sensing can be an effective tool for characterizing rivers and monitoring water resources, particularly with the recent advent of deep learning-based approaches. In this study, we present an end-to-end workflow for efficiently obtaining images of rivers, generating segmentation-based masks to highlight the stream channels present within these images, and then using a CNN model to detect a particular type of geomorphic feature: rapids. This initial proof-of-concept investigation led to the following research highlights and main findings:

1.: An API-based approach allowed for highly automated retrieval of images extracted at a regular interval along flowlines throughout the conterminous US and Alaska and from the locations of known rapids. The resulting database consists of 281,024 images obtained from the Google Maps Static API and made publicly available as the Compilation of Images from River Reaches across the United States (CIRRUS) [27]. CIRRUS includes a subset of images with manually annotated river masks and labels for the presence or absence of river rapids that could support further development and application of approaches for characterizing rivers via remote sensing. As a potential starting point for such efforts, we make the workflow developed for this study accessible by including our code in the data release along with the images.
2.: To rigorously evaluate model performance, we split the data at the watershed level into not only training and validation sets but also an independent test set from a completely distinct area. This approach avoided spatial data leakage and allowed for a more robust assessment of model generalizability across geographic domains.
3.: The segmentation model we developed for identifying river pixels within images led to a mean test $I o U$ of 0.57, which increased to 0.89 when only those images with high confidence were considered. These results suggest that highly automated extraction of rivers from standard, readily available satellite and aerial images is not only feasible but also potentially promising.
4.: Our baseline CNN model for detecting rapids yielded overall accuracy and F1 scores of 0.93, implying that this approach could facilitate more extensive, less labor-intensive inventory and monitoring of river rapids.
5.: The framework established herein could help to support numerous hydrologic applications, including non-contact streamflow measurement, habitat mapping, water resource management, and river-oriented recreation.

Author Contributions

Conceptualization, C.J.L., C.L.S. and B.L.B.; methodology, T.K. and K.K.B.; software, N.B., C.S., H.F. and K.H.; validation, C.J.L., M.R., T.K., N.B. and C.S.; formal analysis, N.B., K.H., C.S., T.K. and K.K.B.; investigation, N.B., C.S., H.F., M.R. and K.H.; resources, C.J.L., J.B., C.L.S., T.K., K.K.B., B.L.B. and K.R.M.; data curation, C.J.L., N.B., K.H., H.F., M.R., C.S. and J.B.; writing—original draft preparation, C.J.L., T.K. and K.K.B.; writing—review and editing, C.J.L., B.L.B., C.L.S. and K.R.M.; visualization, K.K.B. and C.S.; supervision, C.J.L., B.L.B. and K.R.M.; project administration, B.L.B., T.K. and K.K.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The image database and code described above are available from [27].

Conflicts of Interest

The authors declare no conflicts of interest. Any use of trade, firm, or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government.

Abbreviations

The following abbreviations are used in this manuscript:

API	Application Programming Interface
AUC	Area Under the Receiver Operating Characteristic Curve
CIRRUS	Compilation of Images from River Reaches across the United States
CNN	Convolutional Neural Network
GPU	Graphics Processing Unit
GSD	Ground Sampling Distance
HR	High Resolution
HUC	Hydrologic Unit Code
NHD	National Hydrography Dataset
NPS	National Park Service
OSM	OpenStreetMap
RGB	red-green-blue
SAM2	Segment Anything Model 2
SHA-1	Secure Hash Algorithm-1
URL	Uniform Resource Locator
US	United States
USGS	U.S. Geological Survey

References

Newson, M.D.; Newson, C.L. Geomorphology, Ecology and River Channel Habitat: Mesoscale Approaches to Basin-Scale Challenges. Prog. Phys. Geogr. 2000, 24, 195–217. [Google Scholar] [CrossRef]
Fausch, K.D.; Torgersen, C.E.; Baxter, C.V.; Li, H.W. Landscapes to Riverscapes: Bridging the Gap between Research and Conservation of Stream Fishes. Bioscience 2002, 52, 483–498. [Google Scholar] [CrossRef]
Marcus, W.A.; Fonstad, M.A. Remote Sensing of Rivers: The Emergence of a Subdiscipline in the River Sciences. Earth Surf. Processes Landforms 2010, 35, 1867–1872. [Google Scholar] [CrossRef]
Carbonneau, P.; Fonstad, M.A.; Marcus, W.A.; Dugdale, S.J. Making Riverscapes Real. Geomorphology 2012, 137, 74–86. [Google Scholar] [CrossRef]
Piégay, H.; Arnaud, F.; Belletti, B.; Bertrand, M.; Bizzi, S.; Carbonneau, P.; Dufour, S.; Liébault, F.; Ruiz-Villanueva, V.; Slater, L. Remotely Sensed Rivers in the Anthropocene: State of the Art and Prospects. Earth Surf. Processes Landforms 2020, 45, 157–188. [Google Scholar] [CrossRef]
Zavadil, E.A.; Stewardson, M.J.; Turner, M.E.; Ladson, A.R. An Evaluation of Surface Flow Types as a Rapid Measure of Channel Morphology for the Geomorphic Component of River Condition Assessments. Geomorphology 2012, 139–140, 303–312. [Google Scholar] [CrossRef]
Milan, D.J.; Heritage, G.L.; Large, A.R.G.; Entwistle, N.S. Mapping Hydraulic Biotopes Using Terrestrial Laser Scan Data of Water Surface Properties. Earth Surf. Processes Landforms 2010, 35, 918–931. [Google Scholar] [CrossRef]
Woodget, A.S.; Visser, F.; Maddock, I.P.; Carbonneau, P.E. The Accuracy and Reliability of Traditional Surface Flow Type Mapping: Is It Time for a New Method of Characterizing Physical River Habitat? River Res. Appl. 2016, 32, 1902–1914. [Google Scholar] [CrossRef]
Hedger, R.D.; Gosselin, M.P. Automated Fluvial Hydromorphology Mapping from Airborne Remote Sensing. River Res. Appl. 2023, 39, 1889–1901. [Google Scholar] [CrossRef]
Valman, S.J.; Boyd, D.S.; Carbonneau, P.E.; Johnson, M.F.; Dugdale, S.J. An AI approach to operationalise global daily PlanetScope satellite imagery for river water masking. Remote Sens. Environ. 2024, 301, 113932. [Google Scholar] [CrossRef]
Lee, S.; Kong, Y.; Lee, T. Development of Deep Intelligence for Automatic River Detection (RivDet). Remote Sens. 2025, 17, 346. [Google Scholar] [CrossRef]
Zhang, X.; Liu, Q.; Gui, D.; Zhao, J.; Chen, Y.; Liu, Y.; Martínez-Valderrama, J. Enhanced River Connectivity Assessment Across Larger Areas Through Deep Learning with Dam Detection. Hydrol. Processes 2025, 39, e70063. [Google Scholar] [CrossRef]
Grimmer, G.; Wenger, R.; Forestier, G.; Chardon, V. Automatic detection of in-stream river wood from random forest machine learning and exogenous indices using very high-resolution aerial imagery. Environ. Model. Softw. 2025, 190, 106460. [Google Scholar] [CrossRef]
Dawson, M.; Dawson, H.; Gurnell, A.; Lewin, J.; Macklin, M.G. AI-assisted interpretation of changes in riparian woodland from archival aerial imagery using Meta’s segment anything model. Earth Surf. Processes Landforms 2025, 50, e6053. [Google Scholar] [CrossRef]
Harlan, M.E.; Gleason, C.J.; Flores, J.A.; Langhorst, T.M.; Roy, S. Mapping and Characterizing Arctic Beaded Streams through High Resolution Satellite Imagery. Remote Sens. Environ. 2023, 285, 113378. [Google Scholar] [CrossRef]
Legleiter, C.J.; Grant, G.; Bae, I.; Fasth, B.; Yager, E.; White, D.C.; Hempel, L.; Harlan, M.E.; Leonard, C.; Dudley, R. Remote Sensing of River Discharge Based on Critical Flow Theory. Geophys. Res. Lett. 2025, 52, e2025GL114851. [Google Scholar] [CrossRef]
James, K. Rafting the Grand Canyon. 2025. Available online: https://www.visitarizona.com/like-a-local/grand-canyon-river-trip (accessed on 19 August 2025).
Service, N.P. Whitewater—Gauley River National Recreation Area (U.S. National Park Service)—nps.gov. 2025. Available online: https://www.nps.gov/gari/planyourvisit/whitewater.htm (accessed on 19 August 2025).
Service, N.P. Visitor Use Data—Social Science (U.S. National Park Service)—nps.gov. 2025. Available online: https://www.nps.gov/subjects/socialscience/visitor-use-statistics-dashboard.htm (accessed on 19 August 2025).
Brinkerhoff, C.B.; Gleason, C.J.; Zappa, C.J.; Raymond, P.A.; Harlan, M.E. Remotely Sensing River Greenhouse Gas Exchange Velocity Using the SWOT Satellite. Glob. Biogeochem. Cycles 2022, 36, e2022GB007419. [Google Scholar] [CrossRef]
U.S. Geological Survey. National Hydrography Dataset Plus High Resolution; U.S. Geological Survey Data Release; U.S. Geological Survey: Reston, VA, USA, 2022. [Google Scholar] [CrossRef]
U.S. Geological Survey. 3D Hydrography Dataset Program; U.S. Geological Survey Data Release; U.S. Geological Survey: Reston, VA, USA, 2025. [Google Scholar] [CrossRef]
Legleiter, C.J.; Bae, I.; Fasth, B.; Grant, G.; Yager, E.; White, D.; Hempel, L.; Harlan, M.; Leonard, C. Image-Based Measurements and Gage Records Used to Test a Method for Inferring River Discharge from Remotely Sensed Data Based on Critical Flow Theory; U.S. Geological Survey Data Release; U.S. Geological Survey: Reston, VA, USA, 2024. [Google Scholar] [CrossRef]
R Core Team. R: A Language and Environment for Statistical Computing; Version 4.4.0; R Core Team: Vienna, Austria, 2025; Available online: https://www.R-project.org (accessed on 2 October 2025).
Vaughan, D.; Dancho, M. furrr: Apply Mapping Functions in Parallel Using Futures, R package version 0.3.1; R Core Team: Vienna, Austria, 2022. Available online: https://furrr.futureverse.org/ (accessed on 10 January 2026).
Csárdi, G. dotenv: Load Environment Variables from ’.env’, R package version 1.0.3, 2021. Available online: https://cran.r-project.org/web/packages/dotenv/index.html (accessed on 10 January 2026).
Legleiter, C.J.; Bladen, K.; Brimhall, N. Compilation of Images from Rivers Reaches Across the United States (CIRRUS); U.S. Geological Survey Data Release; U.S. Geological Survey: Reston, VA, USA, 2025. [Google Scholar] [CrossRef]
McHugh, M.L. Interrater reliability: The kappa statistic. Biochem. Medica 2012, 22, 276–282. [Google Scholar] [CrossRef]
Gwet, K.L. Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement Among Raters, 4th ed.; Advanced Analytics, LLC: Gaithersburg, MD, USA, 2014. [Google Scholar]
Ravi, N.; Gabeur, V.; Hu, Y.T.; Hu, R.; Ryali, C.; Ma, T.; Khedr, H.; Rädle, R.; Rolland, C.; Gustafson, L.; et al. SAM 2: Segment Anything in Images and Videos. arXiv 2024, arXiv:2408.00714. [Google Scholar] [PubMed]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 8024–8035. [Google Scholar]
Smith, L.N. Cyclical Learning Rates for Training Neural Networks. In Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 24–31 March 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 464–472. [Google Scholar] [CrossRef]
Howard, J.; Ruder, S. Universal Language Model Fine-tuning for Text Classification. arXiv 2018, arXiv:1801.06146. [Google Scholar] [CrossRef]
Masters, D.; Luschi, C. Revisiting small batch training for deep neural networks. arXiv 2018, arXiv:1804.07612. [Google Scholar] [CrossRef]
Prechelt, L. Early Stopping—But When? In Neural Networks: Tricks of the Trade; Springer: Berlin/Heidelberg, Germany, 1998; pp. 55–69. [Google Scholar] [CrossRef]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Zhou, Z.; Zheng, Y.; Ye, H.; Pu, J.; Sun, G. Satellite image scene classification via convnet with context aggregation. In Proceedings of the Pacific Rim Conference on Multimedia, Hefei, China, 21–22 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 329–339. [Google Scholar]
Bilotta, G.; Bibbò, L.; Meduri, G.M.; Genovese, E.; Barrile, V. Deep Learning Innovations: ResNet Applied to SAR and Sentinel-2 Imagery. Remote Sens. 2025, 17, 1961. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Identity mappings in deep residual networks. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 630–645. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F.-F. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 248–255. [Google Scholar]
Lewis, D.D.; Gale, W.A. A sequential algorithm for training text classifiers. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, 3–6 July 1994; Springer: Berlin/Heidelberg, Germany, 1994; pp. 3–12. [Google Scholar]
Settles, B. Active Learning Literature Survey; Computer Sciences Technical Report 1648; University of Wisconsin-Madison: Madison, WI, USA, 2009. [Google Scholar]
Santos-Fernandez, E.; Huser, R.; Castruccio, S.; Stephenson, A.G.; Sisson, S.A. Spatially balanced sampling via Gaussian processes. Bayesian Anal. 2021, 16, 1179–1207. [Google Scholar]
Konyushkova, K.; Sznitman, R.; Fua, P. Learning active learning from data. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Roberts, D.R.; Bahn, V.; Ciuti, S.; Boyce, M.S.; Elith, J.; Guillera-Arroita, G.; Hauenstein, S.; Lahoz-Monfort, J.J.; Schröder, B.; Thuiller, W.; et al. Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography 2017, 40, 913–929. [Google Scholar] [CrossRef]
Hartmann, D.; Gravey, M.; Price, T.D.; Nijland, W.; de Jong, S.M. Surveying Nearshore Bathymetry Using Multispectral and Hyperspectral Satellite Imagery and Machine Learning. Remote Sens. 2025, 17, 291. [Google Scholar] [CrossRef]
Meyer, H.; Reudenbach, C.; Wöllauer, S.; Nauss, T. Importance of spatial predictor variable selection in machine learning applications–Moving from data reproduction to spatial prediction. Ecol. Model. 2019, 411, 108815. [Google Scholar] [CrossRef]
Marmanis, D.; Schindler, K.; Wegner, J.D.; Galliani, S.; Datcu, M.; Stilla, U. Classification with an edge: Improving semantic image segmentation with boundary detection. ISPRS J. Photogramm. Remote Sens. 2018, 135, 158–172. [Google Scholar] [CrossRef]
Bischke, B.; Helber, P.; Folz, J.; Borth, D.; Dengel, A. Multi-task learning for segmentation of building footprints with deep neural networks. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1480–1484. [Google Scholar]
Kervadec, H.; Bouchtiba, J.D.I.; Desrosiers, C.; Ayed, I.B. Boundary loss for highly unbalanced segmentation. arXiv 2018, arXiv:1812.07032. [Google Scholar] [CrossRef] [PubMed]
Milletari, F.; Navab, N.; Ahmadi, S.A. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 565–571. [Google Scholar]
Cohen, J. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 1960, 20, 37–46. [Google Scholar] [CrossRef]
Krippendorff, K. Estimating the reliability, systematic error, and random error of interval data. Educ. Psychol. Meas. 1970, 30, 61–70. [Google Scholar] [CrossRef]

Figure 1. Map of HUC4 regions showing the training and validation splits for river basins in the conterminous United States. Test images are in Alaska, which is shown in Figure 9.

Figure 2. Histograms of predicted Intersections over Unions (

I o U

) for the validation and test datasets. The red dotted line indicates a high-confidence threshold of 0.9. Predicted

I o U

values above this cutoff generally correspond with improved true

I o U

performance.

Figure 2. Histograms of predicted Intersections over Unions (

I o U

) for the validation and test datasets. The red dotted line indicates a high-confidence threshold of 0.9. Predicted

I o U

values above this cutoff generally correspond with improved true

I o U

performance.

Figure 3. Scatter plot illustrating the relationship between segmentation accuracy and the proportion of river pixels within an image. Each point represents an image, plotted by the proportion of river pixels in the image (x-axis) and true

I o U

score (y-axis). A positive performance trend is observed (

ρ = 0.44)

, with lower

I o U

in images with smaller proportions of river pixels and higher

I o U

in images with larger proportions of river pixels.

Figure 3. Scatter plot illustrating the relationship between segmentation accuracy and the proportion of river pixels within an image. Each point represents an image, plotted by the proportion of river pixels in the image (x-axis) and true

I o U

score (y-axis). A positive performance trend is observed (

ρ = 0.44)

, with lower

I o U

in images with smaller proportions of river pixels and higher

I o U

in images with larger proportions of river pixels.

Figure 4. River images with predicted masks applied. Images have been randomly selected from those in validation (denoted V1–V4) and test sets (denoted T1–T4) with a predicted

I o U

> 0.9. Non-river areas are generally suppressed, illustrating the model’s ability to isolate river features. To compare to the raw river images, see Figure 5. The width of each validation image is approximately 300 m for z18 images (V1) and 150 m for z19 images (V2, V3, V4). The test image widths are approximately 180 m for z18 images (T1, T2, T4) and 90 m for z19 images (T3).

Figure 4. River images with predicted masks applied. Images have been randomly selected from those in validation (denoted V1–V4) and test sets (denoted T1–T4) with a predicted

I o U

> 0.9. Non-river areas are generally suppressed, illustrating the model’s ability to isolate river features. To compare to the raw river images, see Figure 5. The width of each validation image is approximately 300 m for z18 images (V1) and 150 m for z19 images (V2, V3, V4). The test image widths are approximately 180 m for z18 images (T1, T2, T4) and 90 m for z19 images (T3).

Figure 5. Raw river images for comparison with predicted masks from Figure 4. Images have been randomly selected from those in validation (denoted V1–V4) and test sets (denoted T1–T4) with a predicted

I o U

> 0.9. The width of each validation image is approximately 300 m for z18 images (V1) and 150 m for z19 images (V2, V3, V4). The test image widths are approximately 180 m for z18 images (T1, T2, T4) and 90 m for z19 images (T3).

Figure 5. Raw river images for comparison with predicted masks from Figure 4. Images have been randomly selected from those in validation (denoted V1–V4) and test sets (denoted T1–T4) with a predicted

I o U

> 0.9. The width of each validation image is approximately 300 m for z18 images (V1) and 150 m for z19 images (V2, V3, V4). The test image widths are approximately 180 m for z18 images (T1, T2, T4) and 90 m for z19 images (T3).

Figure 6. Confusion matrix and metrics for baseline-input model (Section 3.3.1) performance on Alaska test images.

Figure 7. Examples from the Alaska test images that are correctly classified by the baseline model. The top row shows true positive images, and the bottom row shows true negative images. The width of each image is approximately 180 m for z18 images (a,b,d,f,g) and 90 m for z19 images (c,e,h).

Figure 8. Examples from the Alaska test images that are incorrectly classified by the baseline. The top row shows false negative images, and the bottom row shows false positive images. The width of each image is approximately 180 m for z18 images (a,c) and 90 m for z19 images (b,d,e–h).

Figure 9. Maps of the Alaskan watersheds used as an independent test dataset colored and labeled by (a) accuracy and (b) the number of labeled images in each basin.

Figure 10. Confusion matrix for masked-input model (Section 3.3.2) performance on Alaska test images.

Figure 11. Confusion matrix for active learning-input model (Section 3.3.3) performance on Alaska test images.

Table 1. Metadata fields associated with each image.

Field	Description
`image`	Image filename (serves as Primary Key)
`name`	River or watershed identifier
`latitude`	Latitude coordinate of the image center in decimal degrees
`longitude`	Longitude coordinate of the image center in decimal degrees
`zoom`	Zoom level at image capture (18 or 19)
`huc2`	The distinct major hydrological region for an image
`huc4`	The distinct hydrological sub-region for an image
`api_timestamp`	Timestamp when the image was retrieved from the Maps API
`mask`	Label denoting existence of a manually annotated segmentation mask
	(0 = no, 1 = yes)
`river_class`	Manually annotated label indicating river presence (0 = no, 1 = yes)
`rapid_class`	Manually annotated label indicating rapid presence (0 = no, 1 = yes)

Table 2. Overall data partition counts for each of the different input architectures outlined in Section 3.1 and Section 3.3. Values in parentheses denote exact class distributions (absence:presence) for classification models of Section 3.3.

Model	Train	Validation	Test
`SAM2`	555	138	192
`Baseline CNN`	2485 (899:1586)	618 (293:325)	955 (423:532)
`Masked Inputs CNN`	2995 (1197:1798)	618 (293:325)	955 (423:532)
`Active Learning CNN`	2759 (1087:1672)	751 (385:367)	955 (423:532)

Table 3. Test-set performance metrics for each of the model architectures outlined in Section 3.3. Higher values of each metric indicate better performance.

Model	Accuracy	F1-Score	AUC
`Baseline`	0.93	0.93	0.98
`Masked Inputs`	0.93	0.93	0.97
`Active Learning`	0.92	0.92	0.98

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Brimhall, N.; Bladen, K.K.; Kerby, T.; Legleiter, C.J.; Swapp, C.; Fluckiger, H.; Bahr, J.; Roberts, M.; Hart, K.; Stegman, C.L.; et al. Compilation of a Nationwide River Image Dataset for Identifying River Channels and River Rapids via Deep Learning. Remote Sens. 2026, 18, 375. https://doi.org/10.3390/rs18020375

AMA Style

Brimhall N, Bladen KK, Kerby T, Legleiter CJ, Swapp C, Fluckiger H, Bahr J, Roberts M, Hart K, Stegman CL, et al. Compilation of a Nationwide River Image Dataset for Identifying River Channels and River Rapids via Deep Learning. Remote Sensing. 2026; 18(2):375. https://doi.org/10.3390/rs18020375

Chicago/Turabian Style

Brimhall, Nicholas, Kelvyn K. Bladen, Thomas Kerby, Carl J. Legleiter, Cameron Swapp, Hannah Fluckiger, Julie Bahr, Makenna Roberts, Kaden Hart, Christina L. Stegman, and et al. 2026. "Compilation of a Nationwide River Image Dataset for Identifying River Channels and River Rapids via Deep Learning" Remote Sensing 18, no. 2: 375. https://doi.org/10.3390/rs18020375

APA Style

Brimhall, N., Bladen, K. K., Kerby, T., Legleiter, C. J., Swapp, C., Fluckiger, H., Bahr, J., Roberts, M., Hart, K., Stegman, C. L., Bean, B. L., & Moon, K. R. (2026). Compilation of a Nationwide River Image Dataset for Identifying River Channels and River Rapids via Deep Learning. Remote Sensing, 18(2), 375. https://doi.org/10.3390/rs18020375

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Compilation of a Nationwide River Image Dataset for Identifying River Channels and River Rapids via Deep Learning

Highlights

Abstract

1. Introduction

2. Data Construction

2.1. Google Maps Application Programming Interface (API)

2.2. Metadata

2.3. Annotation Process

3. Experiments

3.1. Segmentation Model

3.1.1. Model and Losses

3.1.2. Implementation Details

3.2. Classification of River Rapids

3.2.1. Data Preprocessing

3.2.2. Model Architecture

3.2.3. Classifier Model Evaluation

3.2.4. Training Procedure

3.3. Classification Input Architectures

3.3.1. Baseline

3.3.2. Masked Inputs

3.3.3. Active Learning

3.4. Data Splitting

4. Results

4.1. Results for Segmentation Model

4.2. Results for Classification of River Rapids

4.2.1. Baseline Rapid Model Results

4.2.2. Mask and Active Learning Results

5. Discussion

5.1. Limitations

5.2. Applications and Extensions

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI