Applying Artiﬁcial Intelligence Methods to Detect and Classify Fish Calls from the Northern Gulf of Mexico

: Passive acoustic monitoring is a method that is commonly used to collect long-term data on soniferous animal presence and abundance. However, these large datasets require substantial effort for manual analysis; therefore, automatic methods are a more effective way to conduct these analyses and extract points of interest. In this study, an energy detector and subsequent pre-trained neural network were used to detect and classify six ﬁsh call types from a long-term dataset collected in the northern Gulf of Mexico. The development of this two-step methodology and its performance are the focus of this paper. The energy detector by itself had a high recall rate (>84%), but very low precision; however, a subsequent neural network was used to classify detected signals and remove noise from the detections. Image augmentation and iterative training were used to optimize classiﬁcation and compensate for the low number of training images for two call types. The classiﬁer had a relatively high average overall accuracy (>87%), but classiﬁer average recall and precision varied greatly for each ﬁsh call type (recall: 39–91%; precision: 26–94%). This coupled methodology expedites call extraction and classiﬁcation and can be applied to other datasets that have multiple, highly variable calls.


Introduction
Passive acoustics are increasingly being used as a tool for population management and assessment [1,2]. Passive acoustic monitoring (PAM) is a relatively low-cost way to collect long-term datasets of animal occurrence, which is particularly effective when the animals are not constantly present and calling [3]. PAM systems are fairly non-invasive, require relatively little maintenance, are deployable in remote and extreme locations, and can record continuously or on a pre-set schedule for months [4,5]. A plethora of information can be extracted from the extensive recordings, such as specific call characteristics at the individual and species level [1], diel and seasonal calling patterns [1,6,7], habitat use [8][9][10], biological processes (e.g., mating, spawning, feeding, social interactions, and competition) [11][12][13], species abundance and/or composition [14][15][16][17], and ecosystem health [18,19].
PAM has been applied in both freshwater and marine environments [4,6] and proven to be a useful method in detecting and recording the calls of a variety of aquatic animals, such as whales [10,20], dolphins and porpoises [6,21,22], seals [23,24], and fish [4,25,26]. There are over 800 species of soniferous fishes [27][28][29][30][31][32], so PAM is a very useful method for collecting spatial and temporal data on many of those species in a non-invasive, continuous way. For example, PAM has been used to characterize oyster toadfish (Opsanus tau) boat whistle activity (daily, seasonal, and geographical), characteristics (amplitude, waveforms, and spectra), and propagation [1], to determine distinct diel and seasonal calling patterns of white-spotted damselfish (Dascyllus albisella) [33], and to locate grouper, Epinephelidae [34], and red drum (Sciaenops ocellatus) [35] spawning sites. Fish calls are commonly identified in long-term datasets based on their low frequency (<1 kHz) [25,36], chorusing at dawn, dusk, and/or overnight [36,37], and a peak in calling during summer months or known spawning times [38,39]. However, due to the large datasets that are often created using PAM, manual analysis may not be feasible, so automatic detection and classification methods are preferred to expedite the data analysis process by extracting and identifying signals of interest [40][41][42][43][44].
To date, at least 15 studies have used automatic analysis methods to detect and/or classify fish calls [45]. Some studies focused only on automatic call detection without classification [46,47], whereas others applied an automatic pattern recognition method that both detected and classified the target call [48,49]. The commonly used detection methods among the fish call studies included using a matched filter and spectrogram correlation or energy threshold to find and extract the target fish calls in the dataset. These are supervised detection methods because they are created based on call characteristics that the researcher specifies. For example, Ruiz-Blais et al. [46] created a kernel to detect Jamaica weakfish (Cynoscion jamaicensis) calls based on four call features and a call was detected if every feature exceeded its threshold, which was predetermined by the researchers. Ricci et al. [47] used a multikernel approach based on the two lowest harmonic frequencies of oyster toadfish calls, to identify their calls within the recordings. Other detection methods include both supervised and unsupervised machine learning algorithms, such as Gaussian mixture models, k-nearest neighbors, support vector machine, and neural networks, which are capable of pattern recognition and extracting relevant information to not only detect, but also classify calls [45,[50][51][52][53]. Most studies had an average detection or classification accuracy between 85% and 93%. Previous studies reported that accuracy was dependent on a variety of factors. The size of the training set was important, with larger training sets leading to higher accuracy [51,53]. Data with louder ambient noise led to decreased accuracy [53,54]. Finally, call type or fish species also affected the performance of detectors and classifiers [45]. Meagre (Argyrosomus regius) pulses had a much lower identification rate (6.6%) than long grunts (26.4%), intermediate grunts (93.2%), and short grunts (96%) [45], while the identification accuracy of the grouper depended on species [49]. Additionally, Vieira et al. [52] and Monczak et al. [50] observed that their models were able to identify longer duration fish calls with prominent harmonics more accurately than shorter-duration, pulsed calls. Even though detection and classification accuracy is not high for every fish call or species, all acoustic studies of fish that have used automatic analysis methods concluded that these methods provide the most efficient way to analyze long-term PAM datasets [50,52].
In this study, we used a novel two-step analysis method to automatically detect and then classify six fish call types, which were manually identified from a long-term dataset collected in the northern Gulf of Mexico. We discuss the development of the detector and classifier, as well as report on the accuracy, precision, and recall of each step. This two-step methodology expedites call extraction and identification processes, can be used when the target calls vary in both duration and frequency, and presents a new, effective approach to study soniferous fish abundance and presence, which can help ecologists and managers estimate population size and health, leading to improved management decisions. Consequently, this method will be applied to a seven-year dataset collected from the northern Gulf of Mexico to assess the impact of the 2010 Deepwater Horizon oil spill on the local fish community based on fish calling patterns and call abundance. Lastly, this automatic, two-step analysis approach could be used by biologists and ecologists with limited programming skills to detect and classify many different and variable call types in a dataset, whereas previous machine learning algorithms were often created based on a specific call type or a few calls that have relatively similar frequency and duration.

Data Collection
Data used for training, testing and evaluation of the detector and classifier were collected using a High-frequency Acoustic Recording Package (HARP), a bottom-mounted, autonomous calibrated recorder, containing a two-channel hydrophone-one to record high frequencies and one to record low frequencies (sensitivity: −200 dB re Vrms/µPa and −187 dB re Vrms/µPa, respectively; flat frequency response (±1.5 dB): 1-100 Hz and 1-10,000 Hz, respectively) [55]. The HARP was deployed in the northern Gulf of Mexico, approximately 60 km north of the 2010 Deepwater Horizon oil spill site, at~90 m depth ( Figure 1). Recordings were collected between 2010 and 2012 continuously, with short gaps for recovery and redeployment. The HARP recorded at a sample rate of 200 kHz with 16-bit quantization.

Data Collection
Data used for training, testing and evaluation of the detector and classifier were collected using a High-frequency Acoustic Recording Package (HARP), a bottom-mounted, autonomous calibrated recorder, containing a two-channel hydrophone-one to record high frequencies and one to record low frequencies (sensitivity: −200 dB re Vrms/µPa and −187 dB re Vrms/µPa, respectively; flat frequency response (±1.5 dB): 1-100 Hz and 1-10,000 Hz, respectively) [55]. The HARP was deployed in the northern Gulf of Mexico, approximately 60 km north of the 2010 Deepwater Horizon oil spill site, at ~90 m depth ( Figure 1). Recordings were collected between 2010 and 2012 continuously, with short gaps for recovery and redeployment. The HARP recorded at a sample rate of 200 kHz with 16-bit quantization. The data were pre-processed by converting the compressed binary files into WAV files. All WAV files were decimated to a sampling frequency of 2 kHz (initial sampling frequency: 200 kHz) to reduce the data to a 1 kHz bandwidth (0-1000 Hz), which allowed for faster computational analysis because the fish calls of interest have energy content below 1 kHz [25,30]. Long-Term Spectral Averages (LTSAs), with a frequency and temporal resolution of 1 Hz and 5 s, respectively, were calculated from the data using Triton, a Matlab-based acoustic analysis software package [56]. A years' worth of data were manually analyzed to determine potential fish call types in the dataset. There were a total of six likely fish calls identified in these data that were the main target of this analysis. They are likely fish calls due to their low frequency (<1 kHz) and drumming or vibratory, pulsed sound [29][30][31]. The six calls varied in frequency and duration ( Figure 2, Table 1). Five of the six calls-Beats, Buzz, Croak, Downsweep, and Pulse train-have not been documented before and are named based on the way they sound. It is possible the Jetski is the same call as the 300 Hz frequency modulated harmonic call described by Wall et al. [57]; however, it is difficult to discern based on the limited call description and low-resolution spectrogram. The data were pre-processed by converting the compressed binary files into WAV files. All WAV files were decimated to a sampling frequency of 2 kHz (initial sampling frequency: 200 kHz) to reduce the data to a 1 kHz bandwidth (0-1000 Hz), which allowed for faster computational analysis because the fish calls of interest have energy content below 1 kHz [25,30]. Long-Term Spectral Averages (LTSAs), with a frequency and temporal resolution of 1 Hz and 5 s, respectively, were calculated from the data using Triton, a Matlabbased acoustic analysis software package [56]. A years' worth of data were manually analyzed to determine potential fish call types in the dataset. There were a total of six likely fish calls identified in these data that were the main target of this analysis. They are likely fish calls due to their low frequency (<1 kHz) and drumming or vibratory, pulsed sound [29][30][31]. The six calls varied in frequency and duration ( Figure 2, Table 1). Five of the six calls-Beats, Buzz, Croak, Downsweep, and Pulse train-have not been documented before and are named based on the way they sound. It is possible the Jetski is the same call as the 300 Hz frequency modulated harmonic call described by Wall et al. [57]; however, it is difficult to discern based on the limited call description and low-resolution spectrogram.

Train/Test and Evaluation Datasets
When developing artificial intelligence methods to detect and classify calls, it is necessary to have two datasets: a train/test and evaluation dataset. A train/test dataset was established from one month of data (August 2010) and was used to develop and modify the automatic detector and classifier. This month was selected because all six fish call types were present, as well as many different noise types, and a relatively high number of occurrence of each call was present compared to other months that were manually analyzed. The evaluation dataset was composed of the first seven days of June, September, and December 2011 and March 2012, as well as 18-24 July 2012 to cover all four seasons, two different years, and provide a good number of all six call types. These data were used to assess the performance of the detector and classifier on a new, diverse subset of the long-term dataset.
Triton was used to manually create the groundtruth, a log of the six fish calls of interest, for both datasets. When signals of interest were visually detected in the Triton-generated LTSA (plot length: 1 h), a short, higher-resolution spectrogram (plot length: 30-120 s, Hanning window, 90% overlap, frequency resolution: 2 Hz, time resolution: 0.5 s) was used to identify the fish call and record the call's start and end time. The average maximum and minimum frequencies and duration of each call type were measured, too (Table  1). Based on the analyst's manual analysis, the train/test dataset contained a total of 834 calls of the six fish call types and the evaluation dataset contained a total of 1503 calls of the six fish call types.
Data analysis occurred in two steps-detection followed by classification. These two methods are explained in more detail in Sections 2.3 and 2.4.

Train/Test and Evaluation Datasets
When developing artificial intelligence methods to detect and classify calls, it is necessary to have two datasets: a train/test and evaluation dataset. A train/test dataset was established from one month of data (August 2010) and was used to develop and modify the automatic detector and classifier. This month was selected because all six fish call types were present, as well as many different noise types, and a relatively high number of occurrence of each call was present compared to other months that were manually analyzed. The evaluation dataset was composed of the first seven days of June, September, and December 2011 and March 2012, as well as 18-24 July 2012 to cover all four seasons, two different years, and provide a good number of all six call types. These data were used to assess the performance of the detector and classifier on a new, diverse subset of the long-term dataset.
Triton was used to manually create the groundtruth, a log of the six fish calls of interest, for both datasets. When signals of interest were visually detected in the Triton-generated LTSA (plot length: 1 h), a short, higher-resolution spectrogram (plot length: 30-120 s, Hanning window, 90% overlap, frequency resolution: 2 Hz, time resolution: 0.5 s) was used to identify the fish call and record the call's start and end time. The average maximum and minimum frequencies and duration of each call type were measured, too (Table 1). Based on the analyst's manual analysis, the train/test dataset contained a total of 834 calls of the six fish call types and the evaluation dataset contained a total of 1503 calls of the six fish call types. Data analysis occurred in two steps-detection followed by classification. These two methods are explained in more detail in Sections 2.3 and 2.4.

Call Detection: Energy Detector
To effectively and automatically determine the occurrence of signals of interest, fish calls were detected and extracted from the recording's waveforms using the energy detector feature of Ishmael, a bioacoustics analysis software [58]. An energy detector was used to detect the six fish calls of interest because it is a general, broad detector, capable of detecting any signal or noise within a certain frequency band and above a specific threshold that the user specifies. The fish call characteristics needed to create the Ishmael energy detector were average minimum call frequency, average maximum call frequency, and average call duration ( Table 1). The specified detector parameters included the frequency band (maximum frequency and minimum frequency), call duration, threshold, and call detection neighborhood, which is how soon a subsequent call can be detected after an initial detection. These parameters were iteratively adjusted until a detector with a high recall (>85%) was established for the train/test dataset. Because the energy detector did not rely on other temporal characteristics, such as amplitude, number of pulses, or interpulse interval, they were not measured for each call type and the focus of this paper is the application of automatic analysis methods to these data, not describing each call.
To begin, a single, broadband energy detector (frequency range: 100-800 Hz) was applied to the dataset. The frequency range, threshold, and call detection neighborhood were then slowly decreased to 500 Hz (spanning from 150 to 650 Hz), 0.054, and 4 s, respectively. Recall is the fraction of true positives divided by the sum of true positives and false negatives (i.e., the number of detections that are also found in the groundtruth divided by the total number of calls in the groundtruth) and precision is the fraction of true positives divided by the sum of true positives and false positives (i.e., the number of detections that are also found in the groundtruth divided by all detections). Even though recall was relatively high with these parameters, precision was low ( Figure S1), so modifications were made to the energy detector. Instead of one single broadband (150 to 650 Hz) energy detector, a detector with three smaller bands was applied to the train/test dataset. However, after iteratively adjusting the three frequency band widths and settling on 200 to 240 Hz (captured Beats and Buzz calls), 450 to 600 Hz (captured Downsweep, Jetski, and Pulse train calls), and 870 to 950 Hz (captured Croak calls) for the bands, detector precision and recall decreased ( Figure S2). Because both recall and precision decreased, the three-band energy detector was not tested on the evaluation dataset and the single broadband energy detector was determined to be the best detector to extract the six fish call types in this dataset.
The most efficient detector at extracting the six fish call types from the train/test dataset operated over 150 to 650 Hz and had maximum call duration of 15 s with 4 s detection neighborhood and a threshold of 0.054. The detector was then applied to the evaluation dataset to check its performance on a new, diverse dataset it had not been trained on. Recall and precision were calculated for different buffer lengths (tested between 3 and 6 s) to assess detector performance ( Figure S1) [59]. Buffer length was used to evaluate detector performance to see how close a detection is to a manually selected call; if the buffer length was set to 3 s, then that means a detection occurred within 3 s of a manually logged call start time and the detection would be considered a true positive. Therefore, a longer buffer length will result in a higher recall rate because it increases the probability that a detected signal will be identified in the dataset ( Figure S1). In this study, recall was prioritized over precision because the overall purpose of the detector was to extract the majority of calls of interest. Each energy detection was saved as an individual WAV file with the detection start time as the file's name and the detection centered in the file. The second step of the process, classification, enabled rejection of the false detections and retention of signals of interest.

Call Classification: ResNet-50 Convolutional Neural Network
Before the classification step could be conducted, the automatic detections had to be converted into images. A custom-built MATLAB code was created to convert each Ishmael energy detection WAV file into an image (JPG file) by performing a short-time Fourier transform on each detection audio file, resulting in images with 2 Hz and 0.5 s resolution. Each detection WAV file was read into MATLAB and then a spectrogram of the detection file was created and changed to gray scale; the spectrogram was then filtered with a 2D Gaussian smoothing kernel with standard deviation of 1 and saved as an image.
Transfer learning was used to classify all of the detected images [60,61]. The pretrained convolutional neural network that was used was ResNet-50 [62] and it was chosen because of its efficiency, accuracy, and simplicity to retrain for other classification purposes [63][64][65]. The classifier was trained with an unbalanced dataset since some fish calls were less common than others ( Figure 3). The~2200 image dataset (train/test image dataset) used to train and test ResNet-50 were images of the manually detected six fish call types from the August 2010 train/test dataset, as well as with images of five noise types-Disk write, Click train, Blank/noise (Blank), Low frequency noise (LF Noise), and Random noise (Ra Noise)-which were commonly observed in the dataset ( Figure S3). Disk write is self-made noise by the HARP that occurs every 75 s. Click train most likely represents sounds produced by dolphins. LF Noise was most often airgun noise. Only 26 Downsweep fish calls were manually detected in the train/test dataset, so 27 additional Downsweep images, which were manually detected in the long-term dataset, were added to increase the number of Downsweep images to 53, and 32 manually detected Beats images from the evaluation dataset were added to increase the number of Beats images to 151.

Call Classification: ResNet-50 Convolutional Neural Network
Before the classification step could be conducted, the automatic detections had to be converted into images. A custom-built MATLAB code was created to convert each Ishmae energy detection WAV file into an image (JPG file) by performing a short-time Fourier transform on each detection audio file, resulting in images with 2 Hz and 0.5 s resolution Each detection WAV file was read into MATLAB and then a spectrogram of the detection file was created and changed to gray scale; the spectrogram was then filtered with a 2D Gaussian smoothing kernel with standard deviation of 1 and saved as an image.
Transfer learning was used to classify all of the detected images [60,61]. The pre trained convolutional neural network that was used was ResNet-50 [62] and it was chosen because of its efficiency, accuracy, and simplicity to retrain for other classification pur poses [63][64][65]. The classifier was trained with an unbalanced dataset since some fish calls were less common than others ( Figure 3). The ~2200 image dataset (train/test image da taset) used to train and test ResNet-50 were images of the manually detected six fish cal types from the August 2010 train/test dataset, as well as with images of five noise types-Disk write, Click train, Blank/ noise (Blank), Low frequency noise (LF Noise), and Random noise (Ra Noise)-which were commonly observed in the dataset ( Figure S3). Disk write is self-made noise by the HARP that occurs every 75 s. Click train most likely represents sounds produced by dolphins. LF Noise was most often airgun noise. Only 26 Downsweep fish calls were manually detected in the train/test dataset, so 27 additiona Downsweep images, which were manually detected in the long-term dataset, were added to increase the number of Downsweep images to 53, and 32 manually detected Beats im ages from the evaluation dataset were added to increase the number of Beats images to 151. Blank/noise, Low frequency noise, and Random noise). The total number of images for training and testing was 2231, with more than half of the images representing a noise type. * Majority of Beats calls were from August 2010 train/test dataset, but 32 were from the evaluation dataset. ** Approx imately half of the Downsweep images were from the August 2010 train/test dataset and the other half was from other periods in the long-term dataset.
ResNet-50′s accuracy was optimized by experimenting with data augmentation and adjusting hyperparameters, such as number of epochs and mini batch size. Data augmen tation involved iteratively adjusting the scaling and translation range of images [66] When using augmentation to train the classifier in this study, images were randomly scaled and translated by any value in the range specified by the user. In this study, the scale range was specified as ±10% in the Y dimension and the translation range was specified as ±50 pixels in the X dimension, meaning each individual training image was ResNet-50 s accuracy was optimized by experimenting with data augmentation and adjusting hyperparameters, such as number of epochs and mini batch size. Data augmentation involved iteratively adjusting the scaling and translation range of images [66]. When using augmentation to train the classifier in this study, images were randomly scaled and translated by any value in the range specified by the user. In this study, the scale range was specified as ±10% in the Y dimension and the translation range was specified as ±50 pixels in the X dimension, meaning each individual training image was randomly scaled in just the Y dimension by any percentage between −10% and +10% and shifted to the left or right by any pixel value less than 51. However, it should be noted that there were probably instances where no scaling or no translation or neither augmentation parameter was applied since augmentation was random. For example, an image would only be translated if the random scaling value was 0%, or vice versa, an image would only be scaled randomly and not shifted if the random pixel translation value was 0, or the image would be scaled and translated, or both the scaling and translation values would be 0 so the image remained "unaugmented." Therefore, classification accuracy can slightly differ between trials based on how the training images are randomly augmented. Images of the six fish call and five noise types were augmented (i.e., scaled and translated). The hyperparameters were set to 6 epochs and a mini batch size of 10.
Data augmentation was not used to adjust the level of background noise in the images. They already had various levels of background noise found in the data, ensuring the classifier was not trained on images with only the most intense and clear signals with low background noise. Lastly, the actual training dataset remained unbalanced after augmentation. Multiple epochs increased the training number of images for all sound types, so the classifier was trained on more images of the rarer call types (e.g., Downsweep and Pulse Train), but these rarer call types were not resampled more times than other call types to ensure equal numbers of images in the training process.
To examine classifier consistency and accuracy, the detection images from both datasets-train/test and evaluation-were run through the classifier three different times, so there were three classification trials for each dataset in order to calculate average recall and precision for each fish call type and to calculate average overall accuracy (total correctly labeled images/total number of images) for each dataset [53]. Recall and precision were also computed to assess classifier performance and a confusion matrix of classifier performance was used to evaluate accuracy for each trial and averaged for each dataset (Table 2,  Table 3, Tables S1 and S2) [59]. For classification, recall is the fraction of correctly labeled images of one sound type divided by the total number of images of that one sound type and precision is the fraction of correctly labeled images of one sound type divided by the total number of images that are labeled as that one sound type. Accuracy is the fraction of true positives and true negatives divided by the total (i.e., total number of correctly labeled images in the dataset divided by the total number of images in the dataset). Lastly, to see if classifier performance could increase and further observe if the classifier was consistent in labeling images, detection images labeled as the six fish call types were re-classified because many images labeled as a fish call were often images of noise. Table 2. Overall classifier accuracy (total correctly labeled detection images/total detection images) for the three different times (i.e., three trials) all the detection images from both datasets-train/test and evaluation-were classified by the data-augmented, trained classifier, as well as the average overall classifier accuracy for each dataset.

Dataset
Trial # Overall Classifier Accuracy (%) Average Overall Accuracy (%)  Finally, MATLAB was used to calculate the signal-to-noise ratio (SNR) of 100 randomly selected classified images of each fish call type-50 correctly classified and 50 incorrectly classified-to evaluate if classifier performance was dependent on the intensity of the signal compared to ambient background noise. To calculate SNR, the WAV file of each image was used and the frequency band and duration were adjusted for each call to capture where the most energy was present and to avoid including too much noise in the signal's sound pressure level (SPL) calculation ( Figure S4). Two SNR calculations were made for each Downsweep call because a high intensity version of the call would have multiple downsweeps present but a low intensity call, which was more commonly observed in the data, had one or two downsweeps; therefore, a full band SNR calculation would be quite low compared to the SNR calculation when just using the strongest downsweep. The SPL of the background noise, which was computed over the same frequency band and time duration as the signal SPL but prior to the start of the signal start time, was then subtracted from the signal SPL to compute the SNR. To analyze the SNR and classifier performance, binomial logistic regression was used (0 = call was misclassified, 1 = call was correctly classified) to fit a probability of correct classification curve. A threshold value of 0.5 was used to determine the SNR threshold value, which is the minimum SNR value beyond which calls have a chance better than random to be correctly classified ( Figure S5). The area under the receiver operating characteristic curve (AUC) was then calculated to evaluate predictive performance of the logistic regression model (i.e., classifier performance based on the SNR of a call), where anything under 0.7 was bad, 0.7-0.8 was adequate, 0.8-0.9 was good, and >0.9 was excellent.

Energy Detector Performance
The energy detector recall ranged from 79 to 92% depending on the buffer length-as buffer length increased, so did recall ( Figure S1a). Because detector recall was >85% when the buffer length was 4 s for the train/test dataset, the detector was also run on the evaluation dataset and recall only slightly decreased to 84.1% when buffer length was 4 s ( Figure S1b).
Detector precision was very low ( Figure S1). The detector extracted 91,387 and 128,938 signals of interest when applied to the train/test and evaluation datasets, respectively. Regardless of buffer length, detector precision was <1.2% for the train/test dataset and the evaluation dataset. Because the energy detector threshold was quite low, the detector selected all types of noise, including ships and boats, disk write, airguns, and a large variety of unidentified random noises ( Figure S6).

ResNet-50 Classifier Performance
Varying the training and testing set ratio (training: test ratio) of the train/test image dataset indicated that a higher training:test ratio increased classifier accuracy when labeling Downsweep images the most; for other call types, increasing the training:test ratio only slightly improved classifier accuracy ( Figure S7). However, overall classification accuracy, the rate of correct classifications for all 11 sound types (not just the six fish calls), of the train/test image dataset increased when the training:test ratio increased. When the training:test ratio was 70:30, 80:20, and 90:10, overall accuracy was 90.5%, 91.4%, 95.1%, respectively. Therefore, a training:test ratio of 90:10 was used for the remainder of the study to train and then test ResNet-50 s performance prior to classifying the thousands of detection images from each dataset.
When augmentation was not used to train ResNet-50, average classifier precision was low for the six fish call types when classifying the train/test dataset (<45%) and evaluation dataset (<46%) (Figure 4). Because the six fish call types were not always in the same location in the detection image and call size (call duration and frequency band) would slightly vary, image augmentation was used to increase classifier precision. After iteratively adjusting the scaling factor and pixel translation range, classifier performance was highest when the scale range in only the Y dimension was ±10% and the translation range in only the X dimension was ±50 pixels. When these image augmentation settings were applied to images to train ResNet-50, as well as increasing the number of epochs to six so that the classifier looped through the train/test image dataset six times instead of just once, average classifier precision increased for all six fish call types ( Figure 4). Augmentation also increased average classifier recall for all six fish call types when classifying the evaluation dataset detection images (Figure 4b). Even though average recall and precision varied greatly depending on fish call and noise type, average overall classifier accuracy with augmentation, when all 11 sound types are considered, was 87.97% and 93.02% for the train/test and evaluation dataset detection images, respectively (Table 2 and Tables S1 and S2).    Classifier performance varied among call types. The Buzz fish call had the lowest average recall rate (39.00%); most of the Buzz images were labeled as LF Noise, Blank, or Ra Noise (Table 3). For example, in one trial of the evaluation dataset, 61 of the 213 Buzz images were labeled correctly, 8 were labeled as another fish call type (Beats), and 144 were labeled as LF Noise, Blank, or Ra Noise. The other five fish call types had much higher average recall rates, from 91% for the Croak to 62% for the Pulse train (Table 3), than the Buzz call, but similar to the Buzz images, the majority of the other fish call images that were mislabeled were labeled as LF Noise, Blank, or Ra Noise, not as another fish call type (Tables S1 and S2).
A similar trend was observed for precision for all six fish call types-if images labeled as a fish call were misclassified, they were not commonly labeled as a different fish call type, but rather a noise type. For example, in one trial of the evaluation dataset, 1148 images were labeled as "Beats" of which 448 were Beats, but 663 were LF Noise, Blank, or Ra Noise, 5 were Disk Write, 6 were Click train, and 26 were other fish call types (21 of which were Buzz). The fish call type with the lowest average precision was Downsweep (26.50%) and the Croak fish call type had the second lowest average precision (37.33%) ( Table 3). The four other fish call types had higher average precision (Beats: 56.67%, Buzz: 50.00%, Jetski: 94.17%, Pulse train: 59.50%). Interestingly, the average precision for each of the six call types for the evaluation dataset was lower than the train/test dataset average precisions for each of the fish calls.
Re-running the images labeled as any of the six fish calls through the classifier resulted in slightly (<0.5%) to greatly increased (>20%) classifier precision (Table 4). One train/test and evaluation data trial each had their percentage of correctly labeled images increase by <0.5%. On the other hand, two of the evaluation and one train/test dataset trials resulted in increases in precision between 19 and 34%, further indicating substantial level of the inconsistency in this classifier. For all six fish call types, the SNR was not directly related to classifier performance ( Figure S5). The selected SNR threshold value was greater than 0 for four of the calls (Beats, Buzz, Jetski, and Pulse train) and less than 0 for three of the calls (Croak, Downsweep full sweep, and Downsweep strongest sweep). For all fish calls, however, many calls were not correctly classified even when the SNR value was greater than the SNR threshold value. The area under the receiver operating characteristic curve (AUC) values ranged from 57.1% (Buzz) to 86.2% (Beats), indicating the logistic regression model ranged from "bad" to "good" (i.e., classifier performance was variable when labeling images correctly based on the SNR value), depending on the call type ( Figure S5). Based on the model, classifier performance was "bad" at correctly labeling images of Buzz (57.1%), Croak (63.5%), Jetski (67.0%), and Pulse train (66.9%) calls based on their SNR, "adequate" at correctly labeling images of Downsweep calls when the full or strongest downsweep was measured (full: 76.6%; strongest: 71.8%), and "good" at correctly labeling images of Beats calls (86.2%).

Analysis Time: Manual vs. Automatic Methods
Using the energy detector and a pre-trained neural network sped up the call detection and identification process (Table 5). These automatic methods, implemented on a midrange desktop computer with 8.00 GB RAM, 64-bit operating system, and Intel Core i3-8100 CPU and no GPU, resulted in processing of a month of data in approximately 8 and 10 hrs for the train/test and evaluation dataset, respectively. Manually going through a day of data takes 15 to 45 min depending on the number of calls present and amount of background noise. It took the analyst more than double the time (~2.6) to go through each dataset than the two-step automatic detection and classification methods applied in this study. Table 5. Statistics on detector and classifier performance including the number of recording days, the number of detections selected by the energy detector, the amount of time (hr:min) it took the detector to run, the average amount of time it took the classifier to label the detection images from each dataset, and the amount of time it took the analyst to manually annotate each dataset.

Dataset
Number

Discussion
We used an effective automatic detector and classifier system to efficiently extract and label six fish call types from a long-term passive acoustic monitoring dataset from the northern Gulf of Mexico. This approach offers the potential to expedite the analysis process for multiyear datasets from this region. However, understanding the caveats and potential pitfalls of the process is important before applying it to long-term data.

Automatic Energy Detector
The energy detector function in Ishmael was chosen since it is a general, broad detector capable of extracting the six fish call types, which vary in duration and frequency. The recall rate in this study (~85%) was similar to the recall rate in another fish acoustic study [46]. Ruiz-Blais et al. [46] used four Jamaica weakfish call features to create their detector, which had an accuracy of 96% and recall of 81%, and Monczak et al. [50] had a signal detector with an identification rate ((#files-# of files with false negative-# of files with false positive/#files)*100) > 80% in most cases, depending on species, recorder location, and call type. Wall et al. [67] developed an automatic detection algorithm to extract red grouper (Epinephelus morio) calls within a year-long dataset and its recall rate was 44%; however, they prioritized precision over recall because they were more interested in timing of sound production rather than call abundance. Ricci et al. [47] also used a detector to extract fish calls of interest, but they measured detector performance by the false detection rate (=1-precision); their false detection rate was~1%, which is significantly higher than the rate observed for the energy detector in this study (precision < 1.2%). Additionally, two other studies [48,49] developed their own algorithm that both detected and identified their target fish call/s in their respective datasets. Kottege et al. [48] used spectro-temporal features to successfully identify tilapia calls with~98% accuracy and a recall rate of 94%, using discriminant analysis methods. Chérubin et al. [49] did not report detector recall, but their algorithm had an overall identification accuracy of 87.5%; however, it varied based on species.
The majority of detections in our data were noise. The high false detection rate was not surprising because the Gulf of Mexico has one of the loudest US water soundscapes due to shipping and seismic exploitation and exploration, including airgun presence [68][69][70]. Manual review of detections revealed that many were airguns, which is not surprising since seismic surveys occur all year in the northern Gulf of Mexico. Another frequent detection was Disk write, which is self-made noise by the HARP; this noise occurs every 75 s and was often detected by the energy detector we developed.
Despite the low precision, the energy detector by itself was deemed a good detector since it had a relatively high recall rate and it resulted in more total detections of the six fish call types than the manual observer analysis, indicating that even though the detector did not detect all of the same fish calls as the analyst, it was capable of extracting the calls at alternate times that were missed by the analyst. It should be noted that in our analysis, when the detector extracted a signal of interest and the signal was a fish call that the manual observer had not marked, the detection was considered a false positive, even though a fish call was present. Any other treatment would have resulted in a bias in our analysis.
Lastly, the train/test (August 2010) and evaluation datasets were manually analyzed by three different people to reduce subjectivity and bias because there is often inconsistency in marking calls among multiple analysts [71]. Rather than combine the manual detections from all three observers, we chose to use only one analyst's log as our groundtruth for both datasets because that log had more total fish calls marked and at the same time contained 80% of the calls marked by the other two analysts. This approach led to a larger dataset for evaluation and, thus, higher detector precision and recall.

ResNet-50 Classifier
ResNet-50 is an efficient and accurate pre-trained neural network that can easily be retrained for other classification purposes [63,64]. Based on the literature review for this study, this is the only fish acoustic study to date that has used transfer learning to classify fish calls. Other fish acoustic studies used machine learning algorithms that were created and trained specifically to detect and classify their fish calls of interest [45,49,[51][52][53][54]72]. Pre-trained neural networks that can be retrained for other classification purposes, such as AlexNet [73], GoogleNet [74], and ResNet-50, provide scientists, who have a background in biology and ecology rather than signal processing or artificial intelligence, with a readyto-use model that can be easily modified based on their classification goals and target signal. Zhang et al. [75] observed that classification accuracy was >20% higher when they used transfer learning compared to training a neural network from the beginning on the 16 target whale calls in their dataset.
In this study, the classifier ResNet-50 was retrained with an unbalanced dataset of images of the six fish call types and five noise types that were commonly detected by the energy detector. The higher the training:test set ratio, the higher the classification accuracy for each of the six fish call types, especially Downsweep, which only had a total of 53 images for training and testing. It is well-known that classifier performance is frequently dependent on the size and quality of the training dataset [76]. Two fish acoustic studies that used automatic classification methods noted the impact of training set size. Harakawa et al. [51] used sequential machine learning algorithms to classify Sciaenidae calls and recall increased from 80.8% to 92.2% when the percentage of training data increased from 1% to 80%. Noda et al. [53] observed that median classification accuracy increased from 81.77% to 95.58% when the training dataset size increased from 5% to 50%. Both studies used a smaller training set percentage (<80% of the train/test image dataset) than what we used (90%), but the number of images they had for their calls of interest was greater than the number of images we had for each category, which is why our train:test ratio was larger than normal.
Image augmentation, as well as having a mini batch size of 10 and 6 epochs increased classifier precision across all fish call types ( Figure 4). Interestingly, none of the fish acoustic studies to date (according to our literature search) mentioned using image augmentation to train their classifier; however, image augmentation has been used to increase the training data size and classification performance when labeling other animal calls [75,77,78]. For example, Padovese et al. [79] used image augmentation to generate synthetic calls to increase training data size resulting in increased classifier recall and precision for labeling North Atlantic right whale (Eubalaena glacialis) upcalls. Rasmussen and Širović [80] used scaling and translation augmentation to prevent their classifier from overfitting during the training process. Image augmentation was beneficial in this study because the number of images for training and testing was relatively small (<400 images) for each sound type; in fact, two of the six fish calls had <100 images each for training and testing ( Figure 3). Therefore, increasing the number of epochs and using augmentation to scale and translate images increased the training set size and ensured the classifier was trained on images where the call or noise was not always in the same location in the image and the size of each call and noise was slightly variable.
The fish call types with the lowest (39.00%) and highest (91.00%) average recall was Buzz and Croak, respectively. Usually, classifier performance is lower for short-duration, pulsed calls [45,52]; however, classifier recall did not appear to depend on call duration or frequency band in our study. For the two pulsed calls, Pulse train and Downsweep, average recall was 62.17% (second lowest, but much higher than average recall for Buzz) for Pulse train and 83.83% (third highest) for Downsweep. Additionally, it was surprising that the Buzz had a low average recall because calls with distinguished harmonics had a higher identification rate (i.e., classified correctly more often) than shorter duration, pulsed calls in another fish acoustic study [50]. It is likely Buzz calls were misclassified because they slightly resemble airguns (LF Noise; Figure S6c) and Ra Noise, which have more energy (i.e., higher intensity) at very low frequencies (<100 Hz) similar to the Buzz call (Figure 2b). Average classifier recall was lower than might be expected for relatively long-duration Jetski calls (64.83%). They were often labeled as Blank, which was surprising because upon reviewing the images, the call was always present and visible.
Average precision of the classifier, on the other hand, generally appeared related to call duration or frequency band for most of the fish calls. Even though average recall was not the highest for the Jetski call (third lowest, 64.83%), average precision was much higher for the Jetski call (94.17%) than the other five fish call types (second highest, Pulse train: 59.50%). This is likely because the Jetski call looks unique and does not resemble any of the other calls or noise, making it easy to distinguish from the other call and noise types (Figure 2e). The Downsweep call had the lowest average precision (26.50%); most images labeled as "Downsweep" were noise, but Jetski and Pulse train images were also labeled as Downsweep occasionally because some Jetski calls are short in duration and may resemble a Downsweep. Pulse train calls are pulsed calls that are in the same frequency band as the strongest sweep in a Downsweep call (Figure 2d,f) which could also have contributed to the confusion. Croak had the second lowest average precision (37.33%) with images of broadband noise and Disk write, which loosely look like the Croak call, commonly labeled as "Croak." The presence of airguns also impacted precision. They are low frequency noise and cover the same frequency band as Beats and Buzz [69,70] and could have been the cause of reduced precision for those calls, because lots of LF Noise was labeled as Beats and Buzz. Further, because there was greater airgun presence in the evaluation dataset than the train/test dataset, average precision for Beats and Buzz was further lowered in the evaluation dataset. Lastly, average recall was higher than average precision for four of the six calls-Beats, Croak, Downsweep, Pulse train. Harakawa et al. [51] similarly noted that overall classifier recall was greater than overall classifier precision, regardless of training set size, when classifying Sciaenidae calls.
Finally, average overall classifier accuracy in this study (~90%) was similar to classifier accuracies observed in fish acoustic studies that used automatic analysis methods, such as matched filters and machine learning algorithms [49,50,52,53,72,81]. Some studies with higher performance trained their models on fewer classes (i.e., sound types). ResNet-50 was trained on 11 sound types (six fish calls and five common noises) and other studies have shown that more classes (referred to as sound types in this study) often result in lower classification accuracy [45,82]. When Vieira et al. [45] used four sound types of meagre calls, the overall mean identification rate was 43.3%, but when they reduced the number of sound types to two, the overall mean identification rate increased to 78.8%. Maintaining a large number of classification labels can be useful for datasets with a large variety of signals and when the focus of the study is not a single species or sound source. ResNet-50 performed well overall in this study and was capable at labeling 11 different sound types and thus could be a good option for other studies of multiple sound sources as well. Interestingly, the "SNR" of each image (each spectrogram had to be converted into an image to be processed by the ResNet-50 classifier) did not appear to affect classifier accuracy ( Figure S5). The relatively gradual logistic curve, low AUC values, and many calls that were not correctly classified even though they had a high SNR indicate the poor predictive performance of the logistic regression model and that SNR does not appear to affect classification performance (i.e., accuracy) in this study. This was unexpected since the SNR affected automatic detection and classification of fish chorusing and individual calls in other studies [53,54]. However, both Noda et al. [53] and Lin et al. [54] ran their detection and classification algorithm on spectrograms instead of images, which means that signal intensity and frequency influence detection and classification differently than 2D images [83]. Overall, though, ResNet-50 classified images of the six fish call types well, regardless of the SNR.

Considerations for Application of This Approach to Long-Term Datasets
There is no ideal or standard way to divide all available data between training/testing and evaluation. However, typical studies split the data so that 70-80% is used for training and testing the model and 20-30% is used for evaluating how the model performs on data it has not seen before. In this study, though, we used a much larger evaluation dataset than typical (~55% of the data). To apply this two-step methodology to real data, we thought it was important to evaluate the detector and classifier performance across a variety of situations, including different months, seasons, years, and noise levels (e.g., airgun and shipping presence). Therefore, a large evaluation dataset was specifically used to ensure that the energy detector and ResNet-50 could perform as well across a large variety of sound conditions that represent the diverse soundscape of the northern Gulf of Mexico.
The coupled automatic detector and machine learning methodology used in this study expedites the call detection and identification process when analyzing a long-term passive acoustic monitoring dataset. The analyst took~2.6 more time to analyze each dataset than the automatic methods used in this study, and this difference in analysis time could be further increased if a supercomputer with more specialized and powerful hardware, such as RAM, GPU, and FPGA, is used [84]. However, already, the methods employed here could produce results on fish call presence from a full year of acoustic data in about five days of processing time. The energy detector feature in Ishmael and ResNet-50 are relatively simple programs that can be easily modified to detect and classify any signal, or signals, of interest without requiring extensive programming abilities or equipment. Further, this study is proof of concept that transfer learning can be used in fish vocalization studies, which has not been applied in any previous fish acoustic study.
Even though the classifier performed well overall (~90% accuracy), it is important to note the inconsistency in classifier performance. The accuracies of the three trials for each dataset were similar to each other (Table 2); however, the recall and precision for each fish call varied among trials for each dataset, indicating that the classifier was not consistent at labeling images (Table 3). Further, when we re-ran the images labeled as any of the six fish calls, the percentage of correctly labeled images increased between 0.2% and 34.3% for the six trials (Table 4). This broad range in percentage of correctly re-classified images originally labeled as a fish call shows the inconsistency in classifier performance. However, seeing an increase of >19% in the percentage of correctly re-labeled images for three of the six trials also suggests that using the classifier multiple times can ultimately lead to improvement in the overall output (i.e., accuracy and precision). However, this multiple-classification process may need to be coupled with a manual review step to validate fish detections and remove misclassified noise from the final output.
This methodology could be applied to datasets with multiple call types that have diverse, variable temporal features. In our study, temporal variations in calls did not matter because the energy detector will pick up any signal if it is above the specified threshold and within the specified frequency band; the number of pulses and interpulse interval did not affect detection recall and precision. Similarly, since our training dataset represented the variability of calls' temporal features, the success of the classifier was not affected by those variations. For example, images labeled as Beats had from one to more than five Beats present with calls separated or overlapping. Additionally, images with three to seven pulses for a Pulse train were reliably identified as a Pulse train due to their short duration and relative y-position in the image (i.e., frequency), regardless of the interpulse interval. Therefore, as long as the training set is diverse and fairly large (or diversified and expanded with the use of data augmentation), the methods used in this study can detect and classify calls well, even when there are temporal and frequency variations. The energy detector is broad and can work for any signals of interest, while the pre-trained ResNet-50 classifier could work with any call type. However, for different call types, a separate library of training and test images will have to be developed and ResNet-50 will have to be retrained for those particular calls for the classifier to work on the new call types.
The methods developed in this study have the potential to be applied to soundscapes and signals from a variety of ecosystems. By decimating the data, we were able to remove from the energy detection most of the higher frequency biological noise-snapping shrimp and cetaceans-that is also abundant in the region. A similar decimation approach might be of use to others wishing to reduce false detections in the first step of the process. Our study region, the Gulf of Mexico, is one of the loudest U.S. water soundscapes [69] with high levels of low-frequency anthropogenic noise (e.g., commercial shipping and airguns). These sounds could not be removed by decimation since they generally overlap in frequency with our calls of interest. Even with these noises, however, our methods performed well because the six fish call types were common in the dataset, allowing development of a robust training and testing library, and distinct from other sounds in the environment. In most instances, automated detection and classification should be used on signals that are relatively common and easy to distinguish. In cases of sounds that are rare or whose features are challenging to distinguish, it is not likely any automated classification methods will perform satisfactorily. Therefore, the methodology presented in this paper should be well suited for relatively common signals, regardless of habitat type and levels of noise present, be they biophonic, anthrophonic, or geophonic.
In conclusion, this two-step process presents a novel, effective method to study soniferous fish abundance and presence and can help ecologists and managers understand potential population recovery based on call numbers and change in occurrence patterns over the years. As the next step, the energy detector and ResNet-50 classifier will be applied to the long-term PAM dataset collected between August 2010 and August 2017 at this location in the Gulf of Mexico. Those results will enable us to estimate changes in call abundance and observe daily, monthly, and annual calling patterns in the area, and possibly assess the impact of the 2010 Deepwater Horizon oil spill on the local fish community.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/jmse9101128/s1, Table S1: Average confusion matix of the three train/test dataset trials, Table S2: Average confusion matrix of the three evaluation dataset trials, Figure S1: Energy detector precision and recall dependent on various buffer lengths (3 s, 4 s, 5 s, and 6 s) for: (a) the train/test dataset and (b) the evaluation dataset, Figure S2: Comparison of precision and recall between the three-band detector (circles) and single, broadband (stars) energy detector based on various buffer lengths of 3 s, 4 s, 5 s, and 6 s, Figure S3: Images (created from spectrograms) of the five noise types that the classifier was trained with: (a) Disk write, (b) Click train, (c) Blank/ noise, (d) Low frequency noise, (e) Low frequency noise, (f) Random noise, and (g) Random noise, Figure S4: Images (created from spectrograms) of the six fish call types: (a) Beats, (b) Buzz, (c) Croak, (d) Downsweep full sweep, (e) Downsweep strongest sweep, (f) Jetski, and (g) Pulse train, with shaded bands (yellow) representing the frequency band and duration over which the signal sound pressure level was calculated for each call type. Two SNR calculations were made for each Downsweep call (indicated by d and e) because a high intensity Downsweep call had multiple downsweeps, but a less intense Downsweep call, which was more commonly observed in the data, had only one or two downsweeps, Figure S5: Binomial logistic regression plots with a 0.5 threshold (black, dotted horizontal line; 0 = incorrectly classified, 1 = correctly classified) to determine the signal-to-noise ratio (SNR) threshold value (vertical blue line in each subplot), the SNR value above which a call should be correctly classified, for each fish call type:

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to large size.