Proposal for a System Model for O ﬄ ine Seismic Event Detection in Colombia

: This paper presents an integrated model for seismic events detection in Colombia using machine learning techniques. Machine learning is used to identify P-wave windows in historic records and hence detect seismic events. The proposed model has ﬁve modules that group the basic detection system procedures: the seeking, gathering, and storage seismic data module, the reading of seismic records module, the analysis of seismological stations module, the sample selection module, and the classiﬁcation process module. An explanation of each module is given in conjunction with practical recommendations for its implementation. The resulting model allows understanding the integration of the phases required for the design and development of an o ﬄ ine seismic event detection system.


Introduction
Earthquakes have been one major concern to societies around the world. Earthquakes are a consequence of earth tectonics, which cause intercontinental plate drifts. Deformation energy is stored along the plates. Once one or more fault lines exhaust their elastic deformation capacity, rupture occurs and the stored energy is released as seismic waves, propagating along with the earth's crust. Depending on the amount of energy released and the depth of rupture, seismic waves can hit civil infrastructure, causing major impacts. Such events as in Sumatra (Indonesia, 2004), Haiti (2010), and Tohoku (Japan, 2011) are proof of how devastating can earthquakes be over human infrastructure and society as well. In Colombia, the Armenia earthquake (1999, Mw 6.2) is referenced as the worst seismic event for the country, which forced the government to call for an updating of the existing design and construction code. The resulting document was introduced into the Colombian legislation, making it a mandatory practice among civil infrastructure designers and constructors [1,2].
Earthquake engineering is a branch of engineering born to reduce the effects of earth seismicity. The approaches that earthquake engineering take can be seen from two perspectives: a study of the seismic phenomena and a study of the structural response after the seismic event. Research in earthquake engineering has increased in depth as new materials and computational power have been conceived. Simple techniques for the characterization of earthquake events can be used intensively in an attempt to formulate methodologies that provide people with a time frame to evacuate civil infrastructure during significant seismic events. However, current methodologies involving computational power deal with limitations in storage capacity (storage of seismic traces and raw data), processing capabilities (multichannel seismic acquisition), and lack of compatibility and integration of software resources, adding difficulties for the implementation of a successful seismic event detection technique [3][4][5].
Few academic research groups in Colombia, including the Colombian Geological Service, dedicate their efforts to boost techniques and methodologies focused on the seismic phenomena. Most of the research efforts aim toward a better understanding of both site seismic and structural responses, while some research has been carried on the understanding of local seismicity. Countries such as Mexico and Chile, both with similar seismic characteristics as Colombia, dedicate their research efforts to improve cities' structural resilience and to improve the social response during seismic events as well. The proposal of this research paper is a system model for offline seismic event detection in Colombia, where a set of integrated modules for reading and processing historical seismic raw data deals with the reduction of the computational costs for the successful detection of seismic events.
This article structures the proposal of a machine learning-based model for the detection of seismic events as follows: (a) problem statement (seismology and seismic data recording), (b) earthquake detection methodologies (traditional vs. current approaches), (c) seismic detection model proposal (model architecture and modules description), and (d) article conclusions.

Problem Statement
Countries over the Pacific coast of Southern America have a long history of catastrophic earthquake events. According to the US Geological Survey, five of the top 20 largest earthquake events occurred since the earlier 1900s, including the largest, have occurred along the fault line traced by the borders of the Nazca Plate that subducts below the South American Plate [6]. The tectonic environment in Colombia can be described by its two main fault zones: (a) Romeral zone (intraplate seismic zone that runs from north to south of the country's Pacific coast with an approximate length of 1200 km) and (b) the frontal fault of Eastern Cordillera (fault system that divides the Colombian Andean territory from its eastern great plains, most likely a southern border of the Caribbean Plate) [7]. One local seismic zone on the Colombian northeast of great activity is the Bucaramanga Seismic Nest, where at least eight events with a magnitude Mw > 4.7 occur each year [8].
Typical geological and seismic observation services such as SGC in Colombia provide seismic analysis in a two-stage fashion: first, by acquiring and storing seismic records (which can include up to three spatial components of accelerations, velocities, and ground displacements) and second, by performing seismic event recognition by looking for particular seismic characteristics within the stored data. Then, the geological service within a short time frame reports the occurrence of a seismic event, which is usually information that commonly contains the event magnitude and its approximate geolocation. [9][10][11]. This two-step procedure is complex, since it involves algorithms to read, synchronize, and process seismic data information. Geological services usually rely on black-box software that performs these tasks, closing the door to monitor sub-stages and therefore not letting the user integrate alternative algorithms that could eventually improve and/or fit specific site characteristics to the seismic data analysis sub-stage [12,13].
To establish methods for the detection and analysis of seismic events, the disposition of a set of historical seismic records that can be stored, read, and processed is essential to develop an accurate detection of future events (classic approach of learning from data). However, it is difficult to find a seismic dataset that fits the requirements for later processing stages, and when retrieved, the seismic files are not easy to interpret, as they include specific seismic parameters contained in legible formats that only specialized software such as SEISAN and SeisComP can process [14,15].
Moreover, several factors make an integral analysis of these Colombian historical records unfeasible: (a) limitations on the storage capacity, (b) limitations on the compatibility between current software resources, (c) limitations on the processing power required, (d) use of techniques that are not integrated within the detection models, and (e) low flexibility of the existing tools for modifications and adequations of storage and processing algorithms [16,17].
Furthermore, the online detection of coming earthquakes can be done by picking the seismic phases manually or by establishing a fixed-threshold approach; these techniques are statistically earthquake-proven for significant earthquakes and signals with low sampling rates and few numbers of components, given a higher signal-to-noise ratio. When high samples rates are considered from multiple three-dimensional seismological stations, the phases may be picked differently, introducing bias into the detection [18].
In this sense, a system model for offline seismic event detection for the Colombian region is proposed, which allows the identification of patterns and dynamics in historical records, using machine learning techniques.
The following sections present a theoretical basis and the description of the model for seismic event detection.

Earthquake Detection Methodologies
Seismic detection algorithms are used by public and/or private services dedicated to monitor and study seismic activity. Several agencies dedicate efforts to maintain an updated database of information that can help scientists and engineers analyze any activity that could represent a hazard to the infrastructure and population, including volcanic and seismic activities [19]. Data collected include ground motion records (accelerations, velocities, and/or displacements), which are used by detection algorithms as input data.
Several approaches have been conceived to perform seismic events detection. In the seismic signal, amplitude, shape, power, or several other time-domain characteristics can be used to formulate a detection procedure [19,20], depending on the desired purpose of the outcome. In practical terms, seismic signals are identified by monitoring isolated ground vibrations, which under changes in amplitude, frequency content, or motion direction indicate the arrival of seismic waves [20]. Current developments on earthquake signals monitoring aim to provide faster and more reliable detection algorithms for warning systems [21].

Traditional Approaches
Detection algorithms for earthquake detection assume that seismic signals correspond to ground vibrations isolated from human activity. Only stationary background noise is registered prior to earthquake waves' arrival. To automate the process of identifying the arrival of earthquake waves, specialized detection algorithms are required. These algorithms deal with the task of effectively discriminating background noise from seismic events, to avoid the recording of unnecessary data or the loss of actual seismic signals. Then, detection algorithms require having a high rate of positive event identification, which is easily achieved when strong motions occur (e.g., triggers such as signal's threshold can discriminate noise from strong seismic signals). However, if a seismic event is detected far from the causative fault, a decrease in the signal's amplitude can be expected, making it harder for the algorithm to perform a positive event detection.
The simplest approach for an earthquake detection system is a front detection system, which consists of the direct monitoring of a given seismic source. Monitoring the signal's amplitude allows a central managing system to perform pre-defined tasks such as shutting down the power on certain areas or generating alerts of populated areas far enough from the strong ground motion epicenter. Mexico's earthquake monitoring system (SASMEX) implements front detection for this purpose. The front detection approach requires the analysis of most of the seismic signal to validate the trigger, dismissing valuable time that can be used to alert a wider area in case of a strong ground motion event. To deal with this, further approaches attempt to wider the alert time window by analyzing a shorter segment of the seismic signal, requiring more elaborated metrics that can be positively correlated to a significant earthquake event. Some of those metrics include the average noise level, predominant signal period, or cumulative energy [22]. However, these algorithms have a high rate of false alarms when dealing with weak-motion earthquake events [23]. Detection triggers are also specialized to work with frequency-domain data. In this case, signal energy metrics are used as thresholds (e.g., average power). Transform methods such as Fourier or Walsh and signal filtering have been used to provide faster and more reliable detection algorithms [5]. Time-frequency domain techniques such as the wavelet transform have been used to track the initiation of ground motion [24,25]. A technique found more reliable and widely used is based on the short-time average through the long-time average ratio (STA/LTA). The technique is based on the fact that when seismic events occur, the current signal average (STA) is different from the long-term signal average (LTA) where no events occurred [5,26]. Figure 1 shows the implementation of the STA/LTA algorithm over a strong-motion record. The seismic record (top figure) shows the arrival of P-waves in the interval 5-10 s. P-waves (P for primary) travel across the earth's mantle in tension-compression mode. Rocks have their highest stiffness (force to deformation ratio) for compression forces, and thus, compression waves can travel the fastest across the earth's mantle, arriving at the surface prior to secondary waves. After 10 s, the seismogram on the top figure shows the strongest acceleration recorded by the seismic station for the event. Secondary and surface waves arrive at the seismic station seconds later than P-waves. Cities with poor seismic resilient infrastructure usually take the highest toll on human and economic losses when they experiment strong ground accelerations. of false alarms when dealing with weak-motion earthquake events [23]. Detection triggers are also specialized to work with frequency-domain data. In this case, signal energy metrics are used as thresholds (e.g., average power). Transform methods such as Fourier or Walsh and signal filtering have been used to provide faster and more reliable detection algorithms [5]. Time-frequency domain techniques such as the wavelet transform have been used to track the initiation of ground motion [24,25]. A technique found more reliable and widely used is based on the short-time average through the long-time average ratio (STA/LTA). The technique is based on the fact that when seismic events occur, the current signal average (STA) is different from the long-term signal average (LTA) where no events occurred [5,26]. Figure 1 shows the implementation of the STA/LTA algorithm over a strong-motion record. The seismic record (top figure) shows the arrival of P-waves in the interval 5-10 s. P-waves (P for primary) travel across the earth's mantle in tension-compression mode. Rocks have their highest stiffness (force to deformation ratio) for compression forces, and thus, compression waves can travel the fastest across the earth's mantle, arriving at the surface prior to secondary waves. After 10 s, the seismogram on the top figure shows the strongest acceleration recorded by the seismic station for the event. Secondary and surface waves arrive at the seismic station seconds later than P-waves. Cities with poor seismic resilient infrastructure usually take the highest toll on human and economic losses when they experiment strong ground accelerations.  It can be seen that the short-time average is useful to indicate the arrival of the seismic event as it suddenly arises from a very low value (assumed to be noise). The long-time average is less sensitive to the arrival of the seismic signal but keeps track of the signal's duration. The STA/LTA ratio (Figure 1, bottom) points out the location of the seismic event's start point, which is one of the most important features required by any detection methodology.
The STA/LTA algorithm has shown to be very effective due to its simplicity [5,27,28], but it requires the optimization of user-defined parameters to obtain a high rate of positive-event detection.  It can be seen that the short-time average is useful to indicate the arrival of the seismic event as it suddenly arises from a very low value (assumed to be noise). The long-time average is less sensitive to the arrival of the seismic signal but keeps track of the signal's duration. The STA/LTA ratio (Figure 1, bottom) points out the location of the seismic event's start point, which is one of the most important features required by any detection methodology.
The STA/LTA algorithm has shown to be very effective due to its simplicity [5,27,28], but it requires the optimization of user-defined parameters to obtain a high rate of positive-event detection.
Parameters such as the sampling rate, detection threshold, or even, pre-event and post-event parameters are required to achieve a desired positive detection with the algorithm.

Current Approaches
Actual developments of detection algorithms take advantage of current technological advances that allow the capture, processing, and storage of data with high resolution. A large amount of available data today is used for advanced and still rarely used techniques: local similarity (quantifies consistency of data between neighboring stations) [29], probability (parameters such as distance to the seismogenic zone or signal phases are treated as random variables with an associated probability) [30], data mining (establishes a fingerprint of seismic waves for later comparison) [31], neural networks (neural network-based algorithms are trained to identify several characteristics of seismic waveforms) [19], and social sensing (based on trending hashtags or key words on social media, algorithms can trigger responses on alert systems) [32][33][34]. These approaches are all computational powerful and help identify waveforms on large historic seismic arrays that were not processed so far or that could have been processed by techniques with less accuracy. The cost of these techniques is the computational time.
In terms of computational efficiency, the STA/LTA concept [26,35] arises as a traditional and yet highly efficient parameter for earthquake detection.

Seismic Detection Model Proposal
This research paper proposes the model presented in Figure 2-a set of modules in which seismic signals can be processed-from the seismic data collection to the detection mechanism expressed in the classification process module. The applicability of the model is directly related to the selection of the geographical zone whose seismicity is to be studied-in this case, the northeastern region of Colombian. The size of the study area, the rate of occurrence of events, and the homogeneity of the subsoil are some of the variables that directly influence the number of observations to be analyzed and the performance of each of the modules, which is why a careful selection of the region of interest must be first carried out. Applicability of the model on a different geographical region would require historic seismic arrays specific to the location. Parameters such as the sampling rate, detection threshold, or even, pre-event and post-event parameters are required to achieve a desired positive detection with the algorithm.

Current Approaches
Actual developments of detection algorithms take advantage of current technological advances that allow the capture, processing, and storage of data with high resolution. A large amount of available data today is used for advanced and still rarely used techniques: local similarity (quantifies consistency of data between neighboring stations) [29], probability (parameters such as distance to the seismogenic zone or signal phases are treated as random variables with an associated probability) [30], data mining (establishes a fingerprint of seismic waves for later comparison) [31], neural networks (neural network-based algorithms are trained to identify several characteristics of seismic waveforms) [19], and social sensing (based on trending hashtags or key words on social media, algorithms can trigger responses on alert systems) [32][33][34]. These approaches are all computational powerful and help identify waveforms on large historic seismic arrays that were not processed so far or that could have been processed by techniques with less accuracy. The cost of these techniques is the computational time. In terms of computational efficiency, the STA/LTA concept [26,35] arises as a traditional and yet highly efficient parameter for earthquake detection.

Seismic Detection Model Proposal
This research paper proposes the model presented in Figure 2-a set of modules in which seismic signals can be processed-from the seismic data collection to the detection mechanism expressed in the classification process module. The applicability of the model is directly related to the selection of the geographical zone whose seismicity is to be studied-in this case, the northeastern region of Colombian. The size of the study area, the rate of occurrence of events, and the homogeneity of the subsoil are some of the variables that directly influence the number of observations to be analyzed and the performance of each of the modules, which is why a careful selection of the region of interest must be first carried out. Applicability of the model on a different geographical region would require historic seismic arrays specific to the location. The modules are described below as follows: Seismic data seeking and gathering, Reading and interpretation of the seismic data, Analysis of seismological stations, Sample selection, and Classification process.

Seismic Data Seeking and Gathering
To analyze seismic signals by algorithmic means, it is necessary to have a set of records that can be manipulated. In this first step, a download of the historical seismic records is made, filtering by the chosen region. A set of seismological records of each station is obtained.
Depending on the region of interest, it is possible to obtain seismological records through web services, such as the United States Geological Survey (USGS), the European-Mediterranean Seismological Centre (EMSC), the National Earthquake Information Center (NEIC), the National Institute of Seismology, Volcanology, Meteorology, and Hydrology of Guatemala (INSIVUMEH), the The modules are described below as follows: Seismic data seeking and gathering, Reading and interpretation of the seismic data, Analysis of seismological stations, Sample selection, and Classification process.

Seismic Data Seeking and Gathering
To analyze seismic signals by algorithmic means, it is necessary to have a set of records that can be manipulated. In this first step, a download of the historical seismic records is made, filtering by the chosen region. A set of seismological records of each station is obtained.
Depending on the region of interest, it is possible to obtain seismological records through web services, such as the United States Geological Survey (USGS), the European-Mediterranean Seismological Centre (EMSC), the National Earthquake Information Center (NEIC), the National Institute of Seismology, Volcanology, Meteorology, and Hydrology of Guatemala (INSIVUMEH), the Mexican National Seismological Service (SSN), the National Seismological Center of Chile (CSN), or the Colombian National Seismological Network (RSNC by its Spanish acronym), among others. These web services allow downloading data of seismic events one by one, although it is usually necessary to provide searching filters, generally concerning magnitudes, depths, and dates.
The seismic records are usually of public access, so it is possible to request the set of desired data directly to the seismology agency responsible for its storage. Another easier way to do this data-gathering process is to use automated interaction web tools to download the requested files, such as the creation of web snippets using the web-scraping technique, which allows interacting with the web resources of the web service. This technique is usually legally authorized by the RSNC and other services, since all the information downloaded is public access and no intromission for non-authorized domains or web resources is made. Although the scraping procedure is legally accepted, it is recommended to inform the geological services about this practice when executed.
The technological infrastructure needed to download and to store seismic files depends on the volume of data to be processed. In Colombia, the RSNC gathers the seismic records into two categories:

•
Trace files (Waveforms), which contain the seismic samples taken by all seismological stations available around the region of interest.

•
Parameter files (Sfiles), which provide detailed information about the seismic events, such as the longitude and latitude of the epicenter and the P-wave and the S-wave arrival times, among others.
The seismic traces recorded in the Waveform files are usually a large size because they record non-event samples that occur before and after the seismic event picking. Their content is dependent on the duration of the recorded earthquake. Each trace file can have a storage size from approximately 5 MB if it corresponds to a microseism that has been registered by one or few seismological stations, or a specific seismic event registered by a couple of stations, and up to approximately 120 MB, if it is registered by most seismological stations with a duration close to five to ten minutes. The RSNC registers up to 10,000 seismic events per year, with an average of 60 MB of storage per seismic record.
The seismic records may not be stored completely. The storage of all the records allows faster access for later processing stages; however, as has already been shown, the computational load applied to the data storage is high. On the other hand, storing portions of data that are processed and then erasing the unrequired portions, or processing the records one by one so that the results are stored and the records are erased are two recommended procedures to save storage space. Nevertheless, if the data are processed one by one, any subsequent processes to be done or corrections to previous processes will force to access the data sources again, which will hinder processing.

Reading and Interpretation of the Seismic Records
Once the seismic records are obtained, it is necessary to understand the format in which the data are presented, to establish the mechanisms by which they will be read. After reading the data, a sample selection is proposed, which depends on the characteristics found, and a set of memory instances that represent the trace files and read parameters is obtained as an output.
Among the most common international formats for the Sfiles [36] are HYPO71, HYPOINVERSE, and Nordic formats. The most used international formats for Waveform files are SEED, miniSEED, and SimpleASCII [37]. There are comprehensive seismological analysis tools capable of reading a wide range of formats of these two types of files, such as SEISAN and SeisComP. It is also possible to read the files through native programming languages or using libraries linked to these languages, such as SEISPP for C++ [38], Obspy for Python [39], or the open-source tools made available by the USGS for Java [40].
The implementation of specific architectures for reading multiple seismic records is of utmost importance. It is vital to analyze the computational capacity in terms of volatile memory, mainly by considering the techniques exposed and the data stored during the seeking and gathering of seismic records. Therefore, a sub-module can be incorporated for balancing the computational load, which includes techniques for transforming volatile information into non-volatile (hard disk storage), as well as considering parallel processing mechanisms to facilitate the processing of large sets of seismic records.
Additionally, it is pertinent to include file selection algorithms that can discard repeated files or easily identifiable irregularities in both parameter files and trace files to prevent their storage. This process is called data wrangling, in which a data-cleaning procedure is required. Among the irregularities found in the seismic records from the RSNC are inconsistencies in the format, absence of trace files that correspond to existing parameter files, lack of start and end times of the event in the records, as well as non-existent P-wave and S-wave arrival times in some of the records recorded by the seismological stations. A key process to clean the seismic data from the RSNC is shown in Figure 3. includes techniques for transforming volatile information into non-volatile (hard disk storage), as well as considering parallel processing mechanisms to facilitate the processing of large sets of seismic records. Additionally, it is pertinent to include file selection algorithms that can discard repeated files or easily identifiable irregularities in both parameter files and trace files to prevent their storage. This process is called data wrangling, in which a data-cleaning procedure is required. Among the irregularities found in the seismic records from the RSNC are inconsistencies in the format, absence of trace files that correspond to existing parameter files, lack of start and end times of the event in the records, as well as non-existent P-wave and S-wave arrival times in some of the records recorded by the seismological stations. A key process to clean the seismic data from the RSNC is shown in Figure  3. The data cleaning process is executed from three approaches: (1) the validation of the sought and gathered files, (2) the validation of the seismic samples in these files, and (3) the validation of the stations that register these samples.
In the first stage of this cleaning process in the file validation approach, it is checked that the sfile and waveform files can be read correctly for processing. First, if the files cannot be read, the inconvenience may be the source of the data. Secondly, it is verified that there are no repeated files, because the complexity of processing grows and there is no compensation to the investigation for processing the same data more than once. Third, it is verified that the general sfiles data files have a correct record of the corresponding waveform file. If in any of these stages there are inconsistencies, it is recommended that the file(s) are discarded, as they may create a bias in the general seismic analysis.
In the second stage of the cleaning process that focuses on the validation of the samples, it is convenient to check first if there is a record of the start and end time of the events that is homogeneous between each sfile and waveform files, secondly, that the P-waves have been recorded in both files.
Finally, in the third stage of the cleaning process, changes in the seismological station sensors, their relocation, or periods in which they have stopped operating must be considered. This permits the definition of a time interval in which the analysis will be executed, with the certainty that the dynamics of the waves will not be altered by external changes that do not concern the merely seismological field.

Analysis of Seismological Stations
Once the historical seismic archives are read, an analysis of the seismological stations that have recorded the events of interest is carried out, so that those that best represent the events and allow a reduction of the computational load in the processing of the data can be selected. The selected seismological stations are obtained as an output of this process.
Each seismic station provides a specific recording pattern that is dependent on several factors: The data cleaning process is executed from three approaches: (1) the validation of the sought and gathered files, (2) the validation of the seismic samples in these files, and (3) the validation of the stations that register these samples.
In the first stage of this cleaning process in the file validation approach, it is checked that the sfile and waveform files can be read correctly for processing. First, if the files cannot be read, the inconvenience may be the source of the data. Secondly, it is verified that there are no repeated files, because the complexity of processing grows and there is no compensation to the investigation for processing the same data more than once. Third, it is verified that the general sfiles data files have a correct record of the corresponding waveform file. If in any of these stages there are inconsistencies, it is recommended that the file(s) are discarded, as they may create a bias in the general seismic analysis.
In the second stage of the cleaning process that focuses on the validation of the samples, it is convenient to check first if there is a record of the start and end time of the events that is homogeneous between each sfile and waveform files, secondly, that the P-waves have been recorded in both files.
Finally, in the third stage of the cleaning process, changes in the seismological station sensors, their relocation, or periods in which they have stopped operating must be considered. This permits the definition of a time interval in which the analysis will be executed, with the certainty that the dynamics of the waves will not be altered by external changes that do not concern the merely seismological field.

Analysis of Seismological Stations
Once the historical seismic archives are read, an analysis of the seismological stations that have recorded the events of interest is carried out, so that those that best represent the events and allow a reduction of the computational load in the processing of the data can be selected. The selected seismological stations are obtained as an output of this process.
Each seismic station provides a specific recording pattern that is dependent on several factors:

•
The distance from the hypocenter and the epicenter (hypocenter and epicenter distances) to the geographic position of the station defines the amount of attenuation of the seismic wave.

•
The geomorphology to which the seismic waves are exposed on the way to the station defines the propagation pattern and the attenuation of the seismic waves.

•
The natural and artificial noise sources demean the seismic records due to the loss of quality regarding the content associated with seismic information, adding sources of information that concern other events that are not from a seismic nature.

•
The technical parameters of the stations such as measurement channels, signal-to-noise ratio, analog-to-digital conversion, sampling rates, sensitivity, and dynamic range define how the seismic event is perceived from an analog source to a digital environment.
Then, these factors influence the composition and patterns of the traces that are transmitted to the monitoring site and stored for further offline processing. Each Colombian station records the seismic events individually by considering the named factors. The more stations that detect the event, the more data that can be transmitted, processed, and stored.
When a microseism occurs, usually few stations record it, since it can be a noise event due to a local disturbance on the surface or a seismic event of very low magnitude and/or considerable depth. In this case, the amount of data that contains useful information is not extensive and can be analyzed quickly and stored without major physical space costs.
However, when there are long-term, shallow, or large-scale seismic events that are perceived in various regions, the computational capacity for analysis and storage is high. Therefore, selecting a set of stations to analyze the traces that are recorded allows the dimensionality reduction of the data and reduction of the computational load.
Defining the stations to be studied is a process that requires a strong seismological criterion; however, the use of algorithms for statistical analysis facilitates the discernment between station selection criteria. For example, using libraries for the geographical mapping of stations and seismic epicenters, clustering and sampling the data exposed in the parameter files, among other procedures, allow contrasting the information stored and decide about the stations that best represent the events analyzed. Figure 4 proposes a general procedure for the selection of the seismological stations used to identify the presence of seismic events in the Colombian monitored signals. The geomorphology to which the seismic waves are exposed on the way to the station defines the propagation pattern and the attenuation of the seismic waves.

•
The natural and artificial noise sources demean the seismic records due to the loss of quality regarding the content associated with seismic information, adding sources of information that concern other events that are not from a seismic nature.

•
The technical parameters of the stations such as measurement channels, signal-to-noise ratio, analog-to-digital conversion, sampling rates, sensitivity, and dynamic range define how the seismic event is perceived from an analog source to a digital environment.
Then, these factors influence the composition and patterns of the traces that are transmitted to the monitoring site and stored for further offline processing. Each Colombian station records the seismic events individually by considering the named factors. The more stations that detect the event, the more data that can be transmitted, processed, and stored.
When a microseism occurs, usually few stations record it, since it can be a noise event due to a local disturbance on the surface or a seismic event of very low magnitude and/or considerable depth. In this case, the amount of data that contains useful information is not extensive and can be analyzed quickly and stored without major physical space costs.
However, when there are long-term, shallow, or large-scale seismic events that are perceived in various regions, the computational capacity for analysis and storage is high. Therefore, selecting a set of stations to analyze the traces that are recorded allows the dimensionality reduction of the data and reduction of the computational load.
Defining the stations to be studied is a process that requires a strong seismological criterion; however, the use of algorithms for statistical analysis facilitates the discernment between station selection criteria. For example, using libraries for the geographical mapping of stations and seismic epicenters, clustering and sampling the data exposed in the parameter files, among other procedures, allow contrasting the information stored and decide about the stations that best represent the events analyzed. Figure 4 proposes a general procedure for the selection of the seismological stations used to identify the presence of seismic events in the Colombian monitored signals. At first sight, it is necessary to identify the geo-referenced position of the stations, so that there is a spatial perspective of their distribution. Depending on the area under monitoring, the stations of interest will be those closer to this area. This is because the dynamics of local earthquakes are more marked and detectable than the dynamics of regional earthquakes and teleseisms, which have attenuated and difficult to model features. It is recommended that more than two stations be selected, as events that have a non-seismic nature can be detected as such if only the activity is monitored in one or two stations. At first sight, it is necessary to identify the geo-referenced position of the stations, so that there is a spatial perspective of their distribution. Depending on the area under monitoring, the stations of interest will be those closer to this area. This is because the dynamics of local earthquakes are more marked and detectable than the dynamics of regional earthquakes and teleseisms, which have attenuated and difficult to model features. It is recommended that more than two stations be selected, as events that have a non-seismic nature can be detected as such if only the activity is monitored in one or two stations.
In the second stage, it becomes clear to calculate the identification rate of seismic events labeled by each station, according to the monitoring region. This supports the selection of the stations concerning their geo-referenced position, as it is an indicator of the proximity of the stations to the epicenters (epicentral distance) and the signal attenuation index when arriving at the stations.
The stations with the highest identification rate of seismic events must be checked against their epicentral distance. A good relationship for the choice, as the third stage in this process, is to select the stations that have identified the most earthquakes, with a short epicentral distance. It is advisable to include the processing capacity as a third attribute in the selection process, since an additional station can signify the processing of 210,000 additional samples, on average.

Sample Selection
The sample selection process can be carried out in parallel to the station analysis process, since these two processes are independent of each other. There are several factors to consider when selecting a Colombian seismic event sample, as it was shown in Section 4.2: • Inconsistencies in the file formats: There are different formats in which a seismic file can be structured, as SEED and miniSEED. During the processing and storage stages, the data are susceptible to be modified or lost, since there are multiple sources of information. Sometimes, these modifications alter the file formats, making them inconsistent. The files that present inconsistencies in the format and cannot be read correctly must be discarded.

•
Absence of trace files that correspond to SEED and SAC existing parameter files: As part of the data storage process, the seismic information extracted from the seismic events (Sfiles) and the seismic samples (Waveforms) are recorded in separate files, as described in previous sections. Some of them are stored as part of the dataset without being associated. In this way, cases in which seismic information is recorded and samples were lost and vice versa can be found. Those files where the description data do not correspond to the seismic traces must be discarded.

•
Lack of start and end times and/or inexistence of P-wave and S-wave arrival times in the events recorded: when a seismic event is recorded, some variables are measured, among which are the start time and end time of the event and the P-wave and S-wave arrival times. These values are very important to train classification algorithms, as some specific samples can be extracted from the seismograms, knowing when the earthquake began and when it finished. Unfortunately, some files can be well stored but lacking one or more of these four key parameters. In this case, it should be analyzed whether it is possible to determine the start or end date of the event by processing the seismic traces. If this is not possible, the files must be discarded.
It is also important to consider the structural changes of the stations, such as changes in the sampling frequency, sensors, digitizers, and number of spatial components, among others. These variations, although some of them are subtle, represent substantial alterations in the seismic patterns that might not be detected, since the detecting algorithms learn from specific patterns shown in the learning stages.
The inconsistencies in the files may be a consequence of the wrong acquisition, processing, and storage processes that are sometimes attributable to the algorithms that execute those processes for the seismology entities of each region or country.
In Colombia, between 20% and 30% on average of the selected files within the initial population of seismic events are propense to be discarded due to these inconsistencies, although some regions that are affected by very strong seismic events have accurate and well-stored files [41]. Other less frequent irregularities that may occur are (a) recording of the seismic data from stations that are inactive, (b) recording of the seismic data from components that the seismological station does not have (e.g., the registration of three components for stations with monoaxial sensors), (c) recording seismic traces with a different sampling frequency from that described in the sensor datasheet, (d) recording of the seismic traces with different sampling frequencies among the components from the same station, either by components or by events, (e) the annotation of the P-wave in the traces is outside the measurement period of the seismic events, and (f) recording of all relevant seismic attributes described of the seismic signals with a magnitude of zero.

Classification Process
The analysis of seismological stations and sample selection processes provide the input datasets to the classification process, regarding seismological stations and seismic record datasets, respectively. With these inputs, the classification process implements supervised learning strategies using the selected seismological stations to detect the seismic event. The outputs of the classification process are (a) the average performance metrics of the classification approach and (b) the classification model, which is trained and validated, as described in Figure 5. The performance metrics are related to the ability of the classifier to differentiate between a seismic event and a non-seismic event, i.e., a binary classification, and the classification model corresponds to the implementation of the classifier, and it is able to classify new signals and provide the event detection output.
Future Internet 2020, 12, x FOR PEER REVIEW 10 of 17 outside the measurement period of the seismic events, and (f) recording of all relevant seismic attributes described of the seismic signals with a magnitude of zero.

Classification Process
The analysis of seismological stations and sample selection processes provide the input datasets to the classification process, regarding seismological stations and seismic record datasets, respectively. With these inputs, the classification process implements supervised learning strategies using the selected seismological stations to detect the seismic event. The outputs of the classification process are (a) the average performance metrics of the classification approach and (b) the classification model, which is trained and validated, as described in Figure 5. The performance metrics are related to the ability of the classifier to differentiate between a seismic event and a nonseismic event, i.e., a binary classification, and the classification model corresponds to the implementation of the classifier, and it is able to classify new signals and provide the event detection output. The identification and classification of seismic events can be done using different techniques, as described in Section 3. For instance, the phase picking of seismic waves is widely used for real-time monitoring, detection, and localization with the P-wave picking is the main method for detection in early warning systems. Several algorithms for P-wave picking have been proposed in the timedomain [42], whereas the STA/LTA and its variations are the most implemented algorithm in observation and detection networks [43]. In Colombia, the SeisComP3 software is currently used by the RSNC for the acquisition of seismic data from stations located throughout the national territory. With the STA/LTA-based AutoPick module, the P-wave is detected by a SeisComP3 implementation [44,45].
Traditional approaches such as STA/LTA are suitable options for the classification algorithm. Nevertheless, these approaches have limitations regarding their adaptability to the behavior of seismic waves [5,46,47]. As [48] state, the automatic picking of seismic waves can remove the ambiguity derived from the lack of synchronization between channels and signals proceeding from the seismological stations. Furthermore, [48][49][50] have shown that STA/LTA and cross-correlation approaches have a high rate of Types I and II errors (namely false negative and false positive) due to excessive noise that cannot be removed from the source and very low-frequency components that might not be enhanced. These factors can be handled accurately (as far as possible) using machine learning techniques.
Considering that the dynamic behavior of seismic signals recorded by sensors is subject to many factors that influence the signal, as denoted in Section 4.2, machine learning algorithms represent an appropriate alternative for the development of classification models, since they enable the abstraction of attributes associated with the signals, based on the modeling of large training datasets. Machine learning algorithms rely on the quantity and quality of data and, as described in Sections 3.2 and 4.1, current seismological services can provide huge amounts of data that can be processed to obtain datasets to classify seismic records such as the ones provided as output in the Sample Selection The identification and classification of seismic events can be done using different techniques, as described in Section 3. For instance, the phase picking of seismic waves is widely used for real-time monitoring, detection, and localization with the P-wave picking is the main method for detection in early warning systems. Several algorithms for P-wave picking have been proposed in the time-domain [42], whereas the STA/LTA and its variations are the most implemented algorithm in observation and detection networks [43]. In Colombia, the SeisComP3 software is currently used by the RSNC for the acquisition of seismic data from stations located throughout the national territory. With the STA/LTA-based AutoPick module, the P-wave is detected by a SeisComP3 implementation [44,45].
Traditional approaches such as STA/LTA are suitable options for the classification algorithm. Nevertheless, these approaches have limitations regarding their adaptability to the behavior of seismic waves [5,46,47]. As [48] state, the automatic picking of seismic waves can remove the ambiguity derived from the lack of synchronization between channels and signals proceeding from the seismological stations. Furthermore, [48][49][50] have shown that STA/LTA and cross-correlation approaches have a high rate of Types I and II errors (namely false negative and false positive) due to excessive noise that cannot be removed from the source and very low-frequency components that might not be enhanced. These factors can be handled accurately (as far as possible) using machine learning techniques.
Considering that the dynamic behavior of seismic signals recorded by sensors is subject to many factors that influence the signal, as denoted in Section 4.2, machine learning algorithms represent an appropriate alternative for the development of classification models, since they enable the abstraction of attributes associated with the signals, based on the modeling of large training datasets. Machine learning algorithms rely on the quantity and quality of data and, as described in Sections 3.2 and 4.1, current seismological services can provide huge amounts of data that can be processed to obtain datasets to classify seismic records such as the ones provided as output in the Sample Selection process.
As [51] state, machine learning algorithms are particularly well-used in seismology due to their facility to model complex relationships of a wide range of variables. Since the majority of tasks in this context are normally targeting classification problems, machine learning drives a well-structured solution, since it can build a predictive environment in which a model is trained over sample data and tested over unseen data, guaranteeing the generalization of the solution against the data and the context. Sometimes, this procedure can be harmful if there is not enough data to proceed with the training procedure or when the representative descriptors (features or covariates) are almost the same as the number of samples. In these cases, additional machine learning approaches can be implemented, such as feature selection using forward or backward procedures [52].
With the selection of a machine learning algorithm as the classification algorithm, the classification process can be configured with a set of subprocesses depending on the specific machine learning algorithm requirements, which may be feature-based or time-series-based. The proposed set of subprocesses for the classification process is shown in the subprocesses diagram of Figure 6. The sequence of subprocesses can fulfill the classification capabilities for attribute-based or time series machine learning algorithms. context. Sometimes, this procedure can be harmful if there is not enough data to proceed with the training procedure or when the representative descriptors (features or covariates) are almost the same as the number of samples. In these cases, additional machine learning approaches can be implemented, such as feature selection using forward or backward procedures [52].
With the selection of a machine learning algorithm as the classification algorithm, the classification process can be configured with a set of subprocesses depending on the specific machine learning algorithm requirements, which may be feature-based or time-series-based. The proposed set of subprocesses for the classification process is shown in the subprocesses diagram of Figure 6. The sequence of subprocesses can fulfill the classification capabilities for attribute-based or time series machine learning algorithms. To set up the subprocesses, it is necessary to determine the number of classification models desired to represent the dynamics perceived by the stations. A single classification model might represent the events linking all the stations throughout a centralized fusion, or multiple classification models might correspond to each station separately throughout decentralized functions (such as the ones proposed by bagging techniques). In the case of decentralized fusion, each station has its classification model; hence, the Pre-processing and the Model definition stages need to be done separately for each selected station. The results provided by the per-station models can be composed into a combined result in the Testing subprocess using logical functions or more complex integration functions, producing a single classification model based on individual station analysis.
When the selected stations and the Seismic events dataset enter the classification process, the first subprocess that manipulates the datasets is the Seismic dataset slicing subprocess. This subprocess oversees linking the events to the selected stations and splits the Seismic events dataset into datasets per station if a decentralized schema is going to be used. In the Pre-processing stage, six subprocesses oversee formatting the dataset into a scheme ready to be used as input for a learning algorithm in the Model definition stage. Each subprocess in the Pre-processing stage is applied to every component that the signal may contain.
After the Seismic dataset slicing, the seismic signals associated with each event in the dataset are filtered in the Signals filtering subprocess. This subprocess considers the influence of noise sources that affect the seismic signals. The considered noise sources are the soil vibrations produced by natural causes and the instrumental noise associated with the measurement equipment hardware [41]. The filter choice is crucial for the classification model performance, since the quality of the signal depends largely on the quality of the filter; therefore, the ability of the learning algorithm to generalize P-wave dynamics relies on the filter. It is recommended to make a frequency analysis to obtain the frequency components of the signals and the noise. The suggested frequency band is between 1 and 12 Hz [53], and different filter techniques can be applied to this subprocess [54]. To set up the subprocesses, it is necessary to determine the number of classification models desired to represent the dynamics perceived by the stations. A single classification model might represent the events linking all the stations throughout a centralized fusion, or multiple classification models might correspond to each station separately throughout decentralized functions (such as the ones proposed by bagging techniques). In the case of decentralized fusion, each station has its classification model; hence, the Pre-processing and the Model definition stages need to be done separately for each selected station. The results provided by the per-station models can be composed into a combined result in the Testing subprocess using logical functions or more complex integration functions, producing a single classification model based on individual station analysis.
When the selected stations and the Seismic events dataset enter the classification process, the first subprocess that manipulates the datasets is the Seismic dataset slicing subprocess. This subprocess oversees linking the events to the selected stations and splits the Seismic events dataset into datasets per station if a decentralized schema is going to be used. In the Pre-processing stage, six subprocesses oversee formatting the dataset into a scheme ready to be used as input for a learning algorithm in the Model definition stage. Each subprocess in the Pre-processing stage is applied to every component that the signal may contain.
After the Seismic dataset slicing, the seismic signals associated with each event in the dataset are filtered in the Signals filtering subprocess. This subprocess considers the influence of noise sources that affect the seismic signals. The considered noise sources are the soil vibrations produced by natural causes and the instrumental noise associated with the measurement equipment hardware [41]. The filter choice is crucial for the classification model performance, since the quality of the signal depends largely on the quality of the filter; therefore, the ability of the learning algorithm to generalize P-wave dynamics relies on the filter. It is recommended to make a frequency analysis to obtain the frequency components of the signals and the noise. The suggested frequency band is between 1 and 12 Hz [53], and different filter techniques can be applied to this subprocess [54].
When the Signals filtering subprocess is performed successfully, the dataset is relocated to the Signals normalization subprocess, whose objective is to standardize the scale of the signals to the [−1, 1] interval according to the minimum and maximum value of each signal and remove the direct current component that is commonly added by the instrumentation. Then, the normalized signals are re-sampled in the Signals re-sampling subprocess. The definition of a common sampling rate for the signals is necessary due to the variety of the sampling rates that the signals may have, which are the outcome of diverse sampling properties that the acquisition devices and the digitalization algorithms present in the seismological stations.
The seismic signals associated with each record are commonly signals that contain information before the arrival of the P-wave and after the arrival of the S wave. To focus on the P-wave dynamics, in the P-wave picking subprocess, the identification of the time where the P-wave arrived is performed. This picking time annotation is commonly found in the Sfiles. Then, in the Signal synchronization subprocess, the picking time obtained in the P-wave picking is used to determine the exact sample where the P-wave arrived, relying on the defined sampling rate used in the Signals re-sampling subprocess, and a standard amount of samples is selected for a homogeneous duration time for each signal component. Therefore, all the signals representing an event start at the same time according to their P-wave time and end at the same time depending on the selected duration time.
The last Pre-processing subprocess is the Windows extraction subprocess whose purpose is to extract segments from the signals where the P-wave dynamics are contained and segments where there is no P-wave. The window length in samples is subject to the P-wave picking time and the window duration. A 2-s window is a recommended length [55]. It is necessary to obtain non-P-wave windows as well, since the classification algorithm learns to distinguish between the attributes of a P-wave and the attributes of a non-P-wave signal. Therefore, it is recommended to have the same number of windows concerning P-waves and non-P-waves to avoid class imbalance issues. With the Window extraction subprocess, the dataset of filtered, re-sampled, and synchronized signals associated with the events turns into a windows dataset with two classes, P-wave and non-P-wave, which is a common approach for the binary classification of seismic events.
With the window dataset, it is possible to generate a feature dataset that is commonly used by some feature-oriented machine learning techniques. A feature is a description of a record; the seismic events can be statistically described in terms of time, frequency, and non-linearity, among others. The features selection has a huge impact on the classification model's metrics, since they define how the classification algorithm perceives the seismic events. Each event is represented by a set of features in a matrix, and each feature acts as an input to the classification algorithm.
The Pre-processing stage differs from feature-based learning algorithms and time-series learning algorithms. The Signals filtering, Signals re-sampling, and Signals normalization subprocesses may be skipped depending on the time series technique, and some extra processes must be carried out, such as checks for stationarity, correlation, and autocorrelation. It is necessary to indicate the location of the P-wave in the signals (P-wave picking) to synchronize the signal's arrival and duration throughout the dataset (Signals synchronization) and to determine the characteristics of the moving window (Window extraction).
The input dataset for a feature-based technique is a set of single values that describes the original P-waves and non-P-waves in the seismic signals. Among the most commonly used machine learning feature-based algorithms applied to P-wave detection are Hidden Markov Models [56], Bayesian Networks [57], Support Vector Machines [58][59][60], Logistic Regression [53,61,62] and Artificial Neural Networks (ANN) [41,53,[63][64][65][66][67]. Conversely, the input dataset for a time-series technique is a set of signals split into P-wave signals and non-P-wave signals. Among the most used time-series forecasting techniques (TTF) for P-wave detection are Autoregressive Integrated Moving Average (ARIMA) [68], Seasonal ARIMA (SARIMA) [69], and ARIMA with Exogenous Regressors (ARIMAX) [70]. Some time-series forecasting recent methods used for P-wave detection include Pure Linear Neural Networks (PLNN) [71] and Polynomial Neural Networks (PNN) [72].
By following the suggested set of subprocesses and using a machine learning algorithm as the ones previously described, the classification process outputs a classification model and a set of metrics associated with that classification model. The produced model is suitable for performing the offline detection of seismic signals through the identification of P-waves. The mentioned metrics indicate the performance of the classification algorithm with the best set of hyperparameters scored on the test set records. The classification model, as denoted in the graph of the resulting classification model (Figure 7), can be interpreted as a system with testing preprocessed signals as the input to a function that contains the resultant algorithm responsible for the event detection. The output of this algorithm is a value that indicates whether the input signal contains a P-wave or not, in case of being a binary classification process.
Future Internet 2020, 12, x FOR PEER REVIEW 13 of 17 metrics associated with that classification model. The produced model is suitable for performing the offline detection of seismic signals through the identification of P-waves. The mentioned metrics indicate the performance of the classification algorithm with the best set of hyperparameters scored on the test set records. The classification model, as denoted in the graph of the resulting classification model (Figure 7), can be interpreted as a system with testing preprocessed signals as the input to a function that contains the resultant algorithm responsible for the event detection. The output of this algorithm is a value that indicates whether the input signal contains a P-wave or not, in case of being a binary classification process. The classification model is the result of the System Model for Offline detection. It is recommended to generate several classification models by testing different approaches. In this sense, depending on the performance of the classification model reflected on the metrics, a change in the selection of classification process parameters, such as the filter type and length of the window, among others, may be considered to improve the classification performance. In the same way, other parameters that belong to macro processes, such as the number of stations in the Analysis of seismological stations or filtering the seismic records by magnitude range in Seismic data seeking and gathering, may have a huge influence on the classification performance.
Some studies have been carried out using the proposed System Model to perform the identification of P-waves in Colombia. In [41], a dataset of seismological records of events with epicenter in the department of Santander, Colombia between 2010 and 2017 was selected in the Seismic data seeking and gathering process. In the Reading and interpretation of the seismic records process, the records were described in the Nordic format. Then, in the Analysis of seismological stations, four stations were selected according to the epicenter distance. In the Sample Selection process, 20% of the downloaded records gathered were discarded due to inconsistencies in seismic attributes. Finally, in the Classification Process, the selected classification algorithm was logistic regression. The classification model was composed of four logistic regressors, one per each station, and a decentralized voting function that applied a logical function to the output of each regressor, to produce a binary output (P-wave or not P-wave). The obtained classification model produced an accuracy of 98.26% for the detection of the P-wave.
Similarly, in [41], the use of the System Model was applied to a dataset of events with an epicenter in Santander, using four stations. In the binary classification process, the selected classification algorithm was a feed-forward back-propagation Artificial Neural Network (ANN) after being cleaned and the missing values handled appropriately. Unlike the voting function applied in [53], the classification algorithm relied on the behavior associated with all the stations, which were The classification model is the result of the System Model for Offline detection. It is recommended to generate several classification models by testing different approaches. In this sense, depending on the performance of the classification model reflected on the metrics, a change in the selection of classification process parameters, such as the filter type and length of the window, among others, may be considered to improve the classification performance. In the same way, other parameters that belong to macro processes, such as the number of stations in the Analysis of seismological stations or filtering the seismic records by magnitude range in Seismic data seeking and gathering, may have a huge influence on the classification performance.
Some studies have been carried out using the proposed System Model to perform the identification of P-waves in Colombia. In [41], a dataset of seismological records of events with epicenter in the department of Santander, Colombia between 2010 and 2017 was selected in the Seismic data seeking and gathering process. In the Reading and interpretation of the seismic records process, the records were described in the Nordic format. Then, in the Analysis of seismological stations, four stations were selected according to the epicenter distance. In the Sample Selection process, 20% of the downloaded records gathered were discarded due to inconsistencies in seismic attributes. Finally, in the Classification Process, the selected classification algorithm was logistic regression. The classification model was composed of four logistic regressors, one per each station, and a decentralized voting function that applied a logical function to the output of each regressor, to produce a binary output (P-wave or not P-wave). The obtained classification model produced an accuracy of 98.26% for the detection of the P-wave.
Similarly, in [41], the use of the System Model was applied to a dataset of events with an epicenter in Santander, using four stations. In the binary classification process, the selected classification algorithm was a feed-forward back-propagation Artificial Neural Network (ANN) after being cleaned and the missing values handled appropriately. Unlike the voting function applied in [53], the classification algorithm relied on the behavior associated with all the stations, which were analyzed as a group in a centralized function represented by the ANN. The degree of polarization, the ratio of vertical power to total power, skewness, and kurtosis of the three-component seismic data for each station were used as the selected features to feed the training process of the named ANN binary classifier. All the input features were extracted from observations whose classes were balanced and equally distributed in the datasets. With the described settings, the obtained classification model produces an accuracy of 99.24% for the detection of the P-wave.

Conclusions
The proposed five modules of the seismic detection model facilitated the comprehension of the integration of the phases of an offline detection system. A set of historical seismic records is first downloaded and read (depending on the format of the data). The data can be filtered to make it easy to process by the subsequent phases. A selection of seismological stations that recorded an event of interest may allow the reduction of the computational load. This selection can be made based on the distances, geomorphology, noise sources, and technical parameters of the stations. The identification of a seismic event is a binary classification task, i.e., the presence/absence of a P-wave on the seismic signal.
The proposed model allows specifying detection and classification tasks for seismic events, which is applicable to the Colombian region. However, it can be extrapolated to other regions, as the detailed procedures are general enough to be applied in the local seismological networks that have monitoring stations with data formats and equivalent measurements.