Machine Learning to Improve the Sensing of Biomolecules by Conical Track-Etched Nanopore

Single nanopore is a powerful platform to detect, discriminate and identify biomacromolecules. Among the different devices, the conical nanopores obtained by the track-etched technique on a polymer film are stable and easy to functionalize. However, these advantages are hampered by their high aspect ratio that avoids the discrimination of similar samples. Using machine learning, we demonstrate an improved resolution so that it can identify short single- and double-stranded DNA (10- and 40-mers). We have characterized each current blockade event by the relative intensity, dwell time, surface area and both the right and left slope. We show an overlap of the relative current blockade amplitudes and dwell time distributions that prevents their identification. We define the different parameters that characterize the events as features and the type of DNA sample as the target. By applying support-vector machines to discriminate each sample, we show accuracy between 50% and 72% by using two features that distinctly classify the data points. Finally, we achieved an increased accuracy (up to 82%) when five features were implemented.


Introduction
For the past three decades, single nanopore technology have emerged as single-molecule sensors and offer many practical uses such as long read DNA sequencing [1,2]. This was achieved by engineering biological nanopores combined with biological machines to control the DNA translocation speed [3][4][5][6][7]. Beside sequencing, biological nanopores provide a nice platform to analyze the DNA substructure such as hairpin [8], the hybridization [9,10], zipping [11] or the interaction with protein [12]. At the beginning of the 2000s, the idea to mimic biological nanopores was demonstrated using different types of thin film. First, thin films of semiconductors (SiN) drilled by transmission electron microscopy or focused ion beam were used to provide nanopores with a low aspect ratio [13,14]. Next, polymer nanopore obtained by the track-etched technique provided a long high-aspect-ratio nanochannel [15,16]. More recently, 2D materials with reduced thickness down to a couple of angstroms, such as metal nitride or oxide, were developed to improve the noise and/or wettability of those low-aspect-ratio nanopore [17,18].

Material
The A 10 , A 40 , T 10 and T 40 were obtained as previously reported [40]. Briefly, they were synthesized from commercially available phosphoramidite building blocks (Link Technologies Ltd., Bellshill, Scotland) in a 1 µmol scale using an ABI 381A DNA synthesizer by standard phosphoramidite chemistry. Then they were purified by RP-HPLC and characterized by MALDI-TOF MS.

Track-Etched Nanopore Design
Single conical nanopore was obtained by the track-etched method under dissymmetrical condition as previously reported [56]. Briefly, the single tracks were produced by Xe irradiation (8.98 MeV u-1) (GANIL, SME line, Caen, France) of polyethylene terephthalate (PET) film (thickness 13 µm, biaxial orientation ES30 10 61 Goodfellow). The tracks were activated by UV exposition 12 h per side (Fisher bioblock; VL215.MC, λ = 312 nm) before chemical etching process. The etching of conical nanopore was performed under dissymmetric condition (etchant solution 9 M NaOH and stop solution 1 M KCl 1 M of acetic acid) using the electrostopping method (1 V). After nanopore opening, the tip diameter (d t ) of conical nanopores was calculated from the dependence of the conductance G (measured from −100 mV to 100 mV) with KCl concentration 1M, assuming bulk-like ionic conductivity inside the nanopores using Equation (1).
where κ is the conductivity of the solution, L the nanopore length (13 µm) and d b the diameter of the base side. d b is calculated from the total etching time t using the relationship d b = 2.5t. The factor 2.5 was determined in our laboratory using multipore membrane track. The pore dimensions used here

DNA Detection and Analysis
The DNA strands were detected using resistive pulse methods [57][58][59]. Briefly, the single conical nanopore was mounted between two Teflon chambers containing the same electrolyte solution (NaCl 3 M, EDTA 1 mM, PBS 50 mM, pH 7.2 or KCl 2 M, EDTA 1 mM, PBS 50 mM, pH 7.2). The current was measured by Ag/AgCl, 1 M KCl electrodes connected to the cell chambers by agar-agar bridges. The working electrode and ground electrode were located in the trans-chamber (base side of the nanopore) and in the cis chamber (tip side of the nanopore), respectively. Electrical measurement was performed using a patch-clamp amplifier (EPC10 HEKA electronics, Lambrecht, Germany).
The polynucleotide samples were added on the cis chamber (tip side of nanopore) to reach a final concentration of 10 nM. Positive bias (250 mV or 500 mV) was then applied to the trans-chamber. Ion current was recorded at a sampling frequency of 100 kHz (for T 40 and A 40 /T 40 ) or 200 kHz (for A 10 /T 10 ). A Bessel filter at 10 kHz is used. Those experiments were repeated at least 10 times in 8 successive days for each nanopore. The data analysis was performed using a custom-made LabView software with Butterworth filter of 2.5 kHz, 2 orders. The base line fluctuation was corrected using a Savitzky-Golay filter of 2400 side points, 1 order. The detection event was performed using a threshold of 3σ (σ where is the standard deviation of the signal). Each event was characterized by the relative current blockade (∆I/I 0 ), the dwell time (∆t), the area (AUC), the right (RS) and left slopes (LS). The parameters of the current blockade were analyzed using Matlab and the toolbox "statistical and learning machine".

Results and Discussion
The experimental detection of all DNA samples A 10 /T 10 , A 40 /T 40 and T 40 were performed from the tip side to the base side under two different electrolyte conditions (NaCl 3 M, EDTA 1 mM, PBS 50 mM, pH 7.2 or KCl 2 M, EDTA 1 mM, PBS 50 mM, pH 7.2) (Figure 1a). Figure 1b-g shows examples of current traces recorded at 250 mV and 500 mV for all samples. From the current traces, the events related to the DNA translocation through the nanopore were detected. These current blockades were usually described by the relative current blockade (∆I/I 0 ), which is the ratio between the amplitude of the current blockade and the base line current, and the dwell time. These two parameters were first extracted to characterize all the events recorded during our experiments.
Biosensors 2020, 10, x FOR PEER REVIEW 4 of 13 of current traces recorded at 250 mV and 500 mV for all samples. From the current traces, the events related to the DNA translocation through the nanopore were detected. These current blockades were usually described by the relative current blockade (ΔI/I0), which is the ratio between the amplitude of the current blockade and the base line current, and the dwell time. These two parameters were first extracted to characterize all the events recorded during our experiments. In Figure 2 are reported the distribution histograms of ΔI/I0 obtained for the A10/T10, A40/T40 and T40 at a voltage of 500 mV using pore 1 under KCl 2M. These distributions are centered at similar values for the three samples: 0.085, 0.093 and 0.066 (another center of distribution is observed at 0.128) for A10/T10, A40/T40 and T40, respectively. The ΔI/I0 distributions of the current blockade recorded at 250 mV are centered on 0.15, 0.09 and 0.12 for A10/T10, A40/T40 and T40, respectively ( Figure S1). These values slightly increase for pore 2 (under NaCl 3M) recorded at 250 mV. The distributions are centered on 0.27, 0.25 and 0.18 for A10/T10, A40/T40 and T40, respectively ( Figure S2). Similar observation can be made for the experiments performed at 500 mV where the centers of distribution are 0.31, 0.28, and 0.36 for A10/T10, A40/T40 and T40, respectively ( Figure S3).
The distribution of Δt for the three samples at 500 mV recorded for pore 1 (and 2) are reported in Figures 2b and S2b. We observe that the distributions are centered close to the same value: 1.07 ms (1.48 ms), 0.95 ms (1.24 ms), and 1.29 ms (1.23 ms) for A10/T10, A40/T40 and T40, respectively. We notice that the time scale (about 1 ms) is in the same range as the one reported for DNA 50 bp [41]. Under 250 mV, the Δt values do not significantly decrease ( Figures S2b and S4b). Indeed, the centers of distribution for pore 1 (and 2) are found to be 1.07 ms (1.39 ms), 0.82 ms (1.25 ms) and 0.87 ms (1.30 ms) for A10/T10, A40/T40 and T40.  The distribution of ∆t for the three samples at 500 mV recorded for pore 1 (and 2) are reported in Figure 2b and Figure S2b. We observe that the distributions are centered close to the same value: 1.07 ms (1.48 ms), 0.95 ms (1.24 ms), and 1.29 ms (1.23 ms) for A 10 /T 10 , A 40 /T 40 and T 40 , respectively. We notice that the time scale (about 1 ms) is in the same range as the one reported for DNA 50 bp [41]. Under 250 mV, the ∆t values do not significantly decrease (Figures S2b and S4b). Indeed, the centers of distribution for pore 1 (and 2) are found to be 1.07 ms (1.39 ms), 0.82 ms (1.25 ms) and 0.87 ms (1.30 ms) for A 10 /T 10 , A 40 /T 40 and T 40 .
Usually, the ∆I/I 0 and ∆t are the main parameters to discriminate the sample analyzed by nanopore sensors. Here, we observe a large overlap between the different distributions ( Figure 2 and Figures S1-S3) preventing the sample discrimination. The results also indicate that there is no preferential voltage or pore to discriminate them with only one parameter. In Figure 3 are reported two event maps representing the ∆I/I 0 vs. ∆t of translocation events for the three samples at two different voltages: 250 mV ( Figure 3a) and 500 mV (Figure 3b) for pore 1. We observe that the cloud of events overlaps due to the similar distribution of ∆I/I 0 and ∆t. This overlap makes impossible the discrimination of the DNA samples by a simple clustering analysis. The same trend is observed for pore 2 for the two same voltages (250 and 500 mV, see Figure S4). These observations are not surprising due to the low resolution of track-etched nanopore.
Biosensors 2020, 10, x FOR PEER REVIEW 6 of 13 Usually, the ΔI/I0 and Δt are the main parameters to discriminate the sample analyzed by nanopore sensors. Here, we observe a large overlap between the different distributions (Figures 2 and S1-S3) preventing the sample discrimination. The results also indicate that there is no preferential voltage or pore to discriminate them with only one parameter. In Figure 3 are reported two event maps representing the ΔI/I0 vs. Δt of translocation events for the three samples at two different voltages: 250 mV (Figure 3a) and 500 mV (Figure 3b) for pore 1. We observe that the cloud of events overlaps due to the similar distribution of ΔI/I0 and Δt. This overlap makes impossible the discrimination of the DNA samples by a simple clustering analysis. The same trend is observed for pore 2 for the two same voltages (250 and 500 mV, see Figure S4). These observations are not surprising due to the low resolution of track-etched nanopore. To go further, we attempted to define each current blockade with additional parameters ( Figure  4a). First, we considered the surface area of the event (AUC) because it takes into account the eventual current fluctuation during the DNA translocation. We could expect that this parameter is strongly correlated to the Δt and the ΔI/I0. The conical shape of the nanopore can generate a dissymmetrical shape of current blockade events. In that case, the event's right and left slopes (noted RS and LS, respectively) are expected to be different as previously reported in the case of spherical object [60,61]. Now, we evaluate the correlation degree of these five parameters (Δt, ΔI/I0, AUC, RS and LR). To go further, we attempted to define each current blockade with additional parameters (Figure 4a). First, we considered the surface area of the event (AUC) because it takes into account the eventual current fluctuation during the DNA translocation. We could expect that this parameter is strongly correlated to the ∆t and the ∆I/I 0 . The conical shape of the nanopore can generate a dissymmetrical Biosensors 2020, 10, 140 7 of 13 shape of current blockade events. In that case, the event's right and left slopes (noted RS and LS, respectively) are expected to be different as previously reported in the case of spherical object [60,61]. Now, we evaluate the correlation degree of these five parameters (∆t, ∆I/I 0 , AUC, RS and LR). Usually, a positive correlation between the ∆t and the ∆I/I 0 can be observed if the length of the pore is close to that of the analyte. Indeed, this correlation has been reported for protein detection using SiN nanopore [62]. Conversely, in the case of long DNA strands, the amplitudes of the relative current blockade are not correlated with the dwell time since the nanopore is filled with the polymer strand [63]. Here, we report the correlation heat maps of various parameters for the three samples at a voltage of 250 mV (top line) and 500 mV (bottom line) for pore 1 (Figure 4b) and pore 2 ( Figure S5). For all samples and regardless of the pore or the applied voltage, we can observe a strong correlation between ∆t and the surface area (~0.90 in mean). Conversely, the correlation between the ∆t and ∆I/I 0 is low (<0.75). This low correlation degree is also observed between the surface and the ∆I/I 0 . This could be explained by the current fluctuation during the blockade due to the DNA motion inside the pore. Interestingly, the right and the left slopes do not appear to be correlated to each other (~−0.20 in mean) nor with other parameters. Usually, a positive correlation between the Δt and the ΔI/I0 can be observed if the length of the pore is close to that of the analyte. Indeed, this correlation has been reported for protein detection using SiN nanopore [62]. Conversely, in the case of long DNA strands, the amplitudes of the relative current blockade are not correlated with the dwell time since the nanopore is filled with the polymer strand [63]. Here, we report the correlation heat maps of various parameters for the three samples at a voltage of 250 mV (top line) and 500 mV (bottom line) for pore 1 (Figure 4b) and pore 2 ( Figure S5). For all samples and regardless of the pore or the applied voltage, we can observe a strong correlation between Δt and the surface area (~0.90 in mean). Conversely, the correlation between the Δt and ΔI/I0 is low (<0.75). This low correlation degree is also observed between the surface and the ΔI/I0. This could be explained by the current fluctuation during the blockade due to the DNA motion inside the pore. Interestingly, the right and the left slopes do not appear to be correlated to each other (~−0.20 in mean) nor with other parameters. We then attempted to improve sample discrimination using machine learning algorithms. The simplest model involves establishing a linear correlation between two parameters. First, we examined We then attempted to improve sample discrimination using machine learning algorithms. The simplest model involves establishing a linear correlation between two parameters. First, we examined whether ∆I/I 0 and the ∆t are correlated. In Figure 5 is reported the linear regression analysis performed with ∆I/I 0 as response variable and ∆t as predictor at a voltage of 500 mV for pore 1. We can observe a low correlation between these two parameters according to a R 2 about 0.25 in mean. The same analysis for the event recorded at 250 mV and with pore 2 ( Figure S6) also provides a low R 2 value (about 0.21). This is in good agreement with the heat map and confirms the non-linearity between the ∆I/I 0 and the ∆t. whether ΔI/I0 and the Δt are correlated. In Figure 5 is reported the linear regression analysis performed with ΔI/I0 as response variable and Δt as predictor at a voltage of 500 mV for pore 1. We can observe a low correlation between these two parameters according to a R² about 0.25 in mean. The same analysis for the event recorded at 250 mV and with pore 2 ( Figure S6) also provides a low R 2 value (about 0.21). This is in good agreement with the heat map and confirms the non-linearity between the ΔI/I0 and the Δt.  The support vector machine is a class of machine learning algorithm used to solve classification problems. The data training involves finding a way to separate the different samples by using the different parameters that characterize the events. In our case, we have defined the sample A 10 /T 10 , A 40 /T 40 or T 40 as the target of the algorithm. We have defined as the features the different parameters that characterize the events (∆I/I 0 , ∆t, AUC, RS and RL). As previously mentioned, the ∆I/I 0 , ∆t are the most commonly used. Thus, we trained the algorithm with these two features and then with five features in order to demonstrate that the added parameters will help to improve the discrimination and to classify the different samples.
In Figure 6 is reported the confusion matrix for pore 1. Using two features (I/I 0 and ∆t), the accuracy is 72.6% and 53% for the event recorded at 250 mV and 500 mV, respectively. First, we observe that using machine learning the accuracy is better at 250 mV making this voltage more relevant to discriminate samples. Ignoring the voltage, the best predictions of event parameters were found for the A 10 /T 10 . As expected, the use of five features allows for improving the accuracy up to 82.5% and 66.76% at 250 mV and 500 mV, respectively. This improvement of the accuracy is also observed for pore 2 ( Figure S7). This weak difference in precision between the two pores is likely due to the different geometries. The best results were obtained with the smaller nanopores. Using five features, the ratio of the true positive is higher for the A 10 /T 10 and T 40 than for the A 40 /T 40 . This is observed for all experiments except for pore 2 at 250 mV. However, this could be explained by a low number of events (n = 80).
Biosensors 2020, 10, x FOR PEER REVIEW 9 of 13 The support vector machine is a class of machine learning algorithm used to solve classification problems. The data training involves finding a way to separate the different samples by using the different parameters that characterize the events. In our case, we have defined the sample A10/T10, A40/T40 or T40 as the target of the algorithm. We have defined as the features the different parameters that characterize the events (ΔI/I0, Δt, AUC, RS and RL). As previously mentioned, the ΔI/I0, Δt are the most commonly used. Thus, we trained the algorithm with these two features and then with five features in order to demonstrate that the added parameters will help to improve the discrimination and to classify the different samples.
In Figure 6 is reported the confusion matrix for pore 1. Using two features (I/I0 and Δt), the accuracy is 72.6% and 53% for the event recorded at 250 mV and 500 mV, respectively. First, we observe that using machine learning the accuracy is better at 250 mV making this voltage more relevant to discriminate samples. Ignoring the voltage, the best predictions of event parameters were found for the A10/T10. As expected, the use of five features allows for improving the accuracy up to 82.5% and 66.76% at 250 mV and 500 mV, respectively. This improvement of the accuracy is also observed for pore 2 ( Figure S7). This weak difference in precision between the two pores is likely due to the different geometries. The best results were obtained with the smaller nanopores. Using five features, the ratio of the true positive is higher for the A10/T10 and T40 than for the A40/T40. This is observed for all experiments except for pore 2 at 250 mV. However, this could be explained by a low number of events (n = 80).

Conclusions
In summary, three DNA samples named A10/T10, A40/T40 or T40 were detected using conical nanopore with a tip diameter of about 3 nm. The classical parameters used to characterize the event (I/I0 and Δt) do not allow one to discriminate the samples due to a large overlap of their distributions. In addition, the linear regression analysis shows no correlation between these two parameters. Using support vector machines, the different samples were discriminated with accuracy between 50% and

Conclusions
In summary, three DNA samples named A 10 /T 10 , A 40 /T 40 or T 40 were detected using conical nanopore with a tip diameter of about 3 nm. The classical parameters used to characterize the event (I/I 0 and ∆t) do not allow one to discriminate the samples due to a large overlap of their distributions. In addition, the linear regression analysis shows no correlation between these two parameters. Using support vector machines, the different samples were discriminated with accuracy between 50% and 72.6%. The events were then characterized by five parameters that are not correlated to each other except for ∆t and the surface area. The introduction of three additional parameters as features (AUC, LS, RS) in the support vector machine algorithm showed 10% improvement of accuracy, which increased to 82.5% for the smallest nanopore at 250 mV. Among the three samples, the best prediction of event parameters was found for the A 10 /T 10 .
More generally, our work was motivated to propose a solution to improve the resolution of conical track-etched nanopore. The combination of additional parameters and support vector machine algorithms was found to be a relevant solution to reach this goal. We could expect that such analysis methodology will be applied for single molecule sensing using track-etched nanopores, especially in fields where these nanopores could bring fundamental insights, such as in amyloid detection and characterization.
Supplementary Materials: The following are available online at http://www.mdpi.com/2079-6374/10/10/140/s1. Figure S1: Distribution histograms of ∆I/I 0 and ∆t recorded with pore 1 at 250 mV, Figure S2: Distribution histograms of ∆I/I 0 and ∆t recorded with pore 2 at 500 mV, Figure S3: Distribution histograms of ∆I/I 0 and ∆t recorded with pore 2 at 250 mV, Figure S4: Scatter plot representing the ∆t versus the ∆I/I 0 recorded with the pore 2, Figure S5: Heat map representing the correlation between the variables characterizing events obtained under 250 mV. Figure S6: Linear regression performed with ∆I/I 0 (response variable) and ∆t (predictor), Figure S7 Funding: This research was funded by Agence Nationale de la Recherche (ANR-19-CE42-0006, NanoOligo). Single tracks have been produced at GANIL (Caen, France) in the framework of an EMIR project.