3.3.1. Application of the New Intensity Profile on the Complete Gel Image
The IPBBIS method was used to identify GPN protein overexpression in the gel shown in
Figure 8A. In this Figure, lane 1 contains the ladder or molecular weight control. Lane 2 is the negative control, lane 3 is the positive control for GPN protein expression, lane 4 is the concentrated cell extract control, and lanes 5 to 15 represent extracts with the same amount of endogenous proteins to which different concentrations of recombinant protein have been added. The thickness of the spots indicates a higher concentration of GPN in lanes 5, 6, 11, 12, and 14, while lanes 9 and 10 show a lower concentration of the protein. These values were identified in the plot of the new image intensity profile (
Figure 8B,C), where maximum peaks are observed in lanes 5, 6, 11, 12, and 14, and lower peaks correspond to the bands with lower concentrations of recombinant protein, i.e., lower overexpression (lanes 9 and 10). Image analysis was performed by equalizing the image (
Figure 8B) and not equalizing it (
Figure 8C).
The data of the average values resulting from the multiple maxima of the graph corresponding to the different samples in
Figure 8C are shown in
Table 1. It can be corroborated that the maximum intensity value was obtained in lane 5, indicating the highest overexpression of GPN protein in that sample. The lowest expressions were recorded in lanes 9 and 10, with maximum values of 23 and 29, respectively. The maxima assigned to lanes 1 to 4 were not shown as they correspond to the ladder, controls for GPN, and GPN with total protein extract expression.
3.3.3. Effectiveness of the IPBBIS Method Using Known Concentrations
An experiment was carried out to evaluate the effectiveness of the method. GPN protein concentrations with known values (
Table 2) were prepared by adding diverse cell extracts to include different proteins and increase the background noise, obtaining the samples presented in the gel image of
Figure 10A by placing other concentrations in randomly chosen positions.
Samples with different concentrations of recombinant GPN protein are distributed on the polyacrylamide gel in
Figure 10A.
The results of applying the IPBBIS method to the image in
Figure 10A (GPN protein at different concentrations) indicated that samples with a higher protein concentration had peaks with higher intensity values (
Figure 10B). However, when comparing lane 3 (which has no GPN protein concentration,
Figure 10A), the graph showed that it has a higher peak or maximum compared to peaks 5, 7, 9, and 11 (which correspond to different GPN protein concentrations), indicating a higher overexpression of the protein. This fact does not agree with the prepared concentrations, with lane 3 having a concentration of 0.0 mg/mL, lane 5 of 9.5 mg/mL, lane 7 of 18 mg/mL, lane 9 of 18.5 mg/mL, and lane 11 of 14 mg/mL (see
Table 2).
This inconsistency in the results is because lane 3 had more proteins in the total extract than in samples 5, 7, 9, and 11. This behavior corroborated that the IPBBSI method can only calculate the overexpression of the proteins of interest when the background noise decreases, i.e., only when there is the same amount of proteins in the total extract can the level of overexpression of the protein of interest be detected. Proteins that are not of interest, those on the top and bottom of the recombinant protein (GPN), can be considered contaminants and should, therefore, be removed.
3.3.4. Elimination of Impurities through the Determination of the Molecular Weight of the Target Protein
A procedure is proposed to eliminate contaminants. To this end, it is necessary to find the molecular weight of the protein of interest (GPN) and separate it from the rest of the proteins present in the same sample to apply the IPBBIS method and find its level of overexpression.
Considering that the molecular weight control does not present background noise, the ladder or control in
Figure 10A (lane 1) was selected, separated from the rest of the gel, changed to a horizontal orientation (see
Figure 11A), and the IPBBIS method was applied, which provided the graph shown in
Figure 11B. The multiple maxima represent each protein position in the ladder (
Figure 11C).
A relationship was established between the molecular weights of the ladder or control (provided by the manufacturer, Bio-Rad S.A. Mexico D.F.) and the multiple maxima obtained by the IPBBIS method. Due to the absence of background noise, the positions of the control proteins were automatically detected and marked with blue lines for identification, as shown in the gel image of
Figure 11C. At the same time, the data were stored in an array. The stored data represent the positions of the control proteins and the values of their molecular weights. They were processed by applying numerical methods of linear interpolation, nearest interpolation, and cubic interpolation to obtain an equation. This equation was used to know and predict the molecular weight of proteins present in any position of the remaining samples of the gel image to be analyzed (see
Figure 12 and
Table 3).
Figure 12 shows the interpolation methods used to predict the molecular weight of the detected proteins. The X-axis corresponds to the pixel positions of the bands, while the Y-axis represents the molecular weights.
The results obtained in
Table 3 indicate that the linear interpolation method showed the lowest error, with a value of 3.35%, when correlating the actual weight of the GPN protein (34.56 kilodaltons, kDa) with that predicted by the interpolation method used.
Therefore, the formula used to identify the molecular weight of the proteins detected in the rest of the samples corresponds to the linear interpolation method and is defined by Equation (1):
where 0 ≤
xi ≤ 400 corresponds to the interval containing the number of pixels corresponding to the image height of each sample, and 0 ≤
yi ≤ 250 corresponds to the molecular weight of the proteins.
3.3.6. Analysis of the Region of Interest Using the IPBBIS Methods, Manual Area Calculation, Area Calculation by K-Means Segmentation, and Area Calculation by Otsu Segmentation
A random sample was selected on the gel to repeat the molecular weight detection process (lane 5,
Figure 14A). The developed IPBBIS method was applied to detect the bands present per sample, as indicated in
Figure 14B; knowing the molecular weight of the GPN protein, approximately 34 KDa, the interpolation method was applied to detect it automatically within the gel, and it was identified as indicated in
Figure 14C.
Once the positions separating the samples in the gel were detected (
Figure 14B), the molecular weight of the GPN protein and its place within the gel (
Figure 14C) were identified, then the region of interest (ROI) that included the expression of the recombinant protein at different concentrations from all samples (red box in
Figure 14A including lanes 2 to 11) was selected and isolated as shown in
Figure 14D. Analysis was then performed to compare protein overexpression and to identify whether image profiling based on binarized image segmentation can detect proteins with higher or lower overexpression. Before this, the image color in
Figure 14D was changed from RGB to HSV. The result is shown in
Figure 14E, where the height of the peaks is related to the level of overexpression of the protein analyzed. Every sample with protein contained in the region of interest (ROI) in
Figure 14D (lanes 2, 4, 5, 6, 6, 7, 8, 9, 10, and 11) was isolated to obtain a higher precision in the analysis. IPBBIS was applied separately to obtain one graph per sample, as shown in
Figure 15. The average of the multiple maxima obtained was calculated to produce a single maximum value per sample (
Figure 15A–I), which was related to the amount of protein concentrated in each of the different samples analyzed; these values were aggregated in
Table 4 for future reference (row ROI-GPN
Table 4). Image profiling based on binarized image segmentation was applied to each of the nine samples containing GPN protein at different concentrations in the ROI region of interest gel in
Figure 14D. As shown in the
Table 4, we obtained additional data by analyzing the different samples at different concentrations of recombinant GPN protein distributed on the polyacrylamide gel in
Figure 14D.
The array data obtained (ROI-GPN row in
Table 4) revealed that the lane 2 sample (with a maximum value of 10.38 and a concentration of 2 mg/mL) shows the lowest overexpression, followed in order of overexpression by lane 5 (18.67, 9.5 mg/mL), lane 11 (20.11, 14 mg/mL), lane 4 (20.55, 14.5 mg/mL), lane 7 (22.87, 18 mg/mL), lane 9 (23.08, 18.5 mg/mL), lane 10 (27.26, 26 mg/mL), lane 6 (27.34, 27 mg/mL), and lane 8 (28.82, 30 mg/mL). These results show that the array data obtained for the new image profile calculated by the IPBBIS method for each sample are related to the concentration of each protein and can be used to calculate the level of GPN overexpression by comparing every sample present in the gel image.
3.3.7. IPBBIS Study on the Image Dataset Using the Confusion Matrix
After verifying that the IPBBIS method can automatically identify the level of overexpression in each sample, the polyacrylamide gel image dataset was divided into homogeneous and heterogeneous gels to test their efficiency.
The so-called homogeneous gels (dataset of 44 gels with a total of 669 samples) showed similar characteristics, such as the same color and quality, and no imperfections, such as breaks or distortions due to incorrect preparation. In this research, a confusion matrix defines true positives (TP) as cases where IPBBIS correctly detected the lane or protein band. False negatives (FN) occurred when the lane existed but was not found, false positives (FP) when the lane did not exist but was detected, and true negatives (TN) when the lane did not exist and was also not detected. The confusion matrix obtained after analyzing the 669 samples is presented in
Table 5. The precision obtained was 0.985052 (see
Table 6, the precision of homogeneous gels).
Then, the analysis was repeated using the IPBBIS method on the heterogeneous gels, a total of 90 gels with different conditions, which included distorted (smiley face effect), broken, or incorrectly stained gels. In total, 1561 samples were analyzed for GPN protein overexpression. The accuracy of this analysis was measured using the same confusion matrix with the TP, FN, TN, and FP values defined above for the homogeneous gels. The confusion matrix is shown in
Table 7, and the precision obtained was 0.91736 (see
Table 6, the precision of heterogeneous gels). This precision was lower than that obtained with homogeneous gels since the SDS-PAGE gels analyzed present characteristics that make them different, such as breaks, distortions due to incorrect preparation, or insufficient Coomassie blue staining.
3.3.8. Functionality of the Methods Analyzed to Find GPN Protein Overexpression: IPBBIS, Manual Area Calculation, Area Calculation by K-Means Segmentation, and Area Calculation by Otsu Segmentation
To calculate protein overexpression, the areas of each of the GPN protein bands expressed at different concentrations in
Figure 14D were measured manually by outlining the contour of the band and using the K-means segmentation and Otsu segmentation techniques (see
Figure 16) and then compared with the measurement performed by the IPBBIS method.
For the manual measurements, K-means segmentation and Otsu segmentation, it was necessary to cut out each of the bands and separate them from the gel image as neither of these methods can analyze the whole gel. The results of the measurements of each of the bands at different concentrations are aggregated in
Table 4, indicating the type of methodology used in the row.
The data in
Table 4 were normalized to verify the functionality of the methodologies used to assess protein overexpression within the gel in the different samples (
Figure 14D). The data in
Table 4, corresponding to the intensity values in
Figure 14D, were sorted according to the amount of expressed protein from lowest to highest overexpression and placed in
Table 8, normalized (see
Table 9), and used to measure the correct level of GPN protein overexpression for each of the methods used by the following operation:
Let xn be the n-th value of the measured area, and xm be the value of the measured area in the band with the highest protein concentration.
If xm > xn ⇒ xm − xn > 0. This expression indicates that the measurement of the overexpression level is correct since positive values are expected if the concentration is increasing. On the other hand, if xm < xn ⇒ xm − xn < 0, it indicates that the overexpression level was miscalculated. The analyzed method presents errors in its measurement because the protein concentration is increasing and not decreasing.
As seen in
Table 10, the results of the above analysis, applied to each of the measurements, indicate that the proposed image profiling method based on binarized image segmentation (ROI-GPN) does not show any negative values. In contrast, the manual method shows two negative values (lanes 4 and 8 in
Table 10), the K-means segmentation shows two negative values (lanes 4 and 8 in
Table 10), and the Otsu segmentation shows two negative values (lanes 8 and 9 in
Table 10).
These results demonstrate that the developed IPBBIS profiling allows for discovering overexpression and correctly identifying the level of overexpression related to GPN protein concentration. On the other hand, the manual methods, K-means segmentation and Otsu segmentation, presented errors in the measurements.
In addition, the IPBBIS method can analyze the ROI region without cutting out each of the samples present in the gel. In contrast, manual techniques, like K-means segmentation and Otsu segmentation, require the samples to be separated, as they cannot analyze the whole gel.
The ROI-GPN has values of the array data with the number of white pixels analyzed by binary mask. The other methods have values in pixel areas, and normalization was performed to realize a comparison. The normalization was made by taking the maximum value measured by each method when applying four methods to identify the overexpression manually (since there is no automatic one) and dividing it by each of the values of its respective method. Thus, the maximum value for all samples is unity. When sorted by the designed concentration value (lowest to highest), the normalized values increase and do not decrease, as did all the previous methods except for the IPBBIS method developed in this work. The increase only occurs if the methods correctly calculate the size of the bands (by area or by intensity).
These results indicate that traditional methods can identify the position of proteins. However, they cannot identify a particular protein band nor determine the concentration or molecular weight, and the rest of the new intensity profile plot cannot be related to any of the proteins present in the same sample. The IPBBIS method identified the most minor and most overexpressed GPN protein and even detected the order of overexpression.
These results indicate that the IPBBIS method can be used to identify GPN protein overexpression related to IDC and ILC Her2+ breast cancer and can also be applied to identify overexpression of other proteins of biological interest and to detect the progression of cancer stages in different samples from the same patient.
In summary, the IPBBIS method applies a binary mask pixel by pixel, choosing the white intensity value and storing it in an array. The array contains multiple maxima and multiple minima. The intensity value of the minima is related to the separation of the number of samples when analyzing a full gel. As the number of targets decreases, this indicates the separation between proteins. When analyzing per sample for proteins in SDS-PAGE gels, the new image profile values of the multiple minima quantify the level of overexpression of proteins present per sample.
Current methods for searching for proteins in SDS-PAGE gels perform image profiling, processing techniques, threshold, and brightness changes but require the analyst to select the region of interest. The IPBBIS method automatically identifies the number of samples in the gel and the amount of proteins in a sample. It also detects the level of overexpression based on molecular weight.