Particles with cosmic origin that reach the Earth are known as Cosmic Rays (CR). There are many unknowns in their study, although there are two important major aspects that, once resolved, could provide useful information for astrophysicists. The first one is their origin, when considering aspects, such as how they were produced, accelerated, and their propagation through the galactic and extra-galactic medium. The second aspect that make their study so important is that the energy density of cosmic radiation is of the same order of magnitude as the one found in magnetic fields and stars, so they could give a hint on the total energy balance of our universe. There is a third aspect, which is known as the Greisen—Zatsepin—Kuzmin (GZK) limit [1
], which is an abrupt drop in cosmic ray flux to energies above
eV, and, from this, it seems that the universe is “opaque” for events with energies above this limit.
For energies above the solar modulation spectrum (10 GeV/nucleon) [2
], the cosmic rays are called “high energy” ones and, for energies above
eV, they are called “ultra high energy” (UHE) cosmic rays. Three quantities can be used to describe a cosmic ray caused by an incident particle (called primary) in the earth’s atmosphere: the energy, the angle of arrival (
), and the type (mass and charge). All of the observatories of cosmic radiation can measure the energy and
for each event, but, up to now, there is no way to measure the mass directly. Using extrapolated data from particle accelerators, the particle astrophysics community developed models and simulators that only allowed for us to propose probabilities of mass compositions, in distributions well-defined for bins in energy and
]. The classification of the particle composition is a crucial question that should be answered in order to better understand the three aspects described above.
Beyond that, knowing the composition information of each event will make it possible to search for flux of protons at the highest energies [4
]. Therefore, it could improve previous particle-physics studies at 10 EeV and extend them to energies as high as 200 TeV (center of mass). Extending composition sensitivity to all possible energy ranges and a larger range of zenith angles will provide almost an order of magnitude increase in statistics to resolve the question of the origin of the flux suppression (GZK limit).
The observation of ultra high energy cosmic rays (UHECR) is based on the detection of secondary particles cascades that are produced in the atmosphere, resulting from the initial collision of the primary with some molecule of the air (usually
) at the top of the atmosphere (around 35 km altitude). Around
, secondary particles arrive at ground level distributed within a radius of up to 3.000 meters. This phenomenon is known as extensive air shower (EAS), and it was first measured by Pierre Auger in 1939 [5
]. The detectors are not able to discriminate the secondary particles; they measure a signal that relates the cascade’s energy deposit and its evolution over time. Analysis of this signal is the source of the cascade and the primary particle information.
This paper tackles the problem of using Machine Learning (ML) to identify the type of particle that generates the cascade due to the importance of this information, which can be extracted from a EAS. A recent work has performed a first approach to the composition identification problem by estimating the muonic number in simulated traces [6
]. Monte Carlo models predict that heavier primaries (such as Nitrogen or Iron) have more muons than lighter primaries (such as protons or Helium). However, even knowing this element accurately, the mapping of this feature to a particle type still remained unsolved.
A preliminary work [7
], first dealt with this question by trying to tackle the problem using two simple deep learning models, approaching the problem both as a classical classification and as a continuous regression-like output. However due to the limitations of the dataset, limits in the classification accuracy attainable by ML models were not verified and a wider and more thorough study was needed. Thus, the aim of this work is to go in depth into the limits in the possibility of identifying the type of particle that generated an EAS, from a set of ideal measurements at ground level. Moreover, the importance of the different factors that are involved in a EAS for the primary identification will be assessed by an effective feature selection specific technique.
For that, this work uses simulated ground truth for several features (like the muon and electromagnetic numbers) using a data set generated with CORSIKA (C
mulations for KA
scade) simulator [8
]. Five different types of particles have been considered: Photons, Protons, Helium, Nitrogen, and Iron. Four different machine learning classifiers have been trained and analyzed under Python implementation, including XGBoost, K-NN, Deep Neural Networks and Support Vector Machines. This comparison allows comparing these alternatives, both from the performance and the computational cost point of view, allowing for us to assess the best alternative for the given problem. Moreover, a modification of the Markov Blanket Mutual Information Feature Selection (MBFS) algorithm [10
] adapted for classification has been applied in order to identify the relevance of the features involved. The importance of this type of ML techniques application comparative analysis is corroborated in the extent recent literature for other problems from a wide range of fields [13
The rest of the paper is organized, as follows: Section 2
presents the data used in the experiments. Section 3
introduces the classifiers and feature selection algorithm proposed for this work. Section 4
presents the experiments and show the results that were obtained for the problem. Section 5
discusses the results. Finally, conclusions are drawn in Section 6
2. Data Description
The data used in this research were generated by the CORSIKA Monte Carlo code, which is a particle interaction simulator designed to extrapolate hadronic interactions (hadrons are particles with internal structure, such as protons, helium, carbon, etc.) with center of mass energies above 100 TeV. To get an idea of the importance of this simulator, the LHC-CERN collider has a maximum energy of 6.5 TeV per beam (by the end of 2018) [17
] and this is the limit (until now) of the experiments in particle physics. There is no actual data describing interactions above 100 TeV, which is the typical collision energy of cosmic particles with our atmosphere. This is where the need for a simulator with extrapolations of hadronic interaction models comes from.
The simulations are done by tracking the particles through the atmosphere until they undergo reactions with the air nuclei and produce a cascade of the secondary particles. These cascades can be described in a simplified way as the composition of three components: a hadronic cascade (heavier particles, such as pions, neutrons, and protons), a muonic cascade (muons are produced by the pions decay, and their mass is about 200 times greater than the electron mass), and an electromagnetic cascade (photons, electrons, and positrons). The output of the program is a dataset with the information of all particles of the cascade. Each particle is assigned with seven information: position (x, y, z), energy (px, py, pz), and type.
The Monte Carlo code divide the development of the cascade in three types of interaction models to describe the cascade particles: high energy (above 100TeV), low energy (below 100 TeV), and electromagnetic interactions. The code chooses one of these three models based on the energy and type of the particle over the course of development.
The code also provides several options for types of interaction models. For high energy, the models are based on the calculation of the cross section of the secondary particle scattering, the hadron mini-jets. Each model considers a different treatment for the partons (fundamental particles that constitute a hadron) and a distinct phase space. All of the models use the quantum field theory of Gribov–Regge, which is a model used to describe the interaction between hadrons. The models QGSJetII-04 (Q
tring model with Jet
], SIBYLL [19
], and EPOS(LHC) (E
nergy conserving quantum mechanical multi-scattering approach, based on P
ff-shell remnants and S
plitting parton ladders) [20
] are options that can be used to describe high energy collisions. At lower energies, interactions can be used the models GHEISHA (G
], the FLUKA [22
], or the microscopic URQMD (U
]. For electromagnetic (EM) interactions, a version of the code EGS4 (E
] or the analytical NKG (N
] formulas may be used. For this work. we are using, at higher energy, the model QGSJetII-04, combined with FLUKA2011.2c for lower energies, and EGS4 for EM interactions.
We simulate a set of events () for each primary particle mass (photon (no mass), proton, helium, nitrogen, and iron—total: events) and within this set we randomize, for each event, the values of energy ( up to eV), angle of entry into the atmosphere (: 0 until 60 degrees), and the mean free path for first collision (). The errors are related with the systematic of this randomization, which was performed using a Monte Carlo procedure.
Some of the factors that can be extracted from the output dataset for each simulation like (the atmospheric depth (g/cm) where the cascade have the maximum number of particles) and (altitude where the particle starts to interact with the atmosphere [m]) are difficult to be measured. There are few real data measurements for and, to date, it is not possible to measure it at ground level, especially for events at high energies. Therefore, and were discarded in order to provide a more realistic, still optimistic, definition of the type of data that can be measured at ground level. Subsequently, the features considered for the work were:
: total number of particles generated by the event at the ground level.
: total number of muons, at the ground level.
: total number of electromagnetic particles, at the ground level.
: zenith angle of the primary particle [degrees].
: primary particle energy [GeV].
Being precise on the information provided by the CORSIKA simulator, there is no chance to know, with accuracy, the total number of particles reaching the Earth surface. However, an estimation of it can be provided that could be accurate enough to use those values for the classification [7
]. The same can be said for the Energy, and muonic and electromagnetic signals. Dataset and code used for the work presented in this paper can be downloaded from (https://github.com/aguillenATC/Entropy-CompositionClassificationUHECR
This section presents the results that were obtained for the classification of the type of primary from the available dataset.
All of the features were normalized to have zero mean and unit standard deviation. As it was aforementioned dataset was first randomly shuffled and subdivided in 80% of the data for training and validation purposes for hyperparameter optimiztion (under a five cross-validation scheme), and the remaining 20% for test. The the whole dataset was repeatedly validated in a different five cross-validation scheme for performance assessment (see Section 3.1.5
), providing mean and std over training and test performances.
All of the methods were implemented under Python, using Keras, XGBoost, and Sklearn libraries, and executed under a Intel corei7 32GBRAM PC with NVIDIA GeForce GTX 10800 GPU.
Two main feature settings were evaluated according to the demands of the Theoretic Physics experts. Later, results using the NMIFS ranking were assessed in order to provide this information to the experts. The two sets were:
5 features: , , , ,
3 features: , ,
The results obtained by the four classifiers are shown in Table 1
. Both sets with five and three features were assessed. The training times for each of the classification methods included the training of the hyper-parameters of the model. Hyperparameters for each of the classification models using a first training-test subdivision of the dataset are shown in Table 2
. The confusion matrix obtained by XGBoost for both settings for that initial subdivision (highest accuracy, as seen in Table 1
) are shown in Figure 1
and Figure 2
4.2. Feature Ranking
The MBFS algorithm using Kraskov Mutual Information estimation algorithm was executed on the training set, returning the following feature ranking (from most relevant to lowest relevant feature):
. Figure 3
shows the evolution of the test performance as from one to five features are considered, using XGBoost algorithm. As it can be seen, with only two features, the
results surpass the 0.9 of test performance. Nevertheless, with only one feature, accuracy is surprisingly low (0.24). It is important to also highlight that the MI of the
feature with respect to the classification feature showed to be very similar to the MI between the
feature and the classification feature. Moreover, tests performed using
as single feature for classification showed comparable results (0.24) to those using
as a single feature to perform the classification. Moreover, Figure 3
shows that, with the first three features, performance is similar than using all of them. This implies that information of
becomes irrelevant after considering
It is important to highlight that algorithms mRMR and NMIFS failed to recognize the optimal ranking retourned by MBFS algorithm. NMIFS attained the identification of the two most relevant features ( + ), failing to identify the third one (, identified in fifth position and, thus, only attaining the 97% of accuracy using five features). mRMR, on the other hand, missidentified the essential relationship between and , identifying as second most relevant feature the angle, leading to 0.9 of accuracy only after using three features.
After showing the results that were obtained in the comparison, this Section discusses it. This comparison is made when considering both the classification accuracy and the computing cost.
Regardless of the type of particle, XGBoost presents an outstanding performance, being very precise in classifying all of them (as Figure 1
and Figure 2
show). This methodology seems superior to the other alternatives not only because it obtains the best results but because the model is trained in shorter time (just behind KNN). The next suggested methodology is SVM, being very close to XGBoost in classification metrics, but with executions 20 times slower. DNNs are next in the classification accuracy, but it is the slowest technique by far (100 times slower than XGBoost). Finally, the KNN algorithm presents disappointing classification scores, but it is clearly the fastest methodology (100 times faster than XGBoost). The performance ranking obtained by the four methodologies for five features is similar to that obtained for three features. For the latter, XGBoost and SVM achieved a comparatively better result than the other two methodologies, undergoing a lower performance decrease when taking away two of the input features.
Comparing these results with the preliminary work in [7
], in which DNNs were applied, then the best accuracy for five variables reached 0.94, and 0.82 for three variables, while using an optimzed DNN model after several tests, and a single training-test subdivision. Thus, the results were similar, taking into account mean and standard deviation seen in Table 1
. The optimal model then attained the best results while using four layers, while the CV scheme in this work leads to a reduction in the complexity of the network by selecting a two-layered network. In any case, as seen, SVM and XGBoost attained a faster training and better performance than DNNs for the problem tackled.
For the sake of fairness, the comparative was designed, so that, for all the models, the number of combinations of hyperparameters was similar, as explained in Section 3
. Table 2
shows the optimal hyperparamenter configurations for the four techniques, for the two feature subsets considered. SVMs and KNN obtained the same values for the hyperparameters for both feature subsets. XGBoost maintained the maximum depth, but the
parameter presented a larger value for five features. This seems expected, as the weights of the features can shrink more when their number is smaller. Analogously, DNN preserves the two layered configuration (we should keep in mind that, even with one layer, they could be universal approximators), but with a lower number of units per layer for three input features. For the second case, the architecture of the network becomes much simpler.
The reason why the DNNs consumed much more time than XGBoost is because the training was carried out when considering up to 80 different architectures with a maximum of 500 epochs. This might be one of the main flaws of these models in comparison with the other approaches, the cost of finding the right set of hyperparameters might be too expensive. When considering that SVMs are restricted by the number of samples, from the computational cost perspective, the best choice is XGBoost.
In relation to KNN, it is important to highlight that this work utilized its simplest version in the optimization. KNN may sometimes suffer from the presence of noisy features or by differences in the relevance of the features involved. However, despite that features were equally normalized, its performance was lower than the other methods. Although KNN model optimization was simpler than for the other methods (only k was optimized for a single distance metric), and could include feature weights optimization, for instance, in this work a specific feature selection process was performed as a separated next step, whose results are shown in Section 4.2
Once the best technique has been observed, it is possible to take a closer look to the results of XGBoost. Figure 1
and Figure 2
show the confusion matrices that were obtained by the algorithm when using 5 and 3 features respectively. When using all the features:
, the capability of separating Photons from the rest of particles is perfect. The accuracy remains outstanding, even in the subgroup of Hadrons. Although, as the matrix shows, very light and very heavy particles, such as Proton and Iron, respectively, are classified perfectly, but the particles in between (Helium and Nitrogen in this work) some misclassifications are shown. This last observation becomes even more dramatic when the number of features is reduced to:
. Photons, Protons, and Irons are quite well classified, but there is an important source of errors coming from Helium and Nitrogen classification, which fall to a 76% of correct labelling.
The results obtained by the MBFS ranking are surprising, because, by using only two features, the results obtained are better than by using the three features suggested by the experts. This is interesting, because it motivates the research considering the electromagnetic part of the signal instead of considering uniquely the muonic component. Additionally, by using the three most relevant features , , and , the results attained are similar than using the five of them, implying the irrelevancy of and after considering the first three. Analyzing the data coming from the simulator, it is observed that almost all of the information about the cascade development is contained in the electromagnetic () and muonic () components of the shower. With the additional information on the primary energy (), it is possible to obtain the three most relevant information of the event: energy, mass composition, and direction of arrival. Therefore, the results obtained by MBFS corroborate the results that were obtained by the simulator of cascade.