Outliers in spectral time lag selected gamma-ray bursts

It is possible that the astrophysical {samples} are polluted by some outliers, which might belong to a different sub-class. By removing the outliers, the underline statistical feature may be revealed. {A more reliable correlation can be used as a standard candle relation for the cosmological study.} We present outlier searching for gamma-ray bursts with Partitioning Around Medoids (PAM) method. In this work, we choose three parameters from the sample, while all of them having rest-frame spectral time lag ($\tau_{\rm lag,i}$). In most cases, the outliers are GRBs 980425B and 030528A. Linear regression is carried out for the sample without the outliers. Some of them have passed hypothesis testing, while others have not. However, even for the passed sample, the correlation is not very significant. More parameter combinations should be considered in the future work.


Introduction
Gamma-ray bursts (GRBs) are astronomical phenomena detected at high energies. Classification and correlation seeking may reveal the underlying physics. GRBs can be classified as long GRBs (LGRBs) and short GRBs (SGRBs). SGRBs are likely associated with compact binary coalescence events involving at least one neutron star [1,2]. Massive stellar collapse is believed to generate the LGRBs [3][4][5]. However, there is no clear criterion for the classification. Some GRBs could not be decisively classified in either class. On the other hand, it is still possible that GRBs can be classified into more groups, such as intermediate class, giant flare from soft gamma-ray repeaters. The outlier detection belongs to classification analysis. The outliers may belong to an individual group.
Cluster analysis techniques are various, such as k-means, K-medoids, hierarchical clustering, neural network clustering, kernel principal component analysis [6][7][8]. Modak [9] conducted the fuzzy clustering on the GRBs from the final Burst and Transient Source Experiment (BATSE) catalog, and confirmed three groups. Partitioning Around Medoids (PAM) is the most prominent method of K-medoids. We use the euclidean distance as the similarity measurement. The outliers are far away from the other data. We analyze the outliers, and seek the underlying linear correlation between different properties.
The data is based on the collection of Wang et al. [10]. We collect a full sample including prompt emission, afterglow and host galaxy properties. We choose arbitrary three parameters based on the rest-frame spectral time lag τ lag,i to find the outliers, i.e., each combination of the physical parameters contains τ lag,i , trying to find any possible clues for the classification.
Spectral time lag is the time arrival difference between different energy bands for the prompt emission light curves of GRBs. Spectral lag was first introduced by Norris et al. [11]. A cross-correlation function (CCF) can be used to quantify such an effect since the pulse peaks at different energy bands are delayed. The method is widely used to calculate spectral lag [12]. A positive spectral lag is when the high-energy photons arrive before the low-energy ones, a negative spectral lag is on theopposite. Prompt intrinsic spectral evolution or the curvature arXiv:2211.07255v1 [astro-ph.HE] 14 Nov 2022 effect of relativistically moving shocked shells is used to explain the observed spectral lag [13,14].
Spectral time lag has been found correlated to several other quantities. The bolometric peak luminosity and spectral lag have an anti-correlation found by Norris et al. [15], later confirmed by Ukwatta et al. [12], Norris [16], Gehrels et al. [17]. Therefore, the lag is an indicator of both GRB peak luminosity and time history morphology, with short-lag and variable bursts having greater luminosities than long-lag and smooth bursts [16,18,19]. The peak luminosity and spectral lag relation can be viewed as being closely related to the peak luminosity and the variability relation. Variability should be inversely proportional to spectral lag, i.e., larger variability jets exhibit shorter spectral lags [20]. Using the internal shock model, the peak luminosity -spectral lag relation and the peak luminosity -variability relation can be caused by changes in observer's viewing angle with respect to the jet axis [21]. Bright GRBs are expected to have larger Lorentz factor and smaller viewing angles, meaning that the observer is viewing the GRB jet on axis. The spectral lag should be small due to the smaller emitting region [21]. Chen et al. [22] showed the distribution of spectral time lags in GRBs. The distribution of spectral time lags in LGRBs is apparently different from SGRBs, which implies different physical mechanism. Yi et al. [23] confirmed this result. Zhang et al. [24] also studied the spectral time lag of SGRBs, and found the lags of the majority of SGRBs are so small that they are negligible or not measurable. Shao et al. [19] carried out a systematic study of the spectral time lag properties of 50 single-pulsed GRBs detected by Fermi. Shao et al. [19] provided a new measurement which is independent on energy channel selections, and the new results would favor the relativistic geometric effects for the origin of spectral time lag. Lu et al. [25] found the spectral time lags are closely related to spectral evolution within the pulse. However, all of the statistics related to the lags do not have very high significance. In this work, we try to find if there are outliers, and to see whether the statistics become more significant without the outliers. We want to find more reliable correlations than the previous works. We hope to use the reliable correlations as standard candle and to constrain the cosmological parameters.
This paper is organized as follows. Section 2 outlines the statistical methods. Section 3 discusses the PAM results. Section 4 is the conclusion and discussions.

Statistical methods
The parameters and samples are from our previous work [10]. We collected all the possible data for 6289 GRBs in a big catalog [10], of which 165 GRBs have been selected as they contain the required parameters. The parameters we used in this work include τ lag,i (rest-frame spectral time lag, in unit of ms MeV −1 ), T 50,i (duration of 25% to 75% γ-ray fluence in rest-frame), T 90,i (duration of 5% to 95% γ-ray fluence in rest-frame), T R45,i [defined in 26], variability 2 (light curve variability from the definition of Reichart et al. [26]), L pk,52 (peak luminosity of 1 s time bin in rest-frame 1-10 4 keV energy band, in unit of 10 52 erg s −1 ), E iso,52 (isotropic γ-ray energy in rest-frame 1-10 4 keV energy band, in unit of 10 52 ergs), α Band (low energy spectral index of Band model), β Band (high energy spectral index of Band model), E p,Band,i (rest-frame spectral peak energy of Band model, keV), E p,i (rest-frame spectral peak energy of Band model and cutoff power law model, keV), α cpl (low energy spectral index of cutoff power law model), log t burst,i (rest-frame central engine active duration in logarithm, in unit of s) [defined in 27], t radio,pk,i (rest-frame peak time in radio band, in unit of s), β X11hr (index in X-ray band at 11 hours related to the trigger time), Age ( in unit of Myr), A V (dust extinction), host galaxy offset (the distance from GRB location to the centre of its host galaxy, in unit of kpc), Mag (absolute magnitude in AB system at rest 3.6 µm wavelength), log SSFR (specific star formation rate in logarithm, in unit of Gyr −1 ), N H (column density of hydrogen, in unit of 10 21 cm −2 ). We use label "i" to mark the parameters in rest-frame. We choose three parameters including τ lag,i for PAM analysis. For the spectral time lag, because the energy band is different for different instrument, we divide the spectral time lag by the difference of two energy band central values to get a unified quantity, which can be seen in table 1 of Wang et al. [10]. For example, the spectral time lag for GRB 980425B is 1.46 ± 0.18 s between 50-100 keV and 25-50 keV [20], and the redshift is 0.0085. The central value of τ lag,i is 1.46 1+0.0085 · ( 50+100 Same as the suggestion in Foley et al. [28], we removed the τ lag,i for GRB 060218, because GRB 060218 has extremely large lag, and it is an X-ray flash rather than a typical GRB. For the spectral parameters, the spectra are mainly fitted by three models: Band model, cutoff power law (CPL) model and simple power law (SPL) model [29]. Band model is a smoothly joint broken power law with the definition [30]: where α is low energy photon index, β is high energy photon index, A is the coefficient for normalization, E is the energy of the photons, and E 0 is the break energy. Mostly we used E p instead of E 0 . E p is the peak energy in spectrum of E 2 N, and E p = (2 + α)E 0 .
In the previous work [10], we use α Band , β Band and E p,Band as Band function spectral parameters. α cpl and E p,cpl to mark CPL model spectrum parameters. The formula of SPL model is N(E) = AE α spl . In this paper, we use −α Band , −β Band and −α cpl to stand for the opposite of spectral indices, as they become mostly positive numbers by adding the '−' sign.
Cluster analysis divides data into clusters that are meaningful, useful, or both. Cluster analysis is the study of techniques for finding the most representative cluster prototypes. A cluster is a set of objects in which each object is more similar to the prototype that defines the cluster than to the prototype of any other clusters. There are a number of such techniques, but two of the most prominent are K-means and K-medoids. K-medoids method is more robust than others in terms of outliers. The outliers are the smallest groups with a few points in this paper. Partitioning Around Medoids (PAM) is the most prominent method of K-medoids. In this paper, we use PAM method for outlier detection. In order to avoid the influence of different unit, we apply data standardization to all the parameters. We use the R language function Nbclust to calculate the best cluster number. The detailed processes are the following: • We calculate the best cluster number K and choose K initial centroids, where K is a userspecified parameter, namely, the number of clusters desired. In this paper ,the K is 2 or 3; • We use the Euclidean distance as similarity measurement. We calculate the Euclidean distance to the initial centroids of each point; • Each point is then assigned to the closest centroid, and each collection of points assigned to a centroid is a cluster; • For every cluster, we calculate the euclidean distance sum to the centroid of each point, which are assigned to this centroid. Then we get the sum of each cluster; • For every cluster, we choose one point to update the centroid; • Points are assigned to the updated centroids; • For every cluster, we calculate the euclidean distance sum to the updated centroid of each point, which are assigned to this updated centroid. Then we get the sum of all the clusters; • If the sum of all the clusters in the seventh step is smaller than the fourth step, we update the centroids; • We repeat the assignments and update steps until no point changes clusters, or equivalently, until the centroids remain the same.
We tried all the three parameters combinations including τ lag,i . For every combination, we use PAM method for outlier detection firstly. If we find outliers, we do the linear regression among the three parameters including τ lag,i without outliers. Some combinations have obvious outliers and significant correlations as shown in Section 3.1. Some combinations just have obvious outliers, but no significant correlations can be found, as shown in Section 3.2. Some combinations have no obvious outliers and no significant correlations, and we don't show these results. We also consider all the error bars using MC method [31]. The results are shown in the next section.

PAM results
We tried all the three combinations including τ lag,i with PAM method. The data selection criterion is that the sample should have at least 10 GRBs. We use the R language Nbclust package to calculate the best cluster number. Nbclust package provides 30 indices for determining the number of clusters, like Krzanowski-Lai (KL) index [32], Davies-Bouldin (DB) index [33], and so on. Charrad et al. [34] showed more details about Nbclust package. As shown in the upper panel of Figure 1, number 2 has the maximum criteria. Therefore, the cluster number should be 2. However, the cluster number of Figures 10 and 9 is 3, as shown in Figure 9. Except Figures 10 and 9, the cluster number of other figures is 2 with Nbclust package. We did not show the others in this paper. We found 191 combinations with obvious outliers. Though PAM method can find the outliers, the rest of the sample in the PAM plot does not show any meaningful message on the correlation or the clustering. Therefore, some independent statistics should be applied to check the statistical property for the rest of the sample Linear regression is carried out with outliers removed. Only 8 combinations have passed the hypothesis testing, which are displayed in Figures 1 to 8. However for the passed combinations, the correlation is not very significant. We use the adjusted R 2 to measure the goodness of the regression model. It means the percentage of variance explained considering the parameter freedom. We also showed a small part of cluster plots with obvious outliers, but not passing the hypothesis testing in Figures 9 to 14. These figures include all the outliers. In other words, the cluster plots with repeated outliers are not displayed in this paper.

Remarkable linear regression results without outliers
The outlier analysis can be apparently shown in figures. We list the results with remarkable linear regression results in Figures 1-8. Figures 1 and 2 show the outliers in the diagram of τ lag,i , T 90,i (and T 50,i ) and β Band , respectively. Both T 90,i and T 50,i indicate the duration of the GRBs. Therefore, these two analyses have similar behavior. GRBs 980425B and 030528A are the two outliers, which can be clearly seen from these two figures. The spectral time lag for GRB 030528A is 12.5 ± 0.5 s between 100-300 keV and 25-50 keV [35]. The spectral time lag for GRB 980425B is 1.46 ± 0.18 s between 50-100 keV and 25-50 keV [20]. Because the energy band is different, we divide the spectral time lag by the difference of two energy band central values. τ lag,i for these two GRBs are 43215 ± 1729 ms MeV −1 and 38605 ± 4760 ms MeV −1 , respectively, which are much larger than other GRBs. The detailed data for all the samples can be seen in [10]. Notice GRB 980425B is the familiar burst which is generally called GRB 980425 [see 10, for the explanation]. From the right panels of these two figures, one can see the two outliers have extraordinarily large lags, which should be the reason why they are classified as outliers. From the linear regression results, one can see the spectral time lag is proportional to duration of GRBs, and anti-correlated to the spectral index β.
Without the two outliers, one finds τ lag,i = (−1091 +152 −164 ) × log L pk,52 + (684 +118 −129 ) × log E iso,52 + (703 +37 −85 ). Considering L pk,52 ∼ E iso,52 /T 90 , this relation is actually showing the lag is anticorrelated to the total energy, i.e., with higher total energy, the lag is shorter.  The indication from Figure 3 is identified in Figure 4. It directly shows the proportional correlation to T 90 and the anti-correlation to E iso,52 . Figure 5 introduces a new parameter, the spectral index α. The lag is positively correlated to α, i.e., with bigger α, the lag is larger. Remembering the anti-correlation to β as shown in Figures 1 and 2, i.e., with smaller β, the lag is larger. Notice that in general, α > 0, and β < 0, that means the sharper the slope of the spectrum, the larger the lag. This is confirmed in Figure  6, where again τ lag,i is anti-correlated to β. Figure 7 shows a positive correlation between the lag and the peak energy of GRBs. Another new correlation is shown in Figure 8, which is the anti-correlation between the lag and the variability. Because GRB 030528A does not have the value of variability 2 , we just have one outlier as shown in Figure 8. Both of them have larger adjusted R 2 , indicating the correlation is more reliable. Figures 9 to 14 show the remarkable outliers without significant linear regression (with p-value greater than 0.05), which are clearly obvious. In some of these figures (e.g., right panel of Figure 10, and Figure 11), one can see the sample without outliers shows apparent linear correlations. However, one cannot deduce an intrinsic correlation of the physical parameters, as PAM method only identifies the outliers into apparent large distances.       From these figures, we found more outliers for different combinations of three parameters. The outliers are GRBs 980425B, 010921A, 030528A, 031203A, 050416A, 060502A, 060604A, 060605A, 060607A, 080319B, 080721A, 081221A, 090926A, 091127A, 100621A, and 130408A respectively in different cases.

Remarkable outliers without significant linear regression
We also show the distribution of τ lag,i in Figure 15, the outliers are not obvious. Though the quantities are clearly larger for these GRBs, from the distribution itself, they could be inside a single distribution function. Only when they are combined with other parameters, the outliers become obvious. To show the distribution more obviously, we choose the logarithmic scale. However, to avoid the negative numbers, we put them in an independent panel as shown in Figure 15. We also did not consider the error bars in the distribution, as that may rise the positive-negative value problem.

Discussion and conclusions
By performing PAM method onto the GRB data, we expect to find some outliers. By removing the outliers, we then expect to find inner correlations with the remaining sample. In this work, we chose the spectral time lag selected sample, together with other parameters, and we found that, in most cases the outliers are GRBs 980425B and 030528A. This is mainly because they are the two GRBs with the largest spectral lags. However, there are also other outliers in other combinations. That means the value of the lag is not the only criterion. In some combinations of other two parameters, there are weak correlations by removing the outliers, while there are also others having no correlations. As there are no strong correlations, we can not get very conclusive results from those correlations yet. We expect more combinations can reveal some underline correlations. Once we find tighter correlations, some of those correlations could be used as standard candle relations and could be used for more reliable cosmological studies.
It is interesting that the outliers are always GRBs 980425B and 030528A for those combinations having correlations (see Figures 1-8), while the others having more outliers. The reason is not clear yet. GRBs 980425B and 030528A both have very large τ lag,i . In Figures 9-14, most outliers are not GRBs 980425B and 030528A. It is due to the reason that not all the combinations include GRBs 980425B and 030528A. On the other hand, GRB 980425B is a low-luminosity burst [36], and GRB 030528A is an X-ray rich burst [37]. They are both special GRBs.
With the accumulated data, one can perform both the classification and correlation analysis on the data. One can also first classify the GRBs into several subgroups and look for the correlation for each subgroup. These processes may reveal the underlying nature. Outlier analysis is similar to the classification, while to find the outliers, which is the minority in the whole sample. Those outliers may indicate some special feature of the selected GRBs, which may reveal an independent origin (but much smaller samples), or a very different radiation mechanism. More outlier criteria (more parameter combinations) may reveal that in different aspects. For example, the soft γ−ray repeaters, may lie in the sample of conventional GRBs from compact binary mergers or the collapse of massive stars, and may be classified as outliers with a certain set of correctly selected parameters. If the outliers are real strangers, one may want to remove them from the whole sample. By omitting the outliers, the remaining sample may obey some laws, which can also be used to study the physics of GRBs. With the remaining sample, one may want to do a similar the classification and correlation analysis. In the future more comprehensive study, new patterns might be revealed. Data Availability Statement: The data used in this work are publicly available in the machine readable table in [10].

Conflicts of Interest:
The authors declare no conflict of interest.