Rice Seed Purity Identification Technology Using Hyperspectral Image with LASSO Logistic Regression Model

Hyperspectral technology is used to obtain spectral and spatial information of samples simultaneously and demonstrates significant potential for use in seed purity identification. However, it has certain limitations, such as high acquisition cost and massive redundant information. This study integrates the advantages of the sparse feature of the least absolute shrinkage and selection operator (LASSO) algorithm and the classification feature of the logistic regression model (LRM). We propose a hyperspectral rice seed purity identification method based on the LASSO logistic regression model (LLRM). The feasibility of using LLRM for the selection of feature wavelength bands and seed purity identification are discussed using four types of rice seeds as research objects. The results of 13 different adulteration cases revealed that the value of the regularisation parameter was different in each case. The recognition accuracy of LLRM and average recognition accuracy were 91.67–100% and 98.47%, respectively. Furthermore, the recognition accuracy of full-band LRM was 71.60–100%. However, the average recognition accuracy was merely 89.63%. These results indicate that LLRM can select the feature wavelength bands stably and improve the recognition accuracy of rice seeds, demonstrating the feasibility of developing a hyperspectral technology with LLRM for seed purity identification.


Introduction
Paddy is one of the most important food sources worldwide, and its large number of varieties is continuing to increase. The phenomenon of seed mixing exists inevitably in each stage of paddy planting, production, and circulation, which affects the yield, nutrient composition, taste, and price of paddy rice. Therefore, the identification of rice seed purity is crucial in the grain industry. However, the identification task is primarily conducted manually with the naked eye [1] and can thus be easily affected by subjective factors such as experience and tiredness. Certain chemical or biological methods, such as phenol staining and protein electrophoresis, can accurately identify seed purity [2,3]; however, these methods are destructive, time-consuming, require large equipment or expertise, and cannot be batch processed. Therefore, it is vital to develop a rapid and accurate automatic identification method for rice seed purity.
Recently, nondestructive testing methods based on optical technology, such as infrared spectroscopy technology [4,5], machine vision, and hyperspectral technology [6][7][8][9][10][11][12][13], have received significant attention. Infrared spectroscopy technology identifies the spectral features of seeds [14] and analyses their internal chemical compositions (such as protein and starch) to classify them. However, this technology can only acquire a small area of spectral information from the seed samples at a time, and many repetitive measurements in different places are needed to represent the entire scenario of adulteration [15]. Machine vision technology captures colour RGB images of seeds and analyses their external physical features to classify them. Several reports on relevant machine vision technology have been published [16,17]. For example, Chaugule et al. (2014) classified four types of rice seeds based on their colour, shape, and texture with a recognition accuracy of 90. .00% [18]. Additionally, they used the angle of seeds as a new identification feature in 2016, improving the identification accuracy to 97.6% [19]. In 2017, Huang et al. analysed the key details of both ends of rice seeds and found that the identification accuracies of three rice seeds with similar appearances were 92.68%, 97.35%, and 96.57%, respectively [20]. Furthermore, in 2021, Ansari et al. adopted machine vision technology combined with a multivariate analysis method to establish a rapid detection method of rice seeds based on different purities, obtaining the highest accuracy of 93.9% [21]. However, the phenomena of overlapping and superposition of external physical features are often observed owing to the similar appearances of many rice seeds. Therefore, it is difficult to obtain a stable classification effect using spatial information alone [22].
Hyperspectral technology can obtain the spectral and spatial information of seed samples, and it has attracted significant attention in the field of seed identification [23][24][25].
There have been many reports on the use of hyperspectral technology to identify corn, wheat, soybean, rice, and other grain particles. For example, in 2013, Silvia et al. used partial least squares discriminant analysis to identify hyperspectral images of groats and oats with an accuracy of approximately 100% [26]. In 2015, Yang et al. used a support vector machine (SVM) to identify waxy corn seeds, achieving the highest identification accuracy of 98.2% [27]. In 2020, Zhu et al. used a convolutional neural network to identify hyperspectral images of six types of soybean seeds with an average recognition rate of 91% [28]. The identification methods of rice can be broadly classified into two categories. The first incorporates the popular deep learning algorithms for classification and recognition. For example, in 2018, Qiu et al. used a convolutional neural network to identify four types of rice; they achieved a recognition rate of 87% using 12,000 training samples [29]. Furthermore, in 2020, Weng et al. used a deep learning network, namely principal component analysis (PCA) network, to train 4320 samples, obtaining the best recognition rate of over 98% for 10 rice varieties [30]. However, although the deep learning method does not require feature extraction, it requires a large number of training samples, leading to relatively high costs of hyperspectral image collection and data processing. The second category uses the traditional machine learning methods for identification. The most effective method is to use PCA dimension reduction and SVM classification. In 2015, Sun et al. used a combination of spectral, image, and textural features to identify four types of rice using SVM with a recognition accuracy of 91.6% [31]. Furthermore, in 2020, Liu et al. used a combination of spectral, image, and textural features to identify four types of rice seeds with an average identification accuracy of 98% [32]. These methods only require a small number of training samples to achieve a satisfactory recognition rate. However, hyperspectral spatial information contains a significant amount of redundant information and unpredictable noise, leading to high costs of hyperspectral image acquisition and processing. The cost of data acquisition and processing can be further reduced if the feature wavelength bands can be selected via some mathematical methods such as PCA and analysis of variance [33].
There are several methods to reduce the dimension of hyperspectral data [34,35], such as correlation analysis [36], beta coefficient of partial least squares regression analyses [37], and PCA [38]. PCA is widely used in the field of grain recognition [39][40][41][42]. However, it is essentially a dimensionality reduction method. Feature wavelength bands selected by PCA methods are not suitable for physical interpretation; it is easy to ignore those wavelength bands with a small contribution but contain more useful information for classification. Moreover, feature band selection and classification are two independent algorithms. However, there exists a complex interaction between feature band selection and classification modelling. The algorithm featuring simultaneous selection and classification is rarely used for the identification of rice seed purity.
In 1996, Tibshirani et al. proposed LASSO regression, a data sparse method [43] often used for hyperspectral data unmixing [44][45][46]. For the first time in 2017, Yang et al. introduced the LASSO method into hyperspectral images for band selection by conducting tests on public datasets. Band selection based on LASSO regression is more explanatory in physics and helps reveal the physical meaning [47]. The logistic regression model (LRM) is a probability classification method [48,49]. As a classifier, it is widely used in hyperspectral image processing, deep learning and other fields, such as medicine, social science, along with others [50][51][52][53]. This study combines the advantages of the sparse feature of the LASSO algorithm and the classification feature of the logistic regression model. Then, an identification method for hyperspectral rice seed purity based on the LASSO logistic regression model (LLRM) is proposed. We selected feature wavelength bands from four types of rice seeds according to different adulteration conditions. Thereafter, we established the prediction model and compared it with the full-band LRM. Finally, we discuss the feasibility of LLRM to select the feature bands and detect the purity of rice seeds.

Sample Preparation and Data Acquisition
We selected the following four types of rice seeds with plump grains: Huguangxiang, Xiangwan Japonica, Huanghuazhan, and Japonica 530. Each type contained 72 grains, which were naturally air-dried to be considered as experimental samples. Different types of seeds were neatly placed on the sample table with a spacing of approximately 5 mm to ensure that all the seeds had the same collection environment. Hyperspectral image data of all the seeds were collected four times owing to a large number of seeds.
A schematic of the hyperspectral imaging system for rice seeds is illustrated in Figure 1. The rice seed samples were placed directly on a standard whiteboard. Two bilinear light sources, including six halogen bulbs (H8, OSRAM light AG; Munich, Germany), were symmetrically placed on both sides of the whiteboard for illumination. Furthermore, the hyperspectral camera (Specim IQ, Specim, Spectral Imaging Ltd.; Oulu, Finland) was fixed on the top with a bracket [54]. Specim IQ is a handheld hyperspectral camera, which integrates a processor (NVIDIA Tegra K1, NVIDIA Corporation; Santa Clara, CA, USA), spectral camera (CMOS technology), viewfinder camera (5 Mpix), focus camera, and a scanner with the motor for optics movement. Its spectral resolution is 7 nm with 204 spectral bands across the wavelength range 400-1000 nm, and the image resolution is 512 × 512 px. The original data collected by the hyperspectral camera were light intensity information. These data need to be converted into reflectivity data for subsequent processing. The formula for reflection spectrum conversion is as follows: where D Raw , D Dark , and D White denote the light intensity of the received sample, dark current intensity, and light intensity reflected by the whiteboard, respectively. Figure 2 depicts the visible image of four rice seed samples obtained through the viewfinder of the hyperspectral camera; they are Huguangxiang, Xiangwan Japonica, Huanghuazhan, and Japonica 530 from the first (top) to fourth (bottom) rows, respectively. Huangxiang and Huanghuazhan are indica rice with a long and narrow shape, whereas Xiangwan Japonica and Japonica 530 are japonica rice with a short and wide shape. Although there are apparent differences in appearance between the japonica and indica rice, the shape features of the two japonica (or indica) varieties are significantly similar. Thus, it is difficult to identify the four varieties based on their image features alone owing to their biological diversity.  Figure 2 depicts the visible image of four rice seed samples obtained through the viewfinder of the hyperspectral camera; they are Huguangxiang, Xiangwan Japonica, Huanghuazhan, and Japonica 530 from the first (top) to fourth (bottom) rows, respectively. Huangxiang and Huanghuazhan are indica rice with a long and narrow shape, whereas Xiangwan Japonica and Japonica 530 are japonica rice with a short and wide shape. Although there are apparent differences in appearance between the japonica and indica rice, the shape features of the two japonica (or indica) varieties are significantly similar. Thus, it is difficult to identify the four varieties based on their image features alone owing to their biological diversity. The original hyperspectral image also contained other objects, such as the reference whiteboard, in addition to the seed samples. We cut out the irrelevant parts to reduce the amount of data and save data processing resources, retaining only the region of interest (ROI). The original hyperspectral image contained 204 bands of information. The first and last few bands contained no useful information owing to significant noise. Therefore, only the data from the 15th to 190th bands (450-950 nm) were selected for subsequent processing. Figure 3 depicts the data processing flowchart of the hyperspectral rice seed purity identification technology. The first step was data pre-processing. The reflectivity data of the ROI were processed by standard normal variate (SNV) transformation and derivative methods. The second step was background segmentation, which involved segmenting the rice seed pixels from the background image and obtaining and analysing the spectral features of each rice seed. The third step involved preparing the training and the test set. The spectral features of 72 pure seeds of each type and the other three hybrid seeds were grouped together to simulate the purity of the seed, producing 13 datasets. These datasets were divided into training and test sets. The fourth step was modelling. LLRM was built based on the training set, and the selected spectral features were used to identify seeds.   Figure 2 depicts the visible image of four rice seed samples obtained through the viewfinder of the hyperspectral camera; they are Huguangxiang, Xiangwan Japonica, Huanghuazhan, and Japonica 530 from the first (top) to fourth (bottom) rows, respectively. Huangxiang and Huanghuazhan are indica rice with a long and narrow shape, whereas Xiangwan Japonica and Japonica 530 are japonica rice with a short and wide shape. Although there are apparent differences in appearance between the japonica and indica rice, the shape features of the two japonica (or indica) varieties are significantly similar. Thus, it is difficult to identify the four varieties based on their image features alone owing to their biological diversity. The original hyperspectral image also contained other objects, such as the reference whiteboard, in addition to the seed samples. We cut out the irrelevant parts to reduce the amount of data and save data processing resources, retaining only the region of interest (ROI). The original hyperspectral image contained 204 bands of information. The first and last few bands contained no useful information owing to significant noise. Therefore, only the data from the 15th to 190th bands (450-950 nm) were selected for subsequent processing. Figure 3 depicts the data processing flowchart of the hyperspectral rice seed purity identification technology. The first step was data pre-processing. The reflectivity data of the ROI were processed by standard normal variate (SNV) transformation and derivative methods. The second step was background segmentation, which involved segmenting the rice seed pixels from the background image and obtaining and analysing the spectral features of each rice seed. The third step involved preparing the training and the test set. The spectral features of 72 pure seeds of each type and the other three hybrid seeds were grouped together to simulate the purity of the seed, producing 13 datasets. These datasets were divided into training and test sets. The fourth step was modelling. LLRM was built based on the training set, and the selected spectral features were used to identify seeds. The original hyperspectral image also contained other objects, such as the reference whiteboard, in addition to the seed samples. We cut out the irrelevant parts to reduce the amount of data and save data processing resources, retaining only the region of interest (ROI). The original hyperspectral image contained 204 bands of information. The first and last few bands contained no useful information owing to significant noise. Therefore, only the data from the 15th to 190th bands (450-950 nm) were selected for subsequent processing. Figure 3 depicts the data processing flowchart of the hyperspectral rice seed purity identification technology. The first step was data pre-processing. The reflectivity data of the ROI were processed by standard normal variate (SNV) transformation and derivative methods. The second step was background segmentation, which involved segmenting the rice seed pixels from the background image and obtaining and analysing the spectral features of each rice seed. The third step involved preparing the training and the test set. The spectral features of 72 pure seeds of each type and the other three hybrid seeds were grouped together to simulate the purity of the seed, producing 13 datasets. These datasets were divided into training and test sets. The fourth step was modelling. LLRM was built based on the training set, and the selected spectral features were used to identify seeds. The last step was data output, wherein the accuracy of seed purity detection was determined based on the classification results.

Data Processing Flow
The last step was data output, wherein the accuracy of seed purity detection was determined based on the classification results. Figure 3. Flowchart of hyperspectral image processing, including data pre-processing, background segmentation, data set preparation, modelling, and identification accuracy output.

Data Pre-Processing
Various noises exist in the original hyperspectral data due to the limitations of the acquisition environment and the instrument, requiring that all data need to be pre-processed [55]. There are many methods suitable for pre-processing hyperspectral data, among which SNV and derivative methods are the most used. SNV is often used to eliminate the influence of sample surface scattering and optical path change on the spectrum [56,57]. Derivative processing includes first-order derivatives (FDs) and second-order derivatives (SDs) to eliminate other background interference and improve spectral resolution. Hence, this study uses SNV, FD, and SD methods to pre-process the data.
The SNV transformation formula is as follows: where of represents the spectral features before and after the transformation of the pixels in row and column , respectively. Furthermore, ̅ denotes the average reflectivity of , which is provided by: Figure 3. Flowchart of hyperspectral image processing, including data pre-processing, background segmentation, data set preparation, modelling, and identification accuracy output.

Data Pre-Processing
Various noises exist in the original hyperspectral data due to the limitations of the acquisition environment and the instrument, requiring that all data need to be preprocessed [55]. There are many methods suitable for pre-processing hyperspectral data, among which SNV and derivative methods are the most used. SNV is often used to eliminate the influence of sample surface scattering and optical path change on the spectrum [56,57]. Derivative processing includes first-order derivatives (FDs) and second-order derivatives (SDs) to eliminate other background interference and improve spectral resolution. Hence, this study uses SNV, FD, and SD methods to pre-process the data.
The SNV transformation formula is as follows: where S ij of x ij represents the spectral features before and after the transformation of the pixels in row i and column j, respectively. Furthermore, x ij denotes the average reflectivity of x ij , which is provided by: FDs and SDs are fulfilled by the Savitzky-Golay smoothing and differentiation filter [58,59], and the calculation formula as follows: where the polynomial order is 2 and smoothing point is 11, and R(λ) represents the reflectance data for wavelength λ, and ∆λ = 2.941 nm is the wavelength interval.

Image Segmentation
The rice seed pixels must be separated from the background before the extraction of the spectral features. The traditional grains segmentation method selects a hyperspectral image of a certain wavelength band [60]. The adaptive threshold method was used to obtain the binarised image, which was used as a template. Thereafter, the hyperspectral data were multiplied to obtain image segmentation.
The traditional method only uses the spatial information of hyperspectral data, and it is difficult to obtain an accurate binarised image using only single wavelength band information due to the significant amount of noise in the hyperspectral data. However, the difference in spectral slope between the rice seeds and background was noticeable. The reflectance of rice seeds increases gradually in the 400-1000 nm region, and the spectral curve is relatively smooth, but the spectrum of the standard whiteboard is flat at 400-1000 nm. Therefore, we can use the slope information of the spectrum to synthesise a new greyscale image. The formula of a synthesised greyscale image is as follows: where g ij represents the synthesised grey value of the pixel at the ith row and jth column, and x ij denotes the spectrum of pixel points in the ith row and jth column. The arctan function ensures that the value remains between 0 and 1. Thereafter, the synthesised greyscale image was binarised by Otsu's method to obtain a more accurate sample image boundary. This method uses spectral slope as an effective discriminator, which is equivalent to using both the spatial and spectral information of hyperspectral data to fulfil image segmentation, so it has a better segmentation effect. Figure 4 depicts the effect of the image segmentation after using the adaptive threshold method to binarise the synthesised greyscale image. Figure 4a depicts a colour image of a rice seed with a cast shadow due to an illumination problem, significantly affecting the image segmentation. Figure 4b depicts the binarised image using a hyperspectral image at 601.55 nm. Figure 4c showcases the result of segmentation using 601.55 nm wavelength band information. Some background pixels are incorrectly segmented due to the influence of the shadow, thereby affecting the spectral features of the rice seeds. Figure 4d shows the synthesised greyscale image using spectral slope. The spectral information helps overcome the effect of the shadow. Figure 4e represents the binarised spectral slope greyscale image. Figure 4f depicts the segmentation result using spectral slope information. Thus, the boundary of the rice seeds can be accurately determined using spectral slope information, yielding a considerably better result than that depicted in Figure 4c.

Spectral Feature
The pixels from each rice seed formed a connected region after the rice seeds were segmented from the background. Thereafter, all the connected areas in the image were counted. The pixels in each connected area belonged to the same rice seed sample. The spectral features of each rice seed were obtained by averaging the spectral data of the same connected area. Figure 5 depicts the average spectral features of four types of rice seeds. Figure 5a illustrates the reflectance spectrum after correction. Figure 5b depicts the spectrum after SNV processes. The average spectral differences among different paddy varieties were not significant for both raw and SNV data. in the spectra of different varieties after the FD and SD processes. Therefore, it is necessary to establish a mathematical classification model to distinguish these bands.

Spectral Feature
The pixels from each rice seed formed a connected region after the rice seeds were segmented from the background. Thereafter, all the connected areas in the image were counted. The pixels in each connected area belonged to the same rice seed sample. The spectral features of each rice seed were obtained by averaging the spectral data of the same connected area. Figure 5 depicts the average spectral features of four types of rice seeds. Figure 5a illustrates the reflectance spectrum after correction. Figure 5b depicts the spectrum after SNV processes. The average spectral differences among different paddy varieties were not significant for both raw and SNV data. Figure 5c,d show some differences in the spectra of different varieties after the FD and SD processes. Therefore, it is necessary to establish a mathematical classification model to distinguish these bands.

Spectral Feature
The pixels from each rice seed formed a connected region after the rice seeds were segmented from the background. Thereafter, all the connected areas in the image were counted. The pixels in each connected area belonged to the same rice seed sample. The spectral features of each rice seed were obtained by averaging the spectral data of the same connected area. Figure 5 depicts the average spectral features of four types of rice seeds. Figure 5a illustrates the reflectance spectrum after correction. Figure 5b depicts the spectrum after SNV processes. The average spectral differences among different paddy varieties were not significant for both raw and SNV data. Figure 5c,d show some differences in the spectra of different varieties after the FD and SD processes. Therefore, it is necessary to establish a mathematical classification model to distinguish these bands.

Data Grouping
The spectral features of 72 pure seeds of each seed type and the other three adulterated seeds were grouped to simulate different cases of seed purity detection, producing 13 datasets. Table 1 presents the detailed information. Groups 1-10 considered Huguangxiang as the research object. Groups 1-4 were adulterated with the other three types of seeds in equal amounts. However, their proportions were different. Groups 5-7 com- The spectral features of 72 pure seeds of each seed type and the other three adulterated seeds were grouped to simulate different cases of seed purity detection, producing 13 datasets. Table 1 presents the detailed information. Groups 1-10 considered Huguangxiang as the research object. Groups 1-4 were adulterated with the other three types of seeds in equal amounts. However, their proportions were different. Groups 5-7 comprised a mix of the other three seeds in different amounts. Groups 8-10 were adulterated with one other seed. Groups 11-13 considered Xiangwan Japonica, Huanghuazhan, and Japonica 530 as the research objects, respectively. Each group was adulterated with three other seeds, and each of them had 12 grains.

Partition of Sample Sets
Sample set partitioning based on joint x-y distance (SPXY) algorithm was used to randomly divide the data into training and test sets, increasing the difference and representativeness between the samples and improving the stability of the model sets [61,62]. The formula to calculate the distance between two samples is as follows: where Presented here, (p, q) ∈ [1, N], p and q denote the numbers of two different samples, x p and x q represent the spectral features of samples p and q, respectively. Furthermore, d x and d y denote the spectral distance and category distance of two different samples, respectively.

LLRM
The identification of rice seed purity can be considered a binary classification problem because it does not require the identification of non-target seed varieties. Let us assume that N represents the number of samples of rice seeds with a P-dimensional feature. Thus, the input feature, X, is an N × P matrix, and the output Y = [y 1 , y 2 , · · · y N ] is a binary variable. If the sample is of the target seed variety, y i = 1; else, y i = 0. The logarithmic probability of the ith sample belonging to target seed variety is: where X i = [x i1 , x i2 , · · · x iP ] represents the feature of the ith sample, β = [β 0 , β 1 , β 2 , · · · β P ] denotes the regression coefficient of LLRM, and p y i = 1|X i , β represents the probability of the ith sample, which belongs to the target seed variety when the regression coefficient is β.
The logarithmic maximum likelihood function of LLRM is as follows: LASSO regression minimises the sum of squares of residuals when the sum of absolute values of regression coefficients is lesser than a normal constraint. Some regression coefficients were gradually compressed during the parameter estimation; when the regression coefficient was lower than the threshold, it was set to 0 to achieve variable screening. The L1 norm was introduced into LLRM as the punishment term. Furthermore, the objective of variable sparsity was achieved. Its regression cost function is as follows: The estimated regression coefficient,β(λ), was obtained by minimising Equation (7) in the model.
where λ represents the regularisation parameter. The smaller the value of λ, the smaller the punishment, retaining more variables. However, the larger the value of λ, the larger the punishment, retaining fewer variables. Therefore, the selection of an appropriate λ has a significant impact on the classification accuracy of LLRM. This study uses a ten-fold cross-validation method to select the optimal value of λ and can be calculated as follows: where K = 10, CV k (λ) represents an error in fitting the kth subset when the optimal value of λ is used. The detailed steps of the LLRM algorithm implementation are as follows: (1) Data pre-processing: The data collected by hyperspectral technology were preprocessed. Then, they were grouped, as presented in Table 1. (2) Dividing the training and test sets: The SPXY algorithm was used to randomly divide the data into training and test sets with a 1:1 ratio. (3) Selecting the regularisation parameter: The training samples were divided into 10 parts using the cross-validation method, taking turns using 9 of their train, and then calculating the error of fitting the other one. The mean square error of the 10 prediction results was used to estimate the accuracy of the algorithm. Moreover, the optimal value of λ was selected based on the accuracy of the model prediction and by combining the number of feature bands.
(4) Selecting the feature wavelength bands: The coordinate descent method was used to calculate the regression coefficient at an optimal value of λ. Then, the feature band was selected based on the regression coefficient. (5) Modelling: The feature wavelength bands from the test set were selected. Moreover, the regression model was considered the foundation. (6) Calculation of classification accuracy: The function is calculated as follows: If the probability is greater than 0.5, the sample represents the target seed variety. However, if the probability is less than 0.5, the sample represents a non-target seed variety. Finally, they were compared with the labels and the classification accuracy was determined.

Optimum λ Selection
The value of λ has a considerable influence in solving the LLRM. An optimum value of λ can establish a balance between the fitting error and model complexity. If λ is significantly large, the degree of model compression increases, the number of selected feature wavelength bands decreases, and the fitting error increases, consequently affecting classification. If λ is significantly small, the variable compression effect is constrained, and the number of selected feature bands increases significantly, wasting computing resources. Therefore, the value of λ must be optimised carefully.
According to the number of feature wavelengths and recognition accuracy, the optimisation range of λ is set to 0.0001-0.2 in this study. We used cross-validation to determine the relatively small range of λ [63,64]. Thereafter, the number of feature bands and the accuracies for different λ values were analysed, and the optimal λ value was selected. Figure 6a depicts the change in the regression coefficient with λ. The relative value of the regression coefficient gradually decreases, and the coefficient is gradually compressed with increasing λ. Figure 6b depicts variation in the data fitting error of Group 2 with the change in λ. The blue curve represents the fitting error, and the thick red line indicates the standard deviation of different fitting errors. The red vertical line corresponds to the value of λ with the smallest fitting error. In general, the smallest fitting error does not imply the best prediction effect because the data contain unknown noise. Therefore, the optimal value of λ was expected to be found near the red vertical line. Figure 6c,d show that the number of feature wavelength bands is 10 when the fitting error is the smallest with a prediction accuracy of 92.57%. However, the prediction accuracy does not decrease with the decreasing number of feature wavelength bands. This indicates the need to further optimise the model. The highest prediction accuracy of 95.37% is obtained below the red vertical line. There are two positions with the highest prediction rate of λ = 0.0270 and λ = 0.0668, which correspond to 10 and 7 feature wavelength bands, respectively. Moreover, the green vertical line corresponds to λ = 0.0668, which requires the minimum number of feature wavelength bands while exhibiting the same prediction accuracy, thereby effectively reducing the cost of data processing. Other data groups can be optimised similarly to obtain the best LLRM.  Table 2 presents the wavelength bands selected using the SNV data under different adulteration groups. Groups 1-10 indicate that the coincidence degree of the selected wavelength bands is significantly higher when the same variety of rice seeds is considered despite the different number and variety of doped seeds. Among them, wavelength bands of 678.71 nm, 810.86 nm, and 871.60 nm appeared in all the data groups. Furthermore, the appearance frequencies of 548.55 nm, 798.77 nm, and 847.25 nm were also significantly higher, indicating the feasibility of using LLRM for band selection.  show that the coincidence degree of the selected wavelength bands is significantly smaller, implying that different LLRMs must be established for different rice varieties.   Table 2 presents the wavelength bands selected using the SNV data under different adulteration groups. Groups 1-10 indicate that the coincidence degree of the selected wavelength bands is significantly higher when the same variety of rice seeds is considered despite the different number and variety of doped seeds. Among them, wavelength bands of 678.71 nm, 810.86 nm, and 871.60 nm appeared in all the data groups. Furthermore, the appearance frequencies of 548.55 nm, 798.77 nm, and 847.25 nm were also significantly higher, indicating the feasibility of using LLRM for band selection.  show that the coincidence degree of the selected wavelength bands is significantly smaller, implying that different LLRMs must be established for different rice varieties. Tables 3 and 4 list the selected wavelength bands using FD and SD data under different adulteration groups, respectively. The coincidence degree of the selected wavelength bands is significantly higher for the same variety of rice seeds. In contrast, it is significantly smaller for different varieties of rice seeds. We compared the results of band selection using SNV, FD, and SD data for the same type of target seeds. The results indicate that the band coincidence is extremely low, implying that the band selection of LLRM is highly sensitive to the data pre-processing procedure. Under the optimal LLRM, the number of feature wavelength bands selected by the SNV, FD, and SD data were 5-17, 3-11, and 6-12, respectively. Moreover, the FD data required the least number of bands among the three.  Figure 7 depicts the wavelength band selection using the SNV data for different rice varieties. The red lines indicate the average spectra of the target rice seeds, and the blue lines represent the average spectra of the adulterated seeds. The positions indicated by the green lines represent the best bands based on the data of Group 2. Figure 7a shows that the wavelength bands of 548.55 nm, 714.55 nm, and 810.86 nm correspond to small reflectance peaks of the target seeds. The reflectance of the target seed spectra at wavelengths of 643.01 nm, 678.71 nm, and 871.60 nm are significantly different from those of the adulterated seeds. The difference in spectra between the target rice seeds and adulterated seeds are not noticeable at 762.98 nm, 798.77 nm, and 847.25 nm. Certain wavelength bands, such as 896.01 nm and 920.48 nm, are not selected despite the noticeable difference in their spectra, implying that greater spectral differences do not contain more useful information for classification. The number and location of the selected bands in Figure 7b-d are significantly different, indicating that different LLRMs must be established for different varieties.  Figure 7 depicts the wavelength band selection using the SNV data for different rice varieties. The red lines indicate the average spectra of the target rice seeds, and the blue lines represent the average spectra of the adulterated seeds. The positions indicated by the green lines represent the best bands based on the data of Group 2. Figure 7a shows that the wavelength bands of 548.55 nm, 714.55 nm, and 810.86 nm correspond to small reflectance peaks of the target seeds. The reflectance of the target seed spectra at wavelengths of 643.01 nm, 678.71 nm, and 871.60 nm are significantly different from those of the adulterated seeds. The difference in spectra between the target rice seeds and adulterated seeds are not noticeable at 762.98 nm, 798.77 nm, and 847.25 nm. Certain wavelength bands, such as 896.01 nm and 920.48 nm, are not selected despite the noticeable difference in their spectra, implying that greater spectral differences do not contain more useful information for classification. The number and location of the selected bands in Figure 7bd are significantly different, indicating that different LLRMs must be established for different varieties.     Figure 8a shows that there are significant differences in the data at 466.77 nm, 583.85 nm, 678.71 nm, and 835.11 nm. The greater spectral differences do not imply a greater variety of samples. Comparing Figure 8b-d, we find that the selected wavelength bands are mainly concentrated in the infrared wavelengths of 750-900 nm. Moreover, the spectral differences cannot be observed directly in the hyperspectral data, indicating that the infrared bands contain more useful information than visible light bands. Thus, we conclude that more attention should be allotted to the infrared bands for wavelength band selection. wavelength bands are mainly concentrated in the infrared wavelengths of 750-900 nm. Moreover, the spectral differences cannot be observed directly in the hyperspectral data, indicating that the infrared bands contain more useful information than visible light bands. Thus, we conclude that more attention should be allotted to the infrared bands for wavelength band selection.  Figure 9 depicts the wavelength band selection using SD data for different rice varieties. It is difficult to notice the spectral differences in different wavelength bands because the SD data is significantly rough. Hence, the wavelengths with large peaks are not selected. Comparing Figures 7-9, we find that the selected feature wavelength bands are almost different under different pre-processing conditions for the same type of target seeds, indicating that the data pre-processing method significantly influences the LLRM. The pre-processing using the FD data exhibited the fastest convergence speed and required the least number of bands; hence, it is the best pre-processing method.  Figure 9 depicts the wavelength band selection using SD data for different rice varieties. It is difficult to notice the spectral differences in different wavelength bands because the SD data is significantly rough. Hence, the wavelengths with large peaks are not selected. Comparing Figures 7-9, we find that the selected feature wavelength bands are almost different under different pre-processing conditions for the same type of target seeds, indicating that the data pre-processing method significantly influences the LLRM. The pre-processing using the FD data exhibited the fastest convergence speed and required the least number of bands; hence, it is the best pre-processing method.

Band Selection
The coincidence degrees of the feature bands selected for different data groups were significantly higher in the Huguangxiang case. The appearance frequencies of 548.55 nm, 643.01 nm, 678.71 nm, 714.55 nm, 810.86 nm, and 871.60 nm were the highest. The coincidence degrees of the feature bands were significantly smaller for different types of rice seeds, indicating that a specific LLRM must be established for different rice varieties. The band selection results of the LLRM were highly sensitive to data pre-processing. It was found that pre-processing with FD was more suitable for LLRM than with SNV or SD.  The coincidence degrees of the feature bands selected for different data groups were significantly higher in the Huguangxiang case. The appearance frequencies of 548.55 nm, 643.01 nm, 678.71 nm, 714.55 nm, 810.86 nm, and 871.60 nm were the highest. The coincidence degrees of the feature bands were significantly smaller for different types of rice seeds, indicating that a specific LLRM must be established for different rice varieties. The band selection results of the LLRM were highly sensitive to data pre-processing. It was found that pre-processing with FD was more suitable for LLRM than with SNV or SD.

Comparison of Model Accuracy
Tables 5-7 show the recognition accuracies between LLRM and LRM using SNV, FD an SD data, respectively. As shown in Table 5, the recognition accuracy of LLRM is 98.77-100%, with an average recognition accuracy of 99.64%. The recognition accuracy of the full variable LRM is 71.60-100%, with an average recognition rate of 93.66%. However, the results of LRM are considerably lower than those of LLRM. The recognition accuracies of Group 3 and 4 are particularly more prominent at less than 80%; this indicates that LLRM selects the key bands steadily and eliminates unnecessary noise, thereby improving the signal-to-noise ratio of the hyperspectral data and obtaining a better rice seed recognition rate.  Tables 5-7 show the recognition accuracies between LLRM and LRM using SNV, FD an SD data, respectively. As shown in Table 5, the recognition accuracy of LLRM is 98.77-100%, with an average recognition accuracy of 99.64%. The recognition accuracy of the full variable LRM is 71.60-100%, with an average recognition rate of 93.66%. However, the results of LRM are considerably lower than those of LLRM. The recognition accuracies of Group 3 and 4 are particularly more prominent at less than 80%; this indicates that LLRM selects the key bands steadily and eliminates unnecessary noise, thereby improving the signal-to-noise ratio of the hyperspectral data and obtaining a better rice seed recognition rate. Table 6 presents the comparison of the recognition accuracies between LLRM and LRM using FD data. The recognition accuracy of LLRM is 95.37-100%, with an average recognition accuracy of 99.18%. However, the recognition accuracy of the full variable LRM is 73.33-95.83%, with an average recognition rate of merely 88.63%. Table 7 shows that the recognition accuracy of the full variable LRM using SD data is 77.78-95.61%, with an average recognition rate of merely 86.60%. However, the recognition accuracy of LLRM is 91.67-100%, with an average recognition rate of 96.59%. The average recognition rate of LLRM is higher than the highest rate of LRM. Comparing Table 5 with Tables 6 and 7 shows that the recognition accuracy of the full variable LRM is 78.89-95.61%, with an average recognition rate of merely 86.60%. However, the recognition accuracy of LLRM is 91.67-100%, with an average recognition rate of 96.59%. This result indicates that it is feasible to develop a hyperspectral technology with LLRM to identify seed purity.

Conclusions
In this study, we combined the sparse characteristics of the LASSO algorithm and the classification characteristics of the LRM and proposed a method to identify the hyperspectral rice seed purity based on the LLRM. We tested 13 different adulterated groups of four varieties of rice seeds. The results of different data groups indicated that the value of the regularisation parameter λ was different for different adulteration cases. Moreover, the number of selected feature wavelength bands also varied. The best balance between the number of feature wavelength bands and recognition accuracy was obtained by adjusting λ, thereby establishing the optimal LLRM.
The recognition accuracy of the LLRM was 91.67-100% in 13 rice seed adulteration cases, with an average recognition rate of 98.47%. However, the recognition accuracy of the LRM was 71.60-100%, with an average accuracy of merely 89.63%. This result indicates that LLRM selects the key bands steadily and improves the signal-to-noise ratio of the hyperspectral data, obtaining better accuracy in rice seed recognition. The average number of selected feature wavelength bands is 11.3 for SNV data, 8.5 for FD data, and 9.5 for SD data. SNV data selected relatively more feature wavelength bands, which led to better accuracy than FD and SD data. By comparing all the LLRMs, we found that the accuracy of three pre-processing methods was improved by 5.98%, 10.55%, and 9.99%, respectively, and the probability of achieving a 100% recognition accuracy increased after FD processing, implying that FD processing is more suitable than other approaches for LLRM.
These results reveal that the hyperspectral detection technology of rice seed purity based on LLRM can reduce the collection cost and improve the processing speed while ensuring satisfactory recognition accuracy. Thus, LLRM demonstrates significant potential for application in the field of seed purity identification. In future research, we will further analyse and select these candidate feature wavelength bands and add information on the texture and shape of samples to establish a more robust hyperspectral rice seed purity identification technology.