Next Article in Journal
Portable Vision Testing and Optometry Technology
Previous Article in Journal
Graphical Dependencies and Mechanical Unit Selection for Driving a Work Machine
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Proceeding Paper

Predicting Big Mart Sales with Machine Learning †

1
Department of Software Engineering, University of Sialkot, Sialkot 51040, Pakistan
2
Informatics Engineering, Faculty of Engineering, Computerand Design, Nusa Putra University, Sukabumi 43152, West Java, Indonesia
*
Author to whom correspondence should be addressed.
Presented at the 7th International Global Conference Series on ICT Integration in Technical Education & Smart Society, Aizuwakamatsu City, Japan, 20–26 January 2025.
Eng. Proc. 2025, 107(1), 95; https://doi.org/10.3390/engproc2025107095
Published: 16 September 2025

Abstract

Currently, supermarket-run shopping centers, known as “Big Marts,” monitor sales information for every single item in order to predict potential customer demand and update inventory management. Anomalies and general trends are commonly discovered through data warehouse mining using a range of machine learning techniques, and businesses such as Big Marts can use the obtained data to forecast future sales volumes. Compared to other research publications, this one forecasted sales with higher accuracy using machine learning models including KNN (K Nearest Neighbors), Naïve Bayes, and Random Forest. To adapt the proposed business model to anticipated outcomes, the sales forecast is based on Big Mart sales for various stores. Using different machine learning methods, the data that is produced may then be used to predict potential sales volumes for retailers such as Big Marts. The projected cost of the suggested system includes the following identifiers: price, outlet, and outlet location. In order to facilitate data-driven decision-making in retail operations and help Big Marts optimize their business models and effectively satisfy anticipated demand, this study emphasizes the importance of incorporating cutting-edge machine learning approaches.

1. Introduction

A crucial field of research in retail analytics is Big Mart sales prediction, which uses cutting-edge data-driven techniques to improve demand forecasts, pricing tactics, and inventory management. Traditional techniques like linear regression and Decision Trees frequently find it difficult to capture the complexity of customer behavior, product features, and store-specific aspects due to the rapid expansion of data available in the retail industry. Gradient-Boosted Trees, Random Forest, and XGBoost are a few examples of machine learning algorithms that have become effective tools for increasing prediction efficiency, scalability, and accuracy. By using large datasets to find patterns and trends, these techniques help firms make better decisions and increase profitability. Notwithstanding these developments, there are still issues with model interpretability, real-time adaptability, and incorporating outside variables like economic data.

2. Literature Review

The progress of methods for enhancing precision and effectiveness in retail forecasting is highlighted in the literature on Big Mart sales prediction. Conventional models, including Decision Trees and linear regression, frequently fall short in addressing scalability and data complexity problems. In contrast to previous techniques like Random Forest and neural networks, refs. [1,2] highlighted the superiority of XGBoost in terms of handling huge datasets and offering higher accuracy. Studies have indicated that polynomial regression has the potential to identify complex patterns in data [3,4]. Other studies demonstrate how well Ridge Regression and Random Forest work for demand forecasting and inventory management [5,6]. In complicated circumstances, advanced techniques like artificial neural networks have performed better than older models, achieving improved scalability and lower error metrics [7]. Alternative methods, such as KNN and Gradient-Boosted Trees, were investigated in [8,9], which showed their effectiveness in some situations. Despite advancements in this field, there are still issues with the real-time adaptation of models, integrating hybrid models involving random forest [10], and enhancing model interpretability for business applications. These findings highlight the increasing significance of machine learning in enhancing retail operations’ ability to predict sales. Ref. [11] tackles the issue of sales prediction for retailers like Big Marts. The results show that Random Forest gives higher accuracy with lower RMSE values compared to other models. Compared to methods used in prior studies [12], which either struggled with accuracy or computational overhead, XGBoost has emerged as a superior, scalable solution for sales prediction in retail environments [10]. Compared Random Forest and neural networks used in prior studies, XGBoost demonstrated superior scalability, efficiency, and practicality, making it highly effective for retail sales forecasting [2]. Using a Big Mart dataset, the author compares this dataset with different machine learning algorithms, demonstrating that polynomial regression (degree 5) achieves the best performance with the lowest RMSE (Root Mean Squared Error) (969.4954) and highest R2 score (67.24). These findings are supported by those of related work [4]. The findings of [5] improve on earlier works, such as [10], by emphasizing the effectiveness of Random Forest in handling complex retail datasets for demand forecasting. Using a dataset on Kaggle, the study compared an ANN to other models, finding that the ANN (Artificial Nural Network) achieved the lowest MSE (691,652.94) and RMSE (831.66), outperforming all others. The findings of [6] support prior research, such as [2], affirming ANN’s superior accuracy in complex sales prediction tasks. This study applies different algorithms to the “Big Mart Sales” dataset, and the XGBoost method exhibits the highest accuracy and efficiency. Compared to a prior study [13], ref. [7] emphasizes the efficiency and precision of XGBoost for complex sales prediction tasks. Compared to works by [13] emphasizing Random Forest, and Das et al., focusing on neural networks, [8] establishes XGBoost as a more effective solution for retail sales forecasting. The study evaluates models like Regression, linear regression, Random Forest, and M5P, finding M5P to perform best, with the highest correlation coefficient and lowest error metrics. Compared to [11] and other works focusing on simpler models, [9] emphasizes M5P’s superior performance for sales forecasting. Using historical e-commerce sales data, the study evaluates models including linear regression, Random Forest, and Gradient-Boosted Trees (GBT), with GBT achieving the best performance (98% accuracy). Compared to [7,10] confirms GBT’s superiority while providing a comprehensive comparison framework for forecasting methods. Using the 2018 Big Mart dataset, the study employed data cleaning, visualization, and regression analysis, achieving a predictive accuracy of 75.7% (R2 = 0.757). Compared to [13], which highlighted Random Forest, [11] demonstrates the effectiveness of regression techniques for improving model interpretability and reducing multicollinearity. The authors of [12] demonstrate XGBoost’s capacity to manage intricate datasets and non-linear interactions, making it a strong option for sales prediction when compared to the methods used in [6,11,12]. The authors of [13] applied the KNN method to the Big Mart dataset and achieved an accuracy of 84.7%. Unlike [8], which favored XGBoost and LightGBM, this research highlights KNN’s effectiveness for retail sales forecasting. The authors of [14] highlight the efficiency of simpler regression methods, and their results align with those of [14] in showcasing regression’s effectiveness for sales forecasting. The authors of [15] highlight the superior predictive performance of Decision Trees with AdaBoost for retail sales forecasting.

3. Proposed Methodology

The suggested system and the proposed method forecast sales using machine learning algorithms. A csv file is used to test the most recent data after the prediction model has been created. The model uses the csv file to train the prior data of a mart and uses a variety of machine learning algorithms [16,17], including Random Forest, Naïve Bayes, and KNN, to build a prediction model. An expected value is provided.
The data is pre-processed as soon as the data source is provided, which involves clustering and using the column mean value to fill in all zeros and empty spaces. A range of prediction models, including machine learning algorithms, are then used for testing and training, and the result or prediction is provided to the user.

3.1. KNN, or K-Nearest Neighbor

One supervised machine learning technique used to address classification and regression issues is the K-Nearest Neighbor (KNN) algorithm. Because of its simplicity and ease of use, the (K-NN) method is a popular and adaptable machine learning technique [18,19]. Using a distance metric, like the Euclidean distance, the K-NN algorithm determines a given data point’s K nearest neighbors. The majority vote or average of the K neighbors is then used to establish the class or value of the data point. This method enables the algorithm to adjust to various patterns and generate predictions according to the data’s local structure. The accuracy of the KNN model when used to predict Big Mart sales was 86.68.

3.2. Naïve Bayes

A family of algorithms based on Bayes’ Theorem is known as a Naive Bayes classifier. These classifiers are commonly used in machine learning because of their ease of use and effectiveness, even with the “naive” assumption of feature independence. To solve classification difficulties, the Naïve Bayes method is employed. The classification of texts is made considerably easier using this method. In text classification jobs, the data are high-dimensional (as each word represents one feature in the data). With a given set of feature values, this model forecasts the likelihood that an instance will belong to a class. The classifier is probabilistic. The process is as follows: (1) Firstly, import the dataset into RapidMiner. (2) Then split the data into two parts. (3) Apply the Naïve Bayes algorithm to the split data. (4) Finally, apply the model and show the output.

3.3. An Unpredictable Forest

The Random Forest algorithm is a powerful tree learning technique in Machine Learning. It works by creating a number of Decision Trees during the training phase. Each tree is constructed using a random subset of the dataset to measure a random subset of features in each partition. This randomness introduces variability among individual trees, reducing the risk of overfitting and improving overall prediction performance. A previous study applied different methods, like KNN, Decision Tree, and AdaBoost models, and the Decision Tree and AdaBoost models achieved the highest accuracy of 100% on the training and testing datasets [15]. The process was as follows: a. From the training set, choose K data points at random. b. Construct the Decision Trees linked to the chosen subsets of data. c. For the Decision Trees you wish to construct, select the number N. d. Repeat Steps 1 and 2. For new data points, locate each Decision Tree’s forecasts and allocate them to the category with the most votes.

3.4. Methodology

Several methodical procedures were used in the Big Mart sales prediction approach to preprocess data, create predictive models, and assess how well they work. The following outlines the key stages: Regarding data collection, the study’s dataset comprised sales information for several stores, including item type, outlet type, outlet location, item price, and sales numbers. This information, which formed the basis for model development, was obtained from a Big Mart sales dataset that is accessible to the public. Figure 1 shows our proposed methodology flow.

3.5. Data Preprocessing

Data pretreatment included encoding the categorical variables. Figure 2 shows the details of dataset.

3.6. Data Analysis

We analyzed the datasets using different techniques, such as scatter plots and box plots, to identify the outliers and anomalies in dataset.

3.6.1. Feature Selection

The feature selection technique is very important for different machine learning algorithms to identify the most relevant prediction of sales.

3.6.2. Model Deployment

Once the aforementioned features were applied, we applied various machine learning algorithms, including KNN and Naïve Bayes, to our dataset.

3.6.3. Model Training and Testing

The dataset was split into two parts, in a 70:30 ratio to ensure unbiased model evaluation.

3.6.4. Model Evaluation

Models were evaluated on the testing dataset using measures such as Mean Squared Error, Mean Absolute Error, and R2.

3.6.5. Implementation and Prediction

In order to optimize the sales management system, the final model was finally put into practice to forecast the new input data.

4. Results and Discussion

The performance of the KNN, Naïve Bayes, and Random Forest models with unknown data was assessed after they had been trained on the Big Mart dataset. Every model was assessed by previous users. Several metrics were used to evaluate model performance, as shown in Table 1. The results of model’s accuracy are shown in Table 2. Figure 3 shows the classification report from KNN model. Figure 4 shows output tree of predictions from ensemble random forest model.

5. Conclusions

In conclusion, machine learning may be used to anticipate sales in a variety of industries and can give firms important information about customer behavior, industry trends, and sales projections. Businesses may build precise models that accurately forecast future sales by using previous sales data and machine learning algorithms. This will allow them to streamline their processes, increase their revenue, and gain a competitive edge. Machine learning has enormous potential for sales prediction, and as technology develops, firms may use this tool to make data-driven decisions and adjust to shifting market conditions. Overall, machine learning-based sales prediction is a potent tool that may assist companies in staying competitive and ahead of the curve in today’s market. Three machine learning algorithms were trained in this study to forecast Big Mart sales, and they all achieved high accuracy.

Author Contributions

M.H. conceived the study, designed the methodology, and supervised the overall research process. A.M. performed the data collection, preprocessing, and carried out the experiments. I.Y. contributed to data analysis, interpretation of results, and validation of the findings. M.H. and A.M. drafted the initial manuscript, while I.Y. reviewed and refined the paper to improve clarity and technical depth. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Swetha, T.; Roopa, R.; Sajitha, T.; Vidhyashree, B.; Sravani, J.; Praveen, B. Forecasting online shoppers purchase intentions with cat boost classifier. In Proceedings of the 2024 International Conference on Distributed Computing and Optimization Techniques (ICDCOT), Bengaluru, India, 15–16 March 2024; pp. 1–6. [Google Scholar]
  2. Ibrahim, A.; Irawan, E.; Dewi, N.K.; Filaresy, R. The implementation of supply chain management and big data to accelerate stock order in mega drug store. J. Phys. Conf. Ser. 2019, 1196, 012005. [Google Scholar] [CrossRef]
  3. Kilimci, Z.H.; Akyuz, A.O.; Uysal, M.; Akyokus, S.; Uysal, M.O.; Atak Bulbul, B.; Ekmis, M.A. An improved demand forecasting model using deep learning approach and proposed decision integration strategy for supply chain. Complexity 2019, 2019, 9067367. [Google Scholar] [CrossRef]
  4. Gurnani, M.; Korke, Y.; Shah, P.; Udmale, S.; Sambhe, V.; Bhirud, S. Forecasting of sales by using fusion of machine learning techniques. In Proceedings of the 2017 International Conference on Data Management, Analytics and Innovation (ICDMAI), Pune, India, 24–26 February 2017; pp. 93–101. [Google Scholar]
  5. Ali, N.; Shah, W. Predicting Retail Sales for Walmart: A Comprehensive Study Integrating Machine Learning and Time Series Models. 2024. Available online: https://www.researchgate.net/profile/Faisal-Zaheer/publication/385518162_Predicting_Retail_Sales_for_Walmart_A_Comprehensive_Study_Integrating_Machine_Learning_and_Time_Series_Models/links/6728e1a277f274616d5c39f8/Predicting-Retail-Sales-for-Walmart-A-Comprehensive-Study-Integrating-Machine-Learning-and-Time-Series-Models.pdf (accessed on 15 September 2025).
  6. Kadam, V.S.; Kanhere, S.; Mahindrakar, S. Regression techniques in machine learning &applications: A review. Int. J. Res. Appl. Sci. Eng. Technol. 2020, 8, 826–830. [Google Scholar]
  7. Cheriyan, S.; Ibrahim, S.; Mohanan, S.; Treesa, S. Intelligent sales prediction using machine learning techniques. In Proceedings of the 2018 International Conference on Computing, Electronics & Communications Engineering (iCCECE), Southend, UK, 16–17 August 2018; pp. 53–58. [Google Scholar]
  8. Feng, T.; Niu, C.; Song, Y. Short Term E-commerce Sales Forecast Method Based on Machine Learning Models. In Proceedings of the 2022 6th International Seminar on Education, Management and Social Sciences (ISEMSS 2022), Chongqing, China, 15–17 July 2022; pp. 1020–1030. [Google Scholar]
  9. Jiang, K. Effective e-commerce price prediction with machine learning technologies. In Proceedings of the International Conference on Algorithms, Software Engineering, and Network Security, Nanchang, China, 26–28 April 2024; pp. 603–608. [Google Scholar]
  10. Chen, I.F.; Lu, C.J. Sales forecasting by combining clustering and machine-learning techniques for computer retailing. Neural Comput. Appl. 2017, 28, 2633–2647. [Google Scholar] [CrossRef]
  11. Punam, K.; Pamula, R.; Jain, P.K. A two-level statistical model for big mart sales prediction. In Proceedings of the 2018 International Conference on Computing, Power and Communication Technologies (GUCON), Greater Noida, India, 28–29 September 2018; pp. 617–620. [Google Scholar]
  12. Thivakaran, T.K.; Ramesh, M. Exploratory Data analysis and sales forecasting of bigmart dataset using supervised and ANN algorithms. Meas. Sens. 2022, 23, 100388. [Google Scholar] [CrossRef]
  13. Mondal, S.; Debbarma, A.; Prakash, B. Big mart sales prediction using machine learning. In Proceedings of the 2024 10th International Conference on Communication and Signal Processing (ICCSP), Melmaruvathur, India, 12–14 April 2024; pp. 742–747. [Google Scholar]
  14. Wen, K.Y.; Joseph, M.H.; Sivakumar, V. Big Mart Sales Prediction using Machine Learning. EAI Endorsed Trans. Internet Things 2024, 10. [Google Scholar] [CrossRef]
  15. Agbonlahor, O.V. A Comparative Study on Machine Learning and Deep Learning Techniques For predicting Big Mart Item Outlet Sales. Doctoral Dissertation, Dublin Business School, Dublin, Ireland, 2020. [Google Scholar]
  16. Ashfaq, F.; Jhanjhi, N.Z.; Khan, N.A.; Das, S.R. Synthetic crime scene generation using deep generative networks. In Proceedings of the International Conference on Mathematical Modeling and Computational Science, Madurai, India, 24–25 February 2023; Springer Nature: Singapore, 2023; pp. 513–523. [Google Scholar]
  17. Diwaker, C.; Tomar, P.; Solanki, A.; Nayyar, A.; Jhanjhi, N.Z.; Abdullah, A.; Supramaniam, M.A. A New Model for Predicting Component-Based Software Reliability Using Soft Computing. IEEE Access 2019, 7, 147191–147203. [Google Scholar] [CrossRef]
  18. Kok, S.H.; Abdullah, A.; Jhanjhi, N.Z.; Supramaniam, M.A. A Review of Intrusion Detection System Using Machine Learning Approach. Int. J. Eng. Res. Technol. 2019, 12, 8–15. [Google Scholar]
  19. Ahmed, S.; Hossain, M.A.; Bhuiyan, M.M.I.; Ray, S.K. A Comparative Study of Machine Learning Algorithms to Predict Road Accident Severity. In Proceedings of the 2021 International Conferences: IUCC, CIT, DSCI, SmartCNS, London, UK, 20–22 December 2021; pp. 390–397. [Google Scholar] [CrossRef]
Figure 1. Methodology.
Figure 1. Methodology.
Engproc 107 00095 g001
Figure 2. Datasets.
Figure 2. Datasets.
Engproc 107 00095 g002
Figure 3. Classification Report for KNN.
Figure 3. Classification Report for KNN.
Engproc 107 00095 g003
Figure 4. An unpredictable forest.
Figure 4. An unpredictable forest.
Engproc 107 00095 g004
Table 1. Evaluation of models’ performance.
Table 1. Evaluation of models’ performance.
ModelsModel Accuracy
KNN86.68%
Naïve Bayes95.56%
Random Forest100%
Table 2. All attributes of PhiUSIIL phishing URL (website).
Table 2. All attributes of PhiUSIIL phishing URL (website).
AuthorsYearsDatasetClassifierAccuracy
Gopal Behera and Neeta Nain2020Big Mart sales datasetXGBoost-
Rao Faizan Ali and Amgad Muneer2023Big Mart sales datasetRF, RR, LR, DTOutperformed
Punam Kumari, Rakesh Pamula, and Praveen Kumar Jain2018Big Mart sales dataRF, MLR-
Imran Bin Ibrahim and Syed Adnan2023Big Mart sales dataXGBoost,
LR, PR, RR
Outperformed
Artika Arista, Theresiawati Theresiawati, and Henki Bayu Seta.2024Big Mart sales dataXGBoost,
LR, PR, RR
-
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Husban, M.; Mir, A.; Yustiana, I. Predicting Big Mart Sales with Machine Learning. Eng. Proc. 2025, 107, 95. https://doi.org/10.3390/engproc2025107095

AMA Style

Husban M, Mir A, Yustiana I. Predicting Big Mart Sales with Machine Learning. Engineering Proceedings. 2025; 107(1):95. https://doi.org/10.3390/engproc2025107095

Chicago/Turabian Style

Husban, Muhammad, Azka Mir, and Indra Yustiana. 2025. "Predicting Big Mart Sales with Machine Learning" Engineering Proceedings 107, no. 1: 95. https://doi.org/10.3390/engproc2025107095

APA Style

Husban, M., Mir, A., & Yustiana, I. (2025). Predicting Big Mart Sales with Machine Learning. Engineering Proceedings, 107(1), 95. https://doi.org/10.3390/engproc2025107095

Article Metrics

Back to TopTop