Machine Learning in CNC Machining: Best Practices

: Building machine learning (ML) tools, or systems, for use in manufacturing environments is a challenge that extends far beyond the understanding of the ML algorithm. Yet, these challenges, outside of the algorithm, are less discussed in literature. Therefore, the purpose of this work is to practically illustrate several best practices, and challenges, discovered while building an ML system to detect tool wear in metal CNC machining. Namely, one should focus on the data infrastructure ﬁrst; begin modeling with simple models; be cognizant of data leakage; use open-source software; and leverage advances in computational power. The ML system developed in this work is built upon classical ML algorithms and is applied to a real-world manufacturing CNC dataset. The best-performing random forest model on the CNC dataset achieves a true positive rate (sensitivity) of 90.3% and a true negative rate (speciﬁcity) of 98.3%. The results are suitable for deployment in a production environment and demonstrate the practicality of the classical ML algorithms and techniques used. The system is also tested on the publicly available UC Berkeley milling dataset. All the code is available online so others can reproduce and learn from the results.


Introduction
Machine learning (ML) is proliferating throughout society and business.However, much of today's published ML research is focused on the machine learning algorithm.Yet, as Chip Huyen notes, the machine learning algorithm "is only a small part of an ML system in production" [1].Building and then deploying ML systems (or applications) into complex real-world environments requires considerable engineering acumen and knowledge that extend far beyond the machine learning code, or algorithm, as shown in Figure 1 [2].
above.The machine learning system is also tested on the common UC Berkeley milling dataset [14].All the code is made publicly available so that others can reproduce the results.However, due to the proprietary nature of the CNC dataset, we have only made the CNC feature dataset publicly available.
Undoubtedly, there are many more "best practices" relevant to deploying machine learning systems with manufacturing.In this work, we share our learnings, failures, and the best practices that were discovered while building ML tools within the important manufacturing domain.

Dataset Descriptions 2.1. UC Berkeley Milling Dataset
The UC Berkeley milling data set contains 16 cases of milling tools performing cuts in metal [14].Six cutting parameters were used in the creation of the data: the metal type (either cast iron or steel), the feed rate (either 0.25 mm/rev or 0.5 mm/rev), and the depth of cut (either 0.75 mm or 1.5 mm).Each case is a combination of the cutting parameters (for example, case one has a depth of cut of 1.5 mm, a feed rate of 0.5 mm/rev, and is performed on cast iron).The cases progress from individual cuts representing the tool when healthy, to degraded, and then worn.There are 165 cuts amongst all 16 cases.There are two additional cuts that are not considered due to data corruption.Table A1, in Appendix A, shows the cutting parameters used for each case.
Figure 2 illustrates a milling tool and its cutting inserts working on a piece of metal.A measure of flank wear (VB) on the milling tool inserts was taken for most cuts in the data set.Figure 3 shows the flank wear on a tool insert.Figure 3. Flank wear on a tool insert (perspective and front view).VB is the measure of flank wear.Interested readers are encouraged to consult the Modern Tribology Handbook for more information [15].(Image from author.)Six signal types were collected during each cut: acoustic emission (AE) signals from the spindle and table; vibration from the spindle and table; and AC/DC current from the spindle motor.The signals were collected at a sampling rate of 250 Hz, and each cut has 9000 sample points, for a total signal length of 36 s.All the cuts were organized in a structured MATLAB array as described by the authors of the dataset.Figure 4 shows a representative sample of a single cut.Each cut has a region of stable cutting, that is, where the tool is at its desired speed and feed rate, and fully engaged in cutting the metal.For the cut in Figure 4, the stable cutting region begins at approximately 7 s and ends at approximately 29 s when the tool leaves the metal it is machining.

CNC Industrial Dataset
Industrial CNC data, from a manufacturer involved in the metal machining of small ball-valves, were collected over a period of 27 days.The dataset represents the manufacturing of 5600 parts across a wide range of metal materials and cutting parameters.The dataset was also accompanied by tool change data, annotated by the operator of the CNC machine.These annotations indicated the time the tools were changed, along with the reason for the tool change (either the tool broke, or the tool was changed due to wear).
A variety of tools were used in the manufacturing of the parts.Disposable tool inserts, such as that shown in Figure 3, were used to make the cuts.The roughing tool, and its insert, was changed most often due to wear and thus is the focus of this study.
The CNC data, like the milling data, can also be grouped into different cases.Each case represents a unique roughing tool insert.Of the 35 cases in the dataset, 11 terminated in a worn tool insert as identified by the operator.The remaining cases had the data collection stopped before the insert was worn, or the insert was replaced for another reason, such as breakage.
Spindle motor current was the primary signal collected from the CNC machine.Using motor current within machinery health monitoring (MHM) is widespread and has been shown to be effective in tool condition monitoring [16,17].In addition, monitoring spindle current is a low-cost and unobtrusive method, and thus ideal for an active industrial environment.
Finally, the data were collected from the CNC machine's control system using software provided by the equipment manufacturer.For the duration of each cut, the current, the tool being used, and when the tool was engaged in cutting the metal, was recorded.The data were collected at 1000 Hz. Figure 5, below, is an example of one such cut from the roughing tool.The shaded area in the figure represents the approximate time when the tool was cutting the metal.We refer to each shaded area as a sub-cut.

Milling Data Preprocessing
Each of the 165 cuts from the milling dataset was labeled as healthy, degraded, or failed, according to its health state (amount of wear) at the end of the cut.The labeling schema is shown in Table 2 and follows the labeling strategy of other researchers in the field [18].For some of the cuts, a flank wear value was not provided.In such a case, a simple interpolation between the nearest cuts, with flank wear values defined, was made.Next, the stable cutting interval for each cut was selected.The interval varies based on when the tool engages with the metal.Thus, visual inspection was used to select the approximate region of stable cutting.
For each of the 165 cuts, a sliding window of 1024 data points, or approximately 1 s of data, was applied.The stride of the window was set to 64 points as a simple dataaugmentation technique.Each windowed sub-cut was then appropriately labeled (either healthy, degraded, or failed).These data preprocessing steps were implemented with the open-source PyPHM package and can be readily reproduced [19].
In total, 9040 sub-cuts were created.Table 2 also demonstrates the percentage of sub-cuts by label.The healthy and degraded labels were merged into a "healthy" class label (with a value of 0) in order to create a binary classification problem.

CNC Data Preprocessing
As noted in Section 2.2, each part manufactured is made from multiple cuts across different tools.Here, we only considered the roughing tool for further analysis.The roughing tool experienced the most frequent tool changes due to wear.
Each sub-cut, as shown in Figure 5, was extracted and given a unique identifier.The sub-cuts were then labeled either healthy (0) or failed (1).If a tool was changed due to wear, the prior 15 cuts were labeled as failed.Cuts with tool breakage were removed from the dataset.Table 3, below, shows the cut and sub-cut count and the percentage breakdown by label.In total, there were 5503 complete cuts performed by the roughing tool.

Feature Engineering
Automated feature extraction was performed using the tsfresh open-source library [20].The case for automated feature extraction continues to grow as computing power becomes more abundant [21].In addition, the use of an open-source feature extraction library, such as tsfresh, saves time by removing the need to re-implement code for common feature extraction or data-processing techniques.
The tsfresh library comes with a wide variety of time-series feature engineering techniques, and new techniques are regularly being added by the community.The techniques vary from simple statistical measures (e.g., standard deviations) to Fourier analysis (e.g., FFT coefficients).The library has been used for feature engineering across industrial applications.Unterberg et al. utilized tsfresh in an exploratory analysis of tool wear during sheet-metal blanking [22].Sendlbeck et al. built a machine learning model to predict gear wear rates using the library [23].Gurav et al. also generated features with tsfresh in their experiments mimicking an industrial water system [24].
In this work, 38 unique feature methods, from tsfresh, were used to generate features.Table 4 lists a selection of these features.In total, 767 features on the CNC dataset were created, and 4530 features, across all six signals, were created on the milling dataset.
After feature engineering, and the splitting of the data into training and testing sets, the features were scaled using the minimum and maximum values from the training set.Alternatively, standard scaling was applied, whereby the mean of a feature, across all samples, was subtracted and then divided by its standard deviation.

Feature Selection
The large number of features, generated through automated feature extraction, necessitates a method of feature selection.Although it is possible to use all the features for training a machine learning model, it is highly inefficient.Features may be highly correlated with others, and some features will contain minimal informational value.Even more, in a production environment, it is unrealistic to generate hundreds, or thousands, of features for each new sample.This is particularly important if one is interested in real-time prediction.
Two types of feature selection were used in this work.First, and most simply, a certain number of features were selected at random.These features were then used in a random search process (discussed further in Section 4) for the training of machine learning models.Through this process, only the most beneficial features would yield suitable results.
The second type of feature selection leverages the inbuilt selection method within tsfresh.The tsfresh library implements the "FRESH" algorithm, standing for feature extraction based on scalable hypothesis tests.In short, a hypothesis test is conducted for each feature to determine if the feature has relevance in predicting a value.In our case, the predicted value is whether the tool is in a healthy or failed state.Following the hypothesis testing, the features are ranked by p-value, and only those features below a certain p-value are considered useful.The features are then selected randomly.Full details of the FRESH algorithm are detailed in the original paper [20].
Finally, feature selection can only be conducted on the training dataset as opposed to the full dataset.This is done to avoid data leakage, further discussed in Section 6.

Over and Under-Sampling
Both the CNC and milling datasets are highly imbalanced; that is, there are far more "healthy" samples in the dataset than "failed".The class imbalance can lead to problems in training machine models when there are not enough examples of the minority (failed) class.
Over-and under-sampling are used to address class imbalance and improve the performance of machine learning trained on imbalanced data.Over-sampling is when examples from the minority class-the failed samples in the CNC and milling datasets-are copied back into the dataset to increase the size of the minority class.Under-sampling is the reverse.In under-sampling, examples from the majority class are removed from the dataset.
Nine different variants of over-and under-sampling were tested on the CNC and milling datasets and were implemented using the imbalanced-learn (https://github.com/scikit-learn-contrib/imbalanced-learn, accessed on 21 July 2022) software package [25].The variants, with a brief description, are listed in Table 5.Generally, over-sampling was performed, followed by under-sampling, to achieve a relatively balanced dataset.As with the feature selection, the over-and under-sampling was only performed on the training dataset.

Method Name Type Description
Random Over-sampling Oversampling Samples from minority class are randomly duplicated.
Random Under-sampling Undersampling Samples from majority class are randomly removed.

SMOTE (Synthetic Minority
Over-sampling Technique) [26] Oversampling Synthetic samples are created from the minority class.The samples are created by interpolation between close data points.
ADASYN (Adaptive Synthetic sampling approach for imbalanced learning) [27] Oversampling Similar to SMOTE.Number of samples generated are proportional to data distribution.

SMOTE-ENN [28]
Over and Undersampling SMOTE is performed for over-sampling.Majority class data points are then removed if n of their neighbours are from the minority class.

SMOTE-TOMEK [29]
Over and Undersampling SMOTE is performed for over-sampling.When two data points, from differing classes, are nearest to each other, these are a TOMEK-link.TOMEK link data points are removed for undersampling.
Borderline-SMOTE [30] Oversampling Like SMOTE, but only samples near class boundary are over-sampled.

K-Means SMOTE [31]
Oversampling Clusters of minority samples are identified with K-means.SMOTE is then used for over-sampling on identified clusters.
SVM SMOTE [32] Oversampling Class boundary is determined through SVM algorithm.New samples are generated by SMOTE along boundary.

Machine Learning Models
Eight classical machine learning models were tested in the experiments, namely: the Gaussian naïve-Bayes classifier, the logistic regression classifier, the linear ridge regression classifier, the linear stochastic gradient descent (SGD) classifier, the support vector machine (SVM) classifier, the k-nearest-neighbors classifier, the random forest (RF) classifier, and the gradient boosted machines classifier.
The models range from simple, such as the Gaussian naïve-Bayes classifier, to more complex, such as the gradient boosted machines.All these models can be readily implemented on a desktop computer.Further benefits of these models are discussed in Section 6.
These machine learning models are commonplace, and as such, the algorithm details are not covered in this work.All the algorithms, except for gradient boosted machines, were implemented with the scikit-learn machine learning library in Python [33].The gradient boosted machines were implemented with the Python XGBoost library [34].

Experiment
The experiments on the CNC and milling datasets were conducted using the Python programming language.Many open-source software libraries were used, in addition to tsfresh, scikit-learn, and the XGBoost libraries, as listed above.NumPy [35] and SciPy [36] were used for data preprocessing and the calculation of evaluation metrics.Pandas, a tool for manipulating numerical tables, was used for recording results [37].PyPHM, a library for accessing and preprocessing industrial datasets, was used for downloading and preprocessing the milling dataset [19].Matplotlib was used for generating figures [38].
The training of the machine learning models in a random search, as described below, was performed on a high-performance computer (HPC).However, training of the models can also be performed on a local desktop computer.To that end, all the code from the experiments is available online.The results can be readily reproduced, either online through GitHub, or by downloading the code to a local computer.The raw CNC data are not available due to their proprietary nature.However, the generated features, as described in Section 3, are available for download.

Random Search
As noted, a random search was conducted to find the best model, and parameters, for detecting failed tools on the CNC and milling datasets.A random search is seen as better for determining optimal parameters than a more deterministic grid search [39].
Figure 6 illustrates the random search process on the CNC dataset.After the features are created, as seen in step one, the parameters for a random search iteration are randomly selected.A more complete list of parameters, used for both the CNC and milling datasets, is found in Appendix A. The parameters are then used in a k-folds cross-validation process in order to minimize over-fitting, as seen in steps three through six.Thousands of random search iterations can be run across a wide variety models and parameters.For the milling dataset, seven folds were used in the cross-validation.To ensure independence between samples in the training and testing sets, the dataset was grouped by case (16 cases total).Stratification was also used to ensure that, in each of the seven folds, at least one case where the tool failed was in the testing set.There were only seven cases that had a tool failure (where the tool was fully worn out), and thus, the maximum number of folds is seven for the milling dataset.
Ten-fold cross validation was used on the CNC dataset.As with the milling dataset, the CNC dataset was grouped by case (35 cases) and stratified.
As discussed above, in Section 3, data preprocessing, such as scaling or over-/undersampling was conducted after the data was split, as shown in steps three and four.Training of the model was then conducted, using the split and preprocessed data, as shown in step five.Finally, the model could be evaluated, as discussed below.

Metrics for Evaluation
A variety of metrics can be used to evaluate the performance of machine learning models.Measuring the precision-recall area under curve (PR-AUC) is recognized as a suitable metric for binary classification on imbalanced data and, as such, is used in this work [40,41].In addition, the PR-AUC is agnostic to the final decision threshold, which may be important in applications where the recall is much more important than the precision, or vice versa.Figure 7 illustrates how the precision-recall curve is created.
After each model is trained, in a fold, the PR-AUC is calculated on that fold's hold-out test data.The PR-AUC scores can then be averaged across each of the folds.In this work, we also rely on the PR-AUC from the worst-performing fold.The worst-performing fold can provide a lower bound of the model's performance, and as such, provide a more realistic impression of the model's performance.

Results
In total, 73,274 and 230,859 models were trained on the milling and CNC datasets, respectively.The top performing models, based on average PR-AUC, were selected and then analyzed further.
Figures 8 and 9 show the ranking of the models for the milling and CNC datasets, respectively.In both cases, the random forest (RF) model outperformed the others.The parameters of these RF models are also shown, below, in Tables 6 and 7.The features used in each RF model are displayed in Figures 10 and 11.The figures also show the relative feature importance by F1 score decrease.Figure 12 shows how the top six features, from the CNC model, trend over time.Clearly, the top ranked feature-the index mass quantile on sub-cut 4-has the strongest trend.The full details, for all the models, are available in Appendix A and in the online repository.
The PR-AUC score is an abstract metric that can be difficult to translate to real-world performance.To provide additional context, we took the worst-performing model in the k-fold and selected the decision threshold that maximized its F1-score.The formula for the F1 score is as follows: where TP is the number of true positives, FP is the number of false positives, and FN is the number of false negatives.
The true positive rate (sensitivity), the true negative rate (specificity), the false negative rate (miss rate) and false positive rate (fall out), were then calculated with the optimized threshold.Table 7 shows these metrics for the best-performing random forest models, using its worst k-fold.
To further illustrate, consider 1000 parts manufactured on the CNC machine.We know, from Table 3, that approximately 27 (2.7%) of these parts will be made using worn (failed) tools.The RF model will properly classify 24 of the 27 cuts as worn (the true positive rate).Of the 973 parts manufactured using healthy tools, 960 will be properly classified as healthy (the true negative rate).

Analysis, Shortcomings, and Recommendations
Figure 13 shows the precision-recall (PR) and receiver operating characteristic (ROC) curves for the milling random forest model.These curves help understand the results as shown in the dot plots of Figure 8  The precision-recall curve, for the milling RF model, shows that all the models on the 7 k-folds give strong results.Each of the curves from the k-folds is pushed to the top right, and as shown in Table 8, even the worst-performing fold achieves a true positive rate of 97.3%.The precision-recall curve from the CNC RF model, as shown in Figure 14, shows greater variance between each trained model in the 10 k-folds.The worst-performing fold obtains a true positive rate of 90.3%.
There are several reasons for the difference in model performance between the milling and CNC datasets.First, each milling sub-cut has six different signals available for use (AC/DC current, vibration from the spindle and table, and acoustic emissions from the spindle and table).Conversely, the CNC model can only use the current from the CNC spindle.The additional signals in the milling data provide increased information for machine learning models to learn from.
Second, the CNC dataset is more complicated.The tools from the CNC machine are changed when the operator notices a degradation in part quality.However, individual operators will have different thresholds, and cues, for changing tools.In addition, there are multiple parts manufactured in the dataset across a wide variety of metals and dimensions.In short, the CNC dataset reflects the conditions of a real-world manufacturing environment with all the "messiness" which that entails.As such, the models trained on the CNC data cannot as easily achieve high results like in the milling dataset.In contrast, the milling dataset is from a carefully controlled laboratory environment.Consequently, there is less variety between cuts in the milling dataset than in the CNC dataset.The milling dataset is more homogeneous, and the homogeneity allows the models to understand the data distribution and model it more easily.
Third, the milling dataset is smaller than the CNC dataset.The milling dataset has 16 different cases, but only 7 of the cases have a tool that becomes fully worn.The CNC dataset has 35 cases, and of those cases, 11 contain a fully worn tool.The diminished size of the milling dataset, again, allows the models to model the data more easily.As noted by others, many publicly available industrial datasets are small, thus making it difficult for researchers to produce results that are generalizable [42,43].The UC Berkeley milling dataset suffers from similar problems.
Finally, models trained on small datasets, even with cross validation, can be susceptible to overfitting [44].Furthermore, high-powered models, such as random forests or gradient boosted machines, are more likely to exhibit a higher variance.The high variance, and overfitting, may give the impression that the model is performing well across all k-folds, but if the data are changed, even slightly, the model performs poorly.
Overall, the CNC dataset is of higher quality than the milling dataset; however, it too suffers from its relatively small size.We posit that similar results could be achieved with only a few cuts from each of the 35 cases.In essence, the marginal benefit of additional cuts in a case rapidly diminishes past the first few since they are all similar.This hypothesis would be of interest for further research.
The results from the CNC dataset are positive, and the lower bound of the model's performance approaches acceptability.We believe that collecting more data will greatly improve results.Ultimately, the constraint to creating production-ready ML systems is not the type of algorithm, but rather, the lack of data.We further discuss this in the Best Practices section below.

Focus on the Data Infrastructure First
In 2017, Monica Rogati coined the "data science hierarchy of needs" as a play on the well-known Maslow's hierarchy of needs.Rogati details how the success of a data science project, or ML system, is predicated on a strong data foundation.Having a data infrastructure that can reliably collect, transform, and store data is a prerequisite to upstream tasks, such as data exploration, or machine learning [45].Figure 15 illustrates this hierarchy.
Within the broader machine learning community, there is a growing acknowledgment of the benefits of a strong data infrastructure.Andrew Ng, a well-known machine learning educator and entrepreneur, has expressed the importance of data infrastructure through his articulation of "data-centric AI" [46].Within data-centric AI, there is a recognition that outsized benefits can be obtained by improving the data quality first, rather than improving the machine learning model.As an example of this data-centric approach, consider the OpenAI research team.Recently, they made dramatic advances in speech recognition that have been predicated on the data infrastructure.They used simple heuristics to remove "messy" samples, all the while using off-the-shelf machine learning models.More broadly, the nascent field of machine learning operations (MLOps) has arisen as a means of formalizing the engineering acumen in building ML systems.The data infrastructure is a large part of MLOps [1,7].
In this work, we built the top four tiers of the data science hierarchy pyramid as shown in Figure 15.However, although part of the data infrastructure was built-in the extractionload-transform (ETL) portion-much of the data infrastructure was outside of the research team's control.A system to autonomously collect CNC data was not implemented, and as such, far less data were collected than desired.Over a one-year period, data were manually collected for 27 days, which led to the recording of 11 roughing tool failures.Yet, over that same one-year period, there were an additional 79 cases where the roughing tool failed, but no data were collected.
Focusing on the data infrastructure first, the bottom two layers of the pyramid, builds for future success.In a real-world setting, as in manufacturing, the quality of the data will play an outsized role in the success of the ML application being developed.As shown in the next section, even simple models, coupled with good data, can yield excellent results.

Start with Simple Models
The rise of deep learning has led to much focus, from researchers and industry, on its application in manufacturing.However, as shown in the data science hierarchy of needs, in Figure 15, it is best to start with "simple", classical ML models.The work presented here relied on these classical ML models, from naïve Bayes to random forests.These models still achieved positive results.
There are several reasons to start with simple models: Simple models allow for quicker iteration time.This allows users to rapidly "demonstrate [the] practical benefits" of an approach, and subsequently, avoid less-productive approaches [7].
The benefits, and even preference for simple models, are becoming recognized within the research and MLOps communities.Already, in 2006, David Hand noted that "simple methods typically yield performance almost as good as more sophisticated methods" [47].In fact, more complicated methods can yield over-optimization.Others have shown that tree-based models still outperform deep-learning approaches on tabular [48,49].Tabular data and tree-based models were both used in this study.
Finally, Shankar et al. recently interviewed 18 machine learning engineers across a variety of companies in an insightful study on operationalizing ML real-world applications.They noted that most of the engineers prefer the use of simple machine-learning algorithms over more complex approaches [7].

Beware of Data Leakage
Data leakage occurs when information from the target domain (such as the label information on the health state of a tool) is introduced, often unintentionally, into the training dataset.The data leakage produces results that are far too optimistic, and ultimately, useless.Unfortunately, data leakage is difficult to detect for those who are not wary or uneducated.Kaufman et al. summarized the problem succinctly: "In practice, the introduction of this illegitimate information is unintentional, and facilitated by the data collection, aggregation and preparation process.It is usually subtle and indirect, making it very hard to detect and eliminate" [50].We observed many cases of data leakage in peer-reviewed literature, both from within manufacturing, and more broadly.Data leakage, sadly, is too common across many fields where machine learning is employed [50].
Introducing data leakage into a real-world manufacturing environment will cause the ML system to fail.As such, individuals seeking to employ ML in manufacturing should be cognizant of the common data leakage pitfalls.Here, we explore several of these pitfalls with examples from manufacturing.We adopted the taxonomy from Kapoor et al. and encourage interested readers to view their paper on the topic [51].
• Type 1-Preprocessing on training and test set: Preprocessing techniques, like scaling, normalization, or under-/over-sampling, must only be applied after the dataset has been split into training and testing sets.In our experiment, as noted in Section 3, these preprocessing techniques were performed after the data were split in the k-fold.• Type 2-Feature selection on training and test set: This form of data leakage occurs when features are selected using the entire dataset at once.By performing feature selection over the entire dataset, additional information will be introduced into the testing set that should not be present.Feature selection should only occur after the train/validation/testing sets are created.• Type 3-Temporal leakage: Temporal data leakage occurs, on time series data, when the training set includes information from a future event that is to be predicted.As an example, consider case 13 on the milling dataset.Case 13 consists of 15 cuts.Ten of these cuts are when the tool is healthy, and five of the cuts are when the tool is worn.If the cuts from the milling dataset (165 cuts in total) are randomly split into the training and testing sets, then some of the "worn" cuts from case 13 will be in both the training and testing sets.Data leakage will occur, and the results from the experiment will be too optimistic.In our actual experiments, we avoided data leakage by splitting the datasets by case, as opposed to individual cuts.

Use Open-Source Software
The open-source software movement has consistently produced "category-killing software" across a broad spectrum of fields [52].Open-source software is ubiquitous in all aspects of computing, from mobile phones, to web browsers, and certainly within machine learning.
Table 9, below, lists several of these open-source software packages that are relevant to building modern ML systems.These software packages are also, predominantly, built using the open-source Python programming language.Python, as a general-purpose language, is easy to understand and is one of the most popular programming languages in existence [53].
The popularity of Python, combined with high-quality open-source software packages, such as those in Table 9, only attracts more data scientists and ML practitioners.Some of these individuals, in the ethos of open-source, improve the software further.Others create instructional content, share their code (such as we have with this research), or simply discuss their challenges with the software.All this creates a dominant network effect; that is, the more users that adopt the open-source Python ML software, the more attractive these tools become to others.Today, Python, and its open-source tools, are dominant within the machine learning space [54].
Table 9.Several popular open-source machine learning, and related, libraries.All these applications are written in Python.
NumPy [35] Comprehensive mathematical software package.Supports for large multidimensional arrays and matrices.
TensorFlow [56] Popular deep learning framework, originally created by Google.
Ultimately, using these open-source software packages greatly improves productivity.In our work, we began building our own feature engineering pipeline.However, we soon realized the complexity of that task.As a results, we utilized the open-source tsfresh library to implement the feature engineering pipeline, thus saving countless hours of development time.Individuals looking to build ML systems should consider open-source software first before looking to build their own tools, or using proprietary software.

Leverage Advances in Computational Power
The rise of deep learning has coincided with the dramatic increase in computation power.Rich Sutton, a prominent machine learning researcher, argued in 2019 that "the biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin" [57].Fortunately, it is easier than ever for those building ML systems to tap into the increasing computational power available.
In this work, we utilized a high-performance computer (HPC) to perform an extensive parameter search.Such HPCs are common in academic environments and should be taken advantage of when possible.However, individuals without access to an HPC can also train many classical ML models on regular consumer GPUs.Using GPUs will parallelize the model training process.The XGBoost library allows training on GPUs, which can be integrated into a parameter search.RAPIDS has also developed a suite of open-source libraries for data analysis and training of ML models on GPUs.
Compute power will continue to increase and drop in price.This trend presents opportunities for those who can leverage it.Accelerating data preprocessing, model training, and parameter searches allows teams to iterate faster through ideas, and ultimately, build more effective ML applications.

Conclusions and Future Work
Machine learning is becoming more and more integrated into manufacturing environments.In this work we demonstrated an ML system used to predict tool wear on a real-world CNC machine, and on the UC Berkeley milling dataset.The best performing random forest model on the CNC dataset achieved a true positive rate (sensitivity) of 90.3% and a true negative rate (specificity) of 98.3%.Moreover, we used the results to illustrate five best practices, and learnings, that we gained during the construction of the ML system.Namely, one should focus on the data infrastructure first; begin modeling with simple models; be cognizant of data leakage; use open-source software; and leverage advances in computational power.
A productive direction for future work is the further build-out of the data infrastructure.Collecting more data, as noted in Section 5, would improve results and build confidence in the methods developed here.In addition, the ML system should be deployed in the production environment and iterated upon there.Finally, the sharing of the challenges, learnings, and best practices should continue, and we encourage others within manufacturing to do the same.Ultimately, understanding these broader challenges and best practices will enable the efficient use of ML within the manufacturing domain.

Figure 2 .
Figure 2. A milling tool is shown moving forward and cutting into a piece of metal.(Image modified from Wikipedia, public domain.)

Figure 4 .
Figure 4.The six signals from the UC Berkeley milling data set (from cut number 146).

Figure 5 .
Figure 5.A sample cut of the roughing tool from the CNC dataset.The shaded sub-cut indices are labeled from 0 through 8 in this example.Other cuts in the dataset can have more, or fewer, sub-cuts.

•Figure 6 .
Figure 6.An illustration of the random search process on the CNC dataset.(Image from author.)

Figure 7 .
Figure 7. Explanation of how the precision-recall curve is calculated.(Image from author).

Figure 11 . 4 FFTFigure 12 .
Figure 11.The 10 features used in the CNC random forest model.The features are ranked from most important to least by how much their removal would decrease the model's F1 score.

Figure 13 .
Figure 13.The PR and ROC curves for the random forest milling dataset model.The no-skill model is shown on the plots by a dashed line.The no-skill model will classify the samples at random.

Figure 15 .
Figure15.The data science hierarchy of needs.The hierarchy illustrates the importance of data infrastructure.Before more advanced methods can be employed in a data science or ML system, the lower levels, such as data collection, ETL, data storage, etc., must be satisfied.(Image used with permission from Monica Rogati at aipyramid.com(www.aipyramid.comaccessed 9 September 2022)[45].)

Table 2 .
The distribution of sub-cuts from the milling dataset.
State Label Flank Wear (mm) Number of Sub-Cuts Percentage of Sub-Cuts

Table 3 .
The distribution of cuts, and sub-cuts, from the CNC dataset.

Table 4 .
Examples of features extracted from the CNC and milling datasets using tsfresh.

Table 5 .
The methods of over-and under-sampling tested in the experiments.

Table 6 .
The parameters used to train the RF model on the milling data.

Table 7 .
The parameters used to train the RF model on the CNC data.
Figure 8.The top performing models for the milling data.The x-axis is the precision-recall areaunder-curve score.The following are the abbreviations of the models names: XGBoost (extreme gradient boosted machine); KNN (k-nearest-neighbors); SVM (support vector machine); and SGD linear (stochastic gradient descent linear classifier).Figure 9.The top performing models for the CNC data.The x-axis is the precision-recall areaunder-curve score.The following are the abbreviations of the models names: XGBoost (extreme gradient boosted machine); KNN (k-nearest-neighbors); SVM (support vector machine); and SGD linear (stochastic gradient descent linear classifier).
The 10 features used in the milling random forest model.The features are ranked from most important to least by how much their removal would decrease the model's F1 score. .

Table 8 .
The results of the best-performing random forest models, after threshold tuning, for both the milling and CNC datasets.