For most of the twentieth century, chemistry has been a data-poor discipline relying on well-thought-out hypotheses and carefully planned experiments to develop solutions to real-world problems. With the advent of computerized multichannel instrumentation, chemistry in the twenty-first century is evolving into a data-rich field. According to Lavine and Workman [
1], this has led to a new approach to chemical problem solving: (1) measure a phenomenon or process using instrumentation to generate multivariate data inexpensively, (2) create, test, and validate models that describe the data, (3) iterate steps 1 and 2 if necessary, and (4) interpret the results to develop a fundamental understanding of and insights into complex multivariate phenomena or processes. Framing chemical problem-solving using this paradigm capitalizes on the synergy between instruments and advanced algorithms for model development to capture the world from a multivariate perspective.
Chemists have relied on inductive learning [
2] for model formulation. Inductive learning develops a universal model for all samples using a training set. Each individual prediction sample is treated as an isolated random observation that either resides inside the training set or is excluded from consideration. Using inductive learning, prediction samples do not influence model generation. However, the reliance on inductive learning strategies is futile for analyzing samples that lie outside of the training set boundary. This problem has been stated in the context of ‘un-calibrated interferents’ or sample matrices and environmental effects that distort the instrumental signatures of observed samples. Multivariate methods, such as partial least squares regression or discriminant analysis, can indicate when a sample lies outside of the training set but cannot make reliable predictions regarding these samples. Strategies to find local windows in the data that are free of these interferences or transformation of the data to isolate the signal of interest have been investigated, but only show modest improvements in prediction.
Consequently, chemists and other physical scientists turned to transductive learning [
3] to build models around both the training and prediction sets to assure the model addresses the challenges posed by both data sets. Transduction defines the learning task as predicting the correct labels on specified unlabeled test data, not all possible future data. The model is built to be valid for the specific data it is tasked to predict. This simpler task can result in theoretically better bounds on the prediction error. The concept underlying transduction offers a promising framework to address the problem of sample prediction outside of the training set. The key hypothesis in transductive learning is that making the specific prediction data available, though unlabeled, at the time of training to influence model optimization improves model performance by reoptimizing, i.e., updating, it for every future prediction sample. When the training data consist of relatively few labeled data points in a high-dimensional space, using the information in the unlabeled data prevents the classification or regression model from overfitting the training data.
Recently, transfer learning [
4], a particular application of transductive learning, has been investigated to construct better machine learning models using less training data. Specifically, transfer learning aims to improve the predictions of a model on a primary task by leveraging data from one or more related auxiliary tasks. The only requirement is that the primary task and auxiliary tasks be drawn from related domains. For example, prediction of enantiomeric excess for a cross-coupling reaction on a new substrate could be a representative primary task. By using the correlations learned from a previously measured set of substrates (i.e., the auxiliary tasks), the machine learning model may perform better with limited data on the new substrate (i.e., the primary task). Transfer learning has shown great promise in a variety of process research and development tasks. In addition, transfer learning in the form of model updating represents a potential solution to the computationally expensive problem of retraining deep learning models. The upcoming publication entitled “The Future of Molecular-Scale Measurements Enabled by Chemical Data Science” includes a report describing the 2022 NSF Workshop “Envisioning Data Driven Advances in Measurement and Instrumentation for Chemical Discovery”, which provides a comprehensive view of transfer learning in the context of several outstanding research questions in chemical data science.