Prediction of Shampoo Formulation Phase Stability Using Large Language Models

Bigan, Erwan; Dufour, Stéphane

doi:10.3390/cosmetics12040145

Open AccessArticle

Prediction of Shampoo Formulation Phase Stability Using Large Language Models

by

Erwan Bigan

^* and

Stéphane Dufour

TinyPred, 26 Rue de Constantinople, 75008 Paris, France

^*

Author to whom correspondence should be addressed.

Cosmetics 2025, 12(4), 145; https://doi.org/10.3390/cosmetics12040145

Submission received: 22 May 2025 / Revised: 24 June 2025 / Accepted: 3 July 2025 / Published: 10 July 2025

(This article belongs to the Section Cosmetic Formulations)

Download

Browse Figures

Versions Notes

Abstract

Predictive formulation can help reduce the number of experiments required to reach a target cosmetic product. The performance of Large Language Models from the open source Llama family is compared with that of conventional machine learning to predict the phase stability of shampoo formulations using a recently published dataset. The predictive strength is assessed for various train dataset sizes (obtained by stratified sampling of the full dataset) and for various Large Language Model sizes (3, 8, and 70B parameters). The predictive strength is found to increase on increasing the model size, and the Large-Language-Model-based approach outperforms conventional machine learning when the train dataset is small, delivering Area Under the Receiver Operating Curve above 0.7 with as few as 20 train samples. This work illustrates the potential of Large Language Models to further reduce the number of experiments required to reach a target cosmetic formulation.

Keywords:

predictive formulation; machine learning; large language model

1. Introduction

Predictive models hold the potential to accelerate and reduce experimental costs in the design of new formulations, which still remains, to a large extent, a trial-and-error process [1,2], with application not only in cosmetics [3,4], but also in pharmaceuticals [5,6] or oil and gas [7]. Phase stability is a primary requirement for cosmetic formulations, and for stable formulations, there are numerous other desirable physical [8], biological [9], and sensorial [10] properties. For the most part, there is no known mechanistic model that accounts for experimental results, and predictive models are thus based on statistical learning [11], also known as machine learning (ML), which requires training data and the choice of an algorithm.

Regarding training data, there is no golden rule to determine how much data is actually required to train robust ML models because the predictive strength depends on the strength of statistical associations between features (e.g., ingredients and their concentrations) and the target to be predicted (e.g., phase stability), which is not known a priori. However, all else equal, the higher the combinatorial complexity, the higher the amount of required data, and the combinatorial complexity increases with the number of ingredients mixed together in each formulation, and with the size of the library of candidate ingredients. In addition, when introducing new cosmetic ingredients (e.g., replacement of polymers derived from fossil fuels driven by the green transition [12], or replacement of controversial ingredients driven by regulatory guidance [13]), there may simply be no historical data available at all. Design-of-Experiment (DOE) approaches combined with a high level of laboratory automation can be used to generate data to train predictive models. Chitre et al. [8,14] have recently applied such an approach to generate a publicly available dataset for several hundred shampoo formulations. Even with a high level of automation, such data acquisition campaigns can prove lengthy and costly, so any modeling approach that could reduce the amount of required training data would be beneficial.

Regarding the algorithm, there is a wide range of candidates, with popular ones including Logistic Regression, Random Forest, and gradient-boosted decision trees such as the Light Gradient-Boosting Machine (LGBM) [15]. Besides such algorithms, which will collectively be referred to as conventional ML in the remainder of this article, a new approach based on Large Language Models (LLMs) has recently been proposed [16]. LLMs are a class of models underlying so-called generative AI. This new approach involves converting numerical tabular data into contextualized text, which is then fed into an LLM to make predictions on new samples. This new method has been shown to outperform conventional ML when the training data is small (typically a few tens of samples) [16,17], because it leverages not only statistical patterns in the training data, but also prior expert knowledge acquired through pre-training on massive data (typically, the whole Internet).

The present work evaluates the potential of this new LLM-based approach to predict the phase stability of cosmetic formulations using the above-mentioned shampoo formulation dataset from Chitre et al. [8,14]. State-of-the-art open source LLMs from the Llama family from Meta, with sizes of 3, 8, or 70B parameters [18], are selected. The predictive strength is assessed for various train dataset sizes (obtained by stratified sampling [19] of the full dataset) using the Receiver Operating Curve (ROC) Area Under the Curve (AUC) metric and is compared with that of conventional ML algorithms. The predictive strength is found to increase on increasing the LLM size, and the LLM-based approach outperforms conventional ML when the train dataset is small (fewer than 50 samples) and can deliver AUC above 0.7 with as few as 20 train samples. While this performance gain already illustrates the potential of this new LLM-based approach to reduce time and cost in the development of new cosmetic formulations, avenues for further improvement are discussed.

2. Materials and Methods

2.1. Data

The dataset published in [8] and further described in [14] was used in this study. It gives the composition and phase stability for 812 shampoo formulations, 294 of which were found to be stable. The original dataset also gives the viscosity and turbidity, as well as rheological properties for stable formulations, but only the phase stability was considered in the present work. Each formulation consists of a mixture of four ingredients in a base of water. The four ingredients are two surfactants (from a library of 12), one polyelectrolyte (from a library of four), and one thickener (from a library of two). The trade and INCI names of these ingredients along with their key chemical properties, as well as the full dataset, can be found on the figshare repository [20] associated with [8].

2.2. Data Sampling

One of the challenges of predictive modeling with very small datasets (i.e., in the range of a few tens of samples) is the limited ability to test trained models in a reliable manner. Standard cross-validation only uses a fraction of the data for each test round, and performance estimates are noisy because of the very small number of test cases. This limitation can be circumvented in the present case because the starting point is a much larger dataset with several hundreds of samples, so larger test sets can be used to evaluate trained models.

A generic method satisfying the following criteria was adopted in order to generate train/test splits: (i) compatibility with small train sizes in the range of 10–100 samples; (ii) use of test sizes that are larger than train sizes for reliable assessment of predictive model performance, but not as large as the complement of the full dataset in order to reduce the number of inferences and associated GPU usage when using LLMs. The adopted method consists of the following steps: First, the train size

T R

and test size

T S

are chosen so that the latter is a multiple of the former:

T S = k \times T R

. Second, a stratified sample [19] (stratified on the outcome to be predicted, which is phase stability) with size

T R + T S = (k + 1) \times T R

is randomly chosen from the full dataset. And third, this sample is randomly stratified into

k + 1

folds, out of which

n_{s p l i t s}

(

n_{s p l i t s} \leq k + 1

) are randomly chosen as train folds, and for each such fold, the concatenation of the remaining k folds is used as a test set. This process results in an inverted cross-validation, and it was designed to work specifically with very small train sizes (smaller than test sizes). Standard scikit-learn and numpy libraries, initialized with the same random seed, were used for all random processes (sklearn.model_selection.train_test_split for initial stratified sampling of the full dataset, sklearn.model_selection.StratifiedKFold for splitting of this sample into

k + 1

folds, and numpy.random.choice for choosing

n_{s p l i t s}

out of these). The following train and test size combinations were used:

T R = 10

,

T S = 50

;

T R = 20

,

T S = 80

; and

T R = 50

,

T S = 100

. For each train size /test size combination,

n_{s p l i t s} = 3

was used, and the entire process was repeated for three different random seeds, resulting in nine different train/test splits for each train size /test size combination.

2.3. Choice of LLM Models and Deployment

LLMs from the Llama family from Meta were selected. These open source models benefit from a fairly permissive license, including for commercial use [21], and can be accessed from the HuggingFace model repository [18]. The most recent models from the Llama 3 series were selected, with size 3B (Llama-3.2-3B-Instruct [22]), 8B (Llama-3.1-8B-Instruct [23]), and 70B (Llama-3.3-70B-Instruct-quantized.w8a8 [24]). Their knowledge cutoff date is December 2023, which means that during their pre-training, they were never exposed to the used dataset, which was published in July 2024. The original models from Meta were used for the 3B and 8B models, while an 8-bit quantized version of the 70B model was preferred for easier deployment and faster inference. These LLMs were deployed on Nvidia GPUs (single A40 for the 3B and 8B models, dual A100 for the 70B model) using the vLLM open source library [25,26].

2.4. LLM Prompting

There are two possible ways to use LLMs for predictions on tabular data: the first one involves using the contextualized text for train examples to fine-tune LLM weights, either all of them (full fine-tuning) or only a small subset using so-called Parameter-Efficient Fine-Tuning (PEFT) such as Low-Rank Adaptation (LoRA) [27] or (IA)³ [28]. The latter approach was adopted in the original proposal [16]. The second way, named In-Context Learning (ICL), involves using the LLM as is and including all training data in the prompt for every new prediction [29]. Fine-tuning requires fewer tokens per new prediction (a token is the basic unit into which words are split, with 0.75 word per token for most tokenizers used in current LLMs), which lowers the computational cost and time, at the expense of the prior fine-tuning step and associated hyper-parameter optimization. ICL requires more tokens per new prediction, which increases inference time, but it does not require any prior fine-tuning or hyper-parameter optimization. In the present study, ICL was used because even for the largest train size (

T R = 100

), the number of tokens in the prompt remained well below the maximum context window limit of all used Llama 3 LLMs, which is 128k tokens.

2.4.1. Prompting with Full Context

The LLMs were prompted as shown in Table 1. This prompt follows the standard chat structure for instruction-following LLMs (common abbreviated designation: instruct LLMs), which distinguishes between system, user, and assistant messages. Instruct LLMs are models that, after an initial pre-training phase which makes them strong next-token predictors, have been further trained to follow specific instructions using the chat format structure [29,30]. All messages shown in Table 1 are submitted to the LLM in a single prompt. The system message gives the general context and is followed by the training examples, each consisting of a user message giving the formulation composition followed by an assistant message giving the result of the phase stability test; the last message is a single user message giving the composition of the formulation test sample, the phase stability of which is to be predicted by the LLM. The outcome for stable (resp. unstable) formulations was encoded as “high” (resp. “low”) to avoid having to feed in the system message all experimental details of the phase stability assessment (performed after leaving samples 36 h in ambient lab conditions; see [8]).

2.4.2. Prompting Without Context

In order to differentiate between predictive strength arising from prior expert knowledge from predictive strength arising from inference of statistical patterns only, prompts were also run while keeping the same statistical patterns in the train data but removing all contextual information. This was achieved by (i) using a system message that sets the problem as a pure statistical inference task, (ii) replacing ingredient names with generic column codes, (iii) normalizing ingredient concentrations so that the average ingredient concentration over train samples is unity, and applying this normalization factor to the test data, (iv) removing all concentration units, and (v) replacing the “low” and “high” outcomes with “0” and “1”. The resulting prompt is shown in Table 2 for the same example as shown with full context in Table 1.

2.5. Processing of the LLM Response

The response object returned by the LLM through the vLLM Application Programming Interface (API) is then parsed to extract (i) the prediction (“low” or “high” with context, “0” or “1” without context) and (ii) the corresponding

l o g p r o b

(logarithm of its probability), which, in the most general case, is the sum of the logprobs of individual tokens making up the full returned message, but reduced to a single token for the specific outcome wording used in the present work. This

l o g p r o b

is then exponentiated into a probability

p = e x p (l o g p r o b)

, and the probability of the complement outcome is assigned as

1 - p

. For all tested LLMs and all train/test folds, the returned message always exactly matched one of the two possible outcomes. This is consistent with the use of instruct LLMs (i.e., LLMs that have been trained to follow instructions) [29,30].

2.6. Benchmarking Against Conventional ML

Three popular conventional ML classification algorithms were selected: Logistic Regression (LR), Random Forest (RF), and Light Gradient-Boosting Machine (LGBM). For Logistic Regression and Random Forest, their scikit-learn implementation [31] was used, with default hyper-parameters, except for the class_weight parameter set to balanced to automatically adjust weights inversely proportional to class frequencies in the input data. For LGBM, the native implementation available as a Python package (lightgbm version 4.6.0) was used, also with default hyper-parameters. The rationale for not optimizing hyper-parameters is that this is prone to overfitting, especially when handling such small train sizes. Following [14], two data representations were tested: (i) one-hot encoding for all ingredients; and (ii) one-hot encoding for polymers only (polyectrolytes and thickeners) and featurization for surfactants, by using as features the relative amount of 13 chemical functional groups, in the two-surfactant mixtures, obtained by weighing the number of functional groups per surfactant molecule by the surfactant concentrations. LGBM failed to converge for the smallest train sizes,

T R \leq 20

, for all train/test folds. The Receiver Operating Curve (ROC) Area Under the Curve (AUC) was used as predictive strength metric for both the LLM-based approach and conventional ML, and it was averaged over the nine folds (three folds per random seed times three different seeds).

3. Results

3.1. Conventional ML Benchmark

Figure 1 shows the average AUC as a function of train size for the different conventional ML algorithms (Logistic Regression, Random Forest, and LGBM) and for the different surfactant featurization schemes (one-hot encoding, or featurization of surfactants). The following can be seen:

For each algorithm/featurization scheme combination, the AUC increases on increasing the train size, as expected.
Regardless of the featurization scheme, the LGBM always performs worse than Logistic Regression or Random Forest, which may be explained by (i) the absence of hyper-parameter optimization and (ii) the small train sizes compared to reference benchmarks giving an advantage to LGBM over Random Forest, which have been typically reported for train sizes of thousands or tens of thousands of samples.
Random Forest is better than Logistic Regression when $T R \leq 20$ , and the opposite holds when $T R \geq 50$ (which suggests a crossover for some $T R$ value between 20 and 50, and which was not further investigated).
Overall performance is somehow comparable for the two-surfactant featurization schemes, with a slight advantage for “featurization” over “one-hot encoding” (yielding higher AUCs for 6 out of the 10 working algorithm/train size combinations).

Based on the above, LGBM is excluded from the benchmark, and Logistic Regression and Random Forest are kept, using the surfactant featurization scheme, for comparison with the LLM-based approach given in the next paragraph. This will be referred to as the conventional ML benchmark in the remainder of this article.

Figure 1. Average Area Under the Receiver Operating Curve (AUC) for conventional ML algorithms, using one-hot encoding or featurization of surfactants. LGBM failed to converge for the smallest train sizes

T R \leq 20

and is not shown.

Figure 1. Average Area Under the Receiver Operating Curve (AUC) for conventional ML algorithms, using one-hot encoding or featurization of surfactants. LGBM failed to converge for the smallest train sizes

T R \leq 20

and is not shown.

3.2. Benchmarking the LLM-Based Approach Against Conventional ML

Figure 2 shows the average AUC as a function of train size for the three tested LLM sizes (3B, 8B, and 70B) as well as for the conventional ML benchmark. Up to a train size of 50 samples, the LLM performance increases on increasing the model size, and the largest LLM (i.e., Llama 3 70B) performs better than conventional ML: Llama 3 70B with only

T R = 10

(respectively,

T R = 20

) train samples approximately yields the same AUC as conventional ML with

T R = 20

(respectively,

T R = 50

). This means that for these smallest train sizes, on average, the LLM-based approach requires ≈2× fewer train samples to achieve the same predictive strength as conventional ML. For the larger train sizes (

T R = 50

or

T R = 100

), the LLM performance reaches a plateau (or even decreases for the smaller LLMs (Llama 3 3B and 8B)). The reason for this behavior is unclear: On the one hand, it may be hypothesized that the LLM is less capable of capturing the full prompt information for such long and repetitive prompts. On the other hand, even for

T R = 100

, the prompt length remains below

8 k

tokens, which is well below the Llama 3 maximum context length (128k tokens for all three LLM sizes).

3.3. Investigating the Origin of the LLM Advantage

To determine whether the LLM advantage arises from prior expert knowledge or from a stronger ability to infer statistical patterns, the same LLM experiments were conducted with the data stripped from all context (see Section 2). Figure 3 shows the average AUC for the largest LLM (i.e., Llama 3 70B) with or without context, as well as for the conventional ML benchmark, as a function of train size. Removing the context lowers the AUC and brings it even below that of conventional ML. This suggests that the LLM advantage does not arise from being a better statistical learner, but rather from prior expert knowledge as context is key to performance.

4. Discussion

First, the question of whether the performance advantage is sufficient to justify using the LLM-based approach is discussed. Conventional ML should be preferred whenever the predictive strength is deemed sufficient for the targeted application, because (i) it is simpler and cheaper to deploy, with no need for GPU, (ii) inference is faster, and (iii) it is significantly less complex. However, the ≈2× advantage in required train size to achieve the same predictive strength may be sufficient to trigger industrial interest given the potential for formulation cost reduction and time-to-market advantage (owing to the reduced number of laboratory trials and errors to reach a target product). And there may be ways to further improve the performance of the LLM-based approach beyond this initial work. Fine-tuning could be investigated as an alternative to the In-Context Learning (ICL) used in the present work. Fine-tuning may also be used in combination with In-Context Learning using the method proposed by Chen et al. [32], which involves using the full ICL prompts (such as shown in Table 1) as training examples for fine-tuning. The ICL prompt itself might be optimized, either manually or automatically using the Declarative Self-Improving Python (DSPy) method [33]. Further extending this ICL prompt optimization path, Chain of Thought (CoT) is a method where the LLM is further trained to solve specific problems in steps, with a number of manually curated reference examples provided as the training set [34]. The LLM can then be presented with new yet-unseen questions and instructed to not just deliver the answer, but also the reasoning behind the provided answer. CoT has been shown to substantially improve the quality of the answers. While CoT might not be applicable to our present formulation classification task, its zero-shot version might: Kojima et al. [35] have shown that just prompting the LLM to not only provide an answer, but also a reasoning (but without any specific prior training on such reasoning examples), could significantly improve the LLM response.

Second, the possible origins of the prior expert knowledge are discussed, given the scarcity of published data in cosmetic formulation, especially compared to other scientific domains such as clinical science or biology. There is abundant ingredient information data from suppliers including some example formulations, and there also exist handbooks discussing formulation including phase stability [36], as well as do-it-yourself online resources, some of which include formulation examples with quantitative composition (versus mere ingredient list) [37]. Such resources may find their way into the LLM’s prior knowledge and contribute to helping make better predictions for new formulations. One way for cosmetic companies to circumvent this relative lack of knowledge about cosmetic formulation could be to develop their own LLM incorporating their proprietary know-how, by starting from an existing LLM and further training it on their entire formulation history, ideally not just including product information corresponding to successful formulations, but also all failed formulation attempts. Such modified LLMs should perform even better on the type of task described in the present work.

Third and last, a parallel is drawn between the LLM-based approach used in the present work and other pre-trained approaches, as the advantage over conventional ML lies in the abilities acquired through the pre-training process. There have been several proposals of pre-trained foundation numerical models [38,39,40,41], i.e., deep learning models that handle numbers and not text, and that have been pre-trained on a wide variety of tabular datasets, either generated synthetically [40,41] or corresponding to real-world cases [39]. On the one hand, tabular foundation models present the advantage of natively handling numbers and thus being able to better capture all statistical patterns in the data, whereas LLMs are prone to making calculation mistakes or to returning different results depending on the nature of the prompt [42,43]. On the other hand, they cannot benefit from contextual information (which can only be conveyed in text) or from a specific domain expertise, which is typically acquired through pre-training on mostly unstructured data from various sources. In the future, mixed text–quantitative models that could benefit from both advantages may emerge.

Author Contributions

Conceptualization, E.B. and S.D.; methodology, E.B.; software, E.B.; validation, E.B.; formal analysis, E.B. and S.D.; investigation, E.B.; data curation, E.B.; writing—original draft preparation, E.B.; writing—review and editing, S.D.; visualization, E.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: https://springernature.figshare.com/collections/Accelerating_Formulation_Design_via_Machine_Learning_Generating_a_High-throughput_Shampoo_Formulations_Dataset/7132624, accessed on 2 July 2025.

Acknowledgments

We thank Alain Hamel for helpful comments on the initial version of the manuscript.

Conflicts of Interest

Erwan Bigan and Stéphane Dufour are shareholders of TinyPred, which develops predictive models for various applications. The scientificity of the manuscript is not influenced by the authors’ positions.

Abbreviations

The following abbreviations are used in this manuscript:

API	Application Programming Interface
AUC	Area Under the Curve
CoT	Chain of Thought
DOE	Design Of Experiment
GPU	Graphics Processing Unit
ICL	In-Context Learning
LGBM	Light Gradient-Boosting Machine
LLM	Large Language Model
LoRA	Low-Rank Adaptation
LR	Logistic Regression
ML	Machine Learning
PEFT	Parameter-Efficient Fine-Tuning
RF	Random Forest
ROC	Receiver Operating Curve

References

Conte, E.; Gani, R.; Ng, K.M. Design of formulated products: A systematic methodology. AIChE J. 2011, 57, 2431–2449. [Google Scholar] [CrossRef]
McDonagh, J.L.; Swope, W.C.; Anderson, R.L.; Johnston, M.A.; Bray, D.J. What can digitisation do for formulated product innovation and development? Polym. Int. 2021, 70, 248–255. [Google Scholar] [CrossRef]
Kamairudin, N.; Abd Gani, S.S.; Fard Masoumi, H.R.; Basri, M.; Hashim, P.; Mokhtar, N.M.; Lane, M.E. Modeling of a natural lipstick formulation using an artificial neural network. RSC Adv. 2015, 5, 68632–68638. [Google Scholar] [CrossRef]
Cao, L.; Russo, D.; Felton, K.; Salley, D.; Sharma, A.; Keenan, G.; Mauer, W.; Gao, H.; Cronin, L.; Lapkin, A.A. Optimization of Formulations Using Robotic Experiments Driven by Machine Learning DoE. Cell Rep. Phys. Sci. 2021, 2, 100295. [Google Scholar] [CrossRef]
Bao, Z.; Bufton, J.; Hickman, R.J.; Aspuru-Guzik, A.; Bannigan, P.; Allen, C. Revolutionizing drug formulation development: The increasing impact of machine learning. Adv. Drug Deliv. Rev. 2023, 202, 115108. [Google Scholar] [CrossRef] [PubMed]
Hornick, T.; Mao, C.; Koynov, A.; Yawman, P.; Thool, P.; Salish, K.; Giles, M.; Nagapudi, K.; Zhang, S. In silico formulation optimization and particle engineering of pharmaceutical products using a generative artificial intelligence structure synthesis method. Nat. Commun. 2024, 15, 9622. [Google Scholar] [CrossRef] [PubMed]
Saldana, D.; Starck, L.; Mougin, P.; Rousseau, B.; Creton, B. On the rational formulation of alternative fuels: Melting point and net heat of combustion predictions for fuel compounds using machine learning methods. SAR QSAR Environ. Res. 2013, 24, 259–277. [Google Scholar] [CrossRef] [PubMed]
Chitre, A.; Querimit, R.C.; Rihm, S.D.; Karan, D.; Zhu, B.; Wang, K.; Wang, L.; Hippalgaonkar, K.; Lapkin, A.A. Accelerating Formulation Design via Machine Learning: Generating a High-throughput Shampoo Formulations Dataset. Sci. Data 2024, 11, 728. [Google Scholar] [CrossRef] [PubMed]
Bashir, A.; Lambert, P. Microbiological study of used cosmetic products: Highlighting possible impact on consumer health. J. Appl. Microbiol. 2020, 128, 598–605. [Google Scholar] [CrossRef] [PubMed]
Pensé-Lhéritier, A.M. Recent developments in the sensorial assessment of cosmetic products: A review. Int. J. Cosmet. Sci. 2015, 37, 465–473. [Google Scholar] [CrossRef] [PubMed]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning. In Data Mining, Inference, and Prediction, 2nd ed.; Springer Series in Statistics; Springer: New York, NY, USA, 2009. [Google Scholar] [CrossRef]
Kelly, C.L. Addressing the sustainability challenges for polymers in liquid formulations. Chem. Sci. 2023, 14, 6820–6825. [Google Scholar] [CrossRef] [PubMed]
Vieira, D.; Duarte, J.; Vieira, P.; Gonçalves, M.B.S.; Figueiras, A.; Lohani, A.; Veiga, F.; Mascarenhas-Melo, F. Regulation and Safety of Cosmetics: Pre-and Post-Market Considerations for Adverse Events and Environmental Impacts. Cosmetics 2024, 11, 184. [Google Scholar] [CrossRef]
Chitre, A.; Semochkina, D.; Woods, D.C.; Lapkin, A.A. Machine learning-guided space-filling designs for high throughput liquid formulation development. Comput. Chem. Eng. 2025, 195, 109007. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A highly efficient gradient boosting decision tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Long Beach, CA, USA, 4–9 December 2017; pp. 3149–3157. [Google Scholar]
Hegselmann, S.; Buendia, A.; Lang, H.; Agrawal, M.; Jiang, X.; Sontag, D. TabLLM: Few-shot classification of tabular data with large language models. In Proceedings of the 26th International Conference on Artificial Intelligence and Statistics, Valencia, Spain, 25–27 April 2023; Volume 206, pp. 5549–5581. [Google Scholar]
Wen, X.; Zhang, H.; Zheng, S.; Xu, W.; Bian, J. From Supervised to Generative: A Novel Paradigm for Tabular Deep Learning with Large Language Models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD’24, Barcelona, Spain, 25–29 April 2024; pp. 3323–3333. [Google Scholar] [CrossRef]
Meta Llama on Huggingface. Available online: https://huggingface.co/meta-llama (accessed on 5 May 2025).
Parsons, V.L. Stratified Sampling. In Wiley StatsRef: Statistics Reference Online; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2017; pp. 1–11. [Google Scholar] [CrossRef]
Lapkin, A.A.; Chitre, A.; Querimit, R.C.; Rihm, S.D.; Karan, D.; Zhu, B.; Wang, K.; Wang, L.; Hippalgaonkar, K. Accelerating Formulation Design via Machine Learning: Generating a High-throughput Shampoo Formulations Dataset. figshare. Collection. Available online: https://springernature.figshare.com/collections/Accelerating_Formulation_Design_via_Machine_Learning_Generating_a_High-throughput_Shampoo_Formulations_Dataset/7132624 (accessed on 5 May 2025). [CrossRef]
Meta Llama 3 Community License Agreement. Available online: https://www.llama.com/llama3/license/ (accessed on 5 May 2025).
Available online: https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct (accessed on 5 May 2025).
Available online: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct (accessed on 5 May 2025).
Available online: https://huggingface.co/RedHatAI/Llama-3.3-70B-Instruct-quantized.w8a8 (accessed on 5 May 2025).
Kwon, W.; Li, Z.; Zhuang, S.; Sheng, Y.; Zheng, L.; Yu, C.H.; Gonzalez, J.; Zhang, H.; Stoica, I. Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, Koblenz, Germany, 23–26 October 2023; pp. 611–626. [Google Scholar] [CrossRef]
vLLM Home Page. Available online: https://docs.vllm.ai/en/stable/# (accessed on 5 May 2025).
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. In Proceedings of the Tenth International Conference on Learning Representations (ICLR 2022), Virtual Event, 25–29 April 2022. [Google Scholar]
Liu, H.; Tam, D.; Muqeeth, M.; Mohta, J.; Huang, T.; Bansal, M.; Raffel, C.A. Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning. In Proceedings of the Thirty-Sixth Annual Conference on Advances in Neural Information Processing Systems, NIPS’22, New Orleans, LA, USA, 28 November–9 December 2022; Volume 35, pp. 1950–1965. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems, NIPS’20, Virtual Event, 6–9 December 2020; Volume 33, pp. 1877–1901. [Google Scholar]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS’22, New Orleans, LA, USA, 28 November–9 December 2022; Volume 35, pp. 27730–27744. [Google Scholar]
Scikit-Learn: Machine Learning in Python; Scikit-Learn 1.6.1 Documentation. Available online: https://scikit-learn.org/stable/ (accessed on 5 May 2025).
Chen, Y.; Zhong, R.; Zha, S.; Karypis, G.; He, H. Meta-learning via Language Model In-context Tuning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; pp. 719–730. [Google Scholar]
Khattab, O.; Singhvi, A.; Maheshwari, P.; Zhang, Z.; Santhanam, K.; A, S.V.; Haq, S.; Sharma, A.; Joshi, T.T.; Moazam, H.; et al. DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines. In Proceedings of the Twelfth International Conference on Learning Representations ICLR 2024, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.H.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS’22, New Orleans, LA, USA, 28 November–9 December 2022; pp. 24824–24837. [Google Scholar]
Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large Language Models are Zero-Shot Reasoners. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS’22, New Orleans, LA, USA, 28 November–9 December 2022; Volume 35, pp. 22199–22213. [Google Scholar]
Barel, A.O.; Paye, M.; Maibach, H.I. Handbook of Cosmetic Science and Technology; CRC Press: Boca Raton, FL, USA, 2014. [Google Scholar] [CrossRef]
Homemade Shampoo Formulas|Make Your Own DIY Shampoo with Recipes|MakingCosmetics—makingcosmetics.com. Available online: https://www.makingcosmetics.com/Shampoo-Formulas_ep_90.html?lang=en_US (accessed on 5 May 2025).
Van Breugel, B.; Van Der Schaar, M. Position: Why tabular foundation models should be a research priority. In Proceedings of the 41st International Conference on Machine Learning, ICML’24, Vienna, Austria, 21–27 July 2024; pp. 48976–48993. [Google Scholar]
Tran, Q.M.; Hoang, S.N.; Nguyen, L.M.; Phan, D.; Lam, H.T. TabularFM: An Open Framework For Tabular Foundational Models. In Proceedings of the 2024 IEEE International Conference on Big Data, Washington, DC, USA, 15–18 December 2024; pp. 1694–1699. [Google Scholar] [CrossRef]
Hollmann, N.; Müller, S.; Purucker, L.; Krishnakumar, A.; Körfer, M.; Hoo, S.B.; Schirrmeister, R.T.; Hutter, F. Accurate predictions on small data with a tabular foundation model. Nature 2025, 637, 319–326. [Google Scholar] [CrossRef] [PubMed]
Qu, J.; Holzmüller, D.; Varoquaux, G.; Morvan, M.L. TabICL: A Tabular Foundation Model for In-Context Learning on Large Data. arXiv 2025, arXiv:cs.LG/2502.05564. [Google Scholar]
Muffo, M.; Cocco, A.; Bertino, E. Evaluating Transformer Language Models on Arithmetic Operations Using Number Decomposition. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France, 20–25 June 2022; pp. 291–297. [Google Scholar]
Yan, Y.; Lu, Y.; Xu, R.; Lan, Z. Do PhD-level LLMs Truly Grasp Elementary Addition? Probing Rule Learning vs. Memorization in Large Language Models. arXiv 2025, arXiv:cs.CL/2504.05262. [Google Scholar]

Figure 2. Average Area Under the Receiver Operating Curve (AUC) for the three LLMs as well as for the conventional ML benchmark.

Figure 3. Average Area Under the Receiver Operating Curve (AUC) for the largest LLM (Llama 3 70B), with or without context, as well as for the conventional ML benchmark.

Table 1. Example of a prompt submitted to the LLM, for

T R = 10

. It consists of a series of messages: first, the system message giving the general context; then, a series of alternating user and assistant messages corresponding to the train examples; and finally a user message corresponding to the test sample, for which the LLM is invited to make a prediction.

Table 1. Example of a prompt submitted to the LLM, for

T R = 10

. It consists of a series of messages: first, the system message giving the general context; then, a series of alternating user and assistant messages corresponding to the train examples; and finally a user message corresponding to the test sample, for which the LLM is invited to make a prediction.

Role	Message
System	You will be given weight concentrations of cosmetic ingredients from BASF mixed in water. Classify the phase stability of the mixture into one of the following categories: low, or high. Return only the name of the category, and nothing else. MAKE SURE your output is one of the two categories stated.
User	Plantapon ACG 50: 12.73 w/w%, Dehyton MC: 13.56 w/w%, Dehyquart CC6: 1.16 w/w%, Arlypon TT: 4.94 w/w%. Phase stability is ->
Assistant	low
User	Dehyton PK 45: 9.15 w/w%, Dehyquart A-CA: 9.38 w/w%, Salcare Super 7: 2.24 w/w%, Arlypon F: 4.14 w/w%. Phase stability is ->
Assistant	low
User	Plantapon ACG 50: 13.68 w/w%, Plantapon LC 7: 8.79 w/w%, Salcare Super 7: 1.35 w/w%, Arlypon F: 4.95 w/w%. Phase stability is ->
Assistant	high
User	Plantacare 818: 11.7 w/w%, Plantacare 2000: 8.48 w/w%, Luviquat Excellence: 1.22 w/w%, Arlypon TT: 3.95 w/w%. Phase stability is ->
Assistant	high
User	Plantacare 2000: 8.45 w/w%, Plantapon Amino KG-L: 8.76 w/w%, Luviquat Excellence: 1.7 w/w%, Arlypon TT: 3.06 w/w%. Phase stability is ->
Assistant	low
User	Plantapon ACG 50: 7.7 w/w%, Plantapon LC 7: 9.39 w/w%, Salcare Super 7: 2.71 w/w%, Arlypon TT: 2.59 w/w%. Phase stability is ->
Assistant	high
User	Dehyton MC: 11.07 w/w%, Dehyton PK 45: 12.64 w/w%, Salcare Super 7: 1.68 w/w%, Arlypon F: 4.88 w/w%. Phase stability is ->
Assistant	low
User	Texapon SB 3 KC: 4.18 w/w%, Dehyton PK 45: 10.24 w/w%, Dehyquart CC6: 1.26 w/w%, Arlypon TT: 4.09 w/w%. Phase stability is ->
Assistant	low
User	Plantapon ACG 50: 11.39 w/w%, Plantapon Amino KG-L: 9.27 w/w%, Dehyquart CC6: 2.03 w/w%, Arlypon TT: 4.14 w/w%. Phase stability is ->
Assistant	low
User	Dehyton MC: 9.13 w/w%, Dehyton ML: 9.8 w/w%, Luviquat Excellence: 3.08 w/w%, Arlypon TT: 2.59 w/w%. Phase stability is ->
Assistant	low
User	Plantapon ACG 50: 11.91 w/w%, Dehyton ML: 9.94 w/w%, Luviquat Excellence: 1.1 w/w%, Arlypon TT: 0.94w/w%. Phase stability is ->

Table 2. Prompt used when stripping away all context information, for the same example as shown in Table 1.

Role	Message
System	You will be given characteristics of a sample. Classify the sample outcome into one of the following categories: 0, or 1. Return only the name of the category, and nothing else. MAKE SURE your output is one of the two categories stated.
User	C1 is 2.21, C5 is 2.58, C13 is 1.63, C17 is 2.75. Outcome is ->
Assistant	0
User	C6 is 1.85, C11 is 3.33, C15 is 2.17, C16 is 1.93. Outcome is ->
Assistant	0
User	C1 is 2.38, C2 is 2.42, C15 is 1.31, C16 is 2.31. Outcome is ->
Assistant	1
User	C3 is 3.33, C4 is 2.5, C12 is 1.2, C17 is 2.2. Outcome is ->
Assistant	1
User	C4 is 2.5, C10 is 2.43, C12 is 1.68, C17 is 1.7. Outcome is ->
Assistant	0
User	C1 is 1.34, C2 is 2.58, C15 is 2.63, C17 is 1.44. Outcome is ->
Assistant	1
User	C5 is 2.11, C6 is 2.55, C15 is 1.63, C16 is 2.28. Outcome is ->
Assistant	0
User	C0 is 3.33, C6 is 2.07, C13 is 1.77, C17 is 2.27. Outcome is ->
Assistant	0
User	C1 is 1.98, C10 is 2.57, C13 is 2.85, C17 is 2.3. Outcome is ->
Assistant	0
User	C5 is 1.74, C7 is 3.33, C12 is 3.04, C17 is 1.44. Outcome is ->
Assistant	0
User	C1 is 2.07, C7 is 3.38, C12 is 1.09, C17 is 0.52. Outcome is ->

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bigan, E.; Dufour, S. Prediction of Shampoo Formulation Phase Stability Using Large Language Models. Cosmetics 2025, 12, 145. https://doi.org/10.3390/cosmetics12040145

AMA Style

Bigan E, Dufour S. Prediction of Shampoo Formulation Phase Stability Using Large Language Models. Cosmetics. 2025; 12(4):145. https://doi.org/10.3390/cosmetics12040145

Chicago/Turabian Style

Bigan, Erwan, and Stéphane Dufour. 2025. "Prediction of Shampoo Formulation Phase Stability Using Large Language Models" Cosmetics 12, no. 4: 145. https://doi.org/10.3390/cosmetics12040145

APA Style

Bigan, E., & Dufour, S. (2025). Prediction of Shampoo Formulation Phase Stability Using Large Language Models. Cosmetics, 12(4), 145. https://doi.org/10.3390/cosmetics12040145

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Prediction of Shampoo Formulation Phase Stability Using Large Language Models

Abstract

1. Introduction

2. Materials and Methods

2.1. Data

2.2. Data Sampling

2.3. Choice of LLM Models and Deployment

2.4. LLM Prompting

2.4.1. Prompting with Full Context

2.4.2. Prompting Without Context

2.5. Processing of the LLM Response

2.6. Benchmarking Against Conventional ML

3. Results

3.1. Conventional ML Benchmark

3.2. Benchmarking the LLM-Based Approach Against Conventional ML

3.3. Investigating the Origin of the LLM Advantage

4. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI