Next Article in Journal
Developed Gorilla Troops Technique for Optimal Power Flow Problem in Electrical Power Systems
Next Article in Special Issue
Enterprise Profitability and Financial Evaluation Model Based on Statistical Modeling: Taking Tencent Music as an Example
Previous Article in Journal
Fisher, Bayes, and Predictive Inference
Previous Article in Special Issue
A Panel Threshold Model to Capture the Nonlinear Nexus between Public Policy and Entrepreneurial Activities in EU Countries
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:

Return on Advertising Spend Prediction with Task Decomposition-Based LSTM Model

Department of Computer Science and Engineering, Korea University, Seoul 02841, Korea
Human Inspired Artificial Intelligence Research (HIAI), Korea University, Seoul 02841, Korea
AI Data Business Operation, Bizspring, Seoul 04788, Korea
Author to whom correspondence should be addressed.
Mathematics 2022, 10(10), 1637;
Received: 17 March 2022 / Revised: 18 April 2022 / Accepted: 24 April 2022 / Published: 11 May 2022


Return on advertising spend (ROAS) refers to the ratio of revenue generated by advertising projects to its expense. It is used to assess the effectiveness of advertising marketing. Several simulation-based controlled experiments, such as geo experiments, have been proposed recently. This refers to calculating ROAS by dividing a geographic region into a control group and a treatment group and comparing the ROAS generated in each group. However, the data collected through these experiments can only be used to analyze previously constructed data, making it difficult to use in an inductive process that predicts future profits or costs. Furthermore, to obtain ROAS for each advertising group, data must be collected under a new experimental setting each time, suggesting that there is a limitation in using previously collected data. Considering these, we present a method for predicting ROAS that does not require controlled experiments in data acquisition and validates its effectiveness through comparative experiments. Specifically, we propose a task deposition method that divides the end-to-end prediction task into the two-stage process: occurrence prediction and occurred ROAS regression. Through comparative experiments, we reveal that these approaches can effectively deal with the advertising data, in which the label is mainly set to zero-label.

1. Introduction

Return on advertising spend (ROAS) refers to the ratio of the revenue to the expense incurred by the advertising projects. It is a quantification measure of profitability for a specific advertising project. ROAS is currently being utilized as an indicator of estimating the performance of advertising projects [1,2]. Considering its importance, numerous studies to compute ROAS are also being conducted in various industries, including Google [3,4,5,6].
Representatively, a geo-experiment [3] was proposed to obtain ROAS for a specific advertising project. This refers to a controlled experiment that analyzes the revenue and cost incurred in two groups: a treatment group and a controlled group. Specifically, a specific advertisement is exposed to a treatment group and not to a control group [7]. Although these studies are effective in assessing historical advertising accomplishments, they are not suitable for the inductive process that predicts future advertising impacts [4]. Most research engaged in forecasting ROAS mainly relies on methodologies that use statistic-based models [8], and to the best of our knowledge, ROAS prediction methods applying deep learning models do not exist.
Furthermore, earlier studies in ROAS analysis have only examined revenues and costs, disregarding additional elements that have the potential to be effective in predicting revenue or cost [6,9]. This implies that implicitly helpful information, such as the number of clicks or type of advertising program, is not being utilized [10].
The high cost and several other difficulties in the data acquisition also raise limitations. Following conventional approaches, obtaining the ROAS of a specific adverting project requires controlled experiments [6,7,8]. This indicates that every estimation of the ROAS for a specific advertising project involves an individual simulation. As obtaining data through simulation follows implementing advertisement, collecting each data point incurs an advertisement fee. This follows the high cost of constructing a large dataset.
Considering these limitations, we present a deep learning-based ROAS prediction model that can overcome the limitations of previous studies. Only with a database collected from executing an advertisement project, can this model forecast future ROAS. By utilizing an existing database, this model can further alleviate the need for the additional control experiments in predicting the ROAS that is originally required in data acquisition. Specifically, we adopt the LSTM model structure for predicting future ROAS values by referring previous input features [11].
Subsequently, we present a two-stage framework for improving prediction performance. In the database collected through advertising projects, most of the data query have zero-revenue. This indicate that by applying an end-to-end framework, prediction models trained with this data may be overly biased toward zero; the model may return zero regardless of its input features. For alleviating such limitations and dealing with these data characteristics, we divide the prediction tasks into two sub-tasks [12,13]: occurrence prediction and occurred ROAS regression. For further improvement of the prediction performance, we applied an up-scaling strategy that resolves the un-blanaced data distribution. Through comparative analysis, we determined that our approaches can lead to even higher performance than the casual end-to-end framework.

2. Related Work

Research related to ROAS measurement and prediction has focused on attempts to estimate accurate ROAS for specific advertisements. Typically, geo experiments [3] propose randomized settings in implementing controlled experiments. Corresponding experiments focus on measuring the difference in the cost and revenue incurred in the treatment and control groups. In this setting, an advertisement is only exposed to the treatment group, and thereby, the difference between these two groups implies the advertisement’s implicit effect. In particular, in assigning treatment and control groups, non-overlapping geographic regions are randomly selected to attain objectivity. This aims to measure accurate ROAS through differences in revenues and costs generated from the two groups, and subsequent ROAS-related experiments are also being conducted in such a fashion [8].
In recent years, more generalized control experiments have been proposed by further developing these geo experiment settings [4,6,7]. They focus on dealing with issues that can arise from geo-heterogeneity in geo experiments and designing reliable and cost-effective experiments. Since considerably different responses (e.g., revenue or cost) can be generated depending on the characteristics of each treatment and control group, dealing with these issues is considered necessary in related studies [6]. Kerman et al. [4] propose time-based regression (TBR) to solve the problem by aggregating the geo-level data into the group. Furthermore, Chen and Au [7] and Chen et al. [6] suggest a trimmed match based on a randomized paired experiment to obtain robust estimation results.
As obtaining ROAS of a specific adverting project requires respective controlled experiments, there is a fundamental issue that must be addressed to collect new data for following the above approaches. It can also be pointed out as a limitation that there has been little research on predicting ROAS in the future other than analyzing ROAS for existing advertising projects. Kerman et al. [4], Chan et al. [9] attempted to predict incremental ROAS, but these studies only focused on the casual statistical frameworks, and prediction systems based on deep learning frameworks have not been studied as far as we are aware.
Considering the return on investment (ROI), a broader concept than ROAS, several studies have attempted to predict future information by applying deep learning technology based on the collected information [10,14,15,16]. In most studies, long short-term memory (LSTM) [17] is adopted. This is a deep learning model specialized in dealing with sequential data that works effectively in ROI studies that require data processing by time series [18]. In this study, we apply the previous perspective to design the ROAS prediction model using the LSTM model, offering an effective model to overcome the limitations of previous studies.

3. Proposed Method

We propose a deep learning-based framework that forecasts ROAS based on user behavior data obtained during advertising projects. As the existing dataset can be utilized in training model structures, the need for extra simulation-based control experiments, such as geo-experiments, can be obviated. We used various input features, such as the number of clicks and ad platform information and the cost and revenue generated by clients.
Specifically, we adopted the LSTM model structure in predicting future ROAS and the newly proposed data pre-processing method for composing the input sequence of the prediction model. As data acquired during the implementation of an advertising project may be overly abundant, we select essential features and compose an input sequence. In particular, we adopted the keyword information that each user offered to reach the advertising as an input feature through a pre-trained language model [19].
For encoding the keyword information, we adopted a pre-trained language model BERT [19] and utilized a multi-layer perceptron for encoding other numeric features. The encoded feature representation obtained through this process is used as an input to the LSTM-based prediction model designed to predict the ROAS. Specific encoding processes and utilized model structure are described in Figure 1.
Each query in a database shows user behavior features, such as the number of orders placed and incurred costs. For relieving bias to a respective user, we grouped queries created in an existing database based on the dates of data collection and the ad group of the data query. In order to express the individual characteristics of each feature, we divide it into three classes and encoded them using BERT and multi-layer perceptron to fit in its attributes [20]. By concatenating these features and processing them with the pooling layers, we generated the final input feature to feed into the LSTM-based prediction model.

3.1. Data Pre-Processing

3.1.1. Feature Extraction

We accumulated behavioral data from customers through various advertising projects and used it to generate input sequences for the prediction model. To be more specific, we classified input characteristics to better structure the encoding process. Descriptions of utilized customer behavioral data features are detailed in Table 1.
The class for each feature was utilized as a criterion to determine the encoding process. In this experiment, ROAS was calculated as the ratio of revenue to cost. Because revenue and costs generated by advertising were included in the collected data, their ratio reveals the advertisement’s ROAS. If revenue is generated but the cost is zero, we consider the corresponding advertising irrelevant to the revenue incurrence and set the ROAS value to zero.

3.1.2. Data Clustering

As an advertising project proceeds, the behavior data created by a user is stacked in a query. When used directly for training, there is a significant risk of strong bias depending on the customer’s respective inclination and traits. Because the goal of ROAS prediction is to determine the impact of each advertisement project, clustering queries generated by each user by the advertisement group are required. This can reduce dependency on individual properties and provide a more generalized perspective on an advertising project. Furthermore, in order to forecast ROAS-given prior properties, input features should be generated by clustering these data by time series [11,18]. A vast quantity of data acquired for each user is pre-processed as sequential data in a time series for each ad group using the technique mentioned above. We denote these features as a cluster class, which would act as a standard for clustering.
In this experiment, we denote the features consisting of continuous values with an integer data type as the numeric class and features restricted to categorical classes with a string data type as a categorical class. In clustering the numeric class, we add all values that occurred by the users in a group for use as a representative feature of the group. Since the numeric class values differ substantially amongst ad groups, they are all re-scaled by taking the log10 value. In dealing with categorical classes, the most frequently occurred class in a cluster is chosen as a representing feature of the group. In processing keywords, all the keywords in a group are concatenated into a single string without duplication and designated as keywords for the group. This feature is referred to as a keyword class.

3.2. Prediction Model

Input Sequence Encoding

For constructing input features for the prediction model, we encode clustered features with the following process. The corresponding process includes multi-layer perceptron for encoding features in the numeric and categorical class, and pre-trained language model for the keyword class.
The detailed encoding processes are as follows. First, we denote the numeric class and categorical class as N U M and C A T , respectively, and corresponding features as x n u m Z | N U M | , and x c a t Z | C A T | where | N U M | and | C A T | indicate the number of features in each class. In generating x c a t , for each feature in C A T , every category in the corresponding feature is mapped into the integer index, so that x c a t Z | C A T | [21].
Then, we encode keyword features through the pre-trained language model, BERT [19]. BERT is composed of a transformer [22] encoder structure, and is trained with self-supervised learning that utilizes large-scale unlabeled mono-corpus. This pre-training process enables BERT to contain a general understanding of the language information [23]. Particularly in this study, we leverage BERT to encode the keyword string to a specific embedding that reflects its semantic information.
Specifically, we denote the keyword class as K E Y and corresponding features as x k e y w o r d Z m a x l e n , where concatenated keyword string is tokenized and mapped into integer indices. The whole length of the tokenized feature is confined to the model’s max sequence length m a x l e n . Then, through BERT, the hidden representation of the x k e y w o r d , denoted as b k e y w o r d R m a x l e n × d b e r t where d b e r t indicates the dimension of the BERT model, can be obtained. We define the keyword representation of the whole keyword string as the hidden representation of the first token b k e y w o r d 0 R d b e r t , and denote it as x k e y w o r d .
Given x n u m , x c a t , and x k e w o r d , we generate output feature h for each clustered group and each date, with the following process.
h n u m = W n u m · log ( x n u m ) h c a t = W c a t · x c a t h k e y w o r d = W k e y w o r d · x k e y w o r d h = W h · [ h n u m ; h c a t ; h k e y w o r d ]
In this equation, W n u m R h i d d e n × | N U M | , W c a t R h i d d e n × | C A T | , W k e y w o r d R h i d d e n × d b e r t and W n u m R h i d d e n × 3 h i d d e n indicates the trainable linear layer, where h i d d e n is the hidden size of the prediction model. For obtaining the generalization effect, we applied log scaling for every feature in N U M . Note that the features in N U M differ considerably depending on the ad group the data belong to. Specifically, for the test dataset in this experiment, the minimum value of impr is 1, whereas, its maximum is 29,650. Log scaling can relieve this problem and improve the stability of the model output. Through these overall processes in Equation (1), we can obtain input features for each adgroup and stat_date. Eventually, we construct sequential input features h of the prediction model by concatenating 20-day features. Specifically, h is defined by [ h 1 ; h 2 ; . . . ; h 20 ] R 20 × h i d d e n .

3.3. LSTM-Based Prediction Model

For the ROAS prediction model, we adopt the LSTM structure [17] as our baseline model framework. LSTM is a type of recurrent neural network model structure that deals with sequential input structure, where the long-term dependency problem of the conventional RNN is relieved through its elaborate model structures, including forget gate and cell state [18]. In utilizing this, we construct an ROAS prediction model that takes a 20-day sequential input h and returns the future ROAS.
In detail, the forward propagation process of the LSTM that generates the final prediction result through the h as an input is as follows. LSTM mainly comprises forget gate, input gate, cell state, and output gate. In processing h t , the t t h time step of sequential input h , following processes such as Equation (2), are proceeded.
i t = σ ( W x i h t + W h i l t 1 + W c i c t 1 + b i ) f t = σ ( W x f h t + W h f l t 1 + W c f c t 1 + b f ) c ˜ t = t a n h ( W x c h t + W h c l t 1 + b c ) c t = f t c t 1 + i t c ˜ t o t = σ ( W x o h t + W h o l t 1 + W c o c t 1 + b o ) l t = o t t a n h ( c t )
In this equation, σ indicates the sigmoid function and each i t , f t , c t , and o t shows the calculated results from forget gate, input gate, cell state, and output gate, respectively. The calculation process engaged in obtaining input state and forget state is similar with the procedures in the conventional RNN structure. Each token in the encoded input sequence represented as h t is processed with the trainable dense matrix W. Based on the input gate and forget gate, the cell state, which determines the output representation of each input token, is estimated. By combining these gate values together with the encoded representation of the previous token, we obtain output representation o t , which is directly related to the final output of the LSTM model l t .
Unlike the conventional RNN model structure, LSTM additionally contains the forget gate structure. This enables the conservation of the previous state and can thereby alleviate the long-term dependency problem, which is regarded as the chronic problem of the recurrent model structure.
The encoded representation of the LSTM model for each time step t is denoted as l t , where c 0 and l 0 are initialized as zero values. Eventually, the future ROAS y ^ predicted by the LSTM model is estimated by the last hidden state of the sequential input l 20 , as in Equation (3).
y ^ = W 2 · R e L U ( W 1 · l 20 )
In this equation, W 1 R d l s t m × d l s t m and W 2 R 1 × d l s t m indicate trainable parameters, where d l s t m is the hidden size of the LSTM prediction model. The detailed training procedures are described in the proceeding sections.

3.4. One-Stage Framework

First, the most straightforward training strategy is applying the end-to-end method, which indicates that we directly estimate the future ROAS through an LSTM model by feeding a sequential input. We denote this approach as a one-stage framework. In applying this, the ROAS prediction model θ e 2 e is trained with the training objective in Equation (4) for each input data x and label ROAS y in a given training dataset D.
min θ e 2 e 1 | D | ( x , y ) D ( θ e 2 e ( x ) y ) 2
This means that for a given input, the model directly predicts the ROAS that will occur in the future and is optimized to minimize the mean squared loss between the predicted ROAS and the actual ROAS. However, one problem is considering the data distribution when applying these approaches with the data collected through an advertisement project. Note that, in general, most of the collected data show zero revenue; that is, most of the label ROAS is set to zero. This is because, in most cases, exposing advertisement does not directly relate to the revenue occurrence. The majority of customers who access the advertisements do not generate any revenue, and only a tiny percentage of consumers buy corresponding merchandise and raise revenue.
Specifically, in our dataset, zero label data possess more than 90% of the whole dataset, as in Table 2. Since most of the training data consist of the zero-label, the prediction model trained with such data may return a zero value regardless of the input data [24]. This implies that the one-stage framework shows limitations in dealing with the data obtained through the advertising project.

3.5. Two-Stage Framework

For alleviating such limitations in a one-stage framework, we propose a two-stage framework that divides prediction processes into two phases; revenue occurrence prediction and occurred ROAS regression [13]. Apart from the one-stage framework that adopted a single model structure, the two-stage framework utilizes two model structures θ c l s and θ r e g . We denote these models as the occurrence prediction model and occurred ROAS regression model, respectively. The main objective of θ c l s is to predict whether the future ROAS would be zero or not, given the previous sequential input feature, and θ r e g aims to quantitatively estimate the accurate ROAS only when the future ROAS is judged to be non-zero. These models are trained individually, following their respective training objectives, and are combined in the inference stage. Detailed training procedures are shown in the following sections.

3.5.1. Occurrence Prediction Model

The main objective of θ c l s is to predict the occurrence of the future revenue, which coincides with the task that determines whether the predicted ROAS would be zero or non-zero. This indicates that θ c l s is trained with the similar objective with the binary classification task.
Prior to the training of the θ c l s , pre-processing of the label-ROAS is required. This means the process of mapping all the non-zero label ROAS values to 1 for the given training data. The mapping function m a p to implement this can be expressed as Equation (5).
m a p ( y ) = 1 if y > 0 0 if y = 0
For the given training data consisting of input features x and label ROAS y, θ c l s is trained to predict m a p ( y ) . θ c l s returns two outputs that indicate the probability to be classified into 0 or 1. Specifically, contrary to Equation (3), the model output of θ c l s is defined by Equation (6).
θ c l s ( x ) = Softmax ( W 2 · R e L U ( W 1 · l 20 ) ) = [ y ^ 0 c l s y ^ 1 c l s ]
In this equation, all processes are similar with Equation (3), except W 2 R 2 × h i d d e n . Then, the classification model θ c l s is optimized through the binary cross-entropy loss. The training objective for this model can be defined by Equation (7).
min θ c l s 1 | D | ( x , y ) D m a p ( y ) log ( y ^ 1 c l s ) + ( 1 m a p ( y ) ) l o g ( y ^ 0 c l s )
Through this process, θ c l s can predict whether future revenue can be gained or not (positive ROAS, or not), given input features.

3.5.2. Occurred ROAS Regression Model

We then construct the occurred ROAS regression model θ r e g , which aims to quantify the future ROAS value accurately. This model quantitatively predicts what the ROAS value will be when it is assumed that the ROAS is determined to be non-zero. That is, only training data with non-zero labeled ROAS values are used for training θ r e g . The specific training process is as follows. First, from the original dataset D, a dataset D n z containing only non-zero labels is created. This can be defined as Equation (8):
D n z = { ( x , y ) | ( x , y ) D y > 0 }
In utilizing D n z , θ r e g is trained to minimize the mean squared error (MSE) loss between the model output and actual ROAS value. It has the same training objective as Equation (4), and through this process, θ r e g can be quantitatively predicted for the future ROAS value, given that the ROAS is determined to be non-zero.

3.5.3. Why Task Decomposition?

The two previously trained models θ c l s and θ r e g were trained in different ways, with different training objectives, while having a common purpose of predicting the future ROAS value based on the given input features. For attaining this, the model output of θ c l s and θ r e g are combined for obtaining the final prediction output y ^ in the inference stage.
To do this, first, i n d is generated from the prediction result of θ c l s . This is a value that determines whether the predicted ROAS value will be 0 or greater than 0, and is estimated through the same process as Equation (9).
i n d ( x ) = 0 if y ^ 0 > y ^ 1 1 if y ^ 0 < y ^ 1 , where [ y ^ 0 c l s y ^ 1 c l s ] = θ c l s ( x )
i n d ( x ) for each value x is multiplied to the output of θ r e g . This indicates that i n d ( x ) acts as an indicator that determines future ROAS to be zero or non-zero. In sum, the final output of the whole model framework y ^ is estimated by Equation (10):
y ^ = i n d ( x ) θ r e g ( x )
Through this process, the whole model process can accurately predict the occurrence of the future revenue and can attain the exact regression of the ROAS value. One of the significant effects of this task decomposition process is that up-scaling non-zero data becomes possible. Up-scaling is currently one solution to resolve data imbalance in deep learning [25]. In the case of up-scaling non-zero data in an end-to-end framework, a severe bias interferes with the prediction in the regression model. However, in the binary classification of 0 and 1, the up-scaling method can work as a very effective way to solve the data distribution problem and deal with zero data. In applying the up-scaling method, training data of θ c l s is defined as Equation (11).
D u p s = [ ( x , y ) | ( x , y ) D y > 0 ] + [ ( x , y ) | ( x , y ) D y > 1 ] n
This shows that D u p s is composed of the up-scaled non-zero data and original zero data. Precisely, non-zero data is duplicated n times where n indicates the scaling factor that determines the composition of the training data. θ c l s then possesses evenly distributed data composition between the non-zero label and zero label data. n is a hyper-parameter that should be set prior to the training process and is determined empirically in our experiments. By training θ c l s through the up-scaled dataset D u p s , we can effectively eliminate the biased distribution of the training data. Then, θ c l s can distinguish a non-zero label more precisely, compared with the original setting that up-scaling has not applied.
Other than dealing with the unbalanced data distribution, task decomposition can also improve the performance of θ r e g . Since θ r e g was trained with the non-zero dataset D n z , under the assumption that the predicting value is non-zero, the predicted values are not overly biased to 0 due to the zero data, as in the one-stage framework. These indicate that adopting the two-stage framework can considerably enhance the final performance of the whole prediction model.

4. Experimental Results

4.1. Data Details

We performed the experiment utilizing data provided by BizSpring. In terms of data, user behavior data is acquired for each inquiry via ad projects from various organizations. It is a database that was collected from 7046 ad groups from 1 August 2021 to 26 October 2021. This consists of 32,498,758 queries, which we pre-processed using the approach described in Section 3 to generate a 349,750 query dataset clustered by date and ad group. We generate sequential data to predict ROAS through a 20-day dataset. After that, we randomly segmented it for each ad group to create, train, validate, and test datasets. Data statistics are as in Table 2.
We denote the data with an ROAS value of 0 calculated through the data as a zero label, and data with an ROAS value greater than 0 as a non-zero label. As can be seen from the table, most of the data have a label value of 0.

4.2. Training Details

All training and testing was based on pytorch-lightning, accessed on 26 October 2021. We exploited the KoBERT model released by SKT-Brain in the case of the language model for encoding input data, accessed on 26 October 2021. Since the dataset used in the experiment was mostly collected from Korea and consists of Korean keywords, we utilized a language model specialized for Korean encoding. The model is a transformer model architecture with 12 encoder layers, a hidden size of 768, and a vocab size of 8002.
We set both the hidden size h i d d e n of the LSTM used for the prediction model and the hidden size d of the multi-layer perceptron utilized for the numeric feature encoding to 64. For training, we applied early stopping within the training step 50,000 with the batch size of 32. We adopted the AdamW optimizer [26] with a linear warmup and linear decay [27]. Specifically, betas and epsilon values for the adam optimizer were set to (0.9, 0.999), and 1 × 10 6 , respectively. The weight decay ratio was set to 0.01, and the learning rate was warmed-up for the first 5000 steps. The learning rate was heuristically set to 3 × 10 5 . All training and inference processes were performed on two RTX A6000s. Each model was trained for 50 h.

4.3. Evaluation Details

For the evaluation of the model performance, we employed the mean squared error (MSE), mean absolute error (MAE), and revised F1-score as evaluation metrics. The MSE and MAE were adopted for assessing the performance of the “Occurred ROAS Prediction Model” and F1-score for measuring the accuracy of the “Occurrence Prediction model”. In this paper, we employed the non-zero metric and mapped metric for assessing various aspects of the prediction model that is not clearly confirmed with conventional evaluation metrics.
The mapped F1-score refers to the accuracy of discriminating between zero ROAS and non-zero ROAS. We measured the accuracy between m a p ( y ) and m a p ( y ^ ) for the true ROAS value y and the predicted result y ^ . Note that m a p in Equation (5) is a binary mapping function that returns 1 for positive inputs and 0 for zero inputs. Specifically, for a given model output y ^ and a label y, we can estimate the mapped metrics as follows:
TP = s D 1 ( m a p ( y ^ ) = 1 m a p ( y ) = 1 )
FP = s D 1 ( m a p ( y ^ ) = 1 m a p ( y ) = 0 )
FN = s D 1 ( m a p ( y ^ ) = 0 m a p ( y ) = 1 )
Mapped Precision = TP TP + FP , Mapped Recall = TP TP + FN
Mapped F 1 - score = 2 · Mapped Precision · Mapped Recall Mapped Precision + Mapped Recall
In this equation, 1 is an indicator function which returns 1 if input argument is true, and 0 otherwise. Namely, these are performance measure that are served as indicators to confirm how precisely the model predicts the occurrence of the future revenue. In addition, we adopted MSE and MAE for measuring how accurately the model can predict the ROAS value. More than the conventional MSE and MAE, we additionally propose non-zero metrics that estimate the error of the model only for the non-zero label data. For each input x, the label ROAS y in a whole dataset D, and the corresponding model output y ^ , we can estimate these metrics by the following equations:
D = { ( x , y ) | ( ( x , y ) D ) ¬ ( y = 0 ) }
Non - zero MSE = 1 D s D | y y ^ | 2 , Non - zero MAE = 1 D s D | y y ^ |
As most of the label ROAS(y) in the whole dataset D is zero, the conventional MSE of the whole test set may be biased to the zero. By obviating zero labeled values, the non-zero metrics can reflect more realistic performance of the prediction model.

4.4. Main Results

To determine the effectiveness of the two-stage framework and its up-scaling method, we compare its performance to that of the one-stage framework. In this experiment, we set the up-scaling ratio n empirically to 30. We present the experimental results in Table 3.
As seen from Table 3, the mapped f1-score converges to almost zero when the one-stage model is used, which indicates that the prediction model trained in an end-to-end fashion cannot predict the occurrence of the future profits. This result implies that in designing an ROAS prediction model, the end-to-end framework is inadaptive. Although the one-stage framework shows the lowest MSE and MAE, we speculate that the overall MSE is lowered since most of the labeled ROAS values are zero and the model approximates all predictions to zero.
We resolve this issue to a degree by applying a two-stage framework. By separating the revenue generation prediction model and the generated revenue quantification module, the mapped f1-score rises to 0.27578. It demonstrates that the task decomposition method, which separates the prediction process into two stages, is highly effective in constructing prediction models. In particular, the two-stage framework with the up-scaling of nonzero data reported the best performance in non-zero MSE, non-zero MAE, and mapped F1-score. This suggests that the θ c l s model performance can be greatly improved by applying up-scaling. In addition, as the accuracy of revenue occurrence prediction increased, the model achieved a higher performance in the non-zero MSE and non-zero MAE.
However, using the up-scaling approach to the first-stage framework alone will not result in significant performance gains. The mapped f1-score for the one-stage framework model trained with up-scaling data resulted in 0.06116, showing that the zero label data could not be properly distinguished. The fact that the mapping recall is 1.0 is interpreted as allowing the model learning to make all outputs positive non-zero values. These experimental results indicate that the task decomposition-based two-stage framework and up-scaling proposed in this study are meaningful approaches to generate the ROAS prediction model. In particular, it can be used to create predictive models with satisfactory performance in forecasting future ROAS using only existing data.

5. Conclusions

This paper proposed a model that eliminates the need for contrast experiments in existing ROAS analysis research and predicts ROAS using data gathered while working on advertising projects. A two-stage framework and up-scaling approach for non-zero labeled data were proposed to address the problem of data distribution that arises as a result of having more data that is not lucrative in the generally collected data queries. Experiments have shown that the two-stage framework can effectively solve the data imbalance problem by using up-scaling, and that it performs much better than the predictive model trained using the end-to-end framework. However, we find that despite these improvements, the performance of the final model remains weak: around 0.4 F1-score. In the future, we plan to analyze the effect of input features on the prediction model’s performance and extract appropriate features adaptive for enhancing model performance. Additionally, we plan to figure out the impact of the data size and train a bigger model with a further increased data size.

Author Contributions

Funding acquisition, K.P.; investigation, T.L. and A.S.; methodology, H.M.; project administration C.P.; conceptualization, C.P. and J.S.; software, H.M. and T.L.; validation, S.E. and J.S.; formal analysis, K.P. and J.P.; writing—review and editing, C.P. and J.S.; supervision, K.P. and I.D.A.; project administration, J.P.; funding acquisition, K.P.; data curation, K.O. All authors have read and agreed to the published version of the manuscript.


This work was supported by the Institute of Information & communications Technology Planning & Evaluation(IITP) grant funded by the Korean government (MSIT) (No. 2021-0-01081, Prediction system for real time online marketing perform based on AI), and Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2021R1A6A1A03045425).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.


Many thanks to the KU NMT Group for taking the time to proofread this article.

Conflicts of Interest

The authors declare no conflict of interest.


  1. Orzan, M.C.; Zara, A.I.; Căescu, Ş.C.; Constantinescu, M.E.; Orzan, O.A. Social Media Networks as a Business Environment, During COVID-19 Crisis. Rev. Manag. Comp. Int. 2021, 22, 64–73. [Google Scholar] [CrossRef]
  2. Krasnov, S.; Sergeev, S.; Titov, A.; Zotova, Y. Modelling of Digital Communication Surfaces for Products and Services Promotion. In IOP Conference Series: Materials Science and Engineering; IOP Publishing: Bristol, UK, 2019; Volume 497, p. 012032. [Google Scholar]
  3. Vaver, J.; Koehler, J. Measuring Ad Effectiveness Using Geo Experiments; Technical Report; Google Inc.: Mountain View, CA, USA, 2011. [Google Scholar]
  4. Kerman, J.; Wang, P.; Vaver, J. Estimating Ad Effectiveness Using Geo Experiments in a Time-Based Regression Framework; Working Paper; Google Inc.: Mountain View, CA, USA, 2017. [Google Scholar]
  5. Blake, T.; Nosko, C.; Tadelis, S. Consumer heterogeneity and paid search effectiveness: A large-scale field experiment. Econometrica 2015, 83, 155–174. [Google Scholar] [CrossRef]
  6. Chen, A.; Longfils, M.; Remy, N. Trimmed Match Design for Randomized Paired Geo Experiments. arXiv 2021, arXiv:2105.07060. [Google Scholar]
  7. Chen, A.; Au, T.C. Robust Causal Inference for Incremental Return on Ad Spend with Randomized Paired Geo Experiments. arXiv 2019, arXiv:1908.02922. [Google Scholar] [CrossRef]
  8. Barajas, J.; Zidar, T.; Bay, M. Advertising Incrementality Measurement Using Controlled Geo-Experiments: The Universal App Campaign Case Study; ACM: Washington, DC, USA, 2020. [Google Scholar]
  9. Chan, D.; Ge, R.; Gershony, O.; Hesterberg, T.; Lambert, D. Evaluating online ad campaigns in a pipeline: Causal models at scale. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 24–28 July 2010; pp. 7–16. [Google Scholar]
  10. Ravichandran, K.; Thirunavukarasu, P.; Nallaswamy, R.; Babu, R. Estimation of return on investment in share market through ANN. J. Theor. Appl. Inf. Technol. 2005, 3, 44–54. [Google Scholar]
  11. Selvin, S.; Vinayakumar, R.; Gopalakrishnan, E.; Menon, V.K.; Soman, K. Stock price prediction using LSTM, RNN and CNN-sliding window model. In Proceedings of the 2017 International Conference on Advances in Computing, Communications and Informatics (Icacci), Mangalore, India, 13–16 September 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1643–1647. [Google Scholar]
  12. Sun, C.; Liu, W.; Dong, L. Reinforcement learning with task decomposition for cooperative multiagent systems. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 2054–2065. [Google Scholar] [CrossRef] [PubMed]
  13. Vallon, C.; Borrelli, F. Task Decomposition for Iterative Learning Model Predictive Control. In 2020 American Control Conference (ACC); IEEE: Piscataway, NJ, USA, 2020; pp. 2024–2029. [Google Scholar]
  14. Sen, J.; Dutta, A.; Mehtab, S. Stock portfolio optimization using a deep learning LSTM model. In 2021 IEEE Mysore Sub Section International Conference (MysuruCon); IEEE: Piscataway, NJ, USA, 2021; pp. 263–271. [Google Scholar]
  15. Sen, J.; Dutta, A.; Mehtab, S. Profitability analysis in stock investment using an LSTM-based deep learning model. In Proceedings of the 2021 2nd International Conference for Emerging Technology (INCET), Belgaum, India, 21–23 May 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–9. [Google Scholar]
  16. Xue, H.; Huynh, D.Q.; Reynolds, M. SS-LSTM: A hierarchical LSTM model for pedestrian trajectory prediction. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1186–1194. [Google Scholar]
  17. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
  18. Michańków, J.; Sakowski, P.; Ślepaczuk, R. LSTM in Algorithmic Investment Strategies on BTC and S&P500 Index. Sensors 2022, 22, 917. [Google Scholar] [CrossRef] [PubMed]
  19. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
  20. Ramchoun, H.; Ghanou, Y.; Ettaouil, M.; Janati Idrissi, M.A. Multilayer Perceptron: Architecture Optimization and Training. Int. J. Interact. Multimed. Artif. Intell. 2016, 4, 26–30. [Google Scholar] [CrossRef]
  21. Ontoum, S.; Chan, J.H. Personality Type Based on Myers-Briggs Type Indicator with Text Posting Style by using Traditional and Deep Learning. arXiv 2022, arXiv:2201.08717. [Google Scholar]
  22. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
  23. Jawahar, G.; Sagot, B.; Seddah, D. What does BERT learn about the structure of language? In Proceedings of the ACL 2019-57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019. [Google Scholar]
  24. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
  25. Yeung, T.S.A.; Chung, E.T.; See, S. A deep learning based nonlinear upscaling method for transport equations. arXiv 2020, arXiv:2007.03432. [Google Scholar] [CrossRef]
  26. Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
  27. Gotmare, A.; Keskar, N.S.; Xiong, C.; Socher, R. A closer look at deep learning heuristics: Learning rate restarts, warmup and distillation. arXiv 2018, arXiv:1810.13243. [Google Scholar]
Figure 1. Overall structure of the whole process.
Figure 1. Overall structure of the whole process.
Mathematics 10 01637 g001
Table 1. Types of input features. In our experiments, the class for each feature was established following our standard.
Table 1. Types of input features. In our experiments, the class for each feature was established following our standard.
Feature NameDescriptionData TypeClass
stat_dateDates of data collectionstringcluster
adgroupAd group id of the data querystringcluster
ad_platformAd platform id of the data querystringcategorical
ad_programAd program id of the data querystringcategorical
deviceDevice that data is collected (Mobile or PC)stringcategorical
imprAd dwell time of the customerintegernumeric
clickNumber of clicks occurred by the customerintegernumeric
rgrNumber of “Sign in” occurred by the customerintegernumeric
odrNumber of ordersintegernumeric
cartNumber of “Add to cart” occurred by the customerintegernumeric
convNumber of conversion occurred by the customerintegernumeric
costCost occurred by the customerintegernumeric
rvnRevenue occurred by the customerintegernumeric
keywordKeyword used in searching adstringkeyword
Table 2. Data statistics. Each data point was preprocessed by the method suggested in Section 3.
Table 2. Data statistics. Each data point was preprocessed by the method suggested in Section 3.
# of data points334,96378076980
# of zero labels324,67274836745
# of non-zero labels10,291324235
# of ad groups6764141141
Table 3. Experimental results. Bold values indicates the best performance among different methodologies.
Table 3. Experimental results. Bold values indicates the best performance among different methodologies.
+ Up-scaling
+ Up-scaling
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Moon, H.; Lee, T.; Seo, J.; Park, C.; Eo, S.; Aiyanyo, I.D.; Park, J.; So, A.; Ok, K.; Park, K. Return on Advertising Spend Prediction with Task Decomposition-Based LSTM Model. Mathematics 2022, 10, 1637.

AMA Style

Moon H, Lee T, Seo J, Park C, Eo S, Aiyanyo ID, Park J, So A, Ok K, Park K. Return on Advertising Spend Prediction with Task Decomposition-Based LSTM Model. Mathematics. 2022; 10(10):1637.

Chicago/Turabian Style

Moon, Hyeonseok, Taemin Lee, Jaehyung Seo, Chanjun Park, Sugyeong Eo, Imatitikua D. Aiyanyo, Jeongbae Park, Aram So, Kyoungwha Ok, and Kinam Park. 2022. "Return on Advertising Spend Prediction with Task Decomposition-Based LSTM Model" Mathematics 10, no. 10: 1637.

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop