Developing Data-Conscious Deep Learning Models for Product Classiﬁcation

: In online commerce systems that trade in many products, it is important to classify the products accurately according to the product description. As may be expected, the recent advances in deep learning technologies have been applied to automatic product classiﬁcation. The efﬁciency of a deep learning model depends on the training data and the appropriateness of the learning model for the data domain. This is also applicable to deep learning models for automatic product classiﬁcation. In this study, we propose deep learning models that are conscious of input data comprising text-based product information. Our approaches exploit two well-known deep learning models and integrate them with the processes of input data selection, transformation, and ﬁltering. We demonstrate the practicality of these models through experiments using actual product information data. The experimental results show that the models that systematically consider the input data may differ in accuracy by approximately 30% from those that do not. This study indicates that input data should be sufﬁciently considered in the development of deep learning models for product classiﬁcation.


Introduction
Recently, internet commerce has become more active as internet distribution networks have grown. Because of the resulting increase in volume and variety of products, classification methods have become important. Traditional electronic commerce (e-commerce) is based on business-to-consumer (B2C) transactions. Herein, classification tasks are convenient because companies have all the information on the products they sell to customers. However, the recently emerged used-product trading is consumer-to-consumer (C2C) trading. Herein, any entity can input and register product information. Furthermore, a private seller can arbitrarily input the product name, detailed description, price, and product category of a product they are selling. However, unlike companies, private sellers do not have the correct information on the products sold, and the product information entered may be inaccurate.
Inaccurate product classification is disadvantageous for both seller and buyer. The appropriate product category can be identified if text data extracted from past data of used-product transactions are refined and used to train a deep learning model. This is a technical solution to the problem of inaccurate classification. Deep learning models have been developed mainly by two approaches. The first approach is to apply a wellknown typical deep model as-is by tuning hyperparameters. This approach sets the weights of some particular layers appropriately and applies it in typical models such as a convolutional neural network (CNN) [1,2]. The second approach is to develop a technically new model [3,4].
This study found that deep learning model development and hyperparameter tuning should be integrated for the resulting model to be conscious of the product information

Related Work
Product classification can be an application of text classification problem, if we consider only text data for product information. Early studies on automatic product classification exploited the traditional text classifiers such as Naïve Bayes, SVM (support vector machine), and KNN (k-nearest neighbors) [13][14][15]. Aanen, S. et al. [16] exploited a lexical similarity to improve the disambiguation of product taxonomy and suggested an ontology-based approach using WordNet. Lee, T., et al. [17] also suggested an ontological approach to classify the products, and they exploited a set of ontological data dictionaries for keyword-based product classification.
Recently, various deep learning models have been applied in studies related to product classification. Table 1 summarizes representative related works and the deep learning models or classifiers used therein. It also summarizes whether the approach of each work was to develop a new model or to tune the hyperparameters of an existing model, as mentioned in the previous section.
Krishnan, A. et al. [3] proposed an approach based on multi-CNN and multi-LSTM models (flat models). Hierarchical and flat classification schemes were compared. The results showed that the multi-CNN and multi-LSTM were more robust than the bag-ofword model (hierarchical model). The paper proposed an approach based on multi-CNN and multi-LSTM and a novel method that combines structured attributes with the deep learning models.
Xia, Y. et al. [4] classified product categories using product title data in Japanese, which is an agglutinative language. An attention conventional neural network (ACNN) was proposed using a conventional CNN model and gradient boosted tree (GBT) [18]. In the ACNN, a context encoder and an attention mechanism exist in the left and right modules, respectively. Moreover, higher weights are assigned to certain parts of the inputs through embedding. This plays the role of emphasizing tokens that have high correlations semantically.
Das, P. et al. [1] discussed data preprocessing to improve classification performance by increasing the quality of data. In that study, noise data were removed using various methods. The imbalance of data was quantified, and experiments were conducted using a basic algorithm such as Naïve Bayes (NB) [13] and by additionally applying CNN and GBT. They indicated that performance could be improved further using a method that deletes noise data as stopwords. The related works mentioned above regard automatic product classification as a classification problem [19]. Furthermore, a commonality between these works is that they discussed the effect of various input data on classification models. As deep learning technology has been actively researched in image processing and recognition areas, some studies suggested to exploit the product image data for product classification [20,21]. Ha, J. et al. [20] proposed a RNN-based DeepCNN model for product classification using real data from their own shopping mall. They used text data such as product names and brand names, and the use of image data was also suggested. Their data were imbalanced across the categories like ours. Note that data imbalance problems are frequently observed in other deep learning based studies as well [2,8]. Zahavy, T. et al. [21] proposed a multi-modality framework using VGG, a kind of CNN model, for image data as well as text data of product information.

Model Architecture and Dataset Details
Our proposed product classification model is based on an approach that integrates the data selection, transformation, and filtering phases with the inputs of the deep learning model used. Figure 1 illustrates an overview of this architecture. The selection phase involves the selection of appropriate input data as training data, and the transformation phase involves the structural (rather than semantic) transformation of the selected data for their convenient application in training. For example, we (1) removed null data, duplicate data, etc.; (2) removed special characters, emojis, etc., except letters; and (3) used a morphological analyzer to extract only meaningful words or characters as outputs.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 3 of 10 classification [20,21]. Ha, J. et al. [20] proposed a RNN-based DeepCNN model for product classification using real data from their own shopping mall. They used text data such as product names and brand names, and the use of image data was also suggested. Their data were imbalanced across the categories like ours. Note that data imbalance problems are frequently observed in other deep learning based studies as well [2,8]. Zahavy, T. et al. [21] proposed a multi-modality framework using VGG, a kind of CNN model, for image data as well as text data of product information. Table 1. Studies related to deep learning models for product classification.

Model Architecture and Dataset Details
Our proposed product classification model is based on an approach that integrates the data selection, transformation, and filtering phases with the inputs of the deep learning model used. Figure 1 illustrates an overview of this architecture. The selection phase involves the selection of appropriate input data as training data, and the transformation phase involves the structural (rather than semantic) transformation of the selected data for their convenient application in training. For example, we (1) removed null data, duplicate data, etc.; (2) removed special characters, emojis, etc., except letters; and (3) used a morphological analyzer to extract only meaningful words or characters as outputs.
After the transformation phase, the filtering process is performed to filter out and remove the stopwords (which are considered to be unnecessary for training) among the words. Upon completion of the transformation phase, ancillary tasks such as tokenization, padding, or tensorization are performed. Finally, the results are used as the input data to train the deep learning models. Various deep learning models can be employed. The models we used are presented in the next section. In this study, we used real-world data provided by a South Korean used-product trading platform company [12]. The data contained the information of approximately 2.9 million products the second half of 2019. The product information data comprised the attributes "name", "keyword", and "description" in the text format, and "category" in the After the transformation phase, the filtering process is performed to filter out and remove the stopwords (which are considered to be unnecessary for training) among the words. Upon completion of the transformation phase, ancillary tasks such as tokenization, padding, or tensorization are performed. Finally, the results are used as the input data to train the deep learning models. Various deep learning models can be employed. The models we used are presented in the next section.
In this study, we used real-world data provided by a South Korean used-product trading platform company [12]. The data contained the information of approximately 2.9 million products the second half of 2019. The product information data comprised the attributes "name", "keyword", and "description" in the text format, and "category" in the integer format. The attributes "name", "keyword", and "description" provided each product's name, keyword, and description, respectively. Meanwhile, the attribute "category" provided the product classification number to which each product belonged to. For the product classification, we used the second-level classification of a commonly used threelevel hierarchy system [1,4]. The total number of second-level categories was around 120. Note that we can have seven ways of simply combining the training data for the selection process, as there are three attributes of data: "name", "keyword", and "description". Figure 2 shows the distribution of the number of product data by category in descending order. It is distributed disproportionately across the categories, similar to other datasets [1,2,4]. Figure 3 shows the distribution of the number of words for the name, keyword, and description attributes. In the figure, the distributions before and after the transformation phase are represented by blue and orange, respectively. The blue and red vertical lines represent the average number of words per product data for each attribute before and after the transformation phase, respectively. After the process, the average number of words for the name, keyword, and description attributes varied to 5.2, 6.6, and 39.9, respectively. In the case of the name and keyword attributes, the number of words increased because these were split into nouns through the transformation process. However, in the case of description, the number of words decreased because words such as postpositions were removed. In Figure 3, we omit the parts that are difficult to display on the graph because of low word-frequencies.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 4 of 10 integer format. The attributes "name", "keyword", and "description" provided each product's name, keyword, and description, respectively. Meanwhile, the attribute "category" provided the product classification number to which each product belonged to. For the product classification, we used the second-level classification of a commonly used threelevel hierarchy system [1,4]. The total number of second-level categories was around 120. Note that we can have seven ways of simply combining the training data for the selection process, as there are three attributes of data: "name", "keyword", and "description". Figure 2 shows the distribution of the number of product data by category in descending order. It is distributed disproportionately across the categories, similar to other datasets [1,2,4]. Figure 3 shows the distribution of the number of words for the name, keyword, and description attributes. In the figure, the distributions before and after the transformation phase are represented by blue and orange, respectively. The blue and red vertical lines represent the average number of words per product data for each attribute before and after the transformation phase, respectively. After the process, the average number of words for the name, keyword, and description attributes varied to 5.2, 6.6, and 39.9, respectively. In the case of the name and keyword attributes, the number of words increased because these were split into nouns through the transformation process. However, in the case of description, the number of words decreased because words such as postpositions were removed. In Figure 3, we omit the parts that are difficult to display on the graph because of low word-frequencies.   Table 2 describes the difference in the data before and after the transformation phase and shows an example of the product data. The data were originally in Korean. Although English translations are provided, there is a limitation: the characteristics of Korean cannot be presented. integer format. The attributes "name", "keyword", and "description" provided each product's name, keyword, and description, respectively. Meanwhile, the attribute "category" provided the product classification number to which each product belonged to. For the product classification, we used the second-level classification of a commonly used threelevel hierarchy system [1,4]. The total number of second-level categories was around 120. Note that we can have seven ways of simply combining the training data for the selection process, as there are three attributes of data: "name", "keyword", and "description". Figure 2 shows the distribution of the number of product data by category in descending order. It is distributed disproportionately across the categories, similar to other datasets [1,2,4]. Figure 3 shows the distribution of the number of words for the name, keyword, and description attributes. In the figure, the distributions before and after the transformation phase are represented by blue and orange, respectively. The blue and red vertical lines represent the average number of words per product data for each attribute before and after the transformation phase, respectively. After the process, the average number of words for the name, keyword, and description attributes varied to 5.2, 6.6, and 39.9, respectively. In the case of the name and keyword attributes, the number of words increased because these were split into nouns through the transformation process. However, in the case of description, the number of words decreased because words such as postpositions were removed. In Figure 3, we omit the parts that are difficult to display on the graph because of low word-frequencies.   Table 2 describes the difference in the data before and after the transformation phase and shows an example of the product data. The data were originally in Korean. Although English translations are provided, there is a limitation: the characteristics of Korean cannot be presented.  Table 2 describes the difference in the data before and after the transformation phase and shows an example of the product data. The data were originally in Korean. Although English translations are provided, there is a limitation: the characteristics of Korean cannot be presented.

An Example of Data Transformation (for the Product Description Attribute)
Before transformation hand-made candle wt na~tu~ral soywax pre-mium fragrant oils After transformation handmade candle soy wax premium fragrance oil In the filtering phase, stopwords were selected based on various criteria to conduct a case study on the influence of the stopwords (see Section 4.2). Figure 4 shows the results of an analysis of the frequencies of all the words appearing in the experimental data.  Figure 4b shows the frequency (logarithmic form) of words for each attribute. Figure 4 sorts the frequency of words in descending order. Among these, the words corresponding to the top 1% (1.3 K words) were selected. These are stopwords corresponding to six criteria and were determined to be non-influential in the data training. Furthermore, after selecting various stopwords according to the criteria, we continued the study following the procedure shown in Figure 1.

Models for Baseline
In this study, we used NB, CNN, and Bi-LSTM as models for product classification. Naïve Bayes (NB). We applied NB as a baseline model for comparison with the deep learning models proposed in this paper. We chose it as the baseline non-deep learning model, similar to the approach described in [13,15]. In the experiments, NB was applied after training TF-IDF (term frequency-inverse document frequency) [22].
Convolutional Neural Network (CNN). The model was designed primarily to effectively process images. A convolutional layer that extracts feature values and a pooling layer that reduces the layer size are configured repeatedly. All the nodes with the resulting data processed until the present are connected and represented in a one-dimensional array. This is called a fully-connected layer. It is added at the end of model construction. Figure 5a shows the architecture of the CNN model employed in this paper.
After the filtering method is applied to an original image, the classification operation is performed for the filtered image. In this study, the channel size was set to one for it to be suitable for text, among various CNN models. The width size was 300, which is equal to that for embedding. The embedding layer was applied, and the dropout activation In the filtering phase, stopwords were selected based on various criteria to conduct a case study on the influence of the stopwords (see Section 4.2). Figure 4 shows the results of an analysis of the frequencies of all the words appearing in the experimental data. Figure 4a shows the frequency (logarithmic form) of words combining all the attributes. The vertical lines in (a) represent the proportion of the top 10% frequency words among all the words for description, keyword, and name (left to right): approximately 13.5 K, 15.2 K, and 17.4 K words, respectively. The top 10% of words of the three attributes exist within approximately 13.5% of all of the words (129 K words). Figure 4b shows the frequency (logarithmic form) of words for each attribute.

An Example of Data Transformation (for the Product Description Attribute)
Before transformation hand-made candle wt na~tu~ral soywax pre-mium fragrant oils After transformation handmade candle soy wax premium fragrance oil In the filtering phase, stopwords were selected based on various criteria to conduct a case study on the influence of the stopwords (see Section 4.2). Figure 4 shows the results of an analysis of the frequencies of all the words appearing in the experimental data. Figure 4a shows the frequency (logarithmic form) of words combining all the attributes. The vertical lines in (a) represent the proportion of the top 10% frequency words among all the words for description, keyword, and name (left to right): approximately 13.5 K, 15.2 K, and 17.4 K words, respectively. The top 10% of words of the three attributes exist within approximately 13.5% of all of the words (129 K words). Figure 4b shows the frequency (logarithmic form) of words for each attribute. Figure 4 sorts the frequency of words in descending order. Among these, the words corresponding to the top 1% (1.3 K words) were selected. These are stopwords corresponding to six criteria and were determined to be non-influential in the data training. Furthermore, after selecting various stopwords according to the criteria, we continued the study following the procedure shown in Figure 1.

Models for Baseline
In this study, we used NB, CNN, and Bi-LSTM as models for product classification. Naïve Bayes (NB). We applied NB as a baseline model for comparison with the deep learning models proposed in this paper. We chose it as the baseline non-deep learning model, similar to the approach described in [13,15]. In the experiments, NB was applied after training TF-IDF (term frequency-inverse document frequency) [22].
Convolutional Neural Network (CNN). The model was designed primarily to effectively process images. A convolutional layer that extracts feature values and a pooling layer that reduces the layer size are configured repeatedly. All the nodes with the resulting data processed until the present are connected and represented in a one-dimensional array. This is called a fully-connected layer. It is added at the end of model construction. Figure 5a shows the architecture of the CNN model employed in this paper.
After the filtering method is applied to an original image, the classification operation is performed for the filtered image. In this study, the channel size was set to one for it to be suitable for text, among various CNN models. The width size was 300, which is equal to that for embedding. The embedding layer was applied, and the dropout activation  Figure 4 sorts the frequency of words in descending order. Among these, the words corresponding to the top 1% (1.3 K words) were selected. These are stopwords corresponding to six criteria and were determined to be non-influential in the data training. Furthermore, after selecting various stopwords according to the criteria, we continued the study following the procedure shown in Figure 1.

Models for Baseline
In this study, we used NB, CNN, and Bi-LSTM as models for product classification. Naïve Bayes (NB). We applied NB as a baseline model for comparison with the deep learning models proposed in this paper. We chose it as the baseline non-deep learning model, similar to the approach described in [13,15]. In the experiments, NB was applied after training TF-IDF (term frequency-inverse document frequency) [22].
Convolutional Neural Network (CNN). The model was designed primarily to effectively process images. A convolutional layer that extracts feature values and a pooling layer that reduces the layer size are configured repeatedly. All the nodes with the resulting data processed until the present are connected and represented in a one-dimensional array. This is called a fully-connected layer. It is added at the end of model construction. Figure 5a shows the architecture of the CNN model employed in this paper.
The hidden size is set to 64, and the embedding is performed. The inputs based on the attribute values that are set are combined into a single layer. The combined layer is entered into the LSTM layer and processed. Two pooling layers are passed through, and each layer outputs the mean and maximum. Then, these are combined back into one layer. This linear layer is processed repeatedly. Subsequently, the dropout function is executed with a value of 0.3 [5]. Finally, the linear layer is passed through one more time, which completes the process.

Performance Results
Based on the data-conscious method proposed in this paper, we compared and evaluated the performance of each model trained by plugging in the CNN and Bi-LSTM models presented in the previous section. The NB model was considered as the baseline. We measured the accuracy as the performance metric. All the experiments were conducted using a GPU (Tesla P100-PCIE-16 GB RAM) provided by Google Colab Pro (Google Colaboratory, https://colab.research.google.com/).
We performed the experiment to determine the performance of each model for the selection, transformation, and filtering processes. Table 3 shows the accuracy for each combination method and the training time required for each method. For example, the second row of Table 3 shows that the accuracy of the model trained using the CNN deep learning model up to the selection phase is 48.67% in the worst case, 74.25% in the best case, and 67.74% on an average. If this CNN deep learning model is used in the selection and transformation phases, the accuracy is 58.92% in the worst case, 78.69% in the best case, and 73.14% on an average. These results demonstrate that if the CNN deep learning After the filtering method is applied to an original image, the classification operation is performed for the filtered image. In this study, the channel size was set to one for it to be suitable for text, among various CNN models. The width size was 300, which is equal to that for embedding. The embedding layer was applied, and the dropout activation function was executed using a value of 0.3. The inputs based on the attribute values set up were combined into a single layer. After transforming a three-dimensional tensor into a four-dimensional tensor, the dropout function was executed. After the development of convolution layers with the kernel size list of the input attributes, the tensor was reduced to two dimensions. After inserting max pooling layers, the tensor was manipulated into four dimensions, and these were combined into a single layer. The linear layers were processed repeatedly. Finally, the Softmax function was applied for normalization and output.
Bidirectional Long-Short Term Memory (Bi-LSTM). LSTM is a type of recurrent neural network (RNN) [23]. It was designed to solve the long-term dependency problem, which occurs when the time gap for obtaining necessary information is large. It has an architecture wherein a cell-state is added to the hidden state in an RNN. The limitation of LSTM is that it is executed based on the immediately preceding pattern. Bi-LSTM was designed to overcome this limitation. In Bi-LSTM, forward training and backward training are performed through an RNN. One of its advantages is that the performance does not decrease even when the length of data is large. Figure 5b shows the architecture of the Bi-LSTM model employed in this paper.
In this study, the proposed Bi-LSTM is set to facilitate this model in both directions. The hidden size is set to 64, and the embedding is performed. The inputs based on the attribute values that are set are combined into a single layer. The combined layer is entered into the LSTM layer and processed. Two pooling layers are passed through, and each layer outputs the mean and maximum. Then, these are combined back into one layer. This linear layer is processed repeatedly. Subsequently, the dropout function is executed with a value of 0.3 [5]. Finally, the linear layer is passed through one more time, which completes the process.

Performance Results
Based on the data-conscious method proposed in this paper, we compared and evaluated the performance of each model trained by plugging in the CNN and Bi-LSTM models presented in the previous section. The NB model was considered as the baseline. We measured the accuracy as the performance metric. All the experiments were conducted using a GPU (Tesla P100-PCIE-16 GB RAM) provided by Google Colab Pro (Google Colaboratory, https://colab.research.google.com/).
We performed the experiment to determine the performance of each model for the selection, transformation, and filtering processes. Table 3 shows the accuracy for each combination method and the training time required for each method. For example, the second row of Table 3 shows that the accuracy of the model trained using the CNN deep learning model up to the selection phase is 48.67% in the worst case, 74.25% in the best case, and 67.74% on an average. If this CNN deep learning model is used in the selection and transformation phases, the accuracy is 58.92% in the worst case, 78.69% in the best case, and 73.14% on an average. These results demonstrate that if the CNN deep learning model is used indiscriminately without considering the data as a product classification model, its accuracy can differ by approximately 67.74% from that of the CNN model that effectively considers the data in both selection and transformation phases. The average accuracy increases by 7.4% after the selection-transformation phases compared with that after only the selection phase. Similar results are obtained when the Bi-LSTM deep learning model is used: the maximum and average differences are 29% and 7.3%, respectively. Table 3. Comparison of accuracy and training time between deep learning models (time: hh:mm:ss). The final average accuracy is 52.28% in the case wherein the NB model is used. However, it is 73.45% and 72.32% in the cases where the proposed CNN and Bi-LSTM models, respectively, are used. Furthermore, when the baseline and deep learning models are compared, the minimum and maximum differences in accuracy are 10.79% and 31.09%, respectively. These results verify that the method of training the deep learning models is effective for increasing the classification accuracy. Furthermore, it was observed that the two models are highly suitable for the methodology and product data used in this study. A comparison of the accuracy of the deep learning models reveals that the CNN model shows an accuracy that is marginally higher (an average difference of 1.12%) than that of the Bi-LSTM model. A performance comparison between the selection-transformation phases and selection-transformation-filtering phases for each model reveals that the Bi-LSTM shows an increase, although it is marginal. For example, the average accuracy increases from 71.91% to 72.32. However, in the case of the CNN model, the increase is smaller than that for the Bi-LSTM and negligible. These results indicate that attention should be paid to the selection of stopwords for removal in the filtering phase. The next section presents a case study of this.

Worst
The deep learning models were also examined in terms of time because the training time was required. The training time for each case is shown in the cell below that with the relevant accuracy performance in Table 3. For example, the training time required for the CNN model that showed the best accuracy after the filtering phase was approximately 6 min shorter than that of the Bi-LSTM for the same case. Although we could not present all the cases in the table, the time required for the training after completing the filtering phase was shorter in the CNN model than in the Bi-LSTM model. The difference in the training time is likely to be larger if the data size increases from that in this experiment.

Case Study of Filtering Efficiency
In this section, we present a case study conducted to see the impact of the filtering process on the performance of the product classification models. Same as in the previous section, accuracy was used as a performance metric.
For the top 1% of the frequency of all the words, i.e., 1.3K words, we increased the number of filtered-out stopwords by the top 10% words of the highest frequencies, and examined the accuracy for each case. Figure 6 shows the result of using the CNN model. A similar trend is displayed when the Bi-LSTM model is used. As shown in the figure, the Appl. Sci. 2021, 11, 5694 8 of 10 accuracy decreases as the number of stopwords increases by the top 10% each time. It can also be observed that after the 70% setting is exceeded, the decreasing rate is relatively flat compared with the 10-70% settings. When the words with high frequencies are filtered-out as stopwords, the impact on the accuracy should be more significant. Furthermore, in our data, we observed that the distribution of the words displaying those frequencies is approximately 70%.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 8 of 10 our data, we observed that the distribution of the words displaying those frequencies is approximately 70%.  Figure 6 shows a case where the experiment involves the observation of only the frequency of the filtered-out stopwords irrespective of the syntactic structure or semantics of those words. This can lead to the following question: would the performance improve if we filter out words that are syntactically wrong or weird, composed of unnecessary characters, or have a highly poor semantic relationship with the category that the product belongs to? Therefore, we conducted the experiments to determine the performance of the models wherein the words considered unnecessary syntactically or semantically were removed using heuristics among the top 1% words with high frequencies. This is similar to the previous experiment. Table 4 shows the type of heuristics, stopword examples, and number and proportion of words removed as stopwords. The data refining process was conducted with the list of stopwords created for each criterion. Then, the training process was performed. Table 5 shows the accuracy measured after performing filtering with each criterion of stopwords, for the NB, CNN, and Bi-LSTM models. Without loss of generality, in this experiment we used only the attributes "name" and "keyword". In Table 5, the average accuracy of the CNN and Bi-LSTM models for criteria (1)-(6) is 0.7799 and 0.7756, respectively.
The stopword filtering selection criterion with the highest prediction accuracy is criterion (4), i.e., words composed of (number + unit). In general, the words corresponding to criterion (4) show detailed product descriptions such as size and price (e.g., "100 cm" and "1000 won"). These words generally are not significant in terms of learning of the  Figure 6 shows a case where the experiment involves the observation of only the frequency of the filtered-out stopwords irrespective of the syntactic structure or semantics of those words. This can lead to the following question: would the performance improve if we filter out words that are syntactically wrong or weird, composed of unnecessary characters, or have a highly poor semantic relationship with the category that the product belongs to? Therefore, we conducted the experiments to determine the performance of the models wherein the words considered unnecessary syntactically or semantically were removed using heuristics among the top 1% words with high frequencies. This is similar to the previous experiment. Table 4 shows the type of heuristics, stopword examples, and number and proportion of words removed as stopwords. The data refining process was conducted with the list of stopwords created for each criterion. Then, the training process was performed. Table 5 shows the accuracy measured after performing filtering with each criterion of stopwords, for the NB, CNN, and Bi-LSTM models. Without loss of generality, in this experiment we used only the attributes "name" and "keyword". In Table 5, the average accuracy of the CNN and Bi-LSTM models for criteria (1)-(6) is 0.7799 and 0.7756, respectively. The stopword filtering selection criterion with the highest prediction accuracy is criterion (4), i.e., words composed of (number + unit). In general, the words corresponding to criterion (4) show detailed product descriptions such as size and price (e.g., "100 cm" and "1000 won"). These words generally are not significant in terms of learning of the products and are of low importance. Furthermore, a comparison between the results in Table 5 and the accuracy before the filtering phase in Table 3 does not reveal a significant difference. By training with various criteria, we determined that the improvement of training performance through stopwords is challenging. It requires more extensive analysis and experiments.

Conclusions
A substantial amount of data is produced each day in actual industrial sites. The problem-solving based on deep learning is favorable in terms of both improvement of analysis performance and reduction of time. Online commerce domain in which a rapid and precise automatic product classification is indeed required may be no exception. The efficiency of a deep learning model depends on the training data and the appropriateness of the learning model for the product data. In this study, we propose deep learning models that are conscious of input data comprising text-based product information.
Our approaches exploit two well-known deep learning models and integrate them with the processes of input data selection, transformation, and filtering. Through experiments using actual product information data, we demonstrated that the models that systematically consider the input data may differ in accuracy by approximately 30% from those that do not. Our model architecture proposed in this study consists of selection, transformation, and filtering phases. There can be other ways of integrating approaches in which some phases can be further decomposed or some can be merged with each other. For example, in this study, we conducted a case study to see how the filtering phase can affect the overall accuracy of product classification. It may be a future research direction to develop another way of data-conscious architecture for deep learning-based product classification.
The results of this study can be applied to methods for refining text data and classifying categories for e-commerce as well as other fields. In the case of social media, deep learning models can be trained using composed texts to classify the category of pertinent texts or determine whether the texts require filtering for pornography, violence, racism, etc. In the future, the characteristics of the text data used in this study should be subcategorized further, and the data cleaning process should be further refined to enhance future studies. Apart from the two deep learning models used in this study, other models can be implemented to expand follow-up studies.

Conflicts of Interest:
The authors declare no conflict of interest.