Machine Learning-Based Malicious X.509 Certiﬁcates’ Detection

: X.509 certiﬁcates play an important role in encrypting the transmission of data on both sides under HTTPS. With the popularization of X.509 certiﬁcates, more and more criminals leverage certiﬁcates to prevent their communications from being exposed by malicious trafﬁc analysis tools. Phishing sites and malware are good examples. Those X.509 certiﬁcates found in phishing sites or malware are called malicious X.509 certiﬁcates. This paper applies different machine learning models, including classical machine learning models, ensemble learning models, and deep learning models, to distinguish between malicious certiﬁcates and benign certiﬁcates with Veriﬁcation for Extraction (VFE). The VFE is a system we design and implement for obtaining plentiful characteristics of certiﬁcates. The result shows that ensemble learning models are the most stable and efﬁcient models with an average accuracy of 95.9%, which outperforms many previous works. In addition, we obtain an SVM-based detection model with an accuracy of 98.2%, which is the highest accuracy. The outcome indicates the VFE is capable of capturing essential and crucial characteristics of malicious X.509 certiﬁcates.

on HTTPS reaches 74% approximately, which is high enough to attract more attention. In addition to phishing sites, the communication between malware and their Command Control (CC) servers leverage X.509 certificates to avoid the detection of traffic analysis tools frequently. The SSL Blacklist (SSLBL) project exposes many X.509 certificates used by botnet CC servers to transmit information with malware installed in remote bots. The number of this type of X.509 certificate is increasing over time. We call the certificates found during malicious activities like phishing and malware communication malicious certificates or abnormally used certificates. Being shielded by malicious certificates, phishing sites are more likely to be trusted by web browsers and even camouflage other popular websites and steal private information, which has enormous damage to the Internet. X.509 certificates are indeed created for safeguarding against malicious network impersonators. However, a trusted Certificate Authority (CA) may reissue X.509 certificates for a website by accident, which provides a chance to camouflage this website with the reissued certificate. For example, in 2011, the CA Comodo was compromised and reissued certificates for Microsoft and Google (Source: https://blog.comodo.com/other/the-recent-ra-compromise/). Suppose an attacker holds a reissued certificate for Google. In that case, he can set up a phishing website with the same interface as Google and use the reissued certificate for communication during the HTTPS connection, which is trust by browsers. If users login into this phishing website with their accounts, their information is stolen. The detail of the HTTPS connection with the X.509 certificate is illustrated in RFC 5280 [1]. In addition, the communications between malware and CC servers are unobservable, which has a destructive influence by way of snooping privacy and stealing important information. How to detect those malicious certificates is imperative and significant work.
There are many solutions to detect malicious certificates. The most straightforward method is filtering with known malicious ones. For example, Ghafir et al. [2] proposed a Malicious SSL certificate Detection (MSSLD) model to detect Advanced Persistent Threat (APT) communications based on a blacklist of malicious SSL certificates to filter connections. In addition, many researchers applied machine learning methods for detecting malicious certificates. Categorized by the type of detecting model, machine learning-based methods contain two types of models: one is the use of classical machine learning model, the other is applying neural network to the detection of malicious certificates. For example, Mishari et al. [3] applied classical machine learning models to train separate models for distinguishing phishing and typosquatting with gathered certificates. Dong et al. [4] used deep neural networks for training rogue certificates detection models with artificial rogue certificates. The details of previous methods are illustrated in Section 2.
Although those previous methods have high accuracy in detecting malicious certificates, they cannot handle the situation which is strict with detection accuracy. Moreover, filtering malicious certificates with known malicious certificates has a low performance while encountering new malicious certificates. Both phishing certificates and malware-used certificates are malicious certificates. A detection method that can distinguish those two types of malicious certificates together will be time-saving and effortless. Therefore, we combine phishing certificates and malware-used certificates as malicious certificates to train models. Furthermore, the ensemble learning model has shown excellent performance in many competitions. To improve the detection accuracy as much as possible, we leverage the ensemble learning model to detect malicious certificates. In addition, features selected for training models are various, and most of them are insufficient.
To solve these problems and achieve our objectives, we apply machine learning models, including classical models, ensemble learning models, and deep learning models, to detect malicious certificates that combine phishing certificates and malware-used certificates. Features play a crucial role in training different models. Considering previous methods do not have comprehensive and minute feature extraction towards X.509 certificates, we design and implement Verification for Extraction (VFE). The VFE is intended for analyzing basic fields of certificates, checking certificates' conformation of criterion defined in RFC 5280, constructing and verifying certificate chain, and recording information for features extraction. The result shows that the VFE can capture crucial traits of certificates. The best model achieves an accuracy of 98.2% for detecting malicious and benign certificates, which is a higher score than state-of-the-art. Torroledo et al. [5] implemented a system that is capable of identifying malware certificates with an accuracy of 94.87%. Fasllija et al. [6] achieved a phishing attempts detecting system with an accuracy of 91% approximately. Our contributions can be summarized as follows: 1. Design and implement the VFE for verification and extracting characteristics from certificates. The result shows features extracted from the VFE are good at distinguishing malicious certificates from benign certificates. 2. Apply and compare various machine learning models for malicious certificate detection. We had tested the performance of different models in malicious certificate detection and have a good grasp of the advantages and disadvantages of those models. 3. Find the best model for detecting malicious certificates with an accuracy of 98.2%. After trying and optimizing different models, we found a model with the highest accuracy.
The remainder of this paper is organized as follows. In Section 2, we review related work in this field. Then, in Section 3, we illustrate the design and implementation of the VFE in detail. We analyze the total features extracted from the VFE and explain the reason for choosing them in Section 4. In Section 5, we describe selective machine learning models and their structures during experiments. Our experiments, including data collection, experimental design, and experimental results, are illustrated in Section 6. In Section 7, we make conclusions about the total work.

Related Work
Malicious certificate detection is a complementary and effective method to distinguish phishing websites and communications between malware and their CC servers. There were many studies related to this field. Apart from filtering malicious certificates with known malicious certificates, many researchers work out this problem with machine learning. Those machine learning methods were differential in data sources, feature engineering, model selection, and focused certificates.
Ghafir et al. [2] proposed a Malicious SSL certificate Detection (MSSLD) model to detect Advanced Persistent Threat (APT) communications based on a blacklist of mali-cious SSL certificates to filter connections. To enhance detection ability, they updated the blacklist of malicious SSL certificates from different sources each day at 3:00 am. It is the most direct way to find APT communications. This model relies on malicious certificates from other sources, which cannot ensure the timeliness of detection while meeting new malicious certificates.
In an article by Mishari et al. [3], they trained Random Forest, Decision Tree, and Nearest Neighbor to detect web-fraud domains. They collected nine features of the certificate and some sub-features to train models. In addition, they analyzed the selected features minutely. The result shows that the highest phishing detection accuracy is 88%.
Xianjing et al. [7] analyzed attribute correction of certificates in the certificate chain and built up a probabilistic model SSLight to model attribute correction. Training the SSLight with a large number of regular certificates, they applied it to detect fake certificates. The model built in this paper is comprehensive and demands extensive data for training.
In an article by Dong et al. [8], they designed and implemented a real-time detecting system with certificate downloader, feature extractor, classification executor, and decisionmaker parts. In addition, they collected 95,490 phishing certificates and 113,156 nonphishing instances for training Decision Tree, Random Forest, Naive Bayes Tree, Logistic Regression, Decision Table, and K-Nearest Neighbors. One advantage of this work is the timeliness of detection.
Fasllija et al. [6] made use of Certificate Transparency (CT) logs to extract features and train classical models to detect phishing certificates. One innovation of this research was that their models divided certificates into five categories.
In another paper by Dong et al. [4], they applied deep neural networks to train detection models for rogue certificates. Considering data imbalance of rogue certificates, they changed tiny content of benign certificate to construct rogue certificate, which has an outstanding performance. Their experiments obtained the highest accuracy of 97.7%.
Torroledo et al. [5] leveraged long short-term memory (LSTM) to extract features from subject and issuer information. They used those features combined with other numerical features to train models for detecting phishing certificates and malware-used certificates, respectively. This work inspires our research, and we change the method to deal with subject information.
However, all these previous works of detecting malicious certificates suffer the disadvantages introduced in Section 1. To cope with these drawbacks, we design and implement the VFE to analyze basic fields of certificates, check certificates' conformation of criterion defined in RFC 5280, construct, and verify the certificate chain. During this process, we collect plentiful features of certificates, which is feature engineering. Obtaining sufficient features, we select classical machine models, ensemble learning models, and deep learning models to detect malicious certificates. In addition, we propose some novel tricks for extracting features from subject and issuer information of certificates.
The detection of malicious certificates is beneficial for detecting phishing sites or malware. There are some other researches of finding phishing sites or malware without X.509 certificates. In an article by Hutchinson et al. [9], they proposed a method of using the features of the URL to detect phishing ones. They considered different feature sets to detect phishing URLs with Random Forest (RF). The best model achieved an accuracy of 96.5%. Kulkarni et al. [10] implemented four classifiers with MATLAB to detect phishing websites with dataset from machine learning repository of The University of California, Irvine. Among four classifiers of the Decision Tree, Naïve Bayesian classifier, SVM, and neural network, the Decision Tree reached the highest accuracy of 91.5%.

Verification for Extraction (VFE)
The VFE is implemented to analyze certificates' basic fields, checking whether certificates conform with constraints agreed on RFC 5280, constructing and verifying the certificate chain. We can seize characteristics of certificates, as many as possible for picking up features during those processes. This process is conforming with capturing comprehen-sive aspects about X.509 certificates. The integral architecture of the VEF includes basic analysis, standard checking, certificate chain construction, and certificate chain validation, four parts. The basic analysis part parses primary fields in the content of certificates for feature use. The standard checking part exams certificates' conformation to RFC 5280, not only the type of areas but also some fields' presence. The certificate chain construction part builds the certificate chain according to two methods. One is searching certificates database, and the other is obtaining upper certificates with the help of Authority Information Access (AIA: information for obtaining issuer's certificate). The certificate chain validation part verifies the certificate chain in multiple dimensions, including certificate policy mapping, path length constriction, name constrict, etc. In the following part of this section, the details of the VFE are illustrated.

Basic Analysis
The basic analysis part parses basic fields of a certificate, including certificate version, validation, serial number, public key, subject information, issuer information, extensions, error information, existence information, etc. Table 1 shows the details of basic fields. Those basic fields are contents of the certificate itself, and they play an essential role in the processes of the following parts and are basic information about certificates. We accomplish this part with the help of the python package of OpenSSL (OpenSSL: https://www.openssl. org/). The basic analysis part takes DER (DER: binary encoding scheme of X.509 certificates) or PEM (PEM: Base64 encoding scheme of X.509 certificates) type certificates as inputs and output contents of certificates with OpenSSL's help.

Standard Checking
The standard checking part carries out criterion checking with the guide of RFC 5280. For example, if the serial number is a positive integer no longer than 20 octets, whether certificates of versions 1 and 2 have extensions or the length of explicit text in certificate policy exceeds 200, etc. Table 2 shows the details of checking items. In a word, what we do in this part is find restrictions of RFC 5280 and check whether certificates are conforming with those restrictions. This is a necessary step for the reason that RFC 5280 is an agreement about X.509 certificates.

Name Description Value Type
serial_number_not_conforming Whether serial number is positive and no longer than 20 octets integer after_smaller_than_before Whether after-time is small than before-time in validation integer extension_exist_in_wrong_version Whether extensions are math with certificate version integer decipher_and_encipher_error Whether the appearance of decipher and encipher is conforming integer explicit_text_exceed Whether the length of explicit text exceeds 200 integer only_consist_reasons_error Whether cRLDistributionPoint only consist reasons integer keycertsign_not_conform_ca Whether the set of keycertsign is conforming with ca field integer cA_with_empty_subject Whether subject information is empty and the subject is ca integer CRLissuer_with_empty_subject Whether subject information is empty and the subject is CRLissuer integer

Certificate Chain Construction
The certificate chain construction part locates certificates' issuers until positioning their root certificates. The task of constructing the certificate chain for an end certificate is completed as its root certificate is exposed. There are two manners to locate the issuer's certificate. One is searching in the certificates database CCADB (CCADB: a common certificate authority database with root and intermediate certificates from https://www.ccadb.org/forsearching). The other is obtained according to access information in AIA. To promote the rate of building a complete certificate chain, we integrate two methods. Algorithm 1 describes the overall process of forming the certificate chain. Once the root certificate is found, or no issuer's certificate is obtained by trying two methods, the certificate chain construction is completed. To reduce the time of building a certificate chain, we give priority to the database for searching issuer's certificate since leveraging AIA information to obtain the issuer's certificate is time-consuming.

Certificate Chain Validation
The certificate chain validation part is implemented for examining some fields for related certificates in the certificate chain. For instance, certificate policy mapping, path length constriction, name constrict, etc. Any implementation of certificate chain validation is required to be conforming to RFC 5280. Considering the time cost of implementing a tool for verifying the certificate chain, we accomplish the certificate chain validation part with the help of OpenSSL. Many trusted certificates and Certificate Revocation Lists (CRLs) are required to be added into the validation context during the validation process. It is worth mentioning that we validate the certificate chain of each certificate in the certificate chain to find more possible errors. The detail of certificate chain validation is recognized by checking results of OpenSSL. The details of validation values are displayed on the official website of OpenSSL (OpenSSL verify: https://www.openssl.org/docs/man1.0.2/man1/verify.html).
We record the necessary information for extracting as many as possible features during the processing flow of four parts. With the help of the VFE, comprehensive and useful traits of certificates are collected.

Feature Engineering
Feature engineering is based on certificates' characteristics obtained from the basic analysis, standard checking, certificate chain construction, and certificate chain validation of the VFE. In addition, some outputs of packages and tools we use are taken into consideration. We also attempt to acquire six extra features based on what we gain already. Considering the number of features we extracted is comparably huge, we upload a description of all the features to our website (Our website: https://github.com/fight-think/ features-extraction). In the following parts of this section, more details about the features of a certificate are explained.

Features from Cryptography
Cryptography is a python package that supports us to parse basic fields of certificates. There are few limitations, including versions, algorithms, and extensions supported in this package. Sometimes inputs cannot be parsed by this package. When one of those four situations occurs, the corresponding feature will be stored. In addition, the parsing process does not always go well, so 26 features about parsing errors are extracted. Table 3 illustrates some sample features from Cryptography. The full features from Cryptography are listed on our website (Our website: https://github.com/fight-think/features-extraction).

Features from Basic Analysis
The basic fields of certificate contents are essential features. For example, signature algorithm, public key algorithm, version, country of the issuer, country of the subject, etc. Therefore, we collect 16 features about the basic contents of certificates and 11 features that are easily computed from the contents of certificates. Extensions of certificates are supplementary and explanatory information about certificates. Therefore, not all extensions are essential in a certificate. The presence of extensions shows the importance and completeness of certificates in some way. Each extension has a property about whether this extension is critical or not. The difference between critical and not critical extensions is that critical extensions must be parsed and verified during certificate chain validation. To seize all possible keys, we sort out 14 existential features and 14 critical features. Table 4 shows some sample features from the basic analysis. The full features from the basic analysis are listed on our website (Our website: https://github.com/fight-think/features-extraction).

Features from Standard Checking
RFC 5280 documents common agreements of CAs, certificate users, governments, and several organizations to X.509 certificates. Any implementation of RFC 5280 should be conforming to its criterion. Kumar et al. [11] implemented a frame for checking whether certificates issued by CAs complied with standards defined by RFC 5280, CAs, and browser corporations. The results showed the number of unmatched certificates was decreasing and the percentage reduced to 0.02% in 2017. Although the rate of inconsistent certificates is lower, it is necessary to check certificates' match with RFC 5280. Therefore, we extract 10 features to represent checking results. Table 5 illustrates the details of features from standard checking.

Features from Certificate Chain Construction
We care about the result of construction, error occurrence, and the subject information of certificates in the certificate chain during the certificate chain construction. Drury et al. [12] compared phishing certificates with regular certificates in many dimensions. The results indicated subject and issuer information of certificates were vital to distinguish them. Fadai et al. [13] analyzed trusted SSL root CAs of different modern browsers and operating systems. It shows that the trustworthiness of SSL root CAs is relative to their original countries. Torroledo et al. [5] analyzed issuer and subject information minutely while carrying out feature engineering. Inspired by previous work, we connect subject information of each certificate in the certificate chain to an integral fragment, which is regarded as a trick for discovering the relationship between issuer and subject. We apply two ways to extract features from the text of subject information, and one is the bag of words (BOW) [14], the other is using trained fast text [15] to build embedding word vector of text information. We apply two types of neural networks to deal with embedding word vector [16]. Apart from the text of subject information, we record six more features about errors and the results of certificate chain construction. Table 6 illustrates the details of features from certificate chain construction. Akhawe et al. [17] illustrated the procedure of browsers' checking to server certificate and analyzed reasons that many Transport Layer Security (TLS) warnings were wrongly issued. Certificate chain validation is an essential tactic to identity certificates' authenticity. Therefore, we obtain 74 related features, which compose 71 validation flags of an end certificate from OpenSSL and three features indicating the validation of middle certificates in the certificate chain. As a result that the number of features is relatively large, we illustrate some sample features from certificate chain validation in Table 7. The full features from certificate chain validation are listed on our website (Our website: https://github. com/fight-think/features-extraction).

Extra Features
The seven extra features are discovered with the help of what we gain already. The first one is judging whether the certificate is an Extended Validation (EV) certificate. EV certificates are issued with more checking and more expense, which makes them more credible. Torroledo et al. [5] selected whether the certificate was an EV certificate while extracting features as well. We accomplish it with a list of certificate policies that demonstrate whether one certificate is an EV certificate or not. In addition, whether the root certificate of an end certificate is trusted by hardware and software companies, including Microsoft, Apple, Cisco, and Mozilla, is an important mark of credibility. Therefore, we obtain four features indicating the trust of Microsoft (Microsoft: https: //ccadb-public.secure.force.com/microsoft/IncludedCACertificateReportForMSFT), Apple (Apple: https://support.apple.com/en-us/HT209143), Cisco (Cisco: https://www. cisco.com/security/pki/), and Mozilla (Mozilla: https://ccadb-public.secure.force.com/ mozilla/CACertificatesInFirefoxReport) according to comparison of certificates' fingerprints collected from websites. Finally, the Alexa rank of the subject domain name is searched from the ranking file (Alexa rank: https://www.alexa.com/topsites), and the result shows that the Alexa rank is a useful feature. Table 8 illustrates the details of extra features.

Model Illustration
We apply classical machine learning models, ensemble machine learning models, and deep learning models to detect malicious certificates. Among three types of models, classical machine learning models and ensemble machine learning models are proposed by previous researchers, which are illustrated in the following parts. We do not modify them for applying them to detect malicious certificates. The structures of deep learning models are designed and implemented by us with the help of third packages including Pytorch (Pytorch: https://pytorch.org/), Sklearn (Sklearn: https://scikit-learn.org/stable/), Numpy (Numpy: https://numpy.org/), nltk (nltk: https://www.nltk.org/), and Pandas (Pandas: https://pandas.pydata.org/). With a comparison of different models, we have a profound command to the detection of malicious certificates. In the following parts of this section, specific models and their structures are illustrated.

Classical Machine Learning
The classical machine learning models we select include Logistic Regression (LR), Decision Tree (DT), and Support Vector Machine (SVM). The Logistic Regression was proposed by Joseph Berkson [18], it leveraged a sigmoid [19] function to accomplish nonlinearization and control the outcome range 0 to 1 which presents the probability of outputting value 1. Support Vector Machine is aimed at looking for a hyperplane which distinguishes two types of data with the help of a kernel trick if linear classifier cannot work [20]. Training the Decision Tree model is a process of building a Decision Tree with the best decision rules while making a decision about which feature is selected as judgment [21]. Those classical models have a long history and show their performances in many problems, especially in the problems where data are not so large, and the relationship between data is relatively simple. We apply classical machine learning to the detection of malicious certificates with the purpose of reference experiments.

Ensemble Machine Learning
Ensemble machine learning models [22] use multiple machine learning algorithms to predict the result rather than obtain the predictive result from one of the constituent learning algorithms alone, which always have a better performance. Among several competitions about machine learning, the ensemble machine learning method achieves a relatively high score. Considering different characters and performance of varying ensemble machine learning models, we select four models, including Random Forest (RF), eXtreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM), and Category Boosting (CatBoost), to model four classifiers. Random Forest combines multiple decision trees to make the final predictive result, which always has a better performance than a single decision tree [23]. XGBoost is a scalable end-to-end tree boosting system with a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning, which has achieved great success in several machine learning challenges [24]. LightGBM promotes the efficiency and scalability of Gradient Boosting Decision Tree (GBDT) implementation while handling high dimension and extensive data with Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) [25]. Compared with XGBoost and LightGBM, CatBoost focuses on prediction shift with two algorithmic advances composing ordered boosting and an innovative algorithm for processing categorical features [26]. The data processing of classical and ensemble models are displayed in Figure 2. After splitting and clearing text information of subject_infor_chain, we count the most frequent 30 words and label their presence in one subject_infor_chain as features. Combining those features with 182 numerical features, we train several classical and ensemble models.

Deep Learning
Since Krizhevsky et al. [27] applied a convolutional neural network (CNN) to an image classification challenge in 2012 and won the championship in this competition, deep learning [28] ravaged the world and achieved outstanding results in many fields. To further deal with the words embedding vector of subject information and acquire possible high accuracy models, we use CNN and LSTM to extract features from the words embedding vector and merge those new features with previous features to train classifiers. The critical operations of the convolutional neural network are convolution and pooling, which have superb efficiency in seizing critical features [27]. Long short-term memory is an improved version of the recurrent neural network (RNN) with a cell, an input gate, an output gate, and a forget gate in each unit. It reduces the possibility of gradient vanishing and gradient explosion, which are frequent in RNN [29]. The architecture of networks with CNN and LSTM used in our experiment are displayed in Figure 3. The total architecture includes handling words embedding vectors and combining previous numerical features and features extracted by neural networks to train a classifier. The words embedding vector is obtained by a pre-trained model with an input of 30 words in one subject_infor_chain at most. The difference between CNN based classification model and LSTM based classification model is the neural network for handling word embedding vectors. In addition, we illustrate the detail of hidden layers in CNN-based and LSTMbased models in Appendix A.

Data Collection
To obtain models with better generalization performance and stability, we feed many data to models, especially deep learning models. The truth is that the number of malicious certificates is far less than the number of benign certificates. Therefore, the principle of use is collecting plentiful malicious certificates and a corresponding number of benign certificates. We acquire malicious certificates from two sources, and one is Uniform Resource Locators (URLs) of phishing websites provided by PhishTank (PhishTank: https://www.phishtank. com/). The other is fingerprints of malicious certificates collected by the SSLBL project (SSLBL: https://sslbl.abuse.ch/). Connecting URLs with HTTPS in port 443 and obtaining possible certificates sent from servers during the process of ServerHello, we collect 1711 malicious certificates. We apply fingerprints provided by the SSLBL project as searching keys to search at crt.sh (Crt.sh: https://crt.sh/) and censys.io (Censys: https://censys.io/) to get certificates in PEM format. We finally obtain 611 malicious certificates in this way. The total number of malicious certificates we collect is 2322. Compared with malicious certificates, benign certificates are adequate and easier to obtain. We leverage part domains of the Alexa top 1 million sites (Alexa rank: https://www.alexa.com/topsites) as seeds to get their certificates with the help of HTTPS protocol. The number of benign certificates we collect is 11,909, which is imbalanced. To make fair use of malicious certificates, we select 9288(4 × 2322) benign certificates whose domain names have a high Alexa rank for different experiments. Considering to train models with great generalization ability, we adopt to resampling malicious certificates, which makes data balance in training and validation. To make it easier for reproduction, we release the features of all malicious and benign certificates on our website (Our website: https://github.com/fight-think/features-extraction).

Experimental Design
All the experiments are run on ThinkPad Carbon X1 2019 with 8G RAM hardware, Intel(R) Core(TM) i5-8265U CPU, and 512G SSD. We divide collected malicious and benign certificates into different datasets according to the Alexa rank of domain names or re-sampling malicious certificates. Just as Figure 4 shows, we select 9288 (4 × 2322) benign certificates with higher Alexa rank and divide them into b1, b2, b3, and b4 four parts, each of which has the same number of certificates with malicious certificates part, m1. Then we combine m1 with b1, b2, b3, and b4 for constituting Dataset1, Dataset2, Dataset3, and Dataset4, respectively. Three more datasets are formed by re-sampling malicious certificates and combining the same number of benign certificates with them. We apply different machine learning models with seven datasets, including LR, DT, SVM, RF, XGBoost, CatBoost, LightGBM, CNN-based model, and LSTM-based model, to fit each dataset and compare the performance of all models in various datasets. While training a specific model on one dataset, we use eight-fold cross-validation to find the best model. There are two reasons for using eight-fold cross-validation. The first is that the number of certificates in the experiment is relatively small. If dividing the data into ten pieces, the number of certificates used for testing is relatively small, which will affect the computation of the evaluation metric. The other is that if dividing the data into five pieces, the number of certificates used for training will decrease, which will affect the performance of the model. The experiments with five-fold cross-validation and ten-fold cross-validation indicate them. Therefore, we use eight-fold cross-validation.
During the training process, we split 15% data as testing data and apply eight-fold cross-validation to the remaining data for finding the best model, which means seven folds of data for training and one fold for validation. The model with the highest score of evaluation metric on validation data is regarded as the best model. In classical and ensemble machine learning models, GridSearchCV (GridSearchCV: https://scikit-learn. org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) is utilized for searching for the best model. In deep learning models, we obtain the best model during updating parameters and altering data. After getting the best model, we test the best model's performance in testing data and calculate the time cost from training to testing.
We select accuracy as the evaluation metric to measure different models' performance. Accuracy is associated with true positives, true negatives, false positives, and false negatives. In our experiments, benign certificates are considered positive items, and malicious certificates are negative. There are two main reasons for choosing accuracy as the evaluation metric. To begin with, finding benign certificates is equally important to detecting malicious certificates in our experiments. Comparing with precision, recall, or F1 score, accuracy is suitable for balancing these two abilities. What is more, many previous researchers evaluate their models with accuracy. For example, Dong et al. [30], Torroledo et al. [5], Fasllija et al. [6], and so on. To make it easier to compare with them, we select accuracy as the evaluation metric.

Results
Training several models with each dataset, we record the best model's testing accuracy in eight-fold cross-validation and the validation accuracy of eight models trained with corresponding training data during this process. In addition, the time cost from eight-fold cross-validation to testing the best model is calculated. What is more, we pay attention to the importance value of features in DT, RF, XGBoost, CatBoost, and LightGBM models.

Accuracy
During the process of eight-fold cross-validation, eight models are trained with seven folds data. The validation accuracy of eight models on remaining fold data is recorded. Among these eight models, the one with the highest validation accuracy is selected as the best model. Then we obtain the testing accuracy by testing the best model with testing data. Table 9 shows validation accuracy of eight-fold cross-validation with different models on Dataset6. The last column of this table is the standard deviation(std) of validation accuracy values, which show the corresponding model's stability. The mean and standard deviation of eight models expose that SVM, XGBoost, and LightGBM have higher validation accuracy and lower std than other models. In addition, comparing with classical and deep learning models, ensemble models have higher average validation accuracy and lower average std. Therefore, ensemble learning models are the most efficient and stable models. In addition to eight-fold cross-validation, we adopt five-fold cross-validation and tenfold cross-validation to find the best model. Table 10 shows validation accuracy of five-fold cross-validation with different models on Dataset6. Table 11 shows validation accuracy of ten-fold cross-validation with different models on Dataset6. We compare the mean validation accuracy and the standard deviation (std) of different models. The results show validation accuracy of eight-fold cross-validation is higher than five-fold cross-validation and partially higher than ten-fold cross-validation. The std of eight-fold cross-validation is lower than ten-fold cross-validation and not too higher than five-fold cross-validation. That shows eight-fold cross-validation is more suitable.  Finding the best model with eight-fold cross-validation, we feed testing data into the best model for obtaining testing accuracy. The result is displayed in Table 12. To begin with, comparing the results of Dataset1, Dataset2, Dataset3, and Dataset4, we find that feeding benign certificates with a high Alexa rank of subject domain names to models can result in a tiny improvement to testing accuracy. Furthermore, as the number of certificates increases in Dataset5, Dataset6, and Dataset7, the promotion of testing accuracy is various. SVM has the largest improvement, and ensemble learning models have a small boost with more training data. As we can see, the highest testing accuracy is obtained by SVM in Dataset7. What is more, compared with classical and deep learning models, ensemble learning models have higher average testing accuracy, which indicates ensemble models are the most efficient. Finally, the average testing accuracy of different models in seven datasets is 92.7%, and all models can achieve a relatively high testing accuracy, which shows the effectiveness and necessity of features extracted by the VFE. What is more, the ANalysis Of Variance (ANOVA) results in Table 13 show different models have a significant influence on testing accuracy with a high F-Statistic score. During calculation, each group includes the best models' testing accuracy with different datasets under one algorithm. The ANOVA calculation detail is illustrated on the website (Calcualtion of ANOVA: https://goodcalculators.com/one-way-anova-calculator/). In addition to the validation and testing accuracy of different models, we analyze the importance values of the features in the best model of DT, RF, XGBoost, LightGBM, and CatBoost. Figure 5 shows the top 10 important features in XGBoost and their scores. The score is the improvement of accuracy brought by a feature to the branches it is on. Figure 6 shows the top 10 important features in LightGBM and their scores. The score is the relative number of times a particular feature occurs in all splits of the model's trees. Figure 7 shows the top 10 important features in CatBoost and their scores. Refer to official document (LightGBM document: https://catboost.ai/docs/concepts/fstr.html#fstr_ _regular-feature-importance), the score is computed with following Formula (1) and (2). c 1 , c 2 are the number of objects in each leaf and v 1 , v 2 are the formula values in the left and right leaves. Figure 8 shows the top 10 important features in DT and their scores. The score is computed as the (normalized) total reduction of the criterion brought by that feature, which is known as the Gini importance (Gini importance: https://medium.com/the-artificialimpostor/feature-importance-measures-for-tree-models-part-i-47f187c1a2c3).   Figure 9 shows the top 10 important features in RF and their scores. The score is the average feature importance of trees in the forest. The feature importance of each tree is computed with Gini importance.
As a result that the feature importance property is not provided in every model implementation, we cannot obtain all models' feature importance. Therefore, we analyze the feature importance of XGBoost, LightGBM, CatBoost, DT, and RF. From the top 10 important features in each model, we can see that "valid_time_so_far" is the most important feature in three out of five models. If the certificate is not expired, "valid_time_so_far" is the hours from the issue date to the feature extraction date. If the certificate is expired, "valid_time_so_far" means valid hours when it is not expired. The result exposes that "valid_time_so_far" is the most important feature in our experiments. In addition, we count feature occurrence in the top 10 important features of all five models. Figure 10 shows the results. We can see that there are many common features in the top 10 important features of all five models, which indicates the coincidence of important characteristics to some extent. What is more, the features in Figure 10 appear in the top 10 important features of all five models more than one time. Therefore, we regard these features as vital features. Among these features, "20" represents the appearance of "ou" in the subject information of certificates, and other features are illustrated on our website (Our website: https://github.com/fight-think/features-extraction).

Time Consumption
Time consumption is an important metric of models, which reflects the computation resource consumption of training models. We record the time cost of eight-fold cross-validation and test the best model with testing data within one dataset, shown in Formula (3). Among the formula, n is the number of folds in experiments. Time(i) presents the training model's time cost with n-1 folds data and validation on remaining fold data. Test_time means the time cost of feeding testing data into the best model of cross-validation. In our experiments, we use eight-fold cross-validation, and n is eight. Table 14 shows the time consumption of training Dataset6, reflecting the total time cost. It shows training models with deep neural network spends far more time than classical models and ensemble models. Among different classical models, SVM is the most time-consuming. In addition, in different ensemble learning models, XGBoost takes more time than the other two ensemble models. What is more, the average time consumption of ensemble models is 74.47 s, which is faster than 90.51 s in classical models and 10,502.52 s in deep models. The result shows ensemble models demand less time for training, which means less computation resource consumption.

Conclusions
In this paper, we design and implement a system called VFE for obtaining and recording essential characteristics of X.509 certificates. With the help of the VFE, we extract a large number of features for model training. Furthermore, we train different types of models to distinguish between malicious and benign certificates. All the models have a relatively high score of validation accuracy and testing accuracy, which indicates the robustness of the VFE. In addition, the average testing accuracy of different models in all datasets is 92.7%, and the validation accuracy of different models in Dataset6 is 93.8%, which indicates the features extracted by the VFE are essential and crucial. Analyzing the five models' top 10 important features, we find some important common features vital for detecting malicious certificates. The ensemble learning models have higher average testing accuracy and lower average standard deviation of testing accuracy than classical and deep models, which indicate ensemble models are the most stable and efficient models. Furthermore, ensemble models reach an average testing accuracy of 95.9%. What is more, we obtain an SVM-based detection model with a testing accuracy of 98.2%, which is the highest accuracy.

Conflicts of Interest:
The authors have no conflict of interest concerning this manuscript.

Appendix A. Detail of CNN-Based and LSTM-Based Models
We construct CNN-based and LSTM-based models with the help of Pytorch. The details of the CNN-based model are illustrated in Figure A1. The details of the LSTM-based model are illustrated in Figure A2. In order to make it easier to reproduce this work, we use functions in Pytorch to illustrate the hidden layers of each deep learning model.