#
An Accuracy-Maximization Approach for Claims Classifiers in Document Content Analytics for Cybersecurity^{ †}

^{1}

^{2}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

#### 1.1. Motivation

#### 1.2. Goals and Contributions

- We introduce a new concept of ClaimsBERT to significantly enhance base model BertForSequenceClassification, by integrating CNN feature maps in order to improve performance and accuracy;
- We present our findings obtained using extensive experiments for optimizing our model’s overall architecture and its hyperparameters;
- We discuss the effectiveness of a suitable feature map in parameter selection in our proposed architecture for ClaimsBERT;
- We show that our model achieves a high classification accuracy of 97 percent;
- We discuss the results of our extensive evaluation of ClaimsBERT and its performance comparison with other network-based BERT models. The results indicate that significantly higher accuracy is obtained when integrating CNN into original BERT classifier and fine-tune all layers.

## 2. Related Works

**BERT-Base**: consists of 12 encoder layers, utilizes an embedding size of 768 dimensions, 12 multi-head attentions, and is comprised of 110M tunable parameters in total;**BERT-Large**: consists of 24 encoder layers, utilizes an embedding size of 1024 dimensions, 16 multi-head attentions, and a total of 340M tunable parameters.

#### Language Models in Cybersecurity

## 3. Dataset Curation

- 3% of the documents were unreadable;
- 5% were scanned documents;
- 29% where not related to ICS products;
- 63% of the downloaded documents are ICS product-related documents.

- 25% were classified as “manuals”;
- 69% were classified as “brochures”;
- 6% were classified as “catalogs”.

`PyMuPDF`python package, and we leveraged python’s

`Pytesseract`package for performing optical character recognition (OCR) with any scanned PDFs. With this approach, we managed to extract 2,160,517 sequences with 41,073,376 words across our curated dataset of ICS documents [8].

## 4. NLP Model Optimization for ClaimsBERT

#### 4.1. BERT Baseline (BertForSequenceClassification)

#### 4.2. Fine-Tuning BERT

#### Hyperparameter Selection

**Catastrophic Forgetting**: The cyclic method we utilized was first presented by Smith in [53], and it allows us to determine the optimal learning rate for our model training that avoids catastrophic forgetting. This method initially selects a low learning rate, which is then increased exponentially for each subsequent batch. The LrFinder function [53] was used to determine the best learning rate for each architecture.**Overfitting**: A common problem when training a neural network is the determination of an appropriate number of training epochs to use. An overfitted training dataset can result from too many epochs, while underfitting may be caused by a lack of iterations. If the monitored metric does not improve after a certain number of epochs, then we can stop the training process through selection of an appropriate early-stopping method. By implementing a data-driven automation approach, it eliminates the need to manually select the number of epochs. Our model monitors validation loss values, and if it does not show any improvement after two epochs, we stop training.

#### 4.3. Classification Optimization Using Feature Map

**Reshaping**: This involves reshaping the output of the transformers’ NSP layer from BERT, in order for the batch size and the sequence length to be compatible with the convolutional layer’s input;**Convolution**: This involves connecting multiple convolutional layers with different filter sizes, and then using the max pooling to learn higher-order representations of the data while reducing the number of parameters;**Flattening**: This involves converting the matrix from the final pool to a single array. This flattened vector is then connected to a fully connected neural network for the classification task.

## 5. Comparative Analysis Results and Discussion

#### 5.1. Comparison of ClaimsBERT against the Pre-Trained BertForSequenceClassification

#### 5.2. Comparison of ClaimsBERT against Other Network-Based Classifier Models

#### 5.2.1. LSTM

#### 5.2.2. BiLSTM

#### 5.2.3. Multilayer Perceptron

#### 5.2.4. Neural Network

#### 5.3. Hyperparameter Finetuning Results

#### 5.3.1. Convolution Layer

#### 5.3.2. Max Pooling

#### 5.3.3. Activation Function

#### 5.3.4. BERT Classifier Layer

#### 5.3.5. Randomness Impact

#### 5.4. Performance Comparison for All Models

## 6. Conclusions

## 7. Future Work

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Abbreviations

ASGD | Averaged stochastic gradient |

AUC | Area Under the ROC Curve |

BERT | Bidirectional Encoder Representations from Transformers |

BiLSTM | Bidirectional LSTM |

CI | confidence interval |

CNN | Convolutional Neural Network |

CR | Cybersecurity requirements |

CKG | Cybersecurity Knowledge Graph |

CyBERT | Cybersecurity BERT |

CYVET | Cyber-physical security assurance |

ELMo | Embeddings from Language Models |

FFN | Feed Forward Neural Network |

GPT | Generative Pre-Training |

GCN | Graph Convolutional Network |

ICS | Industrial Control Systems |

LR | Learning Rate |

LSTM | Long Short-Term Memory |

NER | Name Entity Recognition |

NLP | Natural Language Processing |

NN | Neural Network |

NSP | Next Sentence Prediction |

Multi-Layer Perceptron | MLPs |

Multiscale Vision Transformers | MViT |

OCR | Optical Character Recognition |

OT | Operational Technology |

ROC | Receiver Operating Characteristic |

ULMFiT | Universal Language Model with Fine-Tuning |

VSF | Vendor-supplied features |

## References

- Perumalla, K.; Lopez, J.; Alam, M.; Kotevska, O.; Hempel, M.; Sharif, H. A Novel Vetting Approach to Cybersecurity Verification in Energy Grid Systems. In Proceedings of the 2020 IEEE Kansas Power and Energy Conference (KPEC), Manhattan, KS, USA, 13–14 July 2020; pp. 1–6. [Google Scholar]
- Ameri, K.; Hempel, M.; Sharif, H.; Lopez Jr, J.; Perumalla, K. Smart Semi-Supervised Accumulation of Large Repositories for Industrial Control Systems Device Information. In Proceedings of the ICCWS 2021 16th International Conference on Cyber Warfare and Security, Cookeville, TN, USA, 25–26 February 2021; pp. 1–11. [Google Scholar]
- Zheng, X.; Burdick, D.; Popa, L.; Zhong, X.; Wang, N.X.R. Global table extractor (gte): A framework for joint table identification and cell structure recognition using visual context. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 697–706. [Google Scholar]
- Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep contextualized word representations. arXiv
**2018**, arXiv:1802.05365. [Google Scholar] - Howard, J.; Ruder, S. Universal language model fine-tuning for text classification. arXiv
**2018**, arXiv:1801.06146. [Google Scholar] - Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv
**2018**, arXiv:1810.04805. [Google Scholar] - Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI blog
**2019**, 1, 9. [Google Scholar] - Ameri, K.; Hempel, M.; Sharif, H.; Lopez Jr, J.; Perumalla, K. CyBERT: Cybersecurity Claim Classification by Fine-Tuning the BERT Language Model. J. Cybersecur. Priv.
**2021**, 1, 615–637. [Google Scholar] [CrossRef] - Akbik, A.; Blythe, D.; Vollgraf, R. Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, NM, USA, 20–26 August 2018; pp. 1638–1649. [Google Scholar]
- Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv
**2016**, arXiv:1609.08144. [Google Scholar] - Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 5998–6008.
- Wang, Y.; Huang, M.; Zhu, X.; Zhao, L. Attention-based LSTM for aspect-level sentiment classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; pp. 606–615. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv
**2020**, arXiv:2010.11929. [Google Scholar] - Chen, M.; Radford, A.; Child, R.; Wu, J.; Jun, H.; Luan, D.; Sutskever, I. Generative pretraining from pixels. In Proceedings of the International Conference on Machine Learning, Virtual Event, 13–18 July 2020; pp. 1691–1703. [Google Scholar]
- Fan, H.; Xiong, B.; Mangalam, K.; Li, Y.; Yan, Z.; Malik, J.; Feichtenhofer, C. Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 1–17 October 2021; pp. 6824–6835. [Google Scholar]
- Atienza, R. Vision transformer for fast and efficient scene text recognition. In Proceedings of the International Conference on Document Analysis and Recognition, Lausanne, Switzerland, 5–10 September 2021; pp. 319–334. [Google Scholar]
- Hong, Y.; Wu, Q.; Qi, Y.; Rodriguez-Opazo, C.; Gould, S. Vln bert: A recurrent vision-and-language bert for navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 1643–1653. [Google Scholar]
- Chen, Y.C.; Li, L.; Yu, L.; El Kholy, A.; Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. Uniter: Learning universal image-text representations. arXiv
**2019**, arXiv:1909.11740. [Google Scholar] - Liu, H.; Xu, S.; Fu, J.; Liu, Y.; Xie, N.; Wang, C.C.; Wang, B.; Sun, Y. CMA-CLIP: Cross-Modality Attention CLIP for Image-Text Classification. arXiv
**2021**, arXiv:2112.03562. [Google Scholar] - Li, G.; Duan, N.; Fang, Y.; Gong, M.; Jiang, D. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11336–11344. [Google Scholar]
- Chen, S.; Guhur, P.L.; Schmid, C.; Laptev, I. History aware multimodal transformer for vision-and-language navigation. In Proceedings of the Advances in Neural Information Processing Systems, NeurIPS, Virtual, 6–14 December 2021; Volume 34, pp. 5834–5847. [Google Scholar]
- Dou, Z.Y.; Xu, Y.; Gan, Z.; Wang, J.; Wang, S.; Wang, L.; Zhu, C.; Zhang, P.; Yuan, L.; Peng, N.; et al. An Empirical Study of Training End-to-End Vision-and-Language Transformers. arXiv
**2021**, arXiv:2111.02387. [Google Scholar] - Zhai, X.; Wang, X.; Mustafa, B.; Steiner, A.; Keysers, D.; Kolesnikov, A.; Beyer, L. LiT: Zero-Shot Transfer with Locked-image Text Tuning. arXiv
**2021**, arXiv:2111.07991. [Google Scholar] - Wang, Z.; Shan, X.; Yang, J. N15News: A New Dataset for Multimodal News Classification. arXiv
**2021**, arXiv:2108.13327. [Google Scholar] - Oyegoke, T.O.; Akomolede, K.K.; Aderounmu, A.G.; Adagunodo, E.R. A Multi-Layer Perceptron Model for Classification of E-mail Fraud. Eur. J. Inf. Technol. Comput. Sci.
**2021**, 1, 16–22. [Google Scholar] [CrossRef] - Su, X.; You, S.; Xie, J.; Zheng, M.; Wang, F.; Qian, C.; Zhang, C.; Wang, X.; Xu, C. Vision transformer architecture search. arXiv
**2021**, arXiv:2106.13700. [Google Scholar] - Parisotto, E.; Song, F.; Rae, J.; Pascanu, R.; Gulcehre, C.; Jayakumar, S.; Jaderberg, M.; Kaufman, R.L.; Clark, A.; Noury, S.; et al. Stabilizing transformers for reinforcement learning. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 18–24 July 2020; pp. 7487–7498. [Google Scholar]
- Tolstikhin, I.O.; Houlsby, N.; Kolesnikov, A.; Beyer, L.; Zhai, X.; Unterthiner, T.; Yung, J.; Steiner, A.; Keysers, D.; Uszkoreit, J.; et al. Mlp-mixer: An all-mlp architecture for vision. In Proceedings of the Advances in Neural Information Processing Systems, NeurIPS, Virtual, 6–14 December 2021; Volume 34, pp. 24261–24272. [Google Scholar]
- Liu, H.; Dai, Z.; So, D.; Le, Q. Pay attention to MLPs. In Proceedings of the Advances in Neural Information Processing Systems, NeurIPS, Virtual, 6–14 December 2021; Volume 34, pp. 9204–9215. [Google Scholar]
- Jwa, H.; Oh, D.; Park, K.; Kang, J.M.; Lim, H. exbake: Automatic fake news detection model based on bidirectional encoder representations from transformers (bert). Appl. Sci.
**2019**, 9, 4062. [Google Scholar] [CrossRef] [Green Version] - Vogel, I.; Meghana, M. Detecting Fake News Spreaders on Twitter from a Multilingual Perspective. In Proceedings of the 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA), Sydney, Australia, 6–9 October 2020; pp. 599–606. [Google Scholar]
- Liu, C.; Wu, X.; Yu, M.; Li, G.; Jiang, J.; Huang, W.; Lu, X. A two-stage model based on BERT for short fake news detection. In Proceedings of the International Conference on Knowledge Science, Engineering and Management, Athens, Greece, 28–30 August 2019; pp. 172–183. [Google Scholar]
- Sun, C.; Qiu, X.; Xu, Y.; Huang, X. How to fine-tune bert for text classification? In Proceedings of the China National Conference on Chinese Computational Linguistics, Kunming, China, 18–20 October 2019; pp. 194–206. [Google Scholar]
- Khetan, V.; Ramnani, R.; Anand, M.; Sengupta, S.; Fano, A.E. Causal BERT: Language models for causality detection between events expressed in text. arXiv
**2020**, arXiv:2012.05453. [Google Scholar] - Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics
**2020**, 36, 1234–1240. [Google Scholar] [CrossRef] - Beltagy, I.; Lo, K.; Cohan, A. SciBERT: A pretrained language model for scientific text. arXiv
**2019**, arXiv:1903.10676. [Google Scholar] - Edwards, A.; Camacho-Collados, J.; De Ribaupierre, H.; Preece, A. Go simple and pre-train on domain-specific corpora: On the role of training data for text classification. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 13–18 September 2020; pp. 5522–5529. [Google Scholar]
- Safaya, A.; Abdullatif, M.; Yuret, D. Kuisail at semeval-2020 task 12: Bert-cnn for offensive speech identification in social media. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, Barcelona, Spain, 12–13 December 2020; pp. 2054–2059. [Google Scholar]
- Rodrigues Makiuchi, M.; Warnita, T.; Uto, K.; Shinoda, K. Multimodal fusion of bert-cnn and gated cnn representations for depression detection. In Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop, Nice, France, 21 October 2019; pp. 55–63. [Google Scholar]
- He, C.; Chen, S.; Huang, S.; Zhang, J.; Song, X. Using convolutional neural network with BERT for intent determination. In Proceedings of the 2019 International Conference on Asian Language Processing (IALP), Shanghai, China, 15–17 November 2019; pp. 65–70. [Google Scholar]
- Rahali, A.; Akhloufi, M.A. Malbert: Using transformers for cybersecurity and malicious software detection. arXiv
**2021**, arXiv:2103.03806. [Google Scholar] - Zhou, S.; Liu, J.; Zhong, X.; Zhao, W. Named Entity Recognition Using BERT with Whole World Masking in Cybersecurity Domain. In Proceedings of the 2021 IEEE 6th International Conference on Big Data Analytics (ICBDA), Xiamen, China, 5–8 March 2021; pp. 316–320. [Google Scholar]
- Chen, Y.; Ding, J.; Li, D.; Chen, Z. Joint BERT Model based Cybersecurity Named Entity Recognition. In Proceedings of the 2021 The 4th International Conference on Software Engineering and Information Management, Yokohama, Japan, 16–18 January 2021; pp. 236–242. [Google Scholar]
- Gao, C.; Zhang, X.; Liu, H. Data and knowledge-driven named entity recognition for cyber security. Cybersecurity
**2021**, 4, 1–13. [Google Scholar] [CrossRef] - Ranade, P.; Piplai, A.; Mittal, S.; Joshi, A.; Finin, T. Generating fake cyber threat intelligence using transformer-based models. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–9. [Google Scholar]
- Tikhomirov, M.; Loukachevitch, N.; Sirotina, A.; Dobrov, B. Using bert and augmentation in named entity recognition for cybersecurity domain. In Proceedings of the International Conference on Applications of Natural Language to Information Systems, Saarbrücken, Germany, 24–26 June 2020; pp. 16–24. [Google Scholar]
- Oliveira, N.; Sousa, N.; Praça, I. A Search Engine for Scientific Publications: A Cybersecurity Case Study. In Proceedings of the International Symposium on Distributed Computing and Artificial Intelligence, Salamanca, Spain, 6–8 October 2021; pp. 108–118. [Google Scholar]
- Ranade, P.; Piplai, A.; Joshi, A.; Finin, T. CyBERT: Contextualized Embeddings for the Cybersecurity Domain. In Proceedings of the 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA, 15–18 December 2021; pp. 3334–3342. [Google Scholar]
- Nguyen, C.M. A Study on Graph Neural Networks and Pretrained Models for Analyzing Cybersecurity Texts. Master’s Thesis, Japan Advanced Institute of Science and Technology, Nomi, Japan, 2021. [Google Scholar]
- Xie, B.; Shen, G.; Guo, C.; Cui, Y. The Named Entity Recognition of Chinese Cybersecurity Using an Active Learning Strategy. Wirel. Commun. Mob. Comput.
**2021**, 2021, 6629591. [Google Scholar] [CrossRef] - Pal, K.K.; Kashihara, K.; Banerjee, P.; Mishra, S.; Wang, R.; Baral, C. Constructing Flow Graphs from Procedural Cybersecurity Texts. arXiv
**2021**, arXiv:2105.14357. [Google Scholar] - Yin, J.; Tang, M.; Cao, J.; Wang, H. Apply transfer learning to cybersecurity: Predicting exploitability of vulnerabilities by description. Knowl.-Based Syst.
**2020**, 210, 106529. [Google Scholar] [CrossRef] - Smith, L.N. Cyclical learning rates for training neural networks. In Proceedings of the 2017 IEEE winter conference on applications of computer vision (WACV), Santa Rosa, CA, USA, 24–31 March 2017; pp. 464–472. [Google Scholar]
- Shrestha, A.; Mahmood, A. Review of deep learning algorithms and architectures. IEEE Access
**2019**, 7, 53040–53065. [Google Scholar] [CrossRef] - Fahad, S.A.; Yahya, A.E. Inflectional review of deep learning on natural language processing. In Proceedings of the 2018 International Conference on Smart Computing and Electronic Enterprise (ICSCEE), Shah Alam, Malaysia, 11–12 July 2018; pp. 1–4. [Google Scholar]
- Yin, W.; Kann, K.; Yu, M.; Schütze, H. Comparative study of CNN and RNN for natural language processing. arXiv
**2017**, arXiv:1702.01923. [Google Scholar] - Batbaatar, E.; Li, M.; Ryu, K.H. Semantic-emotion neural network for emotion recognition from text. IEEE Access
**2019**, 7, 111866–111878. [Google Scholar] [CrossRef] - Holland Computing Center (HCC) at University of Nebraska-Lincoln. Available online: https://hcc.unl.edu/ (accessed on 1 February 2022).
- Zhou, C.; Sun, C.; Liu, Z.; Lau, F. A C-LSTM neural network for text classification. arXiv
**2015**, arXiv:1511.08630. [Google Scholar] - Liu, G.; Guo, J. Bidirectional LSTM with attention mechanism and convolutional layer for text classification. Neurocomputing
**2019**, 337, 325–338. [Google Scholar] [CrossRef] - Liu, P.; Qiu, X.; Huang, X. Recurrent neural network for text classification with multi-task learning. arXiv
**2016**, arXiv:1605.05101. [Google Scholar] - Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; Kuksa, P. Natural language processing (almost) from scratch. J. Mach. Learn. Res.
**2011**, 12, 2493–2537. [Google Scholar] - Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv
**2015**, arXiv:1511.07122. [Google Scholar] - Cui, Y.; Zhou, F.; Wang, J.; Liu, X.; Lin, Y.; Belongie, S. Kernel pooling for convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2921–2930. [Google Scholar]
- Wang, Y.; Li, Y.; Song, Y.; Rong, X. The influence of the activation function in a convolution neural network model of facial expression recognition. Appl. Sci.
**2020**, 10, 1897. [Google Scholar] [CrossRef] [Green Version] - Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst.
**2012**, 25, 1097–1105. [Google Scholar] [CrossRef] - Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; Adaptive Computation and Machine Learning Series; MIT Press: Cambridge, MA, USA, 2017; pp. 321–359. [Google Scholar]
- Dodge, J.; Ilharco, G.; Schwartz, R.; Farhadi, A.; Hajishirzi, H.; Smith, N. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv
**2020**, arXiv:2002.06305. [Google Scholar] - Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett.
**2006**, 27, 861–874. [Google Scholar] [CrossRef]

**Figure 6.**Filter size effect on training time and accuracy when fine-tuning BERT with different number of convolution and dense layers.

**Figure 9.**Activation function effect on training time and accuracy when fine-tuning BERT with different number of convolution and dense layers.

**Figure 10.**Kernel density estimation (

**a**) and box plot (

**b**) for all sets accuracy; histogram only for validation set accuracy (

**c**).

Layer | Input Shape | Output Shape | Activation Function | Parameters |
---|---|---|---|---|

NSP-Dense | (None, 768) | (None, 768) | – | 590,592 |

Reshape | (None, 768) | (None, 768, 1) | – | 0 |

Convolution1D | (None, 768, 1) | (None, 767, 256) | ReLU | 768 |

MaxPooling1D | (None, 767, 256) | (None, 383, 256) | – | 0 |

Convolution1D | (None, 383, 256) | (None, 382, 128) | ReLU | 65,664 |

MaxPooling1D | (None, 382, 128) | (None, 191, 128) | – | 0 |

Flatten | (None, 191, 128) | (None, 24448) | – | 0 |

Drop-out | (None, 24448) | (None, 24448) | – | 0 |

Dense | (None, 24448) | (None, 64) | ReLU | 1,564,736 |

Drop-out | (None, 64) | (None, 64) | – | 0 |

Dense | (None, 64) | (None, 2) | Softmax | 130 |

Model | Architecture | Accuracy | F1 Score | Precision | Recall |
---|---|---|---|---|---|

BERT + CNN | 12 Encoder | ||||

(ClaimsBERT) | 2 Convolution | 0.973 | 0.96 | 0.963 | 0.966 |

2 Dense | |||||

BERTSequence | 12 Encoder | ||||

Classifier | 1 Dense | 0.764 | 0.751 | 0.743 | 0.741 |

**Table 3.**Comparing BERT + LSTM Classifier results. (Bold text indicates the model parameters achieving the highest accuracy.)

Architecture | LR | Accuracy | F1 Score | |
---|---|---|---|---|

BERT + 1 LSTM | 1 Dense | $7\times {10}^{-5}$ | 0.92 | 0.9 |

2 Dense | $2\times {10}^{-5}$ | 0.94 | 0.93 | |

3 Dense | $4\times {10}^{-5}$ | 0.93 | 0.91 | |

BERT + 2 LSTM | 1 Dense | $3\times {10}^{-5}$ | 0.92 | 0.91 |

2 Dense | $7\times {10}^{-5}$ | 0.94 | 0.92 | |

3 Dense | $4\times {10}^{-5}$ | 0.93 | 0.91 | |

BERT + 3 LSTM | 1 Dense | $2\times {10}^{-5}$ | 0.93 | 0.91 |

2 Dense | $1\times {10}^{-4}$ | 0.92 | 0.91 | |

3 Dense | $7\times {10}^{-5}$ | 0.91 | 0.9 |

**Table 4.**Comparing BERT + BiLSTM Classifier results. (Bold text indicates the model parameters achieving the highest accuracy.)

Architecture | LR | Accuracy | F1 Score | |
---|---|---|---|---|

BERT + 1 BiLSTM | 1 Dense | $5\times {10}^{-5}$ | 0.94 | 0.93 |

2 Dense | $6\times {10}^{-5}$ | 0.95 | 0.94 | |

3 Dense | $4\times {10}^{-5}$ | 0.94 | 0.93 | |

BERT + 2 BiLSTM | 1 Dense | $8\times {10}^{-5}$ | 0.92 | 0.91 |

2 Dense | $5\times {10}^{-5}$ | 0.93 | 0.92 | |

3 Dense | $4\times {10}^{-5}$ | 0.91 | 0.91 | |

BERT + 3 BiLSTM | 1 Dense | $1\times {10}^{-4}$ | 0.91 | 0.91 |

2 Dense | $8\times {10}^{-5}$ | 0.94 | 0.92 | |

3 Dense | $4\times {10}^{-5}$ | 0.91 | 0.9 |

**Table 5.**Comparing BERT + MLP Classifier results. (Bold text indicates the model parameters achieving the highest accuracy.)

Architecture | LR | Accuracy | F1 Score | |
---|---|---|---|---|

BERT + 1 MLP | 1 Dense | $4\times {10}^{-5}$ | 0.936 | 0.925 |

2 Dense | $6\times {10}^{-5}$ | 0.942 | 0.932 | |

3 Dense | $3\times {10}^{-6}$ | 0.941 | 0.935 | |

BERT + 2 MLP | 1 Dense | $4\times {10}^{-6}$ | 0.943 | 0.935 |

2 Dense | $3\times {10}^{-6}$ | 0.932 | 0.921 | |

3 Dense | $2\times {10}^{-5}$ | 0.945 | 0.931 | |

BERT + 3 MLP | 1 Dense | $4\times {10}^{-6}$ | 0.93 | 0.923 |

2 Dense | $5\times {10}^{-6}$ | 0.952 | 0.942 | |

3 Dense | $3\times {10}^{-5}$ | 0.925 | 0.911 |

**Table 6.**Comparing classification performance of ClaimsBERT with other network-based models. The bold row indicates best model parameters with highest accuracy.

Model | Architecture | Accuracy | F1 Score | Precision | Recall |
---|---|---|---|---|---|

BERT + CNN | 12 Encoder | ||||

(ClaimsBERT) | 2 Convolution | 0.973 | 0.96 | 0.963 | 0.966 |

2 Dense | |||||

BERT + NN | 12 Encoder | 0.954 | 0.93 | 0.914 | 0.943 |

(CyBERT [8]) | 3 Dense | ||||

BERT + BiLSTM | 12 Encoder | ||||

1 BiLSTM | 0.951 | 0.941 | 0.951 | 0.949 | |

2 Dense | |||||

BERT + LSTM | 12 Encoder | ||||

1 LSTM | 0.947 | 0.937 | 0.947 | 0.947 | |

2 Dense | |||||

BERT + MLP | 12 Encoder | ||||

3 MLP | 0.952 | 0.942 | 0.947 | 0.938 | |

2 Dense |

**Table 7.**The impact of the number of convolution and dense layers on fine-tuning ClaimsBERT. The bold row indicates best model parameters with highest accuracy.

Architecture | Filter Size | LR | Accuracy | F1-Score | |
---|---|---|---|---|---|

BERT + 1 Convolution | 1 Dense | (256) | $6\times {10}^{-5}$ | 0.95 | 0.94 |

2 Dense | (256) | $2\times {10}^{-7}$ | 0.94 | 0.93 | |

3 Dense | (256) | $3\times {10}^{-5}$ | 0.94 | 0.92 | |

4 Dense | (256) | $7\times {10}^{-5}$ | 0.94 | 0.93 | |

BERT + 2 Convolution | 1 Dense | (256,128) | $3\times {10}^{-5}$ | 0.94 | 0.93 |

2 Dense | (256,128) | $9\times {10}^{-5}$ | 0.97 | 0.96 | |

3 Dense | (256,128) | $2\times {10}^{-7}$ | 0.93 | 0.93 | |

4 Dense | (256,128) | $3\times {10}^{-7}$ | 0.91 | 0.89 | |

BERT + 3 Convolution | 1 Dense | (256,128,64) | $7\times {10}^{-5}$ | 0.93 | 0.91 |

2 Dense | (256,128,64) | $2\times {10}^{-6}$ | 0.94 | 0.92 | |

3 Dense | (256,128,64) | $5\times {10}^{-5}$ | 0.92 | 0.91 | |

4 Dense | (256,128,64) | $1\times {10}^{-6}$ | 0.94 | 0.92 |

Model | Dataset | SD | Mean | CI (95%) | Margin of Error |
---|---|---|---|---|---|

BERT + CNN | Training | 0.007 | 0.991 | 0.99 to 0.993 | 0.00142 |

(ClaimsBERT) | Validation | 0.009 | 0.953 | 0.952 to 0.956 | 0.0.0018 |

Testing | 0.009 | 0.951 | 0.95 to 0.953 | 0.00183 |

Model | Training Time $\hat{*}$ | Classification Time $\hat{*}$ | Trainable Parameters |
---|---|---|---|

BERT + CNN | 12,873 | 727 | 110,769,474 |

(ClaimsBERT) | |||

BERT + NN | 32,970 | 708 | 108,647,026 |

(CyBERT [8]) | |||

BERT + BiLSTM | 51,211 | 1125 | 112,915,970 |

BERT + LSTM | 47,176 | 989 | 109,482,240 |

BERT + MLP | 10,478 | 821 | 109,380,482 |

BERTSequence | 7832 | 335 | 109,483,778 |

Classifier |

Model | Best | Accuracy | Macro | AUC | Training | Testing | Trainable |
---|---|---|---|---|---|---|---|

Architecture | Weighted F1 | Time $\hat{*}$ | Time $\hat{*}$ | Parameters | |||

BERT + CNN | 12 Encoder | ||||||

(ClaimsBERT) | 2 Convolution | 0.97 | 0.96 | 0.968 | 12,873 | 108 | 110,769,474 |

2 Dense | |||||||

BERT + NN | 12 Encoder | 0.954 | 0.93 | 0.948 | 32,970 | 97 | 108,647,026 |

(CyBERT) [8] | 3 Dense | ||||||

BERT + BiLSTM | 12 Encoder | ||||||

1 BiLSTM | 0.951 | 0.941 | 0.937 | 51,211 | 142 | 116,693,570 | |

2 Dense | |||||||

BERT + LSTM | 12 Encoder | ||||||

1 LSTM | 0.947 | 0.937 | 0.929 | 47,176 | 135 | 112,915,970 | |

2 Dense | |||||||

BERT + MLP | 12 Encoder | ||||||

3 MLP | 0.952 | 0.942 | 0.940 | 10,487 | 85 | 1,093,800,482 | |

2 Dense | |||||||

BERTSequence | 12 Encoder | 0.76 | 0.72 | 0.773 | 7832 | 77 | 109,482,240 |

Classifier | 1 Dense |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Ameri, K.; Hempel, M.; Sharif, H.; Lopez Jr., J.; Perumalla, K.
An Accuracy-Maximization Approach for Claims Classifiers in Document Content Analytics for Cybersecurity. *J. Cybersecur. Priv.* **2022**, *2*, 418-443.
https://doi.org/10.3390/jcp2020022

**AMA Style**

Ameri K, Hempel M, Sharif H, Lopez Jr. J, Perumalla K.
An Accuracy-Maximization Approach for Claims Classifiers in Document Content Analytics for Cybersecurity. *Journal of Cybersecurity and Privacy*. 2022; 2(2):418-443.
https://doi.org/10.3390/jcp2020022

**Chicago/Turabian Style**

Ameri, Kimia, Michael Hempel, Hamid Sharif, Juan Lopez Jr., and Kalyan Perumalla.
2022. "An Accuracy-Maximization Approach for Claims Classifiers in Document Content Analytics for Cybersecurity" *Journal of Cybersecurity and Privacy* 2, no. 2: 418-443.
https://doi.org/10.3390/jcp2020022