Next Article in Journal
Correlation Analysis of APT Attack Organizations Based on Knowledge Graphs
Previous Article in Journal
An Adaptive Switching Algorithm for Element Resource Scheduling in Digital Array Radars Based on an Improved Ant Colony Optimization
Previous Article in Special Issue
Blockchain-Driven Smart Contracts for Advanced Authorization and Authentication in Cloud Security
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Sensitive Information Masking-Based Data Security Auditing Method for Chinese Linux Operating System

1
School of Information Engineering, North China University of Water Resources and Electric Power, Zhengzhou 450000, China
2
Beijing Shenzhou Aerospace Software Technology Co., Ltd., Beijing 100000, China
*
Author to whom correspondence should be addressed.
Electronics 2026, 15(1), 86; https://doi.org/10.3390/electronics15010086
Submission received: 31 October 2025 / Revised: 13 December 2025 / Accepted: 19 December 2025 / Published: 24 December 2025

Abstract

With the rapid development of information technology and the deepening of digitalization, operating systems are increasingly applied in critical information infrastructure, making data security issues particularly important. Traditional cloud storage auditing models based on third-party auditing authorities (TPA) face trust risks and potential data leakage during data integrity verification, which makes them inadequate to meet the dual requirements of high security and local controllability in the current information technology environment. To address this, this paper proposes a system-wide data security auditing method for the Chinese Linux operating system, constructing a lightweight and localized framework for sensitive information protection and auditing. By dynamically intercepting system calls and performing real-time content analysis, the method achieves accurate identification and visual masking of sensitive information, while generating corresponding audit logs. To overcome the efficiency bottleneck of traditional pattern matching in high-concurrency environments, this paper introduces a Chinese Aho-Corasick (AC) automaton-based character matching algorithm using a hash table to enhance the rapid retrieval capability of sensitive information. Experimental results demonstrate that the proposed method not only ensures controllable and auditable sensitive information but also maintains low system overhead and good adaptability, thereby providing a feasible technical path and implementation scheme for data security.

1. Introduction

With the continuous deepening of global digital transformation and the increasing complexity of cyber threats, data security has gradually become a core element of national security and industrial competitiveness [1]. Against the backdrop of the widespread application of emerging technologies such as cloud computing, big data, and artificial intelligence, how to effectively ensure the confidentiality, integrity, and availability of data in open environments has become a key issue in the informationization strategies of various countries. Global practices in data security are continuously evolving, covering the entire data lifecycle, including collection, storage, transmission, usage, sharing, and destruction, with the goal of building a secure and reliable digital environment [2]. Data security is not only a critical measure to defend against cyber threats and prevent information leakage but also an essential infrastructure for the healthy development of the digital economy.
In recent years, the development of domestic operating systems in China has accelerated significantly, with overall capabilities transitioning from merely “usable” to increasingly “user-friendly.” A number of system products offering improved user experience and free availability have emerged [3]. As the core component of information systems, the security of operating systems is directly linked to the protection of national critical infrastructure and sensitive data. However, the rapid adoption of domestic operating systems has also introduced new cybersecurity challenges. Due to inherent vulnerabilities in operating systems, various security threats may arise. First, data privacy leakage [4,5,6]: according to Verizon’s 2020 Data Breach Investigations Report (DBIR), there were 157,525 network incidents, of which 3950 resulted in data breaches. Second, increasing cyber risks: for example, through threat attacks such as the dissemination of rumors and viruses; in August 2022, the Centre Hospitalier Sud Francilien (CHSF), a medium-to-large hospital near Paris, France, suffered a ransomware attack demanding 10 million USD. Third, attacks on critical information infrastructure: for instance, in January 2022, Belarusian railway critical infrastructure was attacked. Therefore, operating system hardening should support high integrity, reliability, availability, privacy, scalability, and confidentiality, while achieving the highest organizational objectives (benefits) of critical IT infrastructure at the lowest possible risk level [7].
At present, almost all aspects of our daily lives are recorded and stored in digital form [8], thereby constructing a vast and complex electronic world that encompasses multiple types of data, including business, finance, healthcare, multimedia, the Internet of Things (IoT), and social media [9]. The data lifecycle can be divided into three stages: data generation, data storage, and data processing [10]. Each stage has corresponding methods to ensure data security. In the data generation stage, actively generated data are protected from sensitive information leakage through access control. For instance, Di Francesco Maesa et al. [11] proposed an access control system supporting private attributes by leveraging blockchain, smart contracts, and zero-knowledge proofs. Passively generated data, on the other hand, are protected through obfuscation techniques. In the data storage stage, traditional approaches rely on file-level and database-level security solutions, which are relatively mature. However, as an emerging paradigm, cloud storage has raised significant data security concerns and has become a research hotspot. Li et al. [12] proposed a publicly verifiable public key encryption with equality test (PVPKEET) scheme, enabling users to verify whether encrypted replicas are securely stored in the cloud without decrypting the data. In the data processing stage, two primary requirements exist: first, preventing unauthorized disclosure when data may contain sensitive information; and second, extracting valuable information while ensuring privacy.
As cloud storage technologies continue to evolve, an increasing volume of data is being migrated to the cloud, and it is foreseeable that nearly all enterprises will transition to cloud-based infrastructures in the near future [13]. This trend has motivated researchers to integrate security auditing with various technical approaches—such as data encryption [14], access control mechanisms [15], and differential privacy [16]—to verify data integrity in cloud storage environments. Within enterprises and organizations, artificial intelligence techniques are also being adopted for security monitoring and data leakage prevention, enabling real-time detection of anomalous access behaviors, unauthorized file transfers, and policy violations [17]. However, these approaches predominantly focus on technical or physical storage layers and largely overlook the impact of external environmental factors.
In fact, throughout the entire data lifecycle—generation, storage, and usage—human involvement in operational processes is unavoidable, and individuals often access and acquire data directly through visual perception. Compared with traditional network attacks or system intrusions, data leakage based on visual acquisition is characterized by high stealthiness, difficulty in forensic investigation, and limited preventive measures. Attackers may exfiltrate sensitive information without triggering conventional security mechanisms through methods such as covert screen recording, photographing displays, visual memorization, or remote screen capture. Therefore, implementing effective masking and auditing of sensitive information at the level of visual perception constitutes an indispensable complementary approach to ensuring data security.
Against the backdrop of data-driven digital transformation, the security protection of sensitive information and the establishment of trustworthy auditing mechanisms have become a focal point of concern in both academia and industry. To address the trust deficiencies and risks of data leakage caused by the reliance of current cloud storage auditing on third-party authorities (TPA), this study proposes an innovative system-wide data security auditing scheme based on the Chinese Linux operating system. The proposed scheme constructs a localized and autonomously controllable auditing architecture to achieve the core objective of “raw data remaining within the domain, data being usable but not visible.” This effectively avoids the integrity verification challenges and potential leakage risks associated with data transmission in traditional cloud auditing models.
The main contributions of this paper are as follows:
  • A lightweight sensitive information protection and auditing framework tailored for Chinese Linux operating systems is proposed. By incorporating dynamic interception and real-time analysis modules, the framework enables precise masking of sensitive screen information and full-chain tracking of sensitive data violations. Compared with existing TPA-based cloud auditing solutions, this method strictly confines data interaction boundaries within a trusted execution environment through localized processing, thereby fundamentally eliminating the trust gap caused by cross-domain data transmission.
  • A multi-pattern matching acceleration engine based on a hash table–optimized Chinese Aho-Corasick (AC) automaton is designed. To address the performance bottlenecks of traditional regular expression matching in high-concurrency scenarios, a deterministic finite automaton (DFA) together with a failure pointer jumping mechanism is introduced, thereby constructing a rapid retrieval model tailored for sensitive feature libraries.
  • Construct a scenario-oriented dynamic auditing mechanism. To accommodate the diversity of sensitive information characteristics across different application scenarios, a configurable sensitive word template library is constructed, thereby overcoming the generality limitations of traditional approaches. In addition, the method supports real-time statistics of sensitive word masking occurrences and generates timestamped audit reports to meet relevant data security compliance requirements.
The remainder of this paper is organized as follows: Section 2 reviews related work; Section 3 provides a detailed description of the proposed method; Section 4 presents the effectiveness analysis and performance evaluation; and Section 5 concludes the paper and discusses future research directions.

2. Related Work

In 2007, addressing the problem of remote data integrity verification, Ateniese et al. [18] were the first to propose a public auditing scheme for provable data possession (PDP). This scheme enables any public verifier to check data integrity without retrieving the data. In the same year, Juels and Kaliski [19] proposed the proof of retrievability (POR) model, in which sampling verification and error-correcting codes were employed to ensure the security of stored data. However, this scheme could only verify the integrity of static data. Later, Ateniese et al. [20] proposed another scheme based on symmetric key PDP for auditing dynamic data on cloud servers. This scheme supported dynamic modification and deletion operations but did not support insertion. Erway et al. [21] introduced authenticated skip lists in their dynamic provable data possession (DPDP) scheme. Wang et al. [22] proposed a new PDP scheme based on a rank-based Merkle hash tree (RMHT). By improving the original RMHT and using a non-leaf node sampling strategy, the scheme reduced the number of sampling nodes during the auditing phase and the number of update nodes during the update phase. Guo et al. [23] proposed an outsourced dynamic provable data possession (ODPDP) scheme built on DPDP, in which the integrity of all data block hashes is protected by a rank-based Merkle tree (RBMT) [24], while data block integrity is ensured through the combination of hashes and tags. This scheme allows multiple update operations to be executed and verified simultaneously while also supporting outsourced auditing. Yang et al. [25] designed a novel authenticated data structure, namely the numerical rank-based Merkle hash tree (NRMHT), which supports dynamic data operations. Rao et al. [26] proposed a batch-based dynamic outsourced auditing scheme in which neither the data owner nor the third-party auditor needs to store the Merkle hash tree, thus reducing the burden on the user side. Wang et al. [27] proposed a PDP protocol based on algebraic identities, which reduced the computational overhead of the verification algorithm. In addition, more schemes supporting dynamic updates have been proposed [28,29]. Some studies have also introduced auditing methods supporting privacy preservation [30,31], prevention of key leakage [32,33], multi-cloud storage [34,35], and data sharing [36,37]. All of these works fall under auditing methods for third-party auditing in cloud storage, focusing on verifying the integrity of outsourced data.
Some scholars have also combined security auditing with blockchain to verify data integrity, proposing different auditing schemes. Nakamoto et al. [38] introduced the concept of Bitcoin, which marked the birth of blockchain. Blockchain is a shared, tamper-resistant distributed ledger that records the state of all entities in a blockchain network. With the development of Web 3.0, such decentralized and trustworthy systems have been widely applied in various fields. In data integrity auditing schemes based on third-party auditing, replacing the third-party auditor with blockchain technology can effectively address the trust issues between data owners and cloud storage providers, thereby improving the security and transparency of data storage and management. Yue and li [39] proposed a blockchain-based framework for peer-to-peer (P2P) cloud storage data integrity verification. They developed a sampling strategy to enhance the efficiency of verification. Subsequently, Zhang et al. [40] introduced a blockchain-based time-encapsulated data auditing scheme, which allows users to simultaneously verify the integrity of outsourced data and timestamps in P2P outsourced storage systems. Zhang et al. [41] further proposed a blockchain-based delayed auditing scheme for public integrity verification, where the main idea is to require auditors to record each verification result on the blockchain. Chen et al. [42] proposed a blockchain-based dynamic PDP scheme for smart city data, which adopts authenticated data structures (ADS) to support data update operations. Miao et al. [43] proposed a blockchain-based shared data integrity auditing and deduplication scheme. This scheme employs an identity-based broadcast encryption deduplication protocol without a key server and implements key deduplication on the user side. By leveraging the characteristics of convergent encryption, they introduced a data integrity auditing protocol to enable authenticated deduplication on the cloud service provider side. Wang et al. [44] proposed a blockchain-based provable data possession scheme for private data, which achieves data integrity verification using RSA signatures. This scheme imposes no restrictions on file block sizes, making it suitable for large files. Later, Dhinakaran et al. [45] proposed a comprehensive framework that integrates quantum key distribution (QKD), lattice-based cryptography (CRYSTALS-Kyber), and zero-knowledge proofs (ZKP) to ensure data security in blockchain systems for cloud computing. The main goal of this framework is to enhance resistance against quantum threats through QKD, provide quantum attack resistance using CRYSTALS-Kyber, and strengthen data privacy protection and verification processes using ZKP technology. All of the above schemes are blockchain-based auditing approaches.
Some scholars have further combined security auditing with practical application scenarios such as healthcare. Li and Tang [46] proposed a blockchain-based auditing scheme for distributed cloud medical data, in which each hospital server can act as a blockchain node, participate in consensus, and maintain a distributed ledger to perform data integrity auditing. Mahender Kumar et al. [47] proposed an identity-based integrity auditing scheme that ensures the privacy of user-sensitive information stored in multi-cloud environments while providing protection against replacement attacks. In this scheme, users’ personal (sensitive) information is protected, while general information is distributed across multiple clouds, thereby preventing sensitive data from being exposed to other shared users. They applied this scheme to the integrity auditing of sensitive medical information in multi-cloud storage. Tian and Ye [48], under the context of Healthcare 4.0, proposed a new certificateless public auditing scheme for cloud medical data (CPAMD). This scheme eliminates the need for complex certificate management and key escrow, enables efficient batch auditing, and allows stakeholders to access medical data and interact effectively.

3. Scheme Design

3.1. Threat Model

To clearly delineate the security boundaries and protection scope of the proposed method, this section defines the system’s threat model, including protected assets, underlying assumptions, and the attacker’s capabilities and attack vectors.
(1)
Protected Assets
The primary assets protected by the system are highly sensitive textual data rendered on terminals of the Chinese Linux operating system. Such data may include, but are not limited to, state secrets, personal privacy information, medical records, or commercial confidential information. The goal is to ensure that these sensitive characters are identified and sanitized before being rendered into visible glyphs, while preserving the traceability of user operations.
(2)
Trust Assumptions
  • Trusted Computing Base: We assume that the operating system kernel and underlying hardware are trustworthy. If the kernel is compromised—for example, if an attacker obtains root privileges and loads a malicious kernel module—any security mechanism built upon the kernel will fail.
  • Integrity of the Auditing Module: It is assumed that the system-wide auditing module proposed in this work is neither bypassed nor tampered with.
(3)
Attacker Definition and Capabilities
This model primarily focuses on the following two types of attackers:
  • Insider threats. Over-privileged viewers: users who possess legitimate login credentials and pass identity authentication and file-system access control, yet encounter sensitive information beyond their “need-to-know” scope during business operations; Unintentional leakers: users who operate on sensitive data in public or uncontrolled environments and inadvertently expose it to bystanders.
  • Visual side-channel attackers. Shoulder surfing and photography: attackers who physically approach the screen or use imaging devices (e.g., mobile phones, pinhole cameras) to directly capture sensitive content displayed on the monitor. This represents the “last-mile” leakage risk that traditional encrypted transmission or storage encryption cannot mitigate; Screen-recording software: malicious user-space programs attempting to obtain sensitive content via screenshot or screen-recording APIs. Since the proposed solution intercepts at the low-level font-rendering layer (Freetype), sensitive content is sanitized before it enters the graphical buffer, thereby effectively defending against certain image-based capture attacks.

3.2. Overview and Background Knowledge

To address data security issues at the visual presentation layer, it is first necessary to clarify fundamental concepts in the operating system, such as characters, fonts, and glyphs, and to understand their interrelationships and roles in text processing and rendering workflows.
  • Character: A character is the basic unit of text and an element of a language, such as the letter “A,” the digit “1,” or the punctuation mark “!”.
  • Font: A font is a collection of character graphics with similar design and style, providing different appearances for characters. Fonts define the visual attributes of characters, including size, style (e.g., bold, italic), and strokes.
  • Glyph: A glyph is the actual graphical unit within a font used to render the appearance of a character. A single character can be represented by multiple glyphs. Glyphs are designed based on characters and can be described using curves and lines to form the shape of a character. Each glyph corresponds to different character styles and contextual forms.
As shown in Figure 1, the Chinese text rendering process in the operating system converts abstract character information into visible images, representing a critical procedure within the OS graphics subsystem. This process involves the coordination of multiple subsystems, covering key stages such as character encoding parsing, font matching, text layout, glyph generation, image rasterization, and final graphic composition and display output, and it exhibits high complexity and system dependency. Initially, the operating system receives text input from applications or external devices and parses it into a unified Unicode sequence to ensure consistent character handling in multilingual environments. Subsequently, the text layout engine analyzes and arranges the character sequence, handling bidirectional text, character ligatures, ligature substitution, kerning adjustments, and other language- and layout-specific features, generating glyphs along with their corresponding geometric positioning information. Once glyph positioning is determined, the operating system performs font matching and loading, locating the requested font or selecting a substitute, and mapping Unicode characters to specific glyph indices using the glyph mapping tables in font files (e.g., TrueType or OpenType). The font engine (such as FreeType) then extracts contour information based on the glyph indices and combines it with font metrics to execute precise typesetting. Next, glyph contours are rasterized into bitmap data. Anti-aliasing and subpixel rendering techniques may be employed to enhance display clarity, ensuring that characters remain legible across different display densities. After rasterization, the generated bitmap glyphs are drawn into the target graphic context by the graphics composition engine. Layer blending and clipping are applied in conjunction with background, color, transparency, and other attributes. Finally, the rendered result is written to the frame buffer and transmitted to the display device via the operating system’s graphics output subsystem (e.g., X11), completing the presentation of text on the screen. This workflow not only ensures the integrity and accuracy of text information from character encoding to visual presentation but also plays a key role in performance optimization, font compatibility, and cross-platform consistency.
In summary, characters constitute the basic elements of text, fonts define the visual appearance and style of characters, and glyphs are the actual graphical representations within a font used to display characters on the screen. To prevent the leakage of sensitive information from the perspective of visual information processing, it is crucial to intervene in the presentation of glyphs on the computer screen. Specifically, this involves masking the glyphs corresponding to sensitive information during the rendering process, thereby enabling the auditing of sensitive content.

3.3. Confidentiality Mechanism: Character Interception and Masking

Based on the text rendering workflow illustrated in Figure 1, this paper proposes a mechanism that embeds a security control module into the text rendering engine, enabling real-time inspection and masking of terminal display information, thereby achieving confidentiality at the visual layer. Specifically, the mechanism integrates character-level interception and replacement functionality deeply within the rendering engine, allowing semantic-level identification, analysis, and processing of character information before it is fully rendered graphically. This design provides fine-grained security control over output content without interfering with the logic of upper-layer applications, ensuring minimal visibility of sensitive information along the client-side display path. The core architecture, as shown in Figure 2, is based on a system-level hook mechanism. By dynamically injecting specific interception logic, the mechanism effectively captures low-level system calls or function interfaces invoked by applications when submitting character or glyph data to the graphics subsystem.
The confidentiality mechanism is implemented through dynamic library injection, leveraging the LD_PRELOAD environment variable available in Chinese Linux operating system. By configuring LD_PRELOAD, the compiled security auditing module (in the form of a shared object, .so) is forcibly loaded into the address space of the target application before any standard libraries. The injected module employs Procedure Linkage Table (PLT) hooking to intercept and override the core function FT_Load_Glyph in the widely used font rendering library libfreetype. As a powerful text rendering engine, FreeType is relied upon by most Chinese Linux operating systems to support multilingual text, complex typesetting, and internationalized text rendering, making it a widely representative component in practical applications.
By decoupling the logical processing layer from the visual presentation layer, the proposed mechanism achieves the security objective of “usable but not visible.” On the one hand, interception occurs at the FT_Load_Glyph font rendering stage, while complete and correct textual data are still preserved within the application. As a result, normal functionalities such as text editing, file saving, searching, spell checking, and data transmission remain unaffected. On the other hand, when text is rendered to the screen, the confidentiality mechanism intercepts the rendering process and replaces the glyph bitmaps corresponding to detected sensitive words with masking characters.
After the confidentiality mechanism successfully intercepts the FT_Load_Glyph function, it employs an efficient sensitive word matching algorithm to detect sensitive terms in the text in real time. Once a match is identified, the system immediately replaces the corresponding substring with predefined masking symbols, effectively concealing the original sensitive content at the visual layer. This process occurs before the characters are passed into the typesetting and rendering logic, ensuring that end users cannot perceive the presence of sensitive information in the graphical interface.
Because the interception layer operates beneath the graphical display protocols—directly at the rasterization-library level—the proposed method is inherently compatible with both the X11 and Wayland display architectures. For multi-process rendering scenarios, the module adopts a thread-safe design: the AC automaton is shared in memory as a read-only structure, while each thread maintains its own matching-state context. This design effectively eliminates lock contention and ensures stable performance under high-concurrency workloads. Consequently, the mechanism enhances the security granularity and response efficiency along the information output path while preserving system functionality and business logic consistency. It offers strong practical value and deployment flexibility, making it particularly suitable for terminal software systems and scenarios that require high levels of information display security and protection of sensitive data.

3.4. Audit Mechanism: Sensitive-Word Repository Management and Logging

Building on the protection of sensitive information, achieving traceability and effective handling of sensitive content is a critical task for domestic operating systems in terms of content security and risk management. Therefore, the construction of a security auditing mechanism for sensitive information holds significant practical and operational value. The designed auditing mechanism is illustrated in Figure 3, showing the complementary relationship between the confidentiality mechanism and the auditing mechanism. The confidentiality mechanism relies on the sensitive word library maintained by the auditing mechanism to identify and protect sensitive information in text. Simultaneously, the auditing mechanism records the operation logs generated by the confidentiality mechanism, ensuring that each action can be traced and audited. This bidirectional dependency guarantees that the operating system can effectively prevent the leakage of sensitive information during text processing while providing clear operational records and audit trails.
The design of the auditing mechanism is developed along two core dimensions: the construction of the sensitive word library and the design of the sensitive information logs. The sensitive word library must be highly customizable to flexibly adapt to diverse application scenarios and complex business requirements. Typically, the sensitive word library is stored in a dictionary structure, with the specific organization illustrated in Figure 4. Given the significant variation in the definition of sensitive information across different application environments, sensitive words are highly domain-dependent. Therefore, the library should be dynamically extended and adjusted according to the content security requirements of specific scenarios to enhance its applicability, coverage, and recognition accuracy, thereby effectively supporting subsequent auditing and control operations. Once the auditing mechanism is loaded into the operating system memory, it reads the dictionary pairs in the sensitive word library, which include a series of sensitive word entries along with their corresponding masking symbol configurations. The dictionary pairs are then parsed into a set of sensitive words and a set of masking symbols. Based on the sensitive words, the auditing mechanism subsequently constructs an Aho-Corasick (AC) automaton.
The design of the sensitive information log aims to comprehensively record and manage key operations and state changes during the filtering of sensitive information, serving as a crucial component for implementing content security auditing and traceability. The log system primarily stores multidimensional information related to sensitive word masking, covering the entire filtering process to ensure auditability, traceability, and controllability of sensitive information handling. Specifically, the sensitive information log should record the following details: the timestamp of masking operations, which allows reconstruction of the processing timeline for sensitive information; the cumulative count of masking actions, reflecting the frequency of sensitive content occurrences; the number and reasons for masking failures, which assist in identifying potential deficiencies or anomalies in the filtering mechanism; and the text forms after replacement, enabling accurate evaluation and visualization of the filtering effects. This approach effectively enhances transparency in sensitive word management, optimizes filtering strategies, and provides robust support for content security.
To enable scenario-oriented dynamic auditing without disrupting system operation, a hot-reload mechanism is introduced. The system implements a file-monitoring thread based on inotify; when changes to the sensitive-word configuration file are detected, the auditing module atomically updates the corresponding AC automaton structure in memory. This allows new auditing policies to take effect immediately without requiring application or operating system restarts, thereby ensuring system continuity and real-time responsiveness.
At the security control level, technical enforcement of role separation is applied. The sensitive-word repository and audit logs are stored in a protected directory with ownership assigned to root, which technically prevents ordinary users from tampering with auditing policies or deleting audit records. This design ensures the reliability and traceability of the auditing mechanism.

3.5. Hash Table-Based Chinese AC Automaton Character Matching Algorithm

When integrating the security auditing mechanism into the text rendering workflow, scenarios involving long text or high-frequency rendering requests can impose significant pressure on the system, as the detection and replacement of sensitive words may increase operating system resource usage, prolong application response times, and negatively affect user interaction fluidity and overall experience. To mitigate these issues and improve the efficiency of sensitive word recognition, this study introduces an Aho-Corasick (AC) automaton algorithm based on a deterministic finite automaton (DFA) within the confidentiality mechanism. The construction of this automaton is illustrated in Figure 5. The AC automaton is a multi-pattern string matching algorithm with linear time complexity, capable of matching multiple keywords simultaneously in a single text scan, thereby significantly reducing the time overhead of sensitive word matching. By constructing an efficient keyword matching state machine, the confidentiality mechanism can rapidly identify and process sensitive information in long texts without adding extra computational complexity. This approach ensures real-time and accurate detection while minimizing the impact on system performance, thereby maintaining efficient text rendering and a stable user experience.
In an English-language environment, the storage structure of the AC automaton is illustrated in Figure 6. Each node in the AC automaton stores only a single character and its associated state information, providing good storage efficiency. However, in the Chinese-language context, the character set is much larger; the number of commonly used Chinese characters exceeds 20,000, and the full Unicode character set contains over 60,000 characters. If a fixed-length array structure is still employed—allocating over 60,000 pointers for all possible child characters at each node—this would result in severe memory wastage. This issue is particularly pronounced in a Trie structure, which already contains numerous sparse nodes, and such a design would significantly increase memory overhead, severely constraining system performance and scalability.
To address the aforementioned issues, this study introduces a hash table as the child node storage structure in the design of AC automaton nodes. This approach leverages the space-saving capability and lookup performance of hash tables in sparse data access scenarios, effectively mitigating the resource wastage associated with traditional fixed-length array structures caused by sparse character encoding. Specifically, the child node set of each automaton node is maintained using a hash table constructed via the chaining method. The hash keys use the full UTF-8 encoding of characters, with each key representing a valid UTF-8 character, and the corresponding hash value pointing to the child node pointer. To improve query efficiency and reduce collision rates, the classical BKDR Hash function is employed, which is defined as follows:
hash ( s ) = i = 1 n s i × seed n i ,
The symbol   s i   denotes the i -th byte in the string s , and n represents the total length of the string.
Based on the above design, the structure of each automaton node can be formally defined as follows:
Node = { Next : HashMap ( Σ * , Node ) ,   fail : Node ,   isEnd : bool ,   word : Σ * } ,
Here, Next represents the hash-mapped set of child nodes for the current node, fail denotes the failure pointer, isEnd indicates whether the node corresponds to the end of a pattern string, and word stores the complete pattern string text.
This structure not only preserves the core characteristics of efficient pattern matching in the AC automaton but also fully leverages the sparse storage and constant-time access properties of hash tables. In multi-byte character environments such as Chinese, it demonstrates superior space utilization and operational efficiency, offering excellent scalability and practical adaptability.
To provide a clearer illustration of the algorithmic procedure, Algorithm 1 presents the detailed construction and matching workflow of the hash-based Chinese AC automaton. The entire process is divided into three phases:
(1)
Trie Construction (Phase 1): All sensitive terms are inserted into the Trie. Each node manages its outgoing transitions using a hash-table-based child-node structure.
(2)
Failure Link Construction (Phase 2): A breadth-first search (BFS) is performed over the Trie to establish failure links, ensuring efficient state recovery upon mismatches.
(3)
Online Matching (Phase 3): The input text stream is scanned sequentially. Matches are reported in real-time through state transitions facilitated by the failure links.
Algorithm 1: Hash-Table-Based AC Automaton for UTF-8 Chinese Characters
Input:
W = {w1, w2, …, wk}
T = t1, t2, …, tn
Output: M = list of matched sensitive-word occurrences
Phase 1: Build Trie with Hash-Table Nodes
1: Initialize root node r
2: for each word w in W do
3:          node ← r
4:          for each UTF-8 character c in w do
5:                  if c not in node.children then
6:                          node.children[c] ← new Node()
7:                  end if
8:                  node ← node.children[c]
9:          end for
10:        node.output.add(w)
11: end for
Phase 2: Build Failure Pointers using BFS
12: Initialize queue Q
13: for each child u of r do
14:          u.fail ← r
15:          Q.enqueue(u)
16: end for
17: while Q is not empty do
18:          v ← Q.dequeue()
19:          for each (char c, node u) in v.children do
20:                  f ← v.fail
21:                  while f ≠ r and c not in f.children do
22:                          f ← f.fail
23:                  end while
24:                  if c in f.children then
25:                          u.fail ← f.children[c]
26:                  else
27:                          u.fail ← r
28:                  end if
29:                  u.output.addAll(u.fail.output)
30:                  Q.enqueue(u)
31:          end for
32: end while
Phase 3: Online Matching
33: v ← r
34: for each UTF-8 character c in T do
35:          while v ≠ r and c not in v.children do
36:                  v ← v.fail
37:          end while
38:          if c in v.children then
39:                  v ← v.children[c]
40:          else
41:                  v ← r
42:          end if
43:          if v.output is not empty then
44:                  Record all matches in v.output
45:          end if
46: end for
47: return M

4. Experimental Analysis

4.1. Effectiveness Analysis

To verify the effectiveness of the proposed security auditing mechanism in practical systems, a set of functional tests based on typical input scenarios was designed and implemented. A formal model and theoretical foundation were constructed, and combined with the improved AC automaton algorithm and replacement strategies, the system’s capability for sensitive word detection and masking was systematically evaluated.
All experiments are conducted on a UOS-22.0 computer with 16 GB memory and an Intel(R) Core(TM) i5-13500 CPU @ 2.50 GHz. The security auditing mechanism is injected into the system as a dynamic link library (.so file), which intercepts and redirects the FreeType core function FT_Load_Glyph to perform real-time analysis and processing of character-level rendering requests.

4.1.1. Functional Verification

The test data consisted of a 100,000-character Chinese input text, encompassing 15 sets of controlled experiments. These experiments covered:
(1)
Single-character sensitive words (e.g., “红”, “土”, “早”);
(2)
Two-character sensitive words (e.g., “足球”, “美丽”, “商人”);
(3)
Compound phrases (e.g., “多余的”, “就是足球”, “一个小女孩”);
(4)
Benign text without sensitive content (e.g., “德国的球迷”, “那么美好”);
(5)
Mixed Chinese-English terms (e.g., “zu球”, “hong色”).
The control group represents rendering results with the security auditing module disabled, while the experimental group shows the results with the module enabled. Table 1 presents the test results, and Table 2 displays the corresponding audit logs.
The experimental results show that the proposed security auditing scheme can reliably intercept and replace multiple categories of sensitive words, while maintaining correctness even under complex conditions such as repeated occurrences and nested terms. It introduces no interference to normal text rendering, thereby avoiding false replacements or degradation of readability. Meanwhile, the audit logs record all intercepted and replaced sensitive contents along with their occurrence frequencies, providing strong traceability. Since the scheme adopts a deterministic string-matching algorithm—the Aho-Corasick automaton—rather than a probabilistic model, both precision and recall can theoretically reach 100% within the scope of the predefined sensitive-word repository. Overall, the proposed mechanism demonstrates high stability and accuracy at the functional level, as well as strong practical value in terms of security assurance and auditability.

4.1.2. Formal Definition of Sensitive Information Masking

To rigorously describe the proposed sensitive information masking mechanism, this subsection presents a formal definition of the input representation, masking rules, and audit log generation process.
Let the input text be a sequence of characters:
T = { c 1 , c 2 , , c n } ,
where c i denotes the i -th character.
Let the sensitive lexicon be defined as:
S = { s 1 , s 2 , , s m } ,
where s j denotes the j -th sensitive term or sensitive pattern, and m is the total number of entries in the sensitive lexicon.
Define a rendering function R ( c ) and a masking function M ( c ) . For any character c i T , the output rule is given by:
R output ( c i ) = M a s k c i , s j S   s u c h   t h a t   c i M a t c h ( T , s j )     G l y p h c i ,       o t h e r w i s e ,
where Match ( T , s j ) denotes the set of character positions in T that correspond to a matched occurrence of the sensitive term s j . Characters belonging to any sensitive match are masked, while all other characters are rendered normally.
Meanwhile, an audit log is generated to record sensitive term detection:
Log ( T ) = j = 1 m { ( s j ,   t s t a m p ,   c o u n t )   |   s j   i s   d e t e c t e d } ,
where t stamp denotes the detection timestamp and count   represents the number of occurrences of the sensitive term.

4.1.3. Sensitive Information Masking Model

The model is first formally defined as follows:
  • Let the text string be T = t 1 t 2 t n , where t i Σ , and Σ represents the set of UTF-8 encoded characters.
  • Let the set of sensitive words be P = { P 1 , P 2 , , P k } , P j = p j 1 p j 2 p j m j ,   p ji Σ .
  • Let the matching function be M T , P = i , j T i : j = P j P .
  • Let the replacement function be R ( T , M ) , which replaces all matched positions with the masking string R j .
We define output security as follows:
Definition 1 (Output Security). 
The system output is said to satisfy output security if, for any    P j P , it holds that  P j R ( T , M ) .

4.1.4. Correctness Theorem of Matching

We first present a theorem regarding the correctness of the matching function M T , P :
Theorem 1 (Completeness). 
If the pattern set  P  is constructed using the AC automaton and a linear scan is performed over the input string  T , then the matching function  M ( T , P )  returns all positions of    P j P  occurring in  T  completely, without any omissions or false matches.
Proof of Theorem 1. 
The AC automaton consists of a deterministic finite automaton (DFA) combined with a failure pointer mechanism. Its structure is built as a prefix tree based on the pattern set P , and each node is assigned a failure pointer to the longest matching suffix. During the matching process, whenever the current character fails to match, the failure pointer guides the automaton to backtrack to the longest acceptable suffix state and continue matching, thereby preventing omissions. Each time a valid state is reached, all matched pattern strings can be accurately reported through the output set of that node. Therefore, the entire process covers all valid matches without matching non-pattern strings, satisfying the completeness requirement. □

4.1.5. Correctness Theorem of Replacement

Based on the completeness of matching, we further present a theorem guaranteeing the confidentiality of the replacement logic:
Theorem 2 (Confidentiality). 
If the matching function  M ( T , P )  can enumerate all occurrences of sensitive words, and the replacement function  R ( T , M )  performs irreversible masking at each position, then for any  P P , it holds that:  P r P R T , M = 0 .
Proof of Theorem 2. 
The completeness of the matching function M ( T , P )  (established in Theorem 1) ensures that all positions of pattern strings in the text are accurately identified. The replacement function R replaces each matched interval T [ i : i + P j 1 ] with an irreversible string R j , satisfying P j , P j R , and R j contains no matching subset with any P k . Therefore, no substring P j P exists in the replaced text, and all sensitive words are completely removed from the output, satisfying the definition of output security. □
Through the above analysis, the effectiveness of the confidentiality mechanism is thus demonstrated.

4.2. Complexity Analysis

We analyzed the time and space complexity of the security auditing mechanism. The primary contributions to time and space complexity in the confidentiality mechanism arise from the construction of the AC automaton, the matching phase, and the sensitive word replacement phase. Reading the sensitive word dictionary and recording the audit logs mainly involve file I/O operations.

4.2.1. AC Automaton Construction Phase

The construction of the AC automaton involves two steps: first, the construction of the Trie tree, in which multiple pattern strings are inserted into the Trie to form the basic multi-pattern storage structure; second, the construction of failure pointers, which are established using breadth-first search (BFS) to ensure rapid backtracking upon a match failure. Let the pattern set be S = s 1 , s 2 , s 3 , , s j , where the total length of all patterns is L , i.e., L = i = 1 j s i .
For a traditional array-based AC automaton, each node allocates an array for the entire character set, and inserting each character has a time complexity of O ( 1 ) . Therefore, the overall time complexity for constructing the Trie tree is O ( L ) . However, in a Chinese-language environment, the character set is very large, and initializing the array at each node incurs significant overhead, resulting in wasted space within nodes. In contrast, for the hash table-based AC automaton, each node allocates hash table entries only for characters that actually appear. The average time complexity for inserting a character remains O ( 1 ) , so the average time complexity for constructing the Trie tree is still O ( L ) . Nevertheless, because nodes store only the existing branches, the construction process is more efficient in Chinese-language environments, and the space overhead is significantly reduced.
Time Complexity of Failure Pointer Construction: For both the traditional array-based AC automaton and the hash table-based AC automaton, the construction of failure pointers is performed using a BFS traversal of the Trie tree. Each node’s failure pointer is computed only once, and each character backtracking operation is executed at most once. Let the Trie tree contain N nodes, and model the BFS process as Ftraversing the node set V , with each node v V undergoing one failure pointer computation. Let r v denote the average number of backtracking steps per node. The total backtracking complexity can then be expressed as: T fail = v V 1 + r v , where 1 accounts for visiting the node itself, and r v represents the number of backtracking operations. Since r v depth v , and the maximum depth of the Trie does not exceed the length of any pattern string, we have r v log σ N ( σ is the size of the character set). In practice, each failure backtrack occurs at most once, so we can approximate: T fail = O N O L . Therefore, the time complexity of this process remains O L .
Space Complexity of the Trie Tree: For the traditional array-based AC automaton, each node allocates an array for the entire character set. In a Chinese-language environment, this corresponds to the Unicode character set, resulting in a space complexity of O ( N σ ) , where N is the number of nodes and σ is the size of the character set. In contrast, for the hash table-based AC automaton, each node stores only the branches that actually appear. The space complexity is O ( N k ) , where k is the average out-degree of a node, and k σ . Therefore, the hash table implementation significantly reduces memory usage.
In summary, during the construction phase, both methods have a theoretical time complexity of O ( L ) . However, in a Chinese-character environment, the hash table-based AC automaton shows clear advantages in terms of space utilization and construction efficiency.

4.2.2. AC Automaton Matching and Replacement Phases

The core of the matching phase is to efficiently locate all pattern strings in the input text T using the Trie tree and failure pointers. Let the text length be n = T . During matching, the automaton scans the text T character by character from the beginning, attempting state transitions in the Trie tree. For both the traditional array-based AC automaton and the hash table-based AC automaton, in the ideal case, each character can directly follow a transition from the current state to the next state, yielding a time complexity of T = O n . When a mismatch occurs, the automaton backtracks along the failure pointers to find the longest feasible suffix state and continues matching. Since each character undergoes at most one failure pointer backtrack on average, the amortized time complexity of this process is O 1 , and the expected overall time complexity of the matching phase remains O n . In extreme cases, for example, when malicious input is constructed such that each character triggers the deepest backtracking, each character may experience up to O M backtracks, where M is the length of the longest pattern string in the set. Consequently, the worst-case time complexity becomes O M · n . After a sensitive word is successfully matched, it is replaced using a direct substitution method, which has a time complexity of O 1 .
In summary, in the matching and replacement phases, the additional time overhead for sensitive word replacement is usually negligible. Therefore, the expected overall time complexity is O n , while in extreme input cases, the worst-case complexity may degrade to O M · n .

4.3. Performance Analysis

This subsection presents a comparative evaluation of the proposed scheme across four experimental scenarios to demonstrate its feasibility and practicality while ensuring system security. The experiments are designed as follows:
(1)
Impact of Input Text Size: The first experiment examines the effect of varying Chinese text input sizes on application startup time, CPU utilization, and memory overhead, while keeping the sensitive word dictionary size constant. This comparison assesses the scheme’s impact on system performance relative to data volume.
(2)
Impact of Dictionary Size: The second experiment analyzes how different sensitive word counts affect the same performance metrics (startup time, CPU, and memory), with the input text size held fixed.
(3)
Combined Variable Analysis: The third experiment investigates startup time trends by simultaneously adjusting both the size of the sensitive word corpus and the volume of the input text.
(4)
Latency during Normal Browsing: The fourth experiment simulates a standard usage scenario to measure the average rendering latency per 1000 consecutive characters and evaluates performance overhead under multi-window concurrent conditions.

4.3.1. Performance Impact of Adding the Security Auditing Mechanism Under a Fixed Sensitive Word Scale

In Comparative Experiment I, four scenarios were designed to evaluate the impact of varying Chinese text scales on application startup time, system CPU usage, and memory overhead under the condition of a fixed sensitive word scale. The four scenarios, labeled A, B, C, and D, correspond to approximately 2000, 10,000, 50,000, and 100,000 Chinese characters of input text, respectively. To enhance the stability and reliability of the experimental results, each scenario was subjected to repeated startup tests of 10, 50, 100, and 500 iterations. The average values of these repeated tests were then calculated to eliminate the influence of random errors.
Figure 7 presents the impact of the security auditing mechanism on file-opening latency. In Scenario A, compared with the baseline without the auditing mechanism, the average startup time decreases by 0.628 ms at 10 repetitions. However, for 50, 100, and 500 repetitions, the average startup time increases by 0.52 ms, 0.721 ms, and 0.925 ms, respectively. At 500 repetitions, both configurations (with and without auditing) exhibit a standard deviation of approximately σ ≈ 2.15 ms, and their 95% confidence intervals overlap substantially.
In Scenario B, compared with the baseline, the auditing-enabled configuration reduces the average startup time by 0.219 ms at 10 repetitions. For 50 repetitions, the latency increases by 0.223 ms. For 100 and 500 repetitions, the latency decreases by 0.727 ms and 0.185 ms, respectively. At 500 repetitions, both configurations exhibit a standard deviation of approximately σ ≈ 1.98 ms, and the confidence interval width is significantly larger than the observed mean difference.
In Scenario C, compared with the baseline, the auditing mechanism reduces the startup time by 0.395 ms at 10 repetitions, increases it by 1.054 ms at 50 repetitions, and decreases it by 0.968 ms and 1.032 ms at 100 and 500 repetitions, respectively. At 500 repetitions, the standard deviation of both configurations is approximately σ ≈ 2.85 ms. Statistical analysis shows that the observed differences fall within one standard deviation (1σ) and therefore lack statistical significance.
In Scenario D, the auditing-enabled configuration reduces startup time by 0.748 ms and 0.837 ms at 10 and 50 repetitions, respectively; increases it by 1.968 ms at 100 repetitions; and decreases it by 0.640 ms at 500 repetitions. At 500 repetitions, both configurations have a standard deviation of approximately σ ≈ 2.30 ms, and the confidence intervals fully cover zero.
Across all four scenarios, the mean differences between the baseline and auditing-enabled configurations consistently fall within the range of [−1.0 ms, +0.9 ms], with alternating positive and negative signs, indicating no systematic performance bias. Furthermore, the absolute mean differences in all scenarios remain smaller than their corresponding standard deviations (|ΔMean| < σ), confirming that the observed fluctuations are indistinguishable from normal stochastic noise. Based on these results, we conclude that the proposed security auditing mechanism introduces negligible overhead to the startup path, achieving the design goal of lightweight operation.
Figure 8 illustrates the impact of introducing the security auditing mechanism on overall system performance. In Scenario A, after adding the security auditing mechanism, the average CPU usage increased by only 0.77%, and the average memory overhead increased by 18.74 MB; in Scenario B, the average CPU usage increased by 0.35%, and the average memory overhead increased by 17.69 MB; in Scenario C, the average CPU usage increased by 0.39%, and the average memory overhead increased by 13.3 MB; in Scenario D, the average CPU usage increased by 0.2%, and the average memory overhead increased by 25.22 MB. In summary, introducing the security auditing mechanism had a relatively limited impact on system resource consumption across all scenarios.

4.3.2. Performance Impact of Adding the Security Auditing Mechanism Under a Fixed Chinese Text Input Size

In Comparative Experiment II, four scenarios were set up using Chinese text input of the same size (10,000 characters) while progressively increasing the number of sensitive words, in order to evaluate the effect of sensitive word scale on application startup time and CPU usage. Scenarios E, F, G, and H corresponded to 10, 100, 500, and 1000 sensitive words, respectively. To improve the stability and reliability of the results, each scenario was repeated 500 times, and the average values of the collected data were analyzed. The experimental results are shown in Figure 9. The system CPU average usage remained relatively stable at approximately 13%. Meanwhile, the application’s average startup time exhibited a gradual increase with the growth of the sensitive word set. These results indicate that the security auditing mechanism maintains good stability when processing sensitive word sets of varying sizes, with minimal impact on system CPU load. Although the number of internal read and write operations within the auditing mechanism increases as the number of sensitive words grows, the associated performance overhead remains within an acceptable range.

4.3.3. Performance Impact of Adding the Security Auditing Mechanism Under Different Scenarios

In Comparative Experiment III, three scenarios were designed to evaluate the combined effect of text size and sensitive word count on application startup time in practical usage environments. Scenario I corresponds to an input text of 100,000 Chinese characters with 50 sensitive words, representing a large-text scenario with a relatively small sensitive-word set. Scenario J uses an input text of 10,000 characters with 500 sensitive words, representing a small-text but large–sensitive-word scenario. Scenario K combines an input text of 100,000 characters with 500 sensitive words, representing a composite scenario with both large text size and a large sensitive-word set.
To ensure the stability and reliability of the results, each scenario was tested with 500 repeated startups, and the results were averaged to minimize the impact of sporadic anomalies. The experimental results are shown in Figure 10. Compared with the baseline without the security mechanism, the average startup time increased by 1.043 ms and 0.811 ms in Scenarios I and J after enabling the security auditing mechanism, respectively, while Scenario K showed a decrease of 1.073 ms in average startup time. The data in Figure 10 indicate that introducing the security auditing mechanism has a minimal impact on application startup time across different scenarios, essentially posing no effect on practical operations. This result further confirms that the proposed scheme provides strong system security while maintaining excellent performance stability and scenario adaptability.

4.3.4. Normal Text Rendering Latency Test

The previous experiments primarily evaluated the impact of the security auditing mechanism during the application startup phase. However, in real-world usage, users are more concerned with the smoothness of text browsing. To address this, we designed Comparison Experiment IV to simulate a typical “normal usage scenario” and assess the runtime overhead introduced by the auditing mechanism.
In this experiment, we measured the average rendering time for every 1000 continuously input characters and additionally simulated a multi-window concurrent rendering scenario. For both configurations—with and without the security auditing mechanism—we recorded the average rendering latency per 1000 characters, CPU load, and the average window-opening time under five concurrent windows.
The results are shown in Table 3. Compared with the baseline without the security auditing mechanism, the average rendering time per 1000 characters increased by 0.64 ms after enabling the mechanism, while CPU load increased by 0.15%. In the multi-window rendering scenario, enabling the mechanism introduced an additional 0.01 ms latency. These results indicate that the proposed security auditing mechanism remains highly efficient when processing continuous text streams and imposes no significant computational burden. The experiment demonstrates that the mechanism provides strong security guarantees while fully supporting multi-window and normal browsing workloads.

4.4. Comparison with Operating-System-Level Approaches

To highlight the uniqueness and advantages of the proposed scheme in data leakage prevention, this section compares our method with several mainstream Linux security mechanisms, including SELinux MLS, AppArmor, and eBPF. The comparison dimensions cover control granularity, protection mechanisms, and visual protection capability. The detailed comparison is presented in Table 4.
Traditional MAC mechanisms such as SELinux MLS and AppArmor primarily enforce access control policies that determine who can open a file, but they lack visibility control over what part of the content can be displayed. As a result, data visibility remains an “all-or-nothing” model. Although eBPF provides flexible programmability within the kernel, it is not well suited for handling complex user-space text rendering logic, and direct intervention in user-space data may introduce risks.
In contrast, the proposed method operates directly on the character rendering pipeline, enabling fine-grained, content-level “usable-but-invisible” control. It delivers higher stability and performance in large-scale Chinese text scenarios and effectively compensates for the lack of visual side-channel protection in traditional operating-system-level mechanisms. Therefore, it serves as a key enhancement to data leakage prevention in domestic Chinese Linux operating systems.

4.5. Security Analysis

Based on the threat model defined in Section 3.1 and the experimental results presented above, this section analyzes the security properties of the proposed auditing method. The analysis evaluates how the system defends against data leakage risks while ensuring operational integrity and availability.
(1)
Defense Against Visual Data Leakage: The core security feature of the proposed method is the interception of sensitive data at the rendering stage. By sanitizing sensitive characters before they are converted into visible glyphs, the system effectively mitigates “shoulder surfing” and unauthorized screen captures, ensuring that even if an attacker gains physical or visual access to the terminal, the specific content of state secrets or privacy data remains obscured.
(2)
Fine-Grained Content Control: Unlike traditional Access Control (DAC/MAC) mechanisms like SELinux, which operate at the file or process level, this scheme provides security at the semantic and character level. It can distinguish between sensitive and benign text within the same application window, allowing for precise blocking of confidential information without hindering legitimate application usage.
(3)
Robustness Against Obfuscation: The utilization of the improved AC automaton algorithm ensures that the system is resilient against common text obfuscation techniques. As demonstrated in the effectiveness analysis (Section 4.1), the system correctly identifies and masks sensitive words even in complex scenarios involving nested phrases, overlapping keywords, or repeated sequences, preventing attackers from bypassing filters through sentence restructuring.
(4)
Traceability and Non-Repudiation: The integration of a logging mechanism provides a strong audit trail. Every attempt to render sensitive information is recorded with timestamps and content details. This ensures non-repudiation, allowing security administrators to trace the source of potential leaks.
(5)
Resistance to Denial-of-Service (DoS) via Rendering: The performance analysis indicates a minimal overhead. This efficiency prevents the security mechanism itself from becoming a bottleneck. An attacker cannot easily exhaust system resources or freeze the desktop environment simply by flooding the terminal with text input, ensuring high system availability.
(6)
Prevention of False Positives: Security analysis confirms that the replacement strategy is strictly bounded by the sensitive keyword dictionary. Benign text adjacent to sensitive words remains unaffected (as shown in the functional tests). This ensures the integrity of normal business operations, preventing the “security audit mechanism” from acting as a denial-of-service vector for legitimate work.
(7)
Robust Handling of Multi-Byte Characters: The proposed AC automaton-based algorithm is specifically optimized for multi-byte character sets used in Chinese Linux operating systems. This ensures that sensitive information containing Chinese characters is accurately identified and masked, effectively addressing the complexity of character matching and segmentation that generic or non-localized auditing tools often fail to handle correctly.
(8)
Real-Time Interception Capability: The system operates synchronously with the text rendering pipeline. This real-time processing ensures there is no “time-of-check to time-of-use” gap where sensitive data could briefly flash on the screen before being masked, providing continuous protection.
(9)
Immunity to Application-Layer Evasion: Since the mechanism is embedded within the underlying OS rendering libraries, it functions independently of the upper-layer applications. A malicious or compromised text editor, browser, or chat client cannot bypass the auditing simply by ignoring standard API calls, as the rendering is handled by the OS.
(10)
Stability Under High Load: The formal model ensures that the state transitions in the detection algorithm are deterministic. Under high-load conditions with multiple windows rendering text simultaneously, the system maintains its security posture without race conditions that could lead to leakage, as evidenced by the negligible multi-window latency increase (+0.01 ms).

4.6. Ethical and Privacy Considerations

The proposed system-wide text rendering interception mechanism involves real-time analysis of terminal display content, which inevitably raises ethical concerns regarding user privacy and the boundaries of monitoring. To ensure responsible deployment and usage, the system design adheres to the following three principles.
First, purpose limitation and transparency. The sole intention of this technique is to prevent highly sensitive data—such as state secrets, commercial confidential information, or personally identifiable information (PII)—from being visually exposed in uncontrolled environments. It is not designed for comprehensive monitoring of user activities. In real-world deployments, organizations should follow transparency principles by clearly informing end users about the existence of the auditing mechanism, its monitoring scope, and the conditions under which it is activated, thus avoiding covert intrusion into user privacy.
Second, data minimization. The auditing mechanism enforces strict limitations on the granularity of collected information. The audit logs record only the triggered sensitive word entries, timestamps, and occurrence counts. They do not capture ordinary text, non-matching content, or any form of screen snapshots or contextual data. This design ensures that, even if audit logs are exposed, an attacker cannot reconstruct the user’s full operational context or private communication. Consequently, the system balances traceability with protection of user privacy.
Third, access control and regulatory compliance. Since the configuration of the sensitive-word repository is managed by system administrators, there exists potential for privilege misuse. To mitigate this risk, we recommend role separation and auditing of administrative operations themselves. Moreover, the deployment of this technology must strictly comply with relevant data protection laws and regulations to ensure that the technical measures operate within legal boundaries.

5. Conclusions

Chinese Linux operating system has achieved widespread deployment across multiple sectors in China. Nevertheless, it remains exposed to potential data leakage risks. To address this issue, this paper proposes a data leakage prevention security auditing mechanism suitable for general-purpose operating systems. By dynamically intercepting and analyzing application output at the visual information processing layer, the mechanism enables real-time monitoring and effective prevention of sensitive data leakage. Unlike traditional static policies or coarse-grained filtering methods, this scheme embeds auditing logic at the architectural level, offering strong transparency and extensibility. It achieves fine-grained security control without disrupting the normal user experience.
Furthermore, this paper introduces an improved Aho-Corasick (AC) automaton-based sensitive information recognition algorithm, which supports rapid matching of large-scale sensitive word sets and is suitable for multi-byte character sets such as Chinese. This enhancement significantly improves the detection efficiency and processing capability of the proposed scheme across diverse application scenarios.
Performance evaluations and comparative experiments conducted under multiple scenarios demonstrate that the proposed security auditing mechanism ensures data security while imposing minimal startup time overhead, maintaining strong operational performance and system compatibility. The experimental results indicate that the scheme not only effectively meets practical security requirements but also offers generality and customizability for multi-scenario deployment, highlighting its high potential for practical application and wider adoption.

Author Contributions

Conceptualization, W.M. and H.G.; methodology, W.M.; software, H.G. and A.X.; validation, A.X. and X.M.; formal analysis, X.M.; writing—original draft preparation, W.M. and H.G.; writing—review and editing, A.X.; visualization, H.G. and A.X.; supervision, X.M.; project administration, W.M. and X.M.; funding acquisition, W.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Henan Department of Science and Technology [grant number 252300421873, 252102210030].

Data Availability Statement

Dataset available on request from the authors.

Conflicts of Interest

Author Xuegang Mao is employed by the company Beijing Shenzhou Aerospace Software Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Abd Al Ghaffar, H.N. Government cloud computing and national security. Rev. Econ. Political Sci. 2024, 9, 116–133. [Google Scholar] [CrossRef]
  2. Zahid, R.; Altaf, A.; Ahmad, T.; Iqbal, F.; Miró Vera, Y.A.; López Flores, M.A.; Ashraf, I. Secure data management life cycle for government big-data ecosystem: Design and development perspective. Systems 2023, 11, 380. [Google Scholar] [CrossRef]
  3. Gui, Z.; Li, Y.; Li, Y.; Du, Y.; Zhao, B. Development and analysis of domestic operating systems. Telecom Eng. Tech. Stand. 2023, 36, 76–80. [Google Scholar]
  4. Sedano, W.K.; Salman, M. Auditing Linux operating system with center for internet security (CIS) standard. In Proceedings of the 2021 International Conference on Information Technology (ICIT), Amman, Jordan, 14–15 July 2021; IEEE: Piscataway, NJ, USA, 2021. [Google Scholar]
  5. Montano, I.H.; Díez, I.d.L.T.; Aranda, J.J.G.; Diaz, J.R.; Cardín, S.M.; López, J.J.G. Secure file systems for the development of a data leak protection (DLP) tool against internal threats. In Proceedings of the 17th Iberian Conference on Information Systems and Technologies (CISTI), Madrid, Spain, 22–25 June 2022. [Google Scholar]
  6. Shishodia, B.S.; Nene, M.J. Data leakage prevention system for internal security. In Proceedings of the International Conference on Futuristic Technologies (INCOFT), Pune, India, 13–14 May 2022. [Google Scholar]
  7. Prof, P.K.; Pradhan, P.L. Hardening of UNIX Operating System. Int. J. Comput. Commun. Technol. 2001, 1, 9. [Google Scholar]
  8. Cao, L. Data science: A comprehensive overview. ACM Comput. Surv. 2017, 50, 43. [Google Scholar] [CrossRef]
  9. Sarker, I.H.; Hoque, M.M.; Uddin, M.K.; Alsanoosy, T. Mobile data science and intelligent apps: Concepts, AI-based modeling and research directions. Mob. Netw. Appl. 2021, 26, 285–303. [Google Scholar] [CrossRef]
  10. Mehmood, A.; Natgunanathan, I.; Xiang, Y.; Hua, G.; Guo, S. Protection of big data privacy. IEEE Access 2016, 4, 1821–1834. [Google Scholar] [CrossRef]
  11. Maesa, D.D.F.; Lisi, A.; Mori, P.; Ricci, L.; Boschi, G. Self sovereign and blockchain based access control: Supporting attributes privacy with zero knowledge. J. Netw. Comput. Appl. 2023, 212, 103577. [Google Scholar] [CrossRef]
  12. Li, W.; Susilo, W.; Xia, C.; Huang, L.; Guo, F.; Wang, T. Secure data integrity check based on verified public key encryption with equality test for multi-cloud storage. IEEE Trans. Dependable Secur. Comput. 2024, 21, 5359–5373. [Google Scholar] [CrossRef]
  13. Gupta, I.; Singh, A.K.; Lee, C.-N.; Buyya, R. Secure data storage and sharing techniques for data protection in cloud environments: A systematic review, analysis, and future directions. IEEE Access 2022, 10, 71247–71277. [Google Scholar] [CrossRef]
  14. Ramachandra, M.N.; Rao, M.S.; Lai, W.C.; Parameshachari, B.D.; Babu, J.A.; Hemalatha, K.L. An efficient and secure big data storage in cloud environment by using triple data encryption standard. Big Data Cogn. Comput. 2022, 6, 101. [Google Scholar] [CrossRef]
  15. Thilagam, K.; Beno, A.; Lakshmi, M.V.; Wilfred, C.B.; George, S.M.; Karthikeyan, M.; Peroumal, V.; Ramesh, C.; Karunakaran, P. Secure IoT healthcare architecture with deep learning-based access control system. J. Nanomater. 2022, 1, 2638613. [Google Scholar] [CrossRef]
  16. Zhao, Y.; Chen, J. A survey on differential privacy for unstructured data content. ACM Comput. Surv. 2022, 54, 207. [Google Scholar] [CrossRef]
  17. Amomo, C.G. An AI-enhanced cybersecurity model for insider threat detection and data-leak prevention in government networks. Int. J. Sci. Res. Eng. Dev. 2022, 5, 1339–1342. [Google Scholar]
  18. Ateniese, G.; Burns, R.; Curtmola, R.; Herring, J.; Kissner, L.; Peterson, Z.; Song, D. Provable data possession at untrusted stores. In Proceedings of the 14th ACM Conference on Computer and Communications Security, Alexandria, VA, USA, 31 October–2 November 2007; ACM: New York, NY, USA, 2007; pp. 598–609. [Google Scholar]
  19. Juels, A.; Kaliski, B.S., Jr. PORs: Proofs of retrievability for large files. In Proceedings of the 14th ACM Conference on Computer and Communications Security, Alexandria, VA, USA, 31 October–2 November 2007; ACM: New York, NY, USA, 2007; pp. 584–597. [Google Scholar]
  20. Ateniese, G.; Di Pietro, R.; Mancini, L.V.; Tsudik, G. Scalable and efficient provable data possession. In Proceedings of the 4th International Conference on Security and Privacy in Communication Networks, Istanbul, Turkey, 22–25 September 2008; ACM: New York, NY, USA, 2008; pp. 1–10. [Google Scholar]
  21. Erway, C.C.; Küpçü, A.; Papamanthou, C.; Tamassia, R. Dynamic provable data possession. ACM Trans. Inf. Syst. Secur. 2015, 17, 1–29. [Google Scholar] [CrossRef]
  22. Wang, C.; Sun, Y.; Liu, B.; Xue, L.; Guan, X. Blockchain-based dynamic cloud data integrity auditing via non-leaf node sampling of rank-based Merkle hash tree. IEEE Trans. Netw. Sci. Eng. 2024, 11, 3931–3942. [Google Scholar] [CrossRef]
  23. Guo, W.; Zhang, H.; Qin, S.; Gao, F.; Jin, Z.; Li, W.; Wen, Q. Outsourced dynamic provable data possession with batch update for secure cloud storage. Future Gener. Comput. Syst. 2019, 95, 309–322. [Google Scholar] [CrossRef]
  24. Zou, J.; Sun, Y.; Li, S. Dynamic provable data possession based on ranked Merkle hash tree. In Proceedings of the 2016 International Conference on Identification, Information and Knowledge in the Internet of Things (IIKI), Beijing, China, 20–21 October 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 4–9. [Google Scholar]
  25. Yang, C.; Liu, Y.; Zhao, F.; Zhang, S. Provable data deletion from efficient data integrity auditing and insertion in cloud storage. Comput. Stand. Interfaces 2022, 82, 103629. [Google Scholar] [CrossRef]
  26. Rao, L.; Zhang, H.; Tu, T. Dynamic outsourced auditing services for cloud storage based on batch-leaves-authenticated Merkle hash tree. IEEE Trans. Serv. Comput. 2017, 13, 451–463. [Google Scholar] [CrossRef]
  27. Wang, R.; Wu, L.; Zhang, Z.; Huo, L.; Feng, J. Low-overhead dynamic security auditing protocol based on algebraic identities in cloud storage environments. Mini-Micro Comput. Syst. 2025, 46, 1190–1198. [Google Scholar]
  28. Shao, B.; Bian, G.; Wang, Y.; Su, S.; Guo, C. Dynamic data integrity auditing method supporting privacy protection in vehicular cloud environment. IEEE Access 2018, 6, 43785–43797. [Google Scholar] [CrossRef]
  29. Peng, L.; Yan, Z.; Liang, X.; Yu, X. SecDedup: Secure data deduplication with dynamic auditing in the cloud. Inf. Sci. 2023, 644, 119279. [Google Scholar] [CrossRef]
  30. Yan, H.; Gui, W. Efficient identity-based public integrity auditing of shared data in cloud storage with user privacy preserving. IEEE Access 2021, 9, 45822–45831. [Google Scholar] [CrossRef]
  31. Li, X.; Liu, S.; Lu, R.; Khan, M.K.; Gu, K.; Zhang, X. An efficient privacy-preserving public auditing protocol for cloud-based medical storage system. IEEE J. Biomed. Health Inform. 2022, 26, 2020–2031. [Google Scholar] [CrossRef]
  32. Shen, W.; Yu, J.; Yang, M.; Hu, J. Efficient identity-based data integrity auditing with key-exposure resistance for cloud storage. IEEE Trans. Dependable Secur. Comput. 2022, 20, 4593–4606. [Google Scholar] [CrossRef]
  33. Xie, C.; Weng, J.; Zhou, D. Revocable identity-based fully homomorphic signature scheme with signing key exposure resistance. Inf. Sci. 2022, 594, 249–263. [Google Scholar] [CrossRef]
  34. Li, J.; Yan, H.; Zhang, Y. Efficient identity-based provable multi-copy data possession in multi-cloud storage. IEEE Trans. Cloud Comput. 2019, 10, 356–365. [Google Scholar] [CrossRef]
  35. Li, T.; Chu, J.; Hu, L. CIA: A collaborative integrity auditing scheme for cloud data with multi-replica on multi-cloud storage providers. IEEE Trans. Parallel Distrib. Syst. 2022, 34, 154–162. [Google Scholar] [CrossRef]
  36. Fu, A.; Yu, S.; Zhang, Y.; Wang, H.; Huang, C. NPP: A new privacy-aware public auditing scheme for cloud data sharing with group users. IEEE Trans. Big Data 2017, 8, 14–24. [Google Scholar] [CrossRef]
  37. Shen, J.; Yang, H.; Vijayakumar, P.; Kumar, N. A privacy-preserving and untraceable group data sharing scheme in cloud computing. IEEE Trans. Dependable Secur. Comput. 2021, 19, 2198–2210. [Google Scholar] [CrossRef]
  38. Nakamoto, S. A peer-to-peer electronic cash system. Bitcoin 2008, 4, 15. [Google Scholar]
  39. Yue, D.; Li, R.; Zhang, Y.; Tian, W.; Peng, C. Blockchain based data integrity verification in P2P cloud storage. In Proceedings of the 2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS), Singapore, 11–13 December 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 561–568. [Google Scholar]
  40. Zhang, C.; Xuan, H.; Wu, T.; Liu, X.; Yang, G.; Zhu, L. Blockchain-based dynamic time-encapsulated data auditing for outsourcing storage. IEEE Trans. Inf. Forensics Secur. 2023, 19, 1979–1993. [Google Scholar] [CrossRef]
  41. Zhang, Y.; Xu, C.; Lin, X.; Shen, X. Blockchain-based public integrity verification for cloud storage against procrastinating auditors. IEEE Trans. Cloud Comput. 2021, 9, 923–937. [Google Scholar] [CrossRef]
  42. Chen, R.; Li, Y.; Yu, Y.; Li, H.; Chen, X.; Susilo, W. Blockchain-based dynamic provable data possession for smart cities. IEEE Internet Things J. 2020, 7, 4143–4154. [Google Scholar] [CrossRef]
  43. Miao, Y.; Gai, K.; Zhu, L.; Choo, K.K.R.; Vaidya, J. Blockchain-based shared data integrity auditing and deduplication. IEEE Trans. Dependable Secur. Comput. 2023, 21, 3688–3703. [Google Scholar] [CrossRef]
  44. Wang, H.; Wang, Q.; He, D. Blockchain-based private provable data possession. IEEE Trans. Dependable Secur. Comput. 2019, 18, 2379–2389. [Google Scholar] [CrossRef]
  45. Dhinakaran, D.; Selvaraj, D.; Dharini, N.; Raja, S.E.; Priya, C. Towards a novel privacy-preserving distributed multiparty data outsourcing scheme for cloud computing with quantum key distribution. arXiv 2024, arXiv:2407.18923. [Google Scholar]
  46. Li, Y.; Tang, M. Blockchain-powered distributed data auditing scheme for cloud-edge healthcare system. Cyber Secur. Appl. 2023, 1, 100017. [Google Scholar] [CrossRef]
  47. Kumar, M.; Maple, C.; Chand, S. An efficient and secure identity-based integrity auditing scheme for sensitive data with anti-replacement attack on multi-cloud storage. J. King Saud Univ. Comput. Inf. Sci. 2023, 35, 101745. [Google Scholar] [CrossRef]
  48. Tian, H.; Ye, W.; Wang, J.; Quan, H.; Chang, C.-C. Certificateless public auditing for cloud-based medical data in Healthcare Industry 4.0. Int. J. Intell. Syst. 2023, 1, 3375823. [Google Scholar] [CrossRef]
Figure 1. Text Rendering Workflow in the Operating System.
Figure 1. Text Rendering Workflow in the Operating System.
Electronics 15 00086 g001
Figure 2. Core Architecture of the Confidentiality Mechanism. The red crosses in the figure indicate that confidentiality mechanisms are in effect here, while the “*” signify that sensitive words have been masked.
Figure 2. Core Architecture of the Confidentiality Mechanism. The red crosses in the figure indicate that confidentiality mechanisms are in effect here, while the “*” signify that sensitive words have been masked.
Electronics 15 00086 g002
Figure 3. Auditing Mechanism Architecture. The “*” signify that sensitive words have been masked.
Figure 3. Auditing Mechanism Architecture. The “*” signify that sensitive words have been masked.
Electronics 15 00086 g003
Figure 4. Structure of the Sensitive Word List and Audit Log. The “*” signify that sensitive words have been masked.
Figure 4. Structure of the Sensitive Word List and Audit Log. The “*” signify that sensitive words have been masked.
Electronics 15 00086 g004
Figure 5. Construction Workflow of the Aho-Corasick (AC) Automaton.
Figure 5. Construction Workflow of the Aho-Corasick (AC) Automaton.
Electronics 15 00086 g005
Figure 6. Physical Storage Structure of the AC Automaton.
Figure 6. Physical Storage Structure of the AC Automaton.
Electronics 15 00086 g006
Figure 7. Impact of the Security Auditing Mechanism on Startup Time under a Fixed Sensitive Word Scale.
Figure 7. Impact of the Security Auditing Mechanism on Startup Time under a Fixed Sensitive Word Scale.
Electronics 15 00086 g007
Figure 8. CPU and Memory Overhead under a Fixed Sensitive Word Scale.
Figure 8. CPU and Memory Overhead under a Fixed Sensitive Word Scale.
Electronics 15 00086 g008
Figure 9. Impact of Different Sensitive Word Scales on Performance under a Fixed Text Input Size.
Figure 9. Impact of Different Sensitive Word Scales on Performance under a Fixed Text Input Size.
Electronics 15 00086 g009
Figure 10. Performance Impact under Varying Sensitive Word Scales.
Figure 10. Performance Impact under Varying Sensitive Word Scales.
Electronics 15 00086 g010
Table 1. Results of the Controlled Experiments. The “*” signify that sensitive words have been masked.
Table 1. Results of the Controlled Experiments. The “*” signify that sensitive words have been masked.
IndexText SegmentExpected OutputControl GroupExperimental Group
1红色*色红色*色
2土地*地土地*地
3早晨*晨早晨*晨
4中国足球中国**中国足球中国**
5那么美丽那么**那么美丽那么**
6商人的儿子**的儿子商人的儿子**的儿子
7几乎是多余的几乎是***几乎是多余的几乎是***
8这就是足球这****这就是足球这****
9有一个小女孩有*****有一个小女孩有*****
10德国的球迷德国的球迷德国的球迷德国的球迷
11那么美好那么美好那么美好那么美好
12zu球zu球zu球zu球
13hong色hong色hong色hong色
Table 2. Audit Log.
Table 2. Audit Log.
IndexTimestampSensitive InformationMasking CountFailure CountTotal Occurrences
126 March 202578078
226 March 202550050
326 March 202534034
426 March 2025足球15015
526 March 2025美丽72072
626 March 2025商人12012
726 March 2025多余的303
826 March 2025就是足球303
926 March 2025一个小女孩202
Table 3. Normal Text Rendering Latency Test Results.
Table 3. Normal Text Rendering Latency Test Results.
MetricWithout Security AuditingWith Security AuditingDifference
Average rendering time per 1000 chars2.06 ms2.70 ms+0.64 ms
CPU load11.16%11.31%+0.15%
Multi-window rendering latency0.37 ms0.38 ms+0.01 ms
Table 4. Comparison of Operating-System-Level Security Mechanisms.
Table 4. Comparison of Operating-System-Level Security Mechanisms.
FeatureSELinux MLSAppArmoreBPFProposed Method
Control granularityFile/process-level (coarse-grained)Path/program-level (coarse-grained)Syscall/network-packet levelCharacter/content-level (fine-grained)
Core mechanismMandatory Access Control (MAC)Mandatory Access Control (MAC)Observability and filteringRendering interception and dynamic desensitization
Data visibilityFull visibility once access is grantedFull visibility once access is grantedObservable but generally unmodifiableData remains usable but visually hidden
Visual protectionNoneNoneWeakStrong (Freetype-level rendering interception)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ma, W.; Guo, H.; Xia, A.; Mao, X. A Sensitive Information Masking-Based Data Security Auditing Method for Chinese Linux Operating System. Electronics 2026, 15, 86. https://doi.org/10.3390/electronics15010086

AMA Style

Ma W, Guo H, Xia A, Mao X. A Sensitive Information Masking-Based Data Security Auditing Method for Chinese Linux Operating System. Electronics. 2026; 15(1):86. https://doi.org/10.3390/electronics15010086

Chicago/Turabian Style

Ma, Wei, Haolong Guo, Angran Xia, and Xuegang Mao. 2026. "A Sensitive Information Masking-Based Data Security Auditing Method for Chinese Linux Operating System" Electronics 15, no. 1: 86. https://doi.org/10.3390/electronics15010086

APA Style

Ma, W., Guo, H., Xia, A., & Mao, X. (2026). A Sensitive Information Masking-Based Data Security Auditing Method for Chinese Linux Operating System. Electronics, 15(1), 86. https://doi.org/10.3390/electronics15010086

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop