1. Introduction
The online assessment is a tool that is available in real-life applications. Generally, they are used to survey and collect information on the satisfaction of customers or users with services, products, and applications that are provided. The information is collected from the online assessment, which often consists of two types of questions, i.e., multiple-choice questions (MCQ) and open-ended questions (OEQ). Moreover, the questions of online assessments are generally divided into three groups, i.e., demographic questions, behavioral and satisfaction questions, and source questions. Generally, this information is used to improve customer services, organizational policies, or marketing policies. For this reason, they are often released to the data analyst. Thus, they could lead to concerns about privacy violations. To address these concerns, in [
1] and its extended versions [
2,
3,
4,
5,
6,
7], the authors recommend that all users’ explicit identifier values (e.g., citizen, SSN, and student code) are removed, and all unique quasi-identifier values are suppressed or generalized by their less specific values to be indistinguishable.
An illustrative example of privacy preservation using k-Anonymity [
1] is presented for the case where the anonymity parameter k is set to 2. Let
Table 1 denote the original dataset. In this dataset, SSN and Name serve as explicit identifiers, while Gender, Education, and Position are quasi-identifier attributes. Salary is treated as a sensitive attribute. To achieve privacy preservation, all values corresponding to the explicit identifiers (SSN and Name) are removed. In addition, the values of the quasi-identifier attributes—Gender, Education, and Position—are generalized such that each combination appears in at least two indistinguishable tuples. Under these transformations, the released version of
Table 1 satisfies the requirements of 2-Anonymity, as shown in
Table 2. From
Table 2, it can be observed that for any query based on quasi-identifier attributes, at least two tuples satisfy the query conditions, thereby preventing unique tuple re-identification. As a result, the released dataset in
Table 2 provides stronger privacy protection against re-identification attacks compared with the original dataset shown in
Table 1. Unfortunately, they continue to exhibit notable limitations, including degradation of data utility and vulnerability to emerging privacy attack techniques introduced after their deployment.
In addition to datasets in which each attribute is designed to collect atomic values (e.g.,
Table 1), certain datasets are constructed to capture information in the form of user-generated textual contents, such as feedback, opinions, or discussion statements. Such datasets are referred to as content-based datasets. An example of content-based datasets is presented in
Table 3. In this dataset, Student Code serves as the explicit identifier, while Datetime, Education, and Gender are treated as quasi-identifier attributes. The Opinion attribute contains sensitive information that must be protected when
Table 3 is released beyond the scope of the data-collecting organization. Notably, beyond explicitly sensitive content, the Opinion attribute itself could also facilitate re-identification of the data owner. This is because user-generated textual content often embeds distinctive linguistic styles, personal viewpoints, and affective expressions that can uniquely characterize its author. As a result, even after removing the Student Code and generalizing the quasi-identifier attributes (Datetime, Education, and Gender), privacy concerns may persist. Therefore, merely applying
k-Anonymity and its well-known extensions, including
l-Diversity and
t-Closeness [
1,
2,
3,
4,
5,
6,
7], to such datasets is insufficient to fully mitigate privacy violation concerns in content-based datasets. While these models protect against identity disclosure based on structured attributes, they fail to address privacy leakage arising from the semantic and stylistic characteristics inherent in textual content.
An illustrative case highlighting privacy violation concerns in content-based datasets is presented using
Table 3 as the specified original dataset. This table is assumed to collect students’ comments directed toward Bob. To preserve privacy using k-Anonymity, the anonymity parameter is set to k = 2. Accordingly, a released version of
Table 3 that satisfies the 2-Anonymity constraint is shown in
Table 4. As observed from
Table 4, each tuple is indistinguishable from at least k − 1 other tuples with respect to the quasi-identifier attributes. Under this condition, the dataset appears to provide adequate privacy protection and to satisfy the requirements of 2-Anonymity. However, despite conforming to the 2-Anonymity constraint, the released dataset still exhibits privacy violation concerns arising from its content-based attributes. These residual privacy concerns are demonstrated in Example 1, indicating that k-Anonymity alone is insufficient for guaranteeing privacy protection in content-based datasets.
Example 1 (Privacy violation issues in content-based datasets)
. Consider Table 4, which represents the 2-Anonymous release of the original content-based dataset shown in Table 3. Suppose Bob is a lecturer, and Alice and John are two of his students. One tuple in Table 4 corresponds to Alice’s opinion about Bob, and Alice is the target individual whose opinion Bob attempts to infer from the released dataset. Assume that Alice and John are involved in a conflict, and Bob has determined that Alice initiated the dispute. Furthermore, assume that Bob does not have conflicts with any other students. Under these circumstances, Bob can leverage this background knowledge to infer that the tuple in Table 4 is Alice’s opinion. This inference is possible because the content of the tuple is highly consistent with the known conflict among Alice, John, and Bob. This example demonstrates that, even though
Table 4 satisfies the 2-Anonymity constraint, privacy violations could still occur due to semantic inference enabled by external background knowledge. Consequently, k-Anonymity alone is insufficient to prevent privacy breaches in content-based datasets.
To address the limitations of k-Anonymity and its extended variants, numerous privacy preservation models have been proposed to mitigate privacy violation concerns in content-based datasets, particularly within the domains of Natural Language Processing (NLP) [
8,
9,
10,
11] and Information Retrieval (IR) [
12,
13,
14,
15]. Representative examples of such privacy preservation models for content-based datasets are summarized in
Table 5. These models have evolved from early privacy-aware query frameworks [
16,
17,
18] and traditional data anonymization techniques [
19,
20,
21,
22], as well as TFIDF-based text clustering approaches [
23,
24,
25,
26,
27], to more advanced mechanisms such as homomorphic encryption [
28,
29,
30,
31], differential privacy [
32,
33,
34,
35], and federated learning frameworks [
36,
37,
38]. These approaches enhance sensitivity detection by incorporating semantic relevance and contextual rarity within textual content.
In addition, several paradigms have been pursued to address the privacy concerns described above, each adopting a different strategy for how the data are ultimately released. The first is anonymization-based publishing, which, as discussed above, includes
k-Anonymity,
l-Diversity, and
t-Closeness; this paradigm releases a single sanitized table by generalizing or suppressing the original tuples, but the constraints are primarily defined over atomic quasi-identifier values and become difficult to enforce when sensitive information is expressed within free-form textual content. A second paradigm is the anatomy method [
39,
40,
41], which releases the quasi-identifier and sensitive values as two separate tables linked by a group identifier so that individual-level correlations are obscured while aggregate-level analysis remains accurate; however, the split-table format breaks tuple-level readability and is therefore unsuited to downstream natural language processing or information retrieval analysis. A third paradigm operates through aggregate and query-answering frameworks, in which the data owner exposes only aggregated statistics, encrypted indices, or controlled query interfaces over the underlying tuples rather than releasing the dataset itself [
16,
17,
18,
28,
29]; while this protects the raw tuples, it prevents third parties from inspecting or auditing the content of individual documents. A fourth paradigm is differential privacy and its variants [
32,
33,
34,
35], which inject calibrated random noise into query outputs or learned statistics; this provides rigorous probabilistic guarantees but tends to distort the semantic meaning and contextual integrity of textual content, making it less effective for content-based settings.
Table 5.
Related work on privacy preservation for content-based datasets, organized by approach, key idea, and reference.
Table 5.
Related work on privacy preservation for content-based datasets, organized by approach, key idea, and reference.
| Model | Approach | Key Idea | Ref. |
|---|
| Using TF-IDF to hide sensitive itemsets | SIF-IDF Algorithm and Greedy Approach | Sanitizes sensitive items while maintaining data utility using text mining concepts. | [42] |
| TF-IDF and KDC-based Privacy Search | KDC and TF-IDF over Encrypted Data | Enables secure, ranked multi-keyword search on cloud-stored encrypted data. | [43] |
| Differential privacy protection via SVD | SVD-based Disturbance and Differential Privacy | Protects high-dimensional network data by disturbing singular values and spectral vectors. | [44] |
| Privacy-Preserving Collaborative Clustering | Privacy-Preserving Distributed Clustering | Multi-party document clustering without revealing raw sensitive text to the server. | [45] |
| Automatic Anonymization of Textual Documents | Word Embedding-based Semantic Detection | Detects sensitive information using semantic similarity in word embeddings. | [46] |
In parallel with these data-centric models, a substantial body of recent work has emerged around AI-based privacy preservation, in which deep learning, large language models (LLMs), and federated architectures are used either as the protection mechanism itself or as the adversary that motivates new defenses. Representative directions include differentially private deep learning and fine-tuning of pre-trained language models, language model-based text anonymization, deep generative models for privacy-preserving synthetic data, and federated learning frameworks with differential privacy or homomorphic encryption. These directions, together with their representative models and references, are summarized in
Table 6. Collectively, AI-based methods extend the privacy landscape from data-centric anonymization toward model-centric and learning-centric protections; however, their output is typically a privatized model or a synthetic dataset rather than a human-readable release of the original tuples, and they therefore do not address the problem of publishing real content-based datasets in a form that remains directly readable and analyzable by downstream users.
To address these limitations, this work proposes (-Privacy: Privacy Preservation Models for Content-Based Datasets Using Information Retrieval Techniques. The proposed model directly releases a sanitized version of the original content-based tuples under explicit structural constraints on equivalence-class size (d), re-identification confidence (c), and sensitive-term occurrence (l), and is supported by three algorithms—FCFS, greedy, and optimal—that trade off computational efficiency and data utility.
The organization of this work is as follows. This section provides an introduction to this work.
Section 2 introduces the proposed privacy preservation model, including the formal problem definition, term document measurement approaches, i.e., expert-based and mechanism-based measurements, data distortion techniques, and the
-Privacy model.
Section 3 presents the experimental evaluation of the proposed approach under various settings. Finally,
Section 4 and
Section 5 summarize the conclusions and outline directions for future work, respectively.
4. Conclusions
This paper investigates -Privacy as an effective privacy preservation model for releasing datasets containing sensitive information, particularly in content-based and text-rich domains. By jointly enforcing constraints on equivalence class size d, re-identification confidence c, and sensitive value diversity l, the -Privacy model provides stronger protection against both identity disclosure and attribute inference compared with traditional privacy models. To evaluate its practical applicability, three equivalence-class construction algorithms—First Come First Serve (FCFS), Greedy, and Optimal—are developed using a combination of data generalization and data suppression. Experimental results demonstrate that all three algorithms successfully satisfy the -Privacy constraints, thereby preventing privacy breaches in released datasets. However, the findings clearly indicate that the choice of equivalence-class construction strategy has a significant impact on data utility. The FCFS algorithm achieves the highest computational efficiency due to its order-based and low-complexity design, making it suitable for scenarios requiring rapid anonymization. However, its strict reliance on tuple order often leads to substantial information loss. The Greedy algorithm offers a more balanced approach, preserving data semantics while maintaining acceptable computational costs, and consistently achieves higher data utility than FCFS. In contrast, the Optimal algorithm maximizes data utility by globally minimizing information loss, thereby establishing an upper bound on achievable utility, albeit at the expense of increased computational complexity. Overall, the experimental results confirm that -Privacy is a robust and practical framework for privacy-preserving data publishing, capable of effectively balancing privacy protection and data utility in real-world data-sharing scenarios. While the FCFS algorithm is well suited for time-critical applications, the Greedy and Optimal algorithms are more appropriate for contexts where preserving data utility is paramount. These findings underscore the importance of optimization-aware equivalence-class construction and demonstrate that -Privacy can be effectively applied across diverse datasets to support secure and meaningful data release.