4. Discussion
The
Masketeer algorithm represents an efficient tool that pseudonymizes unstructured free text written by authors of different professional backgrounds in colloquial German language, in a medical context, including the heavy use of abbreviations and nicknames, and with frequent typing errors. The evaluation of the algorithm yielded high performance (
Table 8), outperforming both an earlier version (1.0) and a third-party tool (spaCy). The low sensitivity observed in spaCy can likely be attributed to the fact that the texts used for the model were from a foreign domain (news articles) and therefore not accustomed to the nature of clinical notes. Salutations were the most important factor (see
Table 7), which were not optimized in spaCy for colloquial medical language. However, specificity was less affected in the spaCy algorithm, even showing a better false-negative rate than the 1.0 version of
Masketeer. The most common source for false negatives in the current version (
Masketeer 2.0) was either extremely rare names that were not in any dictionary and occurred without salutation or typing mistakes in names that were therefore not found by any
Masker. Naturally, manually written text, especially under the conditions in which our medical texts were written (e.g., time constraints, transcription of verbally transmitted names, diverse educational background, diverging first languages of authors, and no implemented grammar/spelling checks), is imperfect, and thus, algorithms are unlikely to perfectly mask all identifiable PHI from them. In fact, assessing correct de-identification was challenging during evaluation even for human observers in rare cases. Therefore, setting a general performance threshold to determine whether a pseudonymization algorithm is sufficient is unfeasible and depends on the underlying problem. For our purposes,
Masketeer’s performance was satisfactory.
De-identifying the clinical notes opens up possibilities for secondary analyses of the texts. For example, the de-identified note texts could be processed without privacy concerns by NLP and AI experts to extract crucial information valuable for developing predictive models that could be used to anticipate major adverse events from text data. Furthermore, de-identifying the notes could enable the application of popular, publicly available, and powerful LLMs (e.g., ChatGPT, Gemini), which cannot be utilized with personalized notes due to privacy regulations. These LLMs could, for example, derive patient summaries from the notes for the time-efficient rotation of healthcare personnel, ultimately leaving more time for patient care. LLMs could also extract information from the notes to fill gaps in electronic medical records (e.g., missing medication list) or extract and save data in a structured form, which is often exchanged only as free text (e.g., laboratory results).
In general, there seem to be two main approaches to removing identifiable text from medical free text in the literature. The first approach is to apply regular expressions and other handcrafted rules to remove references, which requires manual effort and is unlikely to transition into different contexts. The second approach is to use machine learning and train generalized models to remove references with their own decision-making. However, this approach is often challenging to execute because it requires a large database of labeled medical text data, which would be difficult or time-consuming to compile. The recent publications of two open-source German medical text databases (CARDIO:DE [
27] and GGPONC 2.0 [
28]) are a crucial step forward for model development and also for model evaluation in the future. With more databases like this, a gold-standard AI model for German medical free-text de-identification could emerge in the future.
In the current absence of such a model, the Masketeer algorithm constitutes an example of how handcrafted rules—albeit highly time-consuming—and domain expert knowledge can be used to remove identifiable data effectively and efficiently from unstructured German medical texts.
Although the algorithm was fine-tuned for the application on “HerzMobil” clinical notes, it was designed to be flexible even in other scenarios. Therefore, the algorithms can be configured with any kind of specialized name dictionaries. Although the manual additions and handcrafted exceptions required (e.g., blacklist of names, medical site additions by hand) were time-consuming, they ensured that the Masketeer algorithm could handle colloquial language, nicknames, and non-standard abbreviations correctly.
Developing the de-identification logic around a masking ensemble had a range of advantages from a software design point of view.
Efficiency: The
Masketeer algorithm called the individual
Masker subclasses in the order displayed in
Table 3. Since the algorithms stopped as soon as the first class voted for token removal, computational resources were saved since subsequent
Maskers could be omitted. The results depicted in
Figure 3 confirm a linear runtime complexity in terms of the number of tokens and corpus size.
Testability: Separating and splitting the logical calls into multiple smaller units allowed for more convenient development. Debugging induvial errors was significantly easier since logic checks were compartmentalized and it is easier to debug five small algorithms with four logic checks each than one large algorithm with twenty checks. Further, it simplified test writing because individual unit tests could be written for each Masker class.
Scalability: For similar reasons, the ensemble made Masketeer easier to scale. If new de-identification rules or a new logic were developed, they could independently be inserted into the ensemble without the risk of breaking the logic of other Maskers. Overall, the ensemble made complex logic checks clearer and more manageable.
However, as a consequence of using multiple
Maskers, the ensemble’s calling order mattered. After experimentation, the order seen in
Table 3 was found to work best but was not flawless in all cases.
In most cases, increasing patient privacy comes with the cost of reducing the utility of data. Text pseudonymization is also subject to this dilemma, as redacting certain elements from the text is equivalent to removing information. A previous study based on the same corpus as the one used in the present study found that pseudonymization impacts the classification performance [
25]. Therefore, it is critical to strike a balance between patient privacy and data utility. This was considered in
Masketeer’s development too, for example by applying pseudonymization instead of anonymization, although the latter would offer even higher levels of privacy. In the same spirit, corpus-wide pseudonyms allowed readers to follow communication pathways across multiple notes even in pseudonymized form. The same consideration applies for NER, as the differentiation between HCP-, patient- and person-specific pseudonyms also technically reduces privacy. The texts included information about medical conditions, procedures, and hospital admissions with corresponding dates and medication lists. Such information could potentially be used to re-identify patients, especially in rural areas with low population densities. However, such details are crucial for HCPs to make informed decisions in primary use. Therefore, the
Masketeer algorithm intentionally keeps such information, albeit at the cost of privacy. Norgeot et al. opted for a similar rationale in their Philter algorithm [
9]. However, to address this at least partially in
Masketeer, geographical references to medical sites and doctor’s offices are removed, limiting the risk for re-identification.
On the other hand, implementing the FullName and DoubleMasker Maskers improved privacy at the cost of a small number of false positives (FP rate = 0.067). For example, in the phrase “am Nachmittag macht Frau Maier Spaziergang” (meaning “during the afternoon, Mrs. Maier goes for walk”, including a missing article prior to “walk”) the word “Spaziergang” (“walk”) was removed by the FullName Masker, which wrongly interpreted “Spaziergang” as a name since it represented an upper-case word following a name. Such cases were rare and mostly occurred in notes including typing errors.
Although the tool’s performance was satisfactory for our use case, opportunities for further research present themselves. Future work might include the compilation of additional publicly available sources for HCP names by web scraping, which would improve
Masketeer in other contexts out of the box. Also, the name dictionaries could be cross-referenced with lists of syndromes named after people (e.g., Marfan syndrome, Austin–Flint syndrome, Dressler syndrome). Currently, these would be removed if their names occur in any dictionary, which could be addressed by extending the name whitelist described in
Section 2.2.4. (NameDictionary
Masker). Furthermore, the masking ensemble’s voting logic could be improved at the cost of execution speed. By querying all
Maskers instead of stopping at the first one to vote, a decision could be made based upon which
Masker fits best with the NER, eliminating the influence of the voting order. Analyses concerning the effect of this approach on runtime and pseudonymization performance is pending.
Using LLMs not only to interpret but also de-identify medical free text has successfully been demonstrated in a recent study in 2023 [
29]. Engineering prompts for an LLM to de-identify our corpus and comparing the results to
Masketeer’s performance is also considered a matter for future research. LLMs have been shown to leak private information from their training sets [
30,
31], which must be considered to protect patient PHI.
Limitations
While the
Masketeer algorithm can be initialized with different name dictionaries, most rules (e.g., regular expressions, manual corrections) were designed according to local and context-specific conditions. This required developers to be familiar with the entire “
HerzMobil” DMP, which was time-consuming, and the context-specific rules required adaption to other application scenarios before the
Masketeer algorithm could be applied to different contexts. Furthermore, application of the algorithm for different languages would require additional adaptions.
Masketeer uses a list of salutations to recognize names and a list of common abbreviations to avoid accidental sentence breaking when encountering a full point (i.e., “.”). As seen in
Table 7, the Salutation
Masker was the most active masking logic. Therefore, applying the algorithm to a new language requires cultivating a salutation list and adapting the logic of how salutations are used in the respective language. Furthermore, regular expression rules (e.g., addresses, phone numbers) would have to be changed to follow local conventions. The complexity of these changes increases with linguistic distance from the German language, meaning that adapting the algorithm for other Germanic languages (e.g., English, Swedish) is significantly easier compared to adapting it for others (e.g., Sino-Tibetian languages like Mandarin, Japanese). Also, the
Masketeer algorithm is currently not suited for the usage of another alphabet (e.g., Cyrillic letters, Chinese logograms). Specifying the algorithm for the context and geographical customs is not unusual. Examples found in the literature also fine-tuned their algorithm to their local specifics (e.g., masking small towns (<2000 inhabitants) or whitelisting common terms that can occur as names (e.g., “Field”, “May” in English)) [
20].
During the pseudonymization of a corpus, Masketeer compiles a linkage table between individuals and pseudonyms to ensure coherent, corpus-wide pseudonymization. Since the persistent storage of such a reference table would pose a risk of re-identification, this list is discarded after completion. Consequently, whenever new notes are added, the algorithm must pseudonymize the entire corpus anew to achieve a consistent pseudonymization throughout the corpus again. To address this, focus during development was also placed on improving runtime performance, and new notes are typically added in batches to reduce the frequency of pseudonymization runs on the entire corpus.
The evaluation did not consider individual entity types. Therefore, although
Table 8 provides insights into the overall de-identification capabilities, no comparison between the performance for different entities has been conducted so far.
Since the performance evaluation was a laborious and time-consuming task, only a small subsample (n = 200) was selected and annotated for assessment, representing 0.6% of the entire corpus. Although the sample was stratified for pseudonymization rate, it was not stratified for note length. A larger evaluation sample including stratification for note length might provide a more comprehensive performance assessment.
5. Conclusions
Medical free-text data can hold critical information about patients and diseases that AI applications could benefit from. However, due to regulatory and ethical considerations to protect patient privacy, de-identification is required, which is challenging due to the unstructured format medical text can be stored in. Additionally, as was our case, hand-written text can include informal language, typing errors, and abbreviations and can be written by authors of diverse educational backgrounds. In this paper, an ensemble-based de-identification tool was presented that achieved high performance on such texts written in the German language, for which examples in the literature are comparatively sparse.
Interest in medical free text is on the rise, especially considering the widespread availability of powerful LLM applications. To fully utilize their capabilities while protecting patient privacy and to be compliant with regulatory frameworks, tools like the presented Masketeer algorithm are required. The presented methods and results aim to fill gaps in the literature concerning the applications of NLP in German medical texts (e.g., umlauts, salutations, hyphenated names). Furthermore, by presenting the advantages of an ensemble-based algorithm and analysis of performance, the study also intends to provide practical implementation ideas for other researchers and engineers trying to address similar issues, and the results could also serve as a potential baseline orientation to compare performances. Lastly, the ever-present dilemma of privacy versus utility was addressed by including a NER system that could further serve as inspiration for best practices in the de-identification of medical free text.
The answer to the question of whether rule-based systems or ML-based tools will prevail as the de-identification standard in the future remains indeterminate. Successful examples for both cases are found in literature. However, research interest in LLMs is high for a reason. Their natural capabilities to process human language make them more flexible than handcrafted rules, which will always be at risk of missing edge cases. Ultimately, in many situations, deciding on one technique likely comes down to the type of application. If transparency (e.g., in certified applications) and performance (e.g., on massive datasets) are key, rule-based systems might still outperform LLMs. Improvements in the aspects of explainability and efficiency are certainly valuable directions of future LLM research. Furthermore, measures to prevent LLMs from leaking private training data into their answers should also be explored more deeply to make them more feasible for medical texts.