RDBAlert: An AI-Driven Automated Tool for Effective Identification of Victims’ Personal Information in Ransomware Data Breaches
Abstract
1. Introduction
- 1.
- detects and maintains links of ransomware-related DLSs in an autonomous and dynamic way over time;
- 2.
- allows users (individuals or organizations) to introduce/provide potential stolen personal data, so that it
- (a)
- collects data from the DLS repositories;
- (b)
- automatically analyzes the associated documents by using AI-based tools to extract the personal information specified by the user.
2. Background
- Most of them are not specifically focused on ransomware-related leaks, which constitutes a major issue as users whose information is exposed through that typology of attacks also have the right to know about it promptly.
- The information publicly exposed in DLSs is highly heterogeneous, encompassing a variety of formats such as text, PDFs, images, and more. As a result, personal data often remains inaccessible to most existing automated analysis tools.
- It features a flexible, modular architecture, allowing each module to be replaced or upgraded with potentially more effective solutions without affecting the overall service.
- Although some of the modules are specifically developed by the authors (e.g., the crawler and the Kibana-based visualization module), our solution leverages public AI-powered tools to analyze the variety of information typically found in ransomware-related DLSs. In particular, a number of well-known tools for OCR-based character recognition, image analysis, LLM-based text analysis, etc., are considered to confirm the existence of personal data in these data lakes.
- This core functionality is enhanced by a web-based interface that allows users to input their personal data for searches within DLSs.
3. RDBAlert: A Novel Tool to Identify Personal Data in Ransomware Leaks
3.1. RDBAlert Architecture
- A crawler, to automatically navigate platforms that host data leaks, such as forums on the dark web or .onion sites, employing anonymization technologies such as Tor or distributed proxies to ensure secure and ethical access to these environments.
- An automated search and classification module, which is responsible for decompressing the retrieved files and applying structured inspection techniques to identify PII within documents, databases, and other formats contained within them.
- A flexible storage architecture and advanced query module, in charge of identifying sensitive data along with contextual metadata such as the original file name, its hierarchical location within the folder structure, the originating platform, and timestamps. All of them are integrated into a NoSQL database designed to support multidimensional queries.
- A reporting module, for structured and actionable insights, designed to transform query results from the database into structured, visually accessible, and technically detailed outputs.

3.2. RDBAlert Implementation
3.2.1. Crawler Module
3.2.2. Classification and Search Module
- 1.
- Plain Text Files: The analysis of PII in plain text files (e.g., .txt, .log) relies on regular expressions (regex) designed in accordance with international standards. For instance, email addresses are detected using patterns based on the RFC 2822 specification [37], which defines the syntactic structure of valid email formats.Each identified email address is normalized into a JSON format that includes specifically contextual metadata, a 100-character snippet before and after the match. This context allows for inference of the email’s usage within the text (e.g., in forms or internal communications). To prevent redundancy in the database, an SHA-1 hash is generated for each record, ensuring uniqueness by cryptographically encoding both the content and its surrounding context.
- 2.
- PDF Files: PDFs are classified into two categories based on their content structure:
- Digitally generated PDFs (containing embedded text) are processed similarly to plain text files. Regex-based PII detection is applied, and the results are structured as JSON entities for standardized representation.
- Scanned PDFs (either partially or entirely composed of images) require a computer vision-based workflow. When textual content is not directly accessible, the system leverages YOLO (‘You Only Look Once’ [38])—see paragraph “Multimodal recognition tools” below—a convolutional neural network (CNN) architecture optimized for real-time object detection. YOLO is used to identify visual elements such as faces, passports, or regional identity documents within the scanned content. Any PDF containing such elements is converted into individual image files (in PNG or JPEG format) and then forwarded for advanced processing.
- 3.
- Documents and Images: For file formats such as DOC, JPG, or PNG, the same approach used for scanned PDFs is applied. YOLO is employed to scan these files and detect three critical categories: faces, passports, and identification documents. Images that yield positive matches are tagged and prepared for the text extraction phase.
- 4.
- Databases and Structured Formats: For files such as CSV, XML, SQL, or JSON, the analysis follows a dual-strategy approach:
- Header-based detection: If column names (e.g., ’email’, ’ID number’, ’phone number’) indicate the presence of PII, the corresponding fields are automatically extracted.
- Heuristic search: When headers are ambiguous or absent, the first 30 lines of the file are scanned using targeted regex patterns. This statistical sampling enables the system to infer sensitive data patterns within unlabeled columns.
- Multimodal Recognition Tools
- 1.
- Hierarchical Detection Phase: YOLOv11 follows a two-level detection strategy. First, it identifies the entire document within the image using global bounding boxes. Then, it segments the critical subregions (such as the name field, identification number, and photograph) using normalized relative coordinates. This process is supported by a spatial reference system that preserves the geometric proportions of the document, regardless of its orientation or scale.
- 2.
- Advanced Text Extraction Phase: Each detected subregion is processed through MiniCPM, which applies the following:
- (a)
- Spatial text alignment: by estimating homographies based on key points, perspective distortions inherent to documents captured at non-orthogonal angles are corrected. This transformation converts skewed regions into normalized frontal views, facilitating precise character recognition.
- (b)
- Multimodal contextual recognition: where MiniCPM combines visual embeddings (extracted via convolutional layers) with linguistic embeddings, enabling the resolution of ambiguities in deformed or partially occluded characters.
- 3.
- Rule-Based Syntactic Post-Processing: Extracted text undergoes structured validation, for which syntax- and semantics-specific constraints are applied based on the document type. For instance,
- For identification numbers (ID cards), the system verifies the expected length (8–10 digits), the presence of control letters (in alphanumeric systems), and consistency with regional prefixes.
- For proper names, recognized strings are cross-checked against standardized lexical databases to discard OCR-generated artifacts.
3.2.3. Flexible Storage and Advanced Query Module
- Text fields: For raw PII (e.g., email: “john.doe@example.com”), analyzed using specialized tokenizers (e.g., domain-based segmentation for email addresses).
- Geospatial fields: For documents linked to locations (extracted from file metadata or textual references).
- Nested fields: To store 100-character contextual windows around each PII, allowing for searches based on surrounding phrases.
- Embedding vectors: Generated using Sentence-BERT [41] for textual PII, allowing semantic searches (e.g., matching misspelled names through similarity analysis).
3.2.4. Reporting Module
- Visualization of spatio-temporal correlations: Heatmaps overlaid with geospatial layers, where PII instances are plotted based on coordinates extracted from metadata (e.g., physical addresses in documents).
- Multivariable statistical graphs: Temporal histograms showing data exposure trends, network diagrams linking compromised entities, and stacked bar charts breaking down PII types (e.g., email addresses vs. identification numbers).
- Advanced aggregations: Complex query processing, such as tracking the recurrence of a phone number across multiple leaks or calculating the percentage distribution of confidential documents by economic sector.
- Executive summaries, highlighting macro trends such as the total volume of exposed PII per region or affected entity.
- Detailed forensic analyses, incorporating screenshots of visualizations, contextual excerpts from original documents, and links to source files on the dark web.
- Mitigation recommendations, derived from identified patterns (e.g., correlations between PII types and recurrent attack vectors).
3.3. RDBAlert’s Operation
- First, as illustrated in Figure 3, a dedicated web service (available at https://ransomdbalert.com) enables users to check whether their email address or other data points appear in the leaks indexed by the system. This service offers a user-friendly interface for performing quick and easy checks.
- Alternatively, users can conduct a more detailed analysis locally by downloading the relevant leak data and following the setup instructions available in the RDBAlert GitHub repository [42]. Once an Elasticsearch instance is running and the leaked data has been indexed, users can perform detailed queries directly within Elasticsearch to analyze and retrieve relevant information.
3.3.1. Ransomware Data Leak Monitoring
3.3.2. Data Leak Analysis
- To search for email addresses, phone numbers, names, or other information directly associated with an email, the emails field is deployed.
- For broader contextual information linked to an email, the field email_context exists.
- To search for data related to specific domains, the domain field is specified.

4. Experimental Results
- Honeywell—In May 2023, this Fortune 100 company—specializing in aerospace and energy equipment—experienced a data exfiltration incident involving 233.45 GB of data, including 22.2 GB of personally identifiable information (PII) [43].
- Pension Benefit Information (pbInfo)—In May 2024, the Cl0p ransomware group targeted this US provider specializing in population data management solutions, exfiltrating 33.21 GB of confidential information, including data on pension and insurance beneficiaries. [44].
- Pioneers Electronics—In July 2023, the same group also exfiltrated 114.14 GB of data from the company Pioneer Electronics [45].
- Philippine Health Insurance (PhilHealth)—In September 2023, the Philippine Health Insurance Corporation released a comprehensive report on a data breach incident that compromised the personal information of 42 million individuals [46]. This incident stands as one of the most significant cases of mass data exfiltration to date, with an unprecedented volume of personal records compromised.
- Targa Viasat Spain—In 2024, the Medusa ransomware group targeted Targa Viasat Spain, a company specialized in satellite communications and vehicle tracking solutions, exfiltrating 87.56 GB of sensitive data [47].
4.1. Initial Training Stage
- 1.
- Face Detection: For this task, the WIDER FACE dataset, available at http://shuoyang1213.me/WIDERFACE (accessed on 14 October 2014) was utilized. This dataset is a benchmark in the scientific community, comprising 32,203 images with 393,703 manually annotated faces. The dataset covers the following:
- Variable scales (from close-up portraits to dense crowds).
- Extreme lighting conditions (overexposure and shadows).
- Partial occlusions (accessories and hair).
Each annotation includes precisely adjusted bounding box coordinates, along with difficulty labels (easy, medium, hard), allowing the model to distinguish between straightforward and challenging cases during training. - 2.
- Personal Identification Document Detection: The dataset used to train the detection of personal identification documents in RDBAlert was developed using an innovative approach based on real-world data extracted from historical ransomware leaks, ensuring its relevance to real-world scenarios. This corpus includes images of identity documents (such as ID cards, passports, and residence permits) sourced directly from previously processed leaks. To ensure diversity, the dataset covers variations in the following:
- Capture quality: Ranging from high-resolution scans to photographs taken with mobile devices under suboptimal conditions (e.g., blurriness and reflections).
- Regional formats: Documents issued in different countries, incorporating variations in design, color schemes, and security features (e.g., holograms and microtext).
- Exposure contexts: Documents that are partially obscured, folded, or overlaid with other objects in the image.
- 1.
- Hierarchical Detection Phase: YOLOv11 executes a two-tiered detection strategy: initially, it identifies the entire document within the image using global bounding boxes; subsequently, it segments critical subregions (such as the name area, identification number, and photograph), employing normalized relative coordinates. This process is supported by a spatial reference system that preserves the document’s geometric proportions, irrespective of its orientation or scale.
- 2.
- Advanced Textual Extraction Phase: Each detected subregion is processed through MiniCPM, which applies the following:
- Spatial Text Alignment: By estimating homographies based on key points, it corrects the perspective distortion inherent in documents captured at non-orthogonal angles. This transformation converts skewed regions into normalized frontal views, facilitating accurate character recognition.
- Multimodal Contextual Recognition: MiniCPM integrates visual embeddings (extracted via convolutional layers) with linguistic embeddings, enabling the resolution of ambiguities in deformed or partially occluded characters.
- 3.
- Rule-Based Syntactic Postprocessing: The extracted texts undergo structured validation, wherein specific syntactic and semantic constraints are applied based on the document type. For instance,
- For identification numbers (e.g., IDs), the system verifies the expected length (8–10 digits), the presence of control letters (in alphanumeric systems), and consistency with regional prefixes.
- For proper names, recognized strings are cross-referenced with normalized lexical databases to eliminate OCR artifacts.
4.2. Exfiltrated Data’s Analysis Results
4.2.1. Statistical Overview of Exfiltrated Data Across Case Studies
- In the Honeywell case, RDBAlert identified 1714 PDF files containing PII (payrolls, contracts, tax records) and 2542 databases with 165,343 internal emails and 24,893 external emails. The data includes employee numbers, demographic information (name, birth date, federal taxpayer registry), salary details (daily and monthly), and corporate information (email and company).
- Regarding pbInfo, we identified 3.9 GB of data related to PII, distributed across 783 databases, 702 PDF files, and 72 text documents. The automated analysis revealed the exposure of 79,554 full names, 8674 internal emails, 72,695 external emails, and 354 identification documents (ID cards/passports).
- In the Pioneers case, the tool accurately correlated the 1361 PDF files analyzed with the 1352 JSON files generated. Notably, 93% of the files contained exportable PII, highlighting the prevalence of clerical documents as a critical attack vector for data exfiltration.
- For the Philippine Health Insurance dataset, 3.35 TB of data was identified, including 58.9 GB containing PII. The automated analysis revealed 5098 text files, 7082 PDF documents, 3476 Word files, 8059 image files, and 9523 databases containing PII. The data exposure involved 51,083,051 full names, 4201 internal emails, 304,104 external emails, and 3729 identification documents (ID cards/passports). All of this highlights the critical vulnerability of healthcare systems, which continue to be prime targets for large-scale, coordinated cyberattacks.
- Finally, in the Viasat case, we identified 11.3 GB of PII distributed across 3173 databases, 1136 PDF files, and 98 text documents. The automated analysis revealed the exposure of 138,749 full names, 38,715 internal emails, and 327 identification documents (ID cards/passports). Additionally, the data included sensitive corporate information, such as client contracts, vehicle tracking logs, and financial records, underscoring the critical nature of operational data in such attacks.
| Data Exfiltration | Data Size | #Total Files Found and Analyzed by Type with PII | #Total PII Found and Analyzed by Data Type | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Company | Country | Leaked Data | PII | .txt | .doc | Images | Data-Bases | #Names | #Internal Emails | #External Emails | #ID Cards/Passports | |
| Honeywell | USA | 233.45 GB | 22.2 GB | 159 | 1714 | 13 | 145 | 2542 | 115,961 | 165,343 | 24,893 | 653 |
| pbInfo | USA | 33.21 GB | 3.9 GB | 72 | 702 | 36 | 23 | 783 | 79,554 | 8674 | 72,695 | 354 |
| Pioneer Electronics | Japan | 114.14 GB | 14.6 GB | 233 | 1361 | 84 | 106 | 1352 | 195,443 | 104,875 | 95,471 | 582 |
| PhilHealth | Philippines | 3.35 TB | 58.9 GB | 5098 | 7082 | 3476 | 8059 | 9523 | 51,083,051 | 4201 | 304,104 | 3729 |
| Viasat | Spain | 97.66 GB | 11.3 GB | 98 | 1136 | 141 | 63 | 3173 | 138,749 | 38,715 | 88,451 | 327 |
4.2.2. Correlation of Exfiltration Patterns with Specific Ransomware Groups
- Honeywell: Correlation with groups specializing in industrial espionage and double extortion (e.g., BlackCat/ALPHV). The exfiltration of 233.45 GB, including 2542 databases (see Figure 9) containing internal emails (82.44% of the total) and sensitive corporate documents (payrolls, contracts, tax information), points to a sophisticated group with an interest in trade secrets and corporate intelligence. Groups like BlackCat (ALPHV) are known for conducting thorough analyses of stolen data prior to encryption to enable more effective extortion. The geographical diversity of the identity documents (with significant percentages from India, China, and Mexico) suggests that the target was a multinational corporation with a global workforce, a common target for such actors (see Figure 10).
- pbInfo and Viasat: Correlation with mid-scale Ransomware-as-a-Service (RaaS) groups (e.g., Phobos and Snatch). These cases present smaller but highly specific data volumes (3.9 GB and 11.3 GB of PII, respectively). The attack on pbInfo (see Figure 11) shows a strong focus on the U.S. (93.46% of IDs) and a high proportion of internal communications (89.01% of emails). Conversely, the attack on Spain-based Viasat (see Figure 17) reveals a clear regional target (74.42% of Spanish IDs) that includes critical operational information (vehicle tracking logs and financial records). These patterns are consistent with RaaS groups that enable their affiliates to select lower-profile targets with valuable data, using standardized tools for exfiltration and subsequent blackmail.
- Pioneer Electronics: Correlation with pure extortion groups (e.g., Ragnar Locker and Babuk). The profile of this attack, with a substantial data volume (114.14 GB) and an almost exclusive focus on employees and operations within the United States (100% of identity documents), correlates with groups seeking direct financial impact and streamlined extortion (see Figure 13). The nature of the data (93% of files containing exportable PII, clerical documents) indicates a broad-spectrum exfiltration without a highly specialized search. Groups like Ragnar Locker have demonstrated a similar pattern, attacking large corporations to exfiltrate data and then threaten its publication to cause reputational and financial harm.
- PhilHealth: Correlation with high-impact ransomware groups (e.g., Clop and LockBit). The massive volume of exfiltrated data (3.35 TB, with 58.9 GB of PII), affecting over 51 million individuals, is characteristic of large-scale attacks against the healthcare sector, a primary target for groups like Clop and LockBit. The data composition—with an overwhelming majority of Philippine identification documents (92.83%) and the prevalence of external (see Figure 15) and personal email domains (e.g., yahoo.com at 48.92% (see Figure 14)—indicates the exfiltration of patient and employee records on a national scale. This pattern aligns with these groups’ strategy of maximizing pressure and extortion payments by threatening to expose highly sensitive health data of a vast population.
4.2.3. Geographical Targeting Trends of Major Ransomware Groups
4.3. Computational Results
4.4. Discussion and Ethical Considerations
- Privacy concerns. Data may be “public” but people rarely expect their aggregated records to be indexed, correlated, and queried at scale. Hence, availability does not mean consent.On the other hand, aggregation and linking increase identifiability; that is, many datasets that are harmless in isolation can become sensitive when combined.
- Harm and discrimination. Exposed sensitive attributes like health, ethnicity, or criminal history can enable discrimination, doxxing, extortion, or reputational and financial harm.Moreover, automated profiling can produce biased or incorrect inferences leading to unfair decisions (credit, hiring, insurance).
- Misuse and dual-use. Legitimate uses (research and incident response) sit beside illegitimate ones (stalking, fraud, targeted harassment). Easy access lowers the barrier for abuse.
- Transparency and accountability. Subject individuals typically have no visibility into who queried their data or why. Lack of auditability erodes trust.
- Legal/compliance exposure. Jurisdictions differ (GDPR, CCPA, ePrivacy, sector rules). Indexing or republishing sensitive data can create legal liabilities even if data was public.
- Security risks. The tool itself becomes a high-value target. A breach of the tool or its logs multiplies harm by exposing query histories and aggregated views.
5. Conclusions and Future Work
- Conduction of temporal analyses to map the evolution of a data breach, identifying critical time windows during which data was accessed or modified.
- Integration of geolocation techniques into indexed documents to associate data leaks with specific regions or infrastructures while linking with external threat intelligence feeds (such as indicators of compromise or malicious file hashes) would enrich the analytical context.
- Data masking and end-to-end encryption, ensuring that identified PII is not exposed during queries or exports.
- Consideration of a security architecture for the system to strengthen confidence and ensure legal compliance.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
DURC Statement
References
- Sophos. The State of Ransomware 2024. Report, 2025. Available online: https://www.sophos.com/en-us/content/state-of-ransomware (accessed on 14 October 2025).
- Cisco Umbrella. From Trojan Takeovers to Ransomware Roulette. Cisco Cyber Threat Trends Report, 2024. Available online: https://umbrella.cisco.com/info/cyber-threat-trends-report?utm_medium=search-paid&utm_source=google&utm_campaign=UMB_EMEA_EU_EN_GS_Nonbrand_Security_T1&utm_content=DNS-FY24-Q4-Content-Ebook-Cyber-Threat-Trends-Report&_bt=712318013894&_bk=latest%20cybersecurity%20trends&_bm=p&_bn=g&_bg=158065449997&gad_source=1&gclid=Cj0KCQiA_NC9BhCkARIsABSnSTb55LcWHiMIvNpFjTWsYi9lii816iuEfAPYavGB3EXZL_U8nzlgEx4aAhMaEALw_wcB (accessed on 14 October 2025).
- Cyber Management Alliance. Top 10 Biggest Cyber Attacks of 2024 & 25 Other Attacks to Know About! Available online: https://www.cm-alliance.com/cybersecurity-blog/top-10-biggest-cyber-attacks-of-2024-25-other-attacks-to-know-about (accessed on 14 October 2025).
- Check Point. Ransomware Annual Report 2024. Report 2024. Available online: https://cyberint.com/blog/research/ransomware-annual-report-2024/#:~:text=In%202024%2C%20the%20ransomware%20landscape,the%20remainder%20of%20the%20year (accessed on 14 October 2025).
- Home Office. Ransomware Legislative Proposals: Reducing Payments to Cyber Criminals and Increasing Incident Reporting; Government Consultation, 2025. Available online: https://assets.publishing.service.gov.uk/media/67864097c6428e013188175a/Consultation-Document-Proposals-v2.pdf (accessed on 14 October 2025).
- Hassan, N.A. Ransomware Revealed. A Beginner’s Guide to Protecting and Recovering from Ransomware Attacks; Apress: New York, NY, USA, 2019; ISBN 978-1484242544. [Google Scholar]
- Aggarwal, M. Ransomware Attack: An Evolving Targeted Threat. In Proceedings of the 14th International Conference on Computing Communication and Networking Technologies (ICCCNT), Delhi, India, 6–8 July 2023; pp. 1–7. [Google Scholar] [CrossRef]
- MacColl, J.; Husch, P.; Mott, G.; Sullivan, J.; Nurse, J.R.C.; Turner, S.; Pattnaik, N. The Scourge of Ransomware. Victim Insights on Harms to Individuals, Organisations and Society; Royal United Services Institute: London, UK, 2024; Available online: https://www.rusi.org/explore-our-research/publications/occasional-papers/ransomware-victim-insights-harms-individuals-organisations-and-society (accessed on 14 October 2025).
- EU. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the Protection of Natural Persons with Regard to the Processing of Personal Data and on the Free Movement of Such Data, and Repealing Directive 95/46/EC (General Data Protection Regulation). Available online: https://eur-lex.europa.eu/eli/reg/2016/679/oj/eng (accessed on 14 October 2025).
- EDPB. Guidelines 9/2022 on Personal Data Breach Notification Under GDPR. Version 2.0, 2023. Available online: https://www.edpb.europa.eu/system/files/2023-04/edpb_guidelines_202209_personal_data_breach_notification_v2.0_en.pdf (accessed on 14 October 2025).
- State of California. California Consumer Privacy Act (CCPA). Available online: https://oag.ca.gov/privacy/ccpa (accessed on 14 October 2025).
- Government of Canada. Personal Information Protection and Electronic Documents Act (S.C. 2000, c. 5). Available online: https://laws-lois.justice.gc.ca/eng/acts/p-8.6 (accessed on 14 October 2025).
- Government of Japan. Act on the Protection of Personal Information (Act No. 57 of 2003). Available online: https://www.cas.go.jp/jp/seisaku/hourei/data/APPI.pdf (accessed on 14 October 2025).
- Government of Brazil. General Personal Data Protection Act (LGPD). Available online: https://lgpd-brazil.info (accessed on 14 October 2025).
- Roth, J. Data Exfiltration in Ransomware Attacks: Digital Forensics Primer for Lawyers; Kroll: New York, NY, USA, 2025; Available online: https://www.kroll.com/en/insights/publications/cyber/data-exfiltration-ransomware-attacks (accessed on 14 October 2025).
- Fuentes, M.; Hacquebord, F.; Hilt, S.; Kenefick, I.; Kropotov, V.; McArdle, R.; Mercês, F.; Sancho, D. Modern Ransomware’s Double Extortion Tactics and How to Protect Enterprises Against Them; Trend Micro, 2021. Available online: https://documents.trendmicro.com/assets/white_papers/wp-modern-ransomwares-double-extortion-tactics.pdf (accessed on 14 October 2025).
- Imperva. More Lessons Learned from Analyzing 100 Data Breaches; Whitepaper, 2022. Available online: https://www.imperva.com/resources/whitepapers/More-Lessons-Learned-from-Analyzing-100-Data-Breaches_WP.pdf (accessed on 14 October 2025).
- ArticWolf. Artic Wolf 2025 Ransomware Report; Arctic Wolf Networks Inc.: Eden Prairie, MN, USA, 2025; Available online: https://cybersecurity.arcticwolf.com/2025-Threat-Report-v1.html (accessed on 14 October 2025).
- Price, A. Data-Leak Site Emergence Continues to Increase; Cyjax, London, UK, August 2024. Available online: https://www.cyjax.com/resources/blog/data-leak-site-emergence-continues-to-increase (accessed on 14 October 2025).
- Center for Internet Security. Ransomware: The Data Exfiltration and Double Extortion Trends; Part 3; Center for Internet Security: East Greenbush, NY, USA; Available online: https://www.cisecurity.org/insights/blog/ransomware-the-data-exfiltration-and-double-extortion-trends (accessed on 14 October 2025).
- Fisher, W.; Craft, R.E.; Ekstrom, M.; Sexton, J.; Sweetnam, J. Data Confidentiality: Detect, Respond to, and Recover from Data Breaches. NIST Special Publication 1800-29, February 2024. Available online: https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.1800-29.pdf (accessed on 14 October 2025).
- Wikipedia. List of Data Breaches. Available online: https://en.wikipedia.org/wiki/List_of_data_breaches (accessed on 14 October 2025).
- Hylender, C.D.; Langlois, P.; Pinto, A.; Widup, S. 2025 Data Breach Investigations Report. Verizon report, 2025. Available online: https://www.verizon.com/business/resources/reports/dbir (accessed on 14 October 2025).
- Breachsense. The Most Recent Data Breaches in 2025. Report, 2025. Available online: https://www.breachsense.com/breaches (accessed on 14 October 2025).
- Drapkin, A. Data Breaches That Have Happened in 2022, 2023, 2024 and 2025 so far. Tech.co report. Available online: https://tech.co/news/data-breaches-updated-list (accessed on 14 October 2025).
- Bonta, R. Search Data Security Breaches. Available online: https://oag.ca.gov/privacy/databreach/list (accessed on 24 February 2025).
- America’s Cyber Defense Agency. StopRansomware: BianLian Ransomware Group. Available online: https://www.cisa.gov/news-events/cybersecurity-advisories/aa23-136a (accessed on 14 October 2025).
- Rubin, K. Ransomware and Healthcare: Why Hackers Target the Industry and How to Combat Attacks. Available online: https://www.linkedin.com/pulse/ransomware-healthcare-why-hackers-target-industry-how-kevin-rubin--efzhc (accessed on 14 October 2025).
- HC3. New Threat Brief on Ransomware and Healthcare. Available online: https://dhinsights.org/news/new-threat-brief-on-ransomware-and-healthcare (accessed on 14 October 2025).
- Autoriteit Persoonsgegevens. Report Data Breaches 2023. Report April 2024. Available online: https://www.autoriteitpersoonsgegevens.nl/en/system/files?file=2024-10/Report%20data%20breaches%202023.pdf (accessed on 14 October 2025).
- Dalvi, A.; Kulkarni, P.; Kore, A.; Bhirud, S.G. Dark Web Crawling for Cybersecurity: Insights into Vulnerabilities and Ransomware Discussions. In Proceedings of the 2nd International Conference for Innovation in Technology (INOCON), Bangalore, India, 3–5 March 2023; pp. 1–6. [Google Scholar] [CrossRef]
- Dalvi, A.; Bhirud, S. Dark web monitoring as an emerging cybersecurity strategy for businesses. Int. J. Inf. Eng. Electron. Bus. (IJIEEB) 2024, 16, 54–67. [Google Scholar] [CrossRef]
- Recorded Future. Get Ahead of Present and Future Attacks with Recorded Future. Available online: https://www.recordedfuture.com (accessed on 14 October 2025).
- GitHub. Aria2-Onion-Downloader. Available online: https://github.com/sn0b4ll/aria2-onion-downloader (accessed on 14 October 2025).
- GitHub. Torget. Available online: https://github.com/pmorissette/torget (accessed on 14 October 2025).
- GitHub. Torboost. Available online: https://github.com/tasooshi/torboost (accessed on 14 October 2025).
- Network Working Group. Internet Message Format. RFC 2822. Available online: https://www.rfc-editor.org/rfc/rfc2822.html (accessed on 14 October 2025).
- Ultralitics. Ultralitics Yolo11. Available online: https://docs.ultralytics.com/models/yolo11 (accessed on 14 October 2025).
- Huang, D.; Yan, C.; Li, Q.; Peng, X. From Large Language Models to Large Multimodal Models: A Literature Review. Appl. Sci. 2024, 14, 5068. [Google Scholar] [CrossRef]
- OpenBMB. MiniCPM: A Multimodal Large Language Model. 2024. Available online: https://github.com/OpenBMB/MiniCPM-o (accessed on 14 October 2025).
- Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 3982–3992. [Google Scholar] [CrossRef]
- GitHub. The Source Code of RansomDBAlert. Available online: https://github.com/juanmill4/RansomDBAlert/tree/main (accessed on 14 October 2025).
- Petkauskas, V. Honeywell Hack Exposed Nearly 120 K People; Cybernews, November 2023. Available online: https://cybernews.com/news/honeywell-breach-exposed-thousands (accessed on 14 October 2025).
- Petkauskas, V. Breach of Death Auditor PBI Exposes Details of 370,000 People; Cybernews, July 2023. Available online: https://cybernews.com/news/pbi-data-breach-moveit (accessed on 14 October 2025).
- Schappert, S. TomTom, Pioneer Electronics, Autozone Latest to Navigate MOVEit Attacks; Cybernews, November 2023. Available online: https://cybernews.com/news/tomtom-navigates-moveit-attacks-clop (accessed on 14 October 2025).
- Insurance Business. PhilHealth Hack Potentially Exposes 42 Million People. Available online: https://www.insurancebusinessmag.com/asia/news/cyber/philhealth-hack-potentially-exposes-42-million-people-496453.aspx (accessed on 14 October 2025).
- EuropaPress. Targa Viasat Suffers Cyberattack Compromising Nearly 100 GB of Financial and Personal Documents. Available online: https://www.europapress.es/motor/sector-00644/noticia-targa-viasat-sufre-ciberataque-compromete-casi-100-gb-documentos-financieros-personales-20240704185418.html (accessed on 14 October 2025).
- UIDAI. Unique Identification Authority of India. Available online: https://www.uidai.gov.in/en/about-uidai/unique-identification-authority-of-india.html (accessed on 14 October 2025).
- Cevallos-Salas, D.; Estrada-Jiménez, J.; Guamán, D.S.; Urquiza-Aguiar, L. Ransomware dynamics: Mitigating personal data exfiltration through the SCIRAS lens. Comput. Secur. 2025, 157, 104583. [Google Scholar] [CrossRef]















| Platform | Description | Main Features |
|---|---|---|
| AmILeaked | Similar to “Have I Been Pwned” below, this service allows users to verify whether their email or password has appeared in a known data breach. | - Strengths: simple UI for quick checks; personal alerts/monitoring options. - Access: free checks + paid monitoring/business plans. - Caveat: smaller footprint than large commercial engines. |
| DeHashed | A search engine specialized in compromised datasets. It enables users to check if their personal information, such as emails or passwords, has been exposed in data breaches. | - Strengths: powerful filtering, programmatic API, monitoring/alerts; commonly used in incident response. - Access: freemium (limited free lookups) with paid tiers and API. - Caveat: paywall for full results and commercial use; always handle PII legally. |
| GhostProject | An online dataset that allows users to search for compromised passwords and data using email addresses or usernames. It helps determine if personal information has been exposed in data leaks. | - Strengths: very large record counts (advertises billions of records); quick credential/password lookup. - Access: web UI—model varies (some features gated). - Caveat: provenance and legality of some indexed dumps is unclear; treat outputs carefully. |
| Have I Been Pwned? | A free service that allows users to check whether their email address or phone number has been involved in a known security breach. It provides alerts and advice to protect their accounts. | - Strengths: authoritative, transparent methodology, free checks, enterprise/notification APIs, strong privacy-aware APIs (password API uses k-anonymity). - Access: free for casual queries; paid API/notify options for enterprise. - Caveat: HIBP only indexes breaches it has validated/ingested (not every private dump). |
| IntelX | A data intelligence and search platform that allows users to explore a wide variety of sources, including data breaches, WHOIS records, documents, and more. IntelX aids in digital investigations and in retrieving hard-to-find information. | - Strengths: selectors-based search (email, IP, Bitcoin, IPFS), historical archive, darknet coverage—useful for deep OSINT and historical tracing. - Access: free basic search; PRO features for darknet/history/advanced filters require subscription. - Caveat: the breadth is powerful but may surface outdated/duplicated items; requires experienced filtering. |
| Leak Lookup | A service that provides access to multiple leaked datasets to check if personal information has been compromised. It is useful for security investigations and identity protection. | - Strengths: large record counts, domain monitoring and business-focused subscriptions. - Access: account/login required; paid monitoring tiers. - Caveat: commercial service—depth and freshness depend on subscription level. |
| Ransomwhere | A platform that tracks and aggregates data on ransom payments related to ransomware attacks. It helps understand the financial impact of these attacks and promotes transparency in cybersecurity. | - Strengths: focused ransomware dataset, downloadable data and reporting options for researchers. - Access: web access; data export options on site. - Caveat: narrow scope (ransomware-related)—not a general credential search engine. |
| Recon-ng | A complete web reconnaissance framework that includes modules to search for information in public datasets and third-party services, useful for security professionals and pentesters. | - Strengths: modular, scriptable, integrates with APIs/services for enrichment; great for structured investigations and pivoting. - Access: open-source (run locally), requires modules/API keys for some external services. - Caveat: not a raw breach dump index—it aggregates OSINT and can query breach services when configured. |
| Scylla | A search engine for compromised information that enables users to search across multiple leaked datasets simultaneously. It offers an API for custom integrations. | - Strengths: username/social-profile pivoting, Shodan integration, geolocation features—useful for profile enumeration. - Access: open-source GitHub projects or custom installs. - Caveat: name ambiguity—be sure you mean the OSINT tool, not ScyllaDB. Handle results ethically. |
| Name | Description | Link |
|---|---|---|
| Breach house | An advanced ransomware site monitoring system that collects information about active groups, recent attacks, and extortion trends on the dark web. Its focus is to provide real-time data to researchers and cybersecurity professionals. | https://breach.house (accessed on 14 October 2025) |
| Ransomlook.io | A web-based ransomware tracker that facilitates the observation of ransomware group activity in real time. It enables researchers and analysts to gain insights into new leaks and attack patterns on the dark web. | https://www.ransomlook.io/recent (accessed on 14 October 2025) |
| Ransomware.live | A ransomware monitoring portal that tracks the activities of major extortion groups. It is constantly updated and allows visualization of trends in the growth of these groups and their victims. | https://github.com/JMousqueton/ransomware.live (accessed on 14 October 2025) |
| Ransomwatch | An open-source tool that tracks and archives posts from ransomware groups on the dark web. It monitors multiple known group sites and provides real-time alerts on new data leaks. | https://github.com/joshhighet/ransomwatch (accessed on 14 October 2025) |
| Name | Description | Link |
|---|---|---|
| ABBYY FineReader | A leading OCR software offering high text recognition accuracy. It supports a wide range of languages and document formats, allowing scanned PDFs to be converted into searchable and editable documents. | https://pdf.abbyy.com/es (accessed on 14 October 2025) |
| Amazon Textract | Amazon Textract is a powerful machine learning service provided by AWS that allows you to automatically extract text, forms, and tables from scanned documents and images. | https://aws.amazon.com/textract/?nc1=h_ls (accessed on 14 October 2025) |
| Detectron2 | An object detection and recognition system developed by Facebook AI Research. It is flexible with high performance for segmentation and object detection tasks. | https://github.com/facebookresearch/detectron2 (accessed on 14 October 2025) |
| EasyOCR | An easy-to-use OCR library supporting over 80 languages. It is based on PyTorch and ideal for projects requiring fast implementation. | https://github.com/JaidedAI/EasyOCR (accessed on 14 October 2025) |
| Google Cloud Vision API | A cloud service offering OCR and image analysis capabilities. It can extract text from images and PDFs and is scalable for large volumes of data. | https://cloud.google.com/vision?hl=en (accessed on 14 October 2025) |
| H2O-VL Mississippi-2B | A vision–language model (VLM) optimized for text recognition and document-oriented visual question answering (VQA). It is designed for advanced multimodal reasoning over documents and images. | https://h2o.ai/platform/mississippi (accessed on 14 October 2025) |
| InternVL2-5-MPO | A powerful multimodal AI model with enhanced OCR and scene text recognition capabilities. It provides high accuracy in extracting structured and unstructured text from complex visual documents. | https://internvl.github.io/blog/2024-12-20-InternVL-2.5-MPO (accessed on 14 October 2025) |
| Keras OCR | Provides a set of tools to build OCR systems using Keras and TensorFlow. It includes pre-trained models and simplifies the training of custom models. | https://keras-ocr.readthedocs.io/en/latest (accessed on 14 October 2025) |
| LLaVA (Large Language and Vision Assistant) | A state-of-the-art model that integrates vision and language understanding, allowing for document interpretation, OCR, and VQA. It is widely used for AI-driven document processing and multimodal learning. | https://llava-vl.github.io (accessed on 14 October 2025) |
| Mediapipe (Google) | In addition to OCR, it offers advanced image analysis capabilities, including object detection, content tagging, facial recognition, and logo detection. | https://github.com/google-ai-edge/mediapipe (accessed on 14 October 2025) |
| Microsoft Azure Computer Vision | A unified service that offers innovative computer vision capabilities. Give your apps the ability to analyze images, read text, and detect faces with prebuilt image tagging, text extraction with optical character recognition (OCR), and responsible facial recognition. | https://azure.microsoft.com/en-us/products/ai-services/ai-vision (accessed on 14 October 2025) |
| MiniCPM | A large multimodal model (LMM) designed for document and image understanding. MiniCPM incorporates OCR capabilities and can perform complex reasoning over textual and visual data. | https://github.com/OpenBMB/MiniCPM-o (accessed on 14 October 2025) |
| MiniMonkey | A multimodal model trained for visionlanguage tasks, including OCR, document understanding, and scene text recognition. It integrates with various AI pipelines to extract and analyze information from images. | https://huggingface.co/mx262/MiniMonkey (accessed on 14 October 2025) |
| Tesseract OCR | An open-source OCR engine developed by Google. Despite being free, it is highly powerful and supports over 100 languages. It can be integrated into custom applications and is ideal for developers. | https://github.com/tesseract-ocr/tesseract (accessed on 14 October 2025) |
| YOLO (You Only Look Once) | A real-time object detection algorithm that utilizes convolutional neural networks. It is highly efficient and used to identify and locate multiple objects within an image with high precision. | https://docs.ultralytics.com (accessed on 14 October 2025) |
| Name | Description | Link |
|---|---|---|
| Apache Cassandra | A high-performance, distributed NoSQL database designed for handling massive amounts of structured and semi-structured data across multiple nodes. It provides high availability, fault tolerance, and linear scalability. | https://cassandra.apache.org/_/index.html (accessed on 14 October 2025) |
| Apache HBase | A column-family NoSQL database built on top of Hadoop, designed for processing large amounts of sparse data. It is well-suited for analytical workloads and time-series data storage. | https://hbase.apache.org/ (accessed on 14 October 2025) |
| Elasticsearch | A distributed search and analytics engine optimized for full-text search, log analysis, and real-time data indexing. It enables scalable querying across large datasets and integrates well with the ELK stack (Elasticsearch, Logstash, Kibana). | https://www.elastic.co (accessed on 14 October 2025) |
| Neo4j | A graph database that specializes in handling highly connected data. It is widely used for applications that require complex relationship modeling, such as recommendation systems and fraud detection. | https://neo4j.com (accessed on 14 October 2025) |
| Redis | An in-memory key-value store known for its speed and efficiency. It is commonly used for caching, real-time data processing, and message queuing in high-performance applications. | https://redis.io (accessed on 14 October 2025) |
| Case Study | Key Patterns and Observed TTPs | Correlated Ransomware Group(s) |
|---|---|---|
| PhilHealth | Massive data volume; healthcare sector target; predominantly local PII; extensive use of personal email. | Clop/LockBit (specialized in high-impact, large-volume targets). |
| Honeywell | Sensitive corporate data (internal emails and finances); multinational target; geographically diverse PII. | BlackCat/ALPHV (focus on industrial espionage and double extortion). |
| Pioneer Electronics | Highly concentrated PII (U.S.); broad-spectrum exfiltration without a specific technical focus. | Ragnar Locker/Babuk (direct financial extortion and attacks on large corporations). |
| pbInfo/Viasat | Moderate data volume; specific regional or sectoral target; operational data and localized PII. | Phobos/Snatch (mid-scale RaaS for more focused attacks). |
| Group\Country | USA | France | Japan | Brazil | Canada | United Kingdom | Germany | Other |
|---|---|---|---|---|---|---|---|---|
| Clop | 5 | 1 | 1 | 0 | 0 | 0 | 0 | 2 |
| Akira | 33 | 0 | 0 | 4 | 3 | 2 | 2 | 18 |
| Play | 467 | 0 | 0 | 0 | 50 | 6 | 12 | 35 |
| Medusa | 110 | 0 | 0 | 0 | 17 | 11 | 2 | 51 |
| Data Lake | Process | Resources Involved |
|---|---|---|
| Complete pipeline | - Time: 4 h 12 m - CPU: average: 30%, range: 12–54% - RAM: average: 15 GB, peak: 29 GB | |
| Honeywell (233.45 GB) | YOLO | - Time: 754,098 images/21 m (→598.49 img/s) - 12,915 images filtered - GPU: 80% - VRAM: 1 GB |
| MiniCPM | - Time: 12,915 images/1 h 12 m (→2.99 img/s) - 653 images filtered - GPU: 98% - VRAM: 21 GB | |
| Complete pipeline | - Time: 7 h 39 m - CPU: average: 40%, range: 12–63% - RAM: average: 20 GB, peak: 29.5 GB | |
| PhilHealth (3.35 TB) | YOLO | - Time: 2,084,578 images/1 h 2 m (→554.32 img/s) - 35,478 images filtered - GPU: 80% - VRAM: 1 GB |
| MiniCPM | - Time: 35,478 images/2 h 49 mm (→3.49 img/s) - 3729 images filtered - GPU: 98% - VRAM: 21 GB |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tejada-Triviño, J.M.; Castillo-Fernández, E.; García-Teodoro, P.; Gómez-Hernández, J.A. RDBAlert: An AI-Driven Automated Tool for Effective Identification of Victims’ Personal Information in Ransomware Data Breaches. Electronics 2025, 14, 4327. https://doi.org/10.3390/electronics14214327
Tejada-Triviño JM, Castillo-Fernández E, García-Teodoro P, Gómez-Hernández JA. RDBAlert: An AI-Driven Automated Tool for Effective Identification of Victims’ Personal Information in Ransomware Data Breaches. Electronics. 2025; 14(21):4327. https://doi.org/10.3390/electronics14214327
Chicago/Turabian StyleTejada-Triviño, Juan Manuel, Elvira Castillo-Fernández, Pedro García-Teodoro, and José Antonio Gómez-Hernández. 2025. "RDBAlert: An AI-Driven Automated Tool for Effective Identification of Victims’ Personal Information in Ransomware Data Breaches" Electronics 14, no. 21: 4327. https://doi.org/10.3390/electronics14214327
APA StyleTejada-Triviño, J. M., Castillo-Fernández, E., García-Teodoro, P., & Gómez-Hernández, J. A. (2025). RDBAlert: An AI-Driven Automated Tool for Effective Identification of Victims’ Personal Information in Ransomware Data Breaches. Electronics, 14(21), 4327. https://doi.org/10.3390/electronics14214327

