Theft Address Extraction and Classification from Chinese Judicial Documents Based on Large Language Model
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsBased on the issue of sparse, inconsistent, and incomplete address information in judicial documents, this paper develops the CAEC_LLM. The research findings are quite interesting, enabling precise extraction and classification of addresses. The paper is well-structured, detailed in presentation, and highly complete. This is a well-written article. However, it should be noted that the description in Section 5, "Experimental Analysis," does not correspond to the figures. Figures 6 and 7 seem to be swapped—please ask the author to double-check this. There are no further suggestions beyond this.
Author Response
Dear Reviewer:
Thank you for your thoughtful review and the constructive feedback provided on our manuscript. We appreciate the time effort you dedicated to helping us improve this work. We have carefully addressed each of your suggestions, and we believe these revisions have significantly strengthened the paper.
Detailed responses to each comment are provided below. We hope the revised manuscript now meets the standards for publication and look forward to your further assessment.
Reviewer 1, Comment Q1:
However, it should be noted that the description in Section 5, "Experimental Analysis," does not correspond to the figures. Figures 6 and 7 seem to be swapped。
Response to Reviewer 1, Comment Q1:
There was an inconsistency where the textual descriptions of the results for Figures 6 and 7 in Section 5, "Experimental Analysis," were inadvertently swapped, while the figures themselves were in the correct order. We have now carefully revised the text in Section 5 to ensure that all descriptions accurately correspond to their respective figures (Figures 6 and 7). The corrections have been made in the revised manuscript (please see lines 477-512;519-560 in the revised text). At the same time, we have highlighted the revised sections in the manuscript. We appreciate the reviewer's attention to this issue, which has helped improve the clarity of our presentation.
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThis study leverages LLM to extract and classify theft addresses in text based on spatial scales. It can make practical contributions to data collection in the spatial analysis of crime. For this manuscript, many clarifications and explanations are needed to improve the interpretation of the study design, workflow, and results. See my comments below.
Section 2.1. From lines 149 to 160, the authors introduced multiple studies that applied LLMs to analyze judicial documents. These studies have made methodological improvements, but what research gaps still exist? In the entire section 2.1, the authors summarized multiple research gaps from previous studies in each paragraph except for the last one. It is unclear which research gaps, among all those mentioned, can be specifically addressed by this study. Please add more explanations at the end of section 2.1.
A paper named “An LLM driven dataset on the spatiotemporal distributions of street and neighborhood crime in China” is closely related to this study. Authors can relate that paper to your work, summarize their methodological advances in extracting crime events from China’s court decision documents using LLM, and explain what improvements your study made compared to that paper.
Zhang, Y., Kwan, MP. & Fang, L. An LLM driven dataset on the spatiotemporal distributions of street and neighborhood crime in China. Sci Data 12, 467 (2025). https://doi.org/10.1038/s41597-025-04757-8
Section 3.1. Add justification and supporting references for deciding those eight categories of crime addresses. Can a crime address be attributed to more than one category and how is that enforced in LLM? Add explanations.
Figure 2. Are crime address and the category of crime address the output of CACE_LLM? In this sense, should the arrow point from CAEC_LLM to crime address and its category?
Lines 228-229. Explain “the test set includes all output content of the model.” Model results cannot generally be used as a test set, as they create a biased evaluation. A test set should be a held-back, separate portion of data.
Figure 3 and section 3.2.1. It is unclear what the output is from LoRA Fine-Tuning to CAEC_LLM. Authors stated that “Incorporating legal domain knowledge in this way improves the model's ability to accurately locate and extract addresses in judicial documents.” What is legal domain knowledge? How did those technical parameters work to incorporate the legal domain knowledge? Explain all details and provide examples. Though the authors explained technical parameters, I cannot see how they contribute to locating and extracting crime addresses from judicial documents.
Section 4.1. Lines 293-295. Explain why selecting 1,889 judicial documents and why between 2011 and 2021. What is the total number of judicial documents?
Section 5.1. Add a description of the validation data set and the rationale for choosing Qwen2.5-7B, GLM4-6B, and Qwen3-8B for comparison to CAEC_LLM. There are 972 test set addresses. It is unclear why it is larger than 500 (1,889 - 1,389), from which data sets they came, and how they were selected. Explain the input of each model and how confounding factors of the experiment were controlled.
Section 5.2. Explain what prompt/input was entered into each model for evaluation.
The authors can discuss how the text of crime addresses, extracted and classified by CAEC_LLM, can be used properly in the spatial analysis of crime, which usually requires geographic coordinates of a crime event.
Author Response
Dear Reviewer:
Thanks for your insightful and constructive feedback on our manuscript. We appreciate the opportunity to refine our work based on your suggestions.
Below, we have provided a point-by-point response to each comment. For clarity, blue text indicates the original manuscript content, while red text highlights the revised wording. Corresponding changes have also been highlighted directly within the revised manuscript.
We believe these revisions have significantly strengthened the paper and look forward to your further evaluation.
Reviewer 2, Comment Q1:
Section 2.1. From lines 149 to 160, the authors introduced multiple studies that applied LLMs to analyze judicial documents. These studies have made methodological improvements, but what research gaps still exist? In the entire section 2.1, the authors summarized multiple research gaps from previous studies in each paragraph except for the last one. It is unclear which research gaps, among all those mentioned, can be specifically addressed by this study. Please add more explanations at the end of section 2.1.
Response to Reviewer 2, Comment Q1:
We agree that the connection between the identified research gaps and the specific objectives of our study required further clarification. We have now revised the Introduction to more explicitly align our goals with the limitations found in existing literature.
At the end of Section 2.1 (lines 165–181), we have added a new summary paragraph that not only synthesizes the key research gaps highlighted in the preceding discussion but also explicitly contrasts our work with existing LLM-based judicial text studies. Specifically, while prior LLM applications in the legal domain, such as Lawyer-LLaMa, Chat-Law, and LawLLM, have focused on tasks like legal reasoning, statute retrieval, judgment prediction, or document summarization, they have largely overlooked the fine-grained, spatially-aware extraction and classification of crime addresses from lengthy and procedurally complex judicial narratives. Moreover, even studies that utilize LLMs for location extraction (e.g., Zhang et al., 2025) often treat addresses merely as atomic points for geocoding, neglecting their varied spatial semantics, contextual roles, and hierarchical scale characteristics that are critical for accurate crime geography analysis.
The revised paragraph:
Despite the aforementioned progress in adapting LLMs for judicial text analysis, there remains a research gap concerning the extraction and nuanced interpretation of crime-specific location information. While models such as Lawyer-LLaMa and LawLLM have demonstrated strong performance in general legal reasoning, statute retrieval, or judgment prediction, their capability to accurately identify and semantically classify sparse, variable, and context-dependent crime addresses from lengthy procedural narratives has not been sufficiently explored or validated. Furthermore, the inherent challenges of judicial documents exacerbate this difficulty, including non-standardized address formats, the intermingling of multiple address types (e.g., residence, crime scene, apprehension location), and addresses expressed across different spatial scales. Consequently, our work makes distinct methodological contributions: (1) proposing and implementing a fine-grained, spatially-aware address classification scheme tailored for crime analysis; (2) fine-tuning an open-source LLM using domain-specific judicial data to enhance the accuracy and controllability of the extraction and classification process, thereby reducing reliance on "black-box" API calls; and (3) providing an end-to-end, reproducible model capable of not only extracting ad-dresses but also semantically classifying them, enabling more granular and spatially appropriate subsequent analysis.
Reviewer 2, Comment Q2:
A paper named “An LLM driven dataset on the spatiotemporal distributions of street and neighborhood crime in China” is closely related to this study. Authors can relate that paper to your work, summarize their methodological advances in extracting crime events from China’s court decision documents using LLM, and explain what improvements your study made compared to that paper.
Response to Reviewer 2, Comment Q2:
Following the suggestion, we have expanded the discussion of this work to better clarify its relevance to our study and elaborate on our methodological contributions. Specifically, we have revised and supplemented the descriptions in Section 2.1 (lines 156-164 of the revised manuscript).
The revised paragraph:
Zhang et al. [36] employed a pre-trained general-purpose LLM (ChatGPT) to identify street-level crime locations spanning multiple years, with the primary objective of constructing a nationwide spatiotemporal dataset of crime at the street and community levels. This represents a significant step forward in utilizing LLMs for large-scale crime data mining from judicial documents. However, their approach views address recognition solely as a means to an end for data preparation. It treats addresses as atomic points for geocoding, overlooking the complex linguistic patterns and nested spatial relationships involved. These legal-specific models underscore a growing effort to harness domain-specific knowledge, enabling more robust and contextually accurate text extraction.
Reference:
Zhang, Y, M-P Kwan, and L Fang, An LLM driven dataset on the spatiotemporal distributions of street and neighborhood crime in China Scientific Data, 2025 12(1): p 467
Reviewer 2, Comment Q3:
Section 3.1. Add justification and supporting references for deciding those eight categories of crime addresses. Can a crime address be attributed to more than one category and how is that enforced in LLM? Add explanations.
Response to Reviewer 2, Comment Q3:
In the revised manuscript, we have thoroughly addressed the two points you raised.
First, as noted, the theoretical framework for the eight-category classification scheme and the relevant references are detailed in Section 2.2 (e.g., [32]). Furthermore, we have revised the opening of Section 3.1 (lines 232–267) to more clearly articulate the rational behind these eight categories.
The revised paragraph:
As shown in Table 1, each category corresponds to a distinct spatial semantic and geometric representation. C1 (Administrative Units) denotes polygonal regions, suitable for macro-level crime mapping and socioeconomic analysis. C2 (House Number Addresses) represents precise point-of-interest (POI) locations, which often require interpolation for geocoding. C3 (Road/Street Segments) inherently reflects linear features, making it particularly suitable for network-based crime analysis and hotspot identification. C4 (Transportation Hubs), although geometrically point-like, possesses a unique functional significance distinct from ordinary streets. C5 (Open Areas) and C6 (Institutions/Facilities/Residential Areas) are typically represented as POIs on maps and may include references to internal structures (e.g., dormitory buildings within a university). C7 (Vaguely Location Descriptions) constitutes a unique category of crime addresses that requires directional interpretation and specialized geocoding methods.
The classification of crime addresses is essential for transforming unstructured textual descriptions into spatially meaningful units to support accurate geocoding and multi-scale crime analysis. Our eight-category classification scheme is based on three core principles derived from spatial referencing theory, the linguistic characteristics of judicial texts, and the practical requirements of crime geography research. First, addresses in judicial documents exhibit a natural hierarchical spatial structure, ranging from macro-level administrative units (e.g., provinces, cities) to micro-level specific locations (e.g., building numbers). Our classification reflects this continuum: C1 represents areal administrative units; C2 represents precise point locations; C3 captures linear features; C4 to C6 denote functionally distinct point-like entities (transportation hubs, open areas, institutions); and C7 accommodates vague or relative spatial descriptions. Second, each category corresponds to common lexical and syntactic markers in judicial narratives. For example, C2 addresses typically contain numeric identifiers (e.g., "No.," "Room," "Building"); C3 addresses end with road-type terms (e.g., "Road," "Avenue"); and C7 includes proximity words (e.g., "near," "beside"). This linguistic alignment facilitates robust model learning and consistent classification. Third, based on a systematic induction of theft-related judgments, crime scenes are frequently described not only as precise addresses but also as transportation hubs (C4), open public areas (C5), and facilities (C6)—all of which are high-frequency locations for theft incidents. Including these categories ensures that our scheme comprehensively covers the various types of crime locations recorded in judicial practice.
To prevent category overlap, we employ a deterministic priority rule that resolves multi-category cases by assigning each address to the highest-priority matching category. This ensures consistency and eliminates ambiguity in downstream geospatial processing.
Second, regarding the specific application of the scheme and the execution of the mutually exclusive classification, we have added a comprehensive explanation in Section 3.2 (lines 348–375). This new text clarifies our priority-based classification rule, which ensures that each address is assigned to one and only one category. Additionally, we describe how this rule is directly integrated into our CAEC_LLM through instruction fine-tuning, rather than applied as a separate post-processing step. This design ensures the output of deterministic and actionable results for subsequent geospatial analysis.
The revised paragraph:
To ensure that each extracted crime address is assigned to one and only one category, thereby facilitating clear downstream geospatial processing, we established a deterministic, priority-based classification rule. This rule establishes a hierarchy among eight categories based on the presence of specific textual tokens, resolving potential overlaps by assigning the address to the highest-priority category matching its defining characteristics. This hierarchy is also summarized visually in the system prompt (see Figure 5) and is presented below in order of decreasing priority: C7 (Vaguely Location Descriptions): Applied if the address contains relative or proximity terms (e.g., “nearby,” “next to,” “opposite,” “around,” “roadside”). C2 (House Number Addresses): Applied if the address contains precise numerical identifiers (e.g., “No.,” “building,” “room”) and C7 is not triggered. C4 (Transportation Hubs): Applied if the address contains transportation node terminology (e.g., “station,” “intersection,” “entrance”). C5 (Open Areas): Applied if the address contains open space descriptors (e.g., “parking lot,” “plaza,” “park”). C6 (Institutions/Facilities/Residential Areas): Applied if the address contains terms for institution or facility types (e.g., “university,” “mall,” “hospital,” “community”). C3 (Road/Street Segments): Applied if the address ends with road type terminology (e.g., “Road,” “Avenue,” “Street”). C1 (Administrative Units): Applied if the address consists solely of administrative division names (e.g., “Province,” “City,” “District”). C8 (Other Addresses): A catch-all category for addresses that do not meet any of the above conditions.
The integration logic of this rule with the LLM is not implemented as a separate post-processing script but is directly encoded into CAEC_LLM through our instruction fine-tuning framework. First, the rule is explicitly stated in the system prompt provided to the model, defining the task constraints. More importantly, the entire instruction dataset used for fine-tuning is constructed strictly according to this priority order. Each (query, response) pair in the training data reflects the outcome of applying this rule. Consequently, during the fine-tuning process, CAEC_LLM learns and internalizes this decision hierarchy, enabling it to perform both extraction and rule-based classification simultaneously in a single, end-to-end forward pass.
Reviewer 2, Comment Q4:
Figure 2. Are crime address and the category of crime address the output of CACE_LLM? In this sense, should the arrow point from CAEC_LLM to crime address and its category?
Response to Reviewer 2, Comment Q4:
Thanks for this suggestion. There was an error in the original version of Figure 2. The crime address and its category are indeed the outputs of the CAEC_LLM model, not the inputs. Therefore, the direction of the arrows indicating data flow should point from the CAEC_LLM box to the “Crime Address” and “Category of Crime Address” boxes. We have corrected this error in the revised manuscript.
The updated Figure 2:
Reviewer 2, Comment Q5:
Lines 228-229. Explain “the test set includes all output content of the model.” Model results cannot generally be used as a test set, as they create a biased evaluation. A test set should be a held-back, separate portion of data.
Response to Reviewer 2, Comment Q5:
The misleading and inaccurate statement in the original manuscript (Lines 228–229) failed to accurately describe our actual experimental setup. To clarify and correct the record, we utilized a manually annotated, independent, held-out test set that has not been used during model training or fine-tuning. The 500 test documents mentioned in Section 4.1 (containing 972 ground truth crime address entities) constitute this independent evaluation set. Model performance was assessed by comparing the model's predictions on this unseen data against the manually annotated ground truth. We have revised the ambiguous wording in Lines 272–278 to accurately reflect this correct methodology and eliminate any confusion.
The original paragraph:
As shown in Figure 3, we will select a portion of the cleaned and annotated data from the judicial document dataset for the creation of the instruction fine-tuning dataset and subsequent model testing. Although the test set includes all output content of the model, to more intuitively demonstrate the model's performance in each category, we divide the test set into two parts for discussion: the model's address extraction performance and the model's address classification performance.
The revised paragraph:
As shown in Figure 3, we will select a portion of the cleaned and annotated data from the judicial document dataset for the creation of the instruction fine-tuning dataset and subsequent model testing dataset. Although the model's output simultaneously includes both the extraction and classification results of addresses, in order to more intuitively demonstrate the model's performance across different tasks, we will present and discuss the results in two steps: the model's address extraction performance and the model's address classification performance.
Reviewer 2, Comment Q6:
Figure 3 and section 3.2.1. It is unclear what the output is from LoRA Fine-Tuning to CAEC_LLM. Authors stated that “Incorporating legal domain knowledge in this way improves the model's ability to accurately locate and extract addresses in judicial documents.” What is legal domain knowledge? How did those technical parameters work to incorporate the legal domain knowledge? Explain all details and provide examples. Though the authors explained technical parameters, I cannot see how they contribute to locating and extracting crime addresses from judicial documents.
Response to Reviewer 2, Comment Q6:
Regarding the composition of "legal domain knowledge" in this study, specifically, the linguistic patterns, contextual cues, and spatial semantics unique to crime addresses in judicial texts, we have added a more detailed discussion in the second paragraph of Section 3.2.1 (lines 289–297 of the revised manuscript).
The revised paragraph:
To adapt the general-purpose GLM4-9B model to the specialized task of crime address handling in judicial documents, we employ Low-Rank Adaptation (LoRA) fine-tuning. The “legal domain knowledge” refers to the specific linguistic patterns, contextual dependencies, and spatial semantics associated with crime addresses as they appear in Chinese judicial narratives. This includes, but is not limited to: (1) the typical lexical markers for different address categories (e.g., “Road” for C3, “near” for C7); (2) the common syntactic structures surrounding crime locations (e.g., “[at] + [Location] + [committed theft]”); and (3) the ability to distinguish a crime scene address from other co-occurring addresses (e.g., defendant’s domicile) based on narrative context.
The instruction dataset facilitates the transfer of domain-specific knowledge during the fine-tuning process. Throughout LoRA fine-tuning, the model learns from this data, with gradient updates applied specifically to the task-aligned LoRA parameters. These parameters then encode the necessary domain adaptations. To address your query, we have added a discussion on this aspect in lines 337–341 of Section 3.2.2 in the revised manuscript.
The revised paragraph:
The instruction-tuning dataset serves as the carrier of such knowledge. Each (query, answer) pair provides an example illustrating how the original judicial documents (query) are mapped to structured address information (answer) according to our defined rules. During LoRA fine-tuning, the model processes these examples and computes the loss between its predictions and the ground-truth answers.
Meanwhile, in the fifth and sixth paragraphs of Section 3.2.1 (lines 312–327 of the revised manuscript), we provide a non-mathematical description of the mechanism, explaining how these parameter updates alter the model's internal behavior (e.g., by adjusting attention to location tokens near crime verbs). We also include a specific example comparing the model's understanding of administrative divisions like "Gulou District, Nanjing City" before and after fine-tuning, illustrating how it shifts from a generic location to a categorized spatial reference for crimes.
The revised paragraph:
This update process effectively "programs" domain-specific knowledge into the LoRA parameters. For instance, the model learns to adjust the attention mechanism within its Transformer layers, enabling tokens indicating locations (e.g., "road", "building", "district") to receive higher attention scores when they appear near crime-related verbs. Concurrently, the model's internal representations of these location-related to-kens and their contextual combinations are refined to better align with our classification system. Specifically, before fine-tuning, the model might primarily treat "Gulou District, Nanjing City" as a generic location. After fine-tuning, the LoRA-adjusted model associates it more closely with the "C1_Administrative Units" category and understands that, within a judicial context, it typically serves as a macro-level spatial reference for criminal incidents.
Therefore, the technical contribution of the LoRA matrices extends beyond mere parameter-efficient tuning. They encode a task-specific "shift" in linguistic understanding required by this domain. By integrating legal domain knowledge in this manner, the fine-tuned CAEC_LLM acquires an enhanced capability to accurately locate crime addresses within lengthy procedural texts and extract them with correct boundaries and categorical interpretations.
Reviewer 2, Comment Q7:
Section 4.1. Lines 293-295. Explain why selecting 1,889 judicial documents and why between 2011 and 2021. What is the total number of judicial documents?
Response to Reviewer 2, Comment Q7:
We have added a description of the total volume of "first-instance theft" judicial documents in the second paragraph of Section 4.1 (lines 393–394 of the revised manuscript).
The revised paragraph:
For this study, we filtered and retained only first-instance criminal judgment documents in which the crime type was recorded as “theft”. A preliminary screening of this massive database yielded 835,693 first-instance judicial documents related to theft. This selection ensured uniformity in both legal and narrative expression, reducing semantic variability in crime description. The resulting subset served as the foundation for both the model training and evaluation processes.
We selected 1,889 judicial documents in Section 4.1 and limited the time range to 2011–2021, primarily based on the following considerations:
The period from 2011 to 2021 covers a stable phase of sustained growth in the online publication of documents after the launch of the judicial document disclosure platform, ensuring data richness and coverage.
We adopted a stratified sampling method by year, randomly selecting approximately 200 documents annually between 2011 and 2021 for manual verification and annotation. During the annotation process, we excluded documents for which reliable address recognition was not possible (such as cybercrimes and crimes occurring on public transportation vehicles), ultimately retaining 1,889 valid documents. This sample size represents a balance between the cost of manual annotation and the requirements for model training: it satisfies the data volume requirements for training and robustly evaluating complex LLMs while achieving high-quality annotation within feasible resource constraints.
To this end, we have revised and expanded the detailed discussion regarding the dataset in the third paragraph of Section 4.1 (lines 398–423 of the revised manuscript).
The original paragraph:
To establish benchmark evaluation criteria for the extraction and classification performance of the model, we constructed a manually annotated benchmark dataset. Specifically, we randomly selected 1,889 judicial documents from judgments issued by courts at various levels across mainland China between 2011 and 2021. Among them, 1,389 documents were used in the training set. These documents were manually annotated to identify crime locations and classify each address into its corresponding category. The annotated data is stored in a structured text format, facilitating its submission to large language models for training and accuracy assessment.
The revised paragraph:
To construct a high-quality manually annotated dataset for model development and evaluation under feasible resource constraints, we adopted a stratified sampling strategy. We limited the time range to the period from 2011 to 2021. This 11-year window was selected because it represents a period of sustained growth in the online publication of judicial documents after the platform's launch, ensuring data availability and relative format consistency. Within this period, we aimed to extract representative samples for annotation. To obtain cross-year samples, we conducted sampling by year, randomly selecting 200 judicial documents from each year between 2011 and 2021 for manual verification and annotation. We recruited several volunteers from the field of geographic information to carry out this work. During this process, we removed documents from the sampled dataset that could not undergo reliable address recognition (such as cybercrimes involving account transactions and crimes occurring on public transportation vehicles). Ultimately, we identified 1,889 judicial documents. This sample size represents a balance between the high cost of manual annotation and the need to build a sufficiently large dataset for training and robustly evaluating com-plex LLMs. These documents originate from courts at various levels across mainland China. From the 1,889 annotated documents, we performed a fixed split: 1,389 documents were used for training CAEC_LLM, and the remaining 500 were reserved as an independent test set. These documents were manually annotated to identify crime lo-cations and categorize each address into the corresponding category. The annotated data is stored in a structured text format, facilitating its submission to large language models for training and accuracy evaluation. For simplicity, and given the relative stability of LoRA performance under standard hyperparameters, we did not use a separate validation set for hyperparameter tuning; the reported results are derived from model checkpoints obtained after fine-tuning on the entire training set. After manual annotation, these 500 test documents contain a total of 972 distinct crime location address entities, which constitute the address-level ground truth for evaluation.
Reviewer 2, Comment Q8:
Section 5.1. Add a description of the validation data set and the rationale for choosing Qwen2.5-7B, GLM4-6B, and Qwen3-8B for comparison to CAEC_LLM. There are 972 test set addresses. It is unclear why it is larger than 500 (1,889 - 1,389), from which data sets they came, and how they were selected. Explain the input of each model and how confounding factors of the experiment were controlled.
Response to Reviewer 2, Comment Q8:
As stated in the newly added content in Section 4.1 (Lines 398–423), our test set consists of 500 reserved judicial documents (since a single judicial document often describes multiple crime events or contains multiple location references, the number of address entities (972) exceeds the number of documents (500). These addresses constitute the address-level ground truth for evaluating extraction and classification performance).
Additionally, we have expanded Section 4.2 (Lines 444–451) to explain the selection of baseline models. To ensure fair comparisons and control for confounding factors, we kept all other factors (input data, evaluation pipeline, environment) constant (Lines 451–461), thereby carefully isolating the effect of the core variables.
The revised paragraph:
We compared the performance of the proposed CAEC_LLM with three powerful open-source Chinese LLMs to establish competitive baselines and highlight the effect of domain-specific fine-tuning. GLM4-6B belongs to the same series as our base model, sharing the identical GLM architecture. Qwen2.5-7B and Qwen3-8B are widely recognized and high-performing general-purpose Chinese LLMs. We included both to represent popular baselines of different parameter scales. Qwen3-8B is particularly noted for its enhanced reasoning capabilities, allowing us to investigate whether such architectural optimization benefits this task without domain adaptation. To ensure a fair comparison, all models were evaluated under identical conditions. For each document in the test set of 500 judicial documents, the input was formatted using the exact same prompt template as shown in Figure 5, which includes system instructions and the original document text. The model’s input, output, and evaluation process were all conducted in the Chinese context. The baseline models (GLM4-6B, Qwen2.5-7B, Qwen3-8B) were evaluated in their original, un-fine-tuned state (zero-shot). In contrast, CAEC_LLM is a LoRA fine-tuned version of GLM4-9B. All experiments were run on the same hardware and software stack. A consistent post-processing script was used to parse model outputs, extracting address strings and predicted categories, which were then compared against the same manually annotated ground truth to calculate precision, recall, and F1 scores.
Reviewer 2, Comment Q9:
Section 5.2. Explain what prompt/input was entered into each model for evaluation.
Response to Reviewer 2, Comment Q9:
We have clarified the evaluation procedure in Section 4.2 (Lines 444-461) of the revised manuscript.
To ensure a fair and controlled comparison, the exact same prompt/input was entered into each model for evaluation. Specifically, for every document in the test set of 500 judicial documents, we used the unified prompt template shown in Figure 5. This template includes the system instructions defining the task rules and output format, followed by the raw text of the judicial document as the query.
Reviewer 2, Comment Q10:
The authors can discuss how the text of crime addresses, extracted and classified by CAEC_LLM, can be used properly in the spatial analysis of crime, which usually requires geographic coordinates of a crime event.
Response to Reviewer 2, Comment Q10:
As noted, we have added a dedicated discussion on this topic in Section 6.2 (Lines 589-600) of the revised manuscript.
The revised paragraph:
A reasonable question arising from our work is how to integrate text-based addresses into spatial crime analysis, which typically requires geographic coordinates. We argue that the primary value of this classification lies not in replacing geocoding, but in enabling a semantically aware and scale-appropriate geocoding strategy, thereby significantly improving the quality of the generated spatial data. By moving away from a “one-size-fits-all” point-based geocoding approach, our framework helps mitigate the substantial positional errors that arise from forcing linear, areal, or vague locations into point representations. This, in turn, leads to more accurate crime maps, more reliable hotspot detection, and ultimately, more effective spatial analysis. Therefore, our work does not merely stop at text classification; it provides the necessary spatial semantic metadata aimed at transforming unstructured judicial text into a structured, geographically intelligent database, primed for rigorous crime pattern analysis and evidence-based urban safety planning.
Author Response File:
Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThe paper presents an LLM-based method for crime address attraction from Chinese judicial documents. While the topic is timely and relevant, the manuscript lacks sufficient clarity regarding data accessibility, language processing, and reproducibility. Addressing these issues, particularly the public nature of the source data, the language of the documents, and the status of the released dataset, is necessary for a thorough evaluation of the contribution. Please read the attached file for detailed comments.
Comments for author File:
Comments.pdf
Minor revisions are needed at few instances.
Author Response
Dear Reviewer:
Thank you for your insightful and constructive feedback on our manuscript. We appreciate the opportunity to revise our work based on your suggestions.
Below, we have provided a point-by-point response to each of your comments. For clarity, blue text indicates the original content, while red text highlights the revised wording. Additionally, all changes have been highlighted directly within the revised manuscript.
We believe these revisions have significantly strengthened the paper and look forward to your further evaluation.
Reviewer 3, Comment Q1:
The manuscript states that the complete judicial document dataset and fine-tuned model cannot be released due to data privacy and confidentiality agreements. However, the source platform used in this study (China Judgments Online / Wenshu) is an official public repository operated by the Supreme People’s Court of China, where judicial decisions are intended to be publicly accessible after user registration. The authors should clearly explain what specific legal, contractual, or ethical constraints apply to their processed dataset that go beyond the access restrictions of the original public source. Without such clarification, the justification for withholding the dataset and model remains unclear.
Response to Reviewer 3, Comment Q1:
Our initial statement regarding data restrictions was not sufficiently clear and has now been revised. The primary restrictions are not legal or contractual in nature, but rather logistical and practical. While the original documents are indeed publicly accessible, the dataset used in this study, comprising over 800,000 filtered, complete documents, is massive in scale. More importantly, this processed dataset represents a significant investment of effort and computational resources on our part regarding manual annotation, screening, and cleaning. To address this issue and ensure full transparency and reproducibility, we have updated the data availability statement (Lines 694–698 in the revised manuscript).
The revised paragraph:
Data Availability Statement: The model code and partial sample datasets used for training and testing in this study have been deposited in a public GitHub repository and can be accessed at:
https://github.com/letdo1945/Crime_Address_Extraction_and_Classification_Based_on_LLM_Data. The complete original judicial document dataset and the final trained model (CAEC_LLM) can be provided upon request to the author.
Reviewer 3, Comment Q2:
The manuscript indicates that a sample of the dataset is available on a public GitHub repository; however, the provided link is currently non-functional. This prevents independent verification of the data characteristics and preprocessing steps. The authors should provide a working link or an alternative means to access a representative sample, along with clear documentation of data selection, preprocessing, and annotation procedures to support reproducibility. As the authors mentioned that a benchmark dataset has been curated with manual annotation process, but the paper lacks all the details on this part.
Response to Reviewer 3, Comment Q2:
We appreciate the reviewer's careful attention to the reproducibility of our work. Regarding the GitHub link, we have tested the provided URL at our end on multiple networks and devices, and it resolves correctly to the intended public repository (https://github.com/letdo1945/Crime_Address_Extraction_and_Classification_Based_on_LLM_Data). The repository is active and accessible.
Reviewer 3, Comment Q3:
The manuscript does not clearly specify the language of the judicial documents used in the experiments. Given that China Judgments Online predominantly publishes judicial decisions in Chinese, it is important to clarify whether the models were trained and evaluated on original Chinese texts, translated English versions, or a mixture of both. It is mentioned at one instance that “GLM4-9B as the foundational 235 model due to its demonstrated efficiency in processing Chinese text”, but then, please read the next comment:
All illustrative examples in the manuscript are presented in English, without clarification as to whether these are verbatim translations of real judicial documents or synthetic examples created for explanatory purposes. This ambiguity makes it difficult to assess the realism and linguistic complexity of the proposed crime address extraction task. Including at least one anonymized example in the original language, or clearly labeling translated or illustrative examples, would improve transparency.
Response to Reviewer 3, Comment Q3:
We have provided additional explicit explanations to address the ambiguities present in the original manuscript. All models, including our fine-tuned CAEC_LLM, were trained and evaluated exclusively using the raw Chinese text of judicial documents. No translation was involved at any stage of model development or testing. All address examples presented in the paper's tables, figures, and main text are English translations of real addresses extracted from Chinese judicial documents, not synthetic examples. This critical point has now been explicitly stated in Section 4.1 (Lines 384-387) and Section 4.2 (Lines 454-455) of the revised manuscript. As for the original document examples of the judicial documents, detailed information is provided in the file data/Test_Set_500_json.csv in the GitHub repository at
https://github.com/letdo1945/Crime_Address_Extraction_and_Classification_Based_on_LLM_Data
The revised paragraph(Section 4.1):
The dataset used in this study was collected from the China Judgments Online platform (https://wenshu.court.gov.cn/). This dataset comprises a series of judicial documents from 1986 to 2021, covering various criminal and civil cases ranging from traffic accidents, fraud, to theft. These cases were adjudicated by different courts across China, amounting to millions of judicial documents in total. Each document provides a complete record of a case and the associated trial proceedings. The format of these judicial documents is detailed in Table 2. It should be noted that all content related to judicial documents presented in this paper is a translation of the original Chinese text into an English context. All corpora used in this study consist of judicial documents and their corresponding original Chinese texts. The judicial document dataset includes a range of basic information, such as case type, presiding court, involved parties, disclosure date, prosecuting authority, judgment date, document content, procedural stage of the trial, and case identifier. The location of the crime is contained within the " Main Text " section of the dataset.
The revised paragraph(Section 4.2):
We compared the performance of the proposed CAEC_LLM with three powerful open-source Chinese LLMs to establish competitive baselines and highlight the effect of domain-specific fine-tuning. GLM4-6B belongs to the same series as our base model, sharing the identical GLM architecture. Qwen2.5-7B and Qwen3-8B are widely recognized and high-performing general-purpose Chinese LLMs. We included both to represent popular baselines of different parameter scales. Qwen3-8B is particularly noted for its enhanced reasoning capabilities, allowing us to investigate whether such architectural optimization benefits this task without domain adaptation. To ensure a fair comparison, all models were evaluated under identical conditions. For each document in the test set of 500 judicial documents, the input was formatted using the exact same prompt template as shown in Figure 5, which includes system instructions and the original document text. The model’s input, output, and evaluation process were all conducted in the Chinese context. The baseline models (GLM4-6B, Qwen2.5-7B, Qwen3-8B) were evaluated in their original, un-fine-tuned state (zero-shot). In contrast, CAEC_LLM is a LoRA fine-tuned version of GLM4-9B. All experiments were run on the same hardware and software stack. A consistent post-processing script was used to parse model outputs, extracting address strings and predicted categories, which were then compared against the same manually annotated ground truth to calculate precision, recall, and F1 scores.
Reviewer 3, Comment Q4:
The evaluation compares several multilingual large language models (e.g., Qwen2.5-7B, Qwen3-8B, GLM4-6B), which are capable of processing both Chinese and English text. However, the manuscript does not clarify which language(s) were used as input during evaluation. The authors should specify whether all models were evaluated on Chinese text, English translations, or both, as this has important implications for performance interpretation and fairness of comparison. The reason for choosing Qwen models for comparison should also be discussed.
Response to Reviewer 3, Comment Q4:
All models, including the baseline LLMs (GLM4-6B, Qwen2.5-7B, Qwen3-8B) and CAEC_LLM, were evaluated exclusively on raw Chinese text. The model input (original judicial documents), output, and the subsequent comparison with ground truth were all conducted within the Chinese context. No translation was involved during the evaluation phase. This ensures a direct and fair comparison of the models' capabilities in processing the target domain language. This clarification has been added to Section 4.2 (Lines 452-461). Additionally, we have expanded the discussion in Section 4.2 (Lines 444–452) to justify the selection of these baseline models.
Reviewer 3, Comment Q5:
It is not clear how LoRA fine-tuning has been applied as the details in Section 3.2.1 only provide definitional details of LoRA. Then, looking at Section 3.2.2, it looks like a simple case of prompt engineering for address extraction.
Response to Reviewer 3, Comment Q5:
In this study, we developed CAEC_LLM, a specialized LLM tailored for extracting and classifying crime addresses from Chinese judicial documents. This model is not merely a combination of fine-tuning and prompt engineering; rather, it is an integrated system where LoRA-based fine-tuning and structured prompt design work synergistically to embed domain-specific spatial and legal knowledge into the model.
LoRA fine-tuning serves as the core mechanism for domain adaptation. It enables the model to learn and internalize the linguistic patterns, contextual dependencies, and spatial semantics unique to crime addresses in judicial documents. Through low-rank updates to the attention layers, CAEC_LLM acquires the ability to distinguish crime scene addresses from other address types (e.g., the defendant’s residence) and to recognize lexical markers indicative of different spatial categories (e.g., “Road” for C3, “near” for C7). To further elucidate this point, we have added the second, fifth, and sixth paragraphs to Section 3.2.1 (Lines 289–297 and 312–327 in the revised manuscript) to reinforce this explanation.
The revised paragraph(Section3.2.1,Passage2):
To adapt the general-purpose GLM4-9B model to the specialized task of crime address handling in judicial documents, we employ Low-Rank Adaptation (LoRA) fine-tuning. In our context, the “legal domain knowledge” refers to the specific linguistic patterns, contextual dependencies, and spatial semantics associated with crime addresses as they appear in Chinese judicial narratives. This includes, but is not limited to: (1) the typical lexical markers for different address categories (e.g., “Road” for C3, “near” for C7); (2) the common syntactic structures surrounding crime locations (e.g., “[at] + [Location] + [committed theft]”); and (3) the ability to distinguish a crime scene address from other co-occurring addresses (e.g., defendant’s domicile) based on narrative context.
The revised paragraph(Section3.2.1,Passage5、Passage5):
This update process effectively "programs" domain-specific knowledge into the LoRA parameters. For instance, the model learns to adjust the attention mechanism within its Transformer layers, enabling tokens indicating locations (e.g., "road", "building", "district") to receive higher attention scores when they appear near crime-related verbs. Concurrently, the model's internal representations of these location-related to-kens and their contextual combinations are refined to better align with our classification system. Specifically, before fine-tuning, the model might primarily treat "Gulou District, Nanjing City" as a generic location. After fine-tuning, the LoRA-adjusted model associates it more closely with the "C1_Administrative Units" category and understands that, within a judicial context, it typically serves as a macro-level spatial reference for criminal incidents.
Therefore, the technical contribution of the LoRA matrices extends beyond mere parameter-efficient tuning. They encode a task-specific "shift" in linguistic understanding required by this domain. By integrating legal domain knowledge in this manner, the fine-tuned CAEC_LLM acquires an enhanced capability to accurately locate crime addresses within lengthy procedural texts and extract them with correct boundaries and categorical interpretations.
Meanwhile, the structured prompt engineering mentioned in this study functions as a knowledge-guiding framework. The prompt template (system + query + answer) explicitly defines the classification rules and output format, ensuring that the model operates within a semantically constrained space. More importantly, these rules are not applied post-hoc; they are embedded into the instruction-tuning dataset, enabling the model to learn the classification hierarchy during the fine-tuning process. Therefore, the prompt design is not a “simple case of prompt engineering,” but a structured knowledge injection strategy that works in tandem with LoRA to ensure consistent and interpretable outputs. Corresponding to the revisions mentioned above, we have provided further explanation on this part in Lines 337–341 of Section 3.2.2 in the revised manuscript.
The revised paragraph:
The instruction-tuning dataset serves as the carrier of such knowledge. Each (query, answer) pair provides an example illustrating how the original judicial documents (query) are mapped to structured address information (answer) according to our defined rules. During LoRA fine-tuning, the model processes these examples and computes the loss between its predictions and the ground-truth answers.
Reviewer 3, Comment Q6:
The challenges associated with judicial documents processing have been described at several instances in Introduction and Review sections that gave a sense of redundancy. Moreover, not even a single work has been cited in the domain of address extraction from criminal records/judicial documents in particular. Is the proposed work first experimental work in this domain? All the statements below from the manuscript show either redundancy or claims made without references and challenges described without specific examples either from texts or by citing other papers relevant to the crime address extraction domain.
Response to Reviewer 3, Comment Q6:
In the Introduction (Lines 156–164 in the revised manuscript), we now cite and discuss the work by Zhang et al. (2025), which utilizes LLMs to identify crime locations from legal documents to construct spatiotemporal datasets.
The revised paragraph:
Driven by the impressive success of LLMs, there has been growing interest in adapting these models for judicial document analysis. Pioneering studies have introduced large-scale models such as Lawyer-LLaMa [33] and Chat-Law [34]. These models are built upon and fine-tuned using open-source platforms like LLaMa and ChatGLM on datasets comprising legal dialogues and judicial documents. Similarly, Shen et al. [35] fine-tuned the Baichuan-13B-Base architecture to develop LawLLM, a legal LLM specifically designed for comprehensive legal applications. This specialized system demonstrates robust performance in critical legal tasks including statutory information extraction and judicial decision prediction. Li [28] further advanced this area by integrating LLMs with prompt engineering techniques, enabling the generation of summaries from judicial documents. Zhang et al. [36] employed a pre-trained general-purpose LLM (ChatGPT) to identify street-level crime locations spanning multiple years, with the primary objective of constructing a nationwide spatiotemporal dataset of crime at the street and community levels. This represents a significant step forward in utilizing LLMs for large-scale crime data mining from judicial documents. However, their approach views address recognition solely as a means to an end for data preparation. It treats addresses as atomic points for geocoding, overlooking the complex linguistic patterns and nested spatial relationships involved. These legal-specific models underscore a growing effort to harness domain-specific knowledge, enabling more robust and contextually accurate text extraction.
Reference:
Zhang, Y, M-P Kwan, and L Fang, An LLM driven dataset on the spatiotemporal distributions of street and neighborhood crime in China Scientific Data, 2025 12(1): p 467
We position our work as an extension of such efforts, introducing a formal classification scheme and a fine-tuned model for controllable extraction. In Section 2.1 (Lines 74、113-116 and 141 in the revised manuscript), we have revised the text to more clearly distinguish between prior work on general legal NLP, general address extraction, and the specific gap that our work aims to fill. Redundant statements have been removed or consolidated to improve flow and focus.
The original paragraph(Section1,Passage4):
Second, judicial documents often reference multiple types of addresses, including defendants’ registered address, permanent residence, arrest location, and the crime scene (see Figure 1). These address types are contextually distinct but lexically similar, which requires models to capture deep contextual semantics to correctly interpret the address and its role in the case.
The revised paragraph(Section1,Passage4):
Second, judicial documents often reference multiple types of addresses, including defendants’ registered address, permanent residence, arrest location, and the crime scene (see Figure 1). These address types are contextually distinct but lexically highly similar.
The original paragraph(Section2.1,Passage1):
Extensive scholarly work exists on extracting address information from textual content, traditionally considered a subset of Named Entity Recognition (NER) [19, 20]. Early research focused on simultaneously identifying spatial and organizational entities within entire documents [21-23]. Yet, when it comes to judicial documents, this task becomes more challenging [24]. In contrast to social media content where geolocation information is concisely integrated, judicial documents frequently contain extended textual segments spanning hundreds to thousands of words, and the collection comprises a diverse array of address types.
The revised paragraph(Section2.1,Passage1):
Extensive scholarly work exists on extracting address information from textual content, traditionally considered a subset of named entity recognition (NER) [19, 20]. Early research focused on simultaneously identifying spatial and organizational entities within entire documents [21-23]. Yet, when it comes to judicial documents, this task becomes more challenging. Unlike social media content, which integrates geolocation information concisely, judicial documents typically consist of ultra-long text segments ranging from hundreds to thousands of words, and their textual descriptions exhibit the specificity of vocabulary and language patterns in the legal domain [24].
The original paragraph(Section2.1,Passage3):
Further fine-tuning experiments by Hu and colleagues, using ChatGPT and GPT-4 on the Harvey Hurricane Twitter dataset, improved entity-recognition accuracy by over 40%, highlighting the value of integrating geographical knowledge. However, while pre-trained LLMs have proven effective in many contexts, their ability to reliably recognize address details within specialized legal domains still warrants further validation, particularly when faced with the challenges of minimal address content, non-standard formatting, and a high volume of extraneous address information [32].
The revised paragraph(Section2.1,Passage3):
Further fine-tuning experiments by Hu and colleagues, using ChatGPT and GPT-4 on the Harvey Hurricane Twitter dataset, improved entity-recognition accuracy by over 40%, highlighting the value of integrating geographical knowledge. However, while pre-trained LLMs have proven effective in many contexts, their ability to reliably recognize address details within specialized legal domains still warrants further improvement [32].
Reviewer 3, Comment Q7:
Abbreviations should be used properly. NLP, LLMs should be defined, or if already defined then they should be used correctly. For instance, NER has been defined but again full form has been used.
Response to Reviewer 3, Comment Q7:
We have checked and corrected the usage of abbreviations, including NLP, LLMs, and NER.
Reviewer 3, Comment Q8:
Authors should also look for some minor grammatical issues with writing.
Response to Reviewer 3, Comment Q8:
We have proofread the entire manuscript and corrected grammatical errors.
Reviewer 3, Comment Q9:
BERT is an LLM but it has been referenced in general neural networks discussion and then later again with LLMs.
Response to Reviewer 3, Comment Q9:
In the broad sense of pre-trained Transformer models, BERT is indeed a LLM. Our original intention was to draw a functional and architectural distinction, which may not have been clearly articulated initially. BERT is essentially an encoder-only model, optimized for understanding and representation tasks. In contrast, the LLMs discussed in our text are primarily decoder or encoder-decoder models designed for generative tasks. To eliminate ambiguity, we have adjusted the examples in the Introduction (Line 85 in the revised manuscript) to refer specifically to generative LLMs as the context for introducing our method. Additionally, we have revised the text and references in Section 2.1 (Lines 121–123 in the revised manuscript).
The original paragraph(Section1,Passage6):
Recent advances in Large Language Models (LLMs), such as BERT [11] and GPT variants [12, 13], have proven effective at extracting sparse and semantically complex information from long texts [14]. However, judicial documents present domain-specific linguistic structures, including highly formalized expressions, nested clauses, and regionally nuanced address descriptions [15]. These characteristics often cause off-the-shelf models to misinterpret the contextual roles of location entities or to overlook crime-related addresses embedded within lengthy procedural narratives.
The revised paragraph(Section1,Passage6):
Recent advances in LLMs, such as Deepseek [11] and GPT variants [12, 13], have proven effective at extracting sparse and semantically complex information from long texts [14]. However, judicial documents present domain-specific linguistic structures, including highly formalized expressions, nested clauses, and regionally nuanced address descriptions [15]. These characteristics often cause off-the-shelf models to misinterpret the contextual roles of location entities or to overlook crime-related addresses embedded within lengthy procedural narratives.
Reference:
Guo, D, et al, Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning arXiv preprint arXiv:250112948, 2025
The original paragraph(Section2.1,Passage2):
Advances in neural networks have improved the precision of address extraction. Researchers have applied deep learning architectures, such as BiLSTM, to domain-specific tasks using annotated corpora. For instance, Qi et al. [25] extracted ad-dress information from user-generated content on platforms like X and Weibo, by employing BERT and an improved IDCNN-BiLSTM-CRF model. Similarly, Ke et al. [26] enhanced a Chinese text NER model on the People’s Daily dataset by combining BERT with BiLSTM. Despite these advances, challenges persist due to issues such as limited training data, inherent noise, and significant variations within specific domain-specific. In judicial documents, these challenges are compounded by the scarcity of address details, the prevalence of non-standard address formats, and the frequent occurrence of redundant address strings.
The revised paragraph(Section2.1,Passage2):
Advances in neural networks have improved the precision of address extraction. Researchers have applied deep learning architectures, such as BiLSTM, to domain-specific tasks using annotated corpora. For instance, Qi et al. [25] extracted ad-dress information from user-generated content on platforms like X and Weibo, by employing an improved IDCNN-BiLSTM-CRF model. Similarly, Shao et al. [26] enhanced the BiLSTM-CRF architecture by incorporating a multi-layer perceptron approach, which significantly improved the model's capability for address entity recognition on the CCKS2021 dataset. Despite these advances, challenges persist due to issues such as limited training data, inherent noise, and significant variations within specific domain-specific. In judicial documents, these challenges are compounded by the scarcity of address details, the prevalence of non-standard address formats, and the frequent occurrence of redundant address strings.
Reference:
Shao, W, et al, Address entity recognition based on multi-layer knowledge perception Journal of Chinese Information Processing, 2025 39(06): p 110-118
Reviewer 3, Comment Q10:
Authors have focused on reviewing address extraction, however, the following work “Ke et al. [24] enhanced a Chinese text NER model on the People’s Daily dataset by combining BERT with BiLSTM” it is unclear whether the cited work is a general NER case of location/address entities?
Response to Reviewer 3, Comment Q10:
The work by Ke et al. [24] originally cited focuses on general NER on a news corpus (People's Daily dataset), and is not specifically tailored to address or location entities. Citing it in the context of address extraction was misleading. We have corrected this in the revised manuscript. In Section 2.1 (Lines 121–123 in the revised manuscript), we have replaced the citation of Ke et al. [24] with the work of Shao et al. [26].
Reviewer 3, Comment Q11:
No results have been reported either in terms of visuals or tables for qualitative analysis of misclassified cases.
Response to Reviewer 3, Comment Q11:
Upon careful consideration, we recognized that the brief qualitative discussion of misclassifications in the original Section 5.1 was superficial and lacked substantive visualization or tabular analysis. In the revised manuscript, we have removed the previous preliminary and insufficient qualitative discussion regarding misclassified cases from Section 5.1.
Reviewer 3, Comment Q12:
Caption for Figure 2 needs to be redefined as it looks similar to caption for Figure 3
Response to Reviewer 3, Comment Q12:
The caption of Figure 2 has been revised to distinguish it from the caption of Figure 3 while accurately reflecting the content of the figure.
The original caption:
Figure 2. Experimental workflow diagram.
The revised caption:
Figure 2. Schematic diagram of CAEC_LLM based on LoRA fine-tuning.
Author Response File:
Author Response.pdf
Round 2
Reviewer 3 Report
Comments and Suggestions for AuthorsDespite the authors’ revisions, several core methodological and reproducibility concerns
remain insufficiently addressed. These issues affect the internal validity, interpretability, and
reproducibility of the study. Please check the attached file for the details.
Comments for author File:
Comments.pdf
Author Response
Dear Reviewer:
Thank you for your insightful and constructive feedback on our manuscript. We appreciate the opportunity to revise our work based on your suggestions.
Below, we have provided a point-by-point response to each of your comments. For clarity, blue text indicates the original content, while red text highlights the revised wording. Furthermore, regarding the revisions for round 2, all changes have been highlighted in blue in the revised manuscript.
We believe these revisions have significantly strengthened the paper and look forward to your further evaluation.
Reviewer 3, Round2, Comment Q1:
The manuscript claims that domain-specific LoRA fine-tuning of GLM4-9B leads to performance improvements. However, the experimental results do not include a comparison between the base (non-fine-tuned) GLM4-9B and the fine-tuned GLM4-9B. Instead, the comparison is conducted against GLM4-6B. This does not isolate the effect of fine-tuning, as performance differences may be attributable to model scale rather than domain adaptation. A direct ablation comparison between base GLM4-9B and fine-tuned GLM4-9B is necessary to substantiate the claimed contribution of LoRA-based domain adaptation.
Response to Reviewer 3, Round2, Comment Q1:
We have replaced the GLM4-6B model used in the manuscript with GLM4-9B, which serves as the base model for CAEC_LLM, and have re-validated the experimental results accordingly. The contents involving this model in Table 3, as well as in Sections 5.1 and 5.2, have been revised accordingly (Lines 531–561, and lines 568-614 in the revised manuscript). Detailed modifications to the manuscript will be presented in the response to Comment Q2.
Reviewer 3, Round2, Comment Q2:
The manuscript explicitly states that no separate validation set was used, and that the reported results were derived from checkpoints obtained after fine-tuning on the entire training set. Even when using LoRA with standard hyperparameters, a validation set is typically required to select checkpoints, monitor overfitting, and ensure stability. Without a validation procedure, it is unclear how model selection was performed and whether test data may have indirectly influenced training decisions. This represents a significant methodological weakness.
Response to Reviewer 3, Round2, Comment Q2:
In our previous model training work, we indeed did not set aside a separate validation set. We have now addressed this issue by explicitly stating in the fourth paragraph of Section 4.1 (lines 438-439 of the revised manuscript) that 30% of the training data was partitioned as a validation set during model training. Accordingly, we have retrained the model and re-validated the experimental results under this new data split, and have updated Table 3, Figures 7 and 8, as well as the corresponding paragraphs in Section 5.1 (Lines 531–561, and lines 568-614 in the revised manuscript).
The original table:
Table 3. Comparison of different models for crime address extraction.
|
Metrics |
CAEC_LLM |
GLM4 |
Qwen2.5 |
Qwen3 |
|
Total Test-Set Addresses |
972 |
972 |
972 |
972 |
|
Total Model-Predicted Addresses |
1003 |
993 |
829 |
898 |
|
Incorrectly Identified Addresses |
218 |
251 |
245 |
274 |
|
Correctly Identified Addresses |
785 |
742 |
584 |
624 |
|
Under-identified Addresses |
187 |
230 |
388 |
348 |
|
Over-identified Addresses |
218 |
251 |
245 |
274 |
|
Precision |
0.78 |
0.75 |
0.7 |
0.69 |
|
Recall |
0.81 |
0.76 |
0.6 |
0.64 |
|
F1 Score |
0.79 |
0.76 |
0.65 |
0.67 |
The revised table:
Table 3. Comparison of different models for crime address extraction.
|
Metrics |
CAEC_LLM |
GLM4 |
Qwen2.5 |
Qwen3 |
|
Total Test-Set Addresses |
972 |
972 |
972 |
972 |
|
Total Model-Predicted Addresses |
1003 |
993 |
829 |
898 |
|
Incorrectly Identified Addresses |
218 |
251 |
245 |
274 |
|
Correctly Identified Addresses |
785 |
742 |
584 |
624 |
|
Under-identified Addresses |
187 |
230 |
388 |
348 |
|
Over-identified Addresses |
218 |
251 |
245 |
274 |
|
Precision |
0.78 |
0.75 |
0.7 |
0.69 |
|
Recall |
0.81 |
0.76 |
0.6 |
0.64 |
|
F1 Score |
0.79 |
0.76 |
0.65 |
0.67 |
The original paragraph(Section5.1,Passage3-6,and Figure 6):
Figure 6 illustrates the performance of various models in recognizing different types of addresses. At the overall performance level, the CAEC_LLM model demonstrated its capability in address recognition tasks, achieving the highest F1 scores in more than half of the categories (C2, C3, C4, C5), with near-perfect performance particularly in C2 and C3. In contrast, the GLM4 and Qwen series models exhibited a dis-tinct "specialization" characteristic, achieving top-tier performance in certain specific categories while showing significant shortcomings in others.
Figure 6. Address extraction results of different types of addresses.
Analyzing classification performance across different categories, C2 (House Number Addresses) and C4 emerged as the domains where all models performed most excellently. In the C2 category, all models demonstrated exceptional precision, with CAEC_LLM, GLM4, and Qwen2.5 all achieving a perfect score of 1.00, indicating they almost never misclassified non-house-number texts. Although differences existed in recall rates, all models achieved F1 scores above 0.75 in this category, suggesting that recognizing clearly formatted numerical identifiers is a strength of these LLMs. Similarly, in the C4 category, CAEC_LLM and GLM4 also achieved perfect precision scores, further validating the models' capability in classifying addresses with distinct characteristics.
In comparison, C3 (Road/Street Segments) and C8 categories remain highly challenging domains, though the nature of challenges differs. In the C3 category, CAEC_LLM simultaneously achieved both the highest precision (1.00) and recall (0.83), significantly outperforming other models. Meanwhile, GLM4 and the Qwen series collectively struggled in this category, with both precision and recall rates remaining low, indicating their general lack of effective capability in distinguishing road information. For the C8 category, the results are more complex: all models maintained a constant recall rate of 0.29, meaning a substantial number of such addresses were systematically misclassified. Simultaneously, except for Qwen2.5, all other models showed extremely low precision (all below 0.25), revealing a common issue: models tend to hastily assign a small number of difficult-to-judge samples to C8, but with very poor judgment accuracy.
For the C7 category, all three models except CAEC_LLM (GLM4, Qwen2.5, Qwen3) achieved perfect precision (1.00). However, their recall rates showed significant differences, with GLM4 achieving the highest recall (0.77) while maintaining high precision, thus obtaining the highest F1 score (0.87) in this category. Similarly, in the C6 category, GLM4 achieved perfect precision (1.00) but moderate recall (0.71), while CAEC_LLM adopted a more balanced strategy, trading a slight precision concession (0.73) for the highest recall (0.84). These two strategies ultimately resulted in similar F1 performances.
The revised paragraph(Section5.1,Passage7-10,and Figure 7):
Figure 7 illustrates the performance of various models in recognizing different types of addresses. At the overall performance level, the CAEC_LLM model demonstrated strong capability, achieving the highest F1 scores in categories C2, C3, and C5. However, GLM4 exhibited superior overall performance, attaining the highest F1 scores in C4, C6, and C7, while also leading in precision across multiple categories. In contrast, the Qwen series models showed a distinct "specialization" characteristic, with Qwen2.5 achieving the top F1 score in C1 and Qwen3 showing competitive precision in some instances, though both exhibited significant shortcomings in other categories.
Figure 7. Address extraction results of different types of addresses.
Analyzing classification performance across different categories, C2 (House Number Addresses) and C4 emerged as domains where models performed relatively well. In the C2 category, CAEC_LLM led with the highest F1 score (0.90), while GLM4 and Qwen2.5 achieved perfect precision (1.00) in the C4 category, validating the models' capability in classifying addresses with distinct characteristics.
In comparison, C3 (Road/Street Segments) and C8 remain highly challenging do-mains, though the nature of challenges differs. In the C3 category, CAEC_LLM significantly outperformed other models, achieving both the highest precision (0.84) and F1-Score (0.78). Meanwhile, GLM4 and the Qwen series collectively struggled, with precision and recall remaining notably low. For the C8 category, results were mixed: Qwen2.5 achieved notably high precision (0.67) but moderate recall (0.29), while other models showed both low precision and recall, indicating systematic difficulty in accurately identifying these addresses
In the C7 category, all three models except CAEC_LLM (GLM4, Qwen2.5, Qwen3) achieved perfect precision (1.00). Their recall rates showed significant differences, with GLM4 achieving the highest recall (0.79) while maintaining high precision, thus obtaining the highest F1 score (0.89) in this category. Similarly, in the C6 category, GLM4 again achieved perfect precision (1.00) but moderate recall (0.73), while CAEC_LLM adopted a more balanced strategy, trading a slight precision concession (0.73) for the highest recall (0.82). This trade-off resulted in GLM4 ultimately achieving a higher F1 score (0.85) compared to CAEC_LLM (0.77).
The original paragraph(Section5.2,Passage2-6,and Figure 7):
Figure7 illustrates the performance of various models in classifying different types of addresses. In terms of overall performance, the CAEC_LLM model demonstrates a significant advantage, achieving the highest or nearly highest F1 scores in most categories (C2, C4, C5, C6, C7), highlighting its robust comprehensive address recognition capability. In contrast, GLM4 and the two Qwen models exhibit considerable performance fluctuations and instability across different tasks.
Analyzing the models' performance across categories, all models face substantial challenges in C3 (Road/Street Segments) and C8. The C3 category is one of the most unstable across all models; For instance, GLM4 achieves a recall of 0.42, significantly higher than its precision of 0.12, while both Qwen2.5 and Qwen3 record precision values below 0.18 in this category, indicating the models' difficulty in consistently and accurately classifying road information. The results for the C8 category are even more severe: Qwen2.5 and Qwen3 completely fail to effectively recognize addresses in this category (F1 = 0), and although CAEC_LLM and GLM4 manage to identify some, their extremely low precision values (0.25 and 0.04, respectively) suggest a tendency to mis-classify a large number of ambiguous addresses into this category, leading to severe over-identification.
In comparison, C2 (House Number Addresses) and C6 categories are domains where models perform relatively stable and excellently. In the C2 category, CAEC_LLM achieves the highest F1 score (0.75), significantly outperforming other models, which validates the strong capability of all models in recognizing addresses with explicit numerical identifiers. Similarly, CAEC_LLM leads in the C6 category (F1 = 0.72), while other models show significant divergence in recall rates (GLM4 only 0.21 compared to CAEC_LLM's 0.78).
Figure 7. Address classification results of different types of addresses.
For the C7 category, the models' performance reveals their limitations in handling ambiguous samples. CAEC_LLM maintains high precision (0.67) and recall (0.63), demonstrating robust performance. However, Qwen2.5 completely fails to recognize addresses in this category, and although Qwen3 achieves the highest precision (0.69), its recall is extremely low (0.04). This indicates that, aside from CAEC_LLM, other models either entirely fail to capture the characteristics of the C7 category or become overly conservative to ensure correctness, missing the vast majority of true samples.
Similarly, in the C1 and C5 categories, GLM4 and Qwen3 achieve the highest re-call rates (0.67) in C5, but their precision values are notably low (0.17 and 0.28, respectively). However, this strategy does not prove effective in the C1 category, where all models fail to exceed an F1 score of 0.32, suggesting that C1 itself poses an inherent challenge, and the models have yet to identify an effective recognition pattern
Overall, the results indicate that our proposed address extraction and classification model CAEC_LLM, which combines rule-guided category systems with structured and example-enhanced knowledge injection, significantly enhances the spatial interpretability and analytical value of judicial documents data. This innovation enables LLMs to progress from simple text recognition to reliable, scale-aware spatial reasoning, bridging the gap between NLP and geospatial crime analysis.
The revised paragraph(Section5.2,Passage2-7,and Figure 8):
Figure 8 illustrates the performance of various models in classifying different types of addresses. In terms of overall performance, the CAEC_LLM model demonstrates a significant advantage, achieving the highest or nearly highest F1 scores in most categories (C2, C4, C5, C6, C7), highlighting its robust comprehensive address recognition capability. In contrast, GLM4 and the two Qwen models exhibit considerable performance fluctuations and instability across different tasks.
Analyzing the models' performance across categories, all models face substantial challenges in C3 (Road/Street Segments) and C8. The C3 category is one of the most unstable across all models. For instance, CAEC_LLM achieves a recall of 0.36 and precision of 0.42, while GLM4 shows a precision of only 0.11 despite a recall of 0.31. Both Qwen2.5 and Qwen3 record precision values below 0.18 in this category (0.11 and 0.18, respectively), indicating the models' difficulty in consistently and accurately classifying road information. The results for the C8 category are even more severe: Qwen2.5 and Qwen3 completely fail to effectively recognize addresses in this category (F1 = 0), and although CAEC_LLM and GLM4 manage to identify some (recalls of 0 and 0.29, respectively), their extremely low precision values (0 for CAEC_LLM and 0.05 for GLM4) suggest a tendency to misclassify a large number of ambiguous addresses into this category, leading to severe over-identification. Notably, CAEC_LLM fails to correctly identify any C8 instance, resulting in zero precision and recall.
In comparison, C2 (House Number Addresses) and C6 categories are domains where models perform relatively stable and excellently. In the C2 category, CAEC_LLM achieves the highest F1 score (0.74), significantly outperforming other models (GLM4: 0.50, Qwen2.5: 0.32, Qwen3: 0.56), which validates the strong capability of all models in recognizing addresses with explicit numerical identifiers. Similarly, CAEC_LLM leads in the C6 category (F1 = 0.69), while other models show significant divergence in recall rates (GLM4 only 0.23 compared to CAEC_LLM's 0.73).
Figure 8. Address classification results of different types of addresses.
For the C7 category, the models' performance reveals their limitations in handling ambiguous samples. CAEC_LLM maintains high precision (0.66) and recall (0.58), demonstrating robust performance. However, Qwen2.5 completely fails to recognize addresses in this category (precision = 0, recall = 0), and although Qwen3 achieves the highest precision (0.69), its recall is extremely low (0.04). This indicates that, aside from CAEC_LLM, other models either entirely fail to capture the characteristics of the C7 category or become overly conservative to ensure correctness, missing the vast majority of true samples.
Similarly, in the C1 and C5 categories, GLM4 and Qwen3 achieve relatively high recall rates (0.80 for GLM4 in C5 and 0.67 for Qwen3 in C5), but their precision values are notably low (0.23 and 0.28, respectively). However, this strategy does not prove effective in the C1 category, where all models fail to exceed an F1 score of 0.33, suggesting that C1 itself poses an inherent challenge, and the models have yet to identify an effective recognition pattern.
Overall, the results indicate that our proposed address extraction and classification model CAEC_LLM, which combines rule-guided category systems with structured and example-enhanced knowledge injection, significantly enhances the spatial interpretability and analytical value of judicial documents data. This innovation enables LLMs to progress from simple text recognition to reliable, scale-aware spatial reasoning, bridging the gap between NLP and geospatial crime analysis.
Reviewer 3, Round2, Comment Q3:
The annotation protocol lacks key details necessary to assess dataset reliability. Specifically, the manuscript does not report:
The number of annotators involved
Whether annotation was performed independently or collaboratively
Any inter-annotator agreement metrics (e.g., Cohen's κ or F1)
Annotation guidelines or quality control procedures
Given that the fine-tuned model depends entirely on these 1,889 annotated documents, the absence of reliability metrics raises concerns about the validity of the ground truth labels.
Response to Reviewer 3, Round2, Comment Q3:
The annotation of the dataset was independently conducted by four undergraduate students majoring in geographic information science. For each judicial document, annotators first extracted paragraphs describing the criminal act, then identified the corresponding crime time and location, and determined the crime type (e.g., theft, robbery) based on the narrative context. Following the priority-based classification rule proposed in this paper(Lines 348-366), each address was assigned to one of eight spatial categories. All extracted information was then structured into JSON format. The annotation process relied primarily on keyword identification and descriptive logic, requiring no complex subjective judgment or domain expertise. Therefore, individuals with basic spatial comprehension skills can perform this task after brief training, eliminating the need for elaborate quality control procedures.
We have strengthened the discussion on this aspect in the fourth paragraph of Section 4.2 (Lines 413-420 of the revised manuscript).
The annotation process was conducted by four undergraduate students majoring in geographic information science. Before commencing the work, we first defined a "crime address" as the specific geographical location where the criminal act was explicitly described as occurring in the judicial document narrative. This definition helped annotators distinguish the crime scene from other mentioned locations, such as the defendant's registered residence, arrest location, or court address. Prior to starting, all annotators participated in a training session. We provided a detailed annotation guideline covering: (1) this definition; (2) the eight-category classification scheme with examples (as shown in Table 1); (3) the deterministic, priority-based rule for resolving category ambiguity (Lines 348-366 of the manuscript); and (4) practice on 50 sample documents, followed by a group discussion to align their understanding. For each judicial document, annotators first extracted paragraphs describing the criminal act, then identified the corresponding crime time and location, and determined the crime type (e.g., theft, robbery) based on the narrative context. Following the priority-based classification rule proposed in this paper (Lines 348-366), each address was assigned to one of eight spatial categories. All extracted information was then structured into JSON format. The annotation process relied primarily on keyword identification and descriptive logic, requiring no complex subjective judgment or domain expertise. Therefore, individuals with basic spatial comprehension skills can perform this task after brief training, eliminating the need for elaborate quality control procedures. Finally, we conducted a comprehensive review of all data, including checking for logical consistency and ensuring that extracted addresses correctly matched the crime narrative.
We have strengthened the discussion on this aspect in the fourth paragraph of Section 4.2 (Lines 413-431 of the revised manuscript).
The original paragraph:
To construct a high-quality manually annotated dataset for model development and evaluation under feasible resource constraints, we adopted a stratified sampling strategy. We limited the time range to the period from 2011 to 2021. This 11-year window was selected because it represents a period of sustained growth in the online publication of judicial documents after the platform's launch, ensuring data availability and relative format consistency. Within this period, we aimed to extract representative samples for annotation. To obtain cross-year samples, we conducted sampling by year, randomly selecting 200 judicial documents from each year between 2011 and 2021 for manual verification and annotation. We recruited several volunteers from the field of geographic information to carry out this work. During this process, we removed documents from the sampled dataset that could not undergo reliable address recognition (such as cybercrimes involving account transactions and crimes occurring on public transportation vehicles). Ultimately, we identified 1,889 judicial documents. This sample size represents a balance between the high cost of manual annotation and the need to build a sufficiently large dataset for training and robustly evaluating complex LLMs. These documents originate from courts at various levels across mainland China. From the 1,889 annotated documents, we performed a fixed split: 1,389 documents were used for training CAEC_LLM, and the remaining 500 were reserved as an independent test set. These documents were manually annotated to identify crime locations and categorize each address into the corresponding category. The annotated data is stored in a structured text format, facilitating its submission to LLMs for training and accuracy evaluation. For simplicity, and given the relative stability of LoRA performance under standard hyperparameters, we did not use a separate validation set for hyperparameter tuning; the reported results are derived from model checkpoints o-tained after fine-tuning on the entire training set. After manual annotation, these 500 test documents contain a total of 972 distinct crime location address entities, which constitute the address-level ground truth for evaluation.
The revised paragraph:
To construct a high-quality manually annotated dataset for model development and evaluation under feasible resource constraints, we adopted a stratified sampling strategy. We limited the time range to the period from 2011 to 2021. This 11-year window was selected because it represents a period of sustained growth in the online publication of judicial documents after the platform's launch, ensuring data availability and relative format consistency. Within this period, we aimed to extract representative samples for annotation. To obtain cross-year samples, we conducted sampling by year, randomly selecting 200 judicial documents from each year between 2011 and 2021 for manual verification and annotation. The annotation process was conducted by four undergraduate students majoring in geographic information science. To ensure the consistency and reliability of the annotations, a structured workflow with clear guidelines was established. First, during a pre-annotation training session, the annotators were provided with a detailed annotation guideline. This guideline explicitly defined a "crime address" as the specific geographical location where the criminal act itself occurred, as described in the narrative, distinguishing it from other locations such as the defendant's residence or the court. The guideline also elaborated on the eight-category classification scheme (Table 1) and the deterministic, priority-based rule for resolving potential category ambiguity (Lines 348-366). Annotators then practiced on a set of 50 sample documents to align their understanding and address any questions. Second, following the training, each of the 1,889 sampled judicial documents was independently annotated by one of the four students. For each document, the annotator was required to: (i) locate and read the paragraph(s) describing the criminal act; (ii) identify and extract the exact text span corresponding to the crime address; (iii) assign the extracted address to one of the eight spatial categories by applying the priority-based classification rule; and (iv) structure the extracted information into JSON format. Finally, all annotation results underwent a comprehensive review. This review involved cross-checking each extracted address against the original document's narrative to ensure logical consistency and accuracy. During this process, we removed documents from the sampled dataset that could not undergo reliable address recognition (such as cyber-crimes involving account transactions and crimes occurring on public transportation vehicles). Ultimately, we identified 1,889 judicial documents. This sample size represents a balance between the high cost of manual annotation and the need to build a sufficiently large dataset for training and robustly evaluating complex LLMs.
Reviewer 3, Round2, Comment Q4:
The authors state that documents "that could not undergo reliable address recognition" (e.g., cybercrimes, crimes occurring on public transportation) were removed after sampling. However, no quantitative breakdown is provided regarding how many documents were excluded or how this affected year-wise balance. Excluding more complex or ambiguous cases may artificially inflate performance and reduce generalizability. The implications of this post-sampling exclusion should be explicitly addressed.
Response to Reviewer 3, Round2, Comment Q4:
In response to the reviewer's suggestion, we have provided a detailed quantitative analysis of the sampling and exclusion processes, accompanied by comprehensive distribution comparisons to demonstrate the representativeness of the final dataset.
From an initial pool of 835,693 first-instance theft-related judicial documents (2011–2021), we employed a stratified sampling strategy by year, randomly selecting 200 documents per year (2,200 documents in total) for manual annotation. During the annotation process, we removed documents that lacked conventional, geocodable crime scene addresses (e.g., cybercrimes, crimes occurring on moving vehicles). Table 1 presents the year-wise distribution of the documents retained after exclusion.
Table 1. Annual distribution of the final dataset (1,889 documents)
|
Year |
Count |
Percent (%) |
|
2011 |
177 |
9.37 |
|
2012 |
177 |
9.37 |
|
2013 |
174 |
9.21 |
|
2014 |
174 |
9.21 |
|
2015 |
173 |
9.16 |
|
2016 |
167 |
8.84 |
|
2017 |
172 |
9.11 |
|
2018 |
167 |
8.84 |
|
2019 |
168 |
8.89 |
|
2020 |
176 |
9.32 |
|
2021 |
164 |
8.68 |
As shown in Table 1, the final dataset maintains a relatively balanced annual distribution (164–177 documents per year), ensuring temporal representativeness across the entire 11-year study period.
To demonstrate that the sampled dataset retains the geographic characteristics of the population, we compared the spatial distribution of the full dataset (835,693 documents) with the final annotated dataset (1,889 documents) based on the ‘Region’ field in the document dataset. Table 2 presents the distribution of the top cities by document count in both datasets.
Table 2. Geographic distribution comparison: Full dataset and annotated dataset
|
Full Dataset (835,693 documents) |
Annotated Dataset (1,889 documents) |
||||
|
City |
Count |
Percent (%) |
City |
Count |
Percent (%) |
|
上海市/Shanghai |
29928 |
3.67 |
杭州市/Hangzhou |
85 |
4.50 |
|
重庆市/Chongqing |
23289 |
2.86 |
上海市/Shanghai |
66 |
3.49 |
|
杭州市/Hangzhou |
11944 |
1.47 |
慈溪市/Cixi |
62 |
3.28 |
|
成都市/Chengdu |
10942 |
1.34 |
重庆市/Chongqing |
58 |
3.07 |
|
北京市/Beijing |
10091 |
1.24 |
宁波市/Ningbo |
51 |
2.70 |
|
佛山市/Foshan |
8733 |
1.07 |
成都市/Chengdu |
48 |
2.54 |
|
东莞市/Dongguan |
8549 |
1.05 |
广州市/Guangzhou |
41 |
2.17 |
|
广州市/Guangzhou |
7922 |
0.97 |
深圳市/Shenzhen |
30 |
1.59 |
|
宁波市/Ningbo |
7297 |
0.9 |
西安市/Xi’an |
27 |
1.43 |
|
天津市/Tianjin |
7227 |
0.89 |
北京市/Beijing |
26 |
1.38 |
|
深圳市/Shenzhen |
7121 |
0.87 |
南京市/Nanjing |
25 |
1.32 |
|
苏州市/Suzhou |
6867 |
0.84 |
桂林市/Guilin |
16 |
0.85 |
|
西安市/Xi’an |
6237 |
0.77 |
苏州市/Suzhou |
15 |
0.79 |
|
武汉市/Wuhan |
6193 |
0.76 |
南宁市/Nanning |
14 |
0.74 |
|
南京市/Nanjing |
5466 |
0.67 |
常州市/Changzhou |
14 |
0.74 |
|
南宁市/Nanning |
5310 |
0.65 |
武汉市/Wuhan |
13 |
0.69 |
|
昆明市/Kunming |
4925 |
0.6 |
唐山市/Tangshan |
13 |
0.69 |
|
温州市/Wenzhou |
4853 |
0.6 |
绍兴市/Shaoxing |
13 |
0.69 |
|
长沙市/Changsha |
4531 |
0.56 |
太原市/Taiyuan |
12 |
0.64 |
|
柳州市/Liuzhou |
4445 |
0.55 |
柳州市/Liuzhou |
12 |
0.64 |
|
常州市/Changzhou |
4441 |
0.55 |
东莞市/Dongguan |
11 |
0.58 |
|
中山市/Zhongshan |
4372 |
0.54 |
厦门市/Xiamen |
10 |
0.53 |
|
绍兴市/Shaoxing |
4306 |
0.53 |
佛山市/Foshan |
10 |
0.53 |
|
义乌市/Yiwu |
4014 |
0.49 |
湖州市/Huzhou |
10 |
0.53 |
The comparison reveals that the geographic distribution of the annotated dataset is largely consistent with that of the full dataset, with major cities (Shanghai, Chongqing, Hangzhou, Chengdu, and Beijing) being well-represented in both. However, despite our stratified sampling preserving the geographic characteristics of the full dataset, the annotated dataset contains a relatively higher proportion of samples from large metropolitan areas. We acknowledge that this over-representation of major cities may introduce systematic bias. Consequently, CAEC_LLM may perform better on standardized urban address formats but could be less robust when applied to documents from smaller cities or rural areas, where address descriptions often follow different regional conventions. To address this, we have added a discussion of this limitation in Section 6.4 (Lines 716-732 of the revised manuscript).
The revised paragraph:
Furthermore, a specific sampling bias warrants discussion. Although our stratified sampling strategy preserved the rank-order of major cities from the full dataset, the final annotated dataset contains a relatively higher proportion of samples from large metropolitan areas such as Shanghai, Hangzhou, and Chongqing. This over-representation of major cities may introduce a systematic bias into the model. Consequently, CAEC_LLM may be more adept at recognizing and classifying address patterns commonly found in densely populated urban centers, which tend to have more standardized and structured address formats (e.g., precise house numbers, well-defined residential compounds). Conversely, its performance could be less robust when applied to judicial documents from smaller cities, towns, or rural areas, where address descriptions may be less formal, rely more on local landmarks, or follow different regional conventions. To mitigate this limitation in future work, we plan to employ a more geographically stratified sampling strategy that explicitly balances representation across different city tiers and rural areas, or to collect targeted data from underrepresented regions to fine-tune a more robust and generalizable model.
The test set was created through completely random sampling of the annotated documents to ensure that its distribution remains similar to that of the training set and the original data. Table 3 presents the address category distribution in the training and test sets.
Table 3. Address category distribution: Training set and Test set
|
Category |
Trainset (Count/Precent%) |
Testset (Count/Precent%) |
|
C6_Institutions/Facilities/Residential Areas |
1052 (40.34%) |
383 (39.40%) |
|
C2_House Number Addresses |
654 (25.08%) |
245 (25.21%) |
|
C7_Vaguely Location Descriptions |
521 (19.98%) |
209 (21.50%) |
|
C4_Transportation Hubs |
114 (4.37%) |
40 (4.12%) |
|
C1_Administrative Units |
108 (4.14%) |
37 (3.81%) |
|
C5_Open Areas |
76 (2.91%) |
15 (1.54%) |
|
C3_Road/Street Segments |
59 (2.26%) |
36 (3.70%) |
|
C8 |
24 (0.92%) |
7 (0.72%) |
As shown in Table 3, the training and test sets exhibit highly consistent category distributions across all eight address types. Minor variations fall within the normal range of fluctuation expected for randomly sampled subsets and do not indicate systematic bias.
These data confirm that, although we excluded approximately 14% of the sampled documents, the retained 1,889 documents maintain strong representativeness across temporal, geographic, and categorical dimensions. Therefore, the model’s performance metrics should be interpreted as valid evaluation results for the domain of conventional, geocodable crime addresses in Chinese judicial documents.
As supporting materials for the dataset in this article, we have organized this portion of the data and discussion into Supplementary Material 1, which is submitted along with the manuscript.
Reviewer 3, Round2, Comment Q5:
Although a GitHub repository has been provided, it contains inference scripts and small annotated subsets but does not include:
The fine-tuned GLM4-9B checkpoint or LoRA adapter files
The training script used for fine-tuning
Hyperparameter configurations or training logs
Without access to the fine-tuned weights or complete training details, the claimed improvements cannot be independently verified.
Response to Reviewer 3, Round2, Comment Q5:
We have uploaded the fine-tuned model checkpoints to the GitHub repository accompanying this paper:
https://github.com/letdo1945/Crime_Address_Extraction_and_Classification_Based_on_LLM_Data
Specifically, the files training_args.yaml and llamaboard_config.yaml located in the directory GLM4-9B/lora/train_2026-02-19-16-37-12 contain the detailed training configurations and hyperparameter settings used in the fine-tuning process.
The LoRA fine-tuning scripts used in this study are based on the LLaMa Factory framework, as noted in the final paragraph of Section 4.1.
Reviewer 3, Round2, Comment Q6:
The original manuscript included a qualitative discussion of misclassified cases (e.g., missing address elements, semantic segmentation errors, hallucinations). In response to a request for supporting visuals or tabular evidence, this section has been removed entirely. While acknowledging the insufficiency of the prior discussion is appreciated, eliminating the section weakens interpretability. For LLM-based extraction tasks, systematic qualitative error analysis is important for understanding model behavior and practical deployment risks. Strengthening rather than removing this analysis would improve the manuscript.
Response to Reviewer 3, Round2, Comment Q6:
We have supplemented the discussion on misclassification cases in paragraphs 3–6 of Section 5.1 (lines 498–530 of the revised manuscript, along with Table 4).
The revised paragraph:
To further evaluate the performance of CAEC_LLM in address extraction, we conducted an in-depth analysis of error cases based on the precision results. These errors can be broadly categorized into the following types: missing address elements, semantic segmentation errors, and hallucinations of LLMs.
Missing Address Elements: As shown in the table 4, this type of error is common in the LLMs’ performance on the dataset. Specifically, the LLM tends to omit certain elements of an address during recognition. These omissions may include information irrelevant to the standard address format or, more critically, the absence of key address components.
Table 4. Examples of missing address elements
|
Correct Addresses |
CAEC_LLM Extracted Addresses |
|
江苏省南京市浦口区××号凤悦天晴花园小区××室 |
南京市浦口区凤悦天晴花园小区×号楼×室 |
|
Room ××, Fengyue Tianqing Garden, ××, Pukou District, Nanjing City, Jiangsu Province |
Room ×, Building ×, Fengyue Tianqing Garden, Pukou District, Nanjing City |
|
江苏省南京市六合区程桥街道XX小区XX幢XX室 |
南京市六合区程桥街道XX小区XX幢XX室丁某的毛坯房 |
|
Room XX, Building XX, XX Residential Quarter, Chengqiao Subdistrict, Luhe District, Nanjing City, Jiangsu Province |
Unfinished apartment of Ding Mou, Room XX, Building XX, XX Residential Quarter, Chengqiao Subdistrict, Luhe District, Nanjing City |
|
南京市六合区横梁街道新篁中心社区XX号 |
南京市六合区横梁街道XX号 |
|
XX, Xinhang Central Community, Hengliang Subdistrict, Luhe District, Nanjing City |
XX, Hengliang Subdistrict, Luhe District, Nanjing City |
Semantic Segmentation Errors: Format-related errors often occur because judicial documents may simplify the description of multiple addresses on the same street due to a suspect’s repeated criminal activities. For example, in the sentence: "The defendant Li committed theft at the Shaxuan Hair Salon on Wenhong Road, Jiangning District, Nanjing, and at Room 1101, Building 9, Shimao Dream Home," the model may incorrectly interpret "Shaxuan Hair Salon, Shimao Dream Home, Building 9, Room 1101" as a single address. This type of error stems from the failure of ambiguity resolution in natural language processing. Legal language and address description methods in judicial documents (such as using commas to separate multiple addresses) often fall out-side the scope of the model’s conventional training data, leading to parsing inaccuracies.
Hallucinations in LLMs: In the test set, a small number of errors were caused by model hallucinations. Although such issues are infrequent, they remain noteworthy. For instance, in the text: "At around 21:00 on January 31, 2020, the defendant Xiang entered a store in Nanjing by using a key hidden by an employee under the roller shutter door and stole five marinated pig trotters worth RMB 200," the only clue regarding the crime location is "Nanjing." However, influenced by its training data, the model tends to generate more complete addresses and thus outputs "Nanjing Xinjiekou 'Jiujiuya' Duck Neck Shop" as the crime location, despite the absence of any related text in the document. Hallucinations in LLMs primarily stem from their generative nature. By learning statistical patterns from massive text data, the model predicts the most probable next word or phrase. When encountering ambiguous, incomplete, or creatively fillable information, the model "guesses" based on learned patterns and generates seemingly plausible but factually incorrect content.
Reviewer 3, Round2, Comment Q7:
Although the authors have clarified that all experiments were conducted on original Chinese judicial texts and that the examples shown are English translations of real cases, the manuscript still does not provide a concrete in-context example from an actual judicial document. Given that the task depends heavily on the linguistic structure and legal phrasing of Chinese text, it would substantially improve transparency to include at least one anonymized excerpt showing the original Chinese sentence, the annotated address span, and the model's predicted output. Referring readers to a file in the repository does not adequately substitute for a clearly presented illustrative example within the manuscript itself.
Response to Reviewer 3, Round2, Comment Q7:
We have enriched the description of judicial text content in the second paragraph of Section 4.1 (lines 392–396 of the revised manuscript). Additionally, Figure 6 now presents an anonymized Chinese excerpt from a judicial document alongside its English translation. As for the model's prediction output, corresponding examples have already been illustrated in Figure 5. Meanwhile, we have also modified the prompt engineering example in Figure 5; now, both the original Chinese text and its corresponding English translation are available for viewing.
The revised paragraph:
Figure 6 presents an anonymized excerpt from the core content of the "Main Text" section in a judicial document, along with its English translation. This example is derived from a typical pickpocketing case judgment in the dataset used in this study. The original text is in Chinese, with the crime address entity that the model is expected to identify underlined.
Figure 6. Example of a document summary in the "Main Text" section.
The revised Figure5:
Figure 5. Prompt template structure with sample input-output pair.
Author Response File:
Author Response.pdf

