1. Introduction
In the context of globalization, the increasing frequency of international trade has positioned China Customs as a critical regulatory authority for import and export goods, tasked with safeguarding national security, maintaining social stability, and promoting economic development. The inspection of imported food and pharmaceuticals is particularly vital due to its direct impact on public health. However, disparities in the legal status of certain substances across different countries pose significant challenges to customs inspections. For instance, Canada legalized recreational cannabis on 17 October 2018, under the Cannabis Act, becoming the second country globally to do so [
1], whereas in China, cannabis and its derivatives (e.g., tetrahydrocannabinol, THC) are classified as strictly controlled substances under the Regulations on the Control of Narcotic Drugs and Psychotropic Substances, prohibiting their import and use [
2]. Similarly, melatonin is widely available as an over-the-counter dietary supplement in the United States, but in China, it is categorized as a pharmaceutical requiring approval from the National Medical Products Administration (NMPA) for importation. These legal discrepancies are reflected in real-world cases. For example, in 2022, Hong Kong of China intercepted a shipment of imported food from Canada containing cannabidiol (CBD), a substance legally used as a health supplement in Canada but classified as a narcotic in China [
3]. These incidents underscore the complex dilemmas customs face in addressing regulatory differences between nations.
Customs can identify prohibited substances in food when they possess knowledge of the substances’ characteristics and established detection methods. However, for emerging or unknown substances, the lack of understanding of their functions and mechanisms makes determining their legality exceedingly difficult. This challenge arises for several reasons. First, the variety and rapid evolution of illicit substances complicate detection efforts; with advancements in chemical synthesis, new psychoactive substances (NPSs) continue to emerge, with the United Nations Office on Drugs and Crime (UNODC) reporting in its 2020 World Drug Report that over 1000 NPSs have been identified globally, with their complex and ever-changing chemical structures overwhelming customs’ capacity to keep pace [
4]. Second, the limitations of detection methods exacerbate the issue, as identifying different substances relies on specialized technologies and equipment—such as high-performance liquid chromatography–mass spectrometry (HPLC-MS) for cannabis components or the more sensitive gas chromatography–mass spectrometry (GC-MS) for fentanyl derivatives—which are costly to operate and maintain and require trained personnel, increasing the burden of inspection. Third, the concealability of illicit substances poses a further hurdle, as offenders often mix them into legitimate goods to evade scrutiny; for example, the EU Drug Markets Report notes that drugs are frequently hidden in food or cosmetics, rendering them difficult to detect through routine checks [
5].
Under these circumstances, we aim to develop a product-function-to-chemical mapping database to assist customs in rapidly identifying potentially non-compliant additives in cross-border goods by analyzing their marketed claims. If a product claims to “whiten skin rapidly”, the database would map this function to substances like hydroquinone (a banned skin-lightening agent), allowing customs to prioritize lab testing for such chemicals. Chemical text data contain a wealth of knowledge, such as drug–drug interactions, diagnostic criteria for diseases, treatment protocols, and gene functions, offering irreplaceable value for clinical decision support, drug development, and medical research. In recent years, the rapid expansion of literature databases like PubMed and the widespread adoption of electronic health records (EHRs) have significantly increased the scale and complexity of unstructured chemical text. Attribute extraction, a core task in natural language processing (NLP), aims to extract specific information fragments—such as entities (e.g., genes, diseases, drugs) and their relationships—from unstructured text, providing structured support for downstream tasks like knowledge graph construction and information retrieval. Deep learning methods have been widely applied in the field of NLP [
6]. However, the specialized terminology, dense jargon, structural variability, and implicit information in chemical text pose significant challenges to traditional attribute extraction methods. For instance, inferring potential drug side effects from clinical trial reports or extracting implicit gene–disease associations from research papers often requires deep semantic understanding and reasoning beyond simple pattern matching or entity recognition.
Building on this customs-oriented motivation, this study formalizes chemical-attribute extraction as a generative QA task. We adopt BioBART as the base model and optimize it via IRL. Specifically, we first fine-tune BioBART on public biomedical QA datasets to adapt it to QA tasks and then further train it on a custom attribute extraction QA dataset. The primary contributions of this study are as follows: (1) by reframing attribute extraction as a QA task, we leverage generative models to enhance implicit attribute extraction and address the limitations of extractive methods in handling distributed information; and (2) we design domain-specific sub-reward components for QA and employ IRL to automatically weight them using expert demonstrations, thereby improving answer quality for chemical-attribute extraction.
2. Related Work
Traditional attribute extraction methods primarily rely on rule-based matching or feature-engineered machine learning models. While effective for explicit information, these approaches struggle with implicit attributes. Rule-based and feature-engineered methods usually depend on manually designed dictionaries, trigger words, or shallow syntactic patterns; therefore, they are sensitive to terminology variation and often fail when an answer must be inferred from multiple sentences rather than copied from a single span. The advent of deep learning has improved performance through neural network-based methods, such as Bidirectional Encoder Representations from Transformers (BERT)-based sequence labeling models. Zhang et al. proposed a method for biochemical text information extraction in few-shot scenarios [
7]. However, the high computational cost and complexity of sequence labeling frameworks have prompted researchers to reframe information extraction as a question answering (QA) paradigm, leveraging the semantic comprehension capabilities of QA models to handle complex extraction tasks. For example, Qiu et al. proposed an extraction framework using QA templates [
8], and Du and Cardie demonstrated effective event extraction by answering natural questions [
9]. In the chemical domain, this methodology has gained traction, with studies such as Wang et al. proposing a multi-turn QA framework for biomedical event extraction [
10], and Chen et al. developing a knowledge-enhanced reading comprehension framework for biomedical relation extraction [
11], both achieving significant improvements over traditional methods. For hazardous-chemical identification, Cheng et al. proposed a domain-knowledge-graph-embedded text feature extraction model for hazardous-chemical recovery identification and attribute classification [
12]. Their work emphasizes classification and feature extraction with domain knowledge, whereas our study focuses on open-ended QA-style attribute extraction and multi-objective reward optimization. These efforts highlight the potential of QA paradigms in addressing domain-specific linguistic challenges, particularly in the scientific literature, where studies by Sipila et al. and Dagdelen et al. have successfully extracted structured information from materials science texts using large language models (LLMs) [
13,
14].
To meet the unique demands of biomedical texts, specialized pretrained models have been developed. Lee et al. introduced BioBERT, a BERT-based model pretrained on the biomedical literature, significantly enhancing performance in tasks such as QA, named-entity recognition, and relation extraction [
15]. Building on this, Yuan et al. proposed BioBART, a biomedical pretrained model based on the Bidirectional and Auto-Regressive Transformers (BART) architecture, optimized for generative tasks [
16]. Pretrained on large-scale biomedical corpora, BioBART integrates bidirectional encoding and autoregressive decoding, making it well-suited for applications like QA, text summarization, and clinical report generation. While extractive QA models, such as BioBERT, excel at identifying explicit answer spans, they struggle to integrate fragmented answers or infer implicit attributes distributed across sentences. In contrast, generative QA models, leveraging language generation capabilities and contextual understanding, offer a promising solution to these challenges. However, generative models face critical issues, including hallucination—generating plausible but factually incorrect content, such as fabricating non-existent drug side effects—and exposure bias caused by teacher forcing during pretraining [
17,
18]. To mitigate these, reinforcement learning (RL) has been explored to optimize long-term rewards. Yet single-reward RL fails to capture multidimensional evaluation criteria, and manually balancing sub-reward weights in multi-reward RL becomes intractable as the number of components increases.
Inverse reinforcement learning (IRL) offers a novel solution by automatically inferring reward weights from expert demonstrations, circumventing manual design limitations. Recent studies have demonstrated IRL’s efficacy in multi-objective optimization for NLP tasks. For instance, Shi et al. introduced IRL to address reward sparsity and mode collapse in text generation [
19], while Fu et al. applied IRL to text summarization, leveraging expert demonstrations to balance sub-rewards and improve sequence quality [
20]. Ghosh et al. further extended IRL to tasks like Table-to-Text and program generation, demonstrating its versatility [
21,
22]. These advancements underscore IRL’s potential to enhance generative models by mitigating hallucination and optimizing multidimensional objectives.
3. Materials and Methods
The general workflow of CAESAR is shown in
Figure 1. First, we fine-tune the BioBART model using public biomedical QA datasets to adapt it for QA tasks. Then, we use a custom attribute extraction QA dataset, which is constructed by crawling a large number of biochemistry texts from the web and cleaning them. We build questions related to attributes and annotate the answers to create a QA-formatted attribute extraction dataset for subsequent training. The subsequent training is divided into two stages. In the first stage, the weights of the QA model are fixed, and the QA model takes the context and question as input and outputs predicted answers. Both the predicted answers and the annotated answers are fed into the reward model, and the weights of the reward model are automatically adjusted by the IRL agent. This stage is called the reward update stage. In the second stage, the reward model evaluates the quality of the predicted answers and guides the training of the model by outputting scores, thereby updating the model parameters. This stage is called the policy update stage.
3.1. Problem Formulation
The attribute extraction task in the chemical field aims to extract target attributes (such as genes, diseases, drugs, and their relationships) from unstructured text. To achieve this goal, we restructure it as a generative QA problem, leveraging the reasoning and generation capabilities of QA models to process the information.
Let the input chemical text be D (a document composed of a sequence of words), and the target attribute be A (the information to be extracted). We transform the task into a QA pair , where Q is the natural language question designed for the attribute A and A is the reference answer provided by experts; the goal of the generative QA model is to generate the answer based on D and Q.
Formally, the task of the generative QA model is defined as
Here,
f is the generative model,
denotes the model parameters, and
is the generated answer sequence. We aim for
to be as semantically close to
A as possible, that is
Here,
L is the loss function (such as cross-entropy loss), which measures the difference between the generated answer and the reference answer.
In the chemical context,
A may be implicit. For example, given the text
D as shown in
Table 1, if the target attribute
A is “the mechanism of action of Rosiglitazone Hydrochloride”, then
“How does the drug mentioned in the text exert its effect?”,
“By upregulating the transcriptional level of PPAR
to improve pulmonary vascular remodeling”. The model needs to generate
based on contextual reasoning.
It should be noted that the proposed task is not formulated as a conventional closed-set classification problem. The attribute type determines the question template, but the answer is generated as a natural-language sequence and is not restricted to a predefined class label. In this study, the dataset contains ten attribute types, including four explicit attributes and six implicit attributes. Therefore, exact-match metrics are suitable for explicit attributes with clear answer boundaries, while generation-oriented metrics and human evaluation are required for implicit attributes involving cross-sentence reasoning or professional inference.
3.2. Fine-Tuning BioBART
To achieve high-quality generative QA for chemical-attribute extraction tasks, we selected BioBART as the base model. BioBART is a pretrained language model based on the BART architecture, specifically designed for the biomedical domain and pretrained on large-scale biomedical corpora, enabling it to capture domain-specific language patterns and semantic relationships. Since BioBART is pretrained using only a text-infilling task, we fine-tuned it with a QA dataset to adapt it to the QA task. To ensure performance, we chose the BioBART-large version, which contains approximately 400M parameters, providing sufficient expressive power.
We fine-tuned the model on two public datasets:
BioASQ [23]: It contains approximately 5000 QA pairs annotated by experts, covering factual questions (“What are the symptoms of a certain disease?”), list-based questions (“List the side effects of a drug”), and inferential questions (“How does Drug X treat Disease Y?”). These question types are highly relevant to the needs of the attribute extraction task.
PubMedQA [24]: It includes 1000 QA pairs based on PubMed abstracts, primarily consisting of “yes/no/maybe” questions (e.g., “Does Drug A cause side effect B?”) and also featuring long answers based on context.
The fine-tuning uses a supervised learning approach. The model training at this stage is based on using Maximum Likelihood Estimation (MLE) to update the model parameters.
3.3. Question Construction for Attributes
In the research of transforming attribute extraction tasks into the QA paradigm, the quality of question construction directly affects the model’s performance and the accuracy of the answers. The primary task of constructing high-quality questions is to ensure that they can clearly guide the model to complete the task. The construction of questions should have clarity (the question must clearly point to the required attribute and avoid ambiguity), professionalism (the question should use standardized terminology from the chemical field), conciseness (the question should be as concise as possible and avoid redundancy), targetedness (each question should focus on a single attribute to avoid confusion), and answerability (the question should be based on information in the text or content that can be inferred). When constructing questions, attention should be paid to the expression of implicit attributes, guiding the model’s answers through questions. Given the professionalism of the chemical field, for synonyms or abbreviations that may appear in the text, standardized professional terminology should be used. At the same time, overly rigid question templates may limit the model’s performance, and the questions should be diversified. When the text does not mention certain attributes, “none” should be allowed as the model’s output answer.
Therefore, we adopt the following methods to construct questions to guide the model’s output:
- 1.
Template Design
For exact attribute matches, use a fixed template: “What is the [attribute] of [drug]?”
Example: “What is the molecular formula of penicillin?”
For inferential attributes, use a flexible template: “Based on the text, what is the possible [attribute] of [drug]?”
Example: “Based on the text, what are the possible side effects of fluoxetine?”
- 2.
Terminology Standardization
Leverage the Unified Medical Language System (UMLS) to ensure consistency of terminology, and include synonyms when necessary.
- 3.
Diverse Expression
Design variants for each attribute and randomly assign them. For example, for “side effects”:
“What are the side effects of omeprazole?”
“What adverse reactions may occur after taking omeprazole?”
“Based on the text, what side effects might omeprazole cause?”
- 4.
Handling of Negative Samples
Use qualifiers to handle cases with no information, for example, “Based on the text, what are the contraindications of chloroquine? (Answer ‘none‘ if not mentioned)” to prevent the model from generating incorrect content.
The constructed questions and answers were independently reviewed by three annotators with biochemical or related backgrounds before being included in the final dataset. The review focused on whether each question was professionally expressed, whether the answer was supported by the source text, and whether the QA pair was answerable. Samples with inconsistent or unclear judgments were revised or removed.
3.4. Reward Model
To optimize model performance and mitigate the issue of hallucination, we introduce IRL to learn a reward function , which automatically adjusts the weights of sub-rewards through expert demonstrations. The reward model consists of multiple sub-reward components and is mathematically formulated as follows:
Accuracy Reward: To measure the semantic relevance between
and
A,
The BLEU score assesses the degree of text overlap, ensuring that the generated content accurately reflects the target attribute [
25].
Professionalism Reward: To assess whether
conforms to professional expressions in the chemical field,
The term is the professionalism score of the word w, which is evaluated by matching with the terms in an external knowledge base. The term is the length of the answer.
Completeness Reward: To ensure that
covers all key information points of
A,
The term denotes the key information points (such as entities or relations) extracted from the answer, calculated through named entity recognition or relation extraction tools.
Function Match Reward: Due to the specific application scenario of our task, our goal is to provide customs with a function–substance correlation database to help them quickly identify illegally added substances in pre-screened suspicious cross-border goods based on their advertised functionalities. Therefore, we have defined a function match reward to measure its similarity to the functions of suspicious cross-border goods in our database and to provide rewards based on this. By ranking database entries according to this reward, we can quickly filter out substances with similar functions, facilitating subsequent operations. Specifically, we compare the extracted functional attributes with each function in the database using the ROUGE-L metric [
26]. The reward obtained by the model is proportional to the ROUGE-L score. A higher ROUGE-L score indicates that the extracted function is more matched with the function in the database. The calculation formula is as follows:
Here, represents the functional attribute generated by the model, represents the i-th function in the database, and n represents the number of functions in the database.
For example, given the generated functional answer “improves pulmonary vascular remodeling by upregulating PPAR”, the function match module compares it with candidate functional descriptions in the external database. Candidate entries may include “improves pulmonary vascular remodeling”, “activates PPAR to regulate glucose metabolism”, and “reduces airway inflammation”. ROUGE-L scores are calculated between the generated functional attribute and each database entry, and the maximum score is used as the function match reward. In this example, the entry “improves pulmonary vascular remodeling” obtains the highest similarity and therefore contributes most to the function match reward. The reliability of this reward depends on the quality and coverage of the external database: standardized and complete functional descriptions provide useful guidance, whereas missing or inconsistent database entries may introduce noisy reward signals.
The comprehensive reward function is defined as
where
denotes the weights of the reward components.
3.5. Training with IRL
We use the maximum-entropy IRL algorithm to learn the reward function from expert-annotated answers [
27].
The training of the model operates in two alternating phases: the reward update phase and the policy update phase.
Reward Model Update Phase: In this phase, the QA model uses the fixed, learned policy to generate an answer. The reward model then updates the weights of different sub-reward components by considering the reference answer in the training pair.
Assume that the reference answers are sampled from the distribution
:
is the reward function, and
is the normalization constant. The objective is to maximize the log-likelihood of the expert answers:
where
N is the number of QA pairs annotated by experts.
is the expert-annotated answer in the
n-th sample, which is used to guide the model to learn to generate high-quality answers.
is the question in the
n-th sample. It is the question associated with the expert-annotated answer
, and it is used to guide the model to generate answers.
The reward weights
are updated by computing the gradient of the log-likelihood objective:
To compute the gradient
, we use importance sampling to handle expectations over all possible answers. Specifically, we estimate it by sampling
N answers from the expert demonstrations distribution
and
M answers from the policy distribution
:
where the importance weights
are proportional to
Here, and are answers drawn from and respectively, corresponding to questions and .
QA Model Update Phase: In this phase, the IRL module fixes the reward function and uses it to update the policy gradients, refining the model’s performance. The goal is to maximize the reward function, thereby improving the model’s ability to generate high-quality answers.
We update the model’s parameters using the Self-Critical Sequence Training (SCST) method [
28].
For each input , represents the i-th source document in the dataset, and represents the question designed for it. We sample a response from BioBART’s policy (through random sampling or beam search). At the same time, a baseline response is generated using greedy decoding as a reference for self-criticism.
- -
Sampled output:
- -
Baseline output:
Using the learned comprehensive reward function
, we calculate the rewards for the sampled output and the baseline output, respectively:
where
includes four reward components (accuracy, professionalism, completeness, function match). The reward difference is defined as
This difference serves as an optimization signal to encourage the model to generate higher-quality answers than the baseline.
SCST is based on the REINFORCE algorithm, using the reward difference to update the model parameters
. The objective function is to maximize the expected reward:
The gradient is estimated as
where
is the log probability of the sampled answer. Through the self-criticism mechanism, the baseline reward
reduces the gradient variance, making training more stable.
3.6. Attribute Types
To ensure the comprehensiveness and representativeness of the experiment, the dataset in this study includes two types of attributes: exact match attributes and implicit inference attributes. We provide a general description of the attribute types in
Table 2. These attributes are extracted from chemical texts, covering key information in drug development and clinical practice, and are used to evaluate the model’s performance under different task difficulties.
In
Table 2, explicit attributes include molecular formula, category, dosage, and indication. These attributes can often be extracted from short-context sentences and have clear boundaries, typically requiring exact matching. In contrast, implicit attributes include the mechanism of action, side effect, therapeutic target, function, drug interaction, and biological pathway. These attributes usually span longer contexts in chemical texts and cannot be directly copied from the original text.
The selection of these attributes reflects the practical needs of the chemical field while providing diverse challenges for generative QA models. By combining these two types of attributes, we can comprehensively validate the performance of the proposed method.
3.7. Experiment Details
3.7.1. Dataset
The custom attribute extraction QA dataset was built on the biochemical text resources used in our previous entity-extraction experiments. The raw materials mainly came from the MedChemExpress (MCE) database and the Pharmacopoeia of the People’s Republic of China, supplemented by expert-checked biochemical descriptions. Different from the entity-extraction setting, which focuses on entity boundaries and entity categories, this dataset further organizes the same raw biochemical resources into document–question–answer triples around substance properties, functional descriptions, and potential risk information.
After raw data collection, the texts were preprocessed before QA construction. The cleaning procedure included removing HTML tags, special control characters, duplicated spaces, and invalid symbols; normalizing English capitalization, Greek letters, superscripts/subscripts, CAS numbers, and common chemical symbols; deleting overly short texts or texts without substantive semantic information; and merging duplicated descriptions of the same substance from different sources while retaining the version with more complete information and clearer attribute descriptions. For semi-structured MCE records, field names were further normalized. For pharmacopoeia-derived texts, substance names, categories, properties, functions, and indications were organized through rule matching followed by manual checking.
The final dataset covers 10 attribute types, including four explicit attributes and six implicit attributes. Explicit attributes, including molecular formula, category, dosage, and indication, account for approximately 40% of the QA pairs. Implicit attributes, including mechanism of action, side effect, biological pathway, function, therapeutic target, and drug interaction, account for approximately 60% of the QA pairs. The annotated QA pairs were split into training, validation, and test sets at a ratio of 8:1:1. Detailed dataset statistics are reported in
Table 3.
The distribution of QA pairs across the ten attribute types is further reported in
Table 4.
The constructed QA pairs were independently reviewed by three annotators with biochemical or related backgrounds. The review focused on whether each question was professionally expressed, whether the answer was supported by the source text, and whether the QA pair was answerable. Samples with inconsistent or unclear judgments were revised or removed before being included in the final dataset.
A potential bias of the dataset is that public biochemical databases and pharmacopoeia-style documents tend to contain more complete descriptions for well-studied substances, whereas newly emerging or poorly documented non-edible additives may be underrepresented. Therefore, the proposed framework is intended for expert-assisted risk pre-screening and candidate prioritization rather than independent regulatory decision-making.
3.7.2. Model Configuration
BART-large pretrained on general-domain text and BioBART-large pretrained on biomedical corpora were used as the primary generative baselines. BioBART-large contains approximately 400 M parameters. Both models were first fine-tuned on public biomedical QA datasets (BioASQ and PubMedQA) using Maximum Likelihood Estimation (MLE) with cross-entropy loss. The fine-tuning process used a batch size of 16 and a learning rate of
. After public QA adaptation, BioBART was further fine-tuned on the custom attribute extraction QA dataset. The experiments were implemented in Python using the PyTorch and Hugging Face Transformers libraries; the cited model names and dataset sources specify the pretrained model and data resources used in the study.
Table 5 reports the loss statistics of this second fine-tuning stage.
The fine-tuning loss was calculated as the token-level cross-entropy loss under the maximum-likelihood objective. Given a document–question pair and the reference answer sequence, the model was optimized to maximize the likelihood of each target token conditioned on the previous target tokens and the input context. As shown in
Table 5, both training loss and validation loss decrease steadily during the 20 training epochs and then tend to stabilize. The moderate gap between them suggests that no severe overfitting is observed.
3.7.3. Training Parameters
The CAESAR framework’s training involved two alternating phases, namely reward update and policy update. The learning rates of both the policy model and the reward model were set to . The batch size was set to 16. For importance sampling in IRL, the number of expert demonstrations N and policy-generated samples M were both set to 50. The reward model’s weights were updated once per epoch based on expert annotations. The total number of training epochs was set to 20. The validation loss was monitored during training to check convergence and potential overfitting.
The main training settings were selected according to commonly used BioBART fine-tuning and self-critical sequence training practices and were kept fixed across all compared models to ensure fair comparison. In particular, batch size and learning rate were chosen to ensure stable optimization under the available GPU memory, and the number of sampled answers in the IRL stage was fixed for all experiments. It should be noted that the reward-component weights in CAESAR are not manually tuned hyperparameters. Instead, they are automatically learned by maximum-entropy IRL from expert-annotated answers. Therefore, we did not perform a manual sensitivity analysis on reward weights, because manually sweeping these weights would change the proposed inverse reward learning mechanism into a manually weighted multi-reward optimization scheme.
3.7.4. Evaluation Metrics
We used different evaluation metrics for explicit and implicit attributes. For explicit attributes, Precision, Recall, and F1 score were used because these attributes usually have clear textual boundaries and can be evaluated by exact or normalized matching:
Here,
,
, and
denote true positives, false positives, and false negatives, respectively. For implicit attributes, the answer may not be a direct span copied from the text; therefore, ROUGE-L and BLEU were used to evaluate the overlap and generation quality between generated answers and expert references. Perplexity (PPL) was also reported to measure the confidence of the generative model. In addition, human evaluation was conducted for implicit attributes because automatic metrics cannot fully capture factual correctness, professional appropriateness, and completeness.
4. Results
4.1. Overall Model Performance
We conducted a performance evaluation of different methods on the test set. Since explicit and implicit attributes have different output characteristics, we report Precision, Recall, and F1 score for explicit attributes and ROUGE-L, BLEU, and PPL for implicit attributes. The results are shown in
Table 6.
Table 6 adds comparisons with domain-specific generative, extractive-QA, retrieval-augmented, and manually weighted RL baselines. PubMedBERT-QA obtains a high F1 score for explicit attributes, indicating that extractive QA is effective when answer spans are clearly present in the source text. However, it performs poorly on implicit attributes because it cannot freely generate integrated answers. RAG-BioBART achieves the highest ROUGE-L on implicit attributes, suggesting that retrieved external evidence can improve information coverage. Nevertheless, retrieval quality may introduce irrelevant or weakly related evidence, and its PPL remains higher than that of CAESAR. CAESAR obtains the highest BLEU and the lowest PPL while maintaining competitive explicit-attribute extraction performance, indicating that IRL-based multi-reward optimization helps balance accuracy, completeness, professionalism, and function matching.
Compared with BioBART-large, CAESAR increases explicit-attribute F1 from 76.23 to 77.82, improves ROUGE-L from 41.84 to 43.08, improves BLEU from 39.54 to 44.46, and reduces PPL from 14.6 to 11.3. These results indicate moderate but consistent gains rather than dramatic improvements. Therefore, we interpret the performance enhancement jointly with validation-loss statistics, ablation results, and human evaluation, instead of relying on a single metric.
The values in
Table 6 should be interpreted in the context of expert-assisted customs pre-screening rather than fully automated regulatory enforcement. The F1 score of 77.82 for explicit attributes indicates that the model can provide useful structured clues for attributes with clear textual boundaries. For implicit attributes, ROUGE-L and BLEU values reflect the model’s ability to generate answers that are semantically close to expert references, but they do not guarantee complete factual correctness. Therefore, the proposed framework is suitable for prioritizing suspicious substances, narrowing the range of candidate additives, and assisting expert review. Final regulatory decisions still require professional inspection and laboratory confirmation.
4.2. Human Evaluation
For explicitly defined attributes, such as molecular formulas or dosages, the evaluation criteria introduced earlier (e.g., Precision, Recall, and F1 scores) are sufficient due to their reliance on direct text matching. However, implicit attributes, which often require cross-sentence integration or semantic inference (e.g., mechanism of action or side effects), pose a greater challenge for automated metrics alone. These attributes demand a nuanced understanding of context and professional accuracy that automated metrics like ROUGE-L or BLEU may not fully capture. To address this, we invited five evaluators with backgrounds in biomedicine and chemical text analysis to assess the quality of the model-generated attributes from different perspectives. Each expert evaluated the answers on a scale from 0 to 100, considering factors such as factual correctness, completeness, and professional appropriateness. Their scores were then averaged to provide a comprehensive measure of overall quality.
Table 7 presents the human evaluation scores for different implicit attribute types, comparing BioBART (without IRL) and CAESAR.
The human evaluation scores in
Table 7 demonstrate a consistent improvement in the quality of answers generated by CAESAR compared to BioBART across all implicit attribute types. This improvement aligns with the automated metrics (e.g., ROUGE-L and BLEU) reported earlier, yet provides additional insight into aspects such as semantic coherence and domain-specific accuracy that automated metrics may overlook. The human evaluation and generated outputs indicate that incorporating IRL improves the accuracy and quality of the model’s generated answers.
Table 8 provides a sample of model outputs with and without IRL training.
To further analyze the limitations of the proposed framework, we examined representative incorrect or incomplete outputs for implicit attributes.
Table 9 shows an adapted failure case involving drug interaction extraction. In this example, the source text indicates that rifampicin induces CYP3A4 expression and significantly reduces the plasma exposure of midazolam by accelerating its metabolism. However, the generated answer only states that rifampicin accelerates the metabolism of midazolam. Although this answer captures part of the interaction, it omits the CYP3A4-mediated mechanism and does not explicitly describe the resulting reduction in midazolam exposure. This suggests that the model can sometimes identify the general biomedical relation but still fail to preserve specific mechanism-level and relation-level information.
In addition to the representative case, we manually inspected 100 generated answers for implicit attributes and categorized the observed errors, as shown in
Table 10. Among the 100 inspected samples, 19 outputs contained clear errors. The most frequent error type was over-generalized answer, accounting for 31.6% of all errors. This indicates that the model sometimes replaces a specific mechanism, pathway, or interaction with a broader biomedical description. Incomplete answers accounted for 26.3%, suggesting that the model may capture the main attribute but omit key supporting information. Unsupported inference accounted for 21.1%, showing that the generative model may still produce plausible but insufficiently grounded biomedical expressions when the source text lacks direct evidence. Outcome–mechanism confusion and relation-direction errors occurred less frequently, but they remain important in high-stakes customs inspection scenarios because they may affect the interpretation of substance functions or interactions. These observations indicate that CAESAR improves implicit attribute extraction, but its outputs should still be verified by domain experts before being used for final regulatory decisions.
4.3. Contribution Analysis of Reward Components
To quantify the contribution of each reward component, we conducted a reward ablation study by removing one component from the reward function at a time while keeping the BioBART backbone, training data, and other training settings unchanged. The experiment focuses on the implicit-attribute generation task, where multi-objective reward design is most relevant.
As shown in
Table 11, the full CAESAR model achieves the best overall performance. Removing the accuracy reward causes the largest decline in BLEU and increases PPL substantially, indicating that this reward is important for constraining the generated answer to remain semantically consistent with the expert reference. Removing the completeness reward leads to the largest ROUGE-L decline, suggesting that it helps the model cover key information points and avoid overly short or incomplete answers. Removing the professionalism reward and the function match reward leads to smaller declines in automatic metrics, but these components still contribute to domain-specific terminology normalization and customs-oriented functional matching.
4.4. Reward Component Weight Change
In this subsection, we examine how the IRL agent regulates different sub-reward components during the training process, as reflected by the changes in their weights, as shown in
Figure 2.
In
Figure 2, the vertical axis represents the weight values of different reward components, and the horizontal axis represents the training epochs.
At the beginning of training, the weights of different sub-rewards are relatively balanced because the reward model has not yet formed a stable preference among the reward dimensions. As training proceeds, the weights of the accuracy reward and completeness reward gradually increase and become stable. This trend is consistent with the nature of attribute extraction, where factual consistency and coverage of key information are the most fundamental requirements. By contrast, professionalism and function matching remain at relatively lower but stable levels. These two components are still useful, but they mainly refine terminology normalization and customs-oriented functional association after the core requirements of accuracy and completeness are satisfied. The automatic convergence of these weights also supports the role of maximum-entropy IRL in learning reward preferences from expert demonstrations without manually tuning reward weights.
5. Conclusions
This study improved the extraction of both implicit and explicit attributes by reconfiguring the chemical-attribute extraction task into a generative QA problem, combining BioBART’s domain adaptation capabilities with the multi-reward optimization mechanism of inverse reinforcement learning. Experiments demonstrated that IRL could effectively balance reward weights such as accuracy and professionalism, mitigating the hallucination issues of generative models while enhancing the semantic coherence and domain relevance of the generated answers. Human evaluation further validated the model’s reasoning capabilities on complex attributes (such as biological pathways and therapeutic targets), with score improvements reflecting the practicality of the generated content in customs inspection and research settings.
Future work could extend to multimodal biomedical data, optimize reward function design to cover more evaluation dimensions, and explore transfer learning strategies in low-resource scenarios. In addition, the application scope of the proposed framework can be further expanded. Beyond chemical-attribute extraction for customs inspection, this method, which transforms attribute extraction into generative question answering tasks and is optimized via inverse reinforcement learning, holds significant potential in other domains involving complex text analysis with implicit information. For instance, in the field of environmental science, identifying hazardous components in industrial waste discharge reports often requires extracting both explicit attributes (e.g., concentration of heavy metals) and implicit attributes (e.g., potential ecological-chain impact mechanisms inferred from reaction conditions). By adapting the question templates, reward components, and expert demonstration data, our framework can be extended to efficiently mine critical information from environmental-monitoring texts. For environmental-monitoring texts, the QA templates can be redesigned to ask about pollutant concentration, emission source, exposure route, ecological impact, and treatment strategy. For medical texts, the templates can be adapted to symptoms, diagnosis, treatment response, contraindications, and adverse events. In such applications, the reward components should also be adjusted: the professionalism reward should rely on the corresponding domain terminology, and the function match reward should be replaced or extended by a domain-specific risk or mechanism matching database. Nevertheless, cross-domain transfer is not automatic; it still requires expert demonstrations, terminology normalization, and data-quality control. This expansion will not only verify the generalizability of the method but also provide practical tools for cross-domain text-based attribute extraction tasks that face similar challenges of professional jargon and implicit reasoning.