Chemical-Attribute Extraction via Inverse Reinforcement Learning with Sub-Reward Matching for Question Answering

Zhang, Taiyu; Ni, Yuqing; Yang, Xicheng; Xu, Congyuan; Liu, Xiaochen

doi:10.3390/app16115598

Open AccessArticle

Chemical-Attribute Extraction via Inverse Reinforcement Learning with Sub-Reward Matching for Question Answering

by

Taiyu Zhang

¹,

Yuqing Ni

^1,*

,

Xicheng Yang

²,

Congyuan Xu

³ and

Xiaochen Liu

⁴

¹

Key Laboratory of Advanced Control for Light Industry Processes (Ministry of Education), School of Automation and Intelligent Science, Jiangnan University, Wuxi 214122, China

²

Zhejiang Uniview Technologies Co., Ltd., Hangzhou 310051, China

³

College of Artificial Intelligence, Jiaxing University, Jiaxing 314001, China

⁴

Department of Radiology, Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114, USA

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(11), 5598; https://doi.org/10.3390/app16115598

Submission received: 1 May 2026 / Revised: 30 May 2026 / Accepted: 2 June 2026 / Published: 3 June 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Globalization and international trade have increased the importance of customs authorities in ensuring national security. However, regulatory differences regarding substances such as cannabis derivatives, the emergence of new psychoactive substances (NPSs), and the limitations of detection technology challenge customs in identifying suspicious cross-border goods. Traditional attribute extraction methods struggle with professional terminology and cross-sentence reasoning, making it difficult to regulate unknown or emerging substances. To address this, we propose a generative question answering (QA) framework based on inverse reinforcement learning (IRL) that converts attribute extraction into natural language QA tasks. Our approach, CAESAR (Chemical-Attribute Extraction with Sub-rewArd Reinforcement), uses a customs database to match known profiles and cross-references extracted attributes with benchmarks to enhance detection. It integrates the BioBART model with multi-objective reward optimization, using QA templates to capture implicit attributes. IRL automates the learning of reward weights from expert annotations. Experiments show that CAESAR achieves a competitive F1 score of 77.82 on explicit attributes and obtains the highest BLEU score and the lowest perplexity among the compared generative methods. For implicit attributes, ROUGE-L and BLEU scores are 43.08 and 44.46, respectively, with a perplexity of 11.3. These results are obtained in an open-ended generative QA setting rather than a closed-set classification setting, indicating that the proposed framework can provide practically useful attribute-level evidence for customs-oriented risk pre-screening and expert-assisted prioritization. This study offers an efficient solution for mining implicit knowledge in chemical texts and provides insights into multi-objective generative tasks.

Keywords:

natural language processing; inverse reinforcement learning; attribute extraction; question answering

1. Introduction

In the context of globalization, the increasing frequency of international trade has positioned China Customs as a critical regulatory authority for import and export goods, tasked with safeguarding national security, maintaining social stability, and promoting economic development. The inspection of imported food and pharmaceuticals is particularly vital due to its direct impact on public health. However, disparities in the legal status of certain substances across different countries pose significant challenges to customs inspections. For instance, Canada legalized recreational cannabis on 17 October 2018, under the Cannabis Act, becoming the second country globally to do so [1], whereas in China, cannabis and its derivatives (e.g., tetrahydrocannabinol, THC) are classified as strictly controlled substances under the Regulations on the Control of Narcotic Drugs and Psychotropic Substances, prohibiting their import and use [2]. Similarly, melatonin is widely available as an over-the-counter dietary supplement in the United States, but in China, it is categorized as a pharmaceutical requiring approval from the National Medical Products Administration (NMPA) for importation. These legal discrepancies are reflected in real-world cases. For example, in 2022, Hong Kong of China intercepted a shipment of imported food from Canada containing cannabidiol (CBD), a substance legally used as a health supplement in Canada but classified as a narcotic in China [3]. These incidents underscore the complex dilemmas customs face in addressing regulatory differences between nations.

Customs can identify prohibited substances in food when they possess knowledge of the substances’ characteristics and established detection methods. However, for emerging or unknown substances, the lack of understanding of their functions and mechanisms makes determining their legality exceedingly difficult. This challenge arises for several reasons. First, the variety and rapid evolution of illicit substances complicate detection efforts; with advancements in chemical synthesis, new psychoactive substances (NPSs) continue to emerge, with the United Nations Office on Drugs and Crime (UNODC) reporting in its 2020 World Drug Report that over 1000 NPSs have been identified globally, with their complex and ever-changing chemical structures overwhelming customs’ capacity to keep pace [4]. Second, the limitations of detection methods exacerbate the issue, as identifying different substances relies on specialized technologies and equipment—such as high-performance liquid chromatography–mass spectrometry (HPLC-MS) for cannabis components or the more sensitive gas chromatography–mass spectrometry (GC-MS) for fentanyl derivatives—which are costly to operate and maintain and require trained personnel, increasing the burden of inspection. Third, the concealability of illicit substances poses a further hurdle, as offenders often mix them into legitimate goods to evade scrutiny; for example, the EU Drug Markets Report notes that drugs are frequently hidden in food or cosmetics, rendering them difficult to detect through routine checks [5].

Under these circumstances, we aim to develop a product-function-to-chemical mapping database to assist customs in rapidly identifying potentially non-compliant additives in cross-border goods by analyzing their marketed claims. If a product claims to “whiten skin rapidly”, the database would map this function to substances like hydroquinone (a banned skin-lightening agent), allowing customs to prioritize lab testing for such chemicals. Chemical text data contain a wealth of knowledge, such as drug–drug interactions, diagnostic criteria for diseases, treatment protocols, and gene functions, offering irreplaceable value for clinical decision support, drug development, and medical research. In recent years, the rapid expansion of literature databases like PubMed and the widespread adoption of electronic health records (EHRs) have significantly increased the scale and complexity of unstructured chemical text. Attribute extraction, a core task in natural language processing (NLP), aims to extract specific information fragments—such as entities (e.g., genes, diseases, drugs) and their relationships—from unstructured text, providing structured support for downstream tasks like knowledge graph construction and information retrieval. Deep learning methods have been widely applied in the field of NLP [6]. However, the specialized terminology, dense jargon, structural variability, and implicit information in chemical text pose significant challenges to traditional attribute extraction methods. For instance, inferring potential drug side effects from clinical trial reports or extracting implicit gene–disease associations from research papers often requires deep semantic understanding and reasoning beyond simple pattern matching or entity recognition.

Building on this customs-oriented motivation, this study formalizes chemical-attribute extraction as a generative QA task. We adopt BioBART as the base model and optimize it via IRL. Specifically, we first fine-tune BioBART on public biomedical QA datasets to adapt it to QA tasks and then further train it on a custom attribute extraction QA dataset. The primary contributions of this study are as follows: (1) by reframing attribute extraction as a QA task, we leverage generative models to enhance implicit attribute extraction and address the limitations of extractive methods in handling distributed information; and (2) we design domain-specific sub-reward components for QA and employ IRL to automatically weight them using expert demonstrations, thereby improving answer quality for chemical-attribute extraction.

2. Related Work

Traditional attribute extraction methods primarily rely on rule-based matching or feature-engineered machine learning models. While effective for explicit information, these approaches struggle with implicit attributes. Rule-based and feature-engineered methods usually depend on manually designed dictionaries, trigger words, or shallow syntactic patterns; therefore, they are sensitive to terminology variation and often fail when an answer must be inferred from multiple sentences rather than copied from a single span. The advent of deep learning has improved performance through neural network-based methods, such as Bidirectional Encoder Representations from Transformers (BERT)-based sequence labeling models. Zhang et al. proposed a method for biochemical text information extraction in few-shot scenarios [7]. However, the high computational cost and complexity of sequence labeling frameworks have prompted researchers to reframe information extraction as a question answering (QA) paradigm, leveraging the semantic comprehension capabilities of QA models to handle complex extraction tasks. For example, Qiu et al. proposed an extraction framework using QA templates [8], and Du and Cardie demonstrated effective event extraction by answering natural questions [9]. In the chemical domain, this methodology has gained traction, with studies such as Wang et al. proposing a multi-turn QA framework for biomedical event extraction [10], and Chen et al. developing a knowledge-enhanced reading comprehension framework for biomedical relation extraction [11], both achieving significant improvements over traditional methods. For hazardous-chemical identification, Cheng et al. proposed a domain-knowledge-graph-embedded text feature extraction model for hazardous-chemical recovery identification and attribute classification [12]. Their work emphasizes classification and feature extraction with domain knowledge, whereas our study focuses on open-ended QA-style attribute extraction and multi-objective reward optimization. These efforts highlight the potential of QA paradigms in addressing domain-specific linguistic challenges, particularly in the scientific literature, where studies by Sipila et al. and Dagdelen et al. have successfully extracted structured information from materials science texts using large language models (LLMs) [13,14].

To meet the unique demands of biomedical texts, specialized pretrained models have been developed. Lee et al. introduced BioBERT, a BERT-based model pretrained on the biomedical literature, significantly enhancing performance in tasks such as QA, named-entity recognition, and relation extraction [15]. Building on this, Yuan et al. proposed BioBART, a biomedical pretrained model based on the Bidirectional and Auto-Regressive Transformers (BART) architecture, optimized for generative tasks [16]. Pretrained on large-scale biomedical corpora, BioBART integrates bidirectional encoding and autoregressive decoding, making it well-suited for applications like QA, text summarization, and clinical report generation. While extractive QA models, such as BioBERT, excel at identifying explicit answer spans, they struggle to integrate fragmented answers or infer implicit attributes distributed across sentences. In contrast, generative QA models, leveraging language generation capabilities and contextual understanding, offer a promising solution to these challenges. However, generative models face critical issues, including hallucination—generating plausible but factually incorrect content, such as fabricating non-existent drug side effects—and exposure bias caused by teacher forcing during pretraining [17,18]. To mitigate these, reinforcement learning (RL) has been explored to optimize long-term rewards. Yet single-reward RL fails to capture multidimensional evaluation criteria, and manually balancing sub-reward weights in multi-reward RL becomes intractable as the number of components increases.

Inverse reinforcement learning (IRL) offers a novel solution by automatically inferring reward weights from expert demonstrations, circumventing manual design limitations. Recent studies have demonstrated IRL’s efficacy in multi-objective optimization for NLP tasks. For instance, Shi et al. introduced IRL to address reward sparsity and mode collapse in text generation [19], while Fu et al. applied IRL to text summarization, leveraging expert demonstrations to balance sub-rewards and improve sequence quality [20]. Ghosh et al. further extended IRL to tasks like Table-to-Text and program generation, demonstrating its versatility [21,22]. These advancements underscore IRL’s potential to enhance generative models by mitigating hallucination and optimizing multidimensional objectives.

3. Materials and Methods

The general workflow of CAESAR is shown in Figure 1. First, we fine-tune the BioBART model using public biomedical QA datasets to adapt it for QA tasks. Then, we use a custom attribute extraction QA dataset, which is constructed by crawling a large number of biochemistry texts from the web and cleaning them. We build questions related to attributes and annotate the answers to create a QA-formatted attribute extraction dataset for subsequent training. The subsequent training is divided into two stages. In the first stage, the weights of the QA model are fixed, and the QA model takes the context and question as input and outputs predicted answers. Both the predicted answers and the annotated answers are fed into the reward model, and the weights of the reward model are automatically adjusted by the IRL agent. This stage is called the reward update stage. In the second stage, the reward model evaluates the quality of the predicted answers and guides the training of the model by outputting scores, thereby updating the model parameters. This stage is called the policy update stage.

3.1. Problem Formulation

The attribute extraction task in the chemical field aims to extract target attributes (such as genes, diseases, drugs, and their relationships) from unstructured text. To achieve this goal, we restructure it as a generative QA problem, leveraging the reasoning and generation capabilities of QA models to process the information.

Let the input chemical text be D (a document composed of a sequence of words), and the target attribute be A (the information to be extracted). We transform the task into a QA pair

(Q, A)

, where Q is the natural language question designed for the attribute A and A is the reference answer provided by experts; the goal of the generative QA model is to generate the answer

\hat{A}

based on D and Q.

Formally, the task of the generative QA model is defined as

\hat{A} = f (D, Q; θ)

Here, f is the generative model,

θ

denotes the model parameters, and

\hat{A}

is the generated answer sequence. We aim for

\hat{A}

to be as semantically close to A as possible, that is

min_{θ} L (\hat{A}, A)

Here, L is the loss function (such as cross-entropy loss), which measures the difference between the generated answer and the reference answer.

In the chemical context, A may be implicit. For example, given the text D as shown in Table 1, if the target attribute A is “the mechanism of action of Rosiglitazone Hydrochloride”, then

Q =

“How does the drug mentioned in the text exert its effect?”,

A =

“By upregulating the transcriptional level of PPAR

γ

to improve pulmonary vascular remodeling”. The model needs to generate

\hat{A}

based on contextual reasoning.

It should be noted that the proposed task is not formulated as a conventional closed-set classification problem. The attribute type determines the question template, but the answer is generated as a natural-language sequence and is not restricted to a predefined class label. In this study, the dataset contains ten attribute types, including four explicit attributes and six implicit attributes. Therefore, exact-match metrics are suitable for explicit attributes with clear answer boundaries, while generation-oriented metrics and human evaluation are required for implicit attributes involving cross-sentence reasoning or professional inference.

3.2. Fine-Tuning BioBART

To achieve high-quality generative QA for chemical-attribute extraction tasks, we selected BioBART as the base model. BioBART is a pretrained language model based on the BART architecture, specifically designed for the biomedical domain and pretrained on large-scale biomedical corpora, enabling it to capture domain-specific language patterns and semantic relationships. Since BioBART is pretrained using only a text-infilling task, we fine-tuned it with a QA dataset to adapt it to the QA task. To ensure performance, we chose the BioBART-large version, which contains approximately 400M parameters, providing sufficient expressive power.

We fine-tuned the model on two public datasets:

BioASQ [23]: It contains approximately 5000 QA pairs annotated by experts, covering factual questions (“What are the symptoms of a certain disease?”), list-based questions (“List the side effects of a drug”), and inferential questions (“How does Drug X treat Disease Y?”). These question types are highly relevant to the needs of the attribute extraction task.
PubMedQA [24]: It includes 1000 QA pairs based on PubMed abstracts, primarily consisting of “yes/no/maybe” questions (e.g., “Does Drug A cause side effect B?”) and also featuring long answers based on context.

The fine-tuning uses a supervised learning approach. The model training at this stage is based on using Maximum Likelihood Estimation (MLE) to update the model parameters.

3.3. Question Construction for Attributes

In the research of transforming attribute extraction tasks into the QA paradigm, the quality of question construction directly affects the model’s performance and the accuracy of the answers. The primary task of constructing high-quality questions is to ensure that they can clearly guide the model to complete the task. The construction of questions should have clarity (the question must clearly point to the required attribute and avoid ambiguity), professionalism (the question should use standardized terminology from the chemical field), conciseness (the question should be as concise as possible and avoid redundancy), targetedness (each question should focus on a single attribute to avoid confusion), and answerability (the question should be based on information in the text or content that can be inferred). When constructing questions, attention should be paid to the expression of implicit attributes, guiding the model’s answers through questions. Given the professionalism of the chemical field, for synonyms or abbreviations that may appear in the text, standardized professional terminology should be used. At the same time, overly rigid question templates may limit the model’s performance, and the questions should be diversified. When the text does not mention certain attributes, “none” should be allowed as the model’s output answer.

Therefore, we adopt the following methods to construct questions to guide the model’s output:

1.: Template Design
For exact attribute matches, use a fixed template: “What is the [attribute] of [drug]?”
Example: “What is the molecular formula of penicillin?”
For inferential attributes, use a flexible template: “Based on the text, what is the possible [attribute] of [drug]?”
Example: “Based on the text, what are the possible side effects of fluoxetine?”
2.: Terminology Standardization
Leverage the Unified Medical Language System (UMLS) to ensure consistency of terminology, and include synonyms when necessary.
3.: Diverse Expression
Design variants for each attribute and randomly assign them. For example, for “side effects”:
“What are the side effects of omeprazole?”
“What adverse reactions may occur after taking omeprazole?”
“Based on the text, what side effects might omeprazole cause?”
4.: Handling of Negative Samples
Use qualifiers to handle cases with no information, for example, “Based on the text, what are the contraindications of chloroquine? (Answer ‘none‘ if not mentioned)” to prevent the model from generating incorrect content.

The constructed questions and answers were independently reviewed by three annotators with biochemical or related backgrounds before being included in the final dataset. The review focused on whether each question was professionally expressed, whether the answer was supported by the source text, and whether the QA pair was answerable. Samples with inconsistent or unclear judgments were revised or removed.

3.4. Reward Model

To optimize model performance and mitigate the issue of hallucination, we introduce IRL to learn a reward function

R (\hat{A})

, which automatically adjusts the weights of sub-rewards through expert demonstrations. The reward model consists of multiple sub-reward components and is mathematically formulated as follows:

Accuracy Reward: To measure the semantic relevance between $\hat{A}$ and A,

$\begin{matrix} R_{accuracy} (\hat{A}, A) = BLEU (\hat{A}, A) \end{matrix}$

(1)

The BLEU score assesses the degree of text overlap, ensuring that the generated content accurately reflects the target attribute [25].
Professionalism Reward: To assess whether $\hat{A}$ conforms to professional expressions in the chemical field,

$\begin{matrix} R_{professionalism} (\hat{A}) = \frac{1}{| \hat{A} |} \sum_{w \in \hat{A}} s (w) \end{matrix}$

(2)

The term $s (w)$ is the professionalism score of the word w, which is evaluated by matching with the terms in an external knowledge base. The term $| \hat{A} |$ is the length of the answer.
Completeness Reward: To ensure that $\hat{A}$ covers all key information points of A,

$\begin{matrix} R_{completeness} (\hat{A}, A) = \frac{| E (\hat{A}) \cap E (A) |}{| E (A) |} \end{matrix}$

(3)

The term $E (\cdot)$ denotes the key information points (such as entities or relations) extracted from the answer, calculated through named entity recognition or relation extraction tools.
Function Match Reward: Due to the specific application scenario of our task, our goal is to provide customs with a function–substance correlation database to help them quickly identify illegally added substances in pre-screened suspicious cross-border goods based on their advertised functionalities. Therefore, we have defined a function match reward to measure its similarity to the functions of suspicious cross-border goods in our database and to provide rewards based on this. By ranking database entries according to this reward, we can quickly filter out substances with similar functions, facilitating subsequent operations. Specifically, we compare the extracted functional attributes with each function in the database using the ROUGE-L metric [26]. The reward obtained by the model is proportional to the ROUGE-L score. A higher ROUGE-L score indicates that the extracted function is more matched with the function in the database. The calculation formula is as follows:

$\begin{matrix} R_{fm} (\hat{A}) \propto {max}_{i = 1}^{n} \{ROUGE - L ({\hat{A}}_{f}, C_{i})\} \end{matrix}$

(4)

Here, ${\hat{A}}_{f}$ represents the functional attribute generated by the model, $C_{i}$ represents the i-th function in the database, and n represents the number of functions in the database.
For example, given the generated functional answer “improves pulmonary vascular remodeling by upregulating PPAR $γ$ ”, the function match module compares it with candidate functional descriptions in the external database. Candidate entries may include “improves pulmonary vascular remodeling”, “activates PPAR $γ$ to regulate glucose metabolism”, and “reduces airway inflammation”. ROUGE-L scores are calculated between the generated functional attribute and each database entry, and the maximum score is used as the function match reward. In this example, the entry “improves pulmonary vascular remodeling” obtains the highest similarity and therefore contributes most to the function match reward. The reliability of this reward depends on the quality and coverage of the external database: standardized and complete functional descriptions provide useful guidance, whereas missing or inconsistent database entries may introduce noisy reward signals.

The comprehensive reward function is defined as

\begin{matrix} R (\hat{A}, A) = & ω_{1} \cdot R_{accuracy} (\hat{A}, A) + ω_{2} \cdot R_{professionalism} (\hat{A}) \\ + ω_{3} \cdot R_{completeness} (\hat{A}, A) + ω_{4} \cdot R_{fm} (\hat{A}) \end{matrix}

(5)

where

ω = {ω_{1}, ω_{2}, ω_{3}, ω_{4}}

denotes the weights of the reward components.

3.5. Training with IRL

We use the maximum-entropy IRL algorithm to learn the reward function from expert-annotated answers [27].

The training of the model operates in two alternating phases: the reward update phase and the policy update phase.

Reward Model Update Phase: In this phase, the QA model uses the fixed, learned policy to generate an answer. The reward model then updates the weights of different sub-reward components by considering the reference answer in the training pair.
Assume that the reference answers are sampled from the distribution $p_{ω} (A | Q)$ :

$p_{ω} (A | Q) = \frac{1}{Z} \exp (R_{ω} (A, Q))$

(6)

$R_{ω} (A, Q)$ is the reward function, and $Z = \int_{A} \exp (R_{ω} (A, Q))$ is the normalization constant. The objective is to maximize the log-likelihood of the expert answers:

$J (ω) = \frac{1}{N} \sum_{n = 1}^{N} \log p_{ω} (A^{n} | Q^{n})$

(7)

where N is the number of QA pairs annotated by experts. $A_{n}$ is the expert-annotated answer in the n-th sample, which is used to guide the model to learn to generate high-quality answers. $Q_{n}$ is the question in the n-th sample. It is the question associated with the expert-annotated answer $A_{n}$ , and it is used to guide the model to generate answers.
The reward weights $ω$ are updated by computing the gradient of the log-likelihood objective:

$\begin{matrix} \nabla_{ω} J (ω) & = E_{A \sim p_{data} (A | Q)} \nabla_{ω} R_{ω} (A, Q) - E_{A \sim p_{ω} (A | Q)} \nabla_{ω} R_{ω} (A, Q) \end{matrix}$

(8)

To compute the gradient $\nabla_{ω} J (ω)$ , we use importance sampling to handle expectations over all possible answers. Specifically, we estimate it by sampling N answers from the expert demonstrations distribution $p_{data}$ and M answers from the policy distribution $p_{θ} (A | Q)$ :

$\begin{matrix} \nabla_{ω} J (ω) = & \frac{1}{N} \sum_{n = 1}^{N} \nabla_{ω} R_{ω} (A^{n}, Q^{n}) - \frac{1}{\sum_{m = 1}^{M} β_{m}} \sum_{m = 1}^{M} β_{m} \nabla_{ω} R_{ω} (A^{m}, Q^{m}) \end{matrix}$

(9)

where the importance weights $β_{m}$ are proportional to

$β_{m} \propto \frac{exp R_{ω} (A^{m}, Q^{m})}{p_{θ} (A^{m} | Q^{m})}$

Here, $A^{n}$ and $A^{m}$ are answers drawn from $p_{data}$ and $p_{θ} (A | Q)$ respectively, corresponding to questions $Q^{n}$ and $Q^{m}$ .
QA Model Update Phase: In this phase, the IRL module fixes the reward function and uses it to update the policy gradients, refining the model’s performance. The goal is to maximize the reward function, thereby improving the model’s ability to generate high-quality answers.
We update the model’s parameters using the Self-Critical Sequence Training (SCST) method [28].
For each input $(D_{i}, Q_{i})$ , $D_{i}$ represents the i-th source document in the dataset, and $Q_{i}$ represents the question designed for it. We sample a response ${\hat{A}}^{s}$ from BioBART’s policy $π_{θ} (\hat{A} ∣ D_{i}, Q_{i})$ (through random sampling or beam search). At the same time, a baseline response ${\hat{A}}^{b}$ is generated using greedy decoding as a reference for self-criticism.
-
Sampled output: ${\hat{A}}^{s} \sim π_{θ} (\hat{A} ∣ D_{i}, Q_{i})$
-
Baseline output: ${\hat{A}}^{b} = \arg \max P (\hat{A} ∣ D_{i}, Q_{i}; θ)$
Using the learned comprehensive reward function $R (\hat{A})$ , we calculate the rewards for the sampled output and the baseline output, respectively:

$\begin{matrix} r^{s} = R ({\hat{A}}^{s}), r^{b} = R ({\hat{A}}^{b}) \end{matrix}$

(10)

where $R (\hat{A}) = \sum_{i = 1}^{4} ω_{i} \cdot R_{i} (\hat{A})$ includes four reward components (accuracy, professionalism, completeness, function match). The reward difference is defined as

$\begin{matrix} Δ r = r^{s} - r^{b} \end{matrix}$

(11)

This difference serves as an optimization signal to encourage the model to generate higher-quality answers than the baseline.
SCST is based on the REINFORCE algorithm, using the reward difference to update the model parameters $θ$ . The objective function is to maximize the expected reward:

$\begin{matrix} J (θ) = E_{{\hat{A}}^{s} \sim π_{θ}} [R ({\hat{A}}^{s})] \end{matrix}$

(12)

The gradient is estimated as

$\begin{matrix} \nabla_{θ} J (θ) \approx Δ r \cdot \nabla_{θ} \log π_{θ} ({\hat{A}}^{s} ∣ D_{i}, Q_{i}) \end{matrix}$

(13)

where $\log π_{θ} ({\hat{A}}^{s} ∣ D_{i}, Q_{i})$ is the log probability of the sampled answer. Through the self-criticism mechanism, the baseline reward $r^{b}$ reduces the gradient variance, making training more stable.

3.6. Attribute Types

To ensure the comprehensiveness and representativeness of the experiment, the dataset in this study includes two types of attributes: exact match attributes and implicit inference attributes. We provide a general description of the attribute types in Table 2. These attributes are extracted from chemical texts, covering key information in drug development and clinical practice, and are used to evaluate the model’s performance under different task difficulties.

In Table 2, explicit attributes include molecular formula, category, dosage, and indication. These attributes can often be extracted from short-context sentences and have clear boundaries, typically requiring exact matching. In contrast, implicit attributes include the mechanism of action, side effect, therapeutic target, function, drug interaction, and biological pathway. These attributes usually span longer contexts in chemical texts and cannot be directly copied from the original text.

The selection of these attributes reflects the practical needs of the chemical field while providing diverse challenges for generative QA models. By combining these two types of attributes, we can comprehensively validate the performance of the proposed method.

3.7. Experiment Details

3.7.1. Dataset

The custom attribute extraction QA dataset was built on the biochemical text resources used in our previous entity-extraction experiments. The raw materials mainly came from the MedChemExpress (MCE) database and the Pharmacopoeia of the People’s Republic of China, supplemented by expert-checked biochemical descriptions. Different from the entity-extraction setting, which focuses on entity boundaries and entity categories, this dataset further organizes the same raw biochemical resources into document–question–answer triples around substance properties, functional descriptions, and potential risk information.

After raw data collection, the texts were preprocessed before QA construction. The cleaning procedure included removing HTML tags, special control characters, duplicated spaces, and invalid symbols; normalizing English capitalization, Greek letters, superscripts/subscripts, CAS numbers, and common chemical symbols; deleting overly short texts or texts without substantive semantic information; and merging duplicated descriptions of the same substance from different sources while retaining the version with more complete information and clearer attribute descriptions. For semi-structured MCE records, field names were further normalized. For pharmacopoeia-derived texts, substance names, categories, properties, functions, and indications were organized through rule matching followed by manual checking.

The final dataset covers 10 attribute types, including four explicit attributes and six implicit attributes. Explicit attributes, including molecular formula, category, dosage, and indication, account for approximately 40% of the QA pairs. Implicit attributes, including mechanism of action, side effect, biological pathway, function, therapeutic target, and drug interaction, account for approximately 60% of the QA pairs. The annotated QA pairs were split into training, validation, and test sets at a ratio of 8:1:1. Detailed dataset statistics are reported in Table 3.

The distribution of QA pairs across the ten attribute types is further reported in Table 4.

The constructed QA pairs were independently reviewed by three annotators with biochemical or related backgrounds. The review focused on whether each question was professionally expressed, whether the answer was supported by the source text, and whether the QA pair was answerable. Samples with inconsistent or unclear judgments were revised or removed before being included in the final dataset.

A potential bias of the dataset is that public biochemical databases and pharmacopoeia-style documents tend to contain more complete descriptions for well-studied substances, whereas newly emerging or poorly documented non-edible additives may be underrepresented. Therefore, the proposed framework is intended for expert-assisted risk pre-screening and candidate prioritization rather than independent regulatory decision-making.

3.7.2. Model Configuration

BART-large pretrained on general-domain text and BioBART-large pretrained on biomedical corpora were used as the primary generative baselines. BioBART-large contains approximately 400 M parameters. Both models were first fine-tuned on public biomedical QA datasets (BioASQ and PubMedQA) using Maximum Likelihood Estimation (MLE) with cross-entropy loss. The fine-tuning process used a batch size of 16 and a learning rate of

2 \times 10^{- 5}

. After public QA adaptation, BioBART was further fine-tuned on the custom attribute extraction QA dataset. The experiments were implemented in Python using the PyTorch and Hugging Face Transformers libraries; the cited model names and dataset sources specify the pretrained model and data resources used in the study. Table 5 reports the loss statistics of this second fine-tuning stage.

The fine-tuning loss was calculated as the token-level cross-entropy loss under the maximum-likelihood objective. Given a document–question pair and the reference answer sequence, the model was optimized to maximize the likelihood of each target token conditioned on the previous target tokens and the input context. As shown in Table 5, both training loss and validation loss decrease steadily during the 20 training epochs and then tend to stabilize. The moderate gap between them suggests that no severe overfitting is observed.

3.7.3. Training Parameters

The CAESAR framework’s training involved two alternating phases, namely reward update and policy update. The learning rates of both the policy model and the reward model were set to

1 \times 10^{- 5}

. The batch size was set to 16. For importance sampling in IRL, the number of expert demonstrations N and policy-generated samples M were both set to 50. The reward model’s weights were updated once per epoch based on expert annotations. The total number of training epochs was set to 20. The validation loss was monitored during training to check convergence and potential overfitting.

The main training settings were selected according to commonly used BioBART fine-tuning and self-critical sequence training practices and were kept fixed across all compared models to ensure fair comparison. In particular, batch size and learning rate were chosen to ensure stable optimization under the available GPU memory, and the number of sampled answers in the IRL stage was fixed for all experiments. It should be noted that the reward-component weights in CAESAR are not manually tuned hyperparameters. Instead, they are automatically learned by maximum-entropy IRL from expert-annotated answers. Therefore, we did not perform a manual sensitivity analysis on reward weights, because manually sweeping these weights would change the proposed inverse reward learning mechanism into a manually weighted multi-reward optimization scheme.

3.7.4. Evaluation Metrics

We used different evaluation metrics for explicit and implicit attributes. For explicit attributes, Precision, Recall, and F1 score were used because these attributes usually have clear textual boundaries and can be evaluated by exact or normalized matching:

Precision = \frac{T P}{T P + F P},

(14)

Recall = \frac{T P}{T P + F N},

(15)

F 1 = \frac{2 \times Precision \times Recall}{Precision + Recall} .

(16)

Here,

T P

,

F P

, and

F N

denote true positives, false positives, and false negatives, respectively. For implicit attributes, the answer may not be a direct span copied from the text; therefore, ROUGE-L and BLEU were used to evaluate the overlap and generation quality between generated answers and expert references. Perplexity (PPL) was also reported to measure the confidence of the generative model. In addition, human evaluation was conducted for implicit attributes because automatic metrics cannot fully capture factual correctness, professional appropriateness, and completeness.

4. Results

4.1. Overall Model Performance

We conducted a performance evaluation of different methods on the test set. Since explicit and implicit attributes have different output characteristics, we report Precision, Recall, and F1 score for explicit attributes and ROUGE-L, BLEU, and PPL for implicit attributes. The results are shown in Table 6.

Table 6 adds comparisons with domain-specific generative, extractive-QA, retrieval-augmented, and manually weighted RL baselines. PubMedBERT-QA obtains a high F1 score for explicit attributes, indicating that extractive QA is effective when answer spans are clearly present in the source text. However, it performs poorly on implicit attributes because it cannot freely generate integrated answers. RAG-BioBART achieves the highest ROUGE-L on implicit attributes, suggesting that retrieved external evidence can improve information coverage. Nevertheless, retrieval quality may introduce irrelevant or weakly related evidence, and its PPL remains higher than that of CAESAR. CAESAR obtains the highest BLEU and the lowest PPL while maintaining competitive explicit-attribute extraction performance, indicating that IRL-based multi-reward optimization helps balance accuracy, completeness, professionalism, and function matching.

Compared with BioBART-large, CAESAR increases explicit-attribute F1 from 76.23 to 77.82, improves ROUGE-L from 41.84 to 43.08, improves BLEU from 39.54 to 44.46, and reduces PPL from 14.6 to 11.3. These results indicate moderate but consistent gains rather than dramatic improvements. Therefore, we interpret the performance enhancement jointly with validation-loss statistics, ablation results, and human evaluation, instead of relying on a single metric.

The values in Table 6 should be interpreted in the context of expert-assisted customs pre-screening rather than fully automated regulatory enforcement. The F1 score of 77.82 for explicit attributes indicates that the model can provide useful structured clues for attributes with clear textual boundaries. For implicit attributes, ROUGE-L and BLEU values reflect the model’s ability to generate answers that are semantically close to expert references, but they do not guarantee complete factual correctness. Therefore, the proposed framework is suitable for prioritizing suspicious substances, narrowing the range of candidate additives, and assisting expert review. Final regulatory decisions still require professional inspection and laboratory confirmation.

4.2. Human Evaluation

For explicitly defined attributes, such as molecular formulas or dosages, the evaluation criteria introduced earlier (e.g., Precision, Recall, and F1 scores) are sufficient due to their reliance on direct text matching. However, implicit attributes, which often require cross-sentence integration or semantic inference (e.g., mechanism of action or side effects), pose a greater challenge for automated metrics alone. These attributes demand a nuanced understanding of context and professional accuracy that automated metrics like ROUGE-L or BLEU may not fully capture. To address this, we invited five evaluators with backgrounds in biomedicine and chemical text analysis to assess the quality of the model-generated attributes from different perspectives. Each expert evaluated the answers on a scale from 0 to 100, considering factors such as factual correctness, completeness, and professional appropriateness. Their scores were then averaged to provide a comprehensive measure of overall quality. Table 7 presents the human evaluation scores for different implicit attribute types, comparing BioBART (without IRL) and CAESAR.

The human evaluation scores in Table 7 demonstrate a consistent improvement in the quality of answers generated by CAESAR compared to BioBART across all implicit attribute types. This improvement aligns with the automated metrics (e.g., ROUGE-L and BLEU) reported earlier, yet provides additional insight into aspects such as semantic coherence and domain-specific accuracy that automated metrics may overlook. The human evaluation and generated outputs indicate that incorporating IRL improves the accuracy and quality of the model’s generated answers. Table 8 provides a sample of model outputs with and without IRL training.

To further analyze the limitations of the proposed framework, we examined representative incorrect or incomplete outputs for implicit attributes. Table 9 shows an adapted failure case involving drug interaction extraction. In this example, the source text indicates that rifampicin induces CYP3A4 expression and significantly reduces the plasma exposure of midazolam by accelerating its metabolism. However, the generated answer only states that rifampicin accelerates the metabolism of midazolam. Although this answer captures part of the interaction, it omits the CYP3A4-mediated mechanism and does not explicitly describe the resulting reduction in midazolam exposure. This suggests that the model can sometimes identify the general biomedical relation but still fail to preserve specific mechanism-level and relation-level information.

In addition to the representative case, we manually inspected 100 generated answers for implicit attributes and categorized the observed errors, as shown in Table 10. Among the 100 inspected samples, 19 outputs contained clear errors. The most frequent error type was over-generalized answer, accounting for 31.6% of all errors. This indicates that the model sometimes replaces a specific mechanism, pathway, or interaction with a broader biomedical description. Incomplete answers accounted for 26.3%, suggesting that the model may capture the main attribute but omit key supporting information. Unsupported inference accounted for 21.1%, showing that the generative model may still produce plausible but insufficiently grounded biomedical expressions when the source text lacks direct evidence. Outcome–mechanism confusion and relation-direction errors occurred less frequently, but they remain important in high-stakes customs inspection scenarios because they may affect the interpretation of substance functions or interactions. These observations indicate that CAESAR improves implicit attribute extraction, but its outputs should still be verified by domain experts before being used for final regulatory decisions.

4.3. Contribution Analysis of Reward Components

To quantify the contribution of each reward component, we conducted a reward ablation study by removing one component from the reward function at a time while keeping the BioBART backbone, training data, and other training settings unchanged. The experiment focuses on the implicit-attribute generation task, where multi-objective reward design is most relevant.

As shown in Table 11, the full CAESAR model achieves the best overall performance. Removing the accuracy reward causes the largest decline in BLEU and increases PPL substantially, indicating that this reward is important for constraining the generated answer to remain semantically consistent with the expert reference. Removing the completeness reward leads to the largest ROUGE-L decline, suggesting that it helps the model cover key information points and avoid overly short or incomplete answers. Removing the professionalism reward and the function match reward leads to smaller declines in automatic metrics, but these components still contribute to domain-specific terminology normalization and customs-oriented functional matching.

4.4. Reward Component Weight Change

In this subsection, we examine how the IRL agent regulates different sub-reward components during the training process, as reflected by the changes in their weights, as shown in Figure 2.

In Figure 2, the vertical axis represents the weight values of different reward components, and the horizontal axis represents the training epochs.

At the beginning of training, the weights of different sub-rewards are relatively balanced because the reward model has not yet formed a stable preference among the reward dimensions. As training proceeds, the weights of the accuracy reward and completeness reward gradually increase and become stable. This trend is consistent with the nature of attribute extraction, where factual consistency and coverage of key information are the most fundamental requirements. By contrast, professionalism and function matching remain at relatively lower but stable levels. These two components are still useful, but they mainly refine terminology normalization and customs-oriented functional association after the core requirements of accuracy and completeness are satisfied. The automatic convergence of these weights also supports the role of maximum-entropy IRL in learning reward preferences from expert demonstrations without manually tuning reward weights.

5. Conclusions

This study improved the extraction of both implicit and explicit attributes by reconfiguring the chemical-attribute extraction task into a generative QA problem, combining BioBART’s domain adaptation capabilities with the multi-reward optimization mechanism of inverse reinforcement learning. Experiments demonstrated that IRL could effectively balance reward weights such as accuracy and professionalism, mitigating the hallucination issues of generative models while enhancing the semantic coherence and domain relevance of the generated answers. Human evaluation further validated the model’s reasoning capabilities on complex attributes (such as biological pathways and therapeutic targets), with score improvements reflecting the practicality of the generated content in customs inspection and research settings.

Future work could extend to multimodal biomedical data, optimize reward function design to cover more evaluation dimensions, and explore transfer learning strategies in low-resource scenarios. In addition, the application scope of the proposed framework can be further expanded. Beyond chemical-attribute extraction for customs inspection, this method, which transforms attribute extraction into generative question answering tasks and is optimized via inverse reinforcement learning, holds significant potential in other domains involving complex text analysis with implicit information. For instance, in the field of environmental science, identifying hazardous components in industrial waste discharge reports often requires extracting both explicit attributes (e.g., concentration of heavy metals) and implicit attributes (e.g., potential ecological-chain impact mechanisms inferred from reaction conditions). By adapting the question templates, reward components, and expert demonstration data, our framework can be extended to efficiently mine critical information from environmental-monitoring texts. For environmental-monitoring texts, the QA templates can be redesigned to ask about pollutant concentration, emission source, exposure route, ecological impact, and treatment strategy. For medical texts, the templates can be adapted to symptoms, diagnosis, treatment response, contraindications, and adverse events. In such applications, the reward components should also be adjusted: the professionalism reward should rely on the corresponding domain terminology, and the function match reward should be replaced or extended by a domain-specific risk or mechanism matching database. Nevertheless, cross-domain transfer is not automatic; it still requires expert demonstrations, terminology normalization, and data-quality control. This expansion will not only verify the generalizability of the method but also provide practical tools for cross-domain text-based attribute extraction tasks that face similar challenges of professional jargon and implicit reasoning.

Author Contributions

Conceptualization, Y.N. and X.L.; methodology, T.Z., Y.N., C.X. and X.L.; validation, T.Z.; investigation, T.Z., Y.N. and X.Y.; writing—original draft preparation, T.Z.; writing—review and editing, Y.N., X.Y., C.X. and X.L.; supervision, Y.N.; project administration, Y.N.; funding acquisition, Y.N. and C.X. All authors have read and agreed to the published version of the manuscript.

Funding

The work by T. Zhang and Y. Ni was financially supported by the National Key R&D Program of China (2023YFF1104900), the National Natural Science Foundation of China (62303196), the Basic Research Program of Jiangsu (BK20231036), the Open Research Project of the State Key Laboratory of Industrial Control Technology, China (ICT2026B97), and the 111 Project (B23008). The work by C. Xu was financially supported by the National Natural Science Foundation of China (62302197), and the Jiaxing City Science and Technology Project (2024AY40010).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The custom dataset used in this study is available at https://github.com/TaiyuZhang/custom-dataset (accessed on 1 June 2026). The dataset construction procedure, data split, attribute distribution, model configuration, training parameters, and evaluation metrics are described in the manuscript to support reproducibility.

Acknowledgments

The authors would like to thank all colleagues who provided helpful suggestions for this study.

Conflicts of Interest

Author Xicheng Yang was employed by Zhejiang Uniview Technologies Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Government of Canada. Cannabis Act; Government of Canada: Ottawa, ON, Canada, 2018. [Google Scholar]
State Council of the People’s Republic of China. Regulations on the Administration of Narcotic Drugs and Psychotropic Substances; State Council of the People’s Republic of China: Beijing, China, 2005. [Google Scholar]
Chen, X. Hong Kong Customs Seize 25,000 Goods with Traces of Psychoactive Substances, Arrest 9 Suspects. South China Morning Post, 22 January 2022. Available online: https://www.scmp.com/news/hong-kong/hong-kong-economy/article/3164384/hong-kong-customs-seize-25000-goods-traces (accessed on 1 June 2026).
United Nations Office on Drugs and Crime. World Drug Report 2020; United Nations Office on Drugs and Crime: Vienna, Austria, 2020. [Google Scholar]
Europol and European Monitoring Centre for Drugs and Drug Addiction. EU Drug Markets Report 2019; Technical Report; Europol and European Monitoring Centre for Drugs and Drug Addiction: Lisbon, Portugal, 2019. [Google Scholar]
Lauriola, I.; Lavelli, A.; Aiolli, F. An introduction to deep learning in natural language processing: Models, techniques, and tools. Neurocomputing 2022, 470, 443–456. [Google Scholar] [CrossRef]
Zhang, T.; Ni, Y.; Guo, Z. A hybrid model for few-shot attribute extraction using prototypical networks and k-nearest neighbors. In Proceedings of the 36th Chinese Process Control Conference (CPCC), Yibin, China, 25–27 July 2025. [Google Scholar]
Qiu, L.; Zhou, H.; Qu, Y.; Zhang, W.; Li, S.; Rong, S.; Ru, D.; Qian, L.; Tu, K.; Yu, Y. QA4IE: A question answering based framework for information extraction. In Proceedings of the 17th International Semantic Web Conference (ISWC), Monterey, CA, USA, 8–12 October 2018; pp. 198–216. [Google Scholar]
Du, X.; Cardie, C. Event extraction by answering (Almost) natural questions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 671–683. [Google Scholar]
Wang, X.D.; Weber, L.; Leser, U. Biomedical event extraction as multi-turn question answering. In Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis, Online, 20 November 2020; pp. 88–96. [Google Scholar]
Chen, J.; Hu, B.; Peng, W.; Chen, Q.; Tang, B. Biomedical relation extraction via knowledge-enhanced reading comprehension. BMC Bioinform. 2022, 23, 20. [Google Scholar] [CrossRef] [PubMed]
Cheng, Q.; Zhang, S.; Yang, L. A text feature extraction model for hazardous chemical recovery identification and attribute classification embedded in domain knowledge graph. Environ. Monit. Assess. 2025, 197, 415. [Google Scholar] [CrossRef] [PubMed]
Sîpilă, M.; Mehryary, F.; Pyysalo, S.; Ginter, F.; Todorović, M. Question answering models for information extraction from perovskite materials science literature. arXiv 2024, arXiv:2405.15290. [Google Scholar] [CrossRef]
Dagdelen, J.; Dunn, A.; Lee, S.; Walker, N.; Rosen, A.S.; Ceder, G.; Persson, K.A.; Jain, A. Structured information extraction from scientific text with large language models. Nat. Commun. 2024, 15, 1418. [Google Scholar] [CrossRef] [PubMed]
Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef] [PubMed]
Yuan, H.; Yuan, Z.; Gan, R.; Zhang, J.; Xie, Y.; Yu, S. BioBART: Pretraining and evaluation of a biomedical generative language model. In Proceedings of the 21st Workshop on Biomedical Language Processing, Dublin, Ireland, 26 May 2022; pp. 97–109. [Google Scholar]
Williams, R.J.; Zipser, D. A learning algorithm for continually running fully recurrent neural networks. Neural Comput. 1989, 1, 270–280. [Google Scholar] [CrossRef]
Bengio, S.; Vinyals, O.; Jaitly, N.; Shazeer, N. Scheduled sampling for sequence prediction with recurrent neural networks. In Proceedings of the 29th International Conference on Neural Information Processing Systems (NIPS), Montréal, QC, Canada, 7–12 December 2015; pp. 1171–1179. [Google Scholar]
Shi, Z.; Chen, X.; Qiu, X.; Huang, X. Toward diverse text generation with inverse reinforcement learning. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI), 13–19 July 2018; pp. 4361–4367. [Google Scholar]
Fu, Y.; Xiong, D.; Dong, Y. Inverse reinforcement learning for text summarization. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; pp. 6559–6570. [Google Scholar]
Ghosh, S.; Qi, Z.; Chaturvedi, S.; Srivastava, S. How helpful is inverse reinforcement learning for table-to-text generation? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Online, 1–6 August 2021; pp. 71–79. [Google Scholar]
Ghosh, S.; Srivastava, S. Mapping language to programs using multiple reward components with inverse reinforcement learning. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 1449–1462. [Google Scholar]
Krithara, A.; Nentidis, A.; Bougiatiotis, K.; Paliouras, G. BioASQ-QA: A manually curated corpus for biomedical question answering. Sci. Data 2023, 10, 170. [Google Scholar] [CrossRef] [PubMed]
Jin, Q.; Dhingra, B.; Liu, Z.; Cohen, W.; Lu, X. PubMedQA: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 2567–2577. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
Lin, C.Y. ROUGE: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
Ziebart, B.D.; Maas, A.; Bagnell, J.A.; Dey, A.K. Maximum entropy inverse reinforcement learning. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence, Chicago, IL, USA, 13–17 July 2008; pp. 1433–1438. [Google Scholar]
Rennie, S.J.; Marcheret, E.; Mroueh, Y.; Ross, J.; Goel, V. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 7008–7024. [Google Scholar]

Figure 1. The general workflow of CAESAR.

Figure 2. Trends in sub-reward weights during IRL training.

Table 1. An adapted example of source text D used to illustrate QA-based attribute extraction.

Source Text D
Objective: To investigate the effects of Rosiglitazone Hydrochloride on vascular remodeling in COPD-related pulmonary hypertension rats and its mechanisms.
Methods: Thirty-six Wistar rats were divided into three groups: pulmonary hypertension group (A), treatment group (B), and control group (C). Groups A and B were used to create COPD pulmonary hypertension models via smoke exposure, hypoxia, and LPS. From week 3, Group B received Rosiglitazone Hydrochloride, while Group C received normal saline.
Results: Group A had significantly higher mPAP, RVSP, and MA% than Group C (F = 53.188∼61.666, q = 3.556∼13.769, p < 0.01) and Group B (q = 3.854∼15.345, p < 0.01). PPAR $γ$ mRNA levels in lung tissues increased sequentially in Groups A, B, and C (F = 95.719, q = 3.854∼18.924, p < 0.01). Pearson analysis showed that in Group B, PPAR $γ$ mRNA was negatively correlated with mPAP, RVSP, and MA% (r = −0.791 $\sim -$ 0.760, p < 0.01).

Note: The example is adapted from the expert-checked annotation corpus and is used only to demonstrate how source text is transformed into a QA-style attribute extraction sample. Bold text denotes section labels in the source text.

Table 2. Attribute types and their definitions.

Attribute Type	Definition
Molecular formula	The chemical composition of a drug.
Category	The classification of a drug.
Dosage	Precise dosage of medication, such as “500 mg”.
Indication	Therapeutic uses of drugs, such as “hypertension”.
Mechanism of action	The explanation of how the drug works, such as “inhibits gene B”.
Side effect	Adverse reactions to the drug, such as “dizziness”.
Therapeutic target	The biological target of the drug.
Function	The integration of the indications and side effects of the drug.
Drug interaction	Synergistic or antagonistic effects between drugs.
Biological pathway	The signaling pathways involved with the drug.

Table 3. Statistics of the constructed QA dataset.

Split	Total QA Pairs	Explicit Attributes	Implicit Attributes
Training set	9984	4010	5974
Validation set	1248	501	747
Test set	1248	501	747
Total	12,480	5012	7468

Table 4. Distribution of QA pairs across attribute types.

Attribute Type	Category	Number of QA Pairs
Molecular formula	Explicit	1080
Category	Explicit	1214
Dosage	Explicit	1120
Indication	Explicit	1598
Mechanism of action	Implicit	1512
Side effect	Implicit	1398
Therapeutic target	Implicit	1241
Function	Implicit	1358
Drug interaction	Implicit	976
Biological pathway	Implicit	983

Table 5. Fine-tuning loss statistics of BioBART on the custom attribute extraction QA dataset.

Epoch	Training Loss	Validation Loss
1	3.05	3.18
5	2.68	2.83
10	2.39	2.58
15	2.24	2.47
20	2.18	2.43

Table 6. Model performance comparison with alternative extraction approaches.

Model	Pre	Rec	F1	R-L	B	PPL
BART-large	74.50	77.10	75.78	40.63	36.29	14.4
BioGPT	75.20	76.03	75.61	41.20	40.75	13.2
BioBART-large	75.80	76.67	76.23	41.84	39.54	14.6
PubMedBERT-QA	78.60	77.20	77.89	37.42	33.86	–
RAG-BioBART	76.64	77.32	76.98	43.35	42.60	12.9
BioBART+RL	76.92	77.56	77.24	42.18	43.11	12.1
CAESAR	77.50	78.15	77.82	43.08	44.46	11.3

Note: Bold values indicate the best performance in each metric column. For PPL, a lower value is better.

Table 7. Human evaluation scores for implicit attributes.

Attribute Type	Human Evaluation Score
Attribute Type	BioBART	CAESAR
Mechanism of action	56.7	62.3
Side effect	76.8	80.2
Therapeutic target	66.4	70.3
Function	68.3	72.4
Drug interaction	62.6	67.2
Biological pathway	60.7	63.9

Table 8. Examples of answers generated by BioBART-large and CAESAR (based on the content in Table 1).

Category	BioBART-Large	CAESAR
Function	Rosiglitazone hydrochloride improves pulmonary vascular remodeling.	Rosiglitazone hydrochloride improves pulmonary vascular remodeling by upregulating the transcription level of PPAR $γ$ .

Table 9. A representative failure case of CAESAR.

Text and Question	Reference Answer	Generated Answer	Error Type
The source text reports that rifampicin induces CYP3A4 expression and significantly reduces the plasma exposure of midazolam by accelerating its metabolism. Question: What drug interaction is described?	Rifampicin may reduce the exposure of midazolam by inducing CYP3A4-mediated metabolism.	Rifampicin accelerates the metabolism of midazolam.	Incomplete answer

Table 10. Error type distribution of CAESAR on 100 manually inspected implicit-attribute samples.

Error Type	Number of Cases	Proportion Among Errors (%)
Over-generalized answer	6	31.6
Incomplete answer	5	26.3
Unsupported inference	4	21.1
Outcome–mechanism confusion	2	10.5
Relation-direction error	2	10.5
Total errors	19	100.0

Table 11. Ablation study of reward components.

Model Configuration	ROUGE-L (%)	BLEU (%)	PPL
CAESAR	43.08	44.46	11.3
$- R_{acc}$	42.12	40.86	13.8
$- R_{comp}$	41.96	42.28	13.1
$- R_{prof}$	42.55	43.12	12.4
$- R_{fm}$	42.68	43.35	12.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, T.; Ni, Y.; Yang, X.; Xu, C.; Liu, X. Chemical-Attribute Extraction via Inverse Reinforcement Learning with Sub-Reward Matching for Question Answering. Appl. Sci. 2026, 16, 5598. https://doi.org/10.3390/app16115598

AMA Style

Zhang T, Ni Y, Yang X, Xu C, Liu X. Chemical-Attribute Extraction via Inverse Reinforcement Learning with Sub-Reward Matching for Question Answering. Applied Sciences. 2026; 16(11):5598. https://doi.org/10.3390/app16115598

Chicago/Turabian Style

Zhang, Taiyu, Yuqing Ni, Xicheng Yang, Congyuan Xu, and Xiaochen Liu. 2026. "Chemical-Attribute Extraction via Inverse Reinforcement Learning with Sub-Reward Matching for Question Answering" Applied Sciences 16, no. 11: 5598. https://doi.org/10.3390/app16115598

APA Style

Zhang, T., Ni, Y., Yang, X., Xu, C., & Liu, X. (2026). Chemical-Attribute Extraction via Inverse Reinforcement Learning with Sub-Reward Matching for Question Answering. Applied Sciences, 16(11), 5598. https://doi.org/10.3390/app16115598

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Chemical-Attribute Extraction via Inverse Reinforcement Learning with Sub-Reward Matching for Question Answering

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Problem Formulation

3.2. Fine-Tuning BioBART

3.3. Question Construction for Attributes

3.4. Reward Model

3.5. Training with IRL

3.6. Attribute Types

3.7. Experiment Details

3.7.1. Dataset

3.7.2. Model Configuration

3.7.3. Training Parameters

3.7.4. Evaluation Metrics

4. Results

4.1. Overall Model Performance

4.2. Human Evaluation

4.3. Contribution Analysis of Reward Components

4.4. Reward Component Weight Change

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI