MDPI - Publisher of Open Access Journals

18 pages, 2102 KiB

Open AccessArticle

Context-Aware Search for Environmental Data Using Dense Retrieval

by Simeon Wetzel and Stephan Mäs

ISPRS Int. J. Geo-Inf. 2024, 13(11), 380; https://doi.org/10.3390/ijgi13110380 - 30 Oct 2024

Viewed by 1955

The search for environmental data typically involves lexical approaches, where query terms are matched with metadata records based on measures of term frequency. In contrast, dense retrieval approaches employ language models to comprehend the context and meaning of a query and provide relevant search results. However, for environmental data, this has not been researched and there are no corpora or evaluation datasets to fine-tune the models. This study demonstrates the adaptation of dense retrievers to the domain of climate-related scientific geodata. Four corpora containing text passages from various sources were used to train different dense retrievers. The domain-adapted dense retrievers are integrated into the search architecture of a standard metadata catalogue. To improve the search results further, we propose a spatial re-ranking stage after the initial retrieval phase to refine the results. The evaluation demonstrates superior performance compared to the baseline model commonly used in metadata catalogues (BM25). No clear trends in performance were discovered when comparing the results of the dense retrievers. Therefore, further investigation aspects are identified to finally enable a recommendation of the most suitable corpus composition. Full article

(This article belongs to the Special Issue Unlocking the Power of Geospatial Data: Semantic Information Extraction, Ontology Engineering, and Deep Learning for Knowledge Discovery)

► Show Figures

Figure 1

14 pages, 520 KiB

Open AccessArticle

SS-BERT: A Semantic Information Selecting Approach for Open-Domain Question Answering

by Xuan Fu, Jiangnan Du, Hai-Tao Zheng, Jianfeng Li, Cuiqin Hou, Qiyu Zhou and Hong-Gee Kim

Electronics 2023, 12(7), 1692; https://doi.org/10.3390/electronics12071692 - 3 Apr 2023

Cited by 3 | Viewed by 2560

Abstract

Open-Domain Question Answering (Open-Domain QA) aims to answer any factoid questions from users. Recent progress in Open-Domain QA adopts the “retriever-reader” structure, which has proven effective. Retriever methods are mainly categorized as sparse retrievers and dense retrievers. In recent work, the dense retriever showed a stronger semantic interpretation than the sparse retriever. When training a dual-encoder dense retriever for document retrieval and reranking, there are two challenges: negative selection and a lack of training data. In this study, we make three major contributions to this topic: negative selection by query generation, data augmentation from negatives, and a passage evaluation method. We prove that the model performs better by focusing on false negatives and data augmentation in the Open-Domain QA passage rerank task. Our model outperforms other single dual-encoder rerankers over BERT-base and BM25 by 0.7 in MRR@10, achieving the highest Recall@50 and the max Recall@1000, which is restricted by the BM25 retrieval results. Full article

(This article belongs to the Special Issue Intelligent Big Data Analytics and Knowledge Management)

► Show Figures

Figure 1

19 pages, 2349 KiB

Open AccessArticle

Integrate Candidate Answer Extraction with Re-Ranking for Chinese Machine Reading Comprehension

by Junjie Zeng, Xiaoya Sun, Qi Zhang and Xinmeng Li

Entropy 2021, 23(3), 322; https://doi.org/10.3390/e23030322 - 8 Mar 2021

Viewed by 2582

Abstract

Machine Reading Comprehension (MRC) research concerns how to endow machines with the ability to understand given passages and answer questions, which is a challenging problem in the field of natural language processing. To solve the Chinese MRC task efficiently, this paper proposes an Improved Extraction-based Reading Comprehension method with Answer Re-ranking (IERC-AR), consisting of a candidate answer extraction module and a re-ranking module. The candidate answer extraction module uses an improved pre-training language model, RoBERTa-WWM, to generate precise word representations, which can solve the problem of polysemy and is good for capturing Chinese word-level features. The re-ranking module re-evaluates candidate answers based on a self-attention mechanism, which can improve the accuracy of predicting answers. Traditional machine-reading methods generally integrate different modules into a pipeline system, which leads to re-encoding problems and inconsistent data distribution between the training and testing phases; therefore, this paper proposes an end-to-end model architecture for IERC-AR to reasonably integrate the candidate answer extraction and re-ranking modules. The experimental results on the Les MMRC dataset show that IERC-AR outperforms state-of-the-art MRC approaches. Full article

(This article belongs to the Special Issue Methods in Artificial Intelligence and Information Processing)

► Show Figures

Figure 1

10 pages, 2299 KiB

Open AccessArticle

Document Re-Ranking Model for Machine-Reading and Comprehension

by Youngjin Jang and Harksoo Kim

Appl. Sci. 2020, 10(21), 7547; https://doi.org/10.3390/app10217547 - 27 Oct 2020

Cited by 1 | Viewed by 2844

Abstract

Recently, the performance of machine-reading and comprehension (MRC) systems has been significantly enhanced. However, MRC systems require high-performance text retrieval models because text passages containing answer phrases should be prepared in advance. To improve the performance of text retrieval models underlying MRC systems, we propose a re-ranking model, based on artificial neural networks, that is composed of a query encoder, a passage encoder, a phrase modeling layer, an attention layer, and a similarity network. The proposed model learns degrees of associations between queries and text passages through dot products between phrases that constitute questions and passages. In experiments with the MS-MARCO dataset, the proposed model demonstrated higher mean reciprocal ranks (MRRs), 0.8%p–13.2%p, than most of the previous models, except for the models based on BERT (a pre-trained language model). Although the proposed model demonstrated lower MRRs than the BERT-based models, it was approximately 8 times lighter and 3.7 times faster than the BERT-based models. Full article

(This article belongs to the Special Issue Knowledge Retrieval and Reuse)

► Show Figures

Figure 1

Search Results (4)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (4)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI