Special Issue "Selected Papers from Text Mining Workshop at the 2012 SIAM International Conference on Data Mining"

Quicklinks

A special issue of Algorithms (ISSN 1999-4893).

Deadline for manuscript submissions: closed (30 June 2012)

Special Issue Editors

Guest Editor
Prof. Dr. Michael W. Berry (Website)

Department of Electrical Engineering and Computer Science, The University of Tennessee, Min H. Kao Building, Suite 401, 1520 Middle Drive, Knoxville, TN 37996, USA
Phone: 865-974-3838
Fax: +1 865 974 4404
Interests: information retrieval, data and text mining, computational science, bioinformatics, and parallel computing
Guest Editor
Dr. Jacob Kogan (Website)

Department of Mathematics and Statistics, University of Maryland Baltimore County, 1000 Hilltop Circle, Baltimore, Maryland 21250, USA
Fax: +1 410 455 1066
Interests: optimal control theory; finite dimensional optimization; robust stability of control systems; computational information retrieval

Special Issue Information

Dear Colleagues,

The proliferation of digital computing devices and their use in communication continues to result in an increased demand for systems and algorithms capable of mining textual data. Thus, the development of techniques for mining unstructured, semi-structured, and fully structured textual data has become quite important in both academia and industry. As a result, a one-day workshop on text mining was held on April 28, 2012 in conjunction with the SIAM Twelfth International Conference on Data Mining to bring together researchers from a variety of disciplines to present their current approaches and results in text mining. The workshop surveyed the emerging field of text mining-the application of techniques of machine learning in conjunction with natural language processing, information extraction and algebraic/mathematical approaches to computational information retrieval. Many issues are being addressed in this field ranging from the development of new document classification and clustering models to novel approaches for topic detection, tracking, and visualization.

Prof. Dr. Michael W. Berry
Dr. Jacob Kogan
Guest Editors

Keywords

  • document ranking and representation
  • document classification and clustering
  • text summarization and anomaly detection

Published Papers (6 papers)

View options order results:
result details:
Displaying articles 1-6
Export citation of selected articles as:

Research

Open AccessArticle Extracting Hierarchies from Data Clusters for Better Classification
Algorithms 2012, 5(4), 506-520; doi:10.3390/a5040506
Received: 2 July 2012 / Revised: 24 September 2012 / Accepted: 17 October 2012 / Published: 23 October 2012
Cited by 1 | PDF Full-text (1199 KB) | HTML Full-text | XML Full-text
Abstract
In this paper we present the PHOCS-2 algorithm, which extracts a “Predicted Hierarchy Of ClassifierS”. The extracted hierarchy helps us to enhance performance of flat classification. Nodes in the hierarchy contain classifiers. Each intermediate node corresponds to a set of classes and [...] Read more.
In this paper we present the PHOCS-2 algorithm, which extracts a “Predicted Hierarchy Of ClassifierS”. The extracted hierarchy helps us to enhance performance of flat classification. Nodes in the hierarchy contain classifiers. Each intermediate node corresponds to a set of classes and each leaf node corresponds to a single class. In the PHOCS-2 we make estimation for each node and achieve more precise computation of false positives, true positives and false negatives. Stopping criteria are based on the results of the flat classification. The proposed algorithm is validated against nine datasets. Full article
Open AccessArticle The Effects of Tabular-Based Content Extraction on Patent Document Clustering
Algorithms 2012, 5(4), 490-505; doi:10.3390/a5040490
Received: 1 July 2012 / Revised: 16 August 2012 / Accepted: 9 October 2012 / Published: 22 October 2012
PDF Full-text (1308 KB) | HTML Full-text | XML Full-text
Abstract
Data can be represented in many different ways within a particular document or set of documents. Hence, attempts to automatically process the relationships between documents or determine the relevance of certain document objects can be problematic. In this study, we have developed [...] Read more.
Data can be represented in many different ways within a particular document or set of documents. Hence, attempts to automatically process the relationships between documents or determine the relevance of certain document objects can be problematic. In this study, we have developed software to automatically catalog objects contained in HTML files for patents granted by the United States Patent and Trademark Office (USPTO). Once these objects are recognized, the software creates metadata that assigns a data type to each document object. Such metadata can be easily processed and analyzed for subsequent text mining tasks. Specifically, document similarity and clustering techniques were applied to a subset of the USPTO document collection. Although our preliminary results demonstrate that tables and numerical data do not provide quantifiable value to a document’s content, the stage for future work in measuring the importance of document objects within a large corpus has been set. Full article
Figures

Open AccessArticle Contextual Anomaly Detection in Text Data
Algorithms 2012, 5(4), 469-489; doi:10.3390/a5040469
Received: 20 June 2012 / Revised: 10 October 2012 / Accepted: 11 October 2012 / Published: 19 October 2012
Cited by 2 | PDF Full-text (3733 KB) | HTML Full-text | XML Full-text
Abstract
We propose using side information to further inform anomaly detection algorithms of the semantic context of the text data they are analyzing, thereby considering both divergence from the statistical pattern seen in particular datasets and divergence seen from more general semantic expectations. [...] Read more.
We propose using side information to further inform anomaly detection algorithms of the semantic context of the text data they are analyzing, thereby considering both divergence from the statistical pattern seen in particular datasets and divergence seen from more general semantic expectations. Computational experiments show that our algorithm performs as expected on data that reflect real-world events with contextual ambiguity, while replicating conventional clustering on data that are either too specialized or generic to result in contextual information being actionable. These results suggest that our algorithm could potentially reduce false positive rates in existing anomaly detection systems. Full article
Figures

Open AccessArticle Better Metrics to Automatically Predict the Quality of a Text Summary
Algorithms 2012, 5(4), 398-420; doi:10.3390/a5040398
Received: 2 July 2012 / Revised: 5 September 2012 / Accepted: 7 September 2012 / Published: 26 September 2012
PDF Full-text (401 KB) | HTML Full-text | XML Full-text
Abstract
In this paper we demonstrate a family of metrics for estimating the quality of a text summary relative to one or more human-generated summaries. The improved metrics are based on features automatically computed from the summaries to measure content and linguistic quality. [...] Read more.
In this paper we demonstrate a family of metrics for estimating the quality of a text summary relative to one or more human-generated summaries. The improved metrics are based on features automatically computed from the summaries to measure content and linguistic quality. The features are combined using one of three methods—robust regression, non-negative least squares, or canonical correlation, an eigenvalue method. The new metrics significantly outperform the previous standard for automatic text summarization evaluation, ROUGE. Full article
Open AccessArticle Monitoring Threshold Functions over Distributed Data Streams with Node Dependent Constraints
Algorithms 2012, 5(3), 379-397; doi:10.3390/a5030379
Received: 19 June 2012 / Revised: 8 September 2012 / Accepted: 11 September 2012 / Published: 18 September 2012
PDF Full-text (683 KB) | HTML Full-text | XML Full-text
Abstract
Monitoring data streams in a distributed system has attracted considerable interest in recent years. The task of feature selection (e.g., by monitoring the information gain of various features) requires a very high communication overhead when addressed using straightforward centralized algorithms. While most [...] Read more.
Monitoring data streams in a distributed system has attracted considerable interest in recent years. The task of feature selection (e.g., by monitoring the information gain of various features) requires a very high communication overhead when addressed using straightforward centralized algorithms. While most of the existing algorithms deal with monitoring simple aggregated values such as frequency of occurrence of stream items, motivated by recent contributions based on geometric ideas we present an alternative approach. The proposed approach enables monitoring values of an arbitrary threshold function over distributed data streams through stream dependent constraints applied separately on each stream. We report numerical experiments on a real-world data that detect instances where communication between nodes is required, and compare the approach and the results to those recently reported in the literature. Full article
Open AccessArticle Incremental Clustering of News Reports
Algorithms 2012, 5(3), 364-378; doi:10.3390/a5030364
Received: 29 June 2012 / Revised: 13 August 2012 / Accepted: 15 August 2012 / Published: 24 August 2012
Cited by 3 | PDF Full-text (224 KB) | HTML Full-text | XML Full-text
Abstract
When an event occurs in the real world, numerous news reports describing this event start to appear on different news sites within a few minutes of the event occurrence. This may result in a huge amount of information for users, and automated [...] Read more.
When an event occurs in the real world, numerous news reports describing this event start to appear on different news sites within a few minutes of the event occurrence. This may result in a huge amount of information for users, and automated processes may be required to help manage this information. In this paper, we describe a clustering system that can cluster news reports from disparate sources into event-centric clusters—i.e., clusters of news reports describing the same event. A user can identify any RSS feed as a source of news he/she would like to receive and our clustering system can cluster reports received from the separate RSS feeds as they arrive without knowing the number of clusters in advance. Our clustering system was designed to function well in an online incremental environment. In evaluating our system, we found that our system is very good in performing fine-grained clustering, but performs rather poorly when performing coarser-grained clustering. Full article

Journal Contact

MDPI AG
Algorithms Editorial Office
St. Alban-Anlage 66, 4052 Basel, Switzerland
algorithms@mdpi.com
Tel. +41 61 683 77 34
Fax: +41 61 302 89 18
Editorial Board
Contact Details Submit to Algorithms
Back to Top