Next Article in Journal / Special Issue
Extracting Hierarchies from Data Clusters for Better Classification
Previous Article in Journal / Special Issue
Contextual Anomaly Detection in Text Data
Open AccessArticle

The Effects of Tabular-Based Content Extraction on Patent Document Clustering

EECS Department, Min H. Kao Building Suite 401, University of Tennessee, 1520 Middle Drive, Knoxville, TN 37996, USA
Catalyst Repository Systems, 1860 Blake Street, 7th Floor, Denver, CO 80202, USA
Author to whom correspondence should be addressed.
Algorithms 2012, 5(4), 490-505;
Received: 1 July 2012 / Revised: 16 August 2012 / Accepted: 9 October 2012 / Published: 22 October 2012
PDF [1308 KB, uploaded 22 October 2012]


Data can be represented in many different ways within a particular document or set of documents. Hence, attempts to automatically process the relationships between documents or determine the relevance of certain document objects can be problematic. In this study, we have developed software to automatically catalog objects contained in HTML files for patents granted by the United States Patent and Trademark Office (USPTO). Once these objects are recognized, the software creates metadata that assigns a data type to each document object. Such metadata can be easily processed and analyzed for subsequent text mining tasks. Specifically, document similarity and clustering techniques were applied to a subset of the USPTO document collection. Although our preliminary results demonstrate that tables and numerical data do not provide quantifiable value to a document’s content, the stage for future work in measuring the importance of document objects within a large corpus has been set. View Full-Text
Keywords: text mining; patent documents; table data text mining; patent documents; table data

Graphical abstract

This is an open access article distributed under the Creative Commons Attribution License (CC BY 3.0).

Share & Cite This Article

MDPI and ACS Style

Koessler, D.R.; Martin, B.W.; Kiefer, B.E.; Berry, M.W. The Effects of Tabular-Based Content Extraction on Patent Document Clustering. Algorithms 2012, 5, 490-505.

Show more citation formats Show less citations formats

Related Articles

Article Metrics

Article Access Statistics



[Return to top]
Algorithms EISSN 1999-4893 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert
Back to Top