Next Article in Journal
A Polynomial-Time Reduction from the 3SAT Problem to the Generalized String Puzzle Problem
Previous Article in Journal / Special Issue
An Online Algorithm for Lightweight Grammar-Based Compression
Order Article Reprints
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:

Content Sharing Graphs for Deduplication-Enabled Storage Systems

IBM Research at Almaden, 650 Harry Road, San Jose, CA 95120, USA
Author to whom correspondence should be addressed.
Algorithms 2012, 5(2), 236-260;
Received: 30 December 2011 / Revised: 28 March 2012 / Accepted: 29 March 2012 / Published: 10 April 2012
(This article belongs to the Special Issue Data Compression, Communication and Processing)


Deduplication in storage systems has gained momentum recently for its capability in reducing data footprint. However, deduplication introduces challenges to storage management as storage objects (e.g., files) are no longer independent from each other due to content sharing between these storage objects. In this paper, we present a graph-based framework to address the challenges of storage management due to deduplication. Specifically, we model content sharing among storage objects by content sharing graphs (CSG), and apply graph-based algorithms to two real-world storage management use cases for deduplication-enabled storage systems. First, a quasi-linear algorithm was developed to partition deduplication domains with a minimal amount of deduplication loss (i.e., data replicated across partitioned domains) in commercial deduplication-enabled storage systems, whereas in general the partitioning problem is NP-complete. For a real-world trace of 3 TB data with 978 GB of removable duplicates, the proposed algorithm can partition the data into 15 balanced partitions with only 54 GB of deduplication loss, that is, a 5% deduplication loss. Second, a quick and accurate method to query the deduplicated size for a subset of objects in deduplicated storage systems was developed. For the same trace of 3 TB data, the optimized graph-based algorithm can complete the query in 2.6 s, which is less than 1% of that of the traditional algorithm based on the deduplication metadata.
Keywords: deduplication; storage systems; graph models; graph partitioning; k-core; subset query deduplication; storage systems; graph models; graph partitioning; k-core; subset query

Share and Cite

MDPI and ACS Style

Lu, M.; Constantinescu, C.; Sarkar, P. Content Sharing Graphs for Deduplication-Enabled Storage Systems. Algorithms 2012, 5, 236-260.

AMA Style

Lu M, Constantinescu C, Sarkar P. Content Sharing Graphs for Deduplication-Enabled Storage Systems. Algorithms. 2012; 5(2):236-260.

Chicago/Turabian Style

Lu, Maohua, Cornel Constantinescu, and Prasenjit Sarkar. 2012. "Content Sharing Graphs for Deduplication-Enabled Storage Systems" Algorithms 5, no. 2: 236-260.

Article Metrics

Back to TopTop