Storage Space Allocation Strategy for Digital Data with Message Importance
Abstract
:1. Introduction
2. System Model
2.1. Modeling Weighted Reconstruction Error Based on Message Importance
2.2. Modeling Distortion between the Raw Data and the Compressed Data
2.3. Problem Formulation
2.3.1. General Storage System
2.3.2. Ideal Storage System
2.3.3. Quantification Storage System
3. Optimal Allocation Strategy with Limited Storage Space
3.1. Optimal Allocation Strategy in General Storage System
3.2. Optimal Allocation Strategy in Ideal Storage System
 For the data with extremely small message importance, $\frac{ln(1/{W}_{i})}{lnr}$ is so large that the bottom of the pool is above the water surface. Thus, the storage size of this kind of data is zero.
 For the data with small message importance, $\frac{ln(1/{W}_{i})}{lnr}$ is large, and therefore the bottom of the pool is high. Thus, the storage size of this kind of data is small.
 For the data with large message importance, $\frac{ln(1/{W}_{i})}{lnr}$ is small, and therefore the bottom of the pool is low. Thus, the storage size of this kind of data is large.
 For the data with extremely large message importance, $\frac{ln(1/{W}_{i})}{lnr}$ is so small that the bottom of the pool is constricted in order to truncate the storage size to L.
3.3. Optimal Allocation Strategy in Quantification Storage System
Algorithm 1 Storage Space Allocation Algorithm 
Require: 
The message importance, $\mathit{W}=\{{W}_{i},i=1,2,\dots ,n\}$ (Sort it to satisfy ${W}_{1}\ge {W}_{2}\ge \dots \ge {W}_{n}$) 
The probability distribution of source, $\mathit{P}=\{{p}_{i},i=1,2,\dots ,n\}$ 
The original storage size, L and $\mathit{L}=\{{L}_{i}=L,i=1,2,\dots ,n\}=\{L,\dots ,L\}$ 
The maximum available average storage size, T 
The radix, r 
The auxiliary variables, ${K}_{min},{K}_{max}$ (Let ${K}_{min}=1,{K}_{max}=n$ as the original values) 
Ensure: 
The compressed storage size, $\mathit{l}=\{{l}_{i},i=1,\dots ,n\}$ 
Denote this recursive algorithm as $\varphi (\mathit{W},\mathit{P},L,T,r,{K}_{min},{K}_{max})$ 

4. Property of Optimal Storage Strategy Based on Message Importance Measure
4.1. Normalized Message Importance Measure
4.1.1. Positive Importance Coefficient
4.1.2. Negative Importance Coefficient
4.2. Optimal Storage Size for Each Class
 (1)
 ${l}_{i}\ge {l}_{j}$ if ${p}_{i}<{p}_{j}$ for $\forall i,j\in \{1,2,\dots ,n\}$ when $\varpi >0$;
 (2)
 ${l}_{i}\le {l}_{j}$ if ${p}_{i}<{p}_{j}$ for $\forall i,j\in \{1,2,\dots ,n\}$ when $\varpi <0$.
4.3. Relative Weighted Reconstruction Error
 (1)
 ${D}_{r}(\mathit{x},\varpi )$ is monotonically decreasing with ϖ in $(0,+\infty )$;
 (2)
 ${D}_{r}(\mathit{x},\varpi )$ is monotonically increasing with ϖ in $(\infty ,0)$;
 (3)
 ${D}_{r}(\mathit{x},\varpi )\le {D}_{r}(\mathit{x},0)=({r}^{LT}1)/({r}^{L}1)$.
5. Property of Optimal Storage Strategy Based on NonParametric Message Importance Measure
6. Numerical Results
6.1. Success Rate of Compressed Storage in General Storage System
6.2. Optimal Storage Size Based on Message Importance Measure in Ideal Storage System
6.3. The Property of the RWRE Based on MIM in Ideal Storage System
6.4. The Property of the RWRE Based on NonParametric MIM in a Quantification Storage System
7. Conclusions
Author Contributions
Funding
Conflicts of Interest
Abbreviations
IoT  Internet of things 
MIM  message importance measure 
RWRE  relative weighted reconstruction error 
NMIM  nonparametric message importance measure 
Appendix A. Proof of Theorem 2
Appendix B. Proof of Lemma 1
Appendix C. Proof of Theorem 3
Appendix D. Proof of Theorem 4
References
 Chen, M.; Mao, S.; Zhang, Y.; Leungm, V.C. Definition and features of big data. In Big Data: Related Technologies, Challenges and Future Prospects; Springer: New York, NY, USA, 2014; pp. 2–5. [Google Scholar]
 Cai, H.; Xu, B.; Jiang, L.; Vasilakos, A. IoTbased big data storage systems in cloud computing: Perspectives and challenges. IEEE Internet Things J. 2017, 4, 75–87. [Google Scholar] [CrossRef]
 Hu, H.; Wen, Y.; Chua, T.; Li, X. Toward scalable systems for big data analytics: A technology tutorial. IEEE Access 2014, 2, 652–687. [Google Scholar]
 Dong, D.; Herbert, J. Contentaware partial compression for textual big data analysis in hadoop. IEEE Trans. Big Data 2018, 4, 459–472. [Google Scholar] [CrossRef]
 Park, J.; Park, H.; Choi, Y. Data compression and prediction using machine learning for industrial IoT. In Proceedings of the IEEE International Conference on Information Networking (ICOIN), Chiang Mai, Thailand, 10–12 January 2018; pp. 818–820. [Google Scholar]
 Geng, D.; Zhang, C.; Xia, C.; Xia, X.; Liu, Q.; Fu, X. Big databased improved data acquisition and storage system for designing industrial data platform. IEEE Access 2019, 7, 44574–44582. [Google Scholar] [CrossRef]
 Nalbantoglu, Ö.; Russell, D.; Sayood, K. Data compression concepts and algorithms and their applications to bioinformatics. Entropy 2010, 12, 34–52. [Google Scholar] [CrossRef] [Green Version]
 Cao, X.; Liu, L.; Cheng, Y.; Shen, X. Towards energyefficient wireless networking in the big data era: A survey. IEEE Commun. Surv. Tutor. 2017, 20, 303–332. [Google Scholar] [CrossRef]
 Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef] [Green Version]
 Oohama, Y. Exponential strong converse for source coding with side information at the decoder. Entropy 2018, 20, 352. [Google Scholar] [CrossRef] [Green Version]
 PourkamaliAnaraki, F.; Becker, S. Preconditioned data sparsification for big data with applications to pca and kmeans. IEEE Trans. Inf. Theory 2017, 63, 2954–2974. [Google Scholar] [CrossRef]
 Aguerri, I.E.; Zaidi, A. Lossy compression for computeandforward in limited backhaul uplink multicell processing. IEEE Trans. Commun. 2016, 64, 5227–5238. [Google Scholar] [CrossRef] [Green Version]
 Cui, T.; Chen, L.; Ho, T. Distributed distortion optimization for correlated sources with network coding. IEEE Trans. Commun. 2012, 60, 1336–1344. [Google Scholar] [CrossRef]
 Ukil, A.; Bandyopadhyay, S.; Sinha, A.; Pal, A. Adaptive Sensor Data Compression in IoT systems: Sensor data analytics based approach. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, 19–24 April 2015; pp. 5515–5519. [Google Scholar]
 Zhong, J.; Yates, R.D.; Soljanin, E. Backlogadaptive compression: Age of information. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017; pp. 566–570. [Google Scholar]
 Elkan, C. The foundations of costsensitive learning. In Proceedings of the the Seventeenth International Joint Conference on Artificial Intelligence, Seattle, WA, USA, 4–10 August 2001; pp. 973–978. [Google Scholar]
 Zhou, Z.; Liu, X. Training costsensitive neural networks with methods addressing the class imbalance problem. IEEE Trans. Knowl. Data Eng. 2006, 18, 63–77. [Google Scholar] [CrossRef]
 Lomax, S.; Vadera, S. A survey of costsensitive decision tree induction algorithms. ACM Comput. Surv. 2013, 45, 16:1–16:35. [Google Scholar] [CrossRef] [Green Version]
 Masnick, B.; Wolf, J. On linear unequal error protection codes. IEEE Trans. Inf. Theory 1967, 3, 600–607. [Google Scholar] [CrossRef]
 Sun, K.; Wu, D. Unequal error protection for video streaming using delayaware fountain codes. In Proceedings of the IEEE International Conference on Communications (ICC), Paris, France, 21–25 May 2017; pp. 1–6. [Google Scholar]
 Feldman, D.; Schmidt, M.; Sohler, C. Turning big data into tiny data: Constantsize coresets for kmeans, pca and projective clustering. In Proceedings of the TwentyFourth Annual ACMSIAM Symposium on Discrete Algorithms, New Orleans, LA, USA, 6–8 January 2013; pp. 1434–1453. [Google Scholar]
 Tegmark, M.; Wu, T. Paretooptimal data compression for binary classification tasks. Entropy 2020, 22, 7. [Google Scholar] [CrossRef] [Green Version]
 Liu, S.; She, R.; Fan, P.; Letaief, K.B. Nonparametric message important measure: Storage code design and transmission planning for big data. IEEE Trans. Commun. 2018, 66, 5181–5196. [Google Scholar] [CrossRef]
 Ivanchev, J.; Aydt, H.; Knoll, A. Information maximizing optimal sensor placement robust against variations of traffic demand based on importance of nodes. IEEE Trans. Intell. Transp. Syst. 2016, 17, 714–725. [Google Scholar] [CrossRef]
 Kawanaka, T.; Rokugawa, S.; Yamashita, H. Information security in communication network of memory channel considering information importance. In Proceedings of the IEEE International Conference on Industrial Engineering and Engineering Management (IEEM), Singapore, 10–13 December 2017; pp. 1169–1173. [Google Scholar]
 Li, M.; Zuo, W.; Gu, S.; Zhao, D.; Zhang, D. Learning convolutional networks for contentweighted image compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 3214–3223. [Google Scholar]
 Zhang, X.; Hao, X. Research on intrusion detection based on improved combination of Kmeans and multilevel SVM. In Proceedings of the IEEE International Conference on Communication Technology (ICCT), Chengdu, China, 27–30 October 2017; pp. 2042–2045. [Google Scholar]
 Li, M. Application of cart decision tree combined with pca algorithm in intrusion detection. In Proceedings of the IEEE International Conference on Software Engineering and Service Science (ICSESS), Beijing, China, 24–26 November 2017; pp. 38–41. [Google Scholar]
 Beasley, M.S.; Carcello, J.V.; Hermanson, D.R.; Lapides, P.D. Fraudulent financial reporting: Consideration of industry traits and corporate governance mechanisms. Account. Horiz. 2000, 14, 441–454. [Google Scholar] [CrossRef]
 Fan, P.; Dong, Y.; Lu, J.; Liu, S. Message importance measure and its application to minority subset detection in big data. In Proceedings of the IEEE Globecom Workshops (GC Wkshps), Washington, DC, USA, 4–8 December 2016; pp. 1–6. [Google Scholar]
 She, R.; Liu, S.; Wan, S.; Xiong, K.; Fan, P. Importance of small probability events in big data: Information measures, applications, and challenges. IEEE Access 2019, 7, 100363–100382. [Google Scholar] [CrossRef]
 She, R.; Liu, S.; Fan, P. Recognizing Information Feature Variation: Message Importance Transfer Measure and Its Applications in Big Data. Entropy 2018, 20, 401. [Google Scholar] [CrossRef] [Green Version]
 Liu, S.; Dong, Y.; Fan, P.; She, R.; Wan, S. Matching users’ preference under target revenue constraints in data recommendation systems. Entropy 2019, 21, 205. [Google Scholar] [CrossRef] [Green Version]
 Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; Wiley: Hoboken, NJ, USA, 2006. [Google Scholar]
 Aggarwal, C.C. Data Classification: Algorithms and Applications; CRC Press: Boca Raton, FL, USA, 2014. [Google Scholar]
 Salvador–Meneses, J.; Ruiz–Chavez, Z.; Garcia–Rodriguez, J. Compressed kNN: Knearest neighbors with data compression. Entropy 2019, 21, 234. [Google Scholar] [CrossRef] [Green Version]
 She, R.; Liu, S.; Dong, Y.; Fan, P. Focusing on a probability element: Parameter selection of message importance measure in big data. In Proceedings of the IEEE International Conference on Communications (ICC), Paris, France, 20–26 May 2017; pp. 1–6. [Google Scholar]
 Van Erven, T.; Harremoës, P. Rényi divergence and kullbackleibler divergence. IEEE Trans. Inf. Theory 2014, 60, 3797–3820. [Google Scholar] [CrossRef] [Green Version]
Notation  Description 

$\mathit{x}={x}_{1},{x}_{2},\dots ,{x}_{k},\dots ,{x}_{K}$  The sequence of raw data 
$\widehat{\mathit{x}}={\widehat{x}}_{1},{\widehat{x}}_{2},\dots ,{\widehat{x}}_{k},\dots ,{\widehat{x}}_{K}$  The sequence of compressed data 
${S}_{x}$  The storage size of x 
${D}_{f}({S}_{x1},{S}_{x2})$  The distortion measure function between ${S}_{x1}$ and ${S}_{x2}$ in data reconstruction 
n  The number of event classes 
$\{{a}_{1},{a}_{2},\dots ,{a}_{n}\}$  The alphabet of raw data 
$\{{\widehat{a}}_{1},{\widehat{a}}_{2},\dots ,{\widehat{a}}_{n}\}$  The alphabet of compressed data 
$\mathit{W}=\{{W}_{1},{W}_{2},\dots ,{W}_{n}\}$  The error cost for the reconstructed data 
$\mathit{P}=\{{p}_{1},{p}_{2},\dots ,{p}_{n}\}$  The probability distribution of data class 
$D(\mathit{x},\mathit{W})$  The weighted reconstruction error 
${D}_{r}(\mathit{x},\mathit{W}),{D}_{r}(\mathit{W},\mathit{L},\mathit{l})$  The relative weighted reconstruction error 
$\mathit{L}={L}_{1},{L}_{2},\dots ,{L}_{n}$  The storage size of raw data 
$\mathit{l}={l}_{1},{l}_{2},\dots ,{l}_{n}$  The storage size of compressed data 
${l}_{i}^{*}$  The round optimal storage size of the data belonging to the ith class 
T  The maximum available average storage size 
$\varpi $  The importance coefficient 
${\gamma}_{p}$  ${\gamma}_{p}={\sum}_{i=1}^{n}{p}_{i}^{2}$ 
${\alpha}_{1}$, ${\alpha}_{2}$  ${\alpha}_{1}=arg{min}_{i}{p}_{i}$ and ${\alpha}_{2}=arg{max}_{i}{p}_{i}$ 
$L(\varpi ,\mathit{p})$  The message importance measure, which is given by $L(\varpi ,\mathit{p})=ln{\sum}_{i=1}^{n}{p}_{i}{e}^{\varpi (1{p}_{i})}$ 
$\Delta $  The average compressed storage size of each data, which is given by $\Delta =LT$ 
${\Delta}^{*}(\delta )$  The maximum available $\Delta $ for the given supremum of the RWRE $\delta $ 
$\mathcal{L}(\mathit{P})$  The nonparametric message importance measure, which is given by $\mathcal{L}(\mathit{P})=ln{\sum}_{i=1}^{n}{p}_{i}{e}^{(1{p}_{i})/{p}_{i}}$ 
Variable  Probability Distribution  $\mathit{\varpi}({\mathit{\gamma}}_{\mathit{p}}{\mathit{p}}_{{\mathit{\alpha}}_{1}})/ln\mathit{r}$  $\mathit{\varpi}({\mathit{\gamma}}_{\mathit{p}}{\mathit{p}}_{{\mathit{\alpha}}_{2}})/ln\mathit{r}$  $\mathit{L}(\mathit{\varpi},\mathit{P})+\mathit{\varpi}{\mathit{e}}^{{\mathit{H}}_{2}(\mathit{P})}$ 

${P}_{1}$  $(0.01,0.02,0.03,0.04,0.9)$  5.7924  −0.6276  6.7234 
${P}_{2}$  $(0.003,0.007,0.108,0.132,0.752)$  4.2679  −1.1350  6.1305 
${P}_{3}$  $(0.001,0.001,0.001,0.001,0.996)$  7.1487  −0.0287  5.4344 
${P}_{4}$  $(0.021,0.086,0.103,0.378,0.412)$  2.2367  −0.5838  5.2530 
${P}_{5}$  $(0.2,0.2,0.2,0.2,0.2)$  0  0  5 
Variable  Probability Distribution  ${\mathit{p}}_{{\mathit{\alpha}}_{1}}$  $\mathcal{L}(\mathit{P})$ 

${P}_{1}$  $(0.007,0.24,0.24,0.24,0.273)$  0.007  136.8953 
${P}_{2}$  $(0.007,0.009,0.106,0.129,0.749)$  0.007  136.8953 
${P}_{3}$  $(0.01,0.02,0.03,0.04,0.9)$  0.01  94.3948 
${P}_{4}$  $(0.014,0.086,0.113,0.375,0.412)$  0.014  66.1599 
${P}_{5}$  $(0.2,0.2,0.2,0.2,0.2)$  0.2  4.0000 
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liu, S.; She, R.; Zhu, Z.; Fan, P. Storage Space Allocation Strategy for Digital Data with Message Importance. Entropy 2020, 22, 591. https://doi.org/10.3390/e22050591
Liu S, She R, Zhu Z, Fan P. Storage Space Allocation Strategy for Digital Data with Message Importance. Entropy. 2020; 22(5):591. https://doi.org/10.3390/e22050591
Chicago/Turabian StyleLiu, Shanyun, Rui She, Zheqi Zhu, and Pingyi Fan. 2020. "Storage Space Allocation Strategy for Digital Data with Message Importance" Entropy 22, no. 5: 591. https://doi.org/10.3390/e22050591