# Continuous Learning Graphical Knowledge Unit for Cluster Identification in High Density Data Sets

^{1}

^{2}

^{3}

^{4}

^{*}

## Abstract

**:**

^{24}(over 1.6 million) overlaps. A higher number of overlaps at the same location makes the colour of this area identical, which can be identified by the naked eye. A bitmap is a matrix of colour values that can be represented as integers. The proposed method updates this matrix while adding new points. Thus, this matrix can be considered as an up-to-time knowledge unit of processed data. Results show cluster generation, cluster identification, missing and out-of-range data visualization, and outlier detection capability of the newly proposed method.

## 1. Introduction

^{n}colours. Thus, the 8-bit RGB and 8-bit RGBA formats can represent 2

^{24}and 2

^{32}different colours, respectively. If the whole bit series of a pixel is considered as a single channel, then each pixel is a number of base 256 (e.g., dark green in RGB: (0, 100, 0) = 25,600). Thus, a bitmap is a matrix that contains numerical values. The visual representations of these numbers indicate different colours. If these numbers are used to represent up-to-time processed data, the bitmap becomes an up-to-time knowledge unit. This is a different usage of bitmaps for representing data.

#### 1.1 Related Work

## 2. Methodology

^{n}number of different values; thus, a pixel is a memory cell or a knowledge cell. Therefore, we introduce a bitmap that forms a graphical knowledge unit (GKU) out of knowledge cells to represent data and information.

#### 2.1. Colour Coding Method

_{V}is the single integer colour value of a pixel, R

_{V}is the red colour value, G

_{V}is the green colour value and B

_{V}is the blue colour value, then

_{V}= R

_{V}× 256

^{2}+ G

_{V}× 256

^{1}+ B

_{V}× 256

^{0}.

_{2}= R

_{V}, C

_{1}= G

_{V}and C

_{0}= B

_{V}, then

^{2}+ 2 × 256

^{1}+ 3 × 256

^{0}). In addition, when C

_{V}= 66,051, according to Equation (2), C

_{2}= 1, C

_{1}= 2 and C

_{0}= 3. Because C

_{2}= R

_{V}, C

_{1}= G

_{V}and C

_{0}= B

_{V}, the RGB representation of 66,051 is (1, 2, 3). Note that only these two equations are used for density calculation and density cluster formation.

#### 2.2. Data Preparation

#### 2.3. Visualization of Missing and Out of Range Values

#### 2.4. Embed GKU Specific Information into Bitmap

^{−2}). Thus, any number can be represented (one for the integer part and the other for power of ten) with two pixels in the signed 24-bit single pixel format. Finally, the structure of the GKU specific data is designed as shown in Table 3 using relevant number representation techniques.

#### 2.5. GKU Evaluation Method

## 3. Results and Discussion

#### 3.1. Reading GKUs

_{v}) of RGB colour (0, 0, 255) and (0, 1, 0) are 255 and 256, respectively. The RGB colour (0, 0, 255) is blue and RGB colour (0, 1, 0), which, next to (0, 0, 255), is visually black and create sudden change in colour blue to black, even though the difference between colour values is 1. The table in the Figure 7 shows the RGB values of the inner border (blue side) of each contour line. Usually, all colour borders are visually the same. However, the green colour values (G

_{v}) of those lines maintain constant difference of one between adjacent colour borders; which resembles the contour lines (Figure 7). We numbered the contour lines from the outside to the inside (i.e., 1, 2, 3, …) and observed that contour lines with the same numbers have nearly the same colour values (same green value + nearly the same blue value) (Figure 7). Thus, it is possible to compare the density of different clusters even without a colour scale or legend. This is another advantage of the GKU over existing clustering methods.

#### 3.2. Anytime Cluster Formation

#### 3.3. Representation of Missing and Out of Range Values and GKU Specific Data

#### 3.4. GKU as an Outlier Detection Method

^{n}, where n is the total bit length of the colour format. This drawback can be overcome by using colour formats with higher bit length.

^{64}overlapping incidents.

## 4. Conclusions

## Supplementary Materials

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## References

- Stone, M.C.; Fishkin, K.; Bier, E.A. The Movable Filter as a User Interface Tool. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems, Boston, MA, USA, 24–28 April 1994; pp. 306–312.
- Woodruff, A.; Landay, J.; Stonebraker, M. Constant density visualizations of non-uniform distributions of data. In Proceedings of the 11th Annual ACM Symposium on User Interface Software and Technology, San Francisco, CA, USA, 1–4 November 1998.
- Yang, J.; Ward, M.O.; Rundensteiner, E.A. Visual hierarchical dimension reduction for exploration of high dimensional datasets. In Proceedings of the Eurographics/IEEE TCVG Symposium on Visualization, Grenoble, France, 26–28 May 2003.
- Ellis, G.; Dix, A. A Taxonomy of Clutter Reduction for Information Visualisation. IEEE Trans. Vis. Comput. Graph.
**2007**, 13, 1216–1223. [Google Scholar] [CrossRef] [PubMed] - Chen, H.; Chen, W.; Mei, H.; Liu, Z.; Zhou, K.; Chen, W.; Gu, W.; Ma, K.L. Visual Abstraction and Exploration of Multi-class Scatterplots. IEEE Trans. Vis. Comput. Graph.
**2014**, 20, 1683–1692. [Google Scholar] [CrossRef] [PubMed] - Cleveland, W.S. Visualizing Data; Hobart Press: Hobart, Australia, 1993. [Google Scholar]
- Bachthaler, S.; Weiskopf, D. Efficient and Adaptive Rendering of 2-D Continuous Scatterplots. Comput. Graph. Forum
**2009**, 28, 743–750. [Google Scholar] [CrossRef] - Mai, S.T.; He, X.; Feng, J.; Plant, C.; Böhm, C. Anytime density-based clustering of complex data. Knowl. Inform. Syst.
**2015**, 45, 319–355. [Google Scholar] [CrossRef] - Hoffman, P.; Grinstein, G. Visualizations for High Dimensional Data Mining-Table Visualizations. 1997. Available online: http://web.simmons.edu/~benoit/infovis/MIV-datamining.pdf (accessed on 28 January 2014).
- Salomon, D. Raster Graphics. In The Computer Graphics Manual; Springer: Berlin/Heidelberg, Germany, 2011; pp. 29–131. [Google Scholar]
- Salomon, D. Graphics Standards. In The Computer Graphics Manual; Springer: Berlin/Heidelberg, Germany, 2011; pp. 947–972. [Google Scholar]
- Everitt, B.S.; Landau, S.; Leese, M.; Stahl, D. Index. In Cluster Analysis; John Wiley & Sons, Ltd.: New York, NY, USA, 2011; pp. 321–330. [Google Scholar]
- Lee, R.C.T. Clustering Analysis and Its Applications. Adv. Inform. Syst. Sci.
**1981**, 8, 169–292. [Google Scholar] - Næs, T.; Brockhoff, P.B.; Tomic, O. Cluster Analysis: Unsupervised Classification. In Statistics for Sensory and Consumer Science; John Wiley & Sons, Ltd.: New York, NY, USA, 2010; pp. 249–261. [Google Scholar]
- Okun, O.; Priisalu, H. Unsupervised data reduction. Signal Process.
**2007**, 87, 2260–2267. [Google Scholar] [CrossRef] - Anderberg, M.R. Cluster Analysis for Applications; Academic Press: New York, NY, USA, 1973. [Google Scholar]
- Chui, C.K.; Filbir, F.; Mhaskar, H.N. Representation of functions on big data: Graphs and trees. Appl. Comput. Harmon. Anal.
**2015**, 38, 489–509. [Google Scholar] [CrossRef] - Avramenko, Y.; Ani, E.-C.; Kraslawski, A.; Agachi, P.S. Mining of graphics for information and knowledge retrieval. Comput. Chem. Eng.
**2009**, 33, 618–627. [Google Scholar] [CrossRef] - Yu, H.; Yang, J.; Han, J.; Li, X. Making SVMs Scalable to Large Data Sets using Hierarchical Cluster Indexing. Data Min. Knowl. Discov.
**2005**, 11, 295–321. [Google Scholar] [CrossRef] - De Vito, E.; Rosasco, L.; Toigo, A. Learning sets with separating kernels. Appl. Comput. Harmon. Anal.
**2014**, 37, 185–217. [Google Scholar] [CrossRef] - Galluccio, L.; Michel, O.; Comon, P.; Hero, A.O., III. Graph based k-means clustering. Signal Process.
**2012**, 92, 1970–1984. [Google Scholar] [CrossRef] [Green Version] - Sebzalli, Y.M.; Li, R.F.; Chen, F.Z.; Wang, X.Z. Knowledge discovery from process operational data for assessment and monitoring of operator’s performance. Comput. Chem. Eng.
**2000**, 24, 409–414. [Google Scholar] [CrossRef] - Barbará, D.; Chen, P. Using Self-Similarity to Cluster Large Data Sets. Data Min. Knowl. Discov.
**2003**, 7, 123–152. [Google Scholar] [CrossRef] - David, G.; Averbuch, A. Hierarchical data organization, clustering and denoising via localized diffusion folders. Appl. Comput. Harmon. Anal.
**2012**, 33, 1–23. [Google Scholar] [CrossRef] - Zhang, L.; Tang, C.; Song, Y.; Zhang, A.; Ramanathan, M. VizCluster and its Application on Classifying Gene Expression Data. Distrib. Parallel Databases
**2003**, 13, 73–97. [Google Scholar] [CrossRef] - Johansson, J.; Ljung, P.; Jern, M.; Cooper, M. Revealing structure in visualizations of dense 2D and 3D parallel coordinates. Inform. Vis.
**2006**, 5, 125–136. [Google Scholar] [CrossRef] - Wilkinson, L.; Friendly, M. The History of the Cluster Heat Map. Am. Stat.
**2009**, 63, 179–184. [Google Scholar] [CrossRef] - Niida, A.; Tremmel, G.; Imoto, S.; Miyano, S. Multilayer Cluster Heat Map Visualizing Biological Tensor Data. In Proceedings of the 2013 8th Brazilian Symposium on Advances in Bioinformatics and Computational Biology, Recife, Brazil, 3–7 November 2013; Setubal, J., Almeida, N., Eds.; pp. 116–125.
- Weinstein, J.N. A Postgenomic Visual Icon. Science
**2008**, 319, 1772–1773. [Google Scholar] [CrossRef] [PubMed] - Hao, M.C.; Dayal, U.; Sharma, R.K.; Keim, D.A.; Janetzko, H. Variable binned scatter plots. Inform. Vis.
**2010**, 9, 194–203. [Google Scholar] [CrossRef] - Mayorga, A.; Gleicher, M. Splatterplots: Overcoming Overdraw in Scatter Plots. IEEE Trans. Vis. Comput. Graph.
**2013**, 19, 1526–1538. [Google Scholar] [CrossRef] [PubMed] - Nievergelt, J.; Widmayer, P. Spatial data structures: Concepts and design choices. In Algorithmic Foundations of Geographic Information Systems; van Kreveld, M., Nievergelt, J., Roos, T., Widmayer, P., Eds.; Springer: Berlin/Heidelberg, Germany, 1997; pp. 153–197. [Google Scholar]
- Yoo, J.; Bow, M. Mining spatial colocation patterns: A different framework. Data Min. Knowl. Discov.
**2012**, 24, 159–194. [Google Scholar] [CrossRef] - Gross, M.; Pfister, H. Point-Based Graphics; Morgan Kaufmann Publishers Inc.: San Mateo, CA, USA, 2007; p. 248. [Google Scholar]
- Carr, D.B.; Littlefield, R.J.; Nicholson, W.L.; Littlefield, J.S. Scatterplot Matrix Techniques for Large N. J. Am. Stat. Assoc.
**1987**, 82, 424–436. [Google Scholar] [CrossRef] - Imhof, E. Cartographic Relief Presentation; ESRI Press: Redlands, CA, USA, 2007; p. 111. [Google Scholar]
- Bowman, A.; Foster, P. Density based exploration of bivariate data. Stat. Comput.
**1993**, 3, 171–177. [Google Scholar] [CrossRef] - Lampe, O.D.; Hauser, H. Interactive visualization of streaming data with Kernel Density Estimation. In Proceedings of the 2011 IEEE Pacific Visualization Symposium (PacificVis), Hong Kong, China, 1–4 March 2011.
- George, G.R. New Methods of Mathematical Modeling of Human Behavior in the Manual Tracking Task. Ph.D. Thesis, University of New York, Binghamton, NY, USA, 2008; p. 190. [Google Scholar]
- Krapf, L.C.; Heuwinkel, H.; Schmidhalter, U.; Gronauer, A. The potential for online monitoring of short-term process dynamics in anaerobic digestion using near-infrared spectroscopy. Biomass Bioenergy
**2013**, 48, 224–230. [Google Scholar] [CrossRef] - Huang, Z. Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values. Data Min. Knowl. Discov.
**1998**, 2, 283–304. [Google Scholar] [CrossRef] - Angiulli, F.; Fassetti, F. Exploiting domain knowledge to detect outliers. Data Min. Knowl. Discov.
**2014**, 28, 519–568. [Google Scholar] [CrossRef] - Akoglu, L.; Tong, H.; Koutra, D. Graph based anomaly detection and description: A survey. Data Min. Knowl. Discov.
**2015**, 29. [Google Scholar] [CrossRef] - Salomon, D. The Computer Graphics Manual; Springer: Berlin/Heidelberg, Germany, 2011; p. 967. [Google Scholar]
- Van Verth, J.M.; Bishop, L.M. Essential Mathematics for Games and Interactive Applications: A Programmer’s Guide, 2nd ed.; CRC Press: Boca Raton, FL, USA, 2008; p. 264. [Google Scholar]

**Figure 1.**Definition of marker in graphical knowledge unit (GKU): The marker is a circle (radius = 10 pixels), and the RGB colour value of the circle is (0, 0, X), where 0 ≤ X ≤ 255. The centre of the circle represents the data point (highlighted).

**Figure 2.**(

**A**) Two overlapped markers; and (

**B**) overlapped markers. The data point is represented by the pixel in the centre of the marker (the data point is highlighted in orange).

**Figure 3.**GKU with borders to record missing and out of range values: (1) y > y

_{max}at x = x; (2) x > x

_{max}and y > y

_{max}; (3) x > x

_{max}at y = y; (4) x > x

_{max}and y < y

_{min}; (5) y < y

_{min}at x = x; (6) x < x

_{min}and y < y

_{min}; (7) x < x

_{min}at y = y; (8) x < x

_{min}and y > y

_{max}; (9) y is missing and x < x

_{min}; (10) y is missing at x = x; (11) y is missing and x > x

_{max}; (12) both x and y are missing; (13) x is missing and y > y

_{max}; (14) x is missing at y = y; and (15) x is missing and y < y

_{min}. * Shading is used in the figure to highlight different areas. In the real GKU, there will be no shading.

**Figure 4.**GKUs for the same data set with 35,620 data points using a circle as the marker with different sizes and colours. The correct selection of shape, size and initial colour of the data point will produce clusters that are visually clear and separated by colour borders similar to contour lines. For data set of plots in this figure, see Supplementary Materials, File S1. (

**A**): GKU for 35,620 data points generated using a circle as the marker, where diameter is 5 pixels and RGB colour is (0, 0, 1); (

**B**): GKU for 35,620 data points generated using a circle as the marker, where diameter is 5 pixels and RGB colour is (0, 0, 5); (

**C**): GKU for 35,620 data points generated using a circle as the marker, where diameter is 5 pixels and RGB colour is (0, 0, 10); (

**D**): GKU for 35,620 data points generated using a circle as the marker, where diameter is 10 pixels and RGB colour is (0, 0, 1); (

**E**): GKU for 35,620 data points generated using a circle as the marker, where diameter is 10 pixels and RGB colour is (0, 0, 5); (

**F**): GKU for 35,620 data points generated using a circle as the marker, where diameter is 10 pixels and RGB colour is (0, 0, 10); (

**G**): GKU for 35,620 data points generated using a circle as the marker, where diameter is 20 pixels and RGB colour is (0, 0, 1); (

**H**): GKU for 35,620 data points generated using a circle as the marker, where diameter is 20 pixels and RGB colour is (0, 0, 5); (

**I**): GKU for 35,620 data points generated using a circle as the marker, where diameter is 20 pixels and RGB colour is (0, 0, 10).

**Figure 5.**GKUs for the same data set with 35,620 data points using a square as the marker with different sizes and colours. The correct selection of shape, size and initial colour of the data point will produces clusters that are visually clear and separated by colour borders similar to contour lines. For data set of plots in this figure, see Supplementary Materials, File S1. (

**A**): GKU for 35,620 data points generated using a square as the marker, where length is 10 pixels and RGB colour is (0, 0, 1); (

**B**): GKU for 35,620 data points generated using a square as the marker, where length is 10 pixels and RGB colour is (0, 0, 5); (

**C**): GKU for 35,620 data points generated using a square as the marker, where length is 10 pixels and RGB colour is (0, 0, 10); (

**D**): GKU for 35,620 data points generated using a square as the marker, where length is 20 pixels and RGB colour is (0, 0, 1); (

**E**): GKU for 35,620 data points generated using a square as the marker, where length is 20 pixels and RGB colour is (0, 0, 5); (

**F**): GKU for 35,620 data points generated using a square as the marker, where length is 20 pixels and RGB colour is (0, 0, 10); (

**G**): GKU for 35,620 data points generated using a square as the marker, where length is 40 pixels and RGB colour is (0, 0, 1); (

**H**): GKU for 35,620 data points generated using a square as the marker, where length is 40 pixels and RGB colour is (0, 0, 5); (

**I**): GKU for 35,620 data points generated using a square as the marker, where length is 40 pixels and RGB colour is (0, 0, 10).

**Figure 6.**Relation between bitmap and matrix versions of a GKU. A GKU matrix is a simple way to represent the same GKU. Marker: circle, radius: 10 pixels, marker colour: (0, 0, 254). The table shows the colour values of 10 × 10 pixels in the bitmap, which is a portion of the GKU matrix.

**Figure 7.**Contour lines in a GKU. Representation of 35,620 data points; marker: circle, radius: 20 pixels, marker colour: (0, 0, 1). This shows clear colour borders that can be considered as contour lines. Contour lines are numbered from the outside to the inside of the cluster. Contour lines with same contour line number have the same green channel value. For such contour lines, blue channel values are in the same range. The higher the number of contour lines, the higher the data density. Therefore, it is possible to understand cluster density without a colour scale or legend. For data set of plots in this figure, see Supplementary Materials, File S1.

**Figure 8.**Development of a GKU over time. Bitmaps (

**A**–

**D**) show GKUs with 5000, 10,000, 20,000 and 35,620 data points, respectively. Marker: circle, radius: 10 pixels, initial colour of the data point: (0, 0, 1). For data set of plots in this figure, see Supplementary Materials, File S1.

**Figure 9.**Representation of 35,864 data points in a GKU with borders to record missing values, out of range values and GKU specific information. Marker: circle, radius: 20 pixels, colour of the data point: (0, 0, 50). * Refer to Figure 3 for structure information and usage. ** Refer to Table 3 for structure information about the GKU specific information. For data set of plots in this figure, see Supplementary Materials, File S2.

**Figure 10.**Outlier identification using GKU by defining a border manually. Areas with low colour values are defined as outliers (noise) and vice versa. Shape of the data point: circle, radius: 20 pixels, colour of the data point: (0, 0, 254). For data set of plots in this figure, see Supplementary Materials, File S1.

**Figure 11.**Visualization of 35,620 data points with: (

**A**) scatter plot; (

**B**) heat map; and (

**C**) contour plot. The scatter plot shows the distribution of the data, whereas the heat map and the contour plot show density clusters. However, compared to the GKU, the heat map and contour plot do not show density clusters. For data set of plots in this figure, see Supplementary Materials, File S1.

**Table 1.**Transformation rules used to convert numbers into integers. Depending on the nature of the data, a combination of two or more techniques may be required to achieve a data set suitable to plot on a bitmap.

Value Type | Transformation Technique |
---|---|

Negative integer values | Base line correction. This will convert all negative values to positive values while maintaining the same regression. |

Very large values | Base line correction. This will convert large numbers to small numbers while maintaining the same regression. |

Decimal values | Multiplication by 10^{d} (d ϵ {1, 2, 3,…}). This will convert decimal values to integers (we named d as “decimal to integer factor”). |

Small or large range | Scale up or down. This will change the range. |

**Table 2.**Bitmap header (example of an m × n pixel bitmap with red, green, and blue (RGB) (24-bit) colour scheme) * this unused slot is used to store the offset for graphical knowledge unit (GKU) specific data. BM: a value in Bitmap Header.

Header Section | Offset | Size/Bytes | Value | Description |
---|---|---|---|---|

Bitmap (BMP) Header (14 Bytes) | 0 | 2 | “BM” | Identification (ID) field |

2 | 4 | Size of BMP header, DIB header, and Image | Size of the BMP file | |

6 | 2 | Unused* | Application specific | |

8 | 2 | Unused | Application specific | |

10 | 4 | 54 Bytes (14 + 40) | Offset where the pixel array (bitmap data) can be found | |

Device-independent bitmap (DIB) header | 12 | 40 Bytes | ||

… | ||||

50 | ||||

Bitmap data | 51 | m × n × 4 Bytes | ||

… | ||||

… |

**Table 3.**Example of GKU specific data layout. The value K is the starting location (offset) of the GKU specific data area. The value of K is stored in the first unused slot of the bitmap header.

GKU Specific Data | Offset of Pixels | No. of Pixels | Content in the Pixels, According to the Order | Pixel Format Used to Store Information | Example |
---|---|---|---|---|---|

Properties of point marker | K | 3 | Data point = Circle (1 = circle, 2 = square, …), radius of the circle, colour of the circle. | unsigned 24-bit pixel format | 1, 10, 1 |

Border widths | K + 1 | 5 | Out of range border, missing value border, GKU specific data border, border padding, offset. | unsigned 24-bit pixel format | 10, 10, 10, 1, 10 |

X value information | K + 2 | 8 | Minimum value, maximum value, decimal to integer factor, scale up/down factor. | two signed 24-bit pixel format | (65, 0), (90, 0), (10, 0), (2, 0) |

Y value information | K + 3 | 8 | Minimum value, maximum value, decimal to integer factor, scale up/down factor. | two signed 24-bit pixel format | (223, −2), (9055, −3), (10, 0), (3, 0) |

© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Adikaram, K.K.L.B.; Hussein, M.A.; Effenberger, M.; Becker, T.
Continuous Learning Graphical Knowledge Unit for Cluster Identification in High Density Data Sets. *Symmetry* **2016**, *8*, 152.
https://doi.org/10.3390/sym8120152

**AMA Style**

Adikaram KKLB, Hussein MA, Effenberger M, Becker T.
Continuous Learning Graphical Knowledge Unit for Cluster Identification in High Density Data Sets. *Symmetry*. 2016; 8(12):152.
https://doi.org/10.3390/sym8120152

**Chicago/Turabian Style**

Adikaram, K.K.L.B., Mohamed A. Hussein, Mathias Effenberger, and Thomas Becker.
2016. "Continuous Learning Graphical Knowledge Unit for Cluster Identification in High Density Data Sets" *Symmetry* 8, no. 12: 152.
https://doi.org/10.3390/sym8120152