# What Is (Not) Big Data Based on Its 7Vs Challenges: A Survey

## Abstract

## 1. Introduction

## 2. Data Mining versus KDD versus Big Data

#### 2.1. Data Mining

- Apply clustering to group data into groups of similar elements.
- Search for an explanatory or predictive pattern for a target attribute in terms of other attributes.
- Search for frequent patterns and sub-patterns.
- Search for trends, deviations and interesting correlations between attributes.

#### 2.1.1. Data Analysis versus Data Mining

#### 2.1.2. Patterns in Data Mining

#### 2.1.3. Classification in the Methods of Data Mining

#### 2.1.4. Data Mining Applications

#### 2.2. Knowledge Discovery in Databases

#### 2.2.1. The Term KDD

#### 2.2.2. Phases of KDD

**Data pre-processing:**It has three steps [6].**Reduction**of the dimension through the selection of functions and the taking of useful samples for the intended purpose, which offers a reduction in the number of variables to be considered [26].**Cleaning**of data to eliminate noise generated by different data types, extreme values, and missing values due to default or non-compulsory values [7].

**Choose the right task or Data Mining method**. They can be classification, regression, clustering or summarisation [6].**Choose the Data Mining algorithm**by selecting the specific method to be used for pattern searching [6]. A data mining algorithm is nothing more than a set of heuristic calculations and rules that allow a model to be created from data [31]. For example, artificial neural networks, support vector machines, Bayesian networks, decision trees or different clustering or regression algorithms. This phase is difficult as different algorithms can be used to perform the same work but each will give a different output [31].**Use the chosen Data Mining algorithm**[6].**Evaluation and interpretation of the extracted patterns**. This may mean having to iterate again between the previous phases. In addition, this pattern may involve viewing the extracted patterns or data [26].**Display the pattern found in another dataset**for use and testing, and/or documentation of the pattern [6].

#### 2.2.3. KDD Applications

#### 2.3. Big Data

#### 2.3.1. Definitions of Companies and Academics

#### 2.3.2. Big Data Applications

## 3. Literature Review

#### 3.1. Methodology

- IEEE: IEEE Xplore (https://ieeexplore.ieee.org/Xplore/home.jsp, assessed on 29 November 2022).
- SD: ScienceDirect–Elsevier (http://www.elsevier.com, assessed on 29 November 2022).
- Wiley (https://onlinelibrary.wiley.com/, assessed on 29 November 2022).

#### 3.2. Articles by Database and Type

- 3V: 2013 to 2022: 27 Conferences, 3 Magazines, 1 Journal.
- 2 conferences have been removed because they are about other fields.

- 4V: 2014 to 2022: 25 Conferences, 1 Magazine, 1 Book, 2 Journals.
- 3 conferences have been removed because they are about other fields, and 1 more for being an editorial.

- 5V: 2014 to 2022: 18 Conferences, 3 Journals.
- 3 conferences, 2 journals, and 1 magazine have been removed because they are about other fields.

- 6V: 2020: 1 Magazine.
- 1 conference talks about 10 ‘Vs’.

- 7V: 2014 to 2021: 3 Conferences.
- 8V: 2019: 1 Conference.
- 3 have been removed because they are about other fields.

- 9V: 0.
- 10V: 2021: 1 Conference.
- It has appeared in the 6 ‘Vs’ search.

- 3V: 2018 to 2019: 3 Research articles, 1 Book chapter.
- 4V: 2016 to 2019: 1 Review article, 5 Research articles.
- 5V: 2016 to 2022: 1 Review article, 1 Research article.
- 6V, 7V, 8V, and 9V: 0.

- 3V: 2017 to 2022: 2 Books, and 1 Journal.
- 4V: 2019: 1 Journal.
- 1 has been removed because it is about other fields.

- 5V: 2017: 1 Journal.
- 3 have been removed: 1 is about Big Data but not about the ‘Vs’, and 2 are about other fields.

- 6V: 2022: 1 Book.
- 1 has been removed because it is about Big Data but not about the ‘Vs’.

- 7V: 0.
- 2 have been removed because they are Issues and not about Big Data.

- 8V: 0.
- 1 has been removed because they are Issues and not about Big Data.

- 9V: 0.
- 1 has been removed because it is about Big Data but not about the ‘Vs’.

#### 3.3. Results and Discussion

## 4. Challenges According to the ‘7Vs’ of Big Data

#### 4.1. Volume

^{12}) daily to obtain useful information for their businesses and about their users. The amount of data is growing because every day there is more information and more existing users, which implies, as some already estimated, that it has exponential growth. It is estimated that there will be 500 times more data in 2020 than in 2011 [102].

^{50}) per second or 562 terabytes (10

^{12}) per second. As can be seen, there will be a future with even more data, with huge amounts to analyse. Some estimates indicate that the amount of new information will double every three years [18].

^{21}) of electronic data are created. The equivalent in terabytes is 1,200,000,000, ranging from scientific experiments to telescope data and tweets [16]. This is certified by other estimates made in 2012 [1], where they predicted the creation of 2.5 exabytes (10

^{18}) each day, equivalent to 2,500,000 terabytes, but that this creation capacity would double every 40 months, approximately. Thus, this prediction is quite similar to that made by John Paul Holdren.

^{15}) of information per day. This meant that if a photo were taken every second and a person was asked to analyse all these photos, assuming he worked at night and on weekends, it would take him many years to analyse all the photos of a day [7]. This is useful to emphasise that now, 26 years later and with improved technologies, both hardware and software, we can make faster and automatic analysis, also on images with better resolution and more data.

#### 4.2. Velocity: Reading and Processing

#### 4.3. Variety

#### 4.4. Veracity: Origin, Veracity and Validity

#### 4.5. Variability: Structure, Time Access and Format

#### 4.6. Value

#### 4.7. Visualisation

## 5. Conclusions and Future Work

1 | 2 | 3 Pre-Processing | Data Mining | 7 | 8 | |||||
---|---|---|---|---|---|---|---|---|---|---|

Domain | Selection/Data Collection | Data Reduction | Cleaning | Transformation | 4 Method | 5 Algorithm | 6 Use | Interpretation and Evaluation | Another Dataset | |

[87] | ? | |||||||||

[88] | ? | X | X | X | X | X | X | X | ||

[89] | ? | |||||||||

[90] | ? | X | X | X | X | X | X | X | X | ? |

[91] | ? | X | X | X | X | X | ||||

[92] | ? | X | X | X | X | X | ||||

[94] | ? | X | X |

Database\Vs | 3Vs | 4Vs | 5Vs | 6Vs | 7Vs | 8Vs | 9Vs | 10Vs | Total |
---|---|---|---|---|---|---|---|---|---|

IEEE | 31 | 29 | 21 | 1 | 3 | 1 | 0 | 1 | 87 |

SD | 4 | 6 | 2 | 0 | 0 | 0 | 0 | 0 | 12 |

Wiley | 3 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 6 |

Total | 38 | 36 | 24 | 2 | 3 | 1 | 0 | 1 | 105 |

Type\Vs | 3Vs | 4Vs | 5Vs | 6Vs | 7Vs | 8Vs | 9Vs | 10Vs | Total |
---|---|---|---|---|---|---|---|---|---|

IEEE | |||||||||

Conferences | 27 | 25 | 18 | 3 | 1 | 1 | 75 | ||

Magazines | 3 | 1 | 1 | 5 | |||||

Articles | 1 | 2 | 3 | 6 | |||||

Books | 1 | 1 | |||||||

SD | |||||||||

Articles | 3 | 6 | 2 | 11 | |||||

Books | 1 | 1 | |||||||

Wiley | |||||||||

Articles | 1 | 1 | 1 | 3 | |||||

Books | 2 | 1 | 3 | ||||||

Total | 38 | 36 | 24 | 2 | 3 | 1 | 0 | 1 | 105 |

Authors\Vs | Volume | Velocity | Variety | Veracity | Variability | Value | Visibility | Vulnerability | Visualisation | Validity | Volatility | Viscosity | Virality | Vincularity | Valence | Vitality |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

[1,107,108,109,118,119] | X | X | X | |||||||||||||

[123] | X | X | X | X | ||||||||||||

[2] | X | X | X | X | ||||||||||||

[102] | X | X | X | X | ||||||||||||

[8] | X | X | X | X | X | |||||||||||

[22,125,126] | X | X | X | |||||||||||||

[127] | X | X | X | X | X | X | ||||||||||

[128] | X | X | X | X | X | X | ||||||||||

[129,134] | X | X | X | X | X | X | X | |||||||||

[130,131] | X | X | X | X | X | X | X | |||||||||

[128] | X | X | X | X | X | X | X | |||||||||

[132] | X | X | X | X | X | X | X | X | ||||||||

[135] | X | X | X | X | X | X | X | X | X | X |

Type | Quantity | Commentary/Year |
---|---|---|

Earth Satellites [7] | 1 terabyte (10^{12}) | In 1 day in 1990 |

Websites indexed by Google [8] | 1 million | 1998 |

Websites indexed by Google [8] | 1000 million | 2000 |

Computing [138] | 320 terabytes | 2 h of human genome study in 2008 |

Websites indexed by Google [8] | 1 trillion (10^{18}) | 2008 |

Astronomical or physical particle experiment [137] | 1 petabyte | In 1 year in 2009 |

Facebook [15] | 30 billions of content shared each month | 2010 |

Photos per second on Facebook [139] | 1 million | 2010 |

Photos stored on Facebook [139] | 260 billions = 20 petabytes | 2010 |

Photos uploaded per week on Facebook [139] | 1 billion = 60 terabytes | 2010 |

Bookstore of the United States of America Congress [15] | 235 terabytes of data collected | April 2011 |

Hadron Collider at the discovery of the Higgs Boson [97] | 1 petabyte (10^{15}) | Per second in 2012 |

Human race [11] | 2.5 quintillions (10^{30}) of data bytes | Every day in 2012 |

Walmart user information every day [1] | 2.5 petabytes (10^{15}) | 2012 |

Multi-Media Messages (MMS) [140] | 28,000 per second | 2012 |

New data [1] | 2.5 exabytes | New data every day since 2012 and doubling every 40 months |

Electronic data [16] | 1.2 zettabytes | Every year in 2012 |

Twitter [141] | 10,300,000 tweets in 1 h 30 m | Presidential debate in 2012 |

GitHub [142] | 550,000 repositories | Q2 2012 |

Creators of social content [143] | 600 million (10^{6}) | 33% of Internet users in 2013 |

Other periodical publications [143] | 10,000 | Newspapers and others in 2013 |

Blogs [143] | 70 million (10^{6}) | 2013 |

Google queries per day [8] | More than 1000 million | 2013 |

Tweets per day [8] | +250 million | 2013 |

Facebook updates per day [8] | +800 million | 2013 |

YouTube views per day [8] | +4000 million | 2013 |

Jet engine [2] | 10 terabytes | 30 min in 2013 |

Internet [143] | 20 exabytes (10^{18}) of information | 2013 |

Internet [144] | 40.7% of the population used it in 2014 = 2.954 million | 7,259,691,769 people in 2014 |

Web pages [143] | 1.5 trillion (10^{12}) | 2013 |

Twitter [145] | 310 million active monthly users | 2013 |

Twitter [145] | 500 million tweets per day | August 2013 |

Tweets [143] | 20 thousand million (10^{9}) | 50 million of users/2013 |

Tweets [145] | 143,199 per second | 3 August 2013 |

GitHub [142] | 1,300,000 repositories | Q4 2013 |

GitHub [142] | 2,200,000 repositories | Q4 2014 |

Sequencing of human gene [103] | 600 Gb | 2014 |

Flickr [146] | Almost 70 million public photos uploaded monthly | 2015 |

YouTube [147] | More than 1000 million users (10^{9}) = 1/3 Internet users | 2015 |

YouTube [147] | +100 million hours of video views daily | 2015 |

Hospital data [103] | 167 Tb to 665 Tb | 2015 |

Emails [148] | 204 million | In 1 min in 2016 |

Pandora: hours of music heard [148] | 61,000 h | In 1 min in 2016 |

Flickr [148] | 3 million uploads | In 1 min in 2016 |

Flickr [148] | 20 million photos viewed | In 1 min in 2016 |

Google [148] | 2 million searches | In 1 min in 2016 |

Google Photos [149] | 200 million users | In its first year in 2016 |

Google Photos [149] | 1.6 billion (10^{9}) | In its first year in 2016 |

Google Photos [149] | 2 trillion (10^{18}) tags | In its first year in 2016 |

Google Photos [149] | 24 billion (10^{9}) selfies | In its first year in 2016 |

Facebook [150] | 1650 million (10^{6}) users | 31 March 2016 |

Annual Internet traffic [151] | 1 zettabyte (10^{18}) | 2016 |

Facebook [133] | +500 terabytes of data per day | 2017 |

GitHub [152] | 100,000,000 repositories | 2018 |

ELMo [153] | 94 million of parameters | 2018 |

BERT-Large | 340 million of parameters | 2018 |

GPT [154] | 110 million of parameters | 2018 |

GPT-2 [153] | 1.5 billion of parameters | 2019 |

Megatron-LM [153] | 8.3 billion of parameters | 2019 |

T5 [153] | 11 billion of parameters | 2019 |

Annual Internet traffic [151] | 2.3 zettabytes (10^{18}) | 2020 |

Square Kilometre Array [155] | 524 terabytes per second (estimated) | Will be produced in 2020 (postponed to 2027) |

Turing-NLG [153] | 17.2 billion of parameters | 2020 |

GTP-3 [153] | 175 billion of parameters | 2020 |

Daily generated data [12] | 56 zettabytes | 16 December 2020 |

Megatron-Turing [153] | 15 datasets of a total of 339 billion tokens | 2021 |

Megatron-Turing [153] | 530 billion of parameters | 2021 |

Daily generated data [12] | Estimated 149 zettabytes | 2024 |

