4.2.3. Data Management and Analysis of Big Data
Before starting the process of data analysis and management, work must be aligned to the regulatory framework that applies to the use of data in the smart campus. A specific framework, which applies to environments as proposed, has not been possible to locate in the analysis of previous works; however, at this level the laws generated by each state also govern. These problems are better known in the use of the internet as new factors undermining the right to privacy generate serious problems of personal and commercial security. For the development of this work, the right to privacy for each person has been considered due to the law established by the country where the study originates [65
]. In addition, the data is duly protected during the process and is used only for educational analysis, maintaining the right to privacy of the individual protected by the legal tradition that protects and preserves the inviolability of the home, papers and documents. This ensures that others can use none of these elements without the consent of the individual to whom they belong. During the execution of the method, sensitive data thats affect the most intimate details of a person’s privacy have not been used, eliminating the possibility of creating an ideological, racial, sexual, health, economic or any other profile that becomes a threat to the individual. Therefore, the regulations on data protection will not apply to information such as files kept by individuals in the exercise of exclusively personal or domestic activities, nor those data made anonymous in such a way that it is no longer possible to identify the interested party [66
]. In a smart campus like the one proposed, there are processes where the identification of individuals is not necessary. The opposite happens in ecosystems such as medicine where the application of big data requires identifying each individual to associate an illness or disease. In this phase, it is important that the data of each person can be associated as belonging to him or her.
The existing data sources in a smart campus directly influence data management. For this reason, it is important to use tools capable of performing a quality process in the extraction, transformation and loading of data (ETL), considering adequate processing times. Studies on educational data analysis use BI techniques applied to educational data in order to discover patterns in students that allow the detection how they learn [67
]. The results are used to make corrective decisions in teaching methods, as well as to make projections and anticipate a possible event such as student desertion [68
]. These studies use ETL processes, the processed data is stored in a data warehouse, where data is queried through data mining algorithms and a conclusion is reached. These techniques generally take as a guide the application of the processes of data analysis of a company and, when passing them to a wider environment, they present certain technical and knowledge difficulties. On the one hand, a BI system is developed in an environment where the analysis objective is unique and, in many cases, can focus on a part of the business, for example, the detection of variables affecting sales in certain quarters. If the study needs more scalability, it is necessary to go back to the design process, add the variables to generate cubes of online analytical processing (OLAP) and design the dashboards to present the information. For the development of processes, both commercial and open source tools can be used. The decision to use commercial or open source tools depends on the economic limitations of the organizations and knowledge level of the tools.
Another important factor that intervenes in the use of BI platforms is the data sources they can handle and which are usually structured databases [69
]. When processed by the ETL they are stored in a multi-relational database. The management of these platforms is increasingly versatile and staff with the required knowledge are not difficult to find. If we compare the needs in the aforementioned environment vs. what a smart campus implies, good results are not expected immediately in the volume of data. The first comparison is because, in the extraction of data from the different sources, an ETL does so on an individual basis based on the variables that it needs for the analysis, which means that if these variables change, it will be necessary to create a new connection to the correct source and start the process again. In a smart campus model, the idea is to consider all sources and that the platform in charge of data analysis manages the variables regardless of the sources. Having a large number of variables exponentially increases the volume of data considered for the analysis.
Another point to consider is the ability to use data that is not precisely in a specific format: an ETL has the ability to work with several formats, extract them and convert them to a specific one. This task, although it seems an advantage when we talk about a large volume of data and variety in the formats of the data, can become a problem when consuming many storage and processing resources. In a smart campus, the management of several systems is considered both in structured databases as well as in unstructured sources.
The analysis of the possible usefulness of a data analysis tool based on BI platforms confirms that the conditions of the data in an intelligent campus exceed its functionalities. In an environment such as the one proposed, it is necessary to have processing and storage that guarantees high availability, safety and quality in the data. For this reason, in a smart campus, it is important to think about platforms that have been used in large companies that work with large volumes of data like Hadoop [70
]. Companies of the scope of Google, Yahoo, Amazon, and so forth, have used this tool, and this guarantees the treatment of the data at the required level within the guidelines established in the study. Another advantage of Hadoop is the ease it has to skillfully manage any type of file or format; something that is very difficult to obtain with a traditional BI approach [71
]. Other architectures such as Apache Spark apply to big data platforms. This architecture in relation to Hadoop presents better response time in processing and is used in real-time analysis. Apache Spark is more expensive than Hadoop for the deployment of infrastructure that it needs. Although this architecture is novel and has been put on par in the market with Hadoop, it does not meet the conditions presented in this study [72
The method we used based on cost and the availability of the infrastructure is Hadoop, which is an open-source framework for storing data and running applications in clusters [73
]. Hadoop provides massive storage for any type of data, has an enormous processing capacity, and is able to process virtually unlimited concurrent tasks or work.
The architecture of Hadoop allows an effective analysis of large volumes of data, and the results can strengthen decision-making and improve educational processes. This architecture also allows monitoring of the opinions of students, as well as the ability to draw conclusions about learning problems presented by certain groups of students. With Hadoop, universities can exploit complex data, analyze it, and customize results by adapting the process to the needs of the university and the students.
Hadoop is composed of three fundamental pillars: versatility, flexibility, and fault tolerance. Among the components that allow the execution of the architecture is the distributed file system (HDFS). The Hadoop engine consists of a MapReduce work scheduler, as well as a series of nodes responsible for executing it [74
]. These characteristics are presented as a set of utilities that enable the integration of subprojects. It is important to consider that MapReduce also provides retrospective and complex analysis capabilities that can touch most or all of the data [75
]. MapReduce provides a method of data analysis that is complementary to the functions provided by SQL.
The main problems that are resolved in Hadoop are
These problems are classic in a traditional campus, and the way to solve them is through client-server storage methods where the user interacts with the application, which in turn controls the storage and analysis through relational databases. This method works well with applications that do not generate a large volume of data and that are processed by traditional servers, or that do not exceed the limit of their processor. In summary, this method depends on the applications and available computing resources. An example of this is the process used by BI. However, when it comes to dealing with huge amounts of scalable data, processing the data through a single database is a frantic task, which makes the process a bottleneck.
MapReduce divides the task into small parts and assigns them too many teams, collects the results, and forms the resulting data set. Figure 5
shows the traditional method of data analysis in a university campus, where the user interacts with an application and the data is stored in relational BDD. On the other hand, MapReduce manages the data generated in several processes; the centralization system processes this data and interacts with the user when presenting the results.
The architecture shown in the previous figure represents an improvement in the management and processing of the data; Hadoop integrates the MapReduce algorithm, which is responsible for the processing of the data in parallel [76
]. Figure 6
presents the architecture of Hadoop, which in its core has two main layers; the first is responsible for the computation, and the second is responsible for distributed storage. Thus, the base apache Hadoop framework is composed of the module of Hadoop common that contains libraries and utilities needed by other Hadoop modules. There are Java libraries and utilities required by different Hadoop modules. These libraries provide file systems and OS level abstractions and contain the necessary Java files and scrips required to start Hadoop. The next module is the Yarn framework that is a resource-management platform responsible for managing computing resources in clusters, providing very high aggregate bandwidth across the cluster.
According to the characteristics and needs presented in a smart campus, Hadoop has the best architecture when applied in a fully distributed mode [77
]. This mode of operation requires that a defined number of clusters be deployed that are responsible for processing all assigned work. The management cluster carries out the assignment of tasks, and both the management and the processing are virtual machines assigned in the data center of the campus. The advantage of having the architecture in the intranet is network management, which leads to the best use of resources. Being integrated with the internal network, the communication is transparent and there are no critical problems, as can be the case when creating the clusters in an external cloud [78
The installation of the architecture is done on a Linux platform then the functionality in the workflows is checked. MapReduce plans the tasks through a JobTracker that is responsible for sending the works to the nodes. MapReduce sends the incoming workflow to the available TaskTracker nodes in the cluster, handling the map functions and reducing them in each node [79
]. The planner keeps tasks as close to the machine that has issued that information as possible. If the work cannot be located in the current node in which the information resides, the nodes in the same rack are given priority. This allocation reduces network traffic on the cluster’s core network. If a TaskTracker fails or suffers a timeout, that part of the work is rescheduled. Hadoop responds to a master–slave structure, where the JobTracker is located in the master and there is a TaskTracker for each slave machine, as shown in Figure 7
. The JobTracker records the pending works that reside in the file system. When a JobTracker starts, it looks for that information, in such a way that it can start the work again from the point where it was left [80
The HDFS handles two fundamental elements in the architecture: the NameNode and the DataNode [82
]. The NameNode is only found in the master node and is responsible for keeping all the stored data indexed. That is, it informs the application where the searched data is found. The NameNodes are found on the computers of the slaves and are responsible for storing the information.
With the architecture mounted and with the data acquisition process executed, it starts its analysis. This is done through Hadoop, which allows visualization of the different nodes in a graphical interface. In the analysis of the data, the project is divided into several subprojects, which facilitates obtaining information for each system included in Hadoop. For example, it is possible to analyze the drinks that have the highest consumption at certain dates or seasons. The skill of the data scientist is in posing the right questions to help to control the parameters of a specific event.
The Hadoop interface contains all the works that are in process and stores the corresponding information in files. To verify the functioning of the architecture in this research, the following conditions for the analysis are presented:
Which are the beverages that present the higher indices of consumption in examination seasons in the campus?
Which are the places in the campus with the highest population density in winter and summer?
What are the activities that generate greater knowledge in students in the campus?
These questions seek to solve common problems in a university and help improve the use of resources and the understanding of university trends. For the first question, the information generated by the automatic dispensing system enters the analysis process. The information is sent from the dispensing machine to a virtual server. Figure 8
details the architecture of the data acquisition of the different sensor and actuator systems. In the specific case of the dispensers, they contain different sensors that allow the actuators to generate a specific event that is translated into data, which is sent to the information layer. In this layer, the data is stored in relational or non-relational BDD, depending on the application. The communication protocol of the sensors and actuators at the commercial level is varied. However, the university that participated in the study, with the purpose of designing a scalable architecture and guaranteeing communication, standardized the use of the technologies and with them the protocol of communication within the smart campus using the TCP protocol. In the knowledge layer, the data sent by the dispensing machines is acquired through the different data mining processes, or what is defined within the big data platform. The knowledge generated is applied in the good use of resources for the first case, and the improvement of services. Hadoop stores the data and the administration cluster assigns small analysis processes to the 120 nodes implemented within the architecture. The storage format of the data is simple and contains fields such as date, time, type of beverage, location within the campus, and its identifier. This data arrives in plain text and in real time; therefore, the analysis process has accurate information. The Hadoop analysis is based on applications developed in Java, where the Hadoop libraries are imported and a .jar file is generated that, when executed, starts the analysis process.
presents the results of the first analysis, where the study period consisted of 2 weeks, which is the usual period of exams. Samples were taken every four hours, and drinks available in each machine were classified into four classes. The percentage is related to the amount shipped and the total amount of beverages owned by a vending machine of 288 units capacity. The table shows a high consumption of coffee, followed by coca cola drinks, which in one way or another contain caffeine. With the results obtained, several adjustments can be made within the optimization of resources. On the one hand, it has the capacity to project the number of beverages required according to their classification in different seasons. On the other hand, with the results of the analysis, it is possible to complement a study on stress in students when performing an exam. The study makes it possible to create awareness campaigns on the dependence on caffeine, and the treatment of stress in the student population.
The analysis of the places with the highest population density within the campus requires accurate information on the location of the students. In order to comply with the requirements of the analysis, the data generated by the wireless local area network (WLAN) is considered. The university campus has an integrated WLAN system that handles load balancing device identification, enabling this type of study. Access-point (AP) devices provide information about the number of hosts that are connected, and the controller that manages the APs can emit traces of these hosts that include information of the time they connected to the network, as well as the identification of the AP to which they are connected [83
]. The potential of wireless systems promotes the use of information to the inhabitants of the smart campus, based on the conditions of the environment. This consideration is applicable according to work that has been done in urban sectors, and that can be adapted to the needs presented in this study [84
]. This monitoring is continuous; therefore, the volume of data is high, and the consequence is that the processing of the file takes more time. To reduce the processing time, we work with algorithms developed in Java that perform a pre-filtering, whereby the sample is segmented from months to weeks or days.
shows the results obtained in the density analysis, in which the gross data in of four random weeks of both the summer period and the winter period are considered. The distribution of the AP within the smart campus is by area, and the amount of equipment assigned depends on the analysis of the population density and the optimization methods that the wireless system allows. The APs, in addition to providing access to the network, generate important information such as the amount and detail of the hosts that connect to the network in a specific period. This information is useful to determine in which areas more infrastructure resources or bandwidths are needed, with the objective of optimizing these resources. Another derivative of this analysis is that academic authorities can provide relevant information about the university at the points where students usually meet. Chatting or informative activities can be held by academics taking advantage of the places preferred by the students. The results shown in the table are as expected; however, the tool allows a quantitative analysis, which allows the use of resources to be improved and generates objectivity in its use.
The implementation of a data analysis architecture such as Hadoop allows for better decision making regarding the management of natural resources. Through information on places where there are a greater concentration of students, it is possible to carry information about the proper use of resources. Generating awareness campaigns becomes a way of education and even more so when the smart campus has a system that learns from the data generated by each user. For the deployment of network equipment or AP, sectors that do not have a minimum number of users are restructured immediately to take advantage of resources and promote use in the places that need it.
To respond to the academic activities that generate learning in students, the structure of a single system that provides data to multiple systems is changed. This coupling of more data sources is required because analyzing performance in people requires greater effort, as well as a greater number of variables. The variables contain the student’s socio-academic information, academic record, financial situation, interaction with learning management tools, and so forth. There are several methods through which the analysis can be focused; one is to create a sub process that filters the information of each system, and then unites it in one to present the information. Another method is to perform the sequential analysis of each system, where the results are stored in variables and then presented as a common result.
When creating sub processes in charge of analyzing each system, the processing time is optimized. This functionality is found in the capacity of Hadoop when assigning and coupling a certain number of cores to each task. The inclusion of data mining algorithms allows deepening of the analysis to create clusters and identify the patterns in each student. This analysis allows determination of how students learn, as well as evaluation of the teaching methods of each teacher. This type of study is a true contribution to a smart campus, since it allows the improvement of learning, which is the main objective of a university. By improving the quality of learning, the results of the environment improve, as this conveys a vision of excellence internally and externally. A good image as a university helps improve student income rates, and this goes hand in hand with the economic growth of the campus. Therefore, economic well-being improves the quality of life of those involved, and budgets are increased in all areas.
Mobility is another important point that directly affects sustainability within the smart campus and one of the objectives of this architecture is to reduce the CO2 emission that this causes. In order to meet this objective, it is important to analyze the information that exists within the campus and why the problem arises. As a starting point, we must consider that a university campus can be geographically as large as a small city, therefore, mobility must be considered in a sustainable environment. To solve the problem there are options, such as implementing an internal transport system that carries out the tours every so often. This option solves the problem, to a certain extent, of the inhabitants of the campus not using their vehicles and therefore reducing the emission of gases. However, it is not an optimal solution because its implementation is based on the experience of the administrative staff of that area.
The architecture proposed in this work covers these needs by defining the times, units and routes of each transport based on the analysis of the students’ existing data. Figure 9
shows the flow diagram under which the process to solve this problem is conceived. The first stage is responsible for collecting the data that comes from systems such as the location system through the wireless, video surveillance systems and academic management systems. If the data exist and are appropriate, the big data architecture assigns the necessary nodes to perform the processing of the information. The following flow leads to the storage of the data and its analyses through Hadoop that performs the process based on the research parameters. For example, when identifying patterns in students, the most frequent places are determined, even the big data process takes data from previous processes such as the location of students through the wireless system. This information is selected according to the students who have traveled the greatest distance, in this way it is possible to determine if the students move over short or long distances. As a fixed variable, the system detects students who travel more than two kilometers, with this data, there is enough information to define the places to which, and times in which buses should be sent.
The video surveillance systems, as an integral part of the proposal, send information about bus stop points. If there is saturation; the analysis system detects and alerts the administrator to send buses. Academic management systems provide information about student schedules so that big data identifies all the variables that contribute to solve the problem. Once the transport administrator is notified, they arrange the driver and the route. If the transported shipment is not made, the cycle returns to the data collection phase and the process is repeated. If the data does not exist or is not sufficient for the analysis, the process stops and looks for information that helps to solve the problem in external sources such as alternative systems or spreadsheets. If the data found is sufficient, the data is added to the process and the Hadoop nodes are implemented according to the request made. Otherwise, the data is entered manually, this corresponds to scenarios where the system does not find enough data to make a decision and an operator has to join the process to verify or correct it if it is the case.