Communication networks are perhaps the technology breakthrough that has caused major impacts in the worldwide socio-economic structure in modern times. As the object of scientific curiosity, communication networks constantly generate overwhelming volumes of data assets that can be analyzed and studied. If we focus oin the network level, we refer to this investigation as Network Traffic Analysis (NTA). Taken as an applied science, NTA research is relevant to improve, optimize, and reduce failures in communication infrastructures and services; nevertheless, security aspects clearly stand out as the principal focus of NTA research from the very beginning [1
]. RFC 3917 [2
] lists usage-based accounting, traffic profiling, traffic engineering, attack/intrusion detection, and QoS monitoring as key application fields for traffic capturing. While accounting and QoS monitoring are currently usually covered by mature standards, research works tend to focus on traffic classification, anomaly detection, or specific attack identification. In addition to the practical usefulness, the attention of the research community is fully justified since, from a data science perspective, NTA is one of the most challenging fields due to its intrinsic peculiarities, for instance big data, the high variety of feature representations, evolving scenarios, stream data, adversarial environments, encryption, or limitations imposed by privacy concerns.
In this regard, the compendium of publications that tackle NTA increases year-by-year. A search by topic of papers oin the Web of Science (https://apps.webofknowledge.com/
) using “network traffic” as the keywords found 12,085 publications (consulted on 5 February 2020), from which 21.8% were published as of 2017. Specifically, NTA at the network and transport layer attracts a considerable part of the research attention as it is:
low-intrusive (i.e., privacy respectful),
fast and lightweight,
applicable to big volumes of traffic,
suitable for embedding in network middle-boxes.
Surprisingly, in spite of the high number of related publications, there are no standardized methods, algorithms, or steps to dig into network traffic data from analytical perspectives. This deficiency has been emphasized several times. For instance, Kim et al. [3
] claimed that “recent research on Internet traffic classification algorithms has yielded a flurry of proposed approaches for distinguishing types of traffic, but no systematic comparison of the various algorithms”. On this matter, we hypothesize that the fast evolution of communications and the push for new applications have complicated the characterization of network traffic and the achievement of unified criteria about how to analyze it. However, a significant part of the research corpus shows repeated structures, i.e., many papers follow the recurrent scheme summarized in Figure 1
, which is also supported by several field surveys [1
In this work, we present the Network Traffic Analysis Research Curation (NTARC) model. NTARC is a data model that stores key aspects of NTA research publications. A database of NTARC objects is intended to increase the value of past research as it enables the automated retrieval, reuse, comparison, and analysis of published papers. Additionally, it facilitates reproducibility and the consolidation of standardized methodologies and best practices. NTARC emerges because the current way of reusing past field research is obsolete, manual, subjective, does not facilitate reproducibility, and misses opportunities opened by modern advances in data sharing. This problem is not specific to NTA, but generalized in science and getting more and more attention. There are no similar approaches to automatize the study of past research in NTA with the deepness pursued by NTARC, which also plays the role of a methodology template. Its adoption can help to settle best practices since researchers become aware of methodology deficiencies and errors and are encouraged to create more reproducible works. Curating research and creating NTARC objects comprise a significant effort, but science demands it as it demands methods to increase credibility, quality, and efficiency in the knowledge accumulation [7
]. NTARC is designed specifically for research works that propose analysis methods for network traffic captured at the network level, but it could be similarly developed for other scientific fields provided they share extended, common experimental and methodological structures as the one shown in Figure 1
, which is characteristic of the field here under review.
Finally, we show the potential of NTARC by revisiting the NTA research conducted during the last two decades. To this end, we explore the papers included in the latest release of the NTARC database [8
]. We aim to offer a representative, overall snapshot of the main trends observed in the field, also emphasizing the drawbacks and reasons that could be hampering the research efficacy and making novel proposals unfeasible and far from satisfying the requirements of real-life applications. Our field review focuses on NTA’s main goals, research foci, selected features, datasets used, analysis approaches, evaluation methods, claimed contributions, and reproducibility.
The rest of this paper is organized as follows: Section 2
explores previous data sharing initiatives that have been developed to improve scientific research. Section 3
provides a detailed description of NTARC’s internal structures (fields and subfields). Section 4
lists and explains a set of tools developed for the edition, revision, verification, sharing, and analysis of NTARC objects and databases. Section 5
introduces early initiatives to expand NTARC and encourage its use in the research community. In Section 6
, we present the current release of the NTARC database with a collection of 80 curated papers from 2002 to 2020. Section 7
elaborates on a systematic review of the NTA research embraced by the NTARC database. Conclusions are provided in Section 8
2. Scientific Data Repositories and Data Sharing Initiatives
There are several projects aiming to host and promote research data repositories with a general, non-field-specific purpose; for example: B2SHARE [9
], Figshare [10
], the Globus Data Publication service [11
], 4TU.Researchdata [12
], the Zenodo platform [13
], Dataverse [14
], and Dryad [15
Assante et al. [16
] wondered if generalist scientific data repositories were able to cope with the requirements of research data publishing and concluded that generalist repositories suffer from dealing with a multiplicity of data formats and topologies, a highly varied and multidisciplinary community of data owners and consumers, and a lack of consolidated and shared practices. Repositories were found viable, but conservative and in need of evolving. Such intrinsic heterogeneity seems to be a pressing problem to tackle in the near future to prevent underused repositories. When the research scope of a given repository is more specific, the heterogeneity problem is minimized. For instance, NASA’s Common Metadata Repository (CMR) for Earth Science Data Information is a good example of a system developed to standardize and solve a past inefficient data retrieval situation. The use of Earth science data involves a community of Earth scientists, educators, government agencies, decision makers, and the general public. The CMR initiative has highly increased the value of past research and metadata [17
The success of data repositories partially lies in creating metadata structures able to categorize and identify datasets and research objects effectively. In [18
], Devarakonda et al. defined metadata as structured information that describes data content. Metadata explains the definition of measured and collected variables, their units, precision, accuracy, layout, transformations, limitations, etc. In addition, it should also clarify the data lineage, i.e., how data are measured, acquired, and preprocessed. Hence, metadata facilitates data sharing, access, and reuse. In addition—as claimed in [18
]—metadata must be accessible in a format that is easily adaptable to technology changes, e.g., XML and JSON (used in NTARC).
However, the adoption of data sharing practices and detailed metadata descriptions is still immature in the scientific community. Harrison et al. [19
] recognized several challenges to face, mainly concerning the evolution of researchers’ mindsets and habits. Nevertheless, the authors foresaw a near future in which data resources would be assessed similarly to journal publications in the scientist portfolio, increasing therefore both the quality and quantity of published data. Within the context of environmental sciences, Harrison et al. also presented a workflow to publish datasets, models, and model outputs, enabling the access, reuse, and citation of data products [19
Scientific journals play an important role and endorse or directly develop platforms to improve how scientific publishable material is shared, managed, and reused. Dryad [15
] publishes datasets related to peer-reviewed journal articles and scientifically reputable sources. In [20
], Bardi and Manghi explored the concept of enhanced publications, meaning digital publications that incorporate ways to access and disseminate research materials beyond papers. They found that such initiatives are hindered by the fact that researchers must face several difficulties, e.g., manual efforts in curating data, preparing the material, or acquiring new skills to adapt their data, obtaining no obvious, direct benefits. Therefore, enhanced publications are more common when scientific journals demand such materials with clear policies. Similar conclusions and claims were exposed in the data journal survey in [21
]. In [22
], tools and digital environments were proposed to reduce costs and facilitate the creation of enhanced publications. In this line, Science Object Linking and Embedding (SOLE) [23
] is a tool intended to enable reproducible research by linking research papers with associated science objects. Here, science objects are linked with tags in a bibliography-like form, making their reference easy. In this respect, Scientific Data, launched by the Nature Publishing Group in 2014, is a peer-reviewed journal that focuses on data descriptors, defined as “a new type of publication that combines the narrative content characteristic of traditional journal articles with structured, curated metadata that outline experimental workflows and point to publicly archived data records” [24
Making research data, results, and materials more profitable is a necessary process that involves manual effort. By analyzing the relationship of institutional repositories with small science, Cragin et al. [25
] claimed the necessity of redefining and standardizing the understanding of “data sharing”, as well as promoting the establishment of data curation policies to empower the use of data repositories and protect against data misuse. In a similar line, the USA National Research Council recently published a study about digital curation with the title Preparing the Workforce for Digital Curation
]. In this work, digital curation was deeply analyzed, discussing the current status and practices, society requirements, career paths, professional opportunities, derived benefits, and importance for the scientific advancement. Digital curation is defined as “the active management and enhancement of digital information assets for current and future use”. In the conclusions, the authors first emphasized the limitations and missed opportunities due to the current immaturity and ad-hoc nature of digital curation. Digital curation is not well understood, but the application of digital curation in organization practices is expected to reduce costs and increase benefits. Some examples of organizations focused on promoting and developing digital curation are: the Digital Curation Center (DCC), the National Digital Stewardship Alliance (NDSA), Research Data Alliance (RDA), and the Committee on Data for Science and Technology (CODATA).
Several examples exist that directly show how using data curation and metadata models can benefit science; for instance, the Linking Open Drug Data (LODD) project, which is a task force within the World Wide Web Consortium’s (W3C) Health Care and Life Sciences Interest Group (HCLS IG). LODD gathered and connected reliable information about drugs that are publicly available oin the Internet, uncovering relevant questions for the science and the industry, and providing recommendations for best practices [27
]. As for the NTA field, the main precedents have been developed by the Center for Applied Internet Data Analysis (CAIDA), i.e., DatCat, an Internet Measurement Data Catalog (IMDC), which is “a searchable registry of information about network measurement datasets” [28
], and the Internet Traffic Classification [29
], which is a collection of 68 curated metadata of NTA papers published between 1994 and 2009. Worth mentioning also is the IMPACTproject [30
], but in a wider perspective.
The NTARC model goes a step further and proposes a detailed collection of metadata that fits the structure shown in Figure 1
. The goal is to improve the reuse of previous research by enabling the use of statistics on data summaries and metadata or meta-analysis (understanding meta-analysis in a broad sense). Meta-analysis consists of bringing together different studies about the same research question and applying statistics and analysis methods to obtain global conclusions and general perspectives. Meta-analysis is a perfect procedure to glue together small science in a global context and transform independent works into more profitable parts of the complete science building. This is specially true when the same research question is repeatedly faced by different teams in different places. Meta-analysis has been actually determinant in fields like medicine, pharmacology, epidemiology, education, psychology, business, or ecology. A well-known introduction to meta-analysis was offered by Borenstein et al. [31
]. Specifically for medical research, we address the reader to [32
]. Meta-analysis can also be satisfactorily applied in technical research and engineering, but detailed data models must be previously created. Such models will pave the way toward standardized procedures, which are required for reliable meta-analyses.
3. NTARC Data Structures
An NTARC object is a digital summary of a peer-reviewed NTA scientific publication. NTARC publications are required to fit the scheme shown in Figure 1
. Additionally, every NTARC object must be compliant with the NTARC model, which follows the structure depicted in Figure 2
NTARC uses the JSON format [33
], which can be easily parsed and written by computers while still being human readable. Creating, sharing, and distributing NTARC data are simple and straightforward since JSON files are text-based, and each file addresses only one scientific publication. A first, minimal prototype was used for the research conducted in [34
The readers will notice that fields defined in the NTARC structure are exhaustive. For the sake of flexibility and time optimization, some fields in the structure are mandatory, and other fields are optional. Furthermore, contributors are free to define their own fields that might be added to future NTARC versions or will simply remain as notes.
The NTARC model consists of six main blocks: reference, data, preprocessing, analysis_method, evaluation, and results. Additionally, a version field stores the NTARC version that corresponds to the JSON object. The version field helps automated tools during parsing processes and makes the format future-proof. The main blocks are described as follows:
The reference block:
This block collects information that identifies the scientific work, the publication media, and the curation process itself.
The data block:
This block stores information about the network traffic data used. It is not intended to refer to the original dataset version, but the version retrieved by the paper authors, which might have been modified. Here, we find one of the anchors that facilitates comparative analysis, since the scope of the data is always NTA and is provided in the shape of either packet captures, flow records, or preprocessed data derived from packet captures or flow records.
The data block consists of one or several datasets. By definition, two datasets must be reported separately when they clearly come from different setups, projects, or sensor environments; otherwise, they must be defined as subsets.
The preprocessing block:
This block summarizes all transformation and modification processes that datasets underwent previous to the main analysis (e.g., normalization, dimensionality reduction, feature extraction, filtering). The stored information is limited to the preprocessing specifically mentioned in the paper as a part of the presented methodology. This block also captures the set of network traffic features and/or flow keys that were used to represent traffic during subsequent analysis.
Most fields in this block are binary and mandatory, allowing a fast curation of relevant preprocessing aspects. Subsequent blocks (e.g., feature_selections, packets, flows, and flow_aggregations) are optional, being suitable for cases where a more detailed, fine-grained definition is desired. Specifically, packets, flows, and flow_aggregations are blocks that indicate the type of traffic objects to analyze during experiments. The habitual trend is focusing on only one of these traffic objects.
The analysis_method block:
This block depicts the analysis. It captures relevant details of the analysis methodology and identifies the algorithms used. Note that here, tools are repeated at two context levels: general for the analysis method and specific for algorithms.
The evaluation block:
This block gathers information to understand how analysis outcomes were validated, evaluated, and interpreted. It basically registers the metrics used and the perspectives that the authors found relevant to assess the analysis success or failure.
The result block:
In this block, goals, sub-goals, and improvements claimed by the authors are collected. It also defines the focus of the paper and if the published work meets reproducibility standards [35
4. Tools Developed for NTARC
NTARC formats, structures, and libraries are openly available [36
]. In addition to format specification, we provide a broad, complete documentation with examples, editing rules, and explanations. Thus, contributors are guided in the process of curating data and creating NTARC objects, and users are guided in the process of exploiting NTARC datasets. Being built on top of a standard format like JSON, NTARC benefits from all existing tools already developed by third parties. In addition, we developed several tools to facilitate the curation process and the interaction with the NTARC database. We mention here some of these tools, which are openly available in [37
4.1. JSON Schema
The JSON Schema is a format that allows the formal specification of what constitutes a valid JSON file for a particular application [38
]. We maintain a JSON Schema that formalizes our description of NTARC files [36
]. This schema helps verification tools to validate NTARC files and paves the way for the development of additional tooling. The periodic revision of mismatches between dataset objects and the JSON Schema enables the detection of errors, ambiguities, new trends, or missing values, as well as the updating of the Schema itself.
4.2. NTARC Editor
NTARC objects can be directly edited with common text-editors. However, the NTARC structure includes many attributes and is based on JSON, which is a language that makes use of extensive punctuation; therefore, errors can easily occur during direct manual editions. A custom editor was developed to lighten the generation of NTARC files. Rather than having all fields of the specification implemented, the editor creates the user interface from the JSON Schema described above.
The editor incorporates the specification and includes pointers to the documentation at the appropriate places. Additionally, fields not conforming to the specification are marked, and error messages are displayed. To further ease editing NTARC objects, the editor implements an interface for specifying network traffic features in a more formula-based language. Finally, to accommodate heterogeneous computing environments and boost the adoption of the NTARC format, the NTARC editor was built with cross-platform capabilities in mind. This was achieved by using the Electron framework [39
4.3. Verification Tool
To minimize human errors during curation processes, we developed a verification tool that automatically checks NTARC objects when submitted to databases. This tool is freely available in [40
]. The verification tool uses the JSON Schema as a first step to assert that the submitted file is compatible with the NTARC specification; afterwards, a second analysis is performed by parsing the file with the NTARC Extraction Library (Section 4.4
). The first step checks grammar and NTARC structural consistency; the second steps works deeper and checks syntax and semantic aspects (for instance, a defined division operation must necessarily come with two terms: a numerator and a denominator).
4.4. NTARC Extraction Library
The NTARC Extraction Library is a Python library that enables information extraction from NTARC objects and datasets. This library implements methods and classes linked to the multiple defined blocks and fields. Therefore, it is possible to perform deep searches and queries in databases (i.e., metadata analysis) by using keys with any combination of fields and values. Additionally, as previously mentioned, the library allows the verification of NTARC files. The library also supports calls to external APIs with the capability of augmenting the information available in the NTARC files. For example, by querying the Microsoft Academic Services API [41
], the library can collect additional relevant information that does not appear in the paper and store it in a local cache, e.g., number of citations and authors’ affiliations. The extraction library is freely available from [42
4.5. Content Validation
The tools presented above are used to ensure that new NTARC objects are compliant with the specifications. Therefore, all files included in the dataset are previously verified and consistent in terms of grammar, syntax, and structure, meaning that they can be used for analysis and information extraction. However, the curation of paper information is manual in essence and requires experts reading papers and abstracting contents according to the NTARC structure. Therefore, errors and subjectiveness are possible and happen often. In spite of the efforts for creating supporting tools, documentation, and reducing ambiguities in the file format, some issues are impossible to control automatically; for instance: data curators’ misinterpretations, uncommon terminology, fundamental methodology aspects that are missing or unclear in the paper, etc. In such cases, field values might be compliant with the NTARC specification, but wrong with regard to the research under curation. In this respect, the reference block includes a curated_revision_number field that shows the number of times an NTARC object has been reviewed by curators, enabling the control of data revisions.
5. Dissemination Initiatives
Initiatives to make NTARC fully accessible for the scientific community include the publication of the NTARC database in generalist repositories, the open availability of NTARC documentation and tools for creating, accessing, analyzing, and updating content, endorsing the NTARC grown in academic centers, and directly contacting authors and encouraging the inclusion of NTARC data objects for participation in related conferences and workshops.
Regular citable releases of NTARC databases are provided via Zenodo in [43
]. Additionally, the whole project, including databases, tools, documentation, and specification, is fully accessible through GitHub: [8
]. Therefore, external curators and users have a complete environment to submit contributions, download research resources, and obtain feedback.
As a part of the NTARC project, Masters’ and Bachelors’ degree university students, in addition to using the NTARC database for their respective research, also review papers as part of their academic portfolio. This initiative helps the NTARC database to grow, as well as promote the critical reading of scientific publications among students, who get familiar with the field state-of-the-art and additionally are trained in methodologies of scientific experimentation and dissemination. Students are also encouraged to contact the original authors during the paper curation process, so extending the NTARC network and creating links between researchers and students.
Finally, an ongoing plan is to require NTARC objects as additional publishable material for papers accepted in related conferences, workshops, and journals. This initiative pursues the increase of potential contributors and users of NTARC and, at the same time, raises awareness within scientists of the importance of data sharing, reproducibility, and the consolidation of best practices.
NTARC is a data model for storing relevant information related to network traffic research. We widely described NTARC structures and introduced the tools developed for its creation, validation, sharing, and deployment. NTARC databases are expected to grow with the curation of new and old published papers, whereas NTARC structures are expected to be progressively refined with usage. Overall, NTARC is devised to improve how science is done in the field, and this is achieved by enhancing how research material and information is reused.
By using the “NTARC Database”—a release of NTARC objects containing the last years’ principal field investigations—we reviewed the trends and characteristics of NTA research from a critical and systematic perspective. NTA is particularly focused on attack detection, anomaly detection, and traffic classification, and the standard profile for a research paper is the proposal of a method based on machine learning that claims to improve detection accuracy. However, as posed by Sommer and Paxon in [131
], machine learning has been widely used for security and NTA research for the last few decades, but its presence in commercial and real-world solutions has been almost non-existent. This conclusion draws an incongruous picture in which research and application seem to live in distant worlds.
We also identified some undesired trends to avoid. Summarizing: (a) a lack of accurate descriptions of NTA problem spaces, (b) an insufficient discussion about the traffic classes aimed at, (c) obviating post-analysis, (d) inaccurate, vague, or undefined descriptions of aimed anomalies, (e) inappropriate, unrealistic data setups for unsupervised analysis, (e) the use of obsolete, irrelevant datasets, (f) monolithic approaches for too complex problems, (g) neglecting encryption and streaming characteristics, and (f) non-replicable experiments or non-public experimental resources.
Such undesired traits are partially caused by the limited access to valuable network data by researchers (especially to labeled data), also due to a lack of realistic test environments and methods for evaluating new proposals. NTA is therefore tackled under laboratory conditions that do not properly consider the constraints, peculiarities, and limitations of final implementations and might not cover practical requirements in many cases. As a consequence, the relevance and efficacy of expert research is severely reduced.