FIViz: Forensics Investigation through Visualization for Malware in Internet of Things

: Adoption of the Internet of Things for the realization of smart cities in various domains has been pushed by the advancements in Information Communication and Technology. Transportation, power delivery, environmental monitoring, and medical applications are among the front runners when it comes to leveraging the beneﬁts of IoT for improving services through modern decision support systems. Though with the enormous usage of the Internet of Medical Things, security and privacy become intrinsic issues, thus adversaries can exploit these devices or information on these devices for malicious intents. These devices generate and log large and complex raw data which are used by decision support systems to provide better care to patients. Investigation of these enormous and complicated data from a victim’s device is a daunting and time-consuming task for an investigator. Different feature-based frameworks have been proposed to resolve this problem to detect early and effectively the access logs to better assess the event. But the problem with the existing approaches is that it forces the investigator to manually comb through collected data which can contain a huge amount of irrelevant data. These data are provided normally in textual form to the investigators which are too time-consuming for the investigations even if they can utilize machine learning or natural language processing techniques. In this paper, we proposed a visualization-based approach to tackle the problem of investigating large and complex raw data sets from the Internet of Medical Things. Our contribution in this work is twofold. Firstly, we create a data set through a dynamic behavioral analysis of 400 malware samples. Secondly, the resultant and reduced data set were then visualized most feasibly. This is to investigate an incident easily. The experimental results show that an investigator can investigate large amounts of data in an easy and time-efﬁcient manner through the effective use of visualization techniques.


Introduction
The Internet of Things (IoT) helps form a sustainable ecosystem for realizing the smart cities vision. With the help of sensors and connected devices, most often ubiquitous, IoT offers seamless capabilities in continuous monitoring of various environments [1]. Healthcare is one of the domains where a huge increase in IoT has been witnessed. From remote patient monitoring to timely delivery of healthcare services Internet of Medical Things (IoMT) has occupied a large chunk of the IoT ecosystem [2]. Driven by the capabilities of Artificial Intelligence, IoMT can leverage the full potential of high-performance computing to provide better assessment through decision support systems [3]. Wireless Sensor Networks, mobile devices, wearable computing nodes, and Big Data enabled future internet architectures to provide IoMT the rapid enhancements for efficiency in seamless patient monitoring, early disease diagnosis, and effective treatment [4].
With the increase in the number of internet-connected medical devices, a corresponding spike has been observed in cyber-attacks (theft of critical information, theft of identities, credit cards, etc.) regardless of security measures in place. An alarming increase has been observed especially in IoMT infrastructures due to the sensitive nature of sensitive healthcare patient data [5]. Attackers use novel yet sophisticated methods and techniques to carry out their malicious intentions which often result in financial harm to the service. Attacks may be in the form of malware or others but usually they manipulate system services and resources to achieve their intended goals [6]. The attack scenario is shown in Figure 1. For investigators, these manipulated services provide useful forensic evidence to analyze the whole incident. This fact emphasizes the need for various digital forensics tools, methods, and techniques to investigate such destructive cyber attacks and their whole kill chain. Kill chain is a complete set of attack steps to achieve their goal [8]. These methods and techniques help in making threat intelligence, intrusion detection systems (IDS), and other security applications to cope with such cyber attacks. Digital forensics is the branch of forensic sciences that deals with the investigation and recovery of evidence from digital devices victimized in criminal activities. Forensics investigation is always carried out in four main steps, i.e., collection, examination, analysis, and reporting [9], as shown in Figure 2.
It is becoming too difficult for an investigator to gather evidence from these devices because of the large volume of variable and complex data that need to be analyzed. Different researchers, scholars, and scientists are seeking solutions to handle this enormous data and extract meaningful information from it. They proposed different digital forensics frameworks [10,11], machine learning techniques [12], artificial intelligence [13], and visualization [14,15], etc., to better detect and investigate such attacks (malware) and to present them in a court of law. This kind of technique is also important to feed already existing mechanisms and security applications. As it is said quotes, a picture is worth a thousand words. Visualization plays an important role in the easy and effective identification of crucial information regarding security, forensics (incident response), big data, business productivity and human perception [16]. Visualization allows non-technical personnel i.e., police, and lawmakers etc. to grasp the severity of an attack without having the required technical expertise [17].
According to Lowman et. al [18] and Nordbø et. al [19], the investigators should make sense of large volumes of data and find what they are looking for, and then visualize the resultant data timely. The existing work are lack of considering feature and information i.e., Dynamic-Link Library (DLL), handles, session ID etc., that are useful to reveal certain kind of attacks which are, ransoms, root kits, spy-wares and remote access tools (RATs). These attacks stimulates/exploits specific files of windows i.e., DLL, handles, session ID etc., to accomplish their work. Considering such scenarios we propose and create data set of 400 malwares/ransomes/spywares belong to different families. Then visualize the data set (specific artifacts/features/informations) that reveals the kind of attack or give an insight to the investigator [20].
In this paper, we propose to build the data set with ample information which can be well perceive and understand. The visualization of such data set, make it easy to distinguish between normal and abnormal activity of a malware, instead of traditional inadequate data sets [21]. The advance and latest malware samples (e.g., fund transfer, Korplug, CTB locker, etc.) are executed in clean virtual environment to create the data set. The samples are gathered from the Threat intelligence based honey pots placed by the honeypots we placed in the Application Security Research Lab at COMSATS University Islamabad Campus. Different forensics tools i.e., Log2timeline, sysinternal and cuckoo etc., are used to capture and logged the malware activities. The useful features are extracted and considered for the visualization (e.g., DLL manipulation is considered as a useful artifact for separating benign and malicious processes).
Distinct steps are performed in our proposed framework. Controlled clean virtual environment is created using open source virtualization tool, Virtual Box [7], dynamic behavioral analysis activities are monitored where we have considered the usage of dynamically linked libraries (DLL) as the identification mechanism. Finally,the details captured into the dataset are logged through forensic tools such as Cuckoo for sandboxing, Python Pandas for data cleaning and IDA Pro for analysis of DLL manipulation. After logging raw data are retrieved. Features extraction and selection are performed. weights and labels are also assigned to create concise data set. In the end, the resultant concise data are visualized in away that investigator can easily analyze different processes to distinguish benign and malicious ones [22].
The rest of the paper is sorted as follows: Section 2 Literature Review presents the state of the art in forensic visualization. Next we proposed the fiviz framework in Section 3, which is presented in Proposed Framework and Working, Section 4. Similarly, next Section 4.2 explains and discusses the implementation and results. Then, Section 5 Experimental Analysis presents validation of our experimental results in Results and Discussion section and lastly, Section 6, Conclusion concludes the paper.

Literature Review
Silpa et al. [23] and Akbal et al. [24] explain different potential locations to investigate evidences of a user's activity on the web. The web browser is used for many activities, such as emails, searches, online banking, blogging and social interaction, etc. They discussed how, where and what type of information web browsers log that could help the investigators to investigate browser-related malicious activities. They used digital forensics tools to retrieve useful information from the specified locations.
Fletcher [25] performed analysis of time-line data through a network analyzer tool called Wireshark. The time-line was created through a forensics investigation tool called Log2timeline which extracts time stamps from different sources. This time-line was then converted to Wireshark supported format packet capture (Pcap). This particular time-line was then analyzed with the built-in features of Wireshark, i.e., analysis profile, filtering, colorization, marking and annotation. Time stamp values, i.e., modification, accessed, creation, birth (MACB) were considered from the time-line for analyzing an activity.
Shafqat et al. [23], investigate activities on a very popular browser Google Chrome, in three different modes i.e., regular mode, private mode and portable mode. Chrome logs rich information about user activity on the web. Tools were used to retrieve different kinds of activities such as websites visited, visit count, visit time, bookmarks, searched words and downloads etc. Berggren [26] presents a self-developed forensics analysis tool called Time Sketch. Time Sketch is a collaborative tool with GUI. It adds meaning to raw data with rich annotations, comments, tags, and stars.
Khatik et al. [27] proposed a Server Timeline Tool for analyzing web servers on the basis of log files visualization for forensics analysis. The proposed algorithm named "forensic digger algorithm" takes log files of a web server and analyzes it in seven steps, as follows: 1.
Analyze the date and day based on the occurrence of an activity 2.
Integrate log files as per need.

3.
Identify the desired parameters for the analysis that assist in the collection of evidence for a malicious activity.

4.
Search for frequent users activity in integrated log files with their IP addresses, file access, bytes transferred.

5.
For the confirmation of maliciousness, analyze the behavior of their activity and IP address etc. 6.
Reconstruct the whole event timeline hypothesis by analyzing log files and evidence. 7.
Generate report.
Nordbø et al. [19], explain why visualizations are effective and important in digital forensics. They discuss different methods and techniques, used for visualizing large volumes of data. It also reported some challenges regarding forensics visualization. Hargreaves et al. [10], proposed a framework for low-level event extraction. These low-level events and some regular expression, which authors called test events are used to generate high-level events. On matching of one or multiple low-level events with these test events Python, digital forensics tool (PyDFT) creates high-level events, which are then used for investigation purposes. Kalber et al. [28], propose a novel approach which automatically reconstructs events that previously occurred on the system. These events provide initial information for further investigation. The main idea is to extract timestamps and the paths of different sources and then cluster them. These events are labeled on the basis of applications used and MACB values of files. Visualization of events is applied to support investigators. Inglot et al. [29], propose an improved version of a timeline analysis tool, Zeitline. These improvements are to be made in existing features such as filtering, searching, visualization (GUI), grouping. They also added some new features like plugins, reporting, improved designing, and interactivity. Esposito [30], in his research, developed reusable queries. His research objective is divided into two parts.
Super timeline creation was carried out through existing tool, log2timeline. Then this super timeline was imported into Microsoft Access Database. Here, reusable queries were developed in Structured Query Language (SQL) language to make a concise listing from that log2timeline data set. These queries were used to narrow down large data sets from thousands of records to hundreds. Searches were carried out for user program activities, user web history and recent document accessed.The main purpose of this approach was to shorten investigation time.
Osborne et al. [31] discussed the development process and complexities of forensic software for information visualization. This work is mainly split into three main phases.
Correlation phase Edwards [32], in his work proposes an approach for data reduction and summarization. Data collection through log2timeline provides a large amount of data to an investigator for investigation. Analysis of big data is a time consuming and complex issue for digital forensics investigators. To reduce and manage such large data sets, the author used Tapestry. It provides many features like filtering, grouping and searching through these large data sets on the basis of specific attributes. It has a GUI which facilitates investigators with some visualization.
Schrenk et al. [33], review some state of the art data visualization tools for manipulation of frequency, timelines, emails and logged data. The effectiveness of visualization in digital forensics and Comparison between these visualization techniques were also studied.
Gudjonsson [34] proposed and developed a Log2timeline named framework, which took traditional timeline to the super timeline analysis. Log2timeline parses different log files and artifacts deeper in an automatic fashion and produce a super timeline to assist investigators in their timeline analysis.
Olsson et al. [35] create a novel prototype of a tool named Cyber Forensics TimeLab (CFTL). This tool could index evidence by their time variables and produce a timeline. This timeline is then visualized to find coherent evidence faster and with accuracy.
Zeitline is a cross-platform forensics timeline editor by Florian Buchholz et al. [36], for digital investigation. It has a GUI to generate and locate specific events from arbitrary data sources. Its visualization also provides the ability for filtering, search and group together events recursively. 4n6time no ported to Log2Timeline [26], is an excellent visualizing tool by David Nides for searching and sorting of a super timeline. It does a great job of filtering and shows results(frequency) in different bar graphs.
Xie et al. [37] authors proposed monitoring of the execution states of DLLs. This work just focused on certain parameters rather than complete DLL file contents. In this way, it is useful to detect malware in a light and more feasible way. They have carried out their methodology step-wise. Firstly, they collect data through memory introspection of a clean virtual machine from hypervisor level, outside of the virtual environment. Secondly, they analyzed the benign behavior of the applications and extracted information to use it in detection phase. And lastly, in the online detection phase, they compare these execution states to detect DLL's malicious manipulation.
Ahmadi et al. [38] propose a novel approach for the classification of a large malware data set from Microsoft to their actual family groups. They proposed a learning-based system that would classify malware to their actual family by their different characteristics. This technique was carried out on obfuscated and unpacked malware to extract content and structural based features and achieved 99.77% accuracy.
Veeramani et al. [39] propose automated malware detection framework based on relevant API calls. Samples were deobfuscated through famous disassembler called IDA Pro to extract API calls. These API calls were stored in MySQL database and another technique known as Document Class-Wise Frequency Feature Selection (DCFS) was applied to get relevant API calls. SVM classifier was trained on this data set and achieved the average result of 95%.
Mosli et al. [40] discuss various features selection and their importance in classification. Mosli et al. selected three features i.e., registry activity, imported DLL and called API functions. Different tools were used to extract and select features. In the end, they trained different machine learning models on all three features to classify files. The Support Vector Machine (SVM) model achieved the highest accuracy rate of 96% on registry activity data. Case et al. [41] develop a novel approach to detect objective-c malware. They deeply analyze objective-c language which provides a wide range of facilities to users that are misused by malware developers to carry out their malicious intents. Plugins were also created to detect popular activities i.e., keystroke logging, swizzling and named ports.
Somarriba et al. [14] propose a framework for monitoring of Android applications to detect malicious activities at runtime. They monitor traces related to function calls and visualize them through a tree structure called a dendrogram. A dendrogram shows relationship from root to parent and then to child nodes. When applications are executed, they call a different set of APIs, so in this paper, they monitor this function calling and hooking of functions through this framework. They first logged these calls in a database, applied some rules and visualized these calls from root application execution to their last activity of function calls.
Fraud et al. [42] present API call patterns to detect malware. They log and collect relevant API calls data of executables through a tool called WINAPIOverride32. This tool monitor API calls and internal called functions. Selected features are based on the iterative pattern in these APIs. These iterative patterns are discriminative feature between benign and malicious processes. They applied classifiers on these features to classify normal and malicious processes.
Grégio et al. [17], display malware dynamic behavior in a simple way through activity graph to identify and classify malicious files. They monitored high level system and other activities i.e., system calls through slight modified version of behEMOT. Cluster graph which have vertices and edges were showed with respect to labels from different antiviruses. These clusters have same malware binaries from similar families. Monitored high level time series activities were visualized through activity graph where x-axis is labeled as time and y-axis is as different activities. They also distinguish these samples on basis of different system calls called during execution, which is not that successful. Trinius et al. [43], present behavior based malware analysis through visualization graphs. System level activity reports from CWsandbox were analyzed. Abstraction level technique is applied to these reports to transform it into more condense form. Two types of visualization techniques were created i.e., tree maps and thread graph Tree maps present summarized overview of activities with their frequency while thread graph shows behavior of the individual thread of a process. Table 1 shows that we have critically analyzed existing approaches considering methodology, the data set used, data set features, features and limitations. Lots of researchers have concluded that visualization is very effective in analyzing a large volume of data. Data visualization is important to find relevant information. After a critical analysis of existing approaches about malware analysis and forensics techniques, we concluded that there is space for some well-defined data set, according to that we could create a more effective visualization that could reveal a better picture of benign and malicious activities.

Proposed Framework and Working
After reviewing many existing works on malware investigation, visualization, and computer forensics, we concluded that there is a lack of more interactive, informative visualizations and creation of data set regarding forensic investigation. This work focuses on malware data set creation and effective visualization which is explained step-wise in Figure 3. We used different tools and methods at different stages of our proposed methodology.

System Assumptions and Design Goal
First, we assume that an attacker has installed a malware on the system being investigated, which could access any resource and has full privileges to carry out their malicious intents. Secondly, this malware is unaware about monitoring or logging of their activities. We also assume that attacker has full access to the internet that if he wants to download another malware from the internet or wanted to communicate to an external server or command and control (c & c), etc.
The design goal is to separate normal and abnormal (malicious) activities based on some features, i.e., imported libraries (DLLs). By running every individual malware in a clean virtual environment with full privileges, monitored its activity through different forensics tools and data is collected. These tools collect base level activities and interaction with the operating system (file system activity, registry activities and threads etc.), hard wares, soft wares and online internet communications.

Malware Samples Collection
There are many online sources to get malware samples but to analyze and study modern and most recent samples, these were collected from T-Eye honeypots installed in cyber security lab COMSATS. Collected samples belonged to different families and from the same family with different characteristics.

Controlled Virtual Environment Creation
Executing these samples on a real system is dangerous and not feasible. It is necessary to provide the simulated system with all facilities in place like internet access, administrator privileges etc. so to execute and log dynamic behaviors virtual environment was created through VMware. Windows 7 was installed with other components in place like internet connection, software to execute malware with different kind of extensions, and disabled default Windows security techniques such firewall, Windows Defender etc.

Monitoring and Logging Activity
To monitor and log malware activity, we executed each and every malware and logged their dynamic activity through different forensics tools which is described below. To get a clean environment each and every time, we took snapshots of a clean environment and then execute malware and then reversed it back to its original state and repeated it each time.
To log and monitor their runtime interaction with the operating system and system resources, Microsoft Sysinternal tools were used that log different kind of activities such as registry, filesystem, process, DLLs and threads etc.

Process Explorer
This tool in Figure 4 from Microsoft Sysinternals toolkit lines up all running processes and handles and which DLLs are opened and accessed by these processes. It has the fast searching capability to show which processes are active with handles and DLLs opened [44].

Process Monitor
Process monitor is an advanced monitoring and logging tool which combines capabilities from two legacy tools known as Filemon and Regmon. It shows real-time file system, registry and threads stacks etc. it has many powerful features like rich and non-destructive filtering, comprehensive event properties such as session ID's and user names, reliable processes information, simultaneous logging to a file and much more. It is the best and easily usable tool for troubleshooting and malware hunting [44].

Portable Executable Explorer (PE Explorer)
PE Explorer is a good tool for static analysis. It shows inner structure and data of a file or software. It is a kind of disassembler which shows file headers and other instructions needed to run that file such as, DLL, API, and other program accesses [44].

TCP View
It lists all TCP and UDP connections including local and remote addresses and the state of TCP connections. On starting, it enumerates all active TCP and UDP endpoints on a system, resolving all IP addresses to their domain name versions [44].

Wireshark
Wireshark is a widely used network protocol analyzer. It allows users to analyze network traffic and protocols at the microscopic levels. It is a de facto and de jure standard among industry and academia [45].

Cuckoo Sandbox
Cuckoo sandbox is an open source automatic malware analysis system. It executes malware in a controlled environment and provides detailed information regarding that malware. This is beneficial for threat intelligence to cope with future threats and attacks because it allows analysts to understand the context, motivation and the goals of a breach [46].
In the end, a large amount of raw data was collected from these three sources with almost 400 malware samples from different families and 100 benign samples which were cleaned using a thorough process as defined in Figure 5. We wrote our code in Python and used the following libraries:

1.
To manipulate the data we used Pandas.

2.
Numpy was used to create the data array.

3.
For splitting data into training and test set Skicit-learn was used.

4.
We needed to separate year, month and date, for this purpose Datetime library was used.
Seaborn library to find and remove the missing values With the help of this chronologically ordered data, we came up with high-level execution states/activities such as process/thread activities, network connections, file system (opened, accessed, modified and created) and registry activities.

Data Analysis Data Analysis Goals
Malwares make up around 70% [47], of cyber-attacks today and if a security analyst needs to investigate an incident, there is a high probability that they will need to perform malware analysis.

Static Analysis
Basically static analysis [38,48] constitutes analyzing files, executables etc., without actually executing them. It is usually performed through bisecting and disassembling files through different tools such as IDA Pro [49], etc. It is analyzing file internal structure, header, dependencies, strings, and instructions, etc.

Dynamic Analysis
This method involves the actual execution of files to observe their real-time activities. Files are executed in a virtual environment to monitor their behavior and how they interact with the operating system and other resource and services. It is normally carried out on clean virtual machine providing all resource and services to interact with and log their activities with tools like Sysinternals (process explorer, process monitor, process hacker, tcp view etc.), in real time [48].
Dynamic data analysis is a complex task but it yields fruitful results. Analyzing around 400 samples dynamically is very complex and time-consuming job. Each and every file is executed in a clean virtual environment and logged their behaviors with above-mentioned methods and tools. Then the malware activity is analyzed for further process.

Features Extraction and Selection
The malware analysis provides detailed information about malware behavior, yet raw reports are not suitable for data classification [50]. Raw data is collected through above different forensics tools given in Figure 6. Our raw data contains artifacts of registry files, file system activities, paths, network connections, processes and threads, DLLs, and time-stamps etc. The features are critically extracted, selected and analyzed to better classify normal and abnormal files or activities. The features is related to maliciousness or malware behavior that occurs during their different activities [38] So before applying effective visualization, we sorted and transformed our data into the more meaningful form. Labels and weights were assigned to the data to differentiate between benign and malicious activities. Furthermore, the assigned labels and weights are cross-validated from reputed third party sources [40].
Scripts were written to automatically categorize the data for further processing. Data is categorized into six different attributes which contains time, path, operation, result, weight, and classifier.

DLL (Dynamic-Link Library)
Dynamic-Link Libraries are shared libraries in Microsoft operating systems. They consist of code, data, and resources and are similar to EXE files with a different extension. It is used to carry out almost all kind of jobs in the computer system and different processes access it simultaneously according to their need. Different DLLs are for different purposes and are used at different times i.e., kernel32. DLL is involved in memory management and is loaded into a protected memory space normally at Windows start up [51].

Time-Stamp
Time-stamp means the exact time of event occurrence and it retrieved and stored in different formats. Chronological data are very important to reconstruct or find related artifacts of an incident.

Weight-Age
Weight-age is a security rate which is assigned on the basis of analyzing approximately 400 malware samples and hundred benign files. These were also validated by a reputed third party. Weights play an important role in distinguishing normal and abnormal behavior [47].

Phase 2: Data Set Creation
Data set creation is our first contribution, the most complex and time-consuming task. Data collection is performed at run-time through forensic tools. We used DLLs as a artifact for separating malicious and benign processes.Term Frequency (TF) and Term Frequency-Inverse Document Frequency (TF-IDF) are used to extract and select features. Removes irrelevant data, weights and labels were assigned to the data, which is the requirement of the data set.
Data set creation steps are describe briefly as follows: After receiving data from forensics tools in raw form, data analysis was performed firstly.

Data Analysis
When we executed each and every malware sample, their dynamic activities were logged through forensics tools which produces large volume of sophisticated raw data. To make it useful and informative, we deeply analyzed that large amount of data.

Feature Extraction
After collection of data, it contains different kind of artifacts i.e., DLLs, registry, file system, networking and process thread etc. Our model mainly based on manipulation of DLLs to distinguish between benign and malicious processes. Before going to visualize our data, we need to preprocess it because it contains large amount of irrelevant data. So we first need to transform our data and extract features. To do so, we used Term Frequency (TF) approach, which calculate occurrence of term in a document. This occurrence of terms are considered as a feature and feature is obtain for the DLL.

Feature Selection
In this section, irrelevant data were removed. Features that provide more information and features with less were discarded. Terms that occurred more frequent in each document provided less information and it was better to remove them. This process is carried out through Term Frequency-Inverse Document Frequency (TF-IDF). Inverse Document Frequency (IDF) for term t was calculated as follows: id f t = log N id f t where in the above formula, df t is the number of samples contains term t and N is the total number of samples. Weights were calculated by multiplying TF with IDF calculated values. These updated weights were then assigned to each DLLs. After assigning weights to these DLLs, all the selected DLLs were tagged as normal and abnormal. Conclusive median was calculated to separate normal and abnormal DLLs, which had a value of 50. Above 50 weight was considered abnormal and below was normal.

Concise Data Set
Scripts are created to transform above data into concise data set. After extracting and selecting features, it needs to be transform into proper format. This resultant data set was formatted properly for visualization.

Data Preprocessing
In this step, we removed irrelevant data, sorted data and gave it some proper shape to process it further. The simple but yet effective formula was deduced to add missing data i.e., weights and labels etc., and performed sampling for implementing our proposed scheme. Formula:

1.
DLL that is present only in malicious file, then assign weight is equal and above 50.

2.
DLL that is present only in benign file, then assign weight below fifty (50).

3.
If DLL is present in both malicious and benign, then check count and if count is greater than 10, assign weight according to formula which is greater or equal to 50 but if this count is less than 10, assign weight less than 50.
Consider W= weightage and C = counts; If no. of counts ≥ 10, then value of k ≥ 0.5 and, If no. of counts < 10, then value of k < 0.5.

Data Transformation
The important attributes were converted into meaningful sorting and data labeling was performed. Figure 7, depicts the step-wise creation of data set, from the environment setup to resultant concise data set creation.   Table 2 shows comparison of our data set with existing approaches. Different features were compared like date-time, labeling, weights, DLL's, and applications etc. Our data set was better than other existing approaches in terms of taking less time to distinguish malicious and normal processes.

Phase 3: Visualization Creation
Visualization of the resultant data was the last component of our framework. To better visual textual data graph selection was the most important aspect. Graph selection depends on dimensions of data, data types and what kind of data you actually want to visualize. The timeline graph from D3.js (Data-Driven Documents) a popular JavaScript library available on (https://d3js.org/) is exploited for some of the visualization. D3 is an interactive and a browser based data visualizations library to create simple line chart to more complex info graphics. In this case, it stored and deployed graph oriented data on a time wise line chart. This chart consisted of time in seconds on the x-axis and weights on the y-axis. Other dimensions of data were presented on the tool-tip and median was calculated to discriminate two kind of labeled data, i.e., normal and abnormal DLLs.
The charts show the process behavior simple yet in an illustrative manner. The graphs are build according to the schema shown in Figure 8. The central name shows unique process name and line proceeds with weights and time of each DLL manipulated. Normally above the median shows malicious space with an abnormal label and below shows benign with a normal label. The tooltip presents the DLL name, their operation, results and assigned label. Spikes also shows DLL severity with the assigned weight.

Experimental Setup
For the proof of concept of our proposed methodology, we have acquired malware samples belonging to a different malware family. The malware collection process was carried out through threat intelligence platform T-Eye which is a prominent program by Applied Security Engineering and Research Group COMSATS University and Trillium information security systems [52]. A total of 400 malware and 100 benign including executables were executed in a clean virtualized environment to create data sets [53]. The environment was mainly created in Virtual Box, every executed file (malware) activity being monitored and logged through forensic tools which included Microsoft sysinternal tools, Cuckoo, and Log2timeline. Feature selection and data categorization are the most daunting task which was done by analyzing every single malware sample.

Normal vs Abnormal Activity Visualization
The benign and malicious executable programs were executed. Once the execution process was complete and they start consuming resources, the test-bed started logging the resources utilized. These can be both the physical resources as well as network connectivity. This log was monitored in order to better assess the programs activities. These activities were analyzed and useful features were determined. Below are the visualization of the some of the malware activities which distinguished normal and malicious activity. Among other criteria, we chose the Dynamic Linked Library (DLL) access and manipulation as a metric for classification of benign or malicious activities in programs Figure 9, presents DLLs manipulation by fund transfer malware. In this figure the x-axis shows time in seconds and the y-axis shows weight-age that were assigned during data-set creation. Calculated mean discriminated between normal and abnormal DLLs and tool-tip showed other features like operation, results, and class etc. So through this time-line visualization investigator can easily distinguish between benign and malicious processes to forward it for further investigation.  Figure 10 shows normal file iso-to-usb activity. The visualization clearly show time wise manipulation of DLLs, in which most of the DLLs' activity was below the mean point. It depicts that DLLs which were labeled normal were largely manipulated by this benign process.

Results and Discussion
This section provides details about results and validation of the proposed approach. Given that our methodology is based on data set creation and visualization of data in a way that investigator can clearly distinguish between normal and malicious processes. The questionnaire was performed on the basis of our visualization that we have created according to our data set. In this survey, the academia and industry experts were asked about our approach and results. The reactions were pretty positive with respect to approach knowing the complexity of the problem. The summary of our survey is given below.     Visualization and textual data evaluation:

Visual Benign vs Malicious Process Analysis Evaluation
1.
Can you identify any malicious DLL? 2.
What type of operation was performed on DLLs and their result? 3.
On the basis of different activity, can we distinguish between benign and malicious files? 4.
According to assigned weights which DLL got maximum weight and what was his operation? 5.
According to assigned weights which DLL got minimum weight and what was his operation? Figure 13 provides detailed survey from laymen with respect to visualization and textual representation.

T-Test
T test applies in statistics to two samples to demonstrate that whether there is a significant difference between those samples over a population. Such evaluation should proceed with an argument or hypothesis.

Hypothesis
Visualization allows user to analyze malwares in less time than user without visualization. Table 3 depicts the T-Test for performance evaluation.  Table 4 shows comparison of our approach with existing visualization techniques. This comparison was carried out in terms of functionality and information they provide. Existing work does not provide that much information that feasibly distinguish between normal and abnormal activities.

Conclusions and Future Works
In this modern era, there is much need of security mechanisms in place for big businesses and national defense organizations i.e., banks, federated identity management and security and intelligence agencies etc. Adversaries always look for new methods and techniques to carry out their illegitimate intentions. According to survey, malware attacks are always on the top. From the literature, it was concluded that much work has been already been performed on malware analysis. This paper presents a forensic investigation mechanism to better distinguish between normal and malicious processes through effective visualization. The main purpose of this research is to cope with future malicious disastrous attacks that ends up with financial loss, reputation damage and loss of useful information etc. This work also aims to feed already existing approaches to enrich their rules and policies. Making use of large amount of data from victimized devices is important for investigation of attack incidents. Our research also aims to remove irrelevant data and put intelligence to the large amount of data. Feature extraction and selection is one of the important and daunting aspect in creating data set from enormous and sophisticated data that has been achieved in this work in most feasible manner. The data set is rendered and visualized in such a way that investigator could easily differentiate malicious and normal activity or consider it for further investigation and processing.