An Intelligent Improvement of Internet-Wide Scan Engine for Fast Discovery of Vulnerable IoT Devices

: Since 2016, Mirai and Persirai malware have infected hundreds of thousands of Internet of Things (IoT) devices and created a massive IoT botnet, which caused distributed denial of service (DDoS) attacks. IoT malware targets vulnerable IoT devices, which are vulnerable to security risks. Techniques are needed to prevent IoT devices from being exploited by attackers. However, unlike high-performance PCs, IoT devices are lightweight, low-power, and low-cost, having performance limitations regarding processing and memory, which makes it difﬁcult to install security and anti-malware programs. Recently, several studies have been attempted to quickly search for vulnerable internet-connected devices to solve this real issue. Issues yet to be studied still exist regarding these types of internet-wide scan technologies, such as ﬁltering by security devices and a shortage of collected operating system (OS) information. This paper proposes an intelligent internet-wide scan model that improves IP state scanning with advanced internet protocol (IP) randomization, reactive protocol (port) scanning, and OS ﬁngerprinting scanning, applying k* algorithm in order to ﬁnd vulnerable IoT devices. Additionally, we describe the experiment’s results compared to the existing internet-wide scan technologies, such as ZMap and Shodan. As a result, the proposed model experimentally shows improved performance. Although we improved the ZMap, the throughput per minute (TPM) performance is similar to ZMap without degrading the IP scan throughput and the performance of generating a single IP address is about 118% better than ZMap. In the protocol scan performance experiments, it is about 129% better than the Censys based ZMap, and the performance of OS ﬁngerprinting is better than ZMap, with about 50% accuracy.


Introduction
Gartner, Inc. forecasts that 8.4 billion connected things will be in use, worldwide, in 2017, up 31% from 2016, and will reach 20.4 billion by 2020.The total spending on endpoints and services will reach nearly $2 trillion in 2017 [1].Meanwhile, it has become a reality that vulnerable IoT devices (CCTV, etc.) are frequently involuntarily involved in DDoS (distributed denial of service) attacks.In October 2016, the DNS service provider Dyn took down hundreds of websites-including Twitter, Netflix, and The New York Times-for several hours, due to IoT devices being infected with Mirai malware.As shown in Figure 1, Mirai malware primarily spreads by first infecting devices such as webcams, DVRs, and routers.It then deduces the administrative vulnerabilities of other IoT devices by means of brute force attack.Mirai mutations, such as Persia Lee, Ripper, and Bricker are generated daily [2,3].
The cause of IoT device infection by malware is mainly from security vulnerability management.This means IoT devices are used as cut-down OS, with no security functions, or are simply operated with a default ID and password.Cisco (2016) reported that more than 90% of network devices were running with known vulnerabilities, and there were 28 vulnerabilities per device on average [4].
In addition, according to a report by HP (2015), 80% of IoT devices have a default password that is vulnerable to privacy leakage, 70% of them are not encrypted during communication, and 60% of them have a weak web interface, and have not been updated for security [5].
Symmetry 2018, 10, x FOR PEER REVIEW 2 of 15 running with known vulnerabilities, and there were 28 vulnerabilities per device on average [4].In addition, according to a report by HP (2015), 80% of IoT devices have a default password that is vulnerable to privacy leakage, 70% of them are not encrypted during communication, and 60% of them have a weak web interface, and have not been updated for security [5].Today, there have been various "security by design" approaches for design-secure IoT devices and applications [6], and intelligent and secure monitoring of hierarchical topology smart home appliances and environments [7,8] by applying lightweight encryption, authentication, secure communication, and intrusion detection [9][10][11][12].The most general approach is to install an antivirus program in an IoT device.However, since IoT devices are not high-performance devices, like PCs, or mobile devices, it is not easy to eliminate security vulnerabilities and manage periodic updates by installing an antivirus program or other malware detection technology.Additionally, low-end devices are incapable of performing heavier, conventional cryptographic algorithms, due to their constrained resources [13,14].
From the viewpoint of the security vulnerability management of IoT devices, there are various reasons as to why it is difficult to eliminate security vulnerabilities.First, since IoT devices have been operating for a long period of time by using open source code and old communication technologies, they are running with old vulnerabilities.Second, since IoT devices have hard-coded software or have no automatic update features, like PCs, when vulnerabilities in the embedded OS and firmware are detected, it is difficult to patch them quickly.Third, since IoT device manufacturers are small enterprises, if the company goes down, it is difficult to provide follow-up services, such as a security update, and security vulnerabilities are left unresolved [15].
Recently, there have been many studies on internet-wide vulnerability scanning, such as ZMap [16] and Shodan [17], where connected vulnerable IoT devices are scanned in real time, and stored in a database, and their search results are shared.In this study, we propose a model designed to collect IoT device information in order to detect vulnerabilities in IoT devices connected to the internet, and compare its performance with existing technology.The structure of this paper involves Section 2 explaining the concept and existing studies on passive vulnerability scanning technology.Section 3 presents the passive vulnerability scanning engine model, which has been improved from the open source ZMap, and explains the scanning algorithm and OS fingerprinting identification and classification algorithm.Section 4 explains the method of measuring its performance and the results.Section 5 presents a comparison with existing technologies, such as ZMap and Shodan, and explains future study.Today, there have been various "security by design" approaches for design-secure IoT devices and applications [6], and intelligent and secure monitoring of hierarchical topology smart home appliances and environments [7,8] by applying lightweight encryption, authentication, secure communication, and intrusion detection [9][10][11][12].The most general approach is to install an antivirus program in an IoT device.However, since IoT devices are not high-performance devices, like PCs, or mobile devices, it is not easy to eliminate security vulnerabilities and manage periodic updates by installing an antivirus program or other malware detection technology.Additionally, low-end devices are incapable of performing heavier, conventional cryptographic algorithms, due to their constrained resources [13,14].
From the viewpoint of the security vulnerability management of IoT devices, there are various reasons as to why it is difficult to eliminate security vulnerabilities.First, since IoT devices have been operating for a long period of time by using open source code and old communication technologies, they are running with old vulnerabilities.Second, since IoT devices have hard-coded software or have no automatic update features, like PCs, when vulnerabilities in the embedded OS and firmware are detected, it is difficult to patch them quickly.Third, since IoT device manufacturers are small enterprises, if the company goes down, it is difficult to provide follow-up services, such as a security update, and security vulnerabilities are left unresolved [15].
Recently, there have been many studies on internet-wide vulnerability scanning, such as ZMap [16] and Shodan [17], where connected vulnerable IoT devices are scanned in real time, and stored in a database, and their search results are shared.In this study, we propose a model designed to collect IoT device information in order to detect vulnerabilities in IoT devices connected to the internet, and compare its performance with existing technology.The structure of this paper involves Section 2 explaining the concept and existing studies on passive vulnerability scanning technology.Section 3 presents the passive vulnerability scanning engine model, which has been improved from the open source ZMap, and explains the scanning algorithm and OS fingerprinting identification and classification algorithm.Section 4 explains the method of measuring its performance and the results.Section 5 presents a comparison with existing technologies, such as ZMap and Shodan, and explains future study.

The Concept of Internet-Wide Vulnerability Scanning
IoT malware targets vulnerable IoT devices for the purposes of hacking and malware infection.To avoid such malware infection, the technique of scanning plays a key role in eliminating vulnerabilities of low-end IoT devices with constrained resources.Until now, vulnerability scanning techniques [18] have developed from network scanning in the late 1990s, which scans local network and system vulnerabilities, to internet-wide scanning, which detects vulnerable internet-connected devices regularly.
Figure 2 and Table 1 show the concept and features of traditional network scanning and internet-wide scanning [19].The biggest difference of the two approaches is in the scanning method used to collect device information and scan vulnerabilities.In this paper, the viewpoint of distinguishing the network and internet-wide scanning is whether the technique uses a crafted packet to collect a device's information and find vulnerabilities.In other words, we defined that a scanner using a crafted packet is a network scan, since it is very intrusive, while using banner grabbing (or OS fingerprint) data is an internet-wide scan (non-intrusive internet-wide scan), since it is relatively less intrusive.

The Concept of Internet-Wide Vulnerability Scanning
IoT malware targets vulnerable IoT devices for the purposes of hacking and malware infection.To avoid such malware infection, the technique of scanning plays a key role in eliminating vulnerabilities of low-end IoT devices with constrained resources.Until now, vulnerability scanning techniques [18] have developed from network scanning in the late 1990s, which scans local network and system vulnerabilities, to internet-wide scanning, which detects vulnerable internet-connected devices regularly.
Figure 2 and Table 1 show the concept and features of traditional network scanning and internetwide scanning [19].The biggest difference of the two approaches is in the scanning method used to collect device information and scan vulnerabilities.In this paper, the viewpoint of distinguishing the network and internet-wide scanning is whether the technique uses a crafted packet to collect a device's information and find vulnerabilities.In other words, we defined that a scanner using a crafted packet is a network scan, since it is very intrusive, while using banner grabbing (or OS fingerprint) data is an internet-wide scan (non-intrusive internet-wide scan), since it is relatively less intrusive.The traditional network scan technique checks IPs for local networks to identify OS types, and collects the service type and version information through a port scan to detect vulnerabilities.To identify vulnerabilities, Nmap, Nessus, and other tools are used in network scanning.This technique attacks the device to find its vulnerabilities.For example, it uses crafted packets-well-known ID and password combinations-to check the vulnerability of the password, or directly attacks the device using an exploit code.However, since this technique collects internet-connected device information  The traditional network scan technique checks IPs for local networks to identify OS types, and collects the service type and version information through a port scan to detect vulnerabilities.To identify vulnerabilities, Nmap, Nessus, and other tools are used in network scanning.This technique attacks the device to find its vulnerabilities.For example, it uses crafted packets-well-known ID and password combinations-to check the vulnerability of the password, or directly attacks the device using an exploit code.However, since this technique collects internet-connected device information remotely, it could be filtered out by security equipment, such as a firewall.This is because the network scan techniques have the risk of shutting down the service or device to be inoperable by generating attacking patterns or traffic to the inspected device.
Therefore, many studies focus on internet-wide scanning techniques to collect a large volume of device data quickly from a remote location.Internet-wide scanning does not perform an attack to collect device information, but communicates regularity messages to collect necessary information.Furthermore, this technique covers all internet-connected devices, rather than only the inner network, and collects service access banners and communication traffic headers.

Preliminary Research Related to Internet-Wide Scanning
John Matherly [17,20] developed the Shodan search engine in order to search for information of internet-connected devices through a non-intrusive scan technique.Shodan collects the data of more than 500 million devices and systems on the internet and provides the collected data each month, but it is difficult to analyze the scan processing speed and method, since it is not disclosed as an open source, except for the list of application programming interfaces (APIs).Its main procedure is the scanning of information for HTTP, FTP, TELNET, and other open ports via handshaking.Device information is identified via keywords included in banners, and various protocols are supported to collect as much information as possible from a single device.Additionally, SSL encryption and version information are collected to discover heart bleed, poodle, and other vulnerabilities.
Shovat [21] is a passive vulnerability analysis tool developed by Petru Maior University in Romania with the Shodan engine.Shovat takes the output of traditional Shodan queries and performs an in-depth analysis of service-specific data, such as service banners.It embodies specially crafted algorithms that rely on novel in-memory data structures to automatically reconstruct common platform enumeration (CPE) names, and to proficiently extract vulnerabilities from the national vulnerability database (NVD) [22].
ZMap [16] is open source software, suggested by the University of Michigan, that can scan all internet-connected IPv4 address spaces, and applies random algorithm and sharing techniques to IP addresses for rapid searching.In addition, it is faster than Nmap, since it does not go through TCP/IP stacks.It can reportedly scan the 10 G IPv4 address space in 4 min and 29 s.
Censys [23] is based on the open source ZMap, and collects port information for 23 protocols.It provides banner information, protocol header information, other device information, and encrypted communication protocol vulnerabilities, just as Shodan does, but it does not have a technique for identifying OS information.The main procedure of Censys is to check whether the device IP is enabled, or the port is opened using a TCP SYN scan from a ZMap module.The ZGrab module, which is a plug application scanner, performs a handshake for banner grabbing if there is a response from a scanned device.It collects the data of the application of the port.Censys then extracts the relevant fields from the collected banner data, adds the metadata as a comment, and saves it in the database.

Proposed Model
This section describes the proposed model to improve the device collecting performance of ZMap and Shodan, which are some of the most effective internet-wide scanning techniques.
As shown in Figure 3, this model consists of three modules-IP alive scan module, reactive handshake scan module, and OS fingerprinting module.The IP alive scan module collects IoT device status information.The reactive protocol scan module collects network and service information, and the OS fingerprinting scan module collects OS information.To create a non-intrusive scan effect, the reactive handshake module does not store the TCP connection status data (no per-connection state) to find the protocol (port) data of devices.In the OS fingerprinting module, we applied the passive fingerprinting technique of using the TTL and window size values of the TCP/IP header.After the scanning finished, the collected data was stored in the DB for the device's information, and accumulated about 950 Gbytes per scan.Therefore, we used an Amazon cloud server to efficiently handle network traffic, instead of a single server to handle large amounts of device scan data.We also used Hadoop's HDFS and NoSQL (MogoDB) to multiplex the DB servers and distribute the data across multiple servers using "sharding" technology.In more detail, this method used the shard key to partition the collected data based on ranges of the ke,y in order to store very large data sets.Each range of the shard key values is assigned to a specific chunk on the cluster.The collected data is stored in the JSON file format per IP address.The collected data are split over the shards using a defined range-based shard key, in order to store very large data sets.Here, the shard key is defined as protocol_id.

IP Alive Scan Module
The "IP alive scan module" performs functions to check whether the device is in active status by generating TCP SYN or ICMP packets.It consists of a random IP address generator and an IP scan scheduler.To improve the hit rate of response packets, the white list/black list is applied to the scheduler.An IP alive scan scheduler generates an address list to scan about 4.3 billion IPv4 addresses.
The technology of generating IP addresses is very important in internet-wide scan technology.When scan traffic is generated sequentially, the scan function is not available, due to the detection by security devices, such as firewalls, IDS, and IPS.Therefore, it is necessary to have a technology that randomly generates lists of IP addresses for scanning.Therefore, we used an algorithm that converts IP addresses to decimal numbers, and then circulates them to generate a randomized IP address list.Randomized IP addresses make them look random in the network domain.
As shown in algorithm 1, This algorithm selects an initial decimal number from 2 32 numbers, which is the size of the IPv4 address domain.A prime number between 2 16 and 2 32 is selected to After the scanning finished, the collected data was stored in the DB for the device's information, and accumulated about 950 Gbytes per scan.Therefore, we used an Amazon cloud server to efficiently handle network traffic, instead of a single server to handle large amounts of device scan data.We also used Hadoop's HDFS and NoSQL (MogoDB) to multiplex the DB servers and distribute the data across multiple servers using "sharding" technology.In more detail, this method used the shard key to partition the collected data based on ranges of the key in order to store very large data sets.Each range of the shard key values is assigned to a specific chunk on the cluster.The collected data is stored in the JSON file format per IP address.The collected data are split over the shards using a defined range-based shard key, in order to store very large data sets.Here, the shard key is defined as protocol_id.

IP Alive Scan Module
The "IP alive scan module" performs functions to check whether the device is in active status by generating TCP SYN or ICMP packets.It consists of a random IP address generator and an IP scan scheduler.To improve the hit rate of response packets, the white list/black list is applied to the scheduler.An IP alive scan scheduler generates an address list to scan about 4.3 billion IPv4 addresses.
The technology of generating IP addresses is very important in internet-wide scan technology.When scan traffic is generated sequentially, the scan function is not available, due to the detection by security devices, such as firewalls, IDS, and IPS.Therefore, it is necessary to have a technology that randomly generates lists of IP addresses for scanning.Therefore, we used an algorithm that converts IP addresses to decimal numbers, and then circulates them to generate a randomized IP address list.Randomized IP addresses make them look random in the network domain.
As shown in Algorithm 1, This algorithm selects an initial decimal number from 2 32 numbers, which is the size of the IPv4 address domain.A prime number between 2 16 and 2 32 is selected to generate an IP address that has the difference of B class or more.The prime number is subtracted from the initial number to generate a decimal number.The value 2 32 is added to the initial number to generate a number between 1 and 2 32 if the initial number is smaller than the prime number.The generated decimal number is converted to an IP address and used as the scan address.This algorithm can reduce duplication of IP addresses in B class level.Its purpose is to bypass scan traffic detection by a security system when the highest level bandwidth of the agency allocated of the IP is B class, and it is assumed that the security system is positioned upstream of the network.As another technique to avoid filtering by security devices, we modified the scan scheduler to use WHOIS lookup information as a whitelist of IP address ranges for each domain assigned by ICANN (Internet Corporation for Assigned Names and Numbers).The scan scheduler generates IP lists based on the IP range of each domain, selects one of the IP lists, and scans the IP lists once in a specific time window.At this time, the representative IP of each domain is retrieved by searching the Whois lookup table, and the sub IP addresses of the corresponding domain are extracted.Then, the extracted IP list is shuffled using a randomized algorithm to prevent sequential scanning.

Reactive Handshake Protocol Scan Module
The "reactive protocol handshake scan" module collects the service (port) information running on its open ports through the IP list of alive states.This module scans information about major ports such as FTP, Telnet, SSH, and HTTP, that are used for communication in the device.Information collected from the major ports generates information by going through the process of extracting scan information, including the system connection banner, encrypted communication information, packet header, and HTTP header/body information.
Also, this technique is similar to the ZGrab module of the ZMap.The collection scope of this module identifies a total of 65,568 ports, of which 168 reserved ports and 65,400 unreserved dynamic ports were added to the fifteen basic protocol scan functions provided by ZGrab.Unreserved dynamic ports can be defined by the user or the input of extracted traffic data, such as PCAP.In addition, since most IoT devices connected to the wireless AP have a private IP, the m-search message of the UPnP protocol (1900 port), and response information from NAS servers (9000 port) are used to collect the information of IoT devices connected to the wireless AP.

OS Fingerpriting Scan Module
The "OS fingerprinting scan module" is used to identify the OS and firmware information of internet-wide IoT devices.Our approach uses remote TCP/IP stack fingerprinting using TCP/IP headers and HTTP banner grabbing techniques.This module can be identified by comparing the OS matching rule with information from ten fields related to the TCP/IP packet as TCP Headers (don't_fragment, Window Size, MSS, Window Scale, Timestamps, NOP, sackOK) and IP Headers (IPID, Total Length, TTL).We applied the k* classier algorithm to improve the accuracy of OS fingerprinting.
The k* algorithm [24,25] is a simple instance-based classifier similar to the K-nearest neighbor (K-NN) classifier.A new data instance x is allocated to a class that occurs the most frequently in the k-nearest data point y j , where j = 1,2 ... k [26].Additionally, the entropy distance algorithm is used to search for the most similar instance in the data set.When the entropy distance is used as the metric, actual attributes and omissions can be processed along with other benefits [27].The k* algorithm is shown below.

Summary of Experiment Methods
In these experiments, two approaches were applied for testing the performance of the proposed model.First, we experimented with the scan throughput of the IP alive scan module in a test environment with no real traffic.Second, the rest of experiment was conducted by installing our model in Amazon cloud server.We collected 226 million pieces of real information of internet-connected devices from September to November 2017, and then measured performances which are the information collection rate and the OS identification rate.The experiment results were compared with various existing techniques, i.e., ZMap-based and Nmap-based Masscan (mass IP port scanner [28]) and Shodan.

Throughput per Minute of IP Alive Scan
This technology focusing on public IPv4 addresses and collecting information about internet-connected devices requires high-speed traffic processing.The performance metric is TPM (throughput per minute) which generates 3.68 billion IP alive packets in the 1 Gigabit Ethernet environment, excluding private and reserved IPv4 addresses without system failure or errors.TPM = PPS(Generated Packets per Second) * 60 (2) For the performance of the IP alive scan module, we used the BigTao tester, IP/ethernet testing equipment that can generate L2/L3 layer packets and measure the throughput, frame loss rate, and latency.The testing network environment is constructed by connecting the BigTao tester and DUT (device under test), as shown in Figure 4. We installed the proposed model (IP alive scan module), ZMap (version 2.1.1),and Masscan (1.0.4) in the DUT system.The DUT is configured to allow packet forwarding, and the network transmission and receipt interfaces of the DUT system are connected to the traffic receipt and transmission interfaces of the test system, respectively.These experiments also measure the performance of TPM according to the IEEE RFC 3511 procedure [29] in the environment that generates the background traffic.The BigTao tester is configured to receive the IP alive scan packets generated by the DUT, and the detailed procedure is as follows.This result shows that our model produces almost similar result as ZMap.This means that there was no performance degradation of IP alive scan, even after an enhancement (IP address randomization, etc.) of the ZMap.Our model and ZMap show better performance compared with Masscan (55.92 million TPM).Additionally, it was 72.75 billion TPM in the real internet environment.However, we did not perform a comparison experiment with existing techniques in this real internet environment.This result shows that our model produces almost similar result as ZMap.This means that there was no performance degradation of IP alive scan, even after an enhancement (IP address randomization, etc.) of the ZMap.Our model and ZMap show better performance compared with Masscan (55.92 million TPM).Additionally, it was 72.75 billion TPM in the real internet environment.However, we did not perform a comparison experiment with existing techniques in this real internet  This result shows that our model produces almost similar result as ZMap.This means that there was no performance degradation of IP alive scan, even after an enhancement (IP address randomization, etc.) of the ZMap.Our model and ZMap show better performance compared with Masscan (55.92 million TPM).Additionally, it was 72.75 billion TPM in the real internet environment.However, we did not perform a comparison experiment with existing techniques in this real internet environment.

Randomization of IP Address Generation
In this experiment, the performance metric is a randomization of IP address generation, in order to measure how many single IP addresses without duplicate IPs in the B/C IP classes were generated.

Randomization of IP address generation
As shown in Table 3 and Figure 6, this result indicates that 33,597 single IP addresses were generated (67.8%) for B class IPv4 addresses, and 65,530 single IP addresses were generated (99.99%) for C class IPv4 addresses.This means, in theory, that it took 14 min and 44 s to generate 3.67 billion IPv4 addresses and, therefore, 4,858,560 IP addresses could be generated per second.In addition, the randomization of the proposed model was 68%, which is an improvement of 118% compared to ZMap (58%) and 126% compared to Masscan (54%).

Randomization of IP Address Generation
In this experiment, the performance metric is a randomization of IP address generation, in order to measure how many single IP addresses without duplicate IPs in the B/C IP classes were generated.

Randomization of IP address generation
As shown in Table 3 and Figure 6, this result indicates that 33,597 single IP addresses were generated (67.8%) for B class IPv4 addresses, and 65,530 single IP addresses were generated (99.99%) for C class IPv4 addresses.This means, in theory, that it took 14 min and 44 s to generate 3.67 billion IPv4 addresses and, therefore, 4,858,560 IP addresses could be generated per second.In addition, the randomization of the proposed model was 68%, which is an improvement of 118% compared to ZMap (58%) and 126% compared to Masscan (54%).The reactive scan module identifies the protocol service information from the opened ports of the activated IPs collected by the IP alive scan.In this experiment, the performance metric is the average protocol (port) information collection rate (PICR) and compared it with ZMap and Censys, which is shown in Figure 7.In this experiment, we used 195.68 million pieces of IP data (excluding

Average Protocol (Port) Information Collection Rate (%)
The reactive scan module identifies the protocol service information from the opened ports of the activated IPs collected by the IP alive scan.In this experiment, the performance metric is the average protocol (port) information collection rate (PICR) and compared it with ZMap and Censys, which is shown in Figure 7.In this experiment, we used 195.68 million pieces of IP data (excluding the duplicates) collected through the IP alive scan between September and November 2017.On the other   As shown in Table 4, The test results showed that the proposed model collected a total of 223,559,189 IPs (including duplicates) from 34 ports, while Shodan collected 212,428,355 IPs from 34 ports, and Censys collected 173,496,395 IPs from 13 ports.The average PICR of the proposed module was 105.24% and 128.86% compared to Shodan and Censys, respectively.

Accuracy of OS Fingerprinting Identificatiion
In this experiment, the performance metric is the accuracy of OS fingerprinting identification, whether the OS was identified correctly.As explained in Section 4.1, we used real data (TCP/IP packets and http banner grabbing) collected from the real internet environment for performance measurement.
As shown in Figure 8, the datasets for training consist of {X i , Y i } pairs, as follow in Equation ( 5).Here, Let N be the number of training input data, d be the number of features, and L be the label of OS information.
Training Datasets = {X i , Y i }, i = 1, . . ., N, X i = {f 1 , f 2 , . . ., f d }, Y i ∈ {L 1 , L 2 , . . .,L m } (5) (1) Each X i is a d dimensional feature vector extracted from TCP/IP headers, such as Window Size, Timestamps, NOP, sackOK, IPID, Total Length, TTL, etc. (2) Each Y i is the corresponding label information which uses OS information extracted from HTTP banner grabbing on the same IP.(3) N is the total number of training input data, m is the number of OS types.
For the training datasets, we trained input data consisted of 21,000 TCP/IP raw packets and HTTP information (3000 samples each for seven OSs).In the test stage, the actual class (correct answer) to measure the accuracy of the OS identification used 7000 pieces of banner information (1000 samples each for seven OSs) for verification.We utilized a set of algorithms in the WEKA tool and trained 4 classifiers: k* classifier, BF Tree, Random Tree, C4.5 for OS fingerprinting.
(3) N is the total number of training input data, m is the number of OS types.
For the training datasets, we trained input data consisted of 21,000 TCP/IP raw packets and HTTP information (3000 samples each for seven OSs).In the test stage, the actual class (correct answer) to measure the accuracy of the OS identification used 7000 pieces of banner information (1000 samples each for seven OSs) for verification.We utilized a set of algorithms in the WEKA tool and trained 4 classifiers: k* classifier, BF Tree, Random Tree, C4.5 for OS fingerprinting.Figure 9 shows the OS identification accuracy results of the four classifiers.The accuracy (here means TPR) of the classifiers is in the order of k*, c4.5, BFTree, and Random Tree, ranging from 0.477 to 0.491.Table 5 shows accuracy of each OS.Based on the k* classifier, the average accuracy was 0.491, Figure 9 shows the OS identification accuracy results of the four classifiers.The accuracy (here means TPR) of the classifiers is in the order of k*, c4.5, BFTree, and Random Tree, ranging from 0.477 to 0.491.Table 5 shows accuracy of each OS.Based on the k* classifier, the average accuracy was 0.491, and the accuracy by OS was in order of FreeBSD (0.961), Gento (0.652), and Ubuntu (0.532), and Fedora had the lowest accuracy.According to previous research results [30,31], the average accuracy of exact matching without using machine learning is 15.87~31.25%,and the average accuracy of approximate matching is 2.87~3.1%.On the other hand, when applying machine learning, the accuracy was improved to almost 50~60%.This results are better than the pattern matching results, but they are different from previous studies using machine learning.This is because it is difficult to make a relative comparison and analysis, because the performance results depend on the experimental method, the characteristics of the learning dataset (number of collected training data, the collection method, bias of the training data, range of OS identification, etc.).

Conclusions
When vulnerable IoT devices are infected to become zombies, this can cause DDoS attacks.Therefore, it is important to scan vulnerable internet-connected IoT devices and take appropriate measures.There are many approaches to this, but this study proposed an internet-wide scan engine that can detect vulnerable IoT devices.
In this paper, we presented an improved internet-wide scanning model with various approaches, including the advanced algorithm of IP address generation, the extension of the range of protocols collection, and more accurate OS fingerprinting modules with machine learning.We conducted various experiments to measure the effectiveness of the proposed model and summarize the results, as shown in Table 6.

Figure 1 .
Figure 1.Operation of Mirai malware and DDoS (distributed denial of service) attacks by infected IoT botnet.

Figure 1 .
Figure 1.Operation of Mirai malware and DDoS (distributed denial of service) attacks by infected IoT botnet.

Figure 2 .
Figure 2. Differences between traditional network scanning and internet-wide scanning techniques.

Figure 2 .
Figure 2. Differences between traditional network scanning and internet-wide scanning techniques.

Figure 3 .
Figure 3. Proposed model of vulnerability scanning engine.

( 1 ) 15 ( 4 )
BigTao tester generates 500 Mbps traffic to the DUT.(2)DUT generates 3.67 billion scan packets (TCP SYN/ICMP) for each of the three technologies.(3)BigTao tester checks the total number of frames collected during a period of 10 min.(4) TPM is calculated based on the total number of frames.Symmetry 2018, 10, x FOR PEER REVIEW 8 of TPM is calculated based on the total number of frames.

Figure 4 .
Figure 4. Experimental environment for measuring the throughput of scanning.

Figure 5 .
Figure 5.Comparison of the scan performance of existing tools.

Figure 4 .
Figure 4. Experimental environment for measuring the throughput of scanning.

Figure 5 .
Figure 5.Comparison of the scan performance of existing tools.

Figure 5 .
Figure 5.Comparison of the scan performance of existing tools.

Figure 6 .
Figure 6.Result of single IP address generation.

Figure 6 .
Figure 6.Result of single IP address generation.

4 ) 15 Figure 7 .
Figure 7. Measurement method of port information collection rate with scanning techniques.

Figure 7 .
Figure 7. Measurement method of port information collection rate with scanning techniques.

Figure 8 .
Figure 8. Experiment procedure of performance OS fingerprinting identification rate.

Figure 8 .
Figure 8. Experiment procedure of performance OS fingerprinting identification rate.

Figure 10 .
Figure 10.Confusion matrix for classification model performance.Figure 10.Confusion matrix for classification model performance.

Figure 10 .
Figure 10.Confusion matrix for classification model performance.Figure 10.Confusion matrix for classification model performance.

Table 3 .
Compared with existing techniques related to single IP generation rate.

Table 3 .
Compared with existing techniques related to single IP generation rate.

Table 4 .
Comparison of average PICR with existing techniques.The symbol "-" means that the service protocol information is not provided by Censys API.

Table 4 .
Comparison of average PICR with existing techniques.The symbol "-" means that the service protocol information is not provided by Censys API.