Discovering Land Cover Web Map Services from the Deep Web with JavaScript Invocation Rules

: Automatic discovery of isolated land cover web map services (LCWMSs) can potentially help in sharing land cover data. Currently, various search engine-based and crawler-based approaches have been developed for ﬁnding services dispersed throughout the surface web. In fact, with the prevalence of geospatial web applications, a considerable number of LCWMSs are hidden in JavaScript code, which belongs to the deep web. However, discovering LCWMSs from JavaScript code remains an open challenge. This paper aims to solve this challenge by proposing a focused deep web crawler for ﬁnding more LCWMSs from deep web JavaScript code and the surface web. First, the names of a group of JavaScript links are abstracted as initial judgements. Through name matching, these judgements are utilized to judge whether or not the fetched webpages contain predeﬁned JavaScript links that may prompt JavaScript code to invoke WMSs. Secondly, some JavaScript invocation functions and URL formats for WMS are summarized as JavaScript invocation rules from prior knowledge of how WMSs are employed and coded in JavaScript. These invocation rules are used to identify the JavaScript code for extracting candidate WMSs through rule matching. The above two operations are incorporated into a traditional focused crawling strategy situated between the tasks of fetching webpages and parsing webpages. Thirdly, LCWMSs are selected by matching services with a set of land cover keywords. Moreover, a search engine for LCWMSs is implemented that uses the focused deep web crawler to retrieve and integrate the LCWMSs it discovers. In the ﬁrst experiment, eight online geospatial web applications serve as seed URLs (Uniform Resource Locators) and crawling scopes; the proposed crawler addresses only the JavaScript code in these eight applications. All 32 available WMSs hidden in JavaScript code were found using the proposed crawler, while not one WMS was discovered through the focused crawler-based approach. This result shows that the proposed crawler has the ability to discover WMSs hidden in JavaScript code. The second experiment uses 4842 seed URLs updated daily. The crawler found a total of 17,874 available WMSs, of which 11,901 were LCWMSs. Our approach discovered a greater number of services than those found using previous approaches. It indicates that the proposed crawler has a large advantage in discovering LCWMSs from the surface web and from JavaScript code. Furthermore, a simple case study demonstrates that the designed LCWMS search engine represents an important step towards realizing land cover information integration for global mapping and monitoring purposes


Introduction
A web map service (WMS) is an international standard protocol for publishing and accessing geo-referenced maps on the web [1][2][3].This standard has facilitated the integration, access and value-added applications of geospatial information [1,4,5].In the past few years, an increasing volume of land cover data and maps has been made available through WMSs for facilitating on-line open data access [6,7], supporting collaborative data production [8,9], and assisting in crowd-sourcing sampling and validation [10][11][12][13].One example is the WMS-based information service system, which enables the open access and sharing of one of the world's first 30 m Earth land cover maps, called GlobeLand30 (www.globeland30.org)[14].There are many other land cover web map services (LCWMSs) that provide global, national or local land cover data/maps.Some of these LCWMSs are registered in catalogues (e.g., the Catalogue Service for the Web, CSW) on the basis of Service-Oriented Architecture [15,16].These LCWMSs can be discovered easily by matching keywords in corresponding catalogues.Some LCWMSs are dispersed in the surface web, which refers to content stored in static pages [17].These can be accessed directly by visiting the hyperlinks associated with these static pages [18].However, others are hidden in the deep web, which refers to content hidden in dynamic pages (often behind query interfaces), in script code, and so on [17,19].In particular, with the rapid adoption of the Open Geospatial Consortium (OGC) standards, such services are increasingly employed by geospatial web applications that use JavaScript code supplemented by third-party JavaScript libraries (e.g., OpenLayers and ArcGIS for JavaScript) [20,21].This JavaScript code and the libraries are often referenced in the form of JavaScript links, such as "<script src= ' ../OpenLayers.js'></script>".Such deep web LCWMSs are difficult to discover by simply visiting hyperlinks.For example, the LCWMSs for GlobeLand30 can be discovered only by analysing the JavaScript code exposed by the service system (www.globeland30.org).
Recently, discovery and integration of these dispersed land cover data services have been stimulated by a wide range of development agendas, research programmes and practical applications [10,[22][23][24].One example is the United Nation's 2030 sustainable development agenda, which critically depends on the availability and utilization of land cover information at global, national and local scales.Unfortunately, no one single land cover data set or service can currently meet the needs of all users.Moreover, many LCWMSs exist as isolated "information islands" and are not well connected [25]; therefore, it is natural to consider linking all the land cover information services scattered around the world to provide a more reliable land cover information service.This issue arose and was discussed at the 9-10 June 2015 international workshop organized by the International Society for Photogrammetry and Remote Sensing (ISPRS) and the Group on Earth Observations (GEO) [22].It was proposed to design and develop a Collaborative Global Land information service platform (CoGland); however, doing so faces a number of technical challenges [22].One of these challenges is to automate the discovery and connection of the existing but isolated LCWMSs to form a "one stop" portal for an integrated information service [25].
Automatic discovery of LCWMSs can be realized in either a passive or active manner.The passive method uses a keyword-matching approach to discover services in registered catalogues [15,16,26].The success of this method depends largely on the willingness of service providers to register their WMSs and make them available [15,16,26].However, numerous services published on the web are not registered in any catalogues, and thus cannot be found through catalogue searches [2,27].Various search engines, which use web crawlers to continuously traverse static pages [28][29][30], have been developed for finding services dispersed in the surface web [31,32].General-purpose search engines (such as Google and Bing), customized search engines with general crawlers and focused crawlers are the three most commonly used approaches [16,27,[31][32][33][34].In essence, however, these active approaches can only find geospatial web services that reside in static pages.Nevertheless, a considerable number of WMSs exist behind query interfaces and are hidden within JavaScript code [16,20,21].The key challenge here is to detect and understand the deep web query interfaces and JavaScript code that signify WMSs and to extract them.The detection and understanding of query interfaces has received some attention [35]; however, few efforts have been devoted to the detection and understanding of JavaScript code specifically for discovering WMSs.Therefore, discovering WMSs from JavaScript code in geospatial web applications remains an open question [21].This paper aims to solve this problem.It proposes a focused deep web crawler that can find more LCWMSs from both the deep web's JavaScript code and from the surface web.First, a group of JavaScript link names are abstracted as initial judgements.Through name matching, these judgements are used to judge whether a fetched webpage contains a predefined JavaScript link that may execute other JavaScript code to potentially invoke WMSs.Secondly, some JavaScript invocation functions and URL formats for WMS are summarized as rules from prior knowledge of how WMSs are employed and presented in JavaScript code.Through rule matching, the rules can be used to understand how geospatial web applications invoke WMSs using JavaScript code for extracting candidate WMSs.The above operations are fused into a traditional focused crawling strategy situated between the tasks of fetching webpages and parsing webpages.Thirdly, LCWMSs are selected from the list of extracted candidate WMSs through WMS validation and matching land cover keywords.Finally, a LCWMS search engine (LCWMS-SE) is designed and implemented that uses a rule-based approach to assist users in retrieving the discovered LCWMSs.
The remainder of this paper is organized as follows.Section 2 reviews related work about both the surface and deep geospatial web service discovery as well as the discovery of other deep web resources from JavaScript code.Section 3 outlines the active discovery approach, including a description of the initial judgements, the JavaScript invocation rules and the proposed focused deep web crawler.The design and implementation of the search engine that retrieves the discovered LCWMSs is described in Section 4. Preliminary experiments and analysis are presented in Section 5, and Section 6 concludes the paper.

Related Work
A WMS provides three operations: GetCapabilities, GetMap, and GetFeatureInfo to support requests for metadata, static maps and feature information about the corresponding georeferenced maps, respectively [1,16].The GetCapabilities operation is intended to request service metadata, such as service abstract, reference coordinate system, the format, etc. [16].The GetMap operation enables users to obtain static maps from multiple servers by submitting parameters that include layer names, bounding boxes, format, styles, etc. [3,16].The GetFeatureInfo operation requests metadata for the selected features [16].Because metadata can help users find the service layers they need and determine how best to use them [36], in general, discovering WMSs means finding the URLs of the GetCapabilities operation.Other OGC geospatial services such as the Web Feature Service (WFS) and the Web Coverage Service (WCS) also have GetCapabilities operations.Therefore, discovery methods for those services are analogous to the discovery methods for WMSs.

Surface Geospatial Web Services Discovery
The most notable research for actively discovering surface geospatial web services can be divided into two types of approaches.The first type of discovery approach utilizes the application programming interfaces (APIs) of general-purpose search engines, employing predefined queries to search the Internet for discovering OGC geospatial web services [27,32,33,37,38].For example, the Refractions Research OGC Survey [37] used the Google web API with two queries "request = getcapabilities" and "request = capabilities" to extract the results and then used a number of Perl "regular expressions" for a complete capabilities URL in each returned page.Lopez-Pellicer et al. [33] employed Bing, Google and Yahoo!APIs with two sets of 1000 queries to measure the performances of the three search engines in discovering OGC geospatial web services.They finally identified Yahoo! as the best performer.In the last two years, Bone et al. [32] designed a geospatial search engine for discovering multi-format geospatial data using the Google search engine with both user-specified and predefined terms.Kliment et al. [27,39] used the advanced query operators "inurl:" of the Google search engine to discover specific OGC geospatial services stored in Google's indexes.This type of discovery approach is easily implemented, is low cost and can avoid the performance issues of systems that attempt to crawl the entire web [32].However, a general-purpose search engine is intended to cover entire topics without considering the characteristics of OGC geospatial web services; thus, with this approach, actual OGC geospatial services will be inundated with a large amount of search results [16].Moreover, although general-purpose search engines (e.g., Google) have already attempted to incorporate deep web resources hidden behind hypertext markup language (HTML) forms [35], they still ignore some resources hidden in JavaScript code, especially OGC geospatial web services invoked by JavaScript libraries.Furthermore, the public API of general-purpose search engines has some calling limitations imposed through permissions, allowed numbers of call, special policies, and so on.For instance, the API of the Google custom search engine provides only 100 free search queries per day [40].
The second type of discovery approach is to utilize customized search engines.Customized search engines can be categorized into general-purpose and topic-specific search engines [29].Customized general-purpose search engines are equipped with a web crawler, which can continually collect large numbers of webpages by starting from a series of seed URLs (a list of URLs that the crawler starts with) and without a specific topic [32,41,42].For example, Sample et al. [43] utilized Google search APIs to generate a set of candidate seed URLs for GIS-related websites and developed a web crawler to discover WMSs.Shen et al. [15] developed a WMS crawler based on the open source software NwebCrawler to support active service evaluation and quality monitoring.Patil et al. [44] proposed a framework for discovering WFSs using a spatial web crawler.Compared to general-purpose search engines, customized general-purpose search engines can easily control the process for discovering geospatial web services and can dramatically scale down the set of URLs to crawl.However, the crawled URLs still go far beyond only geospatial pages that contain geospatial web services [16].
To solve the above problems, topic-specific search engines with focused crawlers were developed.With a series of seed URLs and a given topic, focused crawlers aim to continually crawl as many webpages relevant to the given topic and keep the amount of irrelevant webpages crawled to the minimum [41].For example, Lopez-Pellicer et al. [26,45] developed a focused crawler for geospatial web services.This crawler utilized Bing, Google and Yahoo! search APIs to generate seed URLs and used best first and shark-search strategies to assign priority to these URLs.Li et al. [16], Wu et al. [46] and Shen et al. [34] also developed focused crawlers based on different URL priority assignment approaches to discover OGC geospatial web services.Because they further scale down the set of URLs that must be crawled, these methods can improve discovery performance in both technical and economic perspectives [16].However, the above crawlers in the customized search engines parse webpages solely through HTML parsing engines, which have no ability to identify and submit HTML forms and do not address JavaScript code.Therefore, these customized search engines are not able to discover WMSs in the deep web.

Deep Geospatial Web Services Discovery
The deep web refers to data hidden behind query interfaces, scripted code and other types of objects [17,47].From this perspective, the underlying geospatial web services registered in catalogues are opaque because they are hidden behind catalogue query interfaces [47,48]; as such, geospatial web services registered in catalogues form part of the Deep Web.In general, searching services from known catalogues is implemented by keyword matching approaches [16].However, a number of catalogues are unknown to the public.Therefore, discovering services from these unknown catalogues depends on successfully detecting and understanding the query interfaces of these unknown catalogues [47,48].In the past few years, several works have been performed to address the query interface detection and understanding problem.For example, Dong et al. [49] proposed a novel deep web query interface matching approach based on evidence theory and task assignment.These works can be used for detecting and understanding the query interfaces of unknown catalogues, but their original purposes were not focused on discovering geospatial web services from the deep web.
Recently, some efforts have been made to discover geospatial web services from the deep web.For example, Lopez-Pellicer et al. [20,21] analysed the characteristics of the deep web geospatial content in detail.They pointed out that disconnected content, opaque content, ignored content, dynamic content, real-time content, contextual content, scripted content and restricted content are all part of the deep web geospatial content.Furthermore, they summarized six heuristics on the characteristics of the deep web content in the form of texts.Then, they designed an advanced geospatial crawler with plugin architecture and introduced several extension plugins for implementing some of the heuristics to discover part of geospatial contents hidden in the deep web.Finally, they indicated that discovering services in geospatial web applications based on JavaScript libraries (e.g., OpenLayers) is still one of the open questions that require further research.This was one motivation behind this study's effort to develop a new active discovery approach.Therefore, our contribution lies in the development of a method to discover WMSs in geospatial web applications through focused crawling by inspecting the JavaScript code using a rule-matching approach.

Use of JavaScript Code in Deep Web Resources Discovery
Various resources hidden in JavaScript code comprise part of the deep web.Discovery of these resources has been the subject of research in the past few years.For example, Bhushan and Kumar [50] developed a deep web crawler by utilizing some regular expressions (Match/Replace) to extract clear-text links hidden in JavaScript code.Hou et al. [42] summarized matching rules formalized by a regular expression to extract the geographical coordinates of built-up areas from JavaScript code.Some resources dynamically generated by JavaScript code can be extracted using interpreters (i.e., the JavaScript parsing engine and Webkit).For instance, Hammer et al. [51] developed a WebKit-based web crawler to extract online user discussions of news.In geospatial web applications, WMSs are rendered as maps for visualization using geospatial JavaScript libraries; hence, the WMSs are hidden in JavaScript code.In general, these JavaScript libraries have concrete functions used to invoke WMSs.Learning from the discovery experiences of other deep web resources, this paper summarizes the invocation rules to extract candidate WMSs from JavaScript code.

Methodology
The previous discussions and analyses conclude that numerous LCWMSs exist in the JavaScript code of geospatial web applications and must be discovered.Discovering these LCWMSs requires addressing two challenges: detecting JavaScript code and understanding what that code does.Figure 1 represents the conceptual framework for discovering LCWMSs hidden in JavaScript code and the surface web.Compared with other focused crawler-based discovery methods, the proposed approach involves three major tasks.The first task is detecting predefined JavaScript links presented in the fetched webpages.This task is responsible for judging whether other JavaScript code potentially employs WMSs by matching the code against predefined judgements, which are composed of some JavaScript link names.The second task involves understanding the detected JavaScript code.It is the responsibility of this task to extract and validate candidate WMSs from JavaScript code through rule matching.Achieving this goal depends largely on JavaScript invocation rules, which are composed of WMS functions and their URL formats.The third task is to select available LCWMSs from the extracted candidate WMSs using a land cover keyword-matching approach.In Figure 1, the tasks in the dashed box are similar to other focused crawler-based discovery methods.More details are specified in the following subsections.
related JavaScript libraries is a reasonable candidate for potentially containing any WMSs.Based on this finding, this paper summarizes judgements to determine whether the fetched webpages have predefined JavaScript links that execute JavaScript code known to be related to WMSs.The judgements are composed of some JavaScript link names that are predetermined based on the reference formats of OpenLayers, ArcGIS API for JavaScript, Leaflet and Mapbox.jsas they appear in webpages, as shown in Table 1.These judgements include "openlayers," "ol.js," "ol-debug.js,""arcgis," "leaflet," and "mapbox."The judgements are performed by matching names with the JavaScript links in fetched webpages.When a JavaScript link in fetched webpages contains one of the predefined names, it indicates that other JavaScript code in the webpages may employ WMSs.Therefore, the other JavaScript code will be addressed by the JavaScript invocation rules described in this paper.
Moreover, two additional measures are adopted to further avoid traversing all JavaScript links in a geospatial web application.The first is to include some WMS-related keywords extracted from the naming schemes of many actual JavaScript links known to launch WMSs.These keywords include "map," "initial," "wms," "layer," "conus," "capabilities," "demo," "query," and "content."Only when the name of a JavaScript link contains at least one of the above WMS-related keywords will the JavaScript link itself be addressed by two subsequent JavaScript invocation rules.The second adopted limiting measure is to summarize some keywords not related to WMSs from the naming schemes of numerous well-known JavaScript libraries, such as JQuery, ExtJS, Proj4js and AngularJS.The keywords mainly consist of "jquery," "ext-," "proj4js," "angularjs,""openlayers," "ol.js," "oldebug.js,""leaflet," "mapbox" and so on.When the name of a JavaScript link contains one of the above WMS-unrelated keywords, the JavaScript link will be ignored.

Detection of JavaScript Links Using Judgements
Most geospatial web applications use third-party JavaScript libraries to invoke WMSs for map visualization.The four best-known JavaScript libraries are OpenLayers [52], ArcGIS API for JavaScript [53], Leaflet [54] and Mapbox.js[55].For example, the GlobeLand30 [9,14] and CropScape [6] geospatial web applications were developed using OpenLayers.It is standard syntax to enclose a reference to a JavaScript library in a "<script>" tag when it is used to develop a web application with HTML [56], as shown in Table 1.The "src" property value of the "<script>" tag is a JavaScript link to specific external JavaScript code or library.Therefore, a webpage that refers to any of the four WMS-related JavaScript libraries is a reasonable candidate for potentially containing any WMSs.Based on this finding, this paper summarizes judgements to determine whether the fetched webpages have predefined JavaScript links that execute JavaScript code known to be related to WMSs.The judgements are composed of some JavaScript link names that are predetermined based on the reference formats of OpenLayers, ArcGIS API for JavaScript, Leaflet and Mapbox.jsas they appear in webpages, as shown in Table 1.These judgements include "openlayers," "ol.js," "ol-debug.js,""arcgis," "leaflet," and "mapbox."The judgements are performed by matching names with the JavaScript links in fetched webpages.When a JavaScript link in fetched webpages contains one of the predefined names, it indicates that other JavaScript code in the webpages may employ WMSs.Therefore, the other JavaScript code will be addressed by the JavaScript invocation rules described in this paper.
Moreover, two additional measures are adopted to further avoid traversing all JavaScript links in a geospatial web application.The first is to include some WMS-related keywords extracted from the naming schemes of many actual JavaScript links known to launch WMSs.These keywords include "map," "initial," "wms," "layer," "conus," "capabilities," "demo," "query," and "content."Only when the name of a JavaScript link contains at least one of the above WMS-related keywords will the JavaScript link itself be addressed by two subsequent JavaScript invocation rules.The second adopted limiting measure is to summarize some keywords not related to WMSs from the naming schemes of numerous well-known JavaScript libraries, such as JQuery, ExtJS, Proj4js and AngularJS.The keywords mainly consist of "jquery," "ext-," "proj4js," "angularjs,""openlayers," "ol.js," "ol-debug.js,""leaflet," "mapbox" and so on.When the name of a JavaScript link contains one of the above WMS-unrelated keywords, the JavaScript link will be ignored.

Understanding of JavaScript Code Using Invocation Rules
Only a discovery approach that understands how geospatial web applications invoke WMSs by JavaScript libraries will be able to discover the WMSs in these applications [57].Therefore, two JavaScript invocation rules are summarized based on development knowledge of geospatial web applications about WMSs to understand what such JavaScript code does.
As shown in Table 2, the first JavaScript invocation rule is derived from the OpenLayers [52], ArcGIS API for JavaScript [53], Leaflet [54] and Mapbox.js[55], all of which are JavaScript libraries that provide support for OGC's web mapping standards [58].The listed total of seven common functions are often directly used to invoke WMSs in a large number of geospatial web applications.Therefore, to make the discovery method understand how geospatial web applications invoke WMSs, the seven functions collectively compose the first JavaScript invocation rule, which is formalized by the regular expression shown in Figure 2. The rule obtains JavaScript code fragments containing the URLS of candidate WMSs using a simple rule-matching approach.For example, the JavaScript code fragment "OpenLayers.Layer.WMS.Untiled ("Rivers," "http://129.174.131.7/cgi/wms_conuswater.cgi",...)" in the CropScape geospatial web application [6] can be extracted by simply matching the first rule.
JavaScript invocation rules are summarized based on development knowledge of geospatial web applications about WMSs to understand what such JavaScript code does.
As shown in Table 2, the first JavaScript invocation rule is derived from the OpenLayers [52], ArcGIS API for JavaScript [53], Leaflet [54] and Mapbox.js[55], all of which are JavaScript libraries that provide support for OGC's web mapping standards [58].The listed total of seven common functions are often directly used to invoke WMSs in a large number of geospatial web applications.Therefore, to make the discovery method understand how geospatial web applications invoke WMSs, the seven functions collectively compose the first JavaScript invocation rule, which is formalized by the regular expression shown in Figure 2. The rule obtains JavaScript code fragments containing the URLS of candidate WMSs using a simple rule-matching approach.For example, the JavaScript code fragment "OpenLayers.Layer.WMS.Untiled ("Rivers," "http://129.174.131.7/cgi/wms_conuswater.cgi",...)" in the CropScape geospatial web application [6] can be extracted by simply matching the first rule.The seven common functions are also deeply encapsulated in other functions that simply invoke WMSs.For example, in the GeoNetwork site [59], the function "OpenLayers.Layer.WMS" is encapsulated as the function "createWMSLayer," which takes only an array parameter as an argument to specify the base URLs of the WMS.It is impossible to exhaustively list all the encapsulated functions, because different geospatial web applications may adopt different encapsulated function names.Therefore, to make the discovery method understand how encapsulated functions invoke WMSs, the second JavaScript invocation rule is composed of the base URL formats of the WMSs and is also formalized by a regular expression, as shown in Figure 3.The regular expressions for the two rules are compiled by following the rules of the C# language.Through a simple rule-matching approach, the second rule is used to extract http or https URLs that cannot be extracted by the first rule from JavaScript code.For example, no URL can be extracted by the first rule from the GeoNetwork site [59]; however, a URL can be extracted by matching the second rule  The seven common functions are also deeply encapsulated in other functions that simply invoke WMSs.For example, in the GeoNetwork site [59], the function "OpenLayers.Layer.WMS" is encapsulated as the function "createWMSLayer," which takes only an array parameter as an argument to specify the base URLs of the WMS.It is impossible to exhaustively list all the encapsulated functions, because different geospatial web applications may adopt different encapsulated function names.Therefore, to make the discovery method understand how encapsulated functions invoke WMSs, the second JavaScript invocation rule is composed of the base URL formats of the WMSs and is also formalized by a regular expression, as shown in Figure 3.The regular expressions for the two rules are compiled by following the rules of the C# language.Through a simple rule-matching approach, the second rule is used to extract http or https URLs that cannot be extracted by the first rule from JavaScript code.For example, no URL can be extracted by the first rule from the GeoNetwork site [59]; however, a URL can be extracted by matching the second rule from the JavaScript code fragment "GeoNetwork.WMSList.push("Demo Cubewerx (WMS)-2," "http://demo.cubewerx.com/demo/cubeserv/cubeserv.cgi")" in the GeoNetwork site [59].from the JavaScript code fragment "GeoNetwork.WMSList.push("Demo Cubewerx (WMS)-2," "http://demo.cubewerx.com/demo/cubeserv/cubeserv.cgi")" in the GeoNetwork site [59].The two JavaScript invocation rules should be applied in a specific sequence, as shown in Figure 4.Only when Javascript code referenced by a fetched webpage is identified as the potential source of a WMS will the first rule be executed.Then, only if the first rule does not yield any URLs containing candidate WMSs will the second rule be executed.This sequence is necessary because the second rule is more general than the first rule; therefore, it acts to complement the first rule to capture all the URLs in the identified JavaScript code.The two JavaScript invocation rules should be applied in a specific sequence, as shown in Figure 4.Only when Javascript code referenced by a fetched webpage is identified as the potential source of a WMS will the first rule be executed.Then, only if the first rule does not yield any URLs containing candidate WMSs will the second rule be executed.This sequence is necessary because the second rule is more general than the first rule; therefore, it acts to complement the first rule to capture all the URLs in the identified JavaScript code.from the JavaScript code fragment "GeoNetwork.WMSList.push("Demo Cubewerx (WMS)-2," "http://demo.cubewerx.com/demo/cubeserv/cubeserv.cgi")" in the GeoNetwork site [59].The two JavaScript invocation rules should be applied in a specific sequence, as shown in Figure 4.Only when Javascript code referenced by a fetched webpage is identified as the potential source of a WMS will the first rule be executed.Then, only if the first rule does not yield any URLs containing candidate WMSs will the second rule be executed.This sequence is necessary because the second rule is more general than the first rule; therefore, it acts to complement the first rule to capture all the URLs in the identified JavaScript code.mentioned in Section 2.1, a focused crawler is able to efficiently traverse the web to find WMSs in the surface web.Therefore, the proposed crawler uses the focused crawler to traverse the web for finding WMSs in the surface web.In the proposed crawler, the judgements serve as a bridge between the JavaScript invocation rules and the focused crawler.Figure 6 illustrates the framework for the focused deep web crawler.The proposed crawler starts with a series of seed URLs.Next, it begins to fetch the webpages corresponding to the seed URLs or to crawled URLs.Then, the proposed crawler executes one of two branches according to certain judgements.When the judgements are fulfilled, the first branch executes.Otherwise, the second branch executes.The first branch mainly analyses any JavaScript code in the page with the Jurassic JavaScript parsing engine and the two JavaScript invocation rules, according to the pseudocode detailed above.It aims to obtain candidate WMS URLs from JavaScript code, while the second branch is intended to obtain candidate WMS URLs from the surface web.The second branch starts by parsing the HTML of webpages to extract their titles and contents.Then, a relevance calculation is executed

Discovery of Land Cover Web Map Services (LCWMSs) with a Focused Deep Web Crawler
A focused deep web crawler is proposed to discover LCWMSs from JavaScript code and the surface web.The focused deep web crawler mainly relies on the summarized judgements, JavaScript invocation rules and a JavaScript parsing engine for handling the JavaScript code.Moreover, as mentioned in Section 2.1, a focused crawler is able to efficiently traverse the web to find WMSs in the surface web.Therefore, the proposed crawler uses the focused crawler to traverse the web for finding WMSs in the surface web.In the proposed crawler, the judgements serve as a bridge between the JavaScript invocation rules and the focused crawler.
Figure 6 illustrates the framework for the focused deep web crawler.The proposed crawler starts with a series of seed URLs.Next, it begins to fetch the webpages corresponding to the seed URLs or to crawled URLs.Then, the proposed crawler executes one of two branches according to certain judgements.When the judgements are fulfilled, the first branch executes.Otherwise, the second branch executes.The first branch mainly analyses any JavaScript code in the page with the Jurassic JavaScript parsing engine and the two JavaScript invocation rules, according to the pseudocode detailed above.It aims to obtain candidate WMS URLs from JavaScript code, while the second branch is intended to obtain candidate WMS URLs from the surface web.The second branch starts by parsing the HTML of webpages to extract their titles and contents.Then, a relevance calculation is executed using the traditional vector space model and the cosine formula.In this step, the vector space model is an algebraic model that represents webpages and the given topic as two vectors of keywords.The cosine function is a measure of similarity between each webpage vector and the topical vector by comparing the deviation of angles between the two vectors.This calculation is responsible for measuring the degree of similarity between the extracted webpages and the given topical keywords.If the relevance value is smaller than a given threshold, the webpage is discarded.Otherwise, when the relevance value is equal to or greater than the given threshold, the URLs in the webpage are extracted and a priority score for URLs will be assigned based on the texts in URLs, their parent webpages and anchor texts.Furthermore, to more precisely target candidate WMS URLs, any extracted URLs that end with common file extensions such as ".pdf," ".tif," ".js," ".doc," ".zip" and so on are discarded.is an algebraic model that represents webpages and the given topic as two vectors of keywords.The cosine function is a measure of similarity between each webpage vector and the topical vector by comparing the deviation of angles between the two vectors.This calculation is responsible for measuring the degree of similarity between the extracted webpages and the given topical keywords.
If the relevance value is smaller than a given threshold, the webpage is discarded.Otherwise, when the relevance value is equal to or greater than the given threshold, the URLs in the webpage are extracted and a priority score for URLs will be assigned based on the texts in URLs, their parent webpages and anchor texts.Furthermore, to more precisely target candidate WMS URLs, any extracted URLs that end with common file extensions such as ".pdf," ".tif," ".js," ".doc," ".zip" and so on are discarded.After the candidate WMS URLs are obtained from these two branches, they are submitted to the component that performs WMS validation.The WMS validation component submits a GetCapabilities request to check whether the potential WMS URL corresponds to an available WMS.When the URL is not a valid WMS, it is discarded.When the URL is an available WMS, its capability file will be parsed to obtain metadata, such as the service name, service abstract, service keywords, bounding boxes, layer names, layer titles, layer abstracts and so on.Then, the available WMSs will After the candidate WMS URLs are obtained from these two branches, they are submitted to the component that performs WMS validation.The WMS validation component submits a GetCapabilities request to check whether the potential WMS URL corresponds to an available WMS.When the URL is not a valid WMS, it is discarded.When the URL is an available WMS, its capability file will be parsed to obtain metadata, such as the service name, service abstract, service keywords, bounding boxes, layer names, layer titles, layer abstracts and so on.Then, the available WMSs will be classified to identify the LCWMSs through matching land cover keywords.When the WMS metadata of service/layer names, titles, abstracts and keywords contain one of the land cover keywords, the WMS is classified as a LCWMS and stored in a LCWMS catalogue.Otherwise, the WMS is stored into a separate WMS catalogue.Finally, URLs in the URL priority queue will continue to be submitted for fetching webpages until the URL priority queue is empty or other conditions are fulfilled.

LCWMS-SE Architecture
A working prototype of the LCWMS-SE was implemented based on the Microsoft NET Framework 3.5.The LCWMS-SE can automatically find LCWMSs from both the surface and deep webs.Moreover, it enables users to retrieve their required services among the discovered LCWMSs.The LCWMS-SE is available, along with documentation, at [69].The URL is temporary, but users will be able to track the progress of LCWMS discovery from the GlobeLand30 service system in future.
The LCWMS-SE comprises a focused deep web crawler, an indexing module, a retrieval module and a user query interface, as shown in Figure 7.The focused deep web crawler and indexing module are a desktop application based on C# WinForms.The information retrieval module and user query interface are a web application based on ASP.Net.

Indexing and Retrieving Modules
A bounding box is a mandatory parameter for users, who need to execute GetMap operations to obtain their required maps [70].These users usually know the bounding boxes in advance when searching for their desired LCWMSs.When the bounding box is provided as a query parameter, the returned LCWMSs will better conform to the users' requirements.Therefore, the working prototype of the LCWMS-SE takes a user-specified bounding box into account, treating it as as a major query parameter.
Because one goal of the LCWMS-SE is to retrieve LCWMSs with respect to keywords and the user-specified bounding box, both textual and spatial information need to be indexed.The textual information was indexed as an inverted file structure using Lucene.Net 3.0.3[71].The indexed textual information contains the service name, service title, service abstract, service keywords, service URL, layer names, layer titles and layer abstracts.The spatial information was indexed using the BboxStrategy in Lucene.Net Contrib Spatial.NTS 3.0.3[72].The BboxStrategy is a spatial strategy for indexing and searching bounding boxes.In the BboxStrategy, the bounding box is stored as four numeric fields that represent the minimum and maximum X and Y coordinates, respectively.In the search engine, only LCWMSs that have at least one bounding box using the EPSG:4326 (European Petroleum Survey Group) geographic coordinate reference system are spatially indexed.
On the basis of these textual and spatial indexes, this search engine implements three search modes.The first mode is based on a keyword-matching approach.The mode will search only for specified keywords in the title, name, abstract, and keywords at both the service level and the map layer level.The second search mode uses the bounding box-based method.This mode returns only LCWMS whose spatial extent has the specified spatial relationships with the specified bounding box.These spatial relationships refer to "contain," "contained" and "overlap."The third mode is a combination of the first two modes.When users select this query mode, the search engine returns only LCWMSs that contain the specified keywords and whose spatial extent has the given spatial relationships with the specified bounding box.

User Query Interface
The user query interface aims to both convey the users' requirements to the system and display The focused deep web crawler is responsible for fetching and discovering LCWMSs from the internet as discussed in Section 3.3.The discovered LCWMSs are stored in an LCWMS catalogue.The indexing module is responsible for indexing the discovered LCWMSs in spatial and textual forms to generate an index database.The retrieval module, which allows users to enter a bounding box and keywords, provides search and rank functionalities.The user query interface allows users to submit queries, which are addressed through stop word filtering, synonym expansion and so on.Finally, the user query interface submits these queries to the retrieval module and presents ranked results to the users.

Indexing and Retrieving Modules
A bounding box is a mandatory parameter for users, who need to execute GetMap operations to obtain their required maps [70].These users usually know the bounding boxes in advance when searching for their desired LCWMSs.When the bounding box is provided as a query parameter, the returned LCWMSs will better conform to the users' requirements.Therefore, the working prototype of the LCWMS-SE takes a user-specified bounding box into account, treating it as as a major query parameter.
Because one goal of the LCWMS-SE is to retrieve LCWMSs with respect to keywords and the user-specified bounding box, both textual and spatial information need to be indexed.The textual information was indexed as an inverted file structure using Lucene.Net 3.0.3[71].The indexed textual information contains the service name, service title, service abstract, service keywords, service URL, layer names, layer titles and layer abstracts.The spatial information was indexed using the BboxStrategy in Lucene.Net Contrib Spatial.NTS 3.0.3[72].The BboxStrategy is a spatial strategy for indexing and searching bounding boxes.In the BboxStrategy, the bounding box is stored as four numeric fields that represent the minimum and maximum X and Y coordinates, respectively.In the search engine, only LCWMSs that have at least one bounding box using the EPSG:4326 (European Petroleum Survey Group) geographic coordinate reference system are spatially indexed.
On the basis of these textual and spatial indexes, this search engine implements three search modes.The first mode is based on a keyword-matching approach.The mode will search only for specified keywords in the title, name, abstract, and keywords at both the service level and the map layer level.The second search mode uses the bounding box-based method.This mode returns only LCWMS whose spatial extent has the specified spatial relationships with the specified bounding box.These spatial relationships refer to "contain," "contained" and "overlap."The third mode is a combination of the first two modes.When users select this query mode, the search engine returns only LCWMSs that contain the specified keywords and whose spatial extent has the given spatial relationships with the specified bounding box.

User Query Interface
The user query interface aims to both convey the users' requirements to the system and display the system's responses to the user.In other words, it is a bridge between the users and the system [73].The user query interface of the search engine is composed of seven parts, including a text input field for entering query terms, an interactive graphical interface, a bounding box interface, a retrieval mode interface, a relationships interface, a search button and a textual list as shown in Figure 8. Specifically, a text input for a query term enables users to specify what their desired LCWMSs should be about in the form of a list of keywords.Meanwhile, the bounding box interface and interactive graphical interface allow users to specify the area to which their desired LCWMSs should cover in the form of coordinates or a map.The interactive graphical interface was implemented using the OpenLayers API [52] and includes some basic map tools, such as pan, zoom, rectangle selection, etc.Its base map is the GlobeLand30 dataset [14] for the year 2010.The retrieval mode interface allows users to interact with the search engine in any of the three different modes described in Section 4.2.The relationships interface enables users to select among three different spatial relationships to restrict the search results in spatial forms.After entering keywords and/or the bounding box, assigning the retrieval mode and spatial relationship, the user clicks the search button and relevant matching LCWMSs will be displayed in the text list.Every result contains the service title, service abstract, layer titles, layer number and bounding box.Service requests such as GetCapabilities, GetMap and AddToLayer are also provided, if the LCWMS has a bounding box that uses the EPSG:4326 geographic coordinate reference system.The AddToLayer request allows users to integrate the returned LCWMSs into the interactive graphical interface.Specifically, a text input for a query term enables users to specify what their desired LCWMSs should be about in the form of a list of keywords.Meanwhile, the bounding box interface and interactive graphical interface allow users to specify the area to which their desired LCWMSs should cover in the form of coordinates or a map.The interactive graphical interface was implemented using the OpenLayers API [52] and includes some basic map tools, such as pan, zoom, rectangle selection, etc.Its base map is the GlobeLand30 dataset [14] for the year 2010.The retrieval mode interface allows users to interact with the search engine in any of the three different modes described in Section 4.2.The relationships interface enables users to select among three different spatial relationships to restrict the search results in spatial forms.After entering keywords and/or the bounding box, assigning the retrieval mode and spatial relationship, the user clicks the search button and relevant matching LCWMSs will be displayed in the text list.Every result contains the service title, service abstract, layer titles, layer number and bounding box.Service requests such as GetCapabilities, GetMap and AddToLayer are also provided, if the LCWMS has a bounding box that uses the EPSG:4326 geographic coordinate reference system.The AddToLayer request allows users to integrate the returned LCWMSs into the interactive graphical interface.

Experiments and Analysis
This section compares the performance of the proposed focused deep web crawler to a focused crawler-based approach, evaluates the enumerations of the LCWMSs that these crawlers could locate on the web, and demonstrates the LCWMS-SE's abilities concerning the retrieval and integration of discovered LCWMSs.The experiments are carried out using an Intel Pentium 4 CPU, running at 3.20 GHZ, with 1 GB of RAM, and 6 M of bandwidth.

Experiment 1: Discovering WMSs from JavaScript Code
This section presents an experiment carried out on 15 March 2015 that demonstrates the ability of the proposed focused deep web crawler to discover WMSs hidden in JavaScript code compared to a focused crawler-based approach.Table 4 lists the eight seed URLs used in the first experiment.The eight seed URLs correspond to eight online geospatial web applications that invoke a total of 43 unique WMSs.The 43 unique WMSs are all hidden in JavaScript code; of these, only 32 WMSs are available.In this experiment, the crawling scope, which refers to where the crawler can traverse the internet and retrieve webpages by hyperlinks, is restricted to only these eight online geospatial web applications.The proposed focused deep web crawler discovered all 32 available WMSs, while the focused crawler-based approach did not discover any WMSs.This indicates that the proposed focused deep web crawler has the ability to discover WMSs hidden in JavaScript code.Its success largely depends on the judgements, the JavaScript invocation rules and the JavaScript parsing engine discussed earlier in this paper.Using these, the focused deep web crawler knows how geospatial web applications invoke WMSs.In contrast, the focused crawler-based approach uses only an HTML parsing engine and, therefore, has no ability to investigate WMSs hidden in JavaScript code; consequently, it cannot discover the WMSs hidden in these URLs.
Section 2.3 discussed some studies that use rules to extract URLs from JavaScript code concerning the general domains.In addition, the open source crawler Heritrix has developed an extended module (ExtractorJS) to extract URLs from JavaScript code.To compare the efficiency between the existing crawlers with a JavaScript module and our approach, a supplementary experiment was conducted on 25 April 2016.The baseline for comparison is the Heritrix ExtractorJS module.The Seed URLs are the URLs numbered 2, 3, 5, 6 and 7 in Table 4 (the remaining three URLs are invalid).The crawling scope is restricted to only these five online geospatial web applications.Both approaches discovered all the available WMSs hidden in JavaScript code.However, the approach using ExtractorJS extracted and validated 3266 URLs to discover those available WMSs, while our approach extracted and validated only 58 URLs to discover available WMSs, indicating that our approach has a higher efficiency than the approach with ExtractorJS.This is because the criteria, the two additional measures, and the first JavaScript invocation rules let our approach understand how those geospatial web applications invoke WMSs.In contrast, the approach using ExtractorJS utilizes a more general rule than the second JavaScript invocation rules in our approach to extract URLs from JavaScript code.

Experiment 2: Enumerating LCWMSs
The second experiment was carried out to enumerate the LCWMSs that the proposed crawler discovered on the web.In this experiment, there are two sets of seed URLs.The first set uses the same eight geospatial web URLs as Experiment 1.The second set is obtained by submitting a set of queries to the Bing Search API.These queries include mandatory keywords, land cover keywords and land cover product names.The mandatory keywords are associated with three operations of a WMS and the names of related JavaScript libraries, as shown in Table 5.The land cover keywords are listed in Table 2 in Section 3.2.A total of 31 land cover product names are reported in Table 6.Each Bing query is composed of a mandatory keyword and a land cover keyword or product name.Executing these queries against the Bing Search API resulted in a total of 4834 URLs.Moreover, additional queries are dynamically generated from the fetched webpages using a maximum-frequency method to perform daily updates to the list of seed URLs.7. Table 7 indicates that the proposed approach discovers a considerably larger number of WMSs than those found by the previous works of Li et al. [16], Lopez-Pellicer et al. [26], Kliment et al. [27] and two well-known websites [80,81].These previous works did not focus on a particular domain and they all searched for general WMSs.In addition, the discovered WMSs and LCWMSs have a total of 1886 unique WMS hosts, while Bone et al. [32] reported finding only 823 unique WMS hosts.The superior performance of the proposed approach is largely because the proposed approach finds extra WMSs in the deep web based on their JavaScript invocation rules and reduces the influence of topic drift to discover WMSs quickly and accurately by updating the seed URLs regularly.However, as time goes on, the number of WMS naturally increases, which affects the credibility of the comparison.Therefore, the number of services found by Kliment et al. [27] and the two well-known websites [80,81] was tracked again on 27 April 2016.A total of 4233 WMSs with 2922 available WMSs were found in the website [82] provided by Kliment et al. [27].In addition, a total of 8618 and 4029 WMSs were found by the two well-known websites [80,81], respectively.However, the number of WMSs discovered by our proposed approach is still larger than the number of WMSs found in the above three methods.
Moreover, seed URLs and crawling scope may also affect the credibility of the comparison.In Experiment 2, the seed URLs used in Li et al. [16] and the two well-known websites [80,81] are unknown.Similar to our approach, the seed URLs in Lopez-Pellicer et al. [26] were supplied by submitting a set of queries to the Bing Search API and the Google Web Search API.This way is similar to our approach.Kliment et al. [27] used a Google search engine-based method to discover services.This method has no need of seed URLs, but the Google search engine has more seed URLs than our approach.Furthermore, none of these approaches imposed any limits on the crawling scope.In this experiment, the seed URLs had only a small effect on the final results because all the approaches executed over a long period of time (in fact, some approaches are still running today).Therefore, seed URLs and crawling scope have minimal effect on the credibility of the comparison in this experiment.
Although our approach has discovered more WMSs and LCWMSs than the compared approaches, those approaches are still valid and necessary.Because they are complementary to each other and can be utilized to create a good balance between the costs required for WMS discovery and efficiency.For example, the Google search engine-based method used in Kliment et al. [27], which has low development costs, could be combined with our approach to improve discovery efficiency.
Figure 9 represents the spatial distribution and numbers of the discovered LCWMSs in our approach.The spatial location of a LCWMS can be obtained through its IP (Internet Protocol) address.The most important locations of LCWMSs (where 5 or more were found) are labelled in Figure 8.The numerators and denominators of these labels represent the numbers of LCWMSs and WMSs, respectively.
The LCWMSs discovered above can be subdivided into real LCWMSs and generalized LCWMSs based on their original data types.Real LCWMSs are published directly from land cover data and maps, while generalized LCWMSs are released from geographic data and maps related to land cover.To estimate the number of real LCWMSs in the discovered LCWMSs, a split-and-random sampling strategy is carried out.First, the number of LCWMSs containing data about each class except vegetation in Table 3 is calculated through matching the land cover keywords of each class in a prioritized sequence.For example, a LCWMS is classified as cropland if its metadata contains one of the cropland keywords in Table 3.Then, ten random LCWMSs for each class are selected as samples to manually determine whether they are real LCWMSs.This process was carried out five times (only one time for the shrubland and tundra classes).Table 8 represents the number of LCWMSs containing data about each class and the average ratio of real LCWMSs.has low development costs, could be combined with our approach to improve discovery efficiency.
Figure 9 represents the spatial distribution and numbers of the discovered LCWMSs in our approach.The spatial location of a LCWMS can be obtained through its IP (Internet Protocol) address.The most important locations of LCWMSs (where 5 or more were found) are labelled in Figure 8.The numerators and denominators of these labels represent the numbers of LCWMSs and WMSs, respectively.The LCWMSs discovered above can be subdivided into real LCWMSs and generalized LCWMSs based on their original data types.Real LCWMSs are published directly from land cover data and maps, while generalized LCWMSs are released from geographic data and maps related to land cover.To estimate the number of real LCWMSs in the discovered LCWMSs, a split-and-random sampling strategy is carried out.First, the number of LCWMSs containing data about each class except vegetation in Table 3 is calculated through matching the land cover keywords of each class in a prioritized sequence.For example, a LCWMS is classified as cropland if its metadata contains one of the cropland keywords in Table 3.Then, ten random LCWMSs for each class are selected as samples to manually determine whether they are real LCWMSs.This process was carried out five times (only one time for the shrubland and tundra classes).Table 8 represents the number of LCWMSs containing data about each class and the average ratio of real LCWMSs.It can be observed from Table 8 that the ratio of real LCWMSs in the discovered LCWMSs vary with different classes.Up to 90% of the real LCWMSs contain data about the land cover class.These real LCWMSs include the US Geological Survey's NLCD, land cover data of South America for the year 2010, GlobeLand30 data, and so on.However, only approximately 30% of the real LCWMSs contain data about the barren land class.This is because many LCWMSs in this class are published from geological survey data, whose metadata often contains one or more of the barren land keywords in Table 3.

Case Study for Retrieval and Integration
Additionally, a case study is presented to demonstrate the retrieval and integration of the discovered LCWMSs in the LCWMS-SE.Assume that a user wants to search for new land cover datasets around Brazil to compare them with the GlobeLand30 dataset for the year 2010.Accordingly, to match the query's intent, the keywords "land cover 2010" and a bounding box around Brazil should be submitted to the designed search engine.As shown in Figure 10, a total of 803 LCWMSs were found among the discovered LCWMSs.The query takes approximately 0.148 s.The bounding boxes of all the returned LCWMSs contain the specified bounding box. Figure 11 represents the integration between the land cover data from the GlobeLand30-2010 dataset and that of South America for the year 2010 with 30 m resolution after a user clicked the "AddToLayer" link (red arrow).This demonstrates that the designed search engine has the ability to retrieve and integrate LCWMSs.

Snow 4074 40%
It can be observed from Table 8 that the ratio of real LCWMSs in the discovered LCWMSs vary with different classes.Up to 90% of the real LCWMSs contain data about the land cover class.These real LCWMSs include the US Geological Survey's NLCD, land cover data of South America for the year 2010, GlobeLand30 data, and so on.However, only approximately 30% of the real LCWMSs contain data about the barren land class.This is because many LCWMSs in this class are published from geological survey data, whose metadata often contains one or more of the barren land keywords in Table 3.

Case Study for Retrieval and Integration
Additionally, a case study is presented to demonstrate the retrieval and integration of the discovered LCWMSs in the LCWMS-SE.Assume that a user wants to search for new land cover datasets around Brazil to compare them with the GlobeLand30 dataset for the year 2010.Accordingly,

Snow 4074 40%
It can be observed from Table 8 that the ratio of real LCWMSs in the discovered LCWMSs vary with different classes.Up to 90% of the real LCWMSs contain data about the land cover class.These real LCWMSs include the US Geological Survey's NLCD, land cover data of South America for the year 2010, GlobeLand30 data, and so on.However, only approximately 30% of the real LCWMSs contain data about the barren land class.This is because many LCWMSs in this class are published from geological survey data, whose metadata often contains one or more of the barren land keywords in Table 3.

Case Study for Retrieval and Integration
Additionally, a case study is presented to demonstrate the retrieval and integration of the discovered LCWMSs in the LCWMS-SE.Assume that a user wants to search for new land cover datasets around Brazil to compare them with the GlobeLand30 dataset for the year 2010.Accordingly,

Conclusions
Automatic discovery and connections to isolated land cover web map services (LCWMSs) help to potentially share land cover data, which can support the United Nations's 2030 development agenda and conform to the goals of CoGLand.Previous active discovery approaches have been aimed primarily at finding LCWMSs located in the surface web and behind query interfaces in the deep web.However, very few efforts have been devoted to discovering LCWMSs from deep web JavaScript code, due to the challenges involved in web map service (WMS)-related JavaScript code detection and understanding.In this paper, a focused deep web crawler was proposed to solve these problems and discover additional LCWMSs.First, some judgements and two JavaScript invocation rules were created to detect and understand WMS-related JavaScript code for extracting candidate WMSs.These are composed of some names of predefined JavaScript links and regular expressions that match the invocation functions and URL formats of WMSs, respectively.Then, identified candidates are incorporated into a traditional focused crawling strategy, situated between the tasks of fetching webpages and parsing webpages.Thirdly, LCWMSs are selected by matching with land cover keywords.In addition, a LCWMS search engine was implemented based on the focused deep web crawler to assist users in retrieving and integrating the discovered LCWMSs.
Experiments showed that the proposed focused deep web crawler has the ability to discover LCWMSs hidden in JavaScript code.By searching the worldwide web, the proposed crawler found a total of 17,874 available WMSs, of which 11,901 were LCWMS.The results of a case study for retrieving land cover datasets around Brazil indicate that the designed LCWMS search engine represents an important step towards realizing land cover information integration for global mapping and monitoring purposes.
Despite the advantages of the proposed crawler, much work remains to improve the effectiveness of discovering LCWMSs.First, the proposed crawler considers only one script type (JavaScript).However, other script types (i.e., ActionScript) can also invoke LCWMSs.In the future, additional rules will be summarized to help the proposed crawler discover LCWMSs invoked from other scripting languages.Secondly, some currently available LCWMSs may become unavailable in the future.Future work will include a monitoring mechanism to monitor and assess the quality of the discovered LCWMSs.Third, the proposed crawler utilized the cosine function to measure topical relevance and to assign priorities to URLs.Moreover, it adopted a keyword-matching approach to classify each WMS.In the future, machine-learning approaches such as support vector machines will be applied to measure topical relevance, assign URL priorities and classify WMSs.

Figure 2 .
Figure 2. A regular expression representing the first rule.

Figure 2 .
Figure 2. A regular expression representing the first rule.

Figure 3 .
Figure 3.A regular expression of the second rule.

Figure 4 .
Figure 4. Instructions of the two JavaScript invocation rules.

Figure 5
Figure 5 presents the pseudocode to illustrate how to use the two JavaScript invocation rules.Steps 1-45 use the first JavaScript invocation rule to extract the URLs of potential WMSs.Steps 8-15 obtain a string for the URL of a potential WMS by splitting the matched string between the former two commas because the second argument represents the base URL for a WMS when calling functions in version 2 of the OpenLayers API (OpenLayers 2.x).Steps 16-21 also obtain a string for the URL of a potential WMS by splitting the matched string between the string "url:" and the first comma because the value of the key "url" represents the base URL for a WMS when calling functions in version 3 of the OpenLayers API (OpenLayers 3.x).Similarly, a string for the URL of a potential WMS is obtained by splitting the matched string between the first left parentheses and the first commas in Steps 22-27 because the first parameter represents the base URL for a WMS in ArcGIS API for JavaScript functions, Leaflet, and Mapbox.js.Steps 28-31 indicate that the extracted parameter is the URL of a potential WMS if it matches a URL format.When the extracted parameter is a JavaScript variable, a new function that potentially returns the URL of the WMS will be generated and executed by the Jurassic JavaScript parsing engine by calling the CallGlobalFunction function [60], as shown in Steps 32-43.In these steps, the syntax for the new function is composed of the JavaScript variable and a return statement, as shown in Step 38.Steps 47-53 illustrate how to apply the second JavaScript invocation rule to extract the URLs of potential WMSs.

Figure 3 .
Figure 3.A regular expression of the second rule.

Figure 3 .
Figure 3.A regular expression of the second rule.

Figure 4 .
Figure 4. Instructions of the two JavaScript invocation rules.

Figure 5
Figure 5 presents the pseudocode to illustrate how to use the two JavaScript invocation rules.Steps 1-45 use the first JavaScript invocation rule to extract the URLs of potential WMSs.Steps 8-15 obtain a string for the URL of a potential WMS by splitting the matched string between the former two commas because the second argument represents the base URL for a WMS when calling functions in version 2 of the OpenLayers API (OpenLayers 2.x).Steps 16-21 also obtain a string for the URL of a potential WMS by splitting the matched string between the string "url:" and the first comma because the value of the key "url" represents the base URL for a WMS when calling functions in version 3 of the OpenLayers API (OpenLayers 3.x).Similarly, a string for the URL of a potential WMS is obtained by splitting the matched string between the first left parentheses and the first commas in Steps 22-27 because the first parameter represents the base URL for a WMS in ArcGIS API for JavaScript functions, Leaflet, and Mapbox.js.Steps 28-31 indicate that the extracted parameter is the URL of a potential WMS if it matches a URL format.When the extracted parameter is a JavaScript variable, a new function that potentially returns the URL of the WMS will be generated and executed by the Jurassic JavaScript parsing engine by calling the CallGlobalFunction function [60], as shown in Steps 32-43.In these steps, the syntax for the new function is composed of the JavaScript variable and a return statement, as shown in Step 38.Steps 47-53 illustrate how to apply the second JavaScript invocation rule to extract the URLs of potential WMSs.

Figure 4 .
Figure 4. Instructions of the two JavaScript invocation rules.

Figure 5
Figure 5 presents the pseudocode to illustrate how to use the two JavaScript invocation rules.Steps 1-45 use the first JavaScript invocation rule to extract the URLs of potential WMSs.Steps 8-15 obtain a string for the URL of a potential WMS by splitting the matched string between the former two commas because the second argument represents the base URL for a WMS when calling functions in version 2 of the OpenLayers API (OpenLayers 2.x).Steps 16-21 also obtain a string for the URL of a potential WMS by splitting the matched string between the string "url:" and the first comma because the value of the key "url" represents the base URL for a WMS when calling functions in version 3 of the OpenLayers API (OpenLayers 3.x).Similarly, a string for the URL of a potential WMS is obtained by splitting the matched string between the first left parentheses and the first commas in Steps 22-27 because the first parameter represents the base URL for a WMS in ArcGIS API for JavaScript functions, Leaflet, and Mapbox.js.Steps 28-31 indicate that the extracted parameter is the URL of a potential WMS if it matches a URL format.When the extracted parameter is a JavaScript variable, a new function that potentially returns the URL of the WMS will be generated and executed by the Jurassic JavaScript

Figure 5 .
Figure 5. Pseudocode of the use of the JavaScript invocation rules.

Figure 5 .
Figure 5. Pseudocode of the use of the JavaScript invocation rules.

Figure 6 .
Figure 6.The framework of the focused deep web crawler for active discovery of LCWMSs.

Figure 6 .
Figure 6.The framework of the focused deep web crawler for active discovery of LCWMSs.

Figure 7 .
Figure 7.The LCWMS-SE architecture showing modules and linkages.

Figure 7 .
Figure 7.The LCWMS-SE architecture showing modules and linkages.

22 Figure 8 .
Figure 8.The user query interface to display land cover web map services.

Figure 8 .
Figure 8.The user query interface to display land cover web map services.

Figure 9 .
Figure 9. Spatial distribution and numbers of the discovered LCWMSs.

Figure 9 .
Figure 9. Spatial distribution and numbers of the discovered LCWMSs.

Figure 10 .
Figure 10.The user interface that displays LCWMSs for the case study.

Figure 11 .
Figure 11.Interface displaying the integration of land cover web map services.

Figure 10 .
Figure 10.The user interface that displays LCWMSs for the case study.

Figure 10 .
Figure 10.The user interface that displays LCWMSs for the case study.

Figure 11 .
Figure 11.Interface displaying the integration of land cover web map services.

Figure 11 .
Figure 11.Interface displaying the integration of land cover web map services.

Table 1 .
Reference formats of the four WMS-related JavaScript libraries.

Table 1 .
Reference formats of the four WMS-related JavaScript libraries.

Table 2 .
Invocation functions and the formalization of the first rule.

Table 2 .
Invocation functions and the formalization of the first rule.

Table 4 .
The eight seed URLs used for the first experiment.

Table 6 .
Land cover product names.This experiment ran intermittently from 16 June 2015 to 11 November 2015.A total of 17,874 available WMSs with 11,901 LCWMSs were discovered, as shown in Table

Table 7 .
Comparison of the number of web map services (WMSs) found by various methods.

Table 8 .
Number of LCWMSs about each class and the average ratio of real LCWMSs.

Table 8 .
Number of LCWMSs about each class and the average ratio of real LCWMSs.