A Parallel and Optimization Approach for Land-Surface Temperature Retrieval on a Windows-Based PC Cluster

: Land-surface temperature (LST) is a very important parameter in the geosciences. Conventional LST retrieval is based on large-scale remote-sensing (RS) images where split-window algorithms are usually employed via a traditional stand-alone method. When using the environment to visualize images (ENVI) software to carry out LST retrieval of large time-series datasets of infrared RS images, the processing time taken for traditional stand-alone servers becomes untenable. To address this shortcoming, cluster-based parallel computing is an ideal solution. However, traditional parallel computing is mostly based on the Linux environment, while the LST algorithm developed within the ENVI interactive data language (IDL) can only be run in the Windows environment in our project. To address this problem, we combine the characteristics of LST algorithms with parallel computing, and propose the design and implementation of a parallel LST retrieval algorithm using the message-passing interface (MPI) parallel-programming model on a Windows-based PC cluster platform. Furthermore, we present our solutions to the problems associated with performance bottlenecks and fault tolerance during the deployment stage. Our results show that, by improving the parallel environment of the storage system and network, one can effectively solve the stability issues of the parallel environment for large-scale RS data processing.


Introduction
Land-surface temperature (LST) is a very important parameter in the geosciences, which plays a fundamental role in land-atmosphere interaction and is a key parameter in global environmental change in terms of its effect on global hydrology, ecology and biogeochemical processes [1][2][3].LSTs also play an important role in the field of surface radiation and energy balance, drought monitoring, global and regional climate-change analysis, ecosystem modelling, surface-water thermal simulation, etc. [4][5][6][7].In addition, remotely sensed LST is widely used in studying the surface urban heat island (SUHI), which can assess the SUHI effect [8] and find the correlation between LST and SUHI, e.g., the exact impact of temporal aggregation on LST and SUHI [9].Liu et al. demonstrated that both biophysical and building-wall characteristics significantly influence the spatiotemporal variations of LST [10].As a result, the question of how to obtain LST data/information economically and efficiently is of great interest to many disciplines in the natural sciences.
LST retrieval via remote-sensing (RS) methods can greatly improve the range of measurement and reduce the amount of work otherwise involved.Indeed, LST retrieval is a popular focus in current RS research.Spaceborne RS instruments that obtain high-precision LST data have been used in the field of quantitative RS research since the 1980s [11].At present, the RS data that are applied to LST retrieval are provided by several instruments including the Thematic Mapper/Enhanced Thematic Mapper (TM/ETM+), the Advanced Spaceborne Thermal Emission and reflection Radiometer (ASTER), the Moderate-ResOlution Imaging Spectroradiometer (MODIS), and the Advanced Very High Resolution Radiometer (AVHRR).
At the same time, the split-window algorithm (SWA) for retrieving LSTs from satellite thermal infrared RS data, the single-window algorithm/single-channel algorithm, and the temperature/emissivity separation algorithm have also been established [12][13][14][15].However, most split-window algorithms are aimed at processing National Oceanic and Atmospheric Administration (NOAA) and AVHRR data, and a long time series of LST products is lacking.In addition, the coefficients of the split-window algorithm are mostly "local", i.e., the algorithm coefficients depend on specific research areas and sensors.Therefore, our team members performed a large-scale radiation transmission simulation in which they fitted and tested more than 10 kinds of commonly used split-window algorithms.In their analysis, they selected nine split-window algorithms with high precision, low sensitivity and practicability, and built an integrated algorithm that used Bayesian weighted-model averaging (BMA) [16].
Incorporating global LST retrieval with an integrated algorithm based on the BMA model often involves a large number of long time-scale RS data calculations.To perform these calculations, the interactive data language (IDL) code is used, and several components have been specifically developed in the IDL environment to visualize images (ENVI) by the Exelis Visual Information Solutions (VIS) corporation.The traditional stand-alone IDL program is not capable of such tasks [17,18], and as the amount of RS data increases, it is difficult to process the data quickly with IDL programs in stand-alone environments.Thus, Exelis VIS offers the ENVI services engine (ESE) to provide IDL and ENVI image-processing services on a cluster or cloud-computing environment [19].However, for real-world applications, an RS image-processing program usually contains one or more algorithms.If a parallel algorithm is implemented within the IDL program, it will take a lot of development time.Using ESE does save on development time, but commercial products are expensive and can be a financial burden.Fortunately, the emergence and application of different high-performance computing (HPC) technologies, e.g., the cluster-based parallel computing [20][21][22][23][24], graphics processing unit (GPU) based computing [25][26][27][28][29], and cloud computing [30][31][32][33][34][35][36] etc., makes large-scale data processing possible, which can greatly improve processing efficiency.
HPC systems are usually based on Linux.Existing LST-retrieval software needs to use IDL and other related, dependent environments.In upgraded versions of the IDL language environment, the latest version of ENVI is either unavailable or troublesome to deploy on Linux platforms.On the other hand, ENVI works well in the Windows operating system, and the widespread use of Windows makes it easier to build clusters of existing decentralized resources, which can save on costs and improve resource utilization.
There are a few existing studies that have used Windows-based clusters to develop their parallel algorithm or for parallel data processing.For example, Pan et al. used spare computers to form a "private cloud" in order to achieve a multi-computer parallel-computing framework for application in geophysical exploration [37].Zheng used a Windows-based Beowulf cluster and a message passing interface (MPI) to study the application of parallel computing to seismic-damage analysis with RS images [38].Moreover, Tie et al. applied parallel-computing technology and an MPI in a Windows cluster to batch process RS images, and proposed a simple parallel scheme for LST retrieval [39].
Unfortunately, this investigation only proposed the basics of the parallel-processing method and there were still some unaddressed issues such as storage and performance bottlenecks and fault tolerances.
In summary, to meet the demands of LST retrieval from thermal infrared data, and improve the efficiency of the retrieval, our aim is to use MODIS time-series RS data for LST retrieval.We propose a parallel algorithm for LST retrieval at the process level using an MPI parallel-programming model, which enables us to perform high-performance LST retrieval in a distributed memory environment with the Windows operating system.Furthermore, our research aims to solve the aforementioned, unaddressed problems present in previous study.
In addition to the LST parameter in global environmental and climate change, there are some other surface-temperature (ST) parameters such as sea-surface temperature (SST), lake-surface temperature, etc.The large-scale retrieval of these ST parameters are also applied to the large time-series RS data considered here.Compared with the retrieval of LSTs, existing research on SST retrieval has been around for a few decades.In 1980, McMillin pioneered the split-window approach to calculate SSTs on the basis of the 4th and 5th channels of AVHRR [40].In 1985, McClain et al. proposed a single, linear multi-channel sea surface temperature method [41].In 1999, Kumar et al. proposed the Miami pathfinder algorithm for SST that is suitable for processing MODIS data [42].In 2004, Qin proposed a single-window method, specifically for the infrared channel of the TM6 satellite, which can also be applied to SST retrieval [43].In recent years, many scholars have also studied the various factors that influence the SST retrieval procedure [44,45].
Meanwhile, lake-surface temperatures are monitored by satellites and numerous related works have been published in recent years.For instance, Livingstone used an RS image-retrieval method to reveal the relationship between the temperature of a lake in Australia and the local climate over a period of 80 years [46].Pour et al. used MODIS images to retrieve the temperature of frozen Arctic lakes [47].Moreover, Woolway et al. showed the relevance of lake-surface temperatures in relation to LST retrieval as it can be used to compare with LSTs, which is common in climatology [48].Actually, all these algorithms have been improved and applied to lake-surface temperature retrieval based on LST or SST retrieval algorithms [49].However, the parameters of different types of minerals vary from region to region, and from lake to lake.Thus, the required parameters for modelling one type of environment are often incompatible for modelling similar environments in different areas.Therefore, how to select parameters that are more suitable, and hence make the lake-surface temperature retrieval more accurate, has been a difficult obstacle to overcome in these studies.As such, our designed approach that adopts HPC for global LST retrieval by using large-scale RS data processing will, hopefully, be beneficial to the field, and can be applied to these, and others, ST-retrieval applications.

Background and Experimental Data Issue
This work was supported by a National High-Technology Research and Development Program (863) entitled "Generation and application of global products of essential land variables of global ecological system and surface-energy balance".The entire project aims to use multi-source RS data to produce global products that represent essential land variables, and which in turn can provide a database and technical support for researchers to make decisions about global change.Our results include: (1) the global products of essential land variables for 33 years (from 1982 to 2014), including the leaf-area index, emissivity, surface albedo, and photosynthetically active radiation; and (2) another eight products for four representative years (1983, 1993, 2003 and 2013), which involve shortwave radiation downstream of photosynthetically active radiation, LSTs, net long-wave radiation, net radiation (daytime), vegetation coverage, gross primary productivity, and latent heat [24].Within this big project, our work is mainly focused on creating a new and integrated LST-retrieval algorithm that is suitable for generating long time series of LST products based on RS data, because these products have extremely important practical value for climate-change modeling, surface radiation and regional/global energy balance.In our sub-project, we are required to generate day-to-day LST products for a total of four periods (the years 1983, 1993, 2003 and 2013).The spatial resolution of the data in the years 1983 and 1993 is 5 km, while the others (2003,2013) are 1 km.To carry out the task, we took the following steps: (1) based on a large-scale radiation transmission simulation, we selected several LST-retrieval algorithms from the existing split-window algorithms.These selected algorithms have advantages such as high precision, low sensitivity on the initial values of the inputted parameters, and highly practicability; (2) then, we built an integrated algorithm with a BMA model, as the BMA method has several advantages related to the integration of surface long-wave radiation models [50] and evapotranspiration model integration [51].
We note that, in the data-processing stage, we need to process four years' worth of data, while there are ~10 scenes of data from each day.To verify the feasibility of our approach quickly, we needed to experiment with some part of the data in advance in order to obtain a complete and reliable flow for our proposed HPC method.In this study, we chose only one scene from a single day of data.Therefore, we used 64 days of data for the experiments.Because this paper focuses on verifying the feasibility of the method, and it is impractical to demonstrate using the entire project data in this paper, the experiment data investigated here is only a small part of the entire data collected by the whole project.

Fundamentals of LST Retrieval
At present, for spaceborne thermal infrared RS data with high time resolution that utilize two or more thermal infrared bands, the most suitable choice of algorithm for LST retrieval is the split-window algorithm [14].This algorithm can be used for many different types of data including NOAA AVHRR, Environmental Satellite Advanced Along-Track Scanning Radiometer (ENVISAT AATSR), Terra/Aqua MODIS and Suomi National Polar-orbiting Partnership Visible Infrared Imaging Radiometer Suite (NPP VIIRS), and so on.The SWA is mainly based on the difference of atmospheric effects in two adjacent thermal infrared bands.Since the 1970s, the academic community has proposed dozens of SWAs.These algorithms have great similarity in form, and their general form can be summarized as: where, T s is the land surface temperature; T 11 , T 12 are the channel brightness temperature at channel 11 and 12 respectively; and A i (i = 0, 1, 2) is the algorithm factor.
The key to successfully using an SWA is to determine the appropriate input parameters and their values.The input parameters are varied, and they have different levels of complexity.Some algorithms only require the channel brightness temperature, while some external parameters, such as air vapor content, surface emissivity, near-surface temperature and land-cover type, are needed by other algorithms.
The strategy of building the LST-retrieval algorithm with different data sets is as follows: (1) select a variety of widely used SWAs to construct an algorithm library, where the chosen algorithms have their own forms and different input parameters; (2) build a representative atmospheric profile database, carry out a radiation transmission simulation and build training data sets, and then check the sensitivity of the data sets and analyze them; (3) determine the algorithm coefficients by using the simulated data set, determine the optimal algorithm (or combination of algorithms), and achieve the sensitivity analysis of the algorithm; and (4) based on real RS data, determine the calculation method of the input parameters of the algorithm.
In this paper, the global LST retrieval algorithm, which is based on an SWA, was applied to MODIS data as an example, where the entire technical roadmap of LST retrieval is shown in Figure 1.In addition, Figure 2 provides the detailed processing steps of the SWA.
longitude and latitude and so on.After the SWA finishes running, we put the retrieved LSTs into hierarchical data file (HDF) format.

Parallel Design and Implementation of the Global LST-Retrieval Algorithm
As IDL does not support the distributed computing technique, in order to implement cluster-based parallel computing we used other languages to enable inter-process communication.One approach is to use IDL programs to call encapsulated MPI communication functions.Another is to call IDL programs directly from other languages.The latter is currently common practice in the field.In our implementation, we used the C language to develop MPI programs for underlying communication, while invoking IDL programs to process the data.The parallel program uses the longitude and latitude and so on.After the SWA finishes running, we put the retrieved LSTs into hierarchical data file (HDF) format.

Parallel Design and Implementation of the Global LST-Retrieval Algorithm
As IDL does not support the distributed computing technique, in order to implement cluster-based parallel computing we used other languages to enable inter-process communication.One approach is to use IDL programs to call encapsulated MPI communication functions.Another is to call IDL programs directly from other languages.The latter is currently common practice in the field.In our implementation, we used the C language to develop MPI programs for underlying communication, while invoking IDL programs to process the data.The parallel program uses the From Figure 2, the processing procedure can be divided into three steps: (1) data indexing; (2) data matching; and (3) data inversion.The entire program starts with the MOD02KM data, and then assigns the data to be processed in turn.To match the other data sets based on the MOD02KM data, it will obtain some information from the data-matching process, such as the radiance brightness, longitude and latitude and so on.After the SWA finishes running, we put the retrieved LSTs into hierarchical data file (HDF) format.

Parallel Design and Implementation of the Global LST-Retrieval Algorithm
As IDL does not support the distributed computing technique, in order to implement cluster-based parallel computing we used other languages to enable inter-process communication.One approach is to use IDL programs to call encapsulated MPI communication functions.Another is to call IDL programs directly from other languages.The latter is currently common practice in the field.In our implementation, we used the C language to develop MPI programs for underlying communication, while invoking IDL programs to process the data.The parallel program uses the MPICH2 library [52], which is a popular implementation of the MPI standard, to support multi-node parallel processing.By separating the MPI-based communication programs and the IDL-based data-processing programs, we make the whole development process more suitable for a multidisciplinary collaboration and easier to manage.

Parallelization of the Serial Algorithm
As the serial algorithm is designed for stand-alone environments, the design and operation of the algorithm has been optimized to the working environment of a single computer.It is necessary to refactor the serial program and adjust it to the distributed computing environment.This transformation includes path modification of the original IDL program and its input data, as well as the modularization of the IDL program itself.These steps are described as follows.
(1) Path Modification For LST retrieval, the absolute paths of the input and output data need to be reconstructed first.By taking advantage of the application-programming interface of the MPI, we decided to modify the input and output path of the data and make the parameters suitable for the application-programming interface. (

2) Serial Algorithm Modularization
To run LST retrieval in parallel on a Windows-based cluster, we eliminated the traditional IDL graphical interaction required at run time and the IDL console interface to improve the parallelization efficiency.We packaged the IDL programs in the "sav" file format and utilized the "cmd" command to call the IDL sub-routine "idlrt.exe" to run the programs in the "sav" file.Thus, the packaged programs basically run without the IDL runtime environment.By calling only one sub-routine with "cmd" together with the settings of the relevant environment and parameters, this approach greatly reduces the burden involved in running IDL programs.This lays the ground for the parallel optimization of the serial algorithm.

Design and Implementation of the Parallel Algorithm
The primary parallelization strategy for IDL programs is data parallelism.One can divide the RS data to be processed into units of days.By organizing the configuration (input) file of the daily data, one can divide the data using the master process and assign them to individual worker processes.In general, the master process obtains the overall information about the data to be processed, and then assigns the input file for the sliced daily data to each worker process.Each worker process then obtains the respective data file, reads data that need to be processed, processes the data, and then obtains the result.The process is shown in Figure 3.The detailed parallelization procedure is elaborated as follows: (1) Front node (Master for short) initializes the running environment, obtains the path of the data, and resolves the specific tasks that will be distributed to the computing nodes (also called workers).The detailed parallelization procedure is elaborated as follows: (1) Front node (Master for short) initializes the running environment, obtains the path of the data, and resolves the specific tasks that will be distributed to the computing nodes (also called workers).In the implementation, we assign one Master process and one or more worker processes depending on the runtime option from the user.Only one worker process is shown in the diagram.

System Environment Configuration
We used PCs hosted in the lab of our school and established them into a Windows PC cluster.We shared a disk partition at the front node for storage and to provide data access through the shared folder.The configuration of each node is shown in Table 1.The software environment is primarily configured for running MPI and ENVI IDL programs.The software configuration of the specific nodes is shown in Table 2.In the implementation, we assign one Master process and one or more worker processes depending on the runtime option from the user.Only one worker process is shown in the diagram.

System Environment Configuration
We used PCs hosted in the lab of our school and established them into a Windows PC cluster.We shared a disk partition at the front node for storage and to provide data access through the shared folder.The configuration of each node is shown in Table 1.The software environment is primarily configured for running MPI and ENVI IDL programs.The software configuration of the specific nodes is shown in Table 2.

Performance Evaluation Metrics
To evaluate the performance of a parallel algorithm, the speed-up metric is usually used [53]: where, T s is the run time of the serial program and T p is the run time of the parallel program.
The speed-up metric indicates how fast the parallel program is compared to the serial one.The greater the speed-up, the better the parallel program.

Experimental Results and Analysis
In our experiments, we selected 64 days of MODIS data from 2013.Based on the number of cores and the memory limitation per node, we were limited to four processes per computing node, and we recorded the total time for different number of machines and processes.Table 3 shows the time consumed by single and multiple nodes, where each node contains two and four MPI processes, respectively.When we further increased the number of computing nodes, or the number of MPI processes contained in one node, the processing procedure became unstable.It is possible that errors occur when too many MPI processes simultaneously read the same file from the storage node across the network as the number of processes increases.As shown in Table 4, we calculated the achieved speed-up metric for each case.From Tables 3 and 4, we can see that when the number of MPI processes on a single node is the same, the processing time gradually decreases as the number of nodes increases, and the acceleration ratio increases as the number of computing nodes increases.However, we can see that in the four-nodes-four-processes case, the time elapsed is reduced.By comparing the speed-up of the four-node case with the five-node and six-node cases, we can observe that when the total number of processes exceeds eight, simply increasing the number of processes per node will have less effect on the speed-up.Considering the hardware used in the tests, we expect that the network bandwidth was saturated for cases with more than 12 MPI processes.The results are also demonstrated in Figure 5.An increase in the number of MPI processes will cause an increase in the number of reads and writes to the storage node, which results in a decrease in the computing performance of the nodes when the total bandwidth available cannot meet the algorithm's requirements.
From the overall experimental results, this method has been demonstrated to work in the full Windows 7 operating system environment.When one computer is used as a data-storage node and six more as computing nodes (two MPI processes per node) for distributed computing, compared to the traditional serial program running on a single machine, the overall performance is improved by almost a factor of seven.

Optimization Approaches of the Parallel LST Retrieval Algorithm
Based on the foregoing basic experiments, we achieved very good performance with a PC cluster running the Microsoft Windows 7 operating system.However, we noticed two problems: (1) When we increased the number of computers for parallel LST retrieval, we found that the entire system's operating environment became less stable, and the computers were unable to find the assigned task files.As Windows 7 is used in all the experiments, data transfer between a storage node and the computing nodes used Windows 7 network file sharing for the local area network.Such data sharing can only support four or five computing nodes (four MPI processes per node) for parallel computing.Once the number of computing nodes goes beyond this, the entire system environment will either stall or collapse.This constraint fundamentally limits the number of computers involved in the deployment of parallel computing applications on PC clusters.(2) According to the performance analysis of the basic experiment, we notice that more MPI processes (e.g., four nodes with four processes) has a lower speed-up metric than fewer MPI processes (e.g., five nodes with two processes).Meanwhile, when the number of MPI processes reaches 12, increasing the number of MPI processes does not have much effect on the speed-up.As explained at the end of previous section, we expect that the network bandwidth was saturated in those cases, and the total bandwidth available could not meet the requirements of our application.
Based on the foregoing analysis, in order to resolve the problems we modified the computing environment, especially the storage node, and improved the network connectivity.

Improvement Approach One: Modifying the Software Configuration of the Storage Node
As the Windows 7 operating system restricts the maximum number of connections to a storage node, we decided to use a different operating system.One option was a Windows server operating system, which is often used for enterprise-class management systems and can serve many users at the same time for data requests, sends, retrievals, and hard-disk access [54].On the other hand, the Linux operating system is the de facto operating system for high-performance computing.Since Linux cannot share data directly with Windows, we used the Samba file services in Linux to enable data exchange with Windows [55].
We tested the Windows server system and the Samba server in Linux, respectively, on the storage node using 64 days of data.In the experiments, we set the maximum number of MPI processes per compute node to four, and tried 32-process and 64-process runs, and hence recorded the total time consumed.Figure 6 shows a diagram of the connection among the nodes.The number of computing nodes were either eight or 16 in our experiments.Based on the foregoing analysis, in order to resolve the problems we modified the computing environment, especially the storage node, and improved the network connectivity.

Improvement Approach One: Modifying the Software Configuration of the Storage Node
As the Windows 7 operating system restricts the maximum number of connections to a storage node, we decided to use a different operating system.One option was a Windows server operating system, which is often used for enterprise-class management systems and can serve many users at the same time for data requests, sends, retrievals, and hard-disk access [54].On the other hand, the Linux operating system is the de facto operating system for high-performance computing.Since Linux cannot share data directly with Windows, we used the Samba file services in Linux to enable data exchange with Windows [55].
We tested the Windows server system and the Samba server in Linux, respectively, on the storage node using 64 days of data.In the experiments, we set the maximum number of MPI processes per compute node to four, and tried 32-process and 64-process runs, and hence recorded the total time consumed.Figure 6 shows a diagram of the connection among the nodes.The number of computing nodes were either eight or 16 in our experiments.The node configurations of the two cases are shown in Tables 5 and 6, respectively.

G 1000 Mbps 1
In Case 1 we tested the performance of the Windows server installed on the storage node, while in Case 2 we tested the Linux Samba server.The software environment was primarily configured for running MPI and IDL programs.The configurations shown in Table 6 were used, and our experimental results are displayed in Table 7.The node configurations of the two cases are shown in Tables 5 and 6, respectively.In Case 1 we tested the performance of the Windows server installed on the storage node, while in Case 2 we tested the Linux Samba server.The software environment was primarily configured for running MPI and IDL programs.The configurations shown in Table 6 were used, and our experimental results are displayed in Table 7. From Table 7, it can be seen that the system-crash issues for excessive file-server visits have been solved.All the experiments finished successfully.By comparing the results for the two storage configurations, we observed that the Linux Samba server outperformed the Windows server when the number of MPI processes is large.However, the Windows server was slightly better when fewer processes are used.

Improvement Approach Two: Effects of Network Transmission Rates
In our parallelization method, an increase in the number of MPI processes results in increased synchronization of data acquisition and transmission.We are interested in verifying that the amount of data that needs to be transferred per unit of time increases with the increase in the number of MPI processes.Furthermore, we are interested in knowing if the network bandwidth of our previous configuration did not meet the computational requirements, thus limiting the increase in the speed-up.In this experiment, we used 100 Mbps and 1 Gbps network configurations to carry out the comparison.
The configuration of the experimental setup is the same as given in Section 4.1.No other configurations have been changed in this experiment, except that a different network configuration is used.The system connection diagram is shown in Figure 7.  From Table 7, it can be seen that the system-crash issues for excessive file-server visits have been solved.All the experiments finished successfully.By comparing the results for the two storage configurations, we observed that the Linux Samba server outperformed the Windows server when the number of MPI processes is large.However, the Windows server was slightly better when fewer processes are used.

Improvement Approach Two: Effects of Network Transmission Rates
In our parallelization method, an increase in the number of MPI processes results in increased synchronization of data acquisition and transmission.We are interested in verifying that the amount of data that needs to be transferred per unit of time increases with the increase in the number of MPI processes.Furthermore, we are interested in knowing if the network bandwidth of our previous configuration did not meet the computational requirements, thus limiting the increase in the speed-up.In this experiment, we used 100 Mbps and 1 Gbps network configurations to carry out the comparison.
The configuration of the experimental setup is the same as given in Section 4.1.No other configurations have been changed in this experiment, except that a different network configuration is used.The system connection diagram is shown in Figure 7.We used the same PC cluster with different networking devices (see Tables 8 and 9) to process the data from 32 days, and we used four MPI processes per node.We used the same PC cluster with different networking devices (see Tables 8 and 9) to process the data from 32 days, and we used four MPI processes per node.As shown above, in Case 3 we used the 100 Mbps network, while in Case 4 we used the 1 Gbps network.Compliant with the aforementioned experiments, we obtained the experimental results of these cases (Tables 10 and 11; and Figures 8 and 9, respectively).The performance of the parallel implementation is greatly improved when the quality of the network configuration is upgraded (Figure 8).Combined with Figure 9, we find that a faster network transmission speed results in a better speed-up.This also demonstrates the reason why the obtained speed-up of four nodes with four MPI processes is much lower than that of five nodes with two MPI processes (i.e., Table 4).The network transmission speed affects the program's ability to process real-time data; especially when a large amount of data needs to be dealt with, the impact of network speed cannot be ignored.However, the achieved speed-up will change only slightly with increased bandwidth when the parallel algorithm reaches its optimal theoretical performance state.
In summary, the experimental results show that either the Windows server or the Linux Samba server can be a good solution to resolve the limitation in the number of computing nodes in a Windows-based PC cluster with a shared file server.Additionally, an optimized and upgraded network configuration can also speed up the system.

Conclusions and Future Directions
Considering that most of the coefficients of SWAs have "local" characteristics, this paper uses a BMA model to build an integrated algorithm on top of the traditional serial-retrieving program that forms the basis of an MPI-based parallel LST-retrieval algorithm.Our experimental results show that the parallel algorithm can effectively shorten LST retrieval time.By selecting a reasonable number of processes and using four computers, the maximum acceleration ratio is close to five.To optimize the parallel algorithm, we improved the network transmission and the data storage in the implementation of the algorithm by: (1) analyzing the bottleneck (i.e., network configuration) in the workflow and upgrading the networking devices; (2) analyzing the system configuration of the data storage node by carrying out a comparison between a Windows server and a Linux Samba server to replace the Windows 7 operating system on the storage node.Our experimental results show that the time of parallel-task processing can be significantly reduced by upgrading the networking devices, and the acceleration ratio is significantly enhanced.Using either the Windows server or the Linux Samba server to replace the existing storage-node operating system solves the crashing problem caused by excessive connections from computing nodes.However, the results of the parallel performance still leave plenty of room for improvement.Based on this study, further improvements can be carried out in the following areas: (1) In the experiments, we observed that the speed-up in the 1 Gbps network case is much lower than that of the 100 Mbps case when there are over 12 MPI processes.The lower speed-up could be caused by the dramatic improvement in performance of the serial program in the 1 Gbps case.The greatly reduced running time of the serial program, which serves as the nominator to The performance of the parallel implementation is greatly improved when the quality of the network configuration is upgraded (Figure 8).Combined with Figure 9, we find that a faster network transmission speed results in a better speed-up.This also demonstrates the reason why the obtained speed-up of four nodes with four MPI processes is much lower than that of five nodes with two MPI processes (i.e., Table 4).The network transmission speed affects the program's ability to process real-time data; especially when a large amount of data needs to be dealt with, the impact of network speed cannot be ignored.However, the achieved speed-up will change only slightly with increased bandwidth when the parallel algorithm reaches its optimal theoretical performance state.
In summary, the experimental results show that either the Windows server or the Linux Samba server can be a good solution to resolve the limitation in the number of computing nodes in a Windows-based PC cluster with a shared file server.Additionally, an optimized and upgraded network configuration can also speed up the system.

Conclusions and Future Directions
Considering that most of the coefficients of SWAs have "local" characteristics, this paper uses a BMA model to build an integrated algorithm on top of the traditional serial-retrieving program that forms the basis of an MPI-based parallel LST-retrieval algorithm.Our experimental results show that the parallel algorithm can effectively shorten LST retrieval time.By selecting a reasonable number of processes and using four computers, the maximum acceleration ratio is close to five.To optimize the parallel algorithm, we improved the network transmission and the data storage in the implementation of the algorithm by: (1) analyzing the bottleneck (i.e., network configuration) in the workflow and upgrading the networking devices; (2) analyzing the system configuration of the data storage node by carrying out a comparison between a Windows server and a Linux Samba server to replace the Windows 7 operating system on the storage node.Our experimental results show that the time of parallel-task processing can be significantly reduced by upgrading the networking devices, and the acceleration ratio is significantly enhanced.Using either the Windows server or the Linux Samba server to replace the existing storage-node operating system solves the crashing problem caused by excessive connections from computing nodes.However, the results of the parallel performance still leave plenty of room for improvement.Based on this study, further improvements can be carried out in the following areas: (1) In the experiments, we observed that the speed-up in the 1 Gbps network case is much lower than that of the 100 Mbps case when there are over 12 MPI processes.The lower speed-up could be caused by the dramatic improvement in performance of the serial program in the 1 Gbps case.
The greatly reduced running time of the serial program, which serves as the nominator to calculate speed-up, helps to reduce the speed-up calculated for the 1 Gbps case.Further investigations are necessary.(2) In the comparison test of the replaced storage node in Section 5.1, it was found that the speed-up slows down when the number of computing nodes increases.We suspect that a single storage node cannot meet the needs of a large number of computing nodes for reading and writing data.We could consider using a distributed file system to replace the single-data storage node in order to further improve the overall performance of the system.

Figure 2 .
Figure 2. Flowchart of the LST retrieval algorithm.

Figure 2 .
Figure 2. Flowchart of the LST retrieval algorithm.

Figure 2 .
Figure 2. Flowchart of the LST retrieval algorithm.

Figure 3 .
Figure 3. Schematic of the parallel algorithm.

( 2 )
Master broadcasts (assigns) data paths to the computing nodes, and the computing nodes find their respective task data in the storage nodes according to their respective paths.(3)Computing nodes process the data.Each computing node will do their own subtask simultaneously.After the data processing completes, the computing nodes write the results to the storage nodes, and send their own status to the Master.(4) Master analyzes the messages that it receives from the computing nodes and decides whether to continue assigning tasks to the computing nodes based on the status of the computing nodes and the number of remaining tasks.(5) Repeat steps (2), (3), and (4) until all tasks are accomplished.(6) Master ends the running environment and the program.The parallel algorithm contains the following steps: (1) initialize the MPI environment; (2) obtain the rank (identification number) of each process; (3) the MPI program sends a message; (4) the MPI program receives a message; and (5) the MPI program terminates.Developers can add their own functionality to each of these steps.A flowchart of the implementation is shown in Figure 4.

Figure 3 .
Figure 3. Schematic of the parallel algorithm.

( 2 )
Master broadcasts (assigns) data paths to the computing nodes, and the computing nodes find their respective task data in the storage nodes according to their respective paths.(3)Computing nodes process the data.Each computing node will do their own subtask simultaneously.After the data processing completes, the computing nodes write the results to the storage nodes, and send their own status to the Master.(4) Master analyzes the messages that it receives from the computing nodes and decides whether to continue assigning tasks to the computing nodes based on the status of the computing nodes and the number of remaining tasks.(5) Repeat steps (2), (3), and (4) until all tasks are accomplished.(6) Master ends the running environment and the program.The parallel algorithm contains the following steps: (1) initialize the MPI environment; (2) obtain the rank (identification number) of each process; (3) the MPI program sends a message; (4) the MPI program receives a message; and (5) the MPI program terminates.Developers can add their own functionality to each of these steps.A flowchart of the implementation is shown in Figure 4.

Figure 4 .
Figure 4. Workflow of the parallel implementation.

Figure 4 .
Figure 4. Workflow of the parallel implementation.

Figure 5 .
Figure 5.Time consumption versus the number of MPI processes.

Figure 6 .
Figure 6.Diagram of the optimized parallel environment.

Figure 6 .
Figure 6.Diagram of the optimized parallel environment.

Figure 7 .
Figure 7. System diagram for testing with different network configurations.

Figure 7 .
Figure 7. System diagram for testing with different network configurations.

Figure 8 .
Figure 8.Time consumption in Case 3 and Case 4 for different network configurations.

Figure 9 .
Figure 9. Speed-up of Case 3 and Case 4 for different network configurations.

Figure 9 .
Figure 9. Speed-up of Case 3 and Case 4 for different network configurations.

Table 2 .
Software configuration on different nodes.

Table 3 .
Time consumption in seconds.
n is the number of nodes, and P is the number of message-passing interface (MPI) processes running on each node; "null" means the experiment was not completed due to errors.
n is the number of nodes, and P is the number of MPI processes running on each node.

Table 7 .
Experimental results for Case 1 and Case 2. Two identical tests (Test #1 and Test #2) were carried out to ensure that no extreme cases appeared in the tests.

Table 7 .
Experimental results for Case 1 and Case 2. Two identical tests (Test #1 and Test #2) were carried out to ensure that no extreme cases appeared in the tests.

Table 10 .
Case 3 and Case 4 experimental results in seconds.

Table 11 .
Speed-up for the different network configurations.