Improved Bound Fit Algorithm for Fine Delay Scheduling in a Multi-Group Scan of Ultrasonic Phased Arrays

Multi-group scanning of ultrasonic phased arrays (UPAs) is a research field in distributed sensor technology. Interpolation filters intended for fine delay modules can provide high-accuracy time delays during the multi-group scanning of large-number-array elements in UPA instruments. However, increasing focus precision requires a large increase in the number of fine delay modules. In this paper, an architecture with fine delay modules for time division scheduling is explained in detail. An improved bound fit (IBF) algorithm is proposed, and an analysis of its mathematical model and time complexity is provided. The IBF algorithm was verified by experiment, wherein the performances of list, longest processing time, bound fit, and IBF algorithms were compared in terms of frame data scheduling in the multi-group scan. The experimental results prove that the scheduling algorithm decreased the makespan by 8.76–21.48%, and achieved the frame rate at 78 fps. The architecture reduced resource consumption by 30–40%. Therefore, the proposed architecture, model, and algorithm can reduce makespan, improve real-time performance, and decrease resource consumption.


Introduction
Ultrasonic phased array (UPA) technology is an important nondestructive testing method that is widely used in aerospace, shipbuilding, port machinery, and nuclear energy. With its multiple-group scanning functionality and a large number of other elements, the multi-group scan UPA system can provide extended scanning flexibility and image contrast, increased focal law diversification, and high signal-to-noise ratio (SNR). Within the system, a number of filters in a given module determine the precision of fine delay. The higher the precision, the better the image resolution. Classical all-parallel fine delay modules require a lot of hardware resources, i.e., a multiplier, look-up table (LUT), register (Reg), and an in field programmable gate array (FPGA). Synchronization and integration difficulty need to be considered in the use of multi-chip schemes, while hardware resources are limited in single chip schemes. Therefore, an architecture with time-division multiplexing is used to schedule frame tasks between fine delay modules in a single chip. This method can significantly improve resource utilization and reduce the number of resources used. However, when the sampling depth or the value of the focal law is large, the frame rate (frames per second, fps) decreases, leading to worse real-time performance of the distributed UPA instrument and a greatly reduced application scope. Therefore, it is necessary to coordinate fine modules and frame tasks for multi-group scanning through algorithm schedules, minimize idle time slots of resources in the fine delay modules, and reduce the makespan of all frame tasks to improve time performance.
In this paper, a fine delay scheduling architecture was also analyzed considering multi-group-scan echo data diversity, using a non-preempt model for the scheduling problem and proposing the IBF algorithm for optimization.
The paper is organized as follows. In Section 2, the architecture of the fine delay module scheduling for the multi-group scanning of UPA systems is presented, and the multi-group scan problem is explained. In Section 3, the IBF algorithm is proposed and an analysis of its performance and time complexity is provided. LIST, LPT, BF, and IBF algorithms are compared in Section 4. Finally, a conclusion is provided in Section 5.

Fine Delay Scheduling Principle
The delay method and focus scheduling based on different UPA instrument focal parameters (e.g., number of apertures, sending and receiving time, and data amount), which control the pulse repetition frequency (PRF) and frame formation, are used for scheduling in multi-group scans. The delay precision is 1.25 ns. Due to the limitation of the resources of the FPGA in our experiments, the system architecture is designed as four groups and two fine delay modules. Each group has eight channels, and each channel has 10-bit analog-digital converter (ADC). Sampling depth is 2-8 K, the number of focal law ≤128, and read parameter length is 1024 in each group. The design frame rate is not less than 24 fps, which meets the requirements of real-time display.
A diagram of the for mutli-group scanning is shown in Figure 1, labels 1 -5 in Figure 1 are described below.
Sensors 2019, 19 FOR PEER REVIEW 3 In this paper, a fine delay scheduling architecture was also analyzed considering multi-groupscan echo data diversity, using a non-preempt model for the scheduling problem and proposing the IBF algorithm for optimization.
The paper is organized as follows. In Section 2, the architecture of the fine delay module scheduling for the multi-group scanning of UPA systems is presented, and the multi-group scan problem is explained. In Section 3, the IBF algorithm is proposed and an analysis of its performance and time complexity is provided. LIST, LPT, BF, and IBF algorithms are compared in Section 4. Finally, a conclusion is provided in Section 5.

Fine Delay Scheduling Principle
The delay method and focus scheduling based on different UPA instrument focal parameters (e.g., number of apertures, sending and receiving time, and data amount), which control the pulse repetition frequency (PRF) and frame formation, are used for scheduling in multi-group scans. The delay precision is 1.25 ns. Due to the limitation of the resources of the FPGA in our experiments, the system architecture is designed as four groups and two fine delay modules. Each group has eight channels, and each channel has 10-bit analog-digital converter (ADC). Sampling depth is 2-8 K, the number of focal law ≤128, and read parameter length is 1024 in each group. The design frame rate is not less than 24 fps, which meets the requirements of real-time display.
A diagram of the for mutli-group scanning is shown in Figure 1, labels ①-⑤ in Figure 1 are described below.  Figure 1. Diagram of the fine delay module for multi-group scanning.
The presented block diagram includes the following parts: (1) High speed multi-channel ADC module (HADC): Ultrasonic echo signals are subjected to highspeed multi-channel ADC acquisition, conditioning conversion, and transformation into lowvoltage differential signaling (LVDS) serial signals. They are then fed to the FPGA for further processing. ADCs are divided into groups according to the probe socket and multi-group scan. (2) Fine delay scheduling module (FDS): The LVDS serial signal is first converted into a parallel signal, then the parallel signal generated by the IP core is sent to the multi-channel first-in firstout memory (FIFO), which is used for buffering and scheduling. The scheduling module consists of several fine delay modules. The signal buffered in the FIFO is then fed to the scheduling module, where it is forwarded to different fine delay modules. Thus, time division multiplexing is achieved.
The fine-delay module used in this study contains the multi-level half-band filter that was proposed by Liu and Tang [17]. A diagram of the multi-level half-band fine delay filter is presented The presented block diagram includes the following parts: (1) High speed multi-channel ADC module (HADC): Ultrasonic echo signals are subjected to high-speed multi-channel ADC acquisition, conditioning conversion, and transformation into low-voltage differential signaling (LVDS) serial signals. They are then fed to the FPGA for further processing. ADCs are divided into groups according to the probe socket and multi-group scan. (2) Fine delay scheduling module (FDS): The LVDS serial signal is first converted into a parallel signal, then the parallel signal generated by the IP core is sent to the multi-channel first-in first-out memory (FIFO), which is used for buffering and scheduling. The scheduling module consists of several fine delay modules. The signal buffered in the FIFO is then fed to the scheduling module, where it is forwarded to different fine delay modules. Thus, time division multiplexing is achieved.
The fine-delay module used in this study contains the multi-level half-band filter that was proposed by Liu and Tang [17]. A diagram of the multi-level half-band fine delay filter is presented  Figure 2, whereas its simulation diagram created in ModelSim (Mentor Co., Ltd., Wilsonville, OR, USA) is shown in Figure 3.
Sensors 2019, 19 FOR PEER REVIEW 4 in Figure 2, whereas its simulation diagram created in ModelSim (Mentor Co., Ltd., Wilsonville, OR, USA) is shown in Figure 3.

times Interpolation
Multi-level half-band filter . . .  The multi-level half-band fine delay filter uses the interpolation method with eight time intervals to design a half-band filter. The implementation of synthetic technology in the multi-level half-band interpolation filter results in filter decomposition into eight sub-filters. Simultaneously, interpolation with poly-phase decomposition is achieved. The eight filters delay the original signal for 0, 1.25, 2.5, 3.75, 5, 6.25, 7.5, and 8.75 ns. The data samples have a 10-bit length, and thus two 9-bit multipliers are needed for multiplications. However, the multi-level half-band filter uses six 9-bit multipliers. In addition, each channel has eight fine delay channels, so there are 96 (i.e., 6 × 2 × 8 = 96) 9-bit multipliers. If all parallel delay is used in a 256-element UPA system, then 24,576 multipliers would be needed. Given such large resource consumption, the integration of a single FPGA in the multigroup scan module of a UPA system would be difficult.
(3) Coarse delay and sum module (CDS): Coarse delay is based on counter clock delay technology.
All the relative delay parameters of focal laws, calculated by a PC, can be loaded from the "delay and scheduling parameters storage" block in Figure 1. The double data rate 3 (DDR3) synchronous dynamic random access memory input signal addresses the corresponding coarse delay parameter counted by the clock, and thus fixed integer coarse delay is achieved. The sum module merges signals processed by fine delay and coarse delay blocks in an ultrasonic digital beam, which represents the complete beamform of the focal laws. All signals of the ultrasonic digital beam are stored in memory, and all signal groups form a corresponding beamform. In other words, each focal law forms a digital beamform, and all the beamforms of the same group generate the initial image information of that group. (4) External DDR3: Since the internal RAM capacity of the FPGA is insufficient, a DDR3 controller with two DDR3 memories is used for coarse delay data storage. DDR3 memory has a coarse delay and reads the group focus module according to the group. (5) Delay and scheduling parameters storage (DSPS): Delay and scheduling parameters storage is a large-scale storage block in the FPGA. The delay and scheduling parameters are calculated using a focal law calculator in the PC, corresponding to the input data entered by the user. DSPS contains a scheduling table, the pulse repetition frequency of each group, and the time delay parameter for both fine and coarse delays according to focal laws. It also includes algorithmic Sensors 2019, 19 FOR PEER REVIEW 4 in Figure 2, whereas its simulation diagram created in ModelSim (Mentor Co., Ltd., Wilsonville, OR, USA) is shown in Figure 3.

times Interpolation
Multi-level half-band filter . . .  The multi-level half-band fine delay filter uses the interpolation method with eight time intervals to design a half-band filter. The implementation of synthetic technology in the multi-level half-band interpolation filter results in filter decomposition into eight sub-filters. Simultaneously, interpolation with poly-phase decomposition is achieved. The eight filters delay the original signal for 0, 1.25, 2.5, 3.75, 5, 6.25, 7.5, and 8.75 ns. The data samples have a 10-bit length, and thus two 9-bit multipliers are needed for multiplications. However, the multi-level half-band filter uses six 9-bit multipliers. In addition, each channel has eight fine delay channels, so there are 96 (i.e., 6 × 2 × 8 = 96) 9-bit multipliers. If all parallel delay is used in a 256-element UPA system, then 24,576 multipliers would be needed. Given such large resource consumption, the integration of a single FPGA in the multigroup scan module of a UPA system would be difficult.
(3) Coarse delay and sum module (CDS): Coarse delay is based on counter clock delay technology.
All the relative delay parameters of focal laws, calculated by a PC, can be loaded from the "delay and scheduling parameters storage" block in Figure 1. The double data rate 3 (DDR3) synchronous dynamic random access memory input signal addresses the corresponding coarse delay parameter counted by the clock, and thus fixed integer coarse delay is achieved. The sum module merges signals processed by fine delay and coarse delay blocks in an ultrasonic digital beam, which represents the complete beamform of the focal laws. All signals of the ultrasonic digital beam are stored in memory, and all signal groups form a corresponding beamform. In other words, each focal law forms a digital beamform, and all the beamforms of the same group generate the initial image information of that group. (4) External DDR3: Since the internal RAM capacity of the FPGA is insufficient, a DDR3 controller with two DDR3 memories is used for coarse delay data storage. DDR3 memory has a coarse delay and reads the group focus module according to the group. (5) Delay and scheduling parameters storage (DSPS): Delay and scheduling parameters storage is a large-scale storage block in the FPGA. The delay and scheduling parameters are calculated using a focal law calculator in the PC, corresponding to the input data entered by the user. DSPS contains a scheduling table, the pulse repetition frequency of each group, and the time delay parameter for both fine and coarse delays according to focal laws. It also includes algorithmic The multi-level half-band fine delay filter uses the interpolation method with eight time intervals to design a half-band filter. The implementation of synthetic technology in the multi-level half-band interpolation filter results in filter decomposition into eight sub-filters. Simultaneously, interpolation with poly-phase decomposition is achieved. The eight filters delay the original signal for 0, 1.25, 2.5, 3.75, 5, 6.25, 7.5, and 8.75 ns. The data samples have a 10-bit length, and thus two 9-bit multipliers are needed for multiplications. However, the multi-level half-band filter uses six 9-bit multipliers. In addition, each channel has eight fine delay channels, so there are 96 (i.e., 6 × 2 × 8 = 96) 9-bit multipliers. If all parallel delay is used in a 256-element UPA system, then 24,576 multipliers would be needed. Given such large resource consumption, the integration of a single FPGA in the multi-group scan module of a UPA system would be difficult.
(3) Coarse delay and sum module (CDS): Coarse delay is based on counter clock delay technology.
All the relative delay parameters of focal laws, calculated by a PC, can be loaded from the "delay and scheduling parameters storage" block in Figure 1. The double data rate 3 (DDR3) synchronous dynamic random access memory input signal addresses the corresponding coarse delay parameter counted by the clock, and thus fixed integer coarse delay is achieved. The sum module merges signals processed by fine delay and coarse delay blocks in an ultrasonic digital beam, which represents the complete beamform of the focal laws. All signals of the ultrasonic digital beam are stored in memory, and all signal groups form a corresponding beamform. In other words, each focal law forms a digital beamform, and all the beamforms of the same group generate the initial image information of that group. (4) External DDR3: Since the internal RAM capacity of the FPGA is insufficient, a DDR3 controller with two DDR3 memories is used for coarse delay data storage. DDR3 memory has a coarse delay and reads the group focus module according to the group. control for scheduling Mux and Demux based on the above parameters. A fine delay scheduling model diagram in the multi-scan group is presented in Figure 4.
Sensors 2019, 19 FOR PEER REVIEW 5 control for scheduling Mux and Demux based on the above parameters. A fine delay scheduling model diagram in the multi-scan group is presented in Figure 4.  Figure 4. Fine delay scheduling model diagram in the multi-scan group.

Fine Delay Scheduling Problem in Multi-Group Scanning
The parameters of the fine delay module for multi-group scanning of UPAs are presented in Table 1. Here, we represent the symbols used in the scheduling problems with brackets. Table 1. Parameters of the fine delay module for multi-group scanning of a ultrasonic phased array (UPA) system. Fine-delay scheduling for multi-group scanning of UPAs must satisfy four conditions:

Symbol Parameter
(1) Each focal law must be separately processed in fine delay modules. In other words, one fine delay module must process only one focal law datum. (2) The process cannot be interrupted or preemptive, i.e., a no-interrupt non-preemptive (NINP) model is adopted. (3) There is no time gap between the start time of focal law and the start time of the pulse repetition period. (4) The sample depth is less than the pulse repetition period.
Condition (1) avoids timing confusion, condition (2) avoids interruption of the fine delay signal processing, and condition (3) compacts the frame task for scheduling and decreases the time slot waste. Condition (4) ensures that the fine delay processing will not exceed its abilities, leading to echo data overlap.
Before a description of the fine delay scheduling problem is presented, some parameters must be defined:

Fine Delay Scheduling Problem in Multi-Group Scanning
The parameters of the fine delay module for multi-group scanning of UPAs are presented in Table 1. Here, we represent the symbols used in the scheduling problems with brackets. Table 1. Parameters of the fine delay module for multi-group scanning of a ultrasonic phased array (UPA) system. Fine-delay scheduling for multi-group scanning of UPAs must satisfy four conditions:

Symbol Parameter
(1) Each focal law must be separately processed in fine delay modules. In other words, one fine delay module must process only one focal law datum. (2) The process cannot be interrupted or preemptive, i.e., a no-interrupt non-preemptive (NINP) model is adopted. (3) There is no time gap between the start time of focal law and the start time of the pulse repetition period. (4) The sample depth is less than the pulse repetition period.
Condition (1) avoids timing confusion, condition (2) avoids interruption of the fine delay signal processing, and condition (3) compacts the frame task for scheduling and decreases the time slot waste. Condition (4) ensures that the fine delay processing will not exceed its abilities, leading to echo data overlap.
Before a description of the fine delay scheduling problem is presented, some parameters must be defined: If it is assumed that the ith scan has focal law frame N i FocalLaw and sample depth D i Sample , then the frame task is the time needed to complete all beamforms (or focal laws) of the image. Definition 2. Frame task deadline.
The frame task deadline represents the time the system needs to generate a complete image for all groups, and it must be less than 1/24 s for real-time applications.
Schematic diagrams of the frame task and frame task deadline are presented in Figure 5a,b, respectively.
If it is assumed that the ith scan has focal law frame FocalLaw i N and sample depth Sample i D , then the frame task is the time needed to complete all beamforms (or focal laws) of the image. Definition 2: Frame task deadline. The frame task deadline represents the time the system needs to generate a complete image for all groups, and it must be less than 1/24 s for real-time applications.
Schematic diagrams of the frame task and frame task deadline are presented in Figure 5a,b, respectively.
Processing time, t i p , is defined by: End time, t i d , is defined by: Therefore, the question can be set as P m ||C max , and the scheduling model is defined by: subject to: x ij = 1 i = 1, 2, . . . , m j = 1, 2, . . . , n x ij ∈ {0, 1} Equation (4) refers to the scheduling goal of minimizing the project's maximum completion time, which represents the time needed for the completion of all project tasks. In this paper, we consider the frame task as the job or task of the scheduling problem. According to Equation (5), the time allocation of each fine delay module cannot be greater than t d . Equations (6) and (7) show that any task can be assigned only to one processor, and x ij is an assigned variable that is equal to zero or one. Equation (8) represents all fame tasks that must be finished before the frame task deadline.

IBF Algorithm
Since there is no dependency between tasks, the fine delay scheduling problem in multi-group scanning can be considered as an independent, parallel processor scheduling task.
The IBF algorithm parameters are defined as follows. Input is the set of tasks T = {t i , i = 1,2, . . . ,n}, the number of fine-delay modules is m, and the number of tasks is n. Output is the maximal processing time, C IBF max . The IBF algorithm steps are as follows: Step 1. Sort tasks T in descending order according to the task processing time: p i , i = 1,2, . . . ,n; Step 2. Assume that A = 1 m n ∑ i=1 p i and L j , j = 1, 2, . . . , m are the focus and delay module pointers, respectively; Step 3. Use the LPT algorithm to obtain the maximal processing time C LPT max . Let l = 1 and B(1) = C LPT max ; Step 4. If A < max(L j ) < B(l), go to step 5; otherwise, go to step 8; Step 5. Let l = l + 1, i = 1, B(l) = min(max(L j ), B(l − 1) − 1); Step 6. If there is at least one j that satisfies the condition L j + p i ≤ B(l), then allocate task t i to the focus and delay module, which satisfies condition L j + p i ≤ B(l). Otherwise, allocate the task to the focus and delay module, which provides the minimal value of L j + p i ; Step 7. Set i = i + 1, and if i ≤ n, go back to step 6; otherwise, go back to step 4; Step 8. C IBF max = min(B(1), B(2), B(l − 1)). In step 3, the LPT algorithm is used to calculate the initial processing time in order to better approximate the initial conditions. Steps 4-8 represent the prepare algorithm (PA). Thus, the IBF algorithm is a combination of LPT and PA that improves the boundary and convergence of iteration, and achieves better performance in terms of local search and iterative progression. The IBF flowchart is shown in Figure 6.
The IBF algorithm analysis is obtained for B(1) = C LPT max . In the case the iteration stops at l = 2, then the output algorithm result will be C IBF max = C LPT max . If the iteration stops at l = 3, then the output result will be C IBF max = C PA(B(0)) max , and that wil be the makespan. If the iteration stops at l > 3, After induction B(l) ≤ B(1) − (l − 1). Therefore, the absolute performance of the IBF algorithm is defined by: If the iteration number is equal to one, the IBF time complexity is defined by: If the number of iterations is greater than one, IBF employs the PA, which represents the FFD algorithm used in the bin-packing problem. After induction B(l) ≤ B(1) − (l − 1). Therefore, the absolute performance of the IBF algorithm is defined by: If the iteration number is equal to one, the IBF time complexity is defined by: If the number of iterations is greater than one, IBF employs the PA, which represents the FFD algorithm used in the bin-packing problem.

Time Performance
In order to determine the real-time performance of the IBF algorithm, a randomly generated set of tasks was used. The set and real-time deadline were used to simulate a UPA multi-group fine delay scheduling problem. The specific task generation process was as follows. First, m time blocks were

Time Performance
In order to determine the real-time performance of the IBF algorithm, a randomly generated set of tasks was used. The set and real-time deadline were used to simulate a UPA multi-group fine delay scheduling problem. The specific task generation process was as follows. First, m time blocks were generated. The length of each time block was as long as the deadline t d . Then, each task block was divided into h = n/m + 1 parts, and thus h × m tasks were obtained in m time blocks. Afterward, n tasks from h × m tasks that were generated from the previous step were chosen to create a set of tasks, and all task lengths were multiplied by 0.99. Thus, a random generation of a set of tasks was produced. The whole experiment ran in I7-4850HQ (Intel Corporation, Santa Clara, CA, USA) 8 GB RAM with MATLAB 2016a.
This process was conducted to ensure that the processing time of each generated task was not greater than the real-time deadline. All generated tasks did not exceed the calculating ability of the fine-delay module. In other words, a feasible solution always existed for a given scheduling in terms of the number of modules that satisfied the required conditions. The generated set was subjected to a random uniform distribution, and a variety of large scopes were covered.
Five tests were conducted with the following parameters: the number of fine-delay modules m, the ratio of number of tasks and fine delay modules k = n/m, the real-time deadline d, the number of iterations K, and makespan C max . Each test was generated 100 times, and the average result was calculated. The LIST, LPT, BF, and IBF algorithms were compared.
Test 1 compared LPT, BF, and IBF algorithms in terms of makespan. In Figure 7a, the parameter settings were: m = 4, k = 2-10, and d = 1000. Note that each curve had a peak value at k = 3, because when k = 3, the method generating the problem reduced the number of tasks and increased the length. Under this condition, the problem was difficult to schedule. With gradually increasing k, all curves gradually declined. IBF had the smallest makespan at k < 8, and when k ≥ 8, IBF and BF almost had the same makespan performance. This is because with the increase in k, the problem produced more tasks and the length decreased. That is, the smaller the granularity of the tasks, the greater the role of the scheduling algorithm. In Figure 7b, the parameter settings were: m = 2-10, k = 4, and d = 1000. We can see that the IBF algorithm still had the smallest makespan, but with the increase in m, the gap between BF and IBF continued to narrow. Although k was unchanged, the larger the value of m, the greater Sensors 2019, 19, 906 9 of 13 the permutations and combinations of the scheduling algorithm were. In makespan comparisons, IBF always had the best performance, but, as parameters k and m increased, the performance of BF and IBF gradually approached each other.
Test 1 compared LPT, BF, and IBF algorithms in terms of makespan. In Figure 7a, the parameter settings were: m = 4, k = 2-10, and d = 1000. Note that each curve had a peak value at k = 3, because when k = 3, the method generating the problem reduced the number of tasks and increased the length. Under this condition, the problem was difficult to schedule. With gradually increasing k, all curves gradually declined. IBF had the smallest makespan at k < 8, and when k ≥ 8, IBF and BF almost had the same makespan performance. This is because with the increase in k, the problem produced more tasks and the length decreased. That is, the smaller the granularity of the tasks, the greater the role of the scheduling algorithm. In Figure 7b, the parameter settings were: m = 2-10, k = 4, and d = 1000. We can see that the IBF algorithm still had the smallest makespan, but with the increase in m, the gap between BF and IBF continued to narrow. Although k was unchanged, the larger the value of m, the greater the permutations and combinations of the scheduling algorithm were. In makespan comparisons, IBF always had the best performance, but, as parameters k and m increased, the performance of BF and IBF gradually approached each other. Test 2 compared LPT, BF, and IBF in terms of the missed deadline rate (MDR) with variables k and m. The parameter settings in Figure 8a were the same as in Figure 7a, and those in Figure 7b were applied to Figure 8b. The MDR is defined as the number of times a deadline was missed when a Test 2 compared LPT, BF, and IBF in terms of the missed deadline rate (MDR) with variables k and m. The parameter settings in Figure 8a were the same as in Figure 7a, and those in Figure 7b were applied to Figure 8b. The MDR is defined as the number of times a deadline was missed when a scheduling problem was generated randomly 100 times. Figure 8a shows that all curves had a peak value at k = 3, and then gradually decreased with increasing k. The reason is similar to test 1. Note that in Figure 8b, IBF had the smallest makespan, but when m > 9, the values of BF and IBF were basically the same. IBF was still the best in MDR performance, and with the increase in k, the scheduling performance improved as well. When k > 8, IBF was not significantly superior to BF.
Test 3 compared LPT, BF, and IBF using statistical plots. Parameter settings were m = 4, k = 4, and calculation was run 100 times to obtain the makespan. Figure 9a shows the box plot. Note that the IBF algorithm had the lowest median and upper limits and the narrowest interquartile range (IQR). This shows that IBF scheduling had the best overall performance and the most centralized data. In the 95% confidence interval (CI) plot in Figure 9b, IBF had the lowest mean and the narrowest 95% CI. The IBF algorithm outperformed the BF and LPT algorithms in terms of statistical performance.
Sensors 2019, 19 FOR PEER REVIEW 10 scheduling problem was generated randomly 100 times. Figure 8a shows that all curves had a peak value at k = 3, and then gradually decreased with increasing k. The reason is similar to test 1. Note that in Figure 8b, IBF had the smallest makespan, but when m > 9, the values of BF and IBF were basically the same. IBF was still the best in MDR performance, and with the increase in k, the scheduling performance improved as well. When k > 8, IBF was not significantly superior to BF. Test 3 compared LPT, BF, and IBF using statistical plots. Parameter settings were m = 4, k = 4, and calculation was run 100 times to obtain the makespan. Figure 9a shows the box plot. Note that the IBF algorithm had the lowest median and upper limits and the narrowest interquartile range (IQR). This shows that IBF scheduling had the best overall performance and the most centralized data. In the 95% confidence interval (CI) plot in Figure 9b, IBF had the lowest mean and the narrowest 95% CI. The IBF algorithm outperformed the BF and LPT algorithms in terms of statistical Test 3 compared LPT, BF, and IBF using statistical plots. Parameter settings were m = 4, k = 4, and calculation was run 100 times to obtain the makespan. Figure 9a shows the box plot. Note that the IBF algorithm had the lowest median and upper limits and the narrowest interquartile range (IQR). This shows that IBF scheduling had the best overall performance and the most centralized data. In the 95% confidence interval (CI) plot in Figure 9b, IBF had the lowest mean and the narrowest 95% CI. The IBF algorithm outperformed the BF and LPT algorithms in terms of statistical performance. Test 4 compared the performance of LIST, LPT, BF, and IBF algorithms ( Table 2). The test parameter settings were m = 4, k = 4, d = 1000, and the average of 100 runs was taken. The LIST algorithm had the worst performance, which affected the display of the figures. In order to clearly Test 4 compared the performance of LIST, LPT, BF, and IBF algorithms ( Table 2). The test parameter settings were m = 4, k = 4, d = 1000, and the average of 100 runs was taken. The LIST algorithm had the worst performance, which affected the display of the figures. In order to clearly compare BF and IBF, which was not mentioned in the previous experiments, R IBF/LIST was defined as follows: where C LIST max , C LPT max , C BF max , and C IBF max represent the average makespans of LIST, LPT, BF, and IBF obtained from 100 runs, respectively. In addition, K BF and K IBF represent the average number of iterations for BF and IBF. As shown in Table 2, IBF had the lowest average makespan, but its average number of iterations was slightly greater than that of the BF algorithm. This was also reflected in the elapsed time.
In the worst case of our experiment, the average elapsed times at m = 10, k = 4 for LIST, LPT, BF, and IBF algorithms were 2.70, 2.63, 40.61, and 55.21 ms, respectively. The elapsed time of IBF was greater than BF by about 35.95%. However, as shown in the last column of Table 2, IBF improved performance by 8.76-21.48% compared to the LIST algorithm. Test 5 was used to examine the relationship of IBF with the number of iterations. In Figure 10a, all curves had a peak value at k = 3-5, and then slowly declined. This occurred because when k = 35, the generated tasks had large granularity, which facilitated iteration without satisfying the conditions, so the number of iterations was greater. The number of iterations with larger m was greater than that with smaller m, because a large m leads to more permutations and combinations. When k > 8, the number of iterations decreased gradually and tended to be the same. Due to the small size of the task, the initial LPT algorithm was more effective, so the number of iterations decreased. In Figure 10b, except for the case of k = 2, the other curves increased gradually, and the larger the value of k, the smaller the number of iterations. Therefore, the greater the task granularity, the greater the value of m and the greater the number of iterations.

Resource Consumption
In the experiment, an Altera Cyclone VI EP4CE115F29C8 and Quartus II 13.0 (Intel Corporation, Santa Clara, CA, USA) were used to compare all parallel and 1/2 scheduling for 32-channel and 64channel architectures. Then, the TimeQuest Timing Analyzer in Quartus II was used to determine the maximal clock frequency for the listed architectures. The clock frequency was set to 100 MHz. The obtained resource consumption and maximal frequencies of all architectures are presented in Table 3, wherein "number of groups" represents the number of scan groups in the multi-group UPA system; "number of modules" represents the number of fine delay modules in the system; "Total LUT" (LUT: look up table), "Total Reg.", and "Total 9-bit Mult." refer to the consumption of total logic unit, total register, and total 9-bit multiplier, respectively; and Fmax represents the maximum clock frequency. Percentages with brackets in the Total LUT and Total 9-bit Mult. columns represent their share of all the same resources in the entire FPGA. 1 Due to resource limitations, the total 9-bit multiplier in the FPGA was 532. Table 3 shows that all parallel architectures demand more resources and have lower maximal frequencies than 1/2 scheduling architectures. The 1/2 scheduling architecture could save about 57.06-58.84% in LUT and 30-40% in 9-bit multipliers. Table 3 also demonstrates that maximum frequency decreased as the number of channels increased. The bold text in column Fmax are the best Fmax in same number of channels, respectively. Therefore, based on the premise of guaranteeing real-time performance, the proposed architecture and IBF algorithm can reduce resource consumption, shorten timing, and increase the maximum clock frequency.

Resource Consumption
In the experiment, an Altera Cyclone VI EP4CE115F29C8 and Quartus II 13.0 (Intel Corporation, Santa Clara, CA, USA) were used to compare all parallel and 1/2 scheduling for 32-channel and 64-channel architectures. Then, the TimeQuest Timing Analyzer in Quartus II was used to determine the maximal clock frequency for the listed architectures. The clock frequency was set to 100 MHz. The obtained resource consumption and maximal frequencies of all architectures are presented in Table 3, wherein "number of groups" represents the number of scan groups in the multi-group UPA system; "number of modules" represents the number of fine delay modules in the system; "Total LUT" (LUT: look up table), "Total Reg.", and "Total 9-bit Mult." refer to the consumption of total logic unit, total register, and total 9-bit multiplier, respectively; and Fmax represents the maximum clock frequency. Percentages with brackets in the Total LUT and Total 9-bit Mult. columns represent their share of all the same resources in the entire FPGA. Table 3. Resource consumption and max frequency of all parallel and 1/2 scheduling for 32-channel and 64-channel architectures.  Table 3 shows that all parallel architectures demand more resources and have lower maximal frequencies than 1/2 scheduling architectures. The 1/2 scheduling architecture could save about 57.06-58.84% in LUT and 30-40% in 9-bit multipliers. Table 3 also demonstrates that maximum frequency decreased as the number of channels increased. The bold text in column Fmax are the best Fmax in same number of channels, respectively. Therefore, based on the premise of guaranteeing real-time performance, the proposed architecture and IBF algorithm can reduce resource consumption, shorten timing, and increase the maximum clock frequency.  eight clock-cycles has been taken into account and combined into time of read parameter. Units are clock cycles of the FPGA in Table 4 columns 2-4.

Real-Time Verification
In Figure 11, the tasks were T0-T3, corresponding to frame tasks of Group 0-3, and FD0 and FD1 are fine delay modules. The upper FD0 and FD1 were scheduled by LIST, and the lower FD0 and FD1 were scheduled by IBF. In the case of maximum 8 K sampling depth, 128 focal laws (Group 3), the makespan of LIST was 13.86 ms, whereas the makespan of IBF was 11.82 ms, so IBF is superior to LIST. At a waiting time of more than 1 ms between frames, the frame periods of LIST and IBF were 14.86 and 12.82 ms, respectively, which correspond to frame rates of 67 and 78 fps, respectively. Therefore, the IBF algorithm generally reduced the makespan of the frame tasks, increased the frame rate, and improved real-time performance of the multi-group scan UPA instrument. Figure 11. Four groups scheduled in two fine delay modules' simulation by ModelSim.

Conclusions
In this paper, a fine delay scheduling architecture in the multi-group scanning of a UPA system was presented. The diversity of echo data in multi-group scanning and the number of focal laws were considered, and the multi-group scan problem was modelled by a linear equation. The IBF algorithm was proposed, and its time complexity and absolute performance were analyzed. The experimental results showed that compared to LIST, LPT, and BF algorithms, the IBF algorithm decreased the makespan by 8.76-21.48%, while the frame rate reached 78 fps, and the architecture reduced FPGA resources by 30-40%. The IBF algorithm was superior to BF in terms of its small task-to-module ratio. The proposed algorithm and mathematical model was applied to a UPA. uUsing the proposed architectures effectively improved integration, increased maximum frequency, improved real-time Figure 11. Four groups scheduled in two fine delay modules' simulation by ModelSim. In Figure 11, the tasks were T0-T3, corresponding to frame tasks of Group 0-3, and FD0 and FD1 are fine delay modules. The upper FD0 and FD1 were scheduled by LIST, and the lower FD0 and FD1 were scheduled by IBF. In the case of maximum 8 K sampling depth, 128 focal laws (Group 3), the makespan of LIST was 13.86 ms, whereas the makespan of IBF was 11.82 ms, so IBF is superior to LIST. At a waiting time of more than 1 ms between frames, the frame periods of LIST and IBF were 14.86 and 12.82 ms, respectively, which correspond to frame rates of 67 and 78 fps, respectively. Therefore, the IBF algorithm generally reduced the makespan of the frame tasks, increased the frame rate, and improved real-time performance of the multi-group scan UPA instrument.

Conclusions
In this paper, a fine delay scheduling architecture in the multi-group scanning of a UPA system was presented. The diversity of echo data in multi-group scanning and the number of focal laws were considered, and the multi-group scan problem was modelled by a linear equation. The IBF algorithm was proposed, and its time complexity and absolute performance were analyzed. The experimental results showed that compared to LIST, LPT, and BF algorithms, the IBF algorithm decreased the makespan by 8.76-21.48%, while the frame rate reached 78 fps, and the architecture reduced FPGA resources by 30-40%. The IBF algorithm was superior to BF in terms of its small task-to-module ratio. The proposed algorithm and mathematical model was applied to a UPA. uUsing the proposed architectures effectively improved integration, increased maximum frequency, improved real-time performance, and finally, decreased resource consumption. Therefore, the instrument's flexibility and performance was improved. The next step is to study another processing module scheduling and multi-FPGA situation, integrated in a distributed environment.