A Reconfigurable Visual–Inertial Odometry Accelerated Core with High Area and Energy Efficiency for Autonomous Mobile Robots

With the wide application of autonomous mobile robots (AMRs), the visual inertial odometer (VIO) system that realizes the positioning function through the integration of a camera and inertial measurement unit (IMU) has developed rapidly, but it is still limited by the high complexity of the algorithm, the long development cycle of the dedicated accelerator, and the low power supply capacity of AMRs. This work designs a reconfigurable accelerated core that supports different VIO algorithms and has high area and energy efficiency, precision, and speed processing characteristics. Experimental results show that the loss of accuracy of the proposed accelerator is negligible on the most authoritative dataset. The on-chip memory usage of 70 KB is at least 10× smaller than the state-of-the-art works. Thus, the FPGA implementation’s hardware-resource consumption, power dissipation, and synthesis in the 28 nm CMOS outperform the previous works with the same platform.


Introduction
In recent years, AMRs have achieved rapid development driven by practical application demands. As the core technology of AMRs, simultaneous localization and mapping (SLAM) include five parts as Figure 1 shows: sensor data reading, front-end visual odometry, backend optimization, loop closure, and mapping. As a single sensor cannot cope with all situations, the most effective approach is to fuse data from both the camera and IMU, an algorithm called VIO. Aside from the computational complexity of loop closure and global optimization, the VIO system can estimate the position and trajectory of a moving object in real time with lower power consumption, which is very important for lightweight AMRs with high processing speed requirements. The visual front-end of VIO mainly estimates the robot's motion through the direct method or feature point extraction. In 2008, the direct method proposed by Silveira G. et al. [1] used the difference in light intensity of each pixel in adjacent frame images to estimate the camera's motion. At present, representative open-source projects are DTAM [2], LSD-SLAM [3], and DSO [4], etc. Compared with feature point extraction, the direct method saves time for calculating features and can maintain certain robustness when the texture is scarce, but it is difficult to work in scenes with drastic changes in light. Feature point extraction uses image features instead of image intensity, which overcomes the The visual front-end of VIO mainly estimates the robot's motion through the direct method or feature point extraction. In 2008, the direct method proposed by Silveira G. et al. [1] used the difference in light intensity of each pixel in adjacent frame images to estimate the camera's motion. At present, representative open-source projects are DTAM [2], LSD-SLAM [3], and DSO [4], etc. Compared with feature point extraction, the direct method saves time for calculating features and can maintain certain robustness when the texture is scarce, but it is difficult to work in scenes with drastic changes in light. Feature point extraction uses image features instead of image intensity, which overcomes the shortcomings of the direct method but relies on the feature extraction results. In recent years, classic features have included SIFT [5], SURF [6], FAST [7], and so on.
Backend optimization is to optimize the motion pose estimated by the front end to minimize accumulated errors. The main backend algorithms are divided into filtering methods using extended Kalman filter (EKF) or other filters and optimization methods using bundle adjustment (BA) or graph optimization methods. In the case of limited computing resources and relatively simple quantities to be estimated, the filtering method represented by EKF is very effective; however, because the storage capacity and state quantity are in a quadratic growth relationship, there are many feature data, and the filtering method is less efficient. In addition to KF [8] and EKF [9], there are also information filters [10,11] and particle filters [12][13][14]. Contrary to the filtering method, the optimization method no longer relies on the information at a specific moment but obtains the optimal global estimation of all landmarks by optimizing the joint error function of all poses and landmarks. Existing optimization methods include that by Di, K. et al. [15], who studied an extended BA algorithm that can utilize both 2D and 3D information and is suitable for RGB-D cameras. Alismail, H. et al. [16] abandoned the minimization of reprojection error and achieved a significant improvement in accuracy with a photometric BA algorithm based on maximizing photometric continuity. The incremental, consistent, and efficient BA proposed by Liu, H. et al. [17] adopts incremental technology, which consumes only about 1/10 of the computing resources of traditional BA under the premise of ensuring accuracy.
In addition to improving algorithms, using SLAM systems also relies on hardware acceleration. Compared with CPU, GPU has powerful floating-point computing capability, so GPU is generally used in the early stage of hardware acceleration design research, such as feature extraction accelerator based on SIFT [18] and SURF [19] features. However, the GPU system consumes a lot of power. With the development of field programmable gate arrays (FPGAs), hardware acceleration relies more on FPGA. Yum, J. et al. [20] designed a complete SIFT hardware accelerator, Wilson, C. et al. [21] proposed a complete FPGA implementation of the SURF accelerator, and Ulusel, O. [22] showed that feature extraction on the FPGA platform has great advantages over CPU and GPU.
Moreover, accelerators suitable for backend optimization have also been proposed recently. For example, Tertei, D. T. R. et al. [23] proposed an efficient FPGA SoC hardware structure using systolic array matrix multiplication to accelerate the EKF-SLAM algorithm. Wang, J. et al. [24] designed a reconfigurable matrix multiplication coprocessor for accelerating matrix multiplication in visual navigation algorithms. However, the hardware accelerators can only accelerate specific algorithms, or they are incomplete, only accelerating the feature extraction or matrix multiplication part and lacking pose estimation and trajectory output.
Currently, the relatively complete acceleration design includes Liu, R. et al. [25] implementing a complete ORB-SLAM system on FPGA. The feature extraction and feature matching parts are accelerated by FPGA, but the trajectory estimation and pose optimization are completed by ARM. Li, Z. et al. [26] presented an accurate, low-power, real-time CNN-SLAM processor that implements full-visual SLAM on a single chip. Suleiman, A. et al. designed a chip that can perform VIO [27], reducing on-chip storage to 1/4 the size through image compression techniques. Zhang, Z. et al. designed a VIO system that uses Kintex-7 XC7K355T (an FPGA) [28], and Wang, C. et al. designed a visual SLAM system that uses UltraScale+ XCZU7EV (an FPGA) [29].
This paper proposes a reconfigurable, real-time, low-area, and energy-efficient VIO accelerator implemented on an FPGA with the following major design features: (i) a reconfigurable accelerator architecture that adapts to different Kalman-filter-based VIO algorithms; (ii) an optimized instruction-based structure supporting the simultaneous workflow of fixed-point and floating-point units to accelerate VIO algorithms for real-time usage. This design can support the post estimation and trajectory output of real-time > 60 Hz frame input and > 200 Hz of IMU input; and (iii) a computing core with shared memory and the memory reuse strategy. This works only consumes 70 KB of on-chip memory, which is at least 10× lower than the previous works. To our knowledge, this is the first integrated reconfigurable architecture that supports multiple VIO algorithms implemented in FPGA.

Algorithms
2.1. Overall Procedure of the Visual-Inertial Odometry Figure 2 shows the overall procedure of the visual-inertial odometry, consisting of IMU pre-integration, feature coordinates optimization, and extended Kalman filter (EKF). IMU pre-integration computes the prior estimate of pose and the position and other state variables, such as velocity and feature location of the AMR concerning the IMU. Feature coordinates optimization optimizes the pre-estimated pixel location of the stored features by minimizing the 2D reprojection error. Then, measured state variables are estimated through the location difference between the optimized results and the prior estimated ones. At last, EKF combines the prior estimation of state variables and the measured state variables according to the covariance matrix and outputs the post estimation of the state variables, where pose and position represent the trajectory. Each floating-point function circuit is assigned a priority level to ensure that each floating-point computation circuit can be accessed by only one floating-point function circuit at the same clock.
CNN-SLAM processor that implements full-visual SLAM on a single chip. Suleiman, A. et al. designed a chip that can perform VIO [27], reducing on-chip storage to 1/4 the size through image compression techniques. Zhang, Z. et al. designed a VIO system that uses Kintex-7 XC7K355T (an FPGA) [28], and Wang, C. et al. designed a visual SLAM system that uses UltraScale+ XCZU7EV (an FPGA) [29]. This paper proposes a reconfigurable, real-time, low-area, and energy-efficient VIO accelerator implemented on an FPGA with the following major design features: (i) a reconfigurable accelerator architecture that adapts to different Kalman-filter-based VIO algorithms; (ii) an optimized instruction-based structure supporting the simultaneous workflow of fixed-point and floating-point units to accelerate VIO algorithms for realtime usage. This design can support the post estimation and trajectory output of real-time > 60 Hz frame input and > 200 Hz of IMU input; and (iii) a computing core with shared memory and the memory reuse strategy. This works only consumes 70 KB of on-chip memory, which is at least 10× lower than the previous works. To our knowledge, this is the first integrated reconfigurable architecture that supports multiple VIO algorithms implemented in FPGA.

IMU Pre-Integration
IMU can measure acceleration and angular velocity through the accelerometer and gyroscope. Through integration, the rotation and displacement between two frames of images can be obtained. That is to say, if the position, velocity, and rotation at time k are known, these values at time k + 1 can be obtained. However, in the optimization algorithm, these values at each moment are estimated. When optimizing them, the data between two moments must be re-integrated, which makes the calculation requirements large. Lupton T. et al. [30] proposed IMU pre-integration to avoid repeated integration, and Forster, C. [31] made it better.
A schematic diagram of the IMU model is shown in Figure 3, and its formula is shown in (1). The measured values of linear acceleration and angular velocity are represented by( ·). Linear acceleration is the resultant vector of gravitational acceleration and object acceleration, b a t and b ω t are offsets, n a and n ω are Gaussian noise. tween two moments must be re-integrated, which makes the calculation requirements large. Lupton T. et al. [30] proposed IMU pre-integration to avoid repeated integration, and Forster, C. [31] made it better. A schematic diagram of the IMU model is shown in Figure 3, and its formula is shown in (1). The measured values of linear acceleration and angular velocity are represented by (⋅) . Linear acceleration is the resultant vector of gravitational acceleration and object acceleration, and are offsets, and are Gaussian noise.
The representation of the position, velocity, and rotation (quaternion form) of + 1 frame can be obtained from frame, as shown in (2). Here, represents the body coordinate system, represents the world coordinate system, represents the rotation matrix from the body coordinate system to the world coordinate system at time , represents the rotation of the body coordinate system at time relative to the body coordinate system at the time , and represents the rotation of the body coordinate system relative to the world coordinate system at the time . where In (2), PVQ of the body coordinate system at time + 1 depends on . If the PVQ of the body coordinate system is directly used as a variable to optimize and iteratively update, it will lead to a large amount of calculation. The idea of IMU pre-integration is to adjust the reference coordinate system from the world coordinate system to the body coordinate system of frame so that the integration result becomes the relative change of to . It is realized by multiplying , as shown in (4). The detailed derivation process is omitted [32]. The representation of the position, velocity, and rotation (quaternion form) of k + 1 frame can be obtained from k frame, as shown in (2). Here, b represents the body coordinate system, w represents the world coordinate system, R w t represents the rotation matrix from the body coordinate system to the world coordinate system at time t, q b k t represents the rotation of the body coordinate system at time t relative to the body coordinate system at the time b k , and q w b k represents the rotation of the body coordinate system relative to the world coordinate system at the time b k . where In (2), PVQ of the body coordinate system at time k + 1 depends on b k . If the PVQ of the body coordinate system is directly used as a variable to optimize and iteratively update, it will lead to a large amount of calculation. The idea of IMU pre-integration is to adjust the reference coordinate system from the world coordinate system w to the body coordinate system b k of k frame so that the integration result becomes the relative change of b k+1 to b k . It is realized by multiplying R b k w , as shown in (4). The detailed derivation process is omitted [32].
In the feature extraction technology, in addition to FAST [7], there are SIFT [5], SURF [6], and other algorithms. The features they extracted have strong invariance, but the time consumption is relatively large. In a system, feature extraction is only a part, and subsequent algorithms such as registration, purification, and fusion are also performed. This makes the real-time performance not suitable and reduces the system performance. The Fast detector is very fast because it does not involve complex operations such as scale and gradient. It uses the gray value of the pixel in a specific neighborhood to compare the size with the center point to determine whether it is a corner point.
The proposed accelerated core adapts the FAST-9 feature extraction [7] in the vision pipeline, a trade-off between accuracy and efficiency since the proposed accelerated core targets high-performance and low-power design.
Here, S stands for the intensity relationship of the pixels P with the center C, and T is the user-defined threshold. If there are consecutive nine pixels darker or brighter, the center pixel is a FAST feature in the accelerated core.
This paper uses pipeline stages for hardware implementation since the FAST algorithm requires counters, which should be accumulated by clock cycles. However, it is not enough to implement it in 16 clock cycles, as shown in Figure 4, because a true circle path to access all pixels in the FAST algorithm should be half a circle. For example, if pixels P 1 , P 2, and P 10 -P 16 are brighter while others are not, the center pixel should be a FAST feature. However, the first round only considers seven brighter pixels (P 10 -P 16 ) which makes an incorrect judgment. The one-and-a-half circles for detecting solves this problem by redundantly calculating eight more pixels.

Gradient-based Score Calculation
This paper adapts a gradient-based score (Grad score in short) to indicate the quality of a feature after FAST feature detection, as shown below: Here, the input patch is size 8 × 8, and P stands for the pixel values. Therefore, the Grad score is calculated with a 6 × 6 window inside the patch. The score will be used in the rest process in the core.

Algorithm of Feature Coordinates Alignment
Feature coordinates calculated from IMU pre-integration and feature prediction are corrected by photometric error, which is based on the difference in pixel data between the old feature patches in the last frame and the new feature patches extracted in the current frame. The photometric error is defined by:

Gradient-Based Score Calculation
This paper adapts a gradient-based score (Grad score in short) to indicate the quality of a feature after FAST feature detection, as shown below: Here, the input patch is size 8 × 8, and P stands for the pixel values. Therefore, the Grad score is calculated with a 6 × 6 window inside the patch. The score will be used in the rest process in the core.

Algorithm of Feature Coordinates Alignment
Feature coordinates calculated from IMU pre-integration and feature prediction are corrected by photometric error, which is based on the difference in pixel data between the where the scalar factor s l = 0.5 l stands for the down-sampling number of layers of the image pyramid. W stands for the warping transform matrix in [33], p j is the patch pixel in the patch P l , and p is the feature coordinate. This paper adapts the QR-decomposition method in [33], which stacks all error terms together for given estimated coordinatesp: Here, A(p) is based on the patch intensity gradients along the X and Y axes. The QR-decomposition of A(p) obtains an equivalent reduced linear equation system: The iteration will be user-defined and ten times in the proposed core. If the photometric error is still higher than a certain threshold, the feature will be marked as bad and abandoned, while others could be used in the EKF update.

Supported Operations in SLAM Algorithms
Most SLAM algorithms require not only basic operations, such as the addition or multiplication of scalars, but also operations of rotation dynamics, including rotation matrix, quaternion, and lie algebra. The proposed VIO accelerated core provides full functionality to support the abovementioned operations, shown in Table 1.

Hardware Design
This section introduces the overall hardware architecture of the proposed VIO accelerated core and details four important submodules and techniques adapted in the design. Figure 5 shows the proposed accelerator's overall architecture constructed by four sub-modules, including a fixed-point vision pipeline, a memory interface, a programmable computation core, and layers for the EKF engine. The vision pipeline receives input data from the image sensor to perform new and predicted feature extraction. The memory interface uses a shared memory strategy to store intermediate values and output data. The programmable computation core pre-loads three programs to perform the computation of IMU pre-integration and the computation of the Jacobian matrix. This core can satisfy the pose estimation of a sample class of Kalman-filter-based SLAM algorithms. In the EKF engine, layer 1 contains a finite state machine in charge of the whole process. The layer of the feature processing engine scores and sorts the features received from the feature extraction engine then compares the best 25 of them with the features stored in the feature manager and replaces the old ones with them if the new features are better. The feature processing engine also computes the 2D location difference of the same feature in the last and current frames. Then, the EKF engine completes the update process of the EKF process with the help of a specific mission layer and FP arithmetic. The specific mission layer shares the floating-point arithmetic with the computation core for higher hardware utilization. The vision pipeline receives input data from the image sensor to perform new and predicted feature extraction. The memory interface uses a shared memory strategy to store intermediate values and output data. The programmable computation core pre-loads three programs to perform the computation of IMU pre-integration and the computation of the Jacobian matrix. This core can satisfy the pose estimation of a sample class of Kalmanfilter-based SLAM algorithms. In the EKF engine, layer 1 contains a finite state machine in charge of the whole process. The layer of the feature processing engine scores and sorts the features received from the feature extraction engine then compares the best 25 of them with the features stored in the feature manager and replaces the old ones with them if the new features are better. The feature processing engine also computes the 2D location difference of the same feature in the last and current frames. Then, the EKF engine completes the update process of the EKF process with the help of a specific mission layer and FP arithmetic. The specific mission layer shares the floating-point arithmetic with the computation core for higher hardware utilization.   Figure 6 details the structure of the fixed-point vision pipeline, including a Gaussian pyramid-image half sample engine, a FAST feature identifier, a Grad score-based patch extractor, and a patch controller. The architecture workflow is pipelined with line buffers, thus reducing memory consumption and increasing processing throughput. The operation provides a foundation for the following process by interpreting images at multiple resolutions and different scales with a Gaussian sampling pyramid of the input grayscale image. The half-sampled pixels are transferred through an asynchronous FIFO, followed by a group of 11 line buffers. The rest processes are based on an area-reused 12 × 12 sliding window where pixel data can be available for three layers.

Fixed-Point Vision Pipeline
The FAST identifier is a feature extractor adapting the FAST-16 algorithm where if a pixel differs greatly from enough pixels in its surrounding neighborhood, the pixel may be a corner. Adapting the FAST algorithm for every pixel indicates a continually analyzed 16 pixels, i.e., whether the pixel is brighter/darker than the center pixel as the last pixel does. In the hardware implementation, the algorithm is processed in a pipeline which takes 24 periods. At every stage, the circuit will detect the brightness for both the current The architecture workflow is pipelined with line buffers, thus reducing memory consumption and increasing processing throughput. The operation provides a foundation for the following process by interpreting images at multiple resolutions and different scales with a Gaussian sampling pyramid of the input grayscale image. The half-sampled pixels are transferred through an asynchronous FIFO, followed by a group of 11 line buffers. The rest processes are based on an area-reused 12 × 12 sliding window where pixel data can be available for three layers.
The FAST identifier is a feature extractor adapting the FAST-16 algorithm where if a pixel differs greatly from enough pixels in its surrounding neighborhood, the pixel may be a corner. Adapting the FAST algorithm for every pixel indicates a continually analyzed 16 pixels, i.e., whether the pixel is brighter/darker than the center pixel as the last pixel does. In the hardware implementation, the algorithm is processed in a pipeline which takes 24 periods. At every stage, the circuit will detect the brightness for both the current and last pixels and increase or reset the counters. In Figure 6, BrightCnt means the number of continuing pixels with a grey value more significant than the grey value of the center plus threshold, while DarkCnt stands for the opposite. If BrightCnt or DarkCnt is larger than 9, the pixel is identified as a FAST feature, and the result will be passed to the patch controller by a line buffer.
The FAST identifier is optional, considering the patch extractor has already been attached with a Grad score comparator which adapts only the front-computing. Input pixel data from the 8 × 8 window generates their gradient values which are used to calculate the Grad score according to (6).
The patch controller temperately stores the information of a maximum of 100 patches in sub-patch controllers, where a patch is defined by the feature pixel and its surrounding 12 × 12 window of pixels. In order to avoid the extracted feature points being too concentrated, this paper proposed a patch extract strategy, which constructs banned rows and banned columns. When a new feature/patch is extracted, its surrounding 16 rows and columns will be banned for the subsequent extraction unless the new feature has a higher Grad score. The 100 sub-patch controllers are awakened one by one to reduce power consumption and send write to enable signal to the patch controller for patch storing, which is implemented by dual-port rams. If the maximum feature limit (100) is reached, a special sub-patch controller will be used to cover the patch with the least Grad score if the new feature has a higher score, ensuring the extracted patches are high quality for the rest process. At the end of a frame, the vision pipeline sends a signal to the top state machine, indicating the available feature patches. Figure 7 details the structure of the programmable computation core realized by instruction control, which is an extension of the previous work [34]. A matrix computation core performs the matrix operation and controls the program workflow, and a special computation core performs quaternion, angle-axis, and scalar operations. The computation core is controlled by a proposed instruction set architecture (ISA). Operation instructions store the offset address of operands, source, and target address, transpose or generate antisymmetric matrix, and the operation types. The PC reads instructions from IRMEM and sends them to the corresponding instruction decoder for operation instructions. Each core has its own 38-bit instruction set and an 8-bit address instruction memory but shares a 4 × 7-bit address, 32-bit per data, memory for operators. Each core has and only has access to its own nine adders, nine multipliers, one floating sqrt, and one reciprocal computation circuit. Compared with the matrix core, the special core has some unique computation circuits, such as cosine and sine. Two instructions for matrix core, bubble and sync, and one for special core, bubble, are used to synchronize the cores and secure the computation order. An interface and matched instructions for 16-bit per address, 38-bit per data memory are kept for the potential memory. The operations required by different Kalman-filter-based algorithms can be programmed into the instruction and operator memory blocks. What is more, Table 2 shows the supported operations by the computation core, including normal scalar operations and special operations required by most SLAM algorithms.   Transform lie algebra to a rotation matrix 73 clock cycles Li2Q

Programmable Computation Core
Transform lie algebra to quaternion 65 clock cycles R2Q Transform rotation matrix to quaternion 53 clock cycles Q2R Transform quaternion to a rotation matrix 14 clock cycles Q_q Quaternion multiplication 14 clock cycles Figure 8 shows the structure of the feature processing engine, which includes a memory interface, a feature manager, a floating-point patch workspace, a coordinates alignment circuit, and interaction with the EKF engine. During the process of VIO, after IMU pre-integration, the pose of the feature concerning the last frame is exported out of the operator memory.  Figure 8 shows the structure of the feature processing engine, which includes a memory interface, a feature manager, a floating-point patch workspace, a coordinates alignment circuit, and interaction with the EKF engine. During the process of VIO, after IMU pre-integration, the pose of the feature concerning the last frame is exported out of the operator memory.

Feature Processing Engine
The patch workspace requests the 12 × 12 feature patch in a fixed-point format whose center locates where the feature is in the last frame from the feature extraction engine in the fixed-point vision pipeline, converts it to the floating-point format, and stores it temporarily into the dual-port rams. A bilinear interpolation is performed to sample an 8 × 8 patch from the original 12 × 12 patches according to the warping matrix and the updated patch coordinates [33]. Then, the patch workspace measures the difference between the old and the new patches, which are the gray value difference and grad difference along the X and Y axes. The information will be output to the coordinate alignment circuit.
The coordinates alignment circuit generates matrix A and vector b according to (7) and (9). The least-square equation optimizes the feature location and output to patch the workspace for resampling. The process ends when the iteration reaches the time limit or the 2D reprojection error is lower than a specific threshold. If the reprojection error is acceptable, the feature processing engine computes and stores the Jacobian matrix of measured results and noise in memory. EKF engine performs the update of the Kalman filter. After finishing the update, the feature processing engine sorts and compares the newly extracted features with the old ones. The feature processing engine replaces the old feature information and patches with the new one if the new feature is better. The patch workspace requests the 12 × 12 feature patch in a fixed-point format whose center locates where the feature is in the last frame from the feature extraction engine in the fixed-point vision pipeline, converts it to the floating-point format, and stores it temporarily into the dual-port rams. A bilinear interpolation is performed to sample an 8 × 8 patch from the original 12 × 12 patches according to the warping matrix and the updated patch coordinates [33]. Then, the patch workspace measures the difference between the old and the new patches, which are the gray value difference and grad difference along the X and Y axes. The information will be output to the coordinate alignment circuit.
The coordinates alignment circuit generates matrix A and vector b according to (7) and (9). The least-square equation optimizes the feature location and output to patch the workspace for resampling. The process ends when the iteration reaches the time limit or the 2D reprojection error is lower than a specific threshold. If the reprojection error is acceptable, the feature processing engine computes and stores the Jacobian matrix of measured results and noise in memory. EKF engine performs the update of the Kalman filter. After finishing the update, the feature processing engine sorts and compares the newly extracted features with the old ones. The feature processing engine replaces the old feature information and patches with the new one if the new feature is better.
In order to save resources and reduce power consumption, 3 × 3 operation cores are used for matrix-vector operations of any size. To this end, this paper innovatively adopts the idea of vectorization, which is in matrix multiplication, the left matrix is input to the column from left to right, and the right matrix is input to the row from top to bottom. Each set of operations completes an M33 matrix, calculates a new M33 matrix every three rows and three columns, as shown in Figure 9, and finally, calculates the entire large matrix with low resource and power consumption. In order to save resources and reduce power consumption, 3 × 3 operation cores are used for matrix-vector operations of any size. To this end, this paper innovatively adopts the idea of vectorization, which is in matrix multiplication, the left matrix is input to the column from left to right, and the right matrix is input to the row from top to bottom. Each set of operations completes an M33 matrix, calculates a new M33 matrix every three rows and three columns, as shown in Figure 9, and finally, calculates the entire large matrix with low resource and power consumption.

Accuracy of Datasets Compared with Software Platform
A comparison to the software platform on a dataset is performed to verify the functionality and accuracy of the proposed VIO accelerator. The following were the main parameters of the PC: Ubuntu 18.04 (64-bit), Intel(R) Core (TM) i5-8265U CPU with 8G RAM. Moreover, we translate the ROVIO [33] algorithm into the ISA proposed by this paper. The dataset used in the evaluation is EuRoC, which is one of the most authoritative and

Accuracy of Datasets Compared with Software Platform
A comparison to the software platform on a dataset is performed to verify the functionality and accuracy of the proposed VIO accelerator. The following were the main parameters of the PC: Ubuntu 18.04 (64-bit), Intel(R) Core (TM) i5-8265U CPU with 8G RAM. Moreover, we translate the ROVIO [33] algorithm into the ISA proposed by this paper. The dataset used in the evaluation is EuRoC, which is one of the most authoritative and challenging datasets. We translate the EKF process of ROVIO into the proposed ISA and evaluate them on EuRoC. As shown in Table 3, the accuracy loss is neglectable, indicating the high accuracy of the proposed accelerated core.

Evaluation Platform and Experiments
The evaluation platform is constructed by a four-wheeled platform that can move smoothly, carrying the Xilinx XCVU 440 FPGA evaluation board with monocular camera mt9v034 and MPU9250, as shown in Figure 10. Figure 11 illustrates the measured experience map over frames during a typical SLAM operation in a flat corridor, which is about 70 m in total. The evaluation platform starts from one of the corners and moves smoothly at a speed of 0.35 m/s on average. After a loop of testing, we find that the trajectory is only slightly shifted, indicating the robust functionality and relatively high accuracy of the proposed accelerated core. Table 4 illustrates the implementation results of the proposed reconfigurable VIO accelerator. ASIC synthesis is performed to better compare with state-of-the-art ASIC implementation results. This work operates in the highest frequency both in FPGA and ASIC synthesis. Nevertheless, the on-chip memory usage is the lowest among these works for occupying only 70 KB when [29] is lower but with the usage of build-in SoC. Compared with FPGA implementations, slice LUTs and FFs consumed by the accelerator are 2.05× and 2.35× less than [28]. DSP consumption in this work is 8.03× and 1.8× less than [28] and [29]. As illustrated in Table 4, the core area in 28 nm CMOS technology is only 3 mm 2 , which is 6.67× and 3.64× smaller in comparison to [27] and [26]. This work supports reconfigurable SLAM architecture, while others do not. smoothly at a speed of 0.35 m/s on average. After a loop of testing, we find that the trajectory is only slightly shifted, indicating the robust functionality and relatively high accuracy of the proposed accelerated core.     Table 4 illustrates the implementation results of the proposed reconfigurable VIO accelerator. ASIC synthesis is performed to better compare with state-of-the-art ASIC implementation results. This work operates in the highest frequency both in FPGA and ASIC synthesis. Nevertheless, the on-chip memory usage is the lowest among these works for occupying only 70 KB when [29] is lower but with the usage of build-in SoC. Compared with FPGA implementations, slice LUTs and FFs consumed by the accelerator are 2.05× and 2.35× less than [28]. DSP consumption in this work is 8.03× and 1.8× less than [28] and [29]. As illustrated in Table 4, the core area in 28 nm CMOS technology is only 3 mm 2 , which is 6.67× and 3.64× smaller in comparison to [27] and [26]. This work supports reconfigurable SLAM architecture, while others do not.  The power consumption on the FPGA platform is about 0.683 W, which is lower than the 1.46 W on the Kintex-7 XC7K355T [28]. The power proportion of each module is shown in Figure 12. Based on the synthesis results in the 28 nm CMOS process, the power consumption is only 179.52 mW, which is much better than that in [26] with the same 28 nm CMOS technology, but still worse than that in [27] with 65 nm CMOS technology.

Discussions
Due to the lightweight characteristic of the EKF optimization and FAST-based features, the proposed accelerated core essentially reduces the consumption of LUTs, FFs, and DSPs. Secondly, compared with other backend optimization techniques, such as bundle adjustment, EKF does not require storing the entire image, significantly reducing a large amount of on-chip memory. Finally, the delicate design improves the peak performance of the proposed accelerated core. the 1.46 W on the Kintex-7 XC7K355T [28]. The power proportion of each module is shown in Figure 12. Based on the synthesis results in the 28 nm CMOS process, the power consumption is only 179.52 mW, which is much better than that in [26] with the same 28 nm CMOS technology, but still worse than that in [27] with 65 nm CMOS technology. Due to the lightweight characteristic of the EKF optimization and FAST-based features, the proposed accelerated core essentially reduces the consumption of LUTs, FFs,

Conclusions
This paper proposed a reconfigurable visual inertial odometer (VIO) for the simultaneous localization and mapping (SLAM) application in autonomous mobile robots. The proposed accelerated core leveraged a lightweight feature extraction algorithm with the extended Kalman filter (EKF) update algorithm to provide high energy efficiency while achieving appropriate SLAM results. Furthermore, the reconfigurability of the accelerated core brings more potential downstream exploration for AMR applications.
The main hardware architecture of the VIO accelerator consisted of four sub-modules: (1) fixed-point vision pipeline, (2) memory interface, (3) programmable computation core, and (4) layers for the EKF engine. We first evaluated the accuracy both on benchmark datasets and real experimental tests. As shown in the implementation results, the on-chip memory usage of 70 KB was the lowest among the standalone works for SLAM. Meanwhile, the hardware-resource usage and power dissipation on the FPGA implementation and the synthesis of 28 nm CMOS technology also outperformed the state-of-the-art works on the same condition.