Deep-E Enhanced Photoacoustic Tomography Using Three-Dimensional Reconstruction for High-Quality Vascular Imaging

Linear-array-based photoacoustic computed tomography (PACT) has been widely used in vascular imaging due to its low cost and high compatibility with current ultrasound systems. However, linear-array transducers have inherent limitations for three-dimensional imaging due to the poor elevation resolution. In this study, we introduced a deep learning-assisted data process algorithm to enhance the image quality in linear-array-based PACT. Compared to our earlier study where training was performed on 2D reconstructed data, here, we utilized 2D and 3D reconstructed data to train the two networks separately. We then fused the image data from both 2D and 3D training to get features from both algorithms. The numerical and in vivo validations indicate that our approach can improve elevation resolution, recover the true size of the object, and enhance deep vessels. Our deep learning-assisted approach can be applied to translational imaging applications that require detailed visualization of vascular features.


Introduction
Over the past two decades, photoacoustic computed tomography (PACT), a hybrid imaging modality, has gained increasing interest within the field of biomedical imaging. PACT combines the rich contrast from optical absorption with the high resolution and deep imaging depth of ultrasound [1]. PACT relies on the photoacoustic effect, in which short-pulsed light beams induce local thermoelastic expansion that generates ultrasonic acoustic waves [2]. In PA, biomolecules such as melanin, lipids, and hemoglobin are used as endogenous contrasts due to their high optical absorption coefficients at specific wavelengths. Acoustic signals are detected via ultrasound transducer arrays, where they are then digitized to be used in image reconstruction [2]. Based on optical absorption of hemoglobin, PACT can visualize deep underlying vascular structures [3]. Various types of transducer arrays have been used in PACT systems. Linear-array transducers remain as one of the most popular choices due to easy integration with different light sources and compatibility with existing ultrasound systems [4]. Recently, linear transducer arrays have been used in breast imaging [5], biometrics [6], neural imaging [7], and oncology [8]. Linear arrays provide nearly isotropic resolution in the axial and lateral directions to form a B-mode image. However, these arrays have an inherently poor elevation direction resolution, which is determined by the fixed cylindrical focus of the transducer element. To maintain good sectioning capability, most linear arrays have a long elevation focus, rendering an elevation resolution that is typically a few times worse than that of their lateral and axial counterparts. This limitation directly impacts the three-dimensional performance of a PACT system [9].
Many studies have attempted to improve the elevation resolution in linear-arraybased PACT through either hardware-based or software-based solutions [9]. For hardwarebased methods, modifications are made to either the scanning trajectory or the detection configuration [10][11][12]. However, complicated scanning geometries lead to a longer scanning time and more data generation when compared to traditional configurations. Softwarebased approaches optimized the image quality through advanced image reconstruction algorithms. One example is our earlier research based on coherent weighting and focal-line 3D image reconstruction (CWFL) [13]. It computes the coherence of PA signals in 3D space and assigns the back-projected signals with a weighting factor. However, the intensity of each voxel was altered by the weighting factor, affecting the quantitative accuracy of the reconstruction. Recently, deep learning has been widely used in photoacoustic imaging studies to improve image quality and/or reduce electric noise in the raw data [14][15][16]. However, very few have attempted to improve the elevation resolution of linear-arraybased PACT, due to the complexity of 3D simulation and training. Our group recently introduced Deep-E, a UNet-based method for improving the elevation resolution [17]. In that study, we converted the 3D problem into 2D space by training data only on the axialelevation plane. Experimental validation indicated that Deep-E improved the elevation resolution by over four times. However, the input for Deep-E is still 2D reconstructed images (in the axial-lateral plane) that are stacked into 3D. As the 2D reconstruction does not consider signal divergence along the elevation direction, the quality of reconstructed images is inherently limited.
To overcome this issue, we propose a revised Deep-E network that utilizes 3D reconstructed data as input. 3D image reconstruction is achieved by the 3D focal-line reconstruction algorithm, which considers the elevation aperture and precisely back-projects the PA signals into the 3D space [13]. As deeper signals can be seen at more scanning locations, 3D focal-line reconstruction allows better visualization of deeper structures. However, 3D focal-line reconstruction is subject to motion artifacts as it assumes that the object remains stationary during the scanning, which may last 1-2 min. Slight movements in the object might degrade the image quality. Two-dimensional reconstruction is less sensitive to motion artifacts as reconstruction is done at each imaging plane. Therefore, integrating results from both reconstruction algorithms could achieve better image quality. In the following sections, we lay out the steps to obtaining high-resolution data from both 3D and 2D reconstructions and combining them effectively. Our results indicated that, in addition to the elevation resolution improvement, the combined image contained more vascular structures and the deeper vessels were more visible.

System Setup
Our network was designed for the dual-scan mammoscope (DSM) photoacoustic imaging system, which was used for human breast imaging [5,18]. The transducer used in DSM was a customized 128-element linear-array transducer (IMASONIC SAS) with 2.25 MHz central frequency and >65% bandwidth. Each element in the array is arc-shaped with a 15 mm elevation length and focus at 40 mm. Light illumination is provided by a 10 Hz Nd: YAG laser (Continuum Inc., Blvd Austin, TX, USA) with 1064 nm wavelength and <10 ns pulse width. PA signals are acquired by a 256-channel data acquisition unit from Verasonics and are transmitted to the host computer for further processing. The light delivery and signal receiving are synchronized with the Q-switch output of the laser.

Simulation of PA Signal
We utilized the MATLAB-based photoacoustic simulation toolbox (K-wave) to generate the photoacoustic sinograms [19]. To improve the efficiency of dataset generation, we only simulated the acoustic wave propagation in the 2D space along the axial and elevation directions. This simplification is valid when the lateral resolution is much higher than the elevation resolution. We first generated the 3D vasculature matrix using the Insight Segmentation and Registration Toolkit (ITK) [20]. The toolkit generates vessels in 3D space with different diameters. We then cut the 3D vasculature matrix into 2D images. The crosssectional images were considered as the input source for PA simulation and the ground truth for the neural network training. Figure 1 shows the workflow of 2D simulation along the axial-elevation plane. PA source matrices were resized into 50 × 50 mm with 0.1 mm pixel size and placed 30 mm away from the transducer element to mimic the experimental setup. The arc-shape transducer moved along the elevation direction at 0.1 mm step size for 500 steps to acquire acoustic signals originating from predefined PA sources. The received A-line signals at different transducer locations were stacked into one image, which preserved the primary features along the elevation direction in the raw data. only simulated the acoustic wave propagation in the 2D space along the axial and elevation directions. This simplification is valid when the lateral resolution is much higher than the elevation resolution. We first generated the 3D vasculature matrix using the Insight Segmentation and Registration Toolkit (ITK) [20]. The toolkit generates vessels in 3D space with different diameters. We then cut the 3D vasculature matrix into 2D images. The cross-sectional images were considered as the input source for PA simulation and the ground truth for the neural network training. Figure 1 shows the workflow of 2D simulation along the axial-elevation plane. PA source matrices were resized into 50 × 50 mm with 0.1 mm pixel size and placed 30 mm away from the transducer element to mimic the experimental setup. The arc-shape transducer moved along the elevation direction at 0.1 mm step size for 500 steps to acquire acoustic signals originating from predefined PA sources. The received A-line signals at different transducer locations were stacked into one image, which preserved the primary features along the elevation direction in the raw data. The simulated raw data were then combined with electromagnetic interference (EMI) noise acquired from the experimental system. The EMI noises are caused by electromagnetic waves emitted from external sources (e.g., laser, stepper motor) and are shown as strips along the lateral direction. Five hundred frames of EMI noise were acquired in our experimental setup and were then randomly combined with the simulated data.

Image Reconstruction Algorithms
The experimental data can be reconstructed using either the 2D stack or the 3D focal-line algorithms. For the 2D stack, we first reconstructed each PA B-mode image individually and then stacked them along the elevation direction based on their scanning positions. In 3D focalline reconstruction, the data is reconstructed in 3D by considering the elevation aperture [21]. To properly back project the data in 3D, we used the focal line (a line at the linear array focus) as an auxiliary line to find out the receiving path of the photoacoustic signal. As the acoustic wavefront originating from the focal line hits the transducer surface simultaneously, for each point in 3D, we find the shortest path that goes past the focal line before reaching the transducer. The signals from each element were back-projected to a 3D space based on this principle [21].
Image reconstruction for the simulation data is a little bit different. Because we only simulated data along the axial-elevation plane, we do not need to consider reconstruction along the lateral direction. In the Deep-E study, we simply stacked envelop-detected A-line signals along the elevation direction to mimic 2D stack-reconstructed images in the axialelevation plane [17]. This approach cannot mimic 3D focal-line reconstruction. As we only The simulated raw data were then combined with electromagnetic interference (EMI) noise acquired from the experimental system. The EMI noises are caused by electromagnetic waves emitted from external sources (e.g., laser, stepper motor) and are shown as strips along the lateral direction. Five hundred frames of EMI noise were acquired in our experimental setup and were then randomly combined with the simulated data.

Image Reconstruction Algorithms
The experimental data can be reconstructed using either the 2D stack or the 3D focalline algorithms. For the 2D stack, we first reconstructed each PA B-mode image individually and then stacked them along the elevation direction based on their scanning positions. In 3D focal-line reconstruction, the data is reconstructed in 3D by considering the elevation aperture [21]. To properly back project the data in 3D, we used the focal line (a line at the linear array focus) as an auxiliary line to find out the receiving path of the photoacoustic signal. As the acoustic wavefront originating from the focal line hits the transducer surface simultaneously, for each point in 3D, we find the shortest path that goes past the focal line before reaching the transducer. The signals from each element were back-projected to a 3D space based on this principle [21].
Image reconstruction for the simulation data is a little bit different. Because we only simulated data along the axial-elevation plane, we do not need to consider reconstruction along the lateral direction. In the Deep-E study, we simply stacked envelop-detected A-line signals along the elevation direction to mimic 2D stack-reconstructed images in the axial-elevation plane [17]. This approach cannot mimic 3D focal-line reconstruction. As we only simulate the raw data in the axial-elevation plane, the focal line turns into a focal point. Therefore, we can use the virtual point concept to reconstruct the axial-elevation raw data [22]. More specifically, for every reconstruction point in the cross-section, we found the shortest path that goes across the focal point before reaching the transducer and used this path for the time-of-flight calculation. Figure 2 compares the axial-elevation images obtained from the 2D stack and virtual-point reconstruction algorithms. It can be seen that the virtual-point reconstruction recovers features better along the elevation direction. simulate the raw data in the axial-elevation plane, the focal line turns into a focal point. Therefore, we can use the virtual point concept to reconstruct the axial-elevation raw data [22]. More specifically, for every reconstruction point in the cross-section, we found the shortest path that goes across the focal point before reaching the transducer and used this path for the time-offlight calculation. Figure 2 compares the axial-elevation images obtained from the 2D stack and virtual-point reconstruction algorithms. It can be seen that the virtual-point reconstruction recovers features better along the elevation direction.

Input Dataset and Neural Network Parameters
As mentioned above, we used the images generated by the Insight Segmentation and Registration Toolkit as the ground truth. For training, we generated six 3D vascular matrices, and each was cut into 250 images, totaling 1500. We then added four different levels of white noise to the data (signal-to-noise ratios at 6 dB, 9 dB, 12 dB, and noise-free), generating 6000 (6 × 250 × 4) images in total. We did not apply rotation to the data as we wanted to preserve the depth-dependent elevation resolution. These cross-sectional vascular data were simulated into PA raw data and reconstructed using either the 2D stack or focal-point (line) algorithms mentioned in the section above. The 2D stack data was trained using the Deep-E network mentioned in reference [17]. The focal-point data was trained using a similar fully-dense UNet (FD-UNet).
Details of the FD-UNet were mentioned in [17]. Briefly speaking, the FD-UNet is based on a CNN-based residual architecture by adding a skip connection between the input and the output [23]. First, the spatial dimensions of the feature maps are repeatedly decreased to learn local and global features related to artifact removal. Then, the learned feature maps are spatially upsampled and combined to generate an output image of the same size as the input image.
Specifically, the input image size is 256 × 256. It first passes through a 3 × 3 convolution layer to obtain 16 feature maps. Then, it undergoes the following blocks:

•
Dense block: It consists of a sequence of a 1 × 1 and 3 × 3 convolution with batch normalization and ReLU activation function. The outputs from earlier convolutional layers are concatenated together as the input to the subsequent layer. • Down block: It is a learned downsampling operation that consists of a 1 × 1 convolution block with a stride of 1, and a 3 × 3 convolution block with a stride of 2. It gradually reduces the feature map size and increases the channel number. In the last layer of the downsampling path, we can obtain 512 feature maps with the size of 8 × 8.

•
Up block: It consists of a 3 × 3 transposed convolution block with a stride of 2 followed by ReLU activation function and batch normalization to expand the feature map size.
Finally, the feature map goes through a 1 × 1 convolution layer to obtain a residual image with a size of 256 × 256. By adding the residual image with the input image, we obtain the

Input Dataset and Neural Network Parameters
As mentioned above, we used the images generated by the Insight Segmentation and Registration Toolkit as the ground truth. For training, we generated six 3D vascular matrices, and each was cut into 250 images, totaling 1500. We then added four different levels of white noise to the data (signal-to-noise ratios at 6 dB, 9 dB, 12 dB, and noise-free), generating 6000 (6 × 250 × 4) images in total. We did not apply rotation to the data as we wanted to preserve the depth-dependent elevation resolution. These cross-sectional vascular data were simulated into PA raw data and reconstructed using either the 2D stack or focal-point (line) algorithms mentioned in the section above. The 2D stack data was trained using the Deep-E network mentioned in reference [17]. The focal-point data was trained using a similar fully-dense UNet (FD-UNet).
Details of the FD-UNet were mentioned in [17]. Briefly speaking, the FD-UNet is based on a CNN-based residual architecture by adding a skip connection between the input and the output [23]. First, the spatial dimensions of the feature maps are repeatedly decreased to learn local and global features related to artifact removal. Then, the learned feature maps are spatially upsampled and combined to generate an output image of the same size as the input image.
Specifically, the input image size is 256 × 256. It first passes through a 3 × 3 convolution layer to obtain 16 feature maps. Then, it undergoes the following blocks:

•
Dense block: It consists of a sequence of a 1 × 1 and 3 × 3 convolution with batch normalization and ReLU activation function. The outputs from earlier convolutional layers are concatenated together as the input to the subsequent layer. • Down block: It is a learned downsampling operation that consists of a 1 × 1 convolution block with a stride of 1, and a 3 × 3 convolution block with a stride of 2. It gradually reduces the feature map size and increases the channel number. In the last layer of the downsampling path, we can obtain 512 feature maps with the size of 8 × 8.

•
Up block: It consists of a 3 × 3 transposed convolution block with a stride of 2 followed by ReLU activation function and batch normalization to expand the feature map size.
Finally, the feature map goes through a 1 × 1 convolution layer to obtain a residual image with a size of 256 × 256. By adding the residual image with the input image, we obtain the targeted artifact-free image. To differentiate the FD-UNet used in 2D stack data training and focal-point data training, we named the first as 2D Deep-E and the latter as 3D Deep-E.

Image Co-Registration
After training, both the 2D Deep-E and 3D Deep-E networks can be used to improve the experimental data. However, due to the different time-of-flight calculation between 3D focal-line and 2D stack reconstruction, the two enhanced matrices cannot be perfectly overlaid and need to be co-registered to eliminate the mismatch. A MATLAB built-in function, imregdemons, is utilized to estimate the 3D displacement field and align the 2D Deep-E matrix with respect to the 3D Deep-E matrix [24,25]. After that, the aligned 2D Deep-E matrix and 3D Deep-E matrix are sliced and fused (using MATLAB function imfuse) along the axial-elevation plane to preserve features from both outputs. The fused 2D images are then stacked along the lateral direction to form the final 3D output.

Summary
In summary, we first simulated raw photoacoustic data in 2D space along the axial and elevation directions and combined them with the EMI noise (Step 1 in Figure 3). Second, we used envelop detection and virtual point reconstruction algorithms to reconstruct the raw data for 2D Deep-E and 3D Deep-E training, respectively. Subsequently, different levels of Gaussian noise were applied to the reconstructed data to increase the training size. We then input the images to the FD-UNet and trained the two networks separately (Step 2 in Figure 3). The total training time for each network is approximately 2 h in a workstation with AMD Ryzen 9 3950X CPU and NVIDIA GeForce RTX 2080 Ti GPU. After obtaining two trained networks, we used 2D stack and 3D focal-line algorithms to reconstruct the in vivo experimental data. The two reconstructed matrices were cut into 2D images and fed into the trained 2D deep-E and 3D deep-E networks, respectively. The output matrices from both networks were co-registered to combine the features from each algorithm (Step 3 in Figure 3).

Results
We evaluated our approach using simulation and human imaging results. First, we used simulated data with different noise levels to validate the networks' performance in noise removal, object size recovery, and elevation resolution improvement. Then, we used the in vivo experimental data acquired from our DSM system to validate the improved imaging depth and elevation resolution.

Validation with Simulated Vasculature
As mentioned above, the proposed algorithm aims to improve the elevation resolution and reduce noise in the image. These can be validated through simulation data, where the ground truth is available. Figure 4a shows the max amplitude projected (MAP) ground truth, where the image is color encoded by depth ranging from blue to red (representing shallow to deep). The red lines shown in Figure 4a indicate the locations of the cross-section images

Results
We evaluated our approach using simulation and human imaging results. First, we used simulated data with different noise levels to validate the networks' performance in noise removal, object size recovery, and elevation resolution improvement. Then, we used the in vivo experimental data acquired from our DSM system to validate the improved imaging depth and elevation resolution.

Validation with Simulated Vasculature
As mentioned above, the proposed algorithm aims to improve the elevation resolution and reduce noise in the image. These can be validated through simulation data, where the ground truth is available. Figure 4a shows the max amplitude projected (MAP) ground truth, where the image is color encoded by depth ranging from blue to red (representing shallow to deep). The red lines shown in Figure 4a indicate the locations of the cross-section images in Figure 4b. Figure 4c shows the input at different white noise levels. Here, the noise level was quantified based on the Gaussian white noise only.
Output from the network is shown in Figure 4d. It can be seen that most vessels can be clearly recovered when the SNR ratio is 12 dB or higher. When the SNR is reduced to 9 dB, some vessels with small diameters cannot be extracted. However, in most of our experimental data, the SNR is much higher than 9 dB. Figure 4e,f represent the crosssectional images extracted from the red line in Figure 4c,d, respectively. The features shown in Figure 4e exhibited poor axial and elevation resolution. By contrast, the images shown in Figure 4f possessed high resolution in both directions. Table 1 denotes the comparison of the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) index of input and output at different Gaussian noise levels. Our proposed method improves the PSNR in all cases. The SSIM is also significantly improved, indicating high similarity to the ground truth. To validate whether our method can recover the true size of the object, we compared the axial and elevation lengths of objects 1 and 2 in Figure 4b. Table 2 shows the comparison results. The object sizes quantified from the output images are very close to the ground truth.
in Figure 4b. Figure 4c shows the input at different white noise levels. Here, the noise level was quantified based on the Gaussian white noise only. Output from the network is shown in Figure 4d. It can be seen that most vessels can be clearly recovered when the SNR ratio is 12 dB or higher. When the SNR is reduced to 9 dB, some vessels with small diameters cannot be extracted. However, in most of our experimental

Validation with Phantom Data
To further validate whether our algorithm can recover the true object size in an experimental environment, we imaged two phantoms to evaluate the performance. The first phantom was inserted with 0.5 mm pencil leads at 8 mm depth increments. The pencil leads were located 30 to 60 mm away from the transducer surface, mimicking the imaging distance of an in vivo experiment. Figure 5a,b show the depth-encoded MAP images for the input and output combined with 2D and 3D Deep-E, respectively. Here, the depth indicates the distance to the surface of the phantom. Figure 5c,d denote the cross-sections indicated by the red line in Figure 5a,b, respectively. Compared to the input (Figure 5a,c), the object in the output (Figure 5b,d) exhibited a much finer structure in both axial and elevation directions. It can also be seen that our approach performs well on objects at different depths, which is essential for deep tissue imaging. Table 3 lists the average diameter of all pencil leads quantified based on the full width at half maximum (FWHM). The error bar was quantified based on the diameters at different locations. The diameters obtained from the output image are very close to the ground truth in both directions. For the second phantom, three pencil leads with different diameters of 0.5, 0.9, and 2 mm were inserted at the same depth, which is 40 mm away from the transducer. Figure 6a,b show the MAP images for the input and output combined with 2D and 3D Deep-E, respectively. The red line in Figure 6a,b indicates the location of cross-sectional images shown in Figure 6c,d. Table 4 lists the quantified diameters of pencil leads along the elevation direction. While there are some variations, overall, the recovered diameters are closer to the ground truths (in comparison to the input). As will be mentioned later in the discussion, the accuracy can be further improved by training a larger dataset with a broader range of diameters. In addition, we also observed higher contrast-to-noise ratio due to reduced background noise in the output images. Table 3. Diameter quantification along axial and elevation directions, and the contrast-to-noise ratio of pencil leads.

Averaged Pencil Lead Diameter (mm)
Contrast-to-Noise Ratio Axial Elevation

Validation with In Vivo Data
Subsequently, we evaluated the performance of our approach using in vivo data obtained from human breast imaging experiments. Images from three subjects were selected. Similar to the simulation data, all the images are depth-encoded from blue to red. Figure 7a,c,e denote the 2D-stack reconstructed images from three subjects, respectively. In all cases, vessels from deep regions cannot be clearly revealed. Figure 7b,d,f denote the images reconstructed using the 3D focal-line algorithm. As expected, more vessels can be visualized from deep regions (colored in red). However, some vessels near the skin surface were not clearly visible. Figure 7g,i,k denote the outputs from the 2D Deep-E, and Figure 7h,j,l represent the outputs from 3D Deep-E. The elevation resolution has been significantly improved in all images. However, similar to data from the input, some vessels (arrows 1 and 2) are more visible in the 2D stack reconstructed result, while some (arrows 3 and 4) are more visible in 3D reconstructed images. The final co-registered images are shown in Figure 7m,n,o. It can be seen that these three images maintain the high elevation resolution from 2D/3D Deep E processing, and vessels extracted from both networks are clearly visible.

Discussion
Following the success of 2D Deep-E, we further developed 3D Deep-E and combine outputs from both 2D Deep-E and 3D Deep-E images to get the best from both algorithm In 2D Deep-E, features reconstructed from 2D stack reconstruction are limited as it does n consider the receiving angle along the elevation direction. This inherent limitation leads

Discussion
Following the success of 2D Deep-E, we further developed 3D Deep-E and combined outputs from both 2D Deep-E and 3D Deep-E images to get the best from both algorithms. In 2D Deep-E, features reconstructed from 2D stack reconstruction are limited as it does not consider the receiving angle along the elevation direction. This inherent limitation leads to poor visualization of vessels extended along the lateral direction as well as features from deep regions. To address this issue, we combined reconstruction from both 2D stack and 3D focal-line reconstruction algorithms. In theory, applying 3D focal-line reconstruction alone would be sufficient to recover all the features [21]. However, 3D reconstruction is sensitive to the motion artifact induced by body movement and breathing. As a result, some vessels cannot be correctly reconstructed due to motion. By comparison, a 2D stack only reconstructs one frame at a time, mitigating the influence of motion. While 2D stack reconstruction is less sensitive to motion artifacts, it only reconstructs data in the axial and lateral plane, degrading image quality in deeper regions. Our approach effectively combines the benefits of both 2D and 3D reconstruction. Experimental results indicate that our method improves elevation resolution, recovers deeper vessels, and maintains the detailed vasculature from both reconstruction results. In addition, as we accounted for both Gaussian noise and EMI noise in training, the signal-to-noise ratio of the output image was significantly enhanced.
A detailed comparison between the two approaches can be seen in Figure 8. Compared to Figure 8a,c, Figure 8b,d possess more vessels from deep regions, such as regions marked by dashed boxes 1, 2, and 3 (more red-colored vessels). In the meanwhile, vessels extended along the lateral direction are displayed in a more continuous manner (such as the region marked by dashed box 4). Figure 8e,f show the cross-sectional MAP images of Figure 8a,b, respectively. It can be seen that Figure 8f clearly contains more vessels in the deep region. Similar outcomes can be seen in Figure 8g,h. respectively. It can be seen that Figure 8f clearly contains more vessels in the deep region. Similar outcomes can be seen in Figure 8g,h. Despite the encouraging results, future studies are still needed to optimize our method. First, we did not quantify whether the quantitative information is preserved, especially after image co-registration. This issue could be addressed by studying how the co-registration process changes the image intensity and whether proper weighting can be applied to preserve Despite the encouraging results, future studies are still needed to optimize our method. First, we did not quantify whether the quantitative information is preserved, especially after image co-registration. This issue could be addressed by studying how the co-registration process changes the image intensity and whether proper weighting can be applied to preserve the original signal distribution. Second, 3D focal-line reconstruction is more sensitive to motion artifacts, which cannot be avoided in the current setup. We can resolve this issue by using a higher pulse repetition frequency laser to shorten the scanning time and minimize the motion artifact. Third, vessels with large diameters in the input image might not be accurately recovered to the true size. This is due to the limited training dataset, which does not contain sufficient samples of large-diameter vessels. In the future, vessel samples with a broader diameter range could be utilized to address this issue. Finally, the vessel continuity can be further improved. As our approach optimized the image quality based on the cross-sectional image from the reconstructed matrix, connections with adjacent frames were not considered. The extracted features from each cross-section might have slight misalignment, degrading the continuity in the final 3D image. In the future, we can develop an algorithm to consider the connection between frames to improve continuity.
It should be noted that, similar to other software-based approaches, our algorithm cannot achieve real super-resolution imaging as the hardware remains the same as in the conventional imaging [26]. However, compared to other approaches, such as deconvolution, our method has the following advantages. First, the processing speed is fast. After training, it takes less than 10 s to process reconstructed data. Second, it can recover the true diameter of the object, as shown in Figure 5b,d. In comparison, deconvolution based on a single point spread (PSF) function cannot recover the true size of objects at different depths as the PSF would change. Third, the system noise is built into the training. Therefore, our method is less affected by noise than deconvolution.

Conclusions
In this study, we developed a combined 2D Deep-E and 3D Deep-E approach to improve linear-array-based PACT. For 3D Deep-E training, we applied the virtual-point detector concept to generate the input data, mimicking images reconstructed from the 3D focal-line reconstruction algorithm. After obtaining output from 2D Deep-E and 3D Deep-E, we co-registered the two matrices to remove mismatches caused by the two reconstruction algorithms. The numerical validation demonstrated the method's capability in elevation resolution enhancement, true feature size recovery, and noise reduction. In vivo validation further proved that our method can better recover deep vascular features and maintain detailed vasculature from both reconstruction algorithms. In summary, we further enhanced the performance of the Deep-E network by combining it with 3D focalline reconstruction. Our results demonstrated improved image resolution and deeper vessel recovery.  Institutional Review Board Statement: The study was conducted in accordance with the protocol (STUDY00000171) approved by the Institutional Review Board of the University at Buffalo on 2 August 2022.
Informed Consent Statement: All experimental protocols were approved by the University at Buffalo Institutional Review Board (IRB) and informed consent was obtained from all subjects (adults).

Data Availability Statement:
Publicly available scripts and datasets are uploaded to Github and the link is attached. https://github.com/wzheng26/3D-Deep-E.git. Accessed on: 16 September 2022. Data available upon request from authors.

Conflicts of Interest:
The authors declare no competing interests. Jun Xia is the founder of Sonioptix, LLC., which, however, did not support this work.