3.1. Computer Vision Pipeline
The image capture process in HANNDI is designed to produce a single, high-quality, all-in-focus image of a hole’s interior for downstream processing. An Arducam 64 MP HawkEye camera is connected to a Raspberry Pi 5, providing high-resolution imaging with full software control over lens position. A NeoPixel LED light ring is mounted concentrically around the lens to ensure consistent illumination of the inspection area regardless of ambient lighting conditions. The LEDs are activated immediately prior to image capture and turned off afterward to conserve power and prevent glare.
The design of HANNDI enforces a minimum standoff distance of 3.15″ (8.0 cm), corresponding to the HawkEye’s closest in-focus range. With the camera head geometry fixed, the effective detection range extends from the hole entrance down to a depth of approximately 0.75″ (1.9 cm). To cover this range, the system captures a focal sweep of sixteen images at incrementally increasing lens positions. Features beyond this depth, such as through hole backgrounds, remain out of focus and are suppressed in the stacked image.
Although all experiments used the HawkEye module, the pipeline is generalizable to other cameras. Using a different lens or sensor would require recalibration of the minimum in-focus distance and an adjustment of the focal sweep increments, but the registration, stacking, and CNN inference stages remain unchanged.
The camera automatically handles exposure and white balance via the PiCamera2 driver. Because each focal sweep is illuminated entirely by the fixed-intensity LED ring, the auto exposure/white balance routines converge quickly and produce stable results. Preliminary tests showed that locking these parameters provided no measurable improvement, so the simpler auto adjust mode was adopted.
Each frame is saved at
resolution to balance processing efficiency and image detail. Prior to stacking, frames are registered using ORB feature matching with affine warping. Registration accuracy was quantified using the root-mean-square centroid radius, which measures the pixel spread of keypoint centroids across the focal sweep. In two representative sweeps, this metric ranged from 60 to 70 px before alignment and was reduced to 1–2 px after alignment, effectively eliminating ghosting and preserving fine detail as seen in
Figure 1. This step is critical in handheld use, where small operator motions can otherwise introduce misalignments between slices.
Following alignment, the sixteen RGB images are fused into a single all-in-focus image via focal stacking. Each frame is converted to grayscale, and a variance-based sharpness map is computed to highlight areas of high local detail. The sharpness maps are smoothed to reduce noise, and for each pixel location, the frame with the highest sharpness value is selected. The corresponding RGB pixel is then inserted into the output image. This process not only brings all surfaces within the 0.75″ depth into focus but also suppresses through-hole clutter, since distant backgrounds remain blurred and are excluded during pixel selection.
The stacked image is then cropped to a
pixel region centered on the hole. This cropping reduces background clutter while preserving sufficient resolution for accurate analysis. The resulting image serves as the input to the Hole Identifier convolutional neural network described later, which detects and localizes the hole for subsequent foreign object classification. The combination of controlled lighting, fixed standoff distance, focal stepping, feature aligned stacking, and targeted cropping ensures that the network consistently receives high-quality, information-rich inputs suitable for robust inference under realistic handheld operating conditions. An example final image generated from this pipeline can be seen in
Figure 2.
To justify the choice of 16 focal steps, a trade-off study compared 12, 16, and 20 slices. Median Laplacian variance, a measurement of image sharpness, of the final stacked image improved from ≈425 at 12 steps to ≈435 at 16 steps, with only marginal gains up to ≈445 at 20 steps. Median runtime for image acquisition, feature matching, and stacking scaled nearly linearly, with 5.9 s at 12 steps, 7.1 s at 16 steps, and 8.7 s at 20 steps. Sixteen steps provided the best balance of sharpness and runtime, while also avoiding the variability observed at 20 steps. Moreover, Lockheed Martin defined a target of sub-10 s for the scanning stage, and while 20 steps at 8.7 s left little margin for hole detection and FOd classification, the 7.1 s runtime at 16 steps provided comfortable headroom. These factors together motivated the selection of 16 steps as the operational setting, as shown in
Figure 3.
3.2. Physical System Design
The HANNDI device is designed for portable, real-world deployment, with an emphasis on reliable operation, ergonomic use, and a clean integration of all imaging and control components. At the core of the power system is a 7.4 V, 5000 mAh, 50 C LiPo battery that enables untethered operation. A toggle switch is mounted inline to allow the user to fully power down the system between uses, conserving battery life. A 5 A inline fuse is included to protect downstream hardware in the event of a short or sudden current dump. Power is regulated through a buck converter that steps the voltage down to a stable 5 V, supplying up to 5 A to ensure the Raspberry Pi 5 and its peripherals receive consistent and adequate power during operation.
The Raspberry Pi 5 serves as the central controller, managing both image processing and peripheral communication. Connected peripherals include a pushbutton for initiating image capture, a NeoPixel LED light ring for consistent illumination, an Arducam 64 MP camera for high-resolution imaging, and an LCD screen for providing real-time feedback to the user. Together, these components draw moderate current, and under typical load conditions, the system achieves approximately six hours of continuous operation per full battery charge.
All electrical and mechanical components are housed in a purpose-built smart tool that supports field use by manufacturing personnel. The CAD model design for HANNDI is shown in
Figure 4. The handle section houses the LiPo battery, power switch, inline fuse, buck converter, and image capture button. Above the handle is an electronics enclosure that contains the Raspberry Pi 5, internal wiring, and the LCD display mounted for easy visibility during operation.
Mounted to the front of the electronics box is the camera head, which houses both the Arducam camera and the LED ring light. The head is mechanically connected via a spring that allows for positional adjustment, enabling users to manually rotate the camera to align it with a hole or fastener site of interest. This feature is especially important when working on complex geometries or hard-to-reach areas.
The camera head encloses both the camera and LED ring in a barrel-like geometry, leaving only a 0.25″ (6.3 mm) mechanical gap between the head and the inspected surface. This gap is due to the compression limits of the shocks in the suspension mechanism. As a result, nearly all ambient light is excluded from the inspection region, with illumination dominated by the integrated LED ring. The suspension design, combined with the fixed 3.15″ (8.0 cm) standoff enforced by the camera head geometry, ensures that the lens axis remains parallel to the inspected surface, including on gently curved aerospace components. This compliance preserves the required standoff distance across sloped or contoured features, keeping the imaging geometry consistent across parts.
The full device measures approximately 18″ (46 cm) in length when the camera head is not rotated. It is intended for two-handed operation, with one hand gripping the handle for control and the other adjusting the camera head for alignment. This combination of stability and adjustability allows for accurate image capture in a wide range of aerospace inspection environments. The completed prototype can be seen in
Figure 5, where the adjustable camera head and suspension mechanism is also shown in action on a curved surface.
For hole illumination, the system uses a concentric NeoPixel RGBW LED ring with 12 integrated 5050 LEDs arranged at a 36.8 mm outer diameter and 23.3 mm inner diameter. Each LED includes 8-bit PWM control of the R, G, B, and W channels. In this work, equal RGB values of (5, 5, 5) were applied, corresponding to a roughly 2% duty cycle per channel. This provided sufficient brightness for inspection without saturating the sensor.
Illuminance tests confirmed that the LED ring overwhelmingly dominated residual ambient factory lighting. In a test environment, under only ambient illumination, the hole surface measured about 455 lux. With the camera head in place and LEDs off, this dropped below 5 lux due to the head blocking stray light. With the LEDs active, the level rose to approximately 300 lux. This corresponds to a residual ambient-to-flash ratio greater than 60:1, since the camera head effectively reduced ambient light at the surface to negligible levels while the LEDs provided controlled illumination. As a result, the illuminance in captured images was determined almost entirely by the LED system, independent of external lighting variability.
3.3. CNN Methodology and Training
The HANNDI device uses two CNNs based on the YOLOv7-Tiny architecture. A primary motivation of this work was the ability to deploy a single, unified backbone across both pipeline stages. YOLOv7-Tiny could be used without modification as a detector for the Hole Identifier CNN and, in single-bounding-box mode, as a classifier for the FOd Classifier CNN [
16]. By contrast, adopting base CNNs such as MobileNet or EfficientNet would have required using one variant for classification and a modified version such as MobileNet-SSD or EfficientDet for detection. That approach would have introduced additional training procedures and integration overhead. The ability to reuse a single, unaltered YOLO model streamlined implementation and reduced complexity on the embedded platform.
YOLOv7-Tiny also delivered strong out-of-the-box performance on both hole detection and FOd classification. Prior literature shows that MobileNet-SSD and EfficientDet variants achieve broadly comparable accuracy on small-object detection, but not clear gains [
17,
18]. A similar conclusion can be drawn for the low probability of obtaining significant improvements with base MobileNet or EfficientNet classifiers. Given the strong baseline achieved by our trained CNNs, as discussed later in this subsection, additional benchmarking against alternative backbones was unlikely to provide meaningful benefits in the present context.
The architecture further provides extensibility and practical embedded performance. Although the current FOd stage performs classification only, the same YOLO framework can readily be extended to localize debris within holes, enabling future transitions from classification to detection with minimal modification. By contrast, an EfficientNet based classifier would require a switch to a different architecture such as EfficientDet, complicating deployment.
The model is also designed for low-latency inference and runs natively on lightweight hardware such as the Raspberry Pi 5. While EfficientDet and MobileNet-SSD have been reported to achieve slightly faster inference times [
19], these differences are negligible in our context. Image acquisition, registration, and stacking require approximately seven seconds per scan as discussed in
Section 3.1, whereas CNN inference is on the order of milliseconds. We can therefore conclude that the total scan time is dominated by operations other than CNN inference, making any minor gains in inference speed irrelevant.
Moreover, the novelty of this work lies not in CNN benchmarking, but in the design, integration, and deployment of a unified inspection system for aerospace use on resource-constrained embedded hardware. This study demonstrates that a single lightweight object detector can serve dual roles, both detection and classification, without compromising accuracy, enabling real-time operation within a handheld inspection device.
Having established the rationale for the YOLO backbone in our CNN pipeline, we now describe the dataset and training procedures used for both models. All training images were collected directly with the HANNDI prototype on Lockheed Martin aerospace assets. This ensured consistent imaging geometry and illumination between training and deployment. The proprietary dataset contained about 3700 images. Approximately 400 images did not contain any hole. The remaining 3300 images contained holes staged on parts in a simulated aerospace manufacturing environment. Some of these holes were intentionally left clean. The rest contained one of five FOd types, which were blue masking tape, metallic burrs, composite shreds, sealant flakes, and dust bunnies.
In the CNN pipeline, two YOLOv7-Tiny models were used. The first, the Hole Identifier CNN, detects and localizes holes in 640 × 640 stacked images. If no hole is detected, the pipeline terminates. If a hole is found, the region is cropped to 320 × 320 pixels and passed to the second model, the FOd Classifier CNN, which determines whether the hole is clean or contains debris.
For the Hole Identifier CNN, the full dataset of 3700 images was split 80–20% into training and validation sets. The network was trained for 30 epochs using standard YOLOv7 procedures. Validation metrics initially rose rapidly and stabilized after approximately 20 epochs. At convergence, the model consistently achieved >95% precision and recall, with mAP@0.5 ≈ 0.98. Training and validation loss curves decreased smoothly with no signs of overfitting, with final validation losses stabilizing at around 0.03. These results, as shown in
Figure 6 and
Figure 7, confirm accurate and stable hole localization across varied imagery.
The 3300 hole images were cropped using the Hole Identifier CNN to form the 320 × 320 dataset for the FOd Classifier CNN. This dataset was split into 70% training, 20% validation, and 10% testing. The network was trained for 75 epochs, with validation metrics plateauing after 60 epochs. Final validation performance reached 95% precision and recall across all six classes. Training and validation loss curves tracked closely, with final validation losses stabilizing at around 0.01, again showing no evidence of overfitting. The 10% test set was held out for independent evaluation, which is discussed later. Training dynamics are shown in
Figure 8 and
Figure 9.