2. Materials and Methods
2.1. Framework Overview and Problem Definition
This study addresses end-to-end structural parsing of camera-acquired images of hand-drawn circuit diagrams. More specifically, we formulate the task as topology recovery with semantic completion: given an acquired circuit image, the system should recover not only component identities, but also text-linked attributes, electrically valid connectivity, and terminal-role semantics needed for downstream interpretation. Framed this way, the problem is not limited to isolated symbol recognition; it is a vision-based image-acquisition and image-processing problem in which structurally meaningful circuit recovery must remain robust to realistic hand-drawing variability and acquisition degradation.
To solve this task, we implement a fixed ten-stage topology-consistent structural parsing framework: (1) multi-class object detection, (2) text recognition and text–component association, (3) keypoint detection, (4) keypoint aggregation and stabilization, (5) wire-structure enhancement, (6) wire connected-component (CC)-guided connectivity reasoning, (7) endpoint semantic inference, (8) multi-source information fusion, (9) SPICE netlist generation, and (10) downstream circuit simulation. All reported experiments use this fixed execution order.
Figure 1 summarizes the full processing chain from acquired hand-drawn input to structured output. The framework combines local visual perception of components, text, and terminals with wire-supported topology recovery and endpoint-semantic completion so that the final representation remains directly exportable to SPICE-compatible netlists. Algorithm 1 condenses the fixed execution order used throughout the study and shows how the released system moves from acquired image evidence to structured circuit output and downstream simulator-ready export.
To concretely illustrate the stage-wise behavior of the proposed framework, the coupled RLC circuit in
Figure 1,
Figure 2 and
Figure 3 is used as an illustrative workflow example.
Figure 2 highlights representative intermediate outputs from local perception, wire CC-guided connectivity reasoning, and endpoint semantic inference, while
Figure 3 shows the fused structured result that is subsequently exported for SPICE netlist generation and downstream simulation. This example is included for methodological illustration of the processing chain, not as quantitative benchmark evidence.
| Algorithm 1: Top-Level Pipeline for Topology-Consistent Structural Parsing. |
| Input: Camera-acquired hand-drawn circuit image I |
| Output: Structured circuit graph and SPICE netlist |
| 1 detect components, text, terminals, and crossovers in I |
| 2 recognize cropped text and associate it to nearby components using an adaptive scale-aware rule |
| 3 predict candidate wire-junction and component-terminal keypoints |
| 4 aggregate and stabilize keypoints into circuit nodes and component terminals |
| 5 suppress component interiors and enhance wire structure in I |
| 6 recover electrical connectivity by wire CC-guided reasoning on |
| 7 infer endpoint semantics for supported direction-sensitive and multi-terminal components |
| 8 fuse detections, text-linked attributes, connectivity, and endpoint semantics into a unified circuit graph |
| 9 export as a SPICE-compatible netlist |
| 10 optionally execute in downstream ngspice validation |
| 11 return |
2.2. Training Data Sources
Different stages of the framework are trained using task-specific datasets rather than a single shared annotation source.
Table 1 summarizes the main training sources, their roles in the pipeline, and the evaluation benchmark used only for final testing.
For component/text detection, the training data were formed by merging the publicly available CGHD dataset [
31] and Digitize-HCD dataset [
32] with additional hand-drawn circuit samples collected and annotated in this study. The resulting unified label system contains 57 classes, including 54 electronic component classes and three auxiliary non-component classes (
text,
terminal, and
crossover). The OCR dataset was created by cropping text instances from the detection training pool. The node/terminal dataset contains manually annotated wire-junction and component-terminal keypoints together with line annotations for HAWP-based training. The endpoint-semantic dataset was derived from Digitize-HCD terminal-semantic annotations [
32] and then manually verified and corrected before training.
2.3. Independent Benchmark, Annotation Protocol, and Subgroup Definitions
The independent benchmark used for final evaluation contains 1317 hand-drawn circuit diagrams and was kept fully separate from all training and validation data. All benchmark images were acquired by camera. The benchmark contains two mutually exclusive diagram-origin groups: 972 photographs of pre-existing hand-drawn diagrams and 345 photographs of newly drawn hand-drawn diagrams created by researchers to enrich coverage of common component categories and circuit structures. Across both groups, natural hand-drawing artifacts such as stroke-width variation, symbol distortion, irregular spacing, and layout inconsistency were intentionally retained. Camera-acquisition effects such as illumination variation, paper texture, shadows, and resolution differences were also preserved.
Ground-truth labels were produced under a predefined annotation protocol. Component classes, text strings, circuit nodes, connectivity relations, and endpoint-semantic labels were annotated using consistent rules. Wire crossings and junctions were annotated according to explicit visual cues. Arc- or bridge-style crossings were labeled as non-junction crossover structures, meaning that the wires visually cross but are not electrically connected. In contrast, intersections marked with a junction or solder dot were labeled as electrical junctions, meaning that the incident wires belong to the same circuit node. Thus, visually crossing wires were not automatically treated as electrical connections unless explicit node evidence was present. Text was linked to the owning component for OCR/association evaluation, and terminal-role labels were recorded for supported direction-sensitive or multi-terminal component families. Benchmark annotations were generated and cross-checked by two annotators. The first annotator produced the initial structured annotations, and the second annotator independently reviewed the complete benchmark under the same annotation guideline. To quantify annotation reliability, we report image-level exact inter-annotator agreement, where an image was counted as agreed only when the two annotators produced the same structured annotation, including component labels, text ownership, circuit connectivity, junction/crossover interpretation, and endpoint-semantic labels where applicable. The two annotators reached complete agreement on 1294 of the 1317 benchmark images, corresponding to an initial exact inter-annotator agreement of 98.25%. The remaining 23 images were reviewed case by case, and final labels were assigned only after consensus adjudication. Typical disagreements involved dense crossings, faint strokes, unclear handwritten marks, or component–text ownership in crowded regions. Rule-based consistency checks were applied before final release. The overall composition of the independent benchmark, together with the predefined subgroup definitions used in subsequent analyses, is summarized in
Table 2.
These statistics indicate that the benchmark is structurally nontrivial and source-diverse, with frequent crossover patterns, widespread endpoint-semantic applicability, and a broad component-count range. Importantly, the subgroup definitions used later in the Results section were fixed from benchmark composition in advance rather than introduced after inspection of model performance.
2.4. Pipeline Modules
The released system was implemented as a fixed staged workflow covering detection, OCR, node/terminal prediction, wire enhancement, connectivity reasoning, endpoint semantic inference, structured fusion, and netlist generation/simulation. Across all reported experiments, the execution order was kept unchanged and the module-specific settings were frozen in the version-locked configuration bundle associated with the manuscript snapshot.
2.4.1. Input Acquisition and Preprocessing
The pipeline operates on RGB or grayscale input images; in the independent benchmark evaluated in this study, all inputs are camera photographs of hand-drawn circuit diagrams. Before inference, each image is resized while preserving aspect ratio so that the longer side equals 1024 pixels, followed by intensity normalization and 8-bit conversion. This preprocessing standardizes computational input scale and intensity range but does not remove genuine acquisition-related degradation; no manual cleaning, stroke repair, or redrawing is performed before inference.
2.4.2. Component and Text Detection
Electronic components, text regions, terminal markers, and crossover markers are jointly detected using a You Only Look Once version 10 (YOLOv10)-based multi-class detector [
33]. A unified detection formulation is adopted because handwritten text and component symbols often overlap or appear in close proximity, making a decoupled detection sequence less reliable for downstream parsing. During inference, low-confidence detections are discarded and non-maximum suppression is applied to overlapping predictions. The component taxonomy also accounts for common resistor-symbol conventions: zigzag resistor symbols and rectangular resistor symbols are both mapped to the same semantic class,
resistor, whereas fuse symbols are retained as a separate component class. Thus, the detector distinguishes supported resistor styles from fuse symbols at the component-detection stage rather than treating all visually rectangular or resistor-like symbols as the same class.
2.4.3. Text Recognition and Text–Component Association
Detected text regions are cropped and recognized using a PARSeq-based OCR model trained on 95,544 cropped text samples [
34]. The recognized strings are then associated with nearby detected components to recover labels and handwritten parameter values required for structured circuit representation.
This association step is necessary because hand-drawn circuit text varies markedly in size, placement, and writing style. A fixed global pixel threshold is therefore unreliable: a threshold suitable for a large handwritten value may reject a nearby small identifier, whereas a more permissive threshold may attach unrelated text in dense layouts. The released pipeline instead uses a nearest-candidate plus adaptive acceptance rule. For each recognized text instance, the nearest candidate component is first identified by spatial proximity. The match is accepted only when the text–component distance is below a threshold proportional to the text-box scale. In this way, larger text regions are allowed a proportionally larger matching radius, while smaller text regions are constrained more tightly.
Algorithm 2 summarizes this scale-aware association procedure. This simple design improves the stability of text–component matching across different handwriting scales and local drawing densities.
| Algorithm 2: Adaptive Text–Component Association. |
|
2.4.4. Keypoint Prediction, Aggregation, and Stabilization
Candidate wire-junction and component-terminal points are predicted using a Holistically Attracted Wireframe Parsing (HAWP)-based wireframe parsing model [
35]. In preliminary experiments, both HAWP and LCNN [
36] were trained on the same task. Direct line-level outputs from both models were unstable under hand-drawn conditions because wires are often curved, fragmented, weakly connected, and inconsistent in stroke width. For this reason, the released system does
not directly trust raw wireframe segments as the final topology substrate. Instead, it retains the more stable HAWP point predictions as candidate circuit keypoints and lets the subsequent wire-enhancement and connectivity stage decide which point-to-point relations are actually supported by wire evidence. This point-first design is important because it separates
where potentially meaningful junctions or terminals are from
how they should be connected.
Because multiple nearby predictions may correspond to the same physical junction or terminal, the candidate points are merged using Density-Based Spatial Clustering of Applications with Noise (DBSCAN)-based clustering [
37]. Cluster centers are then partitioned into circuit nodes and component terminals according to their spatial relation to detected component boxes. This aggregation and stabilization step reduces duplicate predictions, suppresses jitter from local heatmap offsets, and improves the reliability of the subsequent connectivity stage. Algorithm 3 summarizes the operational logic of this stage.
| Algorithm 3: Adaptive Keypoint Aggregation and Node/Terminal Partitioning. |
|
2.4.5. Wire Enhancement and Wire CC-Guided Connectivity Reasoning
Detected component regions are suppressed, and the remaining image content is enhanced to recover weak or fragmented wire strokes. The resulting wire mask provides the structural evidence for subsequent connectivity reasoning.
The reasoning procedure contains five components. (A) Wire CC construction with light repair: before connected-component labeling, a mild repair step bridges only small local gaps to reduce unnecessary fragmentation of weak strokes. (B) Point-to-wire snapping: stabilized circuit nodes and component terminals are snapped to the nearest valid wire pixels so that graph reasoning is anchored to observed wire structure. (B0) Explicit crossover straight-through pairing: detected arc- or bridge-style crossover markers are treated as non-junction events, and opposite local directions are paired to preserve straight-through continuity without introducing spurious electrical nodes. By contrast, junction-dot or solder-dot evidence is treated as an electrical node cue, so the incident wires are fused into the same circuit node. (C) Within-CC sparse graph selection: candidate edges are first resolved inside each retained wire CC under wire-support and degree constraints. (D) Short-range inter-CC bridge validation: only after within-CC reasoning are limited cross-CC repairs considered, and they are accepted only when local dilation and sampled wire support jointly indicate a plausible short bridge.
This design uses wire CC membership as the primary structural prior and applies inter-CC repair only when necessary. Algorithm 4 summarizes the procedure.
| Algorithm 4: Wire CC-Guided Connectivity Reasoning. |
|
2.4.6. Endpoint Semantic Inference, Structured Fusion, and Netlist Export
Direction-sensitive and multi-terminal devices require more than connectivity alone, because their terminal roles affect electrical interpretation. In this sense, endpoint semantics is not an auxiliary embellishment of the graph; it is part of what makes the recovered structure electrically meaningful. Unlike prior systems that typically stop at terminal localization, component orientation, or coarse connection reconstruction [
18,
19,
22,
23,
24], the released system explicitly performs endpoint semantic inference. Fine-grained subtype refinement is first applied where needed using a DINOv2-based classifier for fine-grained subtype refinement. Endpoint semantics is then obtained in two steps. First, a heatmap-based endpoint predictor, instantiated with a Vision Transformer Pose Estimation (ViTPose)-style top-down architecture, localizes ordered terminal keypoints for supported two-terminal and three-terminal component families [
38,
39]. Second, component-family-specific semantic mapping assigns electrical terminal roles, such as anode/cathode, gate/drain/source, or base/collector/emitter, using the predicted endpoint order, component subtype, local orientation, and recovered connectivity. The ViTPose-style model is therefore used as a generic keypoint-heatmap estimator trained on circuit-terminal annotations, rather than as a human-skeleton prior. The inferred roles, together with component classes, recognized text, stabilized points, and recovered connectivity, are fused into a unified structured circuit representation.
The final structured representation is exported as a SPICE-compatible netlist for downstream simulation. In the present work, ngspice is used as the simulator-side validation environment [
40].
2.5. Evaluation Protocol
The primary endpoint was the strict image-level end-to-end success rate under a strict image-level binary criterion. A test image was counted as successful only when component classification, connectivity reconstruction, text recognition with text-to-component association, and endpoint semantic inference were all correct. The independent 1317-image benchmark was strictly excluded from model training, validation, hyperparameter tuning, ablation selection, and checkpoint selection throughout all reported experiments.
Secondary outcomes included dimension-specific image-level accuracies for the same four dimensions, evaluated against manually annotated ground truth. For OCR/association and endpoint semantics, applicability coverage and conditional image-level accuracies were additionally reported so that the effective denominators for non-applicable cases were explicit.
To provide a finer-grained evaluation of the connectivity-focused contribution, connectivity was additionally evaluated at the terminal-pair graph level. For each circuit image, the ground-truth and predicted topologies were converted into sets of unordered terminal-pair connectivity relations:
where
and
denote component-terminal instances and
denotes the electrical net assignment. Terminal-pair precision, recall, and
were then computed from the aggregated true-positive, false-positive, and false-negative terminal-pair relations:
This metric is invariant to arbitrary net naming and directly evaluates electrical topology rather than raw wire-pixel overlap or intermediate junction placement. We also report the connectivity edit count, defined as
, to quantify how many terminal-pair relations would need to be added or removed to match the ground truth.
Downstream utility was assessed from the full 1317-image benchmark through a four-step funnel: SPICE netlist exportability, simulation eligibility, successful ngspice execution, and manually inspected downstream simulation correctness.
For proportion-based results reported with confidence intervals, two-sided 95% CIs were computed using the Wilson score interval [
41]. Runtime was measured under single-image inference (batch size = 1) and excluded disk input/output and simulator execution time.
2.6. Implementation Environment and Reproducibility
All reported results were generated from the version-locked public repository snapshot associated with this manuscript, together with the released pretrained weights.
Table 3 summarizes the main resources needed to recreate the execution environment and run the pipeline.
This version-locked release fixes the code snapshot, execution environment, and pretrained weights used for the reported experiments. Module-specific training details and configuration files are provided in the public repository.
3. Results
This section summarizes the end-to-end, graph-level, robustness, ablation, and SPICE executability results on the independent benchmark.
3.1. Overall End-to-End Performance
On the independent benchmark, the proposed method achieves a strict image-level end-to-end success rate of 95.14% (1253/1317; 95% confidence interval (CI), 93.84–96.18%, Wilson), as summarized in
Table 4.
Under this strict criterion, 64 of the 1317 test images fail on at least one evaluated dimension. The error distribution is uneven across dimensions: component classification, OCR/association, and endpoint semantics all remain above 98% on the all-image denominator, whereas connectivity inference is lower at 96.89% and contributes the largest residual error pool.
To further characterize connectivity quality beyond binary image-level success,
Table 5 reports terminal-pair graph-level metrics. The recovered topologies contain 25,158 true-positive terminal-pair relations, 103 false-positive terminal-pair relations, and 50 false-negative terminal-pair relations over the full 1317-image benchmark. This corresponds to 99.59% terminal-pair precision, 99.80% terminal-pair recall, and 99.70% terminal-pair F1. The mean connectivity edit count is 0.12 terminal-pair edits per image, and the median edit count is 0, indicating that the strict image-level failures are concentrated in a small subset of cases rather than reflecting widespread graph corruption.
A residual-error breakdown across the failed cases further supports this pattern. Among the 64 failed images, 41 contain connectivity errors, compared with 23 involving OCR/association errors, 15 involving component-classification errors, and 8 involving endpoint-semantic errors. These error categories overlap and therefore do not sum to the total number of failed images. The image-level and terminal-pair metrics should therefore be interpreted together: the former measures whether an entire circuit is reconstructed without any topological error, whereas the latter distinguishes near-correct graphs from cases with many missing or spurious connectivity relations.
3.2. Conditional and Dimension-Wise Results
Because OCR/association and endpoint semantics are not applicable to every image, their conditional results are reported separately in
Table 6. Endpoint semantic completion is not a marginal corner case in this benchmark: 1261/1317 images (95.75%) contain endpoint-semantic-sensitive components, and OCR/association is applicable in 1278/1317 images (97.04%).
3.3. Endpoint-Semantic Predictor Ablation
To examine whether terminal-role inference can be solved by simple orientation-template mapping, we compared the proposed ViTPose-style heatmap endpoint predictor with two simpler alternatives. The rule-based orientation mapping baseline does not use a trainable model; it assigns terminal roles from the component category, bounding-box geometry, and the relative positions of detected terminals. The lightweight orientation classifier predicts a discrete rotation/mirroring configuration for each component crop and then assigns terminal roles using component-family-specific templates. Thus, this ablation directly tests whether endpoint semantics in camera-acquired hand-drawn diagrams can be recovered by idealized orientation and mirroring rules alone.
As shown in
Table 7, the rule-based orientation mapping baseline achieves only 81.76% accuracy, indicating that idealized rotation/mirroring templates are insufficient for many camera-acquired hand-drawn diagrams. The lightweight orientation classifier improves the accuracy to 96.27%, but still produces 47 endpoint-semantic errors. In contrast, the ViTPose-style heatmap endpoint predictor reaches 99.37%, reducing the number of errors from 47 to 8 compared with the lightweight classifier. These results support the use of a heatmap-based endpoint formulation under irregular terminal placement, hand-drawn deformation, and imperfect component localization. At the same time, the comparison shows that ViTPose is not claimed to be uniquely necessary; rather, it is the high-accuracy implementation used in the present pipeline, and lighter endpoint predictors remain promising for resource-constrained deployment.
3.4. Robustness Across Diagram Origins and Structural Complexity
Figure 4 summarizes subgroup performance across diagram origin, presence or absence of crossover structures, and component-count complexity. These subgroup definitions were fixed by benchmark composition rather than being introduced after inspection of the results.
Across diagram-origin groups, the framework achieves 95.37% strict image-level end-to-end success and 97.43% connectivity accuracy on photographs of pre-existing hand-drawn diagrams (927/972 and 947/972, respectively), and 94.49% strict image-level end-to-end success and 95.36% connectivity accuracy on photographs of newly drawn researcher-created diagrams (326/345 and 329/345). These results indicate that high performance is maintained across both source-origin groups rather than being confined to either legacy diagrams or newly created benchmark material.
A similar pattern is observed for structural difficulty. When crossover structures are present, the method reaches 95.01% strict image-level end-to-end success and 96.49% connectivity accuracy (894/941 and 908/941), compared with 95.48% and 97.87% (359/376 and 368/376) on images without crossovers. Across component-count strata, strict image-level end-to-end success decreases from 96.08% in the low-complexity subgroup to 95.25% in the medium-complexity subgroup and 93.99% in the high-complexity subgroup, while connectivity accuracy decreases from 98.04% to 96.83% and 95.67%, respectively. Even in the highest-complexity subgroup, the end-to-end success rate remains above 93%.
3.5. Residual Connectivity Error Analysis
To better understand the remaining topology errors, we manually reviewed all 41 test images that failed on connectivity inference and recorded whether each image exhibited one or more recurring local contributing factors.
Table 8 summarizes this descriptive analysis. Because the factors are not mutually exclusive, the counts are presence-based rather than single-label assignments, and the corresponding percentages do not sum to 100%.
Weak or thin wire traces are the most frequent contributing factor (22/41, 53.66%), followed by crowded nearby keypoints (17/41, 41.46%) and incomplete component suppression (13/41, 31.71%). In addition, 8 of the 41 connectivity-error images (19.51%) exhibit multiple contributing factors simultaneously.
Figure 5 provides representative examples of these residual cases. Panel (a) shows a successful example with nontrivial local structure, whereas panels (b)–(d) show representative failures associated with incomplete component suppression, crowded nearby keypoints, and weak wire traces, respectively. Matching blue boxes highlight the erroneous local regions in the connectivity reasoning results and their corresponding regions in the original images, enabling direct comparison between the local visual ambiguity and its topological consequence.
3.6. Ablation of the Connectivity Reasoning Module
Because connectivity remains the main bottleneck of the full pipeline, we further examined the contribution of the wire CC-guided reasoning stage.
Figure 6 removes one connectivity element at a time while keeping all other stages unchanged, and reports both connectivity accuracy and strict image-level end-to-end accuracy.
All five elements are beneficial, but their effects are clearly unequal. The largest degradations occur when snapping or crossover straight-through pairing is removed. Removing snapping substantially increases broken or false attachments under the primary strict image-level criterion. Removing crossover straight-through pairing causes the most severe degradation: connectivity inference becomes incorrect on 947 of the 1317 test images, and the full pipeline fails on 953 images. The graph-level terminal-pair results in
Table 5 complement this ablation by showing that the full connectivity module achieves 99.70% terminal-pair F1 with only 153 total terminal-pair edits over the complete benchmark. Together, these results show that the observed end-to-end performance depends strongly on the explicit topology-reasoning design rather than on recognition modules alone.
3.7. Downstream Simulation Validation and Representative Case
Figure 7 summarizes the predefined four-stage SPICE validation funnel.
Downstream utility was evaluated using the predefined four-stage SPICE validation funnel. This funnel was designed to distinguish simulator executability from result-level simulation correctness. In this study, a syntactically complete exported netlist indicates that the recovered structure can be represented in SPICE form. A simulation-eligible netlist indicates that ngspice can parse and execute the requested analysis after the required external device models, indispensable component parameters, and analysis commands are available. By contrast, simulation correctness was assessed only after execution, by inspecting whether the exported topology, node mapping, component values, device models, terminal roles, and resulting operating-point values or waveforms were consistent with the source diagram and with the expected behavior of the circuit. Therefore, successful execution was treated as a necessary but not sufficient condition for downstream correctness.
Of the 1317 test images, 1273 were converted automatically into syntactically complete SPICE netlists. Under the baseline simulator environment, 760 exported netlists were immediately simulation-eligible, whereas 513 were initially non-executable because required external device or subcircuit models were unavailable. After the required vendor or public model libraries were supplemented, 497 of those 513 cases were recovered, yielding 1257/1273 exported netlists (98.74%) that were simulation-eligible. The remaining 16 exported but non-simulatable cases were due to missing indispensable handwritten parameters in the original diagrams rather than export failure.
The 1257 executable cases were then examined at the result level. Manual post-execution inspection judged 1250 of the 1257 executed cases (99.44%) to be simulation-correct. A case was counted as correct only when the generated netlist preserved the intended circuit topology and device semantics and when the resulting ngspice output was consistent with the expected circuit behavior. The remaining seven cases executed but were not counted as correct. Among them, five were already counted as end-to-end structured parsing failures, and the other two reflected downstream exporter/simulator-side deviations introduced after successful image-level structural parsing. This distinction shows that the downstream validation did not equate “executable” with “correct”.
Table 9 summarizes the causes of initial non-executability among exported netlists.
To connect the benchmark-level funnel with a concrete circuit instance, we further traced a representative hand-drawn NE555-based PWM motor-control circuit through the downstream validation workflow.
Figure 8 shows the original input diagram,
Figure 9 shows a representative excerpt of the generated SPICE netlist after simulator-side model and parameter completion, and
Figure 10 shows representative transient outputs from the resulting ngspice simulation. In this example, the handwritten “4A” marking in the upper-right branch denotes the rated current of a resistance-wire/fusible protective element rather than a resistance value; it is therefore represented by a dedicated fuse/resistance-wire model instance rather than by a 4
resistor. Some passive-component values are not explicitly visible in the hand-drawn figure; for this representative simulation case, these indispensable values were completed during simulator-side model and parameter preparation and are explicitly listed in the SPICE netlist excerpt. Components whose required resistance or capacitance values are absent and cannot be reliably completed are not treated as directly simulation-ready. This example illustrates how a realistically acquired hand-drawn circuit can be parsed into a simulator-readable representation containing sources, device models, subcircuit instantiation, component instances with explicit parameter values, and transient-analysis directives, and how the resulting execution output can be checked against the intended circuit behavior.
Taken together, the benchmark-level funnel and the representative NE555 case show that the recovered outputs are not only structurally interpretable at the image level, but also usable in a downstream SPICE validation workflow that separates netlist exportability, simulator executability, and result-level correctness.
3.8. Runtime Characteristics and Complexity Consistency
Table 10 reports the average per-image runtime of each stage up to generation of the machine-readable structured representation, together with the dominant operation type and expected scaling behavior. Since circuit simulation is treated as a downstream validation step rather than as part of the core parsing pipeline, simulator time is not included.
Let denote the number of pixels after image resizing, D the number of detected components and text regions, T the number of OCR crops, V the number of stabilized circuit nodes and terminal endpoints, K the number of wire connected components, the number of candidate nodes/endpoints in the k-th connected component, and E the number of candidate graph edges. In this implementation, images are processed at a fixed inference scale, so P is bounded across the benchmark. The practical runtime is therefore governed mainly by the constant factors of the full-image pixel-level stages and the graph-construction stages.
The timing distribution in
Table 10 is consistent with this complexity analysis. The two slowest stages are component suppression and wire enhancement (1.012 s) and connectivity reasoning (1.141 s). Together, they account for 2.153 s of the 2.9285 s total runtime, or approximately 73.5% of the structured-parsing time. This dominance is expected because both stages require full-image geometric operations, connected-component analysis, and graph construction, whereas recognition and fusion stages operate on fixed-size network inputs, localized crops, or comparatively small sets of detected components and endpoints. Thus, the measured runtime is not determined only by the number of neural-network modules, but by the combination of pixel-level processing, connected-component labeling, local endpoint snapping, and sparse graph selection. The reported timings correspond to single-image offline inference (batch size = 1, excluding disk input/output and simulator execution) on a platform equipped with one NVIDIA RTX 4090D GPU (24 GB), 15 allocated CPU cores on an Intel Xeon Platinum 8474C platform, and 80 GB RAM.
3.9. Subtask-Level Comparisons on Public Benchmarks
To provide external reference points on shared public data, we evaluate aligned subtasks on JUHCCR-v1 and CGHD [
21,
25,
31]. These results are reported at the subtask level and are intended to complement, rather than replace, the stricter end-to-end evaluation on the independent 1317-image benchmark. We include two CGHD detection references. The first follows the CGHD dataset-paper reference setting, whereas the second follows the Bayer et al. validation drafter split, namely drafters 21–22, which is the closest publicly reported modular graph-extraction baseline for handwritten circuit diagram images [
25].
Table 11 reports the aligned public-subtask results on JUHCCR-v1 and CGHD.
On JUHCCR-v1, the framework achieves 98.03% component-classification accuracy (33,329/34,000), exceeding the published 91.15% reference by 6.88 percentage points. On CGHD under the dataset-paper reference setting, it achieves mAP@0.5:0.95 = 0.559 and mAP@0.5 = 0.707, compared with the dataset paper’s Faster R-CNN reference of mAP = 52% [
21,
31]. More importantly for comparison with the closest modular graph-extraction baseline, under the Bayer et al. validation drafter split, namely drafters 21–22, the detector achieves mAP@0.5:0.95 = 0.348 and mAP@0.5 = 0.537, compared with the 0.180 validation-set mAP reported by Bayer et al. for Faster R-CNN ResNet-152 [
25]. This CGHD detection-level comparison provides a directly aligned quantitative reference to Bayer et al., while the independent 1317-image benchmark remains the primary evidence for end-to-end topology-consistent parsing.
For graph connectivity, Bayer et al. [
25] present graph assembly and rectification qualitatively through a sample application, but do not provide a dataset-level quantitative endpoint for full topology recovery, such as graph-connectivity accuracy, edge-level precision/recall, graph edit distance, or netlist-equivalence metrics. Therefore, a full end-to-end numerical comparison would not be methodologically aligned: the present framework outputs topology-consistent structured circuit representations with text–component association, endpoint-semantic inference, SPICE-compatible netlist generation, and downstream simulation validation, whereas the publicly reported Bayer et al. results provide a directly reportable object-detection endpoint but not matched annotations or metrics for these later stages. Accordingly, we limit the directly matched quantitative comparison to the shared CGHD object-detection subtask, and use the independent 1317-image benchmark for strict end-to-end evaluation of topology-consistent structural parsing.
4. Discussion
The results indicate that the main remaining difficulty in hand-drawn circuit parsing is topology recovery rather than local symbol recognition. In the present benchmark, component recognition, text recovery, and endpoint-semantic inference were already sufficiently strong that the dominant residual failures were concentrated in connectivity reconstruction under fragmented strokes, local geometric ambiguity, and drawing noise. The added terminal-pair graph-level evaluation further qualifies this result: although strict image-level connectivity accuracy is 96.89%, the recovered topologies achieve 99.59% terminal-pair precision, 99.80% terminal-pair recall, and 99.70% terminal-pair F1, with a mean connectivity edit count of 0.12 per image. This pattern suggests that, for realistic hand-drawn circuit images, further progress is more likely to come from stronger connectivity reasoning than from marginal gains in already strong local classifiers.
From a
Sensors perspective, the relevance of this work lies in robust processing of camera-acquired engineering images. The pipeline is designed for hand-drawn circuit photographs affected by illumination variation, blur, shadows, perspective distortion, paper texture, and stroke degradation, and converts these imperfect inputs into machine-readable circuit structures suitable for netlist export and downstream simulation [
4,
5,
6,
26,
27]. In this sense, the contribution is not only circuit parsing, but also a vision-based sensing and image-understanding framework for structurally recovering engineering diagrams from real captured images [
10,
11].
Regarding circuit-domain applicability, the present task should be understood as engineering-diagram sensing, structural parsing, and circuit digitization, rather than as a method specific to either strong-current or weak-current circuit design. The visual parsing stage is mainly determined by the supported symbol vocabulary, drawing conventions, stroke quality, OCR/parameter completeness, recoverable connectivity, and endpoint-semantic definitions, rather than directly by the operating power level of the circuit. Therefore, power-electronic circuits, measurement circuits, educational circuits, maintenance sketches, and early-stage design diagrams are within the intended application scope when their component symbols, terminal semantics, handwritten parameters, and required SPICE models are covered by the trained taxonomy and exporter. Conversely, circuits containing unseen symbols, highly specialized integrated modules, unconventional notation, missing indispensable parameters, or unavailable device models require additional labeled samples, terminal-role definitions, and exporter/model-library support before the same level of performance can be expected. Thus, the current results demonstrate applicability to the circuit categories represented in the training data and independent benchmark, and support extensibility to other circuit families, but should not be interpreted as universal coverage of all possible electrical, power-electronic, or measurement-circuit drawings.
4.1. Comparison with Existing Methods
The closest modular graph-extraction baseline is compared at the CGHD object-detection subtask level using the Bayer et al. drafter split (
Table 12). At the full graph-extraction level, directly matched numerical comparison remains limited because prior studies differ in reconstruction target, circuit-family scope, and reported validation endpoint, and Bayer et al. do not report dataset-level connectivity or graph-equivalence metrics [
18,
19,
22,
25]. The discussion below therefore separates the directly comparable detection-level evidence from the broader system-level positioning of end-to-end topology-consistent parsing.
Table 13 summarizes only those capabilities and validation endpoints that are explicitly reported in the cited studies and are relevant to this comparison. Here, “Family restr.” indicates whether a method is tied to a limited circuit family or template regime. “Validation level” distinguishes the strict image-level protocol adopted here from studies whose reported endpoints are not directly matched. “NR” denotes items not explicitly reported, and “NDA” denotes settings that are not directly aligned.
Within this comparison frame, the present study is best understood as extending hand-drawn circuit parsing toward stricter structural and semantic completeness rather than claiming a universal cross-paper rank. A particularly distinctive aspect is the explicit endpoint-semantic layer: to the best of our knowledge, prior publicly reported systems for general hand-drawn circuit parsing have not modeled terminal-role semantics for direction-sensitive and multi-terminal devices as an explicit output within an integrated end-to-end pipeline [
18,
19,
22,
23,
24]. Relative to earlier systems that were family-restricted, logic-specific, or evaluated under different endpoints [
18,
19,
22,
42], the present framework combines broader hand-drawn circuit coverage, connectivity reconstruction, endpoint-semantic completion, netlist generation, and downstream executable validation under a strict image-level criterion.
4.2. Study Limitations
Several limitations remain. First, although the residual failures are now concentrated mainly in connectivity recovery, the current pipeline still depends on sufficiently recoverable wire evidence. As indicated by the failure-mode analysis, weak or thin wire traces, crowded nearby keypoints, and incomplete component suppression remained the dominant contributors in the connectivity-error subset, especially when multiple factors co-occurred in the same image. This means that topology recovery under severely degraded local evidence is still the main unresolved challenge. In addition, the current connectivity module uses interpretable wire-CC constraints and geometric graph selection rather than a learned edge-scoring model. This choice improves controllability and physical interpretability, but a GNN-based edge scorer or graph-refinement module could further improve ambiguous cases when enough graph-level training labels become available.
Second, although the system supports a broad set of component categories, its generalization to unseen component types or drawing conventions outside the current training distribution remains limited. Third, endpoint semantic inference is currently implemented for the direction-sensitive and multi-terminal components covered in the present benchmark, rather than for all possible symbol families. Finally, the downstream simulation funnel was designed to separate parser-side structured-output completeness, simulator-side executability, and result-level correctness. Simulator execution can additionally depend on external device libraries, indispensable passive-component values, active-device supply definitions, device or subcircuit models, and analysis commands that may not be fully specified in the source diagram. Therefore, the reported downstream simulation results should be interpreted as simulator-side validation of recognized and completed SPICE representations, rather than as an unconditional guarantee that every arbitrary hand-drawn circuit can be fully and automatically behavior-verified without additional model, parameter, supply, or analysis-command information.