3.1. Reliability Block Diagrams
In the field of reliability engineering, the configuration of components within a system significantly influences the overall system reliability. One of the most fundamental configurations is the series connection, where components are arranged such that the failure of any single component results in the failure of the entire system. This configuration is commonly encountered in many engineering systems, particularly in power transmission, aerospace systems, and manufacturing processes [
38].
In a series system, the reliability of the entire system is the product of the reliabilities of its individual components. The series model assumes statistical independence between component failures and no repair during operation. While simplistic, this assumption provides a foundational understanding of system behavior and is widely used in early-stage design and risk assessments [
39].
However, the inherent vulnerability of series systems—where a single point of failure leads to total system failure—has prompted the development of more robust configurations and the inclusion of redundancy in practical applications [
40]. Despite this, series models remain critically important for understanding baseline system reliability and for evaluating the impact of component quality on overall performance.
Parallel system configurations are a key strategy in reliability engineering for enhancing system robustness and minimizing the risk of total system failure. In contrast to series systems, parallel systems are designed so that the failure of one or more components does not necessarily lead to overall system failure, as long as at least one component remains functional. This architectural approach is widely applied in critical systems where uninterrupted operation is essential, such as in aerospace, nuclear power plants, and data centers [
38,
39].
In a parallel configuration, system reliability increases with the number of redundant components. As a result, parallel arrangements are often used to introduce fault tolerance and ensure high availability, especially in mission-critical applications [
40].
Parallel systems may be implemented as active redundancy—where all units operate simultaneously—or as standby redundancy, where backup units are activated only upon failure of the primary ones. Each approach has distinct implications for system reliability and maintenance strategy [
41]. Moreover, advanced reliability modeling often incorporates non-identical component reliabilities, dependencies, and repair policies, allowing for more accurate assessments of system performance in real-world environments [
42].
To construct the RBD of the dual-axis solar tracking prototype, the system components must first be categorized based on their role within the overall architecture. Based on its functionality, the solar tracking system can be divided into two primary subsystems: the control subsystem and the data transfer subsystem, as depicted in
Figure 1. The control subsystem is responsible for executing key operations such as the homing function, the flat position function, and the optimal sunlight positioning function. In contrast, the data transfer subsystem handles the acquisition of sensor data and its transmission to the virtual IoT platform for remote monitoring and analysis. This classification, presented in
Table 1, allows for a structured representation of how each component contributes to the system’s functionality and reliability. The failure rates of each element are expressed in Failure Per Million Hours (FPMH) according to the international standard presented in [
43].
The Arduino Mega 2560 MCU (Arduino S.r.l., Monza, Italy) is commonly housed in a plastic enclosure when installed indoors, but can be exposed in outdoor setups as well. Failure modes primarily include corrosion of contacts and degraded solder joints due to humidity and condensation. In outdoor use, UV radiation and temperature cycling can cause plastic warping and micro-cracking in solder joints, significantly increasing their failure rate. The TB6560 Motor Driver, packaged typically in a heatsinked PCB mount, handles stepper motor control. Its primary vulnerability is overheating, especially when ventilation is poor. Outdoor conditions like rain ingress and high humidity can lead to short circuits or corrosion of exposed terminals, drastically reducing reliability.
The Tongling 5 V module, enclosed in a semi-sealed plastic case, is an electromechanical relay that can suffer from arcing, contact pitting, and coil degradation. High outdoor failure rates are due to moisture penetration, which leads to corrosion or even coil failure.
The Weidmuller 24 V Relay is an industrial-grade relay with relatively better sealing. Nonetheless, oxidation of contacts and thermal fatigue due to outdoor temperature fluctuations can cause operational failure over time, especially in less-protected outdoor installations.
The Astrosyn Stepper Motor is usually mounted without full environmental sealing. Dust ingress and water exposure can lead to bearing failure or internal corrosion. Over time, these stressors result in increased resistance or stalling. The Superior Electric Slo-Syn represents another stepper motor variant, vulnerable to similar issues as the Astrosyn motor. Wind-blown particles, thermal cycling, and humidity can reduce the insulation resistance, potentially causing shorts or excessive wear in the gear mechanism.
Regarding the sensor components, we can identify the TEMT6000 module. This light sensor module is exposed to ambient light and UV radiation. In outdoor conditions, degradation of the lens material and solder fatigue are common. Moisture ingress can cause measurement drift or total failure.
Another sensor component, the ACS172, is an analog current sensor in a plastic DIP or SOIC package. While relatively robust, long-term outdoor use may cause epoxy encapsulant degradation, leading to pin corrosion or erratic readings due to electromagnetic interference amplified by weather changes. The ML8511 is an UV sensor sensitive to environmental damage. Outdoor use exposes it to actual UV radiation, which can paradoxically degrade the sensor itself. PCB corrosion and encapsulation failure are also concerns under high humidity and rain.
The BH1750, packaged in a small IC form, is a digital light sensor that can degrade in performance due to condensation, lens fogging, or PCB corrosion. Even indoors, fluctuating humidity can cause internal oxidation over time. The DHT22 is a digital temperature and humidity sensor that is notoriously sensitive to condensation. In outdoor setups, if not well-sealed, it suffers from accuracy drift, rust on pins, and eventual sensor failure due to exposure to rain or frost.
Mechanical encoders are vulnerable to dust and moisture. Outdoor conditions may lead to rusting of mechanical parts and misreading due to signal bounce or degradation of optical elements in some variants. On the other hand, limit switches are typically mechanical and enclosed in plastic or metal housings. However, they can still fail from moisture ingress, leading to contact corrosion or mechanical jamming due to dirt buildup.
The SIM800L V2 module, with a compact module design, is usually sensitive to temperature extremes and condensation. Failures can result from corrosion on antenna connectors, solder fatigue on small pins, and Electrostatic Discharge (ESD) events during storms. The Solar Charge Controller is critical for battery health and is often mounted near panels. Outdoor usage may see degradation of connectors, internal MOSFET failure due to heat, or board-level corrosion, especially if the housing is not IP-rated. The LM2596 is a buck converter exposed to outdoor elements may fail due to overheating or corrosion of its inductor or capacitors. Electrolytic capacitors are particularly prone to drying out in heat or swelling due to moisture. Finally, the Varta 12 V·44 Ah battery is a sealed lead-acid battery that performs well under moderate conditions but suffers under high temperatures, which accelerate electrolyte evaporation and plate degradation. Cold temperatures can reduce capacity and cause internal pressure buildup, leading to case rupture in extreme cases.
Based on the above-mentioned system components and their respective failure rates, the RBD of the dual-axis solar tracker is illustrated in
Figure 2.
The first step for computing the reliability of the entire system is to convert the failure rates from FPMH to failures per hour (λ in 1/h). The lambda subsystem indoor and outdoor values will be calculated for comparison purposes only. It is essential to determine if the solar tracker’s subsystem components manage to withstand stress factors and weather conditions. As depicted in
Figure 2, the reliability values will be independently calculated for the control circuits, motors, sensors, and power supply components. For this calculation, only the midpoint value of each failure rate interval will be considered.
For the control circuits, which encompass the first four components, the associated failure rate exhibits a significant difference based on the operating environment. Specifically, the indoor failure rate is λcontrol, indoor = 2.35 × 10−6 failures/h, whereas the outdoor rate is substantially higher at λcontrol, outdoor = 8 × 10−6 failures/h. The reliability of the control subsystem, denoted as R(t), was subsequently computed using the exponential reliability function for a time interval of t = 1000 h. The resulting values are R(1000)control, indoor ≈ 0.9977 for indoor usage, and R(1000)control, outdoor ≈ 0.9920 for outdoor usage.
The automation circuits, encompassing the following four components, exhibit distinct failure rates based on the operational environment: λautomation, indoor = 1.975 × 10−7 failures/h and λautomation, outdoor = 5.965 × 10−6 failures/h. Consequently, the reliability of the automation and gear subsystem after t = 1000 h is R(1000)automation, indoor ≈ 0.9998, and R(1000)automation, outdoor ≈ 0.9941 for indoor and outdoor usage, respectively.
The RBD for the data transfer subsystem is structured as a combination of series (GSM module) and parallel configurations (sensors). This reflects the system’s dependency on multiple components for successful data transmission. Specifically, sensor data will fail to be transmitted if: (a) all sensors simultaneously fail to capture environmental data; (b) the GSM/GPRS module becomes faulty, preventing communication with the Things Speak server. This configuration, illustrated in
Figure 2, highlights critical points of failure that must be addressed to ensure the reliability of the data transfer process. Therefore, the reliability of the following five components (9 through 13) when connected in a parallel configuration under indoor and outdoor conditions for a time interval of t = 1000 h is extremely high, computed as approximately R(1000)
data transfer, indoor, outdoor ≈ 1.0 (specifically, 1 − 5.46 × 10
−20). In reliability engineering, a value this close to unity (1.0) is indicative of a system where the probability of all components failing simultaneously is negligible.
Regarding the GSM/GPRS module, comprising the next two components in a series connection, the computed lambda values are λdatatransfer, indoor = 6.25 × 10−7 failures/h (for outdoor usage) and λdatatransfer, outdoor = 2.10 × 10−6 failures/h. Substituting the values for t = 1000 h will result in a reliability value of R(1000)datatransfer, indoor ≈ 0.999375 (for indoor usage), and R(1000)datatransfer, outdoor ≈ 0.9979 (for outdoor usage).
Concerning the last three components, which are responsible for the power supply unit, the corresponding lambda values are λpowersupply, indoor = 7.7 × 10−7 failures/h and λpowersupply, outdoor = 1.8 × 10−6 failures/h. The associated reliability values are computed for t = 1000, as follows: R(1000)powersupply, indoor ≈ 0.99923 and R(1000) powersupply, outdoor ≈ 0.9982.
Finally, the reliability of the entire system (components 1 through 18) for indoor usage over a time interval of t = 1000 h is approximately R(1000)system, indoor ≈ 0.9961 and R(1000)system, outdoor ≈ 0.9823 for outdoor usage.
The reliability of the entire system is highly sensitive to the environmental transition from indoor to outdoor conditions. While the total system reliability only decreased by approximately 1.38% over 1000 h, this small change is underpinned by a 351.1% increase in the system’s equivalent failure rate. This difference is largely driven by the extreme sensitivity of electromechanical components (such as the Anemometer) and the cumulative, additive effect of increased failure rates within the system’s series architecture. Consequently, long-term operational success for the outdoor application is critically dependent on focused design efforts, such as isolating the most sensitive components (e.g., through robust enclosures) or implementing redundancy, as demonstrated by the resilient parallel subsystem.
3.2. Fault Tree Analysis
FTA is a systematic, deductive methodology employed to evaluate the reliability and safety of complex systems. Originally developed in 1962 by H.A. Watson at Bell Laboratories for the U.S. Air Force’s Minuteman ICBM program, FTA has since become a cornerstone in reliability engineering across various high-risk industries, including aerospace, nuclear energy, and chemical processing [
44].
The essence of FTA lies in constructing a graphical representation—a fault tree—that maps the logical relationships between system failures and their root causes. This tree begins with a “top event,” representing the undesired system failure, and branches downward through intermediate events to basic events, which are the fundamental causes of failure. Logical gates such as AND and OR are used to depict how these events combine to lead to the top event [
45,
46].
FTA serves both qualitative and quantitative purposes. Qualitatively, it helps identify minimal cut sets—the smallest combinations of basic events that can cause the top event—thereby highlighting critical vulnerabilities within the system. Quantitatively, it allows for the calculation of the probability of the top event occurring, based on the probabilities of the basic events and the logical structure of the fault tree [
20].
The versatility of FTA makes it applicable in various domains. For instance, in the energy sector, it aids in assessing the reliability of power systems and identifying potential points of failure. In the context of control automation, FTA is instrumental in analyzing complex systems like nuclear plants and water distribution networks, where it helps in designing robust systems by identifying and mitigating potential faults during the design phase [
47].
FTA can also be utilized to classify system failures based on their severity. In the context of the solar tracking system, three distinct levels of failure criticality can be identified: (a) critical—failures that cause complete system shutdown or pose safety risks, (b) less critical (malfunction)—faults that impair performance but do not halt operation entirely, and (c) non-critical—minor faults with negligible impact on system functionality. To illustrate each level of severity, an individual FTA will be constructed for each corresponding failure scenario.
A critical failure in the solar tracking system typically results from an unexpected power supply outage occurring during execution cycles. This type of failure can lead to a complete system shutdown, interrupting all ongoing operations. The corresponding FTA illustrating this scenario is presented in
Figure 3. The FTA diagram systematically depicts the sequence of failures that can lead to a power supply outage within the solar tracking system. At the top of the tree, the undesired event—power supply outage—is broken down into two main contributory paths: battery failure and power conversion failure. The battery failure branch includes internal faults in the Varta 12 V 44 Ah battery as well as malfunctions in the solar charge controller, both of which are influenced by underlying stressors such as long-term usage, thermal cycling, and moisture ingress. On the other hand, the power conversion failure branch considers the malfunction of the LM2596 converter and the overloading of downstream components, including the SIM800L GSM module and relay circuitry.
These failures may arise independently or in combination, as represented by OR logic gates. The analysis emphasizes how environmental and operational stress factors at the component level can propagate upward through the system architecture, ultimately resulting in a complete power disruption.
A less critical failure scenario in the solar tracking system is the tracking misalignment. This event occurs when the solar panel is no longer accurately oriented toward the sun, resulting in reduced energy harvest. It does not completely disable the system but significantly lowers performance. The corresponding FTA is illustrated in
Figure 4.
This FTA illustrates a secondary-level failure scenario in the solar tracking system, focusing on tracking misalignment as the top event. The misalignment may arise from one of three primary causes: sensor failure, actuator malfunction, or control signal error. Sensor failure is further traced to faults in components such as the TEMT6000 and ML8511 modules, influenced by environmental aging and UV-induced degradation. Actuator malfunction is attributed to the failure of the stepper motor, often caused by wind-blown dust or corrosion. Control signal errors originate from faults in the rotary encoder and misreadings by the Arduino, with mechanical debris and noise acting as the triggering factors. The hierarchical structure captures how less-critical, yet impactful, faults can reduce system efficiency without leading to a total power outage.
The FTA in
Figure 5 models a non-critical failure scenario in the solar tracking system, namely a data communication failure, which impacts remote monitoring and data logging functionalities. The top-level event is decomposed into three main contributing branches: GSM module fault, signal loss or network issues, and microcontroller communication error. The GSM module fault centers on the SIM800L V2, which may fail due to electrical defects or environmental stressors such as humidity and corrosion. Signal loss arises from weak cellular reception, electromagnetic interference, or antenna malfunction, all of which can interrupt data transmission. Microcontroller-related errors are linked to the Arduino Mega 2560, including UART protocol issues and command parsing faults, often caused by firmware glitches or transient electrical noise. These contributing factors are modeled using OR gates, highlighting that the failure of any single component or condition can lead to communication loss without affecting energy generation.
Additionally, the FTA incorporates AND gates to emphasize that a data communication failure arises also when all sensor modules fail simultaneously. In such a case, the system is unable to collect any environmental data, making it impossible to transmit information to the Things Speak server.
In summary, FTA is a vital tool in reliability engineering, offering a structured approach to identifying and mitigating potential system failures. Its ability to provide both a visual representation of failure pathways and quantitative risk assessments makes it indispensable in the design and analysis of complex, safety-critical systems.
3.3. Failure Mode and Effects Analysis
FMEA is a structured, inductive methodology utilized in reliability engineering to proactively identify and mitigate potential failure modes within systems, products, or processes. The FMEA process involves a systematic examination of components and subsystems to determine how they might fail (failure modes), the causes of these failures, and the potential effects on system performance. Each identified failure mode is assessed based on three criteria: Severity (S), which measures the impact of the failure; Occurrence (O), which estimates the likelihood of the failure; and Detection (D), which evaluates the probability of detecting the failure before it occurs. These factors are combined to calculate a Risk Priority Number (RPN), guiding engineers in prioritizing corrective actions [
48]. A FMEA is conducted on the solar tracking prototype, as presented in
Figure 6. This analysis provides a detailed overview of the most common potential failure modes within the system, their root causes, and the corresponding impact on system performance and overall operation.
Most of the solar tracker’s electrical components, listed in
Table 1, are housed within a metallic enclosure that offers additional protection against environmental stressors. However, several sensors essential for automating the control subsystem and enabling remote monitoring within the data transfer subsystem are mounted externally, leaving them directly exposed to varying environmental conditions.
As shown in
Figure 6, one such component is the Light Dependent Resistor (LDR), which is used to measure light distribution across different corners of the PV panel. According to the first layer, if an LDR in a solar tracking system is directly exposed to environmental stressors like sunlight, moisture, dust, and temperature swings, several potential failure root causes arise. For instance, prolonged exposure to UV rays can chemically degrade the plastic encapsulation and even the sensing material itself (typically cadmium sulfide in many LDRs). The result will be shifted sensitivity or permanent loss of responsiveness to light over time. A second root cause for failure is moisture that can oxidize the metallic contacts and the sensitive material inside the LDR. The result will be increased resistance, intermittent operation, or complete failure. A third root cause is constant heating (from the sun) and cooling (at night) over the entire summer days, which creates thermal expansion and contraction. This leads to microcracks in the internal structure, leading to fatigue or delamination of internal layers. A fourth root cause is contamination due to dust and pollution, which results in wrong readings, delayed or inaccurate tracking. A fifth root cause is mechanical damage due to rain, hail, and wind-borne particles. This usually leads to physical destruction or altered optical properties.
The second layer (bottom-up approach) in
Figure 6 represents the failure mode, which most commonly involves the malfunction or complete failure of an LDR. Given that four LDRs are utilized to orient the solar panel toward the Sun, four individual points of potential failure can be identified. Two of these failure points are associated with the West–East rotation along the horizontal axis, while the other two are linked to the North–South rotation along the vertical axis.
For the azimuth rotation, if the West LDR, shown in layer 3, becomes faulty, the Microcontroller Unit (MCU) will behave as described in layer 4. A malfunctioning LDR typically leads to a sudden decrease in resistance, causing the Arduino Mega MCU to continuously register a low voltage value on the A0 analog input. Typically, a damaged photoresistor is interpreted as shading on one side of the PV panel, prompting the solar tracking system to rotate the payload until the West–East sensors read equal values. However, with the West LDR malfunctioning, the system continuously detects an imbalance, causing the solar tracker to keep moving the PV panel until it reaches the sunset position, marked by the maximum horizontal limit switch, where it ultimately becomes stuck, as depicted in layer 5 of
Figure 6. Similarly, if the East LDR malfunctions, the MCU will continuously receive a low voltage reading on input A1. This will cause the solar tracking system to rotate in the opposite direction, ultimately becoming stuck at the homing (sunrise) position, triggered by the activation of the lower horizontal limit switch.
For the elevation rotation, the MCU’s behavior mirrors the previous scenarios. If the North LDR malfunctions, the Arduino Mega board will consistently detect a low voltage value on input A2, prompting the solar tracking system to search for the optimal position by moving the PV panel upward. As a result, the solar tracker will eventually become stuck in the flat position, triggered by the activation of the maximum vertical limit switch. If the South LDR malfunctions, the MCU will continuously read a low voltage value on pin A3. This will cause the solar tracking system to remain stuck at the initial homing position, marked by the activation of the lower vertical limit switch, as shown in layer 5.
The FTA diagram for the previously described failure scenario involving all four LDRs is presented in
Figure 7. The FTA highlights critical failure pathways that lead to the system becoming immobilized in specific operational states—sunset, homing, or flat. Immobilization at the sunset position occurs when a low voltage is detected at sensor A0 concurrently with a failure of the West Light Dependent Resistor (LDR). Similarly, the system may remain stuck in the homing position due to either a combination of low voltage at A1 and a faulty East LDR, or low voltage at A3 paired with a malfunctioning South LDR. In the case of the flat position, the system becomes fixed when low voltage at A2 is accompanied by failure of the North LDR. These individual fault conditions collectively define a broader category of LDR malfunction, which can be traced to either environmental stressors or degradation from age and usage. Environmental stress is further broken down into contributing factors such as ultraviolet radiation, high humidity, thermal cycling, dust and pollution, and mechanical damage—any of which can independently impose detrimental effects on the LDR sensors.