Robotic Arm Control Using a Q-Learning Reinforcement Algorithm

Timóteo, Afonso M.; Barbosa, Ramiro S.; Jesus, Isabel S.

doi:10.3390/robotics15030050

Open AccessArticle

Robotic Arm Control Using a Q-Learning Reinforcement Algorithm

by

Afonso M. Timóteo

¹,

Ramiro S. Barbosa

^1,2,*

and

Isabel S. Jesus

^1,2

¹

Department of Electrical Engineering, Institute of Engineering—Polytechnic of Porto (ISEP/IPP), Rua Dr. António Bernardino de Almeida, 431, 4249-015 Porto, Portugal

²

GECAD—Research Group on Intelligent Engineering and Computing for Advanced Innovation and Development, Institute of Engineering—Polytechnic of Porto (ISEP/IPP), 4249-015 Porto, Portugal

^*

Author to whom correspondence should be addressed.

Robotics 2026, 15(3), 50; https://doi.org/10.3390/robotics15030050

Submission received: 11 December 2025 / Revised: 16 February 2026 / Accepted: 25 February 2026 / Published: 27 February 2026

(This article belongs to the Section Sensors and Control in Robotics)

Download

Browse Figures

Versions Notes

Abstract

This paper presents the design and implementation of an integrated robotic system capable of detecting objects through computer vision and making decisions based on logic strategies to perform physical tasks. For that, the system uses a robotic arm to play the Tic-Tac-Toe game utilizing a Q-learning algorithm to determine optimal moves. The system can be controlled using a graphical interface that enables real-time monitoring, facilitating seamless interaction between the user and the robotic arm. Three algorithms with different decision strategies were developed: a random decision algorithm, the MiniMax algorithm, and Q-learning, a reinforcement-learning algorithm. The results obtained highlight the control of the robotic arm using kinematic equations, the training of a robust YOLOv5 model, and the effective learning capability of a Q-learning algorithm. The proposed system presents practical implementation of the robotic system which can be used as a basis for further projects and for teaching robotics.

Keywords:

artificial intelligence; robotics; robotic arm; direct kinematics; inverse kinematics; computer vision; YOLO; Q-learning; MiniMax; Tic-Tac-Toe game

1. Introduction

Today, the landscape of technology and innovation is driven by unprecedented advances in the field of artificial intelligence (AI), which are producing a significant impact on the integration of computational systems across various sectors [1,2,3,4,5]. This emerging technology is present in nearly every aspect of our lives, reshaping industries, transforming businesses, and redefining how we approach everyday situations.

AI can be defined as the ability of a digital computer to perform tasks typically associated with human intelligence. Systems that exhibit mental processes such as reasoning, discovery of meanings, generalization, and learning from past experiences are referred to as artificially intelligent systems. These systems are capable of recognizing patterns, learning from data, and making intelligent decisions without being explicitly programmed. The advancement of technology, driven by AI, has revolutionized society and required constant human adaptation. The integration of AI across diverse domains has enabled achievements that were once considered unthinkable.

In the field of robotics, AI has significantly expanded the capabilities of robots and related technologies [6,7,8,9,10]. Once confined to repetitive tasks on assembly lines, robots have undergone a remarkable evolution. They are now capable of performing complex and dynamic tasks, such as delicate surgical procedures, precise manufacturing processes, and even ambitious space exploration missions [11,12,13,14,15]. Integration of AI algorithms is essential for robots to interpret and understand signals from their sensors, allowing them to interact with the surrounding environment in a safe and intelligent manner [16,17,18,19]. The field of robotics powered by AI-driven control algorithms has advanced rapidly and consistently in recent years. Researchers and international organizations are constantly exploring innovative ways to integrate intelligent control systems into small machines, with the aim of enhancing various aspects of daily life. Despite its exciting potential, this domain is often complex and not easily accessible. AI models typically require extensive training and careful implementation to ensure that they are equipped to handle the wide range of situations they may encounter. This study seeks to explore some of this current context in order to apply the concepts investigated in the development and implementation of a robotic arm controlled by artificial intelligence algorithms to perform interactive tasks.

The work involves the design and implementation of an integrated robotic system capable of detecting objects using computer vision, making decisions based on strategies and logic, and performing physical tasks with a robotic arm. The main focus is on developing an intelligent module capable of performing an interactive task such as playing the Tic-Tac-Toe game. Although each component used in our system, YOLO for detection, Q-learning for decision-making, and a 5-DOF manipulator controlled through kinematic equations, is individually well established, our contribution lies in the development of a unified framework that seamlessly integrates real-time computer vision, decision-making strategies, and manipulator control into a single integrated system, demonstrated experimentally on a physical platform. Specifically, most existing studies on reinforcement learning applied to Tic-Tac-Toe are carried out in simulation [20] or through grid-based symbolic representations [21]. However, our work demonstrates that a reinforcement-learning agent can interact with the physical world through visual perception under real-world constraints and still achieve good performance. This addresses a gap in the literature, as few studies evaluate how classical learning algorithms perform when embedded in a full perception–decision–action cycle on a real manipulator. Furthermore, this application can be extended as a benchmark for tasks such as assembly, inspection, packaging, and sorting in industrial manipulation scenarios. YOLOv5 provides fast detections of board cells, parts, and defects, which can be mapped to symbolic states (such as cell occupancy, part type, or defect presence) or to metric goals like pick coordinates, following the same perception-to-state conversion used in the game. A Q-learning agent that selects discrete actions (place X or O in a cell) extends naturally to discrete industrial actions such as picking or placing, inspecting a region, or accepting or rejecting products [22,23]. Platforms such as DeepClaw [24] demonstrate this by using a simple Tic-Tac-Toe setup as an initial benchmark and then extending the same hardware and control pipeline to more realistic tasks, including bin clearing, jigsaw assembly, and sorting. Moreover, the proposed framework can serve as an effective educational and experimental platform for robotics, illustrating the integration of perception, decision-making, control, and actuation at both software and hardware levels. It presents practical implementation of the robotic system which can be used as a basis for further projects and for teaching robotics.

The system is operated through a graphical interface that enables real-time monitoring and interaction, enhancing the user’s ability to control and observe the robotic arm’s actions. This interface provides a comprehensive environment for seamless interaction. Users can control the angles of the servomotor, specify the final position of the actuator, and visualize object detections in real-time. In addition, the interface presents the optimal move for the current state of the game board, ensuring an intuitive and efficient user experience.

The main contributions of the developed work are as follows:

Design and implementation of an integrated robotic system that combines computer vision, decision-making strategies, and robotic arm control to perform interactive tasks.
Real-time control of a five-degree-of-freedom robotic arm by applying kinematic equations for robot movements.
Development and comparison of three different decision-making algorithms (random, MiniMax and Q-learning) applied to a unified robotic framework, providing valuable insights into their comparative strengths and practical performance in a physical environment.
Application of a Q-learning reinforcement algorithm to control a real robotic arm in an interactive environment, highlighting the practical effectiveness of RL in real-world dynamic scenarios.

The article is structured as follows. Section 2 provides an overview of recent works in the field of AI applied to robotics and related areas. Section 3 describes the architecture of the system, focusing on the structure and main components of the robotic arm, the direct and inverse kinematics required for robot control, the object detection algorithm and the Q-learning algorithm. The implementation of the system is given in Section 4. Here, the system architecture, as well as the training and application of the AI algorithms are detailed. Section 5 presents the results of the test games played between a human player and the trained Q-learning player, together with a discussion of the experiments conducted. Finally, Section 6 draws the main conclusions and addresses future developments of the work.

2. Related Work

This section reviews key studies that provided the foundational information and methodologies for this work.

In the robotics area, the work by [25] trace the key inventions underlying the robotic concept, demonstrating that the idea of robots predates the study and discovery of electrical systems. Similarly, the survey by [26] offers a comprehensive overview of the evolution of robotic arms over the past 20 years, focusing on the parameters and characteristics that influence their performance. The book by [7] provides an in-depth examination of robotic systems, covering their essential components, control architectures, and functionalities. The authors in [27] focus on the development of industrial robots, with a historical analysis spanning the 1950s to the early 1990s. In [16], the authors introduce key concepts in robotics, addressing current and emerging topics such as machine learning, ethics, human–robot interaction, and design thinking.

In the domain of artificial intelligence and machine learning, the review by [28] explores frequently used algorithms and methods, providing fundamental information on the capabilities and limitations of AI techniques. They evaluated the integration of machine learning algorithms with traditional methodologies, providing valuable guidance on applying AI in practical scenarios. In [29], an efficient lightweight Convolutional Neural Network (CNN) model is presented for the detection of surface defects in industrial products, specifically designed to overcome the high computational requirements of conventional CNNs. The proposed Coordinate Attention Mobile (CAM) backbone network uses inverse residual structures and the Coordinate Attention (CA) mechanism for efficient feature extraction. Multi-scale strategies are used to improve the detection of small objects, improving both accuracy and robustness. A novel Bidirectional Weighted Feature Pyramid Network (BWFPN) is introduced for feature fusion. The proposed model achieves a detection accuracy comparable to that of the state-of-the-art approaches. The work of [4] compares supervised machine learning algorithms, evaluating their efficiency in different datasets. The book by [30] is a comprehensive resource on deep learning, explaining the core concepts and architectures in the design of AI models. The foundational reference [31] provides the fundamental ideas and algorithms of reinforcement learning (RL), covering both the theoretical and practical aspects of this field.

Several studies have explored the integration of AI into robotics to perform specialized tasks. These studies offer valuable insight into the design and implementation of robotic systems. The work of [32] presents the design and implementation of a robotic arm for playing chess with a pick-and-drop mechanism. This study introduces a kinematic calculation framework for the robotic arm and a smart chessboard that assists the control algorithm by providing the precise positions of each chess piece on the board. Another study by [33] describes a real-time autonomous chess robotic system designed to compete against human opponents. Their system includes a computer vision module for detecting chess pieces, a popular chess engine for selecting optimal moves, and a grid-based position calculation system to guide the robotic arm’s movements accurately. An earlier but foundational study by [34] explores the application of artificial neural networks to control a robotic arm in a Tic-Tac-Toe game. Their work highlights the potential of neural networks to enable strategic decision-making and precise control in simple game-playing scenarios. The study [35] proposes a fast, learning-based algorithm to efficiently solve larger Tic-Tac-Toe boards, overcoming the slow Min-Max approach, by generalizing beyond the traditional 3 × 3 board and achieving strong results that could extend to other strategy games such as Minesweeper, Chess and Go. In [36], the authors present an XY-plotter that plays Tic-Tac-Toe using stepper motors controlled by a microcontroller. A vision algorithm detects human moves, and MiniMax provides optimal game decisions. The system combines robotics with human–computer interaction for educational and interactive uses. The experiments show accurate and efficient real-time gameplay. The study highlights the consistency of deterministic MiniMax compared to probabilistic AI methods. The work of [37] introduces SwarmPlay, where a swarm of nano-quadcopters plays a Tic-Tac-Toe board game against a human. The system aims to create more tangible and interactive human–machine gameplay than single-robot setups. A drone swarm, workstation, and computer vision system enable real-time participation. User studies show high engagement and a more natural experience than traditional computer games. The results indicate a strong potential for SwarmPlay to extend to many games, enhancing human–drone interaction through a novel game-theory-based algorithm.

In [38], a new robot-assisted laparoscopy training system is introduced, which uses deep reinforcement learning (DRL) agents such as PPO and GAIL to learn from both simulations and expert demonstrations. The system incorporates real laparoscopic instruments, allowing RL agents to provide trainees with hands-on, tactile learning experiences. The experimental results show that the system can successfully integrate simulation and expert data to improve training outcomes. Statistical analysis confirms that the skill improvements achieved with this training system are significant. In the study [39], a novel architecture was designed to improve coordination of motion control using reinforcement learning. The proposed CoordiGraph framework utilizes the subequivariant property to deal with weak inter-joint coupling in high-dimensional tasks. The method specifically addresses the shortcomings of Graph Neural Networks (GNNs) and equivariant techniques in coordinating motion control tasks within RL. The results show that CoordiGraph outperforms several baseline methods in complex motion control scenarios. Moreover, the findings suggest that subequivariance is a promising strategy to improve motion coordination. The study by [40] addresses the problem of grasping moving objects in unstructured environments. It proposes a DRL-based system using a Kinect depth sensor and an improved Soft Actor–Critic (SAC) algorithm. The system follows an approach–track–grasp pipeline, enabling real-time tracking and grasping. The experimental results show high grasp success rates on objects moving along various trajectories, demonstrating the effectiveness of the method. In [41], the problem of achieving high-precision robotic grasping using only visual input is studied. It introduces QT-Opt, a scalable off-policy DRL algorithm using distributed Q-learning. The approach learns end-to-end control policies directly from visual observations, avoiding the need for explicit object modeling. The results achieve a grasp success rate of 96% on unseen objects, outperforming previous methods. The work of [42] identifies a limitation in vision-based RL for robotic manipulation, where static cameras struggle under occlusions and limited space. It proposes a Dual-Arm Active Visual-Guided Manipulation Model (DAVMM), with one arm handling vision (“eye”) and the other performing manipulation (“hand”), enabling active perception and interaction. Residual-RL and curriculum learning are used to improve sample efficiency and training stability. Experiments on three occluded, narrow-space tasks show DAVMM significantly outperforms strong baselines, achieving higher success rates and faster learning. Another study by [43] introduces a Multi-Actor–Critic Deep Deterministic Policy Gradient (M2ACD) algorithm for robotic manipulator trajectory planning in complex environments. A Two-Stage Reward (TSR) strategy guides safe and precise motion, and NURBS (Non-Uniform Rational B-Splines) curves smooth trajectories to solve the position-hopping jitter. Results show M2ACD outperforms TD3, DARC, and DDPG, achieving superior curve smoothness, stability, and convergence speed for collaborative robot trajectory planning. The study in [44] proposes a new 3D path planning method for robot arms using computer vision, Q-learning, and neural networks to overcome problems related to object localization, computational efficiency, and 2D workspace limitations. The Q-learning algorithm selects optimal movement actions in 3D space, while a neural network translates these actions into robot joint angles. Simulations and experiments show the approach significantly improves accuracy, efficiency, and real-time performance over previous methods.

YOLO edge deployment and physical RL and manipulation benchmarks have become active, rapidly evolving research areas. The work of [45] provides a systematic review of deep learning deployment on embedded hardware, including YOLO-based object detection for edge processing, and emphasizes that successful deployment in resource-constrained environments depends on model-optimization strategies, lightweight architectures, and appropriate hardware selection. Similarly, study [46] evaluates inference workflows and the performance of YOLO models across multiple edge platforms, reporting empirical latency and throughput results on resource-constrained devices such as the Raspberry Pi 4B and NVIDIA Jetson systems (Santa Clara, CA, USA). In the context of physical reinforcement learning, the work in [47] presents one of the first extensive experimental benchmarks of multiple policy-learning algorithms, namely TRPO, PPO, DDPG, and Soft-Q, on commercially available physical robots. The study highlights the robustness of structured task setups and demonstrates the applicability of these algorithms across diverse physical environments. The work of [48] investigates direct training of RL algorithms in controlled yet realistic real-world environments for dexterous manipulation tasks, addressing limitations of simulation-based training. The study presents benchmarking results for three RL algorithms applied to complex in-hand manipulation on physical robotic systems. The results show that TD3 consistently outperforms DDPG and SAC, demonstrating superior robustness in continuous real-world tasks. Overall, the work highlights the practicality of real-world RL training and its effectiveness in reducing the simulation-to-real gap.

Although more recent architectures and advanced learning methods are available, the use of well-established models and algorithms remains justified in experimental robotics research. These approaches offer a mature deployment ecosystem that facilitates reproducibility and simplifies system-level validation.

3. Materials and Methods

This section provides a description of the robotic arm, the derivation of the robot kinematic equations, the computer vision algorithm, and the reinforcement Q-learning strategy used in this work.

3.1. Robotic Arm

The core and crucial element of this system is the robotic arm. It is responsible for interacting with the environment in a precise and controlled way. Therefore, its fundamental characteristics must be carefully analyzed in order not to cause damage, problems, or additional difficulties in the execution of the work. The robot adopted in this study is the Adeept 5-DOF Robotic Arm Kit [49]. It is mainly composed of acrylic parts and offers five degrees of freedom: base rotation (

θ_{1}

), shoulder (

θ_{2}

), elbow (

θ_{3}

), wrist (

θ_{4}

) and claw opening (

θ_{5}

), as designated by the robot manufacturer. Figure 1 illustrates the robot configuration. The robotic arm is paired with the Raspberry Pi 3B controller (Cambridge, UK), which features a 64-bit, 1.2 GHz Quad-Core processor, 400 MHz Videocore IV graphics and 1 GB of native RAM.

The architecture of the robotic system is shown in Figure 2. As can be seen, the control of the robot joints is carried out by five servomotors with a rotation angle between 0 and 180° [49]. The Arm HAT board serves as the servo driver for the interface between the controller and the servomotors, which is responsible for controlling the motors through PWM signals and communicates with the controller through I2C [50].

3.2. Robot Kinematics

The controlled movements of the robotic arm are crucial for performing tasks with precision and accuracy. To achieve these movements, kinematic equations are used to describe the relationships between the arm’s components and the three-dimensional space. These equations are categorized into two main types: forward kinematics and inverse kinematics, which are derived for the robot under study in the following subsections.

Although the robotic arm is originally designed with five degrees of freedom, the positioning of the end effector for this configuration is exclusively determined by the first three joints (base rotation, shoulder, and elbow). The remaining joints, corresponding to wrist rotation and gripper actuation, do not influence the position of the end-effector. So, the system is simplified to a three degree of freedom model by maintaining the wrist orientation and end-effector actuation at fixed angles throughout operation.

3.2.1. Forward Kinematics

Forward kinematics involves calculating the position and orientation of the end-effector of the arm, taking into account the angles of the joints (

θ_{1}

,

θ_{2}

,

θ_{3}

). To simplify kinematic calculations, the system can be considered to have three degrees of freedom, where the final position coincides with the wrist joint.

The forward kinematics of the robot is derived using the Denavit–Hartenberg (DH) convention [51], which defines the relationships between the links of a robotic arm using four key parameters:

$a_{i}$ —Distance from $z_{i}$ to $z_{i + 1}$ measured along $x_{i}$ .
$α_{i}$ —Angle from $z_{i}$ to $z_{i + 1}$ measured about $x_{i}$ .
$d_{i}$ —Distance from $x_{i - 1}$ to $x_{i}$ measured along $z_{i}$ .
$θ_{i}$ —Angle from $x_{i - 1}$ to $x_{i}$ measured about $z_{i}$ .

The first step in applying these parameters is to define the axes of each joint. Starting with the base plate as the fixed starting point, all the other joints will have to present themselves according to the axes established by the base, when the joints are in their resting position. In the end, the axes shown in Figure 3 are obtained. The parameter

L_{0}

is the distance from the base to the first joint. The link lengths are

L_{1}

,

L_{2}

, and

L_{3}

.

Once the axes are defined, the DH parameters for each link can be determined, as shown in Table 1.

The transformation matrices for each link are calculated according to the DH laws convention, using

T_{i}^{i - 1} = [\begin{matrix} cos θ_{i} & - sin θ_{i} \cdot cos α_{i} & sin θ_{i} \cdot sin α_{i} & a_{i} \cdot cos θ_{i} \\ sin θ_{i} & cos θ_{i} \cdot cos α_{i} & - cos θ_{i} \cdot sin α_{i} & a_{i} \cdot sin θ_{i} \\ 0 & sin α_{i} & cos α_{i} & d_{i} \\ 0 & 0 & 0 & 1 \end{matrix}]

(1)

The resulting matrix, which transforms the position of the end-effector taking into account the base matrix, can be obtained by multiplying each individual transformation matrix, as:

T_{3}^{0} = T_{1}^{0} \times T_{2}^{1} \times T_{3}^{2}

(2)

which represents the orientation and position of the final actuator. From (2), forward kinematic equations are taken as in (3), (4) and (5) for the position coordinates x, y and z, respectively.

x = L_{3} cos (θ_{1}) cos (θ_{2}) cos (θ_{3} - \frac{π}{2}) + L_{2} cos (θ_{1}) cos (θ_{2}) + L_{3} cos (θ_{1}) sin (θ_{2}) sin (θ_{3} - \frac{π}{2})

(3)

y = L_{3} sin (θ_{1}) cos (θ_{2}) cos (θ_{3} - \frac{π}{2}) + L_{2} sin (θ_{1}) cos (θ_{2}) + L_{3} sin (θ_{1}) sin (θ_{2}) sin (θ_{3} - \frac{π}{2})

(4)

z = L_{3} sin (θ_{2}) cos (θ_{3} - \frac{π}{2}) - L_{3} cos (θ_{2}) sin (θ_{3} - \frac{π}{2}) + L_{2} sin (θ_{2}) + L_{1} + L_{0}

(5)

Table 2 and Figure 4 display some of the angles tested and their corresponding positions. For the structure of the robotic arm, the distance

L_{0}

is 4.5 cm,

L_{1}

is 5.5 cm,

L_{2}

is 6.5 cm and

L_{3}

is 11 cm.

3.2.2. Inverse Kinematics

Inverse kinematics calculates the joint angles needed to reach the desired final position (x, y, z) and orientation. In the presented implementation, the robotic arm is physically configured with three degrees of freedom. Under this configuration, the end-effector position matches the wrist joint, and only the base, shoulder, and elbow joints are actuated and controlled. The orientation of the end effector is fixed by construction and does not require active pose control. In this way, by applying simple trigonometry relations, we can calculate the base angles of inverse kinematics, as illustrated in Figure 5.

The angle of the base

θ_{1}

corresponds to the rotation of the base around the z axis. A triangle can be formed between the center of the base, the abscissa of the position, and the projection of the position onto the horizontal plane (triangle 0-x-P′). Assuming the coordinates of the final position are represented by

P \equiv

(x, y, z), the angle

θ_{1}

is calculated as

θ_{1} = arctan 2 (\frac{y}{x})

(6)

The angles

θ_{2}

and

θ_{3}

require more attention, as they depend on each other, so two triangles can be drawn, as shown in Figure 6. The first has the vertices at the shoulder joint, the elbow joint, and the actuator end position (triangle Q-P-R); the second, adjacent to the first, has the vertices at the shoulder, the end position, and the projection of the end position onto the horizontal plane shifted towards the shoulder joint (triangle Q-P-T).

Applying the law of cosines to the triangle Q-P-R, we obtain the angle of the elbow joint,

θ_{3}

, as

θ_{3} = arccos [\frac{x^{2} + y^{2} + {(z - L_{1} - L_{0})}^{2} - L_{2}^{2} - L_{3}^{2}}{2 L_{2} L_{3}}]

(7)

Analysing Figure 6,

θ_{2}

is determined using the following relationships:

\begin{matrix} θ_{2} = α - β \\ α = arctan (\frac{z - L_{1} - L_{0}}{\sqrt{x^{2} + y^{2}}}) \\ β = arctan (\frac{L_{3} sin θ_{3}}{L_{2} + L_{3} cos θ_{3}}) \\ θ_{2} = arctan (\frac{z - L_{1} - L_{0}}{\sqrt{x^{2} + y^{2}}}) - arctan (\frac{L_{3} sin θ_{3}}{L_{2} + L_{3} cos θ_{3}}) \end{matrix}

(8)

3.3. YOLO

YOLO (You Only Look Once) is an object detection algorithm that, unlike traditional methods, treats the task as a single regression problem. It uses a single neural network to predict the boundary boxes of objects and classify them from complete images in a single evaluation [52].

Compared to algorithms prior to the introduction of YOLO, which required two separate neural networks to detect and classify objects, YOLO uses only one, ensuring faster processing speed. Furthermore, the ability to analyze an entire image at once allows YOLO to capture the context of detected objects, improving its accuracy and reducing the false positive rate [53].

The core concept of YOLO is to divide the input image into a grid

S * S

, where each cell is responsible for predicting “B” bounding boxes and the probability “C” for each class within the cell, as illustrated in Figure 7 [54].

Each bounding box is represented by five values: x, y, width, height, and a confidence score. The first four values define the bounding box in the image space, while the confidence score reflects the Intersection over Union (IoU) between the predicted box and the true box, expressed as

I o U = \frac{Area of Overlap}{Area of Union}

(9)

Figure 8 shows the network architecture of YOLOv5. It consists of three main parts: Backbone, Neck, and Head. The Backbone, called CSPDarknet, integrates Cross Stage Partial Network (CSPNet) to reduce redundant gradient information, decrease parameters and FLOPS (floating-point operations per second), and improve speed and accuracy while keeping the model lightweight, making it suitable for resource-limited devices. The Neck uses a Path Aggregation Network (PANet) combined with an enhanced Feature Pyramid Network (FPN) to improve information flow, feature fusion, and localization accuracy by efficiently propagating low- and high-level features. The Head, referred to as the YOLO layer, generates feature maps at three different scales to enable multi-scale prediction, which helps detect small, medium, and large objects effectively. In general, the structure of YOLOv5 ensures high precision, fast inference, and adaptability to real-time detection tasks [55,56].

3.4. Q-Learning

The Q-learning algorithm falls into the category of RL algorithms, where an agent can learn the best actions to take in a given state based on its interaction with the environment [57].

The core of this algorithm is the “Q-Table”, a matrix that associates an environment state with a potential action for the agent. Thus, the stored values, known as “Q-values”, represent the expected reward by the agent when taking a given action in a given state. These values initially set arbitrarily are updated iteratively through the agent’s experience, taking into account the temporal difference rule [58]:

Q (S, A) \leftarrow Q (S, A) + α [R + γ max_{A^{'}} Q (S^{'}, A^{'}) - Q (S, A)]

(10)

$Q (S, A)$ —“Q-Value” for state S and action A.
$Q (S^{'}, A^{'})$ —“Q-Value” for the future state $S^{'}$ and the best future action $A^{'}$ .
$α$ —Learning rate of the model.
$γ$ —Discount factor for future rewards.
R—Reward calculated for the current state and for the action taken.

The process of this algorithm is characterized by five main phases, as described in Figure 9 [59].

During the learning process, the agent can choose between exploring new actions to update its table, known as Exploration, or it can take advantage of already known actions that produce high rewards, known as Exploitation. The agent’s choice lies in strategies such as the

ϵ

-Greedy policy, where it generally selects the action with the highest known “Q-value”, but occasionally takes random exploration actions [19].

Because Q-learning does not use a type model of the environment in which it is implemented, the algorithm is ideal for a wide range of applications in unpredictable or complex dynamic environments [4].

However, it has some drawbacks that make it not recommended for certain situations. In extended or continuous state spaces, the size of “Q-Table” can render the algorithm impractical. Additionally, in environments where future rewards can be delayed over multiple states, propagating useful information to previous states becomes more difficult. In some contexts, it can also be difficult to efficiently exploit the environment, which can lead to slow learning or suboptimal results [58].

Given these challenges, this algorithm requires careful tuning of its learning rate, its discount factor, and the value of the

ϵ

-Greedy policy strategy, which is crucial to overcome these problems.

4. Implementation

This section describes the implementation of the proposed system architecture, presenting and explaining the development of the stages outlined in the previous section. As already mentioned, the system is designed to play the Tic-Tac-Toe game using a robotic arm with five degrees of freedom.

4.1. System Architecture

The proposed architecture consists of four main interconnected components designed to facilitate the gameplay of Tic-Tac-Toe, as shown in Figure 10. As can be seen, the system designed to play Tic-Tac-Toe controls a robotic arm with five degrees of freedom using a Raspberry Pi 3B. The control of the robotic arm involves a YOLO computer vision model for real-time object identification, trained to recognize the various elements of the Tic-Tac-Toe game, and a decision-making algorithm, the Q-learning RL algorithm, which uses YOLO’s results to autonomously control the robotic arm during the game. Finally, an interactive interface facilitates control and enables real-time monitoring of the entire control process.

More specifically, the system uses a camera to capture real-time images of the game board. These images are sent to a Raspberry PI 3B controller, which applies a computer vision model, more specifically a YOLO model, to identify and locate the various elements of the game.

The camera used for this system features a 48-MP sensor with an f/2.0 aperture and a 26 mm wide-angle lens, equipped with autofocus capabilities. This configuration provides a sufficiently high spatial resolution and adequate light sensitivity to capture detailed images of the chessboard under typical indoor lighting conditions. However, in the present work, the visual component was designed exclusively for coarse localization of Tic-Tac-Toe cells in a tightly controlled setup rather than for general-purpose 3D pose estimation. Because the camera was rigidly mounted above the board at a constant height and orientation, the projection of the board on the image remained stable throughout all trials. Under these constrained conditions, the physical dimensions of the board (25.5 cm × 33.5 cm) and the YOLO input resolution (640 × 640 pixels) allowed us to apply a uniform linear pixel-centimeter scaling model. The origin of the robot was manually aligned with a fixed pixel coordinate, and the center of each detected bounding box was mapped to the real-world coordinates through this scaling.

The output of this module is a map of detections that enables the decision-making algorithm to assess the current state of the game and determine the best move. Subsequently, the Raspberry PI translates the coordinates of the chosen move into articulation angles and sends commands to the robotic arm’s servomotors via I2C to execute the move.

4.2. Computer Vision

The task of the computer vision subsystem is to interpret the input images and to understand the state of the game in real time. The system is based on a YOLO model, an object detection algorithm, which identifies the position of the game pieces on the board. The information is then sent to the controller to make an informed decision.

The computer vision system focuses on identifying and classifying the Tic-Tac-Toe game pieces and determining their positions on the board. This enables the system to understand its current state and move the robotic arm to the correct position to make the next move. Figure 11 illustrates the results of identifying and classifying the elements of the Tic-Tac-Toe game.

The success of computer vision is largely dependent on the quality and preparation of the dataset it uses. Therefore, we collected datasets that cover a wide range of different situations and contexts, providing the necessary diversity for robust object detection.

In total, six annotated datasets were collected from the Roboflow public repository [60], summing up 4207 images. Figure 12 shows some images from these datasets.

The final dataset, after the processing steps, annotating missing objects and removing duplicate images or contextually irrelevant images, is composed of 3000 images. These are divided into 2500 images for training, 375 for validation, and 125 for testing.

The model applied in this project needs to identify and locate objects in real time, be computationally lightweight, and provide high accuracy. YOLOv5s model was selected from the available pre-built YOLO architectures [61]. YOLOv5 was chosen instead of the latest versions due to its simplicity and high speed optimization, making it particularly suitable for hardware deployment with limited resources, such as the Raspberry Pi used in this study. The smaller YOLOv5s model is especially efficient for real-time applications where low latency is crucial and available resources are constrained, ensuring efficient performance without compromising detection accuracy [62].

In this case, the YOLOv5 model was trained during 300 epochs in the custom dataset. Table 3 lists the network hyperparameters used in the training process, while Figure 13 shows the corresponding performance results [54,55,56]. A deeper analysis of the training results shows that the model achieved its highest accuracy value in epoch 284, reaching 98.7%, indicating that it correctly predicted most of the cases. This epoch is marked by training loss values that ranged from 1.1% for classes to 2.5% for bounding boxes. For the validation phase, these values are between 1.2% and 3.5%, respectively. It also has a recall value of 95.5%, suggesting that the model is capable of detecting most of the objects in the images.

The mAP values for this epoch reflect the model’s overall accuracy in locating objects. The model has an mAP of 98.1% for high IoU values (more than 50%). However, for values higher than 95%, it has a percentage of 66.7%. This indicates that the model performs well in detecting objects with loose tolerance and overlapping the predicted bounding box over the true one. The main performance metrics are shown in Table 4.

4.3. Decision-Making

The core of this system is the decision-making algorithm, which is responsible for analyzing the state of the game and determining the optimal move. It uses various strategies, with a particular focus on the Q-learning algorithm, a reinforcement-learning technique that continuously improves its performance by learning from past experiences.

The decision-making process can be described in four steps. First, the algorithm receives information from the computer vision system about the current state of the game board, including the positions of all pieces in play. Next, it determines the optimal move. Third, it issues movement commands to the robotic arm. Finally, it updates its knowledge base on the current outcome.

For comparison purposes, two additional algorithms with different decision processes were developed: a random decision algorithm and the MiniMax algorithm, which evaluates all possibilities and selects the one that produces the highest reward.

4.3.1. Board Movements

The Tic-Tac-Toe game class serves as the foundation of this project and provides the training environment for the RL model. The game board is represented as a grid where each cell starts empty and is updated as moves are made.

Players take turns placing their symbol (‘X’ or ‘O’) in a chosen position, with a function ensuring that moves are only made in unoccupied spaces. If a move is valid, the board updates accordingly, and the turn switches to the next player. If a player attempts to place a symbol in an already occupied cell, the move is rejected, and they must try again. Figure 14 illustrates this logical decision process.

After each move, the game logic checks for a winner by evaluating all possible winning combinations: rows, columns, and diagonals. If a player has filled one of them with his symbol, he is declared the winner. If the board is full and no winner is found, the game ends in a draw. Otherwise, the game continues with the next turn.

4.3.2. Random Decison Algorithm

With respect to the player algorithms, the random decision is the simplest. This type of player selects a random action from the set of available moves. It has no strategy for choosing movements and does not consider the current state of the game.

4.3.3. MiniMax Algorithm

The MiniMax algorithm is a more sophisticated approach. The algorithm aims to minimize the opponent’s rewards, while trying to maximize the player’s own rewards. To do this, it assumes that the opponent will always make the best move and seeks to minimize his chances of winning. Algorithm 1 describes the main steps of the MiniMax algorithm.

Algorithm 1 MiniMax Chooses Move

B e s t M o v e \Leftarrow N o n e

B e s t R e s u l t; B e s t M o v e \Leftarrow

Minimax

(G a m e, T r u e)

if

B e s t M o v e E x i s t s

then

P l a y M o v e

end if

Procedure Minimax

B o a r d, M a x i m i z i n g

if

G a m e i s i n a n E n d S t a t e

then

S c o r e \Leftarrow E v a l u a t e M o v e

end if

if

P l a y e r i s M a x i m i z i n g

then

B e s t S c o r e \Leftarrow - i n f i n i t y

else

B e s t S c o r e \Leftarrow + i n f i n i t y

end if

for

C e l l s i n B o a r d

do

if

C e l l i s E m p t y

then

P l a y M o v e

Minimax

(G a m e, N o t M a x i m i z i n g)

end if

if

P l a y e r i s M a x i m i z i n g

then

if

S c o r e > B e s t S c o r e

then

B e s t S c o r e \Leftarrow S c o r e

B e s t M o v e \Leftarrow M o v e

end if

else

if

S c o r e < B e s t S c o r e

then

B e s t S c o r e \Leftarrow S c o r e

B e s t M o v e \Leftarrow M o v e

end if

end for

return

B e s t S c o r e, B e s t M o v e

End Procedure

Thus, this algorithm involves the recursive exploration, in a tree-like structure, of all possible future moves, aiming to continue maximizing the player’s score and minimizing the opponent’s score. The algorithm simulates the state of the game, where a new move implies a new branch of the tree; it alternates the turns of each player and evaluates the outcome of all possible moves until the game reaches a terminal state.

In this state, rewards are assigned to the move that led to this outcome. In case of

Victory—The algorithm assigns a positive score if the move belongs the player for whom you are maximizing, or a negative score for the opponent or player to whom you want to minimize the rewards.
Tie—The algorithm returns a null score, reflecting that no player has an advantage.

These values are passed back to the tree or branch that initiated the move and used to evaluate which move provides the best result.

In this way, at each level of the game tree, if it is the turn of the player who wants to maximize the rewards, the move with the highest score is selected. On the other hand, if it is the turn of the player who minimizes the rewards, the algorithm chooses the move with the lowest score.

This pattern continues recursively until the algorithm has explored all possible moves and determined the best move to make.

As shown in Figure 15, the MiniMax algorithm decides that the best move to make for player ’X’ is in the middle of the right column, since the best decisions in that branch end up in a situation where, in the worst-case scenario, the game ends in a draw [63].

4.3.4. Q-Learning

The class implementing the Q-learning algorithm requires three key parameters that control the agent’s learning process, the model’s learning rate, the discount factor, and the value of

ϵ

, in order to apply the

ϵ

-Greedy policy strategy.

Taking into account how this algorithm works, a main function, represented in the Algorithm 2, was developed to decide the best move to take for the present state of the board. This function is responsible for applying a few other supporting functions that generate the available moves, decide between exploiting or exploring according to the

ϵ

-Greedy policy strategy and calculate the “Q-value” for the state–action pair.

After deciding and making the move, the function also checks whether the game has reached a terminal state (win, lose, or draw) and assigns a corresponding reward: positive for a win, negative for a loss, and slightly positive for a draw.

Next, the function simulates the opponent’s best responses and penalizes the agent if the opponent is likely to win on the next move, encouraging the agent to strategically block the opponent.

Finally, this function updates the “Q-values” in the table according to the time difference rule (Equation (10)).

Algorithm 2 Q-learning Chooses Move

A v a i l a b l e M o v e s \Leftarrow E m p t y C e l l s i n B o a r d

S t a t e \Leftarrow g e t_g a m e S t a t e (b o a r d)

if

R a n d o m N u m b e r < E p s i l o n

then

A c t i o n \Leftarrow R a n d o m A v a i l a b l e M o v e

else

A c t i o n \Leftarrow A v a i l a b l e M o v e W i t h H i g h e s t Q - T a b l e V a l u e

end if

E x e c u t e A c t i o n i n b o a r d

if

G a m e i s i n a n E n d S t a t e

then

D i s t r i b u t e R e w a r d s

else [Simulate Opponent Move]

O p p o n e n t M o v e s \Leftarrow E m p t y C e l l s i n B o a r d

for

M o v e i n O p p o n e n t M o v e s

do

S i m u l a t e d B o a r d \Leftarrow C o p y (B o a r d)

E x e c u t e M o v e i n S i m u l a t e d B o a r d

if

G a m e i s i n a n E n d S t a t e

then

D i s t r i b u t e R e w a r d s

end if

end for

end if

F i n a l_R e w a r d \leftarrow H i g h e s t O p p o n e n t R e w a r d

A p p l y t h e T e m p o r a l D i f f e r e n t i a l R u l e w i t h t h e F i n a l R e w a r d

S a v e t h e N e w Q V a l u e i n t o t h e Q - T a b l e

To develop an algorithm capable of winning the Tic-Tac-Toe game, the model was trained for a total of 310,000 games against the various possible opponents, using a decreasing value of

ϵ

against each opponent. The Q-learning parameters are:

Learning Rate: 0.1.
Discount Factor: 0.9.
Epsilon ( $ϵ$ ): Starting at 0.9 and decreasing to 0.01, with a decay rate of 0.1 for every batch of 10% of total games.
Rewards: +1 for a win, +0.5 for a draw, and −1 for a loss.

The selection of Q-learning parameters follows standard RL principles and was adjusted and validated during the initial experiments. A learning rate of 0.1 was chosen to ensure stable but sufficiently responsive updates to the Q-table, trying to avoid oscillation, but providing enough margin to incorporate new information. The discount factor of 0.9 encourages the agent to prefer long-term strategic advantages over immediate but potentially suboptimal moves, which is relevant in a short episodic task. The exploration parameter

ϵ

starts at 0.9 to promote extensive exploration of the state space during the early training stages, then decays to 0.01 (with a 0.1 decay step every 10% of the total games) to progressively shift the agent towards exploitation of the learned policy. Finally, the reward structure reflects the asymmetry between desirable and undesirable outcomes, providing the agent with clear guidance to avoid losing states, while still recognizing the value of forcing a draw when a win is not possible.

At the end of the training, the results obtained are shown in Table 5. Figure 16 and Figure 17 illustrate the progression of the win/draw/loss rates for the trained Q-learning player (as Player ‘X’ and Player ‘O’, respectively) during training against the different predefined opponents: random, new Q-learning, trained Q-learning, and MiniMax. When the Q-learning agent plays as Player ‘X’, the agent exhibits unstable performance, at the beginning of training, due to the high exploration rate (

ϵ

= 0.9), leading to largely random action selection, as is represented in the initial segments of the curves, where win and loss rates fluctuate significantly. As

ϵ

gradually decreases, the agent begins to exploit the accumulated Q-values more consistently, and the curves start to stabilize. Against the weaker opponents, the random and the new Q-learning, the win rate increases rapidly, indicating that the agent quickly learns to avoid losing positions and converges toward the optimal Tic-Tac-Toe policy. Against stronger or deterministic opponents, such as the MiniMax, the learning curve shows a slower improvement. In these cases, the model gradually reduces the loss rate but converges to a high proportion of draws, which aligns with the theoretical outcome of an optimal Tic-Tac-Toe play.

On the other hand, when the agent plays as Player ‘O’, the learning curve differs noticeably. Because Tic-Tac-Toe is strongly biased toward the first player under optimal play, the Q-learning agent has limited opportunity to achieve non-losing outcomes when facing a perfect opponent. This is reflected in the learning curves: even after extensive training, the agent’s draw rate increases only marginally, while the loss rate remains high. This behavior is consistent with game theory: an optimal first player forces at least a draw, and any suboptimal move by the second player leads to an immediate loss. Despite this fact, the slight improvement in draw rate observed late in training indicates that the agent learns to avoid the most immediate losing responses but cannot fully overcome the first-player advantage against MiniMax-level play.

5. Results

This section presents the main results obtained for the individual components and for a test game played between a human player and the trained Q-learning player.

5.1. Game Flow

The backend of this system operates through a structured series of steps, ensuring smooth transitions between the different phases of the game. From the initial setup, through each turn, to the final outcomes, the control logic ensures that all eventualities are accounted for. The main components include game setup, player turn switching, integration of YOLO detections, and control of the robot’s physical movements.

When the play button is pressed on the graphical interface, the system retrieves the user-configured settings, including player types and algorithm parameters. These configurations are then applied to initialize the environment and prepare the game for execution.

Once the setup is complete, the system enters the main game loop (Figure 18), alternating turns between Player 1 and Player 2. After each move, the state of the game board is updated and the system checks whether a winning or draw condition has been met.

For human players, moves performed in real life are registered through the graphical interface. For nonhuman players, the system determines the optimal move based on the selected algorithm and updates the board accordingly.

After this decision, the chosen move is mapped to real-world coordinates, allowing the robotic arm to perform the action. The process of robotic arm movement involves:

Hover over the detected object—The end effector moves to the center x and y of the detection with a predefined elevated z.
Lower the end effector—The end effector remains over the center x and y of the detection, but lowers the z to mark the cell where the move will be made.
Return to base—The end effector returns to the predefined initial position.

To ensure precise movements, the system converts the detected image coordinates from pixels to real-world measurements. The center of the detected object is identified in the image, scaled accordingly, and mapped to the board’s physical dimensions. Using inverse kinematics, the necessary angles are computed to guide the robotic arm accurately. These computed angles are then sent to the servo control module, which executes the movements in sequence.

Throughout the game, the system continuously monitors for a terminal state. Once a winner is determined or a draw is reached, the game ends and the interface presents an option to restart, reset the board, and prepare the system for a new round.

5.2. Final Tests

During the final tests, the system obtained very positive results, managing to fulfill the proposed objective and effectively play the Tic-Tac-Toe game.

For the experimental setup, the robotic arm was physically configured to operate with three effective degrees of freedom. As a result, the system does not perform pick-and-place manipulation. Instead, the robotic arm executes point-reaching motions to indicate the selected board cell corresponding to the decision made by the control algorithm.

In order to evaluate the precision of the robotic arm during positioning tasks, a set of predefined target coordinates was selected on the game board. The arm was commanded to reach each position multiple times under identical conditions, and, for each attempt, the end-effector coordinates were measured over five independent trials. All measured coordinates are defined with respect to the reference frame of the robot base, the plane X-Y aligns with the chessboard, while the Z-axis represents vertical displacement. The origin of the reference frame is located at the center of the robot base. Position measurements were manually obtained using a ruler-based method, with an estimated precision of ±0.5 mm.

Table 6 summarizes the recorded data, showing that the robot consistently reached positions within a small spatial variance, with typical deviations less than 1 cm. Analysis of the measurements allows the repeatability of the robot to be quantified relative to its own biased mean position. The results show that the arm repeatedly returns to within approximately ±0.3 cm along the X-axis and ±0.1–0.2 cm along the Y-axis, Z-axis and radial distance. The maximum errors obtained were 1.5 cm in X, 0.8 cm in Y and 0.6 cm in Z. The mean ± standard deviation of the errors was

- 0.02 \pm 0.57

cm in X,

0.34 \pm 0.29

cm in Y and

- 0.38 \pm 0.14

cm in Z, indicating a negligible bias in X, a small positive bias in Y and a small negative bias in Z, with Z being the most consistent axis.

Figure 19 provides additional information on the accuracy and behavior of the system. The scatter plot in Figure 19a shows that the measured points form tightly defined clusters, demonstrating strong repeatability across trials. Most clusters are shifted upward, revealing a consistent positive bias in the Y-direction. The deviations along X are small and symmetric, suggesting that the errors along the X-axis are primarily random rather than systematic. Because all target positions exhibit similar patterns, the accuracy characteristics appear uniform throughout the workspace. Overall, the distribution of points indicates that the dominant error source is a stable calibration offset rather than measurement noise. Moreover, the box plot in Figure 19b shows that the X-axis exhibits the largest variability, indicating reduced measurement stability, although its median remains close to zero, implying that there is no significant systematic bias. The Y-axis is more stable but consistently overestimates the true values, confirming the positive Y-bias observed in the scatter plot. The Z-axis shows the smallest spread, making it the most precise of the three axes; however, it systematically underestimates the true value by approximately 0.4 cm.

Analyzing the computer vision results through the precision–recall curve, the recall–confidence curve, the F1-confidence curve, and the confusion matrix of the trained YOLO model (Figure 20), we can see that the model can detect most classes with high accuracy, 85% and above. However, it can be seen that the model has difficulty distinguishing the class of the central cell of the playing field (‘22’) from that of the field (‘Field’). This confusion is understandable, as the central cell can be misinterpreted as a smaller field.

To validate the Q-learning model, the exploitation value was reduced to a minimum of 0.01 and the algorithm was tested again against its opponents. The results shown in Table 7 show that the trained model effectively looks for positions in which it wins or, at worst, draws the game, unlike a new untrained Q-learning algorithm (Table 8).

The inclusion of average move counts quantifies the efficiency of the trained model against an untrained algorithm. The trained agent generally completes the games in fewer moves than the untrained agent, particularly in winning scenarios. This difference suggests that the learned “Q-values” successfully guide the agent toward earlier forced wins or earlier detection of forced-draw positions.

Figure 21 and Figure 22 show an example of snapshots of a test game between a human player and the Q-learning player.

5.3. Discussion

The effectiveness of the Q-learning algorithm in making decisions and its ability to adapt to the context of the task led the system to try not to lose the game, always looking for the result that would give it the greatest reward, a win or, in the worst case, a draw.

Similarly, the computer vision module, the trained YOLO model, allowed the robotic arm to interact with the surrounding environment and physically execute the moves, producing minimal positioning errors and deviations within acceptable limits for the task (up to 1 cm). Although the accuracy of the computer vision model is high, it is occasionally affected by image conditions. Optimization of this component by implementing different models with greater accuracy may lead to a more robust system.

However, in order to achieve these results, it was necessary to adapt the primary implementation of this system several times to solve the various problems that arose, from the less accurate results of the YOLO model, to the high defeat rate of the Q-learning algorithm, to the mechanical problems of the robotic arm.

In addition, the Q-learning algorithm, although effective in simple contexts, may not adapt well to more complex environments without modifications to its basic logic. The integration of more sophisticated decision-making algorithms to allow the system to handle other types of tasks is one of the aspects to be investigated.

In this work, inverse kinematics is applied to a simplified three-degree-of-freedom manipulator, with the objective of reaching target positions on a 2D game board rather than controlling a full end-effector orientation. Taking this into consideration, the task and workspace were intentionally constrained, and all target positions lie well within the reachable region of the manipulator. During experimental operation, no singular configurations were encountered, so singularity detection or avoidance was not explicitly implemented. When multiple inverse kinematic solutions existed, a single consistent configuration was selected to ensure repeatability and stable motion. Future extensions of this work will consider more general manipulation tasks, where explicit singularity handling will be required.

Although the AI-driven robotic control techniques developed in this work have been applied to the Tic-Tac-Toe game, the underlying principles are widely applicable to a wide range of real-world scenarios. The integration of computer vision and RL enables robotic manipulators to identify, pick, and assemble parts, which is useful in various industrial operations, such as on manufacturing lines. As an example, the YOLOv5-based perception system can be trained to detect objects, parts, or defects instead of game symbols. The bounding boxes obtained from detection can be directly mapped to robot coordinates, exactly as performed for the board cells in the Tic-Tac-Toe setup. The state extraction mechanism used to interpret the board configuration can be reformulated to represent object presence, orientation, or classification labels in a workspace. The Q-learning algorithm can be adapted to select discrete industrial actions such as pick, place, sort, reject, or reorient, rather than selecting a game move, while reinforcement learning can optimize task sequencing, grasp selection, or sorting strategies.

The platform can also be used for educational purposes to teach programming, logic, and AI concepts in games or other complex dynamic environments. In addition, the use of a physical robotic system as a benchmark to compare AI decision-making strategies, such as MiniMax and Q-learning, provides a valuable testing environment for implementing and evaluating new algorithms in realistic scenarios.

5.4. Future Directions

Future research will focus on improving spatial accuracy by transforming computer vision results into robot movements. The current implementation relies on fixed scaling factors from pixels to real-world coordinates, which, although functional under controlled conditions, lack the robustness of a fully calibrated geometric model. Future work should incorporate a complete camera calibration process, including intrinsic parameter estimation, lens distortion correction, extrinsic calibration relative to the board, and the computation of a board-plane homography or PnP-based pose estimation. Establishing well-defined coordinate frames and performing an error-propagation analysis would further increase reproducibility and ensure precise end-effector positioning under varying environmental conditions.

Another important direction involves expanding the evaluation of the reinforcement-learning model. Although the present results compare trained and untrained Q-learning agents and report win/draw/loss rates, additional analysis would strengthen the characterization of agent behavior. These include measuring the rate at which the agent forces a draw under different initial configurations, conducting statistical significance tests to quantify the advantages of the first player versus the second player, and studying the effects of alternative

ϵ

-schedules, learning rates, and reward structures. The learning curve and sample-efficiency analysis would also help establish the minimal training requirements to achieve strong performance.

6. Conclusions

This work developed a system that integrated computer vision and decision-making algorithms to enable a robotic arm to perform an interactive task: playing the game of Tic-Tac-Toe autonomously.

The computer vision module applied a trained YOLOv5 model to accurately detect and locate the symbols on the game board. Decision-making based on a Q-learning algorithm allowed the robot to select the best moves to make based on the current state. Finally, the control of the robotic arm converted the system’s decisions into movements, allowing the robot to interact with the game board in real time.

The results obtained are very promising and show that the developed system is indeed capable of playing the game autonomously. Moreover, this project allowed us to combine computer vision techniques with reinforcement learning to make autonomous decisions in real scenarios. Our approach seamlessly combines all of these methods to operate in real time, enabling more robust and adaptive interactions in dynamic environments.

Future developments of the work could investigate the use of newer architectures, including more recent YOLO versions and transformer-based models, and assess their suitability for real-time operation on resource-constrained hardware. Also, the system can be expanded to perform tasks other than the Tic-Tac-Toe game, such as more complex board games or industrial applications that require precision and adaptability. The ability of the system to learn and adapt through past experiences introduces numerous potential applications.

Author Contributions

Conceptualization, A.M.T. and R.S.B.; methodology, A.M.T. and R.S.B.; software, A.M.T.; validation, A.M.T. and R.S.B.; formal analysis, A.M.T. and R.S.B.; investigation, A.M.T.; writing—original draft preparation, A.M.T.; writing—review and editing, A.M.T., R.S.B. and I.S.J.; visualization, A.M.T. and R.S.B.; supervision, R.S.B. and I.S.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The source code regarding this study is publicly available at: https://github.com/Afonso-Timoteo/RoboticArm-ReinforcementLearning (accessed on 16 February 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Pugliese, R.; Regondi, S.; Marini, R. Machine learning-based approach: Global trends, research directions, and regulatory standpoints. Data Sci. Manag. 2021, 4, 19–29. [Google Scholar] [CrossRef]
Agarwal, N.; Yadav, D. A Comprehensive Analysis of Classical Machine Learning and Modern Deep Learning Methodologies. Int. J. Eng. Res. Technol. 2024, 13. [Google Scholar] [CrossRef]
Sarker, I.H. Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions. SN Comput. Sci. 2021, 2, 420. [Google Scholar] [CrossRef]
Sarker, I.H. Machine Learning: Algorithms, Real-World Applications and Research Directions. SN Comput. Sci. 2021, 2, 160. [Google Scholar] [CrossRef]
Moran, M.E. Evolution of robotic arms. J. Robot. Surg. 2007, 1, 103–111. [Google Scholar] [CrossRef] [PubMed]
Ingrand, F.; Ghallab, M. Deliberation for autonomous robots: A survey. Artif. Intell. 2017, 247, 10–44. [Google Scholar] [CrossRef]
Matarić, J.M. The Robotics Primer; Massachusetts Institute of Technology: Cambridge, MA, USA, 2007. [Google Scholar]
Kuutti, S.; Bowden, R.; Jin, Y.; Barber, P.; Fallah, S. A Survey of Deep Learning Applications to Autonomous Vehicle Control. IEEE Trans. Intell. Transp. Syst. 2021, 22, 712–733. [Google Scholar] [CrossRef]
Parekh, D.; Poddar, N.; Rajpurkar, A.; Chahal, M.; Kumar, N.; Joshi, G.P.; Cho, W. A Review on Autonomous Vehicles: Progress, Methods and Challenges. Electronics 2022, 11, 2162. [Google Scholar] [CrossRef]
Leal, H.M.; Barbosa, R.S.; Jesus, I.S. Control of a Mobile Line-Following Robot Using Neural Networks. Algorithms 2025, 18, 51. [Google Scholar] [CrossRef]
Hockstein, N.G.; Gourin, C.G.; Faust, R.A.; Terris, D.J. A history of robots: From science fiction to surgical robotics. J. Robot. Surg. 2007, 1, 113–118. [Google Scholar] [CrossRef] [PubMed]
Li, M.; Milojević, A.; Handroos, H. Robotics in Manufacturing-The Past and the Present. In Technical, Economic and Societal Effects of Manufacturing 4.0: Automation, Adaption and Manufacturing in Finland and Beyond; Palgrave Macmillan: Cham, Switzerland, 2020; pp. 85–95. [Google Scholar] [CrossRef]
Kalan, S.; Chauhan, S.; Coelho, R.F.; Orvieto, M.A.; Camacho, I.R.; Palmer, K.J.; Patel, V.R. History of robotic surgery. J. Robot. Surg. 2010, 4, 141–147. [Google Scholar] [CrossRef]
Zamalloa, I.; Kojcev, R.; Hernández, A.; Muguruza, I.; Usategui, L.; Bilbao, A.; Mayoral, V. Dissecting Robotics—Historical overview and future perspectives. arXiv 2017, arXiv:1704.08617. [Google Scholar] [CrossRef]
Chen, M.; Wang, X.; Law, R.; Zhang, M. Research on the Frontier and Prospect of Service Robots in the Tourism and Hospitality Industry Based on International Core Journals: A Review. Behav. Sci. 2023, 13, 560. [Google Scholar] [CrossRef]
Wang, J.; Herath, D. What Makes Robots? Sensors, Actuators, and Algorithms. In Foundations of Robotics; Springer: Singapore, 2022; pp. 177–203. [Google Scholar] [CrossRef]
Forghani, R. Machine Learning and Other Artificial Intelligence Applications. Neuroimaging Clin. 2020, 30, i. [Google Scholar] [CrossRef]
Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science 2018, 362, 1140–1144. [Google Scholar] [CrossRef]
Sivamayil, K.; Rajasekar, E.; Aljafari, B.; Nikolovski, S.; Vairavasundaram, S.; Vairavasundaram, I. A Systematic Study on Reinforcement Learning Based Applications. Energies 2023, 16, 1512. [Google Scholar] [CrossRef]
Yuan, C.; Al Forhad, M.A.; Bansal, R.; Sidorova, A.; Albert, M.V. Multi-agent Dual Level Reinforcement Learning of Strategy and Tactics in Competitive Games. Results Control Optim. 2024, 16, 100471. [Google Scholar] [CrossRef]
Muggleton, S.H.; Hocquette, C. Machine Discovery of Comprehensible Strategies for Simple Games Using Meta-interpretive Learning. New Gener. Comput. 2019, 37, 203–217. [Google Scholar] [CrossRef]
Gomes, N.M.; Martins, F.N.; Lima, J.; Wörtche, H. Deep Reinforcement Learning Applied to a Robotic Pick-and-Place Application. In Communications in Computer and Information Science (CCIS), Proceedings of the Optimization, Learning Algorithms and Applications—First International Conference, OL2A 2021, Revised Selected Papers; Springer Nature: Cham, Switzerland, 2021; Volume 1488, pp. 251–265. [Google Scholar] [CrossRef]
Lobbezoo, A.; Qian, Y.; Kwon, H.J. Reinforcement Learning for Pick and Place Operations in Robotics: A Survey. Robotics 2021, 10, 105. [Google Scholar] [CrossRef]
Wan, F.; Wang, H.; Liu, X.; Yang, L.; Song, C. DeepClaw: A Robotic Hardware Benchmarking Platform for Learning Object Manipulation. arXiv 2020, arXiv:2005.02588. [Google Scholar] [CrossRef]
Lebling, R.W. Robots of Ages Past. Available online: https://www.aramcoworld.com/Articles/November-2019/Robots-of-Ages-Past (accessed on 1 September 2025).
Patidar, V.; Tiwari, R. Survey of robotic arm and parameters. In Proceedings of the 2016 International Conference on Computer Communication and Informatics, ICCCI 2016, Coimbatore, India, 7–9 January 2016. [Google Scholar] [CrossRef]
Gasparetto, A.; Scalera, L. From the Unimate to the Delta Robot: The Early Decades of Industrial Robotics. In Proceedings of the Explorations in the History and Heritage of Machines and Mechanisms; Baichun, M.Z., Ceccarelli, M., Eds.; Springer International Publishing: Cham, Switzerland, 2019; pp. 284–295. [Google Scholar]
Muhamedyev, R.I. Machine learning methods: An overview. In Computer Modelling and New Technologies; Springer: Singapore, 2015; pp. 14–29. [Google Scholar]
Zhang, D.; Hao, X.; Wang, D.; Qin, C.; Zhao, B.; Liang, L.; Liu, W. An efficient lightweight convolutional neural network for industrial surface defect detection. Artif. Intell. Rev. 2023, 56, 10651–10677. [Google Scholar] [CrossRef]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; A Bradford Book; The MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Al-Saedi, F.; Mohammed, A.H. Design and Implementation of Chess-Playing Robotic System. Int. J. Sci. Eng. Comput. Technol. 2015, 5, 90–98. [Google Scholar]
Banerjee, N.; Saha, D.; Singh, A.; Sanyal, G. A Simple Autonomous Robotic Manipulator for playing Chess against any opponent in Real Time. In Proceedings of the International Conference on Computational Vision and Robotics, Bhubaneshwar, India, 6 October 2012. [Google Scholar]
Fok, S.C.; Ong, E.K. A high school project on artificial intelligence in robotics. J. Artif. Intell. Eng. 1996, 10, 61–70. [Google Scholar] [CrossRef]
Kalra, B. Generalised agent for solving higher board states of tic tac toe using Reinforcement Learning. In Proceedings of the 2022 Seventh International Conference on Parallel, Distributed and Grid Computing (PDGC), Solan, India, 25–27 November 2022; pp. 1–8. [Google Scholar] [CrossRef]
Spulber, I.A.; Doloiu, M.D.; Indreica, I.; Măceşanu, G.; Sibişan, B.; Cociaş, T.T. Real-Time Robotic System for Interactive Tic-Tac-Toe Using Computer Vision. Eng. Proc. 2025, 113, 52. [Google Scholar] [CrossRef]
Karmanova, E.; Serpiva, V.; Perminov, S.; Ibrahimov, R.; Fedoseev, A.; Tsetserukou, D. SwarmPlay: A Swarm of Nano-Quadcopters Playing Tic-Tac-Toe Board Game against a Human. In Proceedings of the ACM SIGGRAPH ’21 Emerging Technologies; ACM: New York, NY, USA, 2021; pp. 1–4. [Google Scholar] [CrossRef]
Tan, X.; Chng, C.B.; Su, Y.; Lim, K.B.; Chui, C.K. Robot-Assisted Training in Laparoscopy Using Deep Reinforcement Learning. IEEE Robot. Autom. Lett. 2019, 4, 485–492. [Google Scholar] [CrossRef]
Wang, H.; Tan, X.; Qiu, X.; Qu, C. Subequivariant Reinforcement Learning Framework for Coordinated Motion Control. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 2112–2118. [Google Scholar] [CrossRef]
Chen, P.; Lu, W. Deep reinforcement learning based moving object grasping. Inf. Sci. 2021, 565, 62–76. [Google Scholar] [CrossRef]
Kalashnikov, D.; Irpan, A.; Pastor, P.; Ibarz, J.; Herzog, A.; Jang, E.; Quillen, D.; Holly, E.; Kalakrishnan, M.; Vanhoucke, V.; et al. QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation. In Proceedings of the 2nd Conference on Robot Learning (CoRL), Zurich, Switzerland, 29–31 October 2018; pp. 651–673. [Google Scholar]
Wang, G.; Liu, Y.; Liu, H. One model, two skills: Active vision and action learning model for robotic manipulation. Sci. China Inf. Sci. 2025, 68, 162202. [Google Scholar] [CrossRef]
Zhao, B.; Wu, Y.; Wu, C.; Sun, R. Deep Reinforcement Learning Trajectory Planning for Robotic Manipulator Based on Simulation-Efficient Training. Sci. Rep. 2025, 15, 8286. [Google Scholar] [CrossRef]
Abdi, A.; Ranjbar, M.H.; Park, J.H. Computer Vision-Based Path Planning for Robot Arms in Three-Dimensional Workspaces Using Q-Learning and Neural Networks. Sensors 2022, 22, 1697. [Google Scholar] [CrossRef]
Cordova-Cardenas, R.; Amor, D.; Gutiérrez, Á. Edge AI in Practice: A Survey and Deployment Framework for Neural Networks on Embedded Systems. Electronics 2025, 14, 4877. [Google Scholar] [CrossRef]
Feng, H.; Mu, G.; Zhong, S.; Zhang, P.; Yuan, T. Benchmark Analysis of YOLO Performance on Edge Intelligence Devices. Cryptography 2022, 6, 16. [Google Scholar] [CrossRef]
Mahmood, A.R.; Korenkevych, D.; Vasan, G.; Ma, W.; Bergstra, J. Benchmarking Reinforcement Learning Algorithms on Real-World Robots. In Proceedings of the 2nd Conference on Robot Learning (CoRL), Zurich, Switzerland, 29–31 October 2018; Volume 87, pp. 561–591. [Google Scholar]
Cutler, E.; Xing, Y.; Cui, T.; Zhou, B.; van Rijnsoever, K.; Hart, B.; Valencia, D.; Ong, L.V.C.; Gee, T.; Liarokapis, M.; et al. Benchmarking Reinforcement Learning Methods for Dexterous Robotic Manipulation with a Three-Fingered Gripper. arXiv 2024, arXiv:2408.14747. [Google Scholar] [CrossRef]
Adeept. Adeept 5-DOF Robotic Arm Kit for Raspberry Pi 4 B 3 B+ B A+. Available online: https://www.adeept.com/robotic-arm-kit-rpi-black_p0368.html (accessed on 23 March 2025).
Adafruit. PCA9685. Available online: https://cdn-shop.adafruit.com/datasheets/PCA9685.pdf (accessed on 18 March 2025).
Lynch, K.M.; Park, F.C. Modern Robotics: Mechanics, Planning, and Control; Cambridge University Press: Cambridge, UK, 2017. [Google Scholar]
Diwan, T.; Anirudh, G.; Tembhurne, J.V. Object detection using YOLO: Challenges, architectural successors, datasets and applications. Multimed. Tools Appl. 2023, 82, 9243. [Google Scholar] [CrossRef] [PubMed]
Hussain, M. YOLO-v1 to YOLO-v8, the Rise of YOLO and Its Complementary Nature toward Digital Manufacturing and Industrial Defect Detection. Machines 2023, 11, 677. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Xu, R.; Lin, H.; Lu, K.; Cao, L.; Liu, Y. A Forest Fire Detection System Based on Ensemble Learning. Forests 2021, 12, 217. [Google Scholar] [CrossRef]
Liu, H.; Sun, F.; Gu, J.; Deng, L. SF-YOLOv5: A Lightweight Small Object Detection Algorithm Based on Improved Feature Fusion Mode. Sensors 2022, 22, 5817. [Google Scholar] [CrossRef]
Watkins, C.J.C.H.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
Jang, B.; Kim, M.; Harerimana, G.; Kim, J. Q-Learning Algorithms: A Comprehensive Classification and Applications. IEEE Access 2019, 7, 133653–133667. [Google Scholar] [CrossRef]
Lamba, A. An Introduction to Q-Learning: Reinforcement Learning. 2018. Available online: https://medium.com/free-code-camp/an-introduction-to-q-learning-reinforcement-learning-14ac0b4493cc (accessed on 15 March 2025).
Nelson, J. Roboflow: Computer Vision Tools for Developers and Enterprises. Available online: https://roboflow.com/ (accessed on 20 March 2025).
Hua, Z.; Aranganadin, K.; Yeh, C.C.; Hai, X.; Huang, C.Y.; Leung, T.C.; Hsu, H.Y.; Lan, Y.C.; Lin, M.C. A Benchmark Review of YOLO Algorithm Developments for Object Detection. IEEE Access 2025, 13, 123515–123545. [Google Scholar] [CrossRef]
Khanam, R.; Asghar, T.; Hussain, M. Comparative Performance Evaluation of YOLOv5, YOLOv8, and YOLOv11 for Solar Panel Defect Detection. Solar 2025, 5, 6. [Google Scholar] [CrossRef]
Jorgensen, B. Minimax. Available online: https://beej.us/blog/data/minimax/ (accessed on 25 March 2025).

Figure 1. Robotic arm structure.

Figure 2. Robotic arm architecture.

Figure 3. Axis configuration.

Figure 4. Forward kinematics—Simulation and real tests: (a) Simulation of angles (0°, 0°, 0°), (b) real angle position (0°, 0°, 0°), (c) simulation of angles (0°, 90°, 0°), (d) real angle position (0°, 90°, 0°), (e) simulation of angles (0°, 90°, 90°), (f) real angle position (0°, 90°, 90° ).

Figure 5. Angles and trigonometry of the model.

Figure 6. Shoulder and elbow triangles.

Figure 7. YOLO model in operation.

Figure 8. Network architecture of YOLOv5.

Figure 9. Q-learning algorithm.

Figure 10. System architecture.

Figure 11. Computer vision results.

Figure 12. Examples of the dataset.

Figure 13. Overall performance of the trained model.

Figure 14. Function to make board moves.

Figure 15. Logic of the MiniMax algorithm.

Figure 16. Win/draw/loss as ‘X’ player rate of the trained Q-learning player against various opponents: (a) Random, (b) new Q-learning, (c) trained Q-learning, (d) MiniMax.

Figure 17. Win/draw/loss as ‘O’ player rate of the trained Q-learning player against various opponents: (a) Random, (b) new Q-learning, (c) trained Q-learning, (d) MiniMax.

Figure 18. Main game loop.

Figure 19. Analysis of measurement data: (a) Measured vs. target positions (X-Y), (b) box plot.

Figure 20. Results of trained model: (a) Precision–recall curve, (b) recall–confidence curve, (c) F1–confidence curve, (d) confusion matrix.

Figure 21. First turn of the Q-learning player: (a) Player 2’s move choice, (b) first movement of the robotic arm, (c) second move of the robotic arm, (d) third move of the robotic arm.

Figure 22. Second turn of the human player: (a) Player 1’s move, (b) board state analysis.

Table 1. Denavit–Hartenberg parameters for the model.

Link	$a_{i}$	$α_{i}$	$d_{i}$	$ϕ_{i}$
1	0	$\frac{π}{2}$	$L_{0} + L_{1}$	$θ_{1}$
2	$L_{2}$	$π$	0	$θ_{2}$
3	$L_{3}$	0	0	$θ_{3} - \frac{π}{2}$

Table 2. Forward kinematics test.

$θ_{i}$	x	z
(0°, 0°, 0°)	6.5	21.0
(0°, 90°, 0°)	−11.0	16.5
(0°, 90°, 90°)	0.0	27.5

Table 3. Model training hyperparameters.

Parameter	Value
Epoch	300
Optimizer	SGD
Initial Learning Rate	0.01
Final Learning Rate	0.01
Momentum	0.937
Weight Decay	0.0005
IoU Threshold	0.20

Table 4. Model performance metrics.

Epoch	Precision	Recall	map_0.5	map_0.5:0.95
284	98.7%	95.5%	98.1%	66.7%

Table 5. Q-learning training results.

Opponent	Victory (as ‘X’)	Draw (as ‘X’)	Victory (as ‘O’)	Draw (as ‘O’)
Random	77.38%	13.19%	32.73%	13.57%
Q-Learning New	56.36%	19.56%	32.28%	15.42%
Q-Learning Trained	55.90%	19.95%	21.19%	19.59%
MiniMax	0.00%	52.19%	0.00%	1.16%

Table 6. Position results of robotic arm (in cm).

Target Position (x,y,z)	1° Try	2° Try	3° Try	4° Try	5° Try
(3.0, −13.0, 11.5)	(2.7, −12.3, 11.3)	(3.0, −12.2, 11.5)	(3.0, −12.4, 11.0)	(2.9, −12.5, 11.0)	(2.9, −12.3, 11.1)
(−3.0, −13.0, 11.5)	(−2.2, −12.2, 11.1)	(−2.7, −12.2, 11.0)	(−2.5, −12.4, 10.9)	(−2.4, −12.4, 11.1)	(−2.8, −12.2, 11.1)
(6.0, −13.0, 11.5)	(5.2, −13.0, 11.1)	(4.5, −13.1, 11.0)	(5.0, −13.0, 10.9)	(5.6, −12.7, 11.1)	(5.5, −13.0, 11.0)
(−6.0, −13.0, 11.5)	(−5.0, −13.0, 11.1)	(−5.5, −12.8, 11.0)	(−5.9, −12.8, 11.1)	(−5.8, −12.7, 10.9)	(−5.2, −13.1, 11.0)
(5.0, −16.0, 11.5)	(4.9, −15.8, 11.3)	(4.5, −15.7, 11.2)	(4.9, −15.8, 11.3)	(4.6, −15.8, 11.2)	(4.2, −15.9, 11.3)
(−5.0, −16.0, 11.5)	(−4.5, −15.8, 11.2)	(−4.5, −15.8, 11.2)	(−4.8, −15.7, 11.3)	(−5.0, −15.6, 11.2)	(−5.2, −15.5, 11.2)

Table 7. Final results of a trained Q-learning.

Opponent	Victory (as ‘X’)	Draw (as ‘X’)	Avg. Moves (as ‘X’)	Victory (as, ‘O’)	Draw (as ‘O’)	Avg. Moves (as ‘O’)
Random	96.0%	3.0%	5.75	65.0%	22.0%	7.33
Q-Learning New	100.0%	0.0%	6.98	2.0%	2.0%	7.06
Q-Learning Trained	100.0%	0.0%	7.0	0.0%	2.0%	7.04
MiniMax	0.0%	99.0%	8.99	0.0%	0.0%	7.0

Table 8. Final results of an untrained Q-learning.

Opponent	Victory (as ‘X’)	Draw (as ‘X’)	Avg. Moves (as ‘X’)	Victory (as ‘O’)	Draw (as ‘O’)	Avg. Moves (as ‘O’)
Random	79.0%	4.0%	6.59	47.0%	3.0%	7.15
Q-Learning New	71.0%	28.0%	8.27	2.0%	36.0%	8.42
Q-Learning Trained	99.0%	0.0%	7.03	1.0%	0.0%	7.01
MiniMax	0.0%	98.0%	8.94	0.0%	1.0%	7.02

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Timóteo, A.M.; Barbosa, R.S.; Jesus, I.S. Robotic Arm Control Using a Q-Learning Reinforcement Algorithm. Robotics 2026, 15, 50. https://doi.org/10.3390/robotics15030050

AMA Style

Timóteo AM, Barbosa RS, Jesus IS. Robotic Arm Control Using a Q-Learning Reinforcement Algorithm. Robotics. 2026; 15(3):50. https://doi.org/10.3390/robotics15030050

Chicago/Turabian Style

Timóteo, Afonso M., Ramiro S. Barbosa, and Isabel S. Jesus. 2026. "Robotic Arm Control Using a Q-Learning Reinforcement Algorithm" Robotics 15, no. 3: 50. https://doi.org/10.3390/robotics15030050

APA Style

Timóteo, A. M., Barbosa, R. S., & Jesus, I. S. (2026). Robotic Arm Control Using a Q-Learning Reinforcement Algorithm. Robotics, 15(3), 50. https://doi.org/10.3390/robotics15030050

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Robotic Arm Control Using a Q-Learning Reinforcement Algorithm

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Robotic Arm

3.2. Robot Kinematics

3.2.1. Forward Kinematics

3.2.2. Inverse Kinematics

3.3. YOLO

3.4. Q-Learning

4. Implementation

4.1. System Architecture

4.2. Computer Vision

4.3. Decision-Making

4.3.1. Board Movements

4.3.2. Random Decison Algorithm

4.3.3. MiniMax Algorithm

4.3.4. Q-Learning

5. Results

5.1. Game Flow

5.2. Final Tests

5.3. Discussion

5.4. Future Directions

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI