1. Introduction
Today, the landscape of technology and innovation is driven by unprecedented advances in the field of artificial intelligence (AI), which are producing a significant impact on the integration of computational systems across various sectors [
1,
2,
3,
4,
5]. This emerging technology is present in nearly every aspect of our lives, reshaping industries, transforming businesses, and redefining how we approach everyday situations.
AI can be defined as the ability of a digital computer to perform tasks typically associated with human intelligence. Systems that exhibit mental processes such as reasoning, discovery of meanings, generalization, and learning from past experiences are referred to as artificially intelligent systems. These systems are capable of recognizing patterns, learning from data, and making intelligent decisions without being explicitly programmed. The advancement of technology, driven by AI, has revolutionized society and required constant human adaptation. The integration of AI across diverse domains has enabled achievements that were once considered unthinkable.
In the field of robotics, AI has significantly expanded the capabilities of robots and related technologies [
6,
7,
8,
9,
10]. Once confined to repetitive tasks on assembly lines, robots have undergone a remarkable evolution. They are now capable of performing complex and dynamic tasks, such as delicate surgical procedures, precise manufacturing processes, and even ambitious space exploration missions [
11,
12,
13,
14,
15]. Integration of AI algorithms is essential for robots to interpret and understand signals from their sensors, allowing them to interact with the surrounding environment in a safe and intelligent manner [
16,
17,
18,
19]. The field of robotics powered by AI-driven control algorithms has advanced rapidly and consistently in recent years. Researchers and international organizations are constantly exploring innovative ways to integrate intelligent control systems into small machines, with the aim of enhancing various aspects of daily life. Despite its exciting potential, this domain is often complex and not easily accessible. AI models typically require extensive training and careful implementation to ensure that they are equipped to handle the wide range of situations they may encounter. This study seeks to explore some of this current context in order to apply the concepts investigated in the development and implementation of a robotic arm controlled by artificial intelligence algorithms to perform interactive tasks.
The work involves the design and implementation of an integrated robotic system capable of detecting objects using computer vision, making decisions based on strategies and logic, and performing physical tasks with a robotic arm. The main focus is on developing an intelligent module capable of performing an interactive task such as playing the Tic-Tac-Toe game. Although each component used in our system, YOLO for detection, Q-learning for decision-making, and a 5-DOF manipulator controlled through kinematic equations, is individually well established, our contribution lies in the development of a unified framework that seamlessly integrates real-time computer vision, decision-making strategies, and manipulator control into a single integrated system, demonstrated experimentally on a physical platform. Specifically, most existing studies on reinforcement learning applied to Tic-Tac-Toe are carried out in simulation [
20] or through grid-based symbolic representations [
21]. However, our work demonstrates that a reinforcement-learning agent can interact with the physical world through visual perception under real-world constraints and still achieve good performance. This addresses a gap in the literature, as few studies evaluate how classical learning algorithms perform when embedded in a full perception–decision–action cycle on a real manipulator. Furthermore, this application can be extended as a benchmark for tasks such as assembly, inspection, packaging, and sorting in industrial manipulation scenarios. YOLOv5 provides fast detections of board cells, parts, and defects, which can be mapped to symbolic states (such as cell occupancy, part type, or defect presence) or to metric goals like pick coordinates, following the same perception-to-state conversion used in the game. A Q-learning agent that selects discrete actions (place X or O in a cell) extends naturally to discrete industrial actions such as picking or placing, inspecting a region, or accepting or rejecting products [
22,
23]. Platforms such as DeepClaw [
24] demonstrate this by using a simple Tic-Tac-Toe setup as an initial benchmark and then extending the same hardware and control pipeline to more realistic tasks, including bin clearing, jigsaw assembly, and sorting. Moreover, the proposed framework can serve as an effective educational and experimental platform for robotics, illustrating the integration of perception, decision-making, control, and actuation at both software and hardware levels. It presents practical implementation of the robotic system which can be used as a basis for further projects and for teaching robotics.
The system is operated through a graphical interface that enables real-time monitoring and interaction, enhancing the user’s ability to control and observe the robotic arm’s actions. This interface provides a comprehensive environment for seamless interaction. Users can control the angles of the servomotor, specify the final position of the actuator, and visualize object detections in real-time. In addition, the interface presents the optimal move for the current state of the game board, ensuring an intuitive and efficient user experience.
The main contributions of the developed work are as follows:
Design and implementation of an integrated robotic system that combines computer vision, decision-making strategies, and robotic arm control to perform interactive tasks.
Real-time control of a five-degree-of-freedom robotic arm by applying kinematic equations for robot movements.
Development and comparison of three different decision-making algorithms (random, MiniMax and Q-learning) applied to a unified robotic framework, providing valuable insights into their comparative strengths and practical performance in a physical environment.
Application of a Q-learning reinforcement algorithm to control a real robotic arm in an interactive environment, highlighting the practical effectiveness of RL in real-world dynamic scenarios.
The article is structured as follows.
Section 2 provides an overview of recent works in the field of AI applied to robotics and related areas.
Section 3 describes the architecture of the system, focusing on the structure and main components of the robotic arm, the direct and inverse kinematics required for robot control, the object detection algorithm and the Q-learning algorithm. The implementation of the system is given in
Section 4. Here, the system architecture, as well as the training and application of the AI algorithms are detailed.
Section 5 presents the results of the test games played between a human player and the trained Q-learning player, together with a discussion of the experiments conducted. Finally,
Section 6 draws the main conclusions and addresses future developments of the work.
2. Related Work
This section reviews key studies that provided the foundational information and methodologies for this work.
In the robotics area, the work by [
25] trace the key inventions underlying the robotic concept, demonstrating that the idea of robots predates the study and discovery of electrical systems. Similarly, the survey by [
26] offers a comprehensive overview of the evolution of robotic arms over the past 20 years, focusing on the parameters and characteristics that influence their performance. The book by [
7] provides an in-depth examination of robotic systems, covering their essential components, control architectures, and functionalities. The authors in [
27] focus on the development of industrial robots, with a historical analysis spanning the 1950s to the early 1990s. In [
16], the authors introduce key concepts in robotics, addressing current and emerging topics such as machine learning, ethics, human–robot interaction, and design thinking.
In the domain of artificial intelligence and machine learning, the review by [
28] explores frequently used algorithms and methods, providing fundamental information on the capabilities and limitations of AI techniques. They evaluated the integration of machine learning algorithms with traditional methodologies, providing valuable guidance on applying AI in practical scenarios. In [
29], an efficient lightweight Convolutional Neural Network (CNN) model is presented for the detection of surface defects in industrial products, specifically designed to overcome the high computational requirements of conventional CNNs. The proposed Coordinate Attention Mobile (CAM) backbone network uses inverse residual structures and the Coordinate Attention (CA) mechanism for efficient feature extraction. Multi-scale strategies are used to improve the detection of small objects, improving both accuracy and robustness. A novel Bidirectional Weighted Feature Pyramid Network (BWFPN) is introduced for feature fusion. The proposed model achieves a detection accuracy comparable to that of the state-of-the-art approaches. The work of [
4] compares supervised machine learning algorithms, evaluating their efficiency in different datasets. The book by [
30] is a comprehensive resource on deep learning, explaining the core concepts and architectures in the design of AI models. The foundational reference [
31] provides the fundamental ideas and algorithms of reinforcement learning (RL), covering both the theoretical and practical aspects of this field.
Several studies have explored the integration of AI into robotics to perform specialized tasks. These studies offer valuable insight into the design and implementation of robotic systems. The work of [
32] presents the design and implementation of a robotic arm for playing chess with a pick-and-drop mechanism. This study introduces a kinematic calculation framework for the robotic arm and a smart chessboard that assists the control algorithm by providing the precise positions of each chess piece on the board. Another study by [
33] describes a real-time autonomous chess robotic system designed to compete against human opponents. Their system includes a computer vision module for detecting chess pieces, a popular chess engine for selecting optimal moves, and a grid-based position calculation system to guide the robotic arm’s movements accurately. An earlier but foundational study by [
34] explores the application of artificial neural networks to control a robotic arm in a Tic-Tac-Toe game. Their work highlights the potential of neural networks to enable strategic decision-making and precise control in simple game-playing scenarios. The study [
35] proposes a fast, learning-based algorithm to efficiently solve larger Tic-Tac-Toe boards, overcoming the slow Min-Max approach, by generalizing beyond the traditional 3 × 3 board and achieving strong results that could extend to other strategy games such as Minesweeper, Chess and Go. In [
36], the authors present an XY-plotter that plays Tic-Tac-Toe using stepper motors controlled by a microcontroller. A vision algorithm detects human moves, and MiniMax provides optimal game decisions. The system combines robotics with human–computer interaction for educational and interactive uses. The experiments show accurate and efficient real-time gameplay. The study highlights the consistency of deterministic MiniMax compared to probabilistic AI methods. The work of [
37] introduces SwarmPlay, where a swarm of nano-quadcopters plays a Tic-Tac-Toe board game against a human. The system aims to create more tangible and interactive human–machine gameplay than single-robot setups. A drone swarm, workstation, and computer vision system enable real-time participation. User studies show high engagement and a more natural experience than traditional computer games. The results indicate a strong potential for SwarmPlay to extend to many games, enhancing human–drone interaction through a novel game-theory-based algorithm.
In [
38], a new robot-assisted laparoscopy training system is introduced, which uses deep reinforcement learning (DRL) agents such as PPO and GAIL to learn from both simulations and expert demonstrations. The system incorporates real laparoscopic instruments, allowing RL agents to provide trainees with hands-on, tactile learning experiences. The experimental results show that the system can successfully integrate simulation and expert data to improve training outcomes. Statistical analysis confirms that the skill improvements achieved with this training system are significant. In the study [
39], a novel architecture was designed to improve coordination of motion control using reinforcement learning. The proposed CoordiGraph framework utilizes the subequivariant property to deal with weak inter-joint coupling in high-dimensional tasks. The method specifically addresses the shortcomings of Graph Neural Networks (GNNs) and equivariant techniques in coordinating motion control tasks within RL. The results show that CoordiGraph outperforms several baseline methods in complex motion control scenarios. Moreover, the findings suggest that subequivariance is a promising strategy to improve motion coordination. The study by [
40] addresses the problem of grasping moving objects in unstructured environments. It proposes a DRL-based system using a Kinect depth sensor and an improved Soft Actor–Critic (SAC) algorithm. The system follows an approach–track–grasp pipeline, enabling real-time tracking and grasping. The experimental results show high grasp success rates on objects moving along various trajectories, demonstrating the effectiveness of the method. In [
41], the problem of achieving high-precision robotic grasping using only visual input is studied. It introduces QT-Opt, a scalable off-policy DRL algorithm using distributed Q-learning. The approach learns end-to-end control policies directly from visual observations, avoiding the need for explicit object modeling. The results achieve a grasp success rate of 96% on unseen objects, outperforming previous methods. The work of [
42] identifies a limitation in vision-based RL for robotic manipulation, where static cameras struggle under occlusions and limited space. It proposes a Dual-Arm Active Visual-Guided Manipulation Model (DAVMM), with one arm handling vision (“eye”) and the other performing manipulation (“hand”), enabling active perception and interaction. Residual-RL and curriculum learning are used to improve sample efficiency and training stability. Experiments on three occluded, narrow-space tasks show DAVMM significantly outperforms strong baselines, achieving higher success rates and faster learning. Another study by [
43] introduces a Multi-Actor–Critic Deep Deterministic Policy Gradient (M2ACD) algorithm for robotic manipulator trajectory planning in complex environments. A Two-Stage Reward (TSR) strategy guides safe and precise motion, and NURBS (Non-Uniform Rational B-Splines) curves smooth trajectories to solve the position-hopping jitter. Results show M2ACD outperforms TD3, DARC, and DDPG, achieving superior curve smoothness, stability, and convergence speed for collaborative robot trajectory planning. The study in [
44] proposes a new 3D path planning method for robot arms using computer vision, Q-learning, and neural networks to overcome problems related to object localization, computational efficiency, and 2D workspace limitations. The Q-learning algorithm selects optimal movement actions in 3D space, while a neural network translates these actions into robot joint angles. Simulations and experiments show the approach significantly improves accuracy, efficiency, and real-time performance over previous methods.
YOLO edge deployment and physical RL and manipulation benchmarks have become active, rapidly evolving research areas. The work of [
45] provides a systematic review of deep learning deployment on embedded hardware, including YOLO-based object detection for edge processing, and emphasizes that successful deployment in resource-constrained environments depends on model-optimization strategies, lightweight architectures, and appropriate hardware selection. Similarly, study [
46] evaluates inference workflows and the performance of YOLO models across multiple edge platforms, reporting empirical latency and throughput results on resource-constrained devices such as the Raspberry Pi 4B and NVIDIA Jetson systems (Santa Clara, CA, USA). In the context of physical reinforcement learning, the work in [
47] presents one of the first extensive experimental benchmarks of multiple policy-learning algorithms, namely TRPO, PPO, DDPG, and Soft-Q, on commercially available physical robots. The study highlights the robustness of structured task setups and demonstrates the applicability of these algorithms across diverse physical environments. The work of [
48] investigates direct training of RL algorithms in controlled yet realistic real-world environments for dexterous manipulation tasks, addressing limitations of simulation-based training. The study presents benchmarking results for three RL algorithms applied to complex in-hand manipulation on physical robotic systems. The results show that TD3 consistently outperforms DDPG and SAC, demonstrating superior robustness in continuous real-world tasks. Overall, the work highlights the practicality of real-world RL training and its effectiveness in reducing the simulation-to-real gap.
Although more recent architectures and advanced learning methods are available, the use of well-established models and algorithms remains justified in experimental robotics research. These approaches offer a mature deployment ecosystem that facilitates reproducibility and simplifies system-level validation.
4. Implementation
This section describes the implementation of the proposed system architecture, presenting and explaining the development of the stages outlined in the previous section. As already mentioned, the system is designed to play the Tic-Tac-Toe game using a robotic arm with five degrees of freedom.
4.1. System Architecture
The proposed architecture consists of four main interconnected components designed to facilitate the gameplay of Tic-Tac-Toe, as shown in
Figure 10. As can be seen, the system designed to play Tic-Tac-Toe controls a robotic arm with five degrees of freedom using a Raspberry Pi 3B. The control of the robotic arm involves a YOLO computer vision model for real-time object identification, trained to recognize the various elements of the Tic-Tac-Toe game, and a decision-making algorithm, the Q-learning RL algorithm, which uses YOLO’s results to autonomously control the robotic arm during the game. Finally, an interactive interface facilitates control and enables real-time monitoring of the entire control process.
More specifically, the system uses a camera to capture real-time images of the game board. These images are sent to a Raspberry PI 3B controller, which applies a computer vision model, more specifically a YOLO model, to identify and locate the various elements of the game.
The camera used for this system features a 48-MP sensor with an f/2.0 aperture and a 26 mm wide-angle lens, equipped with autofocus capabilities. This configuration provides a sufficiently high spatial resolution and adequate light sensitivity to capture detailed images of the chessboard under typical indoor lighting conditions. However, in the present work, the visual component was designed exclusively for coarse localization of Tic-Tac-Toe cells in a tightly controlled setup rather than for general-purpose 3D pose estimation. Because the camera was rigidly mounted above the board at a constant height and orientation, the projection of the board on the image remained stable throughout all trials. Under these constrained conditions, the physical dimensions of the board (25.5 cm × 33.5 cm) and the YOLO input resolution (640 × 640 pixels) allowed us to apply a uniform linear pixel-centimeter scaling model. The origin of the robot was manually aligned with a fixed pixel coordinate, and the center of each detected bounding box was mapped to the real-world coordinates through this scaling.
The output of this module is a map of detections that enables the decision-making algorithm to assess the current state of the game and determine the best move. Subsequently, the Raspberry PI translates the coordinates of the chosen move into articulation angles and sends commands to the robotic arm’s servomotors via I2C to execute the move.
4.2. Computer Vision
The task of the computer vision subsystem is to interpret the input images and to understand the state of the game in real time. The system is based on a YOLO model, an object detection algorithm, which identifies the position of the game pieces on the board. The information is then sent to the controller to make an informed decision.
The computer vision system focuses on identifying and classifying the Tic-Tac-Toe game pieces and determining their positions on the board. This enables the system to understand its current state and move the robotic arm to the correct position to make the next move.
Figure 11 illustrates the results of identifying and classifying the elements of the Tic-Tac-Toe game.
The success of computer vision is largely dependent on the quality and preparation of the dataset it uses. Therefore, we collected datasets that cover a wide range of different situations and contexts, providing the necessary diversity for robust object detection.
In total, six annotated datasets were collected from the Roboflow public repository [
60], summing up 4207 images.
Figure 12 shows some images from these datasets.
The final dataset, after the processing steps, annotating missing objects and removing duplicate images or contextually irrelevant images, is composed of 3000 images. These are divided into 2500 images for training, 375 for validation, and 125 for testing.
The model applied in this project needs to identify and locate objects in real time, be computationally lightweight, and provide high accuracy. YOLOv5s model was selected from the available pre-built YOLO architectures [
61]. YOLOv5 was chosen instead of the latest versions due to its simplicity and high speed optimization, making it particularly suitable for hardware deployment with limited resources, such as the Raspberry Pi used in this study. The smaller YOLOv5s model is especially efficient for real-time applications where low latency is crucial and available resources are constrained, ensuring efficient performance without compromising detection accuracy [
62].
In this case, the YOLOv5 model was trained during 300 epochs in the custom dataset.
Table 3 lists the network hyperparameters used in the training process, while
Figure 13 shows the corresponding performance results [
54,
55,
56]. A deeper analysis of the training results shows that the model achieved its highest accuracy value in epoch 284, reaching 98.7%, indicating that it correctly predicted most of the cases. This epoch is marked by training loss values that ranged from 1.1% for classes to 2.5% for bounding boxes. For the validation phase, these values are between 1.2% and 3.5%, respectively. It also has a recall value of 95.5%, suggesting that the model is capable of detecting most of the objects in the images.
The mAP values for this epoch reflect the model’s overall accuracy in locating objects. The model has an mAP of 98.1% for high IoU values (more than 50%). However, for values higher than 95%, it has a percentage of 66.7%. This indicates that the model performs well in detecting objects with loose tolerance and overlapping the predicted bounding box over the true one. The main performance metrics are shown in
Table 4.
4.3. Decision-Making
The core of this system is the decision-making algorithm, which is responsible for analyzing the state of the game and determining the optimal move. It uses various strategies, with a particular focus on the Q-learning algorithm, a reinforcement-learning technique that continuously improves its performance by learning from past experiences.
The decision-making process can be described in four steps. First, the algorithm receives information from the computer vision system about the current state of the game board, including the positions of all pieces in play. Next, it determines the optimal move. Third, it issues movement commands to the robotic arm. Finally, it updates its knowledge base on the current outcome.
For comparison purposes, two additional algorithms with different decision processes were developed: a random decision algorithm and the MiniMax algorithm, which evaluates all possibilities and selects the one that produces the highest reward.
4.3.1. Board Movements
The Tic-Tac-Toe game class serves as the foundation of this project and provides the training environment for the RL model. The game board is represented as a grid where each cell starts empty and is updated as moves are made.
Players take turns placing their symbol (‘X’ or ‘O’) in a chosen position, with a function ensuring that moves are only made in unoccupied spaces. If a move is valid, the board updates accordingly, and the turn switches to the next player. If a player attempts to place a symbol in an already occupied cell, the move is rejected, and they must try again.
Figure 14 illustrates this logical decision process.
After each move, the game logic checks for a winner by evaluating all possible winning combinations: rows, columns, and diagonals. If a player has filled one of them with his symbol, he is declared the winner. If the board is full and no winner is found, the game ends in a draw. Otherwise, the game continues with the next turn.
4.3.2. Random Decison Algorithm
With respect to the player algorithms, the random decision is the simplest. This type of player selects a random action from the set of available moves. It has no strategy for choosing movements and does not consider the current state of the game.
4.3.3. MiniMax Algorithm
The MiniMax algorithm is a more sophisticated approach. The algorithm aims to minimize the opponent’s rewards, while trying to maximize the player’s own rewards. To do this, it assumes that the opponent will always make the best move and seeks to minimize his chances of winning. Algorithm 1 describes the main steps of the MiniMax algorithm.
| Algorithm 1 MiniMax Chooses Move |
Minimax if then end if Procedure
Minimax if then end if if then else end if for do if then Minimax end if if then if then end if else if then end if end if end for return End Procedure |
Thus, this algorithm involves the recursive exploration, in a tree-like structure, of all possible future moves, aiming to continue maximizing the player’s score and minimizing the opponent’s score. The algorithm simulates the state of the game, where a new move implies a new branch of the tree; it alternates the turns of each player and evaluates the outcome of all possible moves until the game reaches a terminal state.
In this state, rewards are assigned to the move that led to this outcome. In case of
Victory—The algorithm assigns a positive score if the move belongs the player for whom you are maximizing, or a negative score for the opponent or player to whom you want to minimize the rewards.
Tie—The algorithm returns a null score, reflecting that no player has an advantage.
These values are passed back to the tree or branch that initiated the move and used to evaluate which move provides the best result.
In this way, at each level of the game tree, if it is the turn of the player who wants to maximize the rewards, the move with the highest score is selected. On the other hand, if it is the turn of the player who minimizes the rewards, the algorithm chooses the move with the lowest score.
This pattern continues recursively until the algorithm has explored all possible moves and determined the best move to make.
As shown in
Figure 15, the MiniMax algorithm decides that the best move to make for player ’X’ is in the middle of the right column, since the best decisions in that branch end up in a situation where, in the worst-case scenario, the game ends in a draw [
63].
4.3.4. Q-Learning
The class implementing the Q-learning algorithm requires three key parameters that control the agent’s learning process, the model’s learning rate, the discount factor, and the value of , in order to apply the -Greedy policy strategy.
Taking into account how this algorithm works, a main function, represented in the Algorithm 2, was developed to decide the best move to take for the present state of the board. This function is responsible for applying a few other supporting functions that generate the available moves, decide between exploiting or exploring according to the -Greedy policy strategy and calculate the “Q-value” for the state–action pair.
After deciding and making the move, the function also checks whether the game has reached a terminal state (win, lose, or draw) and assigns a corresponding reward: positive for a win, negative for a loss, and slightly positive for a draw.
Next, the function simulates the opponent’s best responses and penalizes the agent if the opponent is likely to win on the next move, encouraging the agent to strategically block the opponent.
Finally, this function updates the “Q-values” in the table according to the time difference rule (Equation (
10)).
| Algorithm 2 Q-learning Chooses Move |
if then else end if if then else [Simulate Opponent Move] for do if then end if end for end if |
To develop an algorithm capable of winning the Tic-Tac-Toe game, the model was trained for a total of 310,000 games against the various possible opponents, using a decreasing value of against each opponent. The Q-learning parameters are:
Learning Rate: 0.1.
Discount Factor: 0.9.
Epsilon (): Starting at 0.9 and decreasing to 0.01, with a decay rate of 0.1 for every batch of 10% of total games.
Rewards: +1 for a win, +0.5 for a draw, and −1 for a loss.
The selection of Q-learning parameters follows standard RL principles and was adjusted and validated during the initial experiments. A learning rate of 0.1 was chosen to ensure stable but sufficiently responsive updates to the Q-table, trying to avoid oscillation, but providing enough margin to incorporate new information. The discount factor of 0.9 encourages the agent to prefer long-term strategic advantages over immediate but potentially suboptimal moves, which is relevant in a short episodic task. The exploration parameter starts at 0.9 to promote extensive exploration of the state space during the early training stages, then decays to 0.01 (with a 0.1 decay step every 10% of the total games) to progressively shift the agent towards exploitation of the learned policy. Finally, the reward structure reflects the asymmetry between desirable and undesirable outcomes, providing the agent with clear guidance to avoid losing states, while still recognizing the value of forcing a draw when a win is not possible.
At the end of the training, the results obtained are shown in
Table 5.
Figure 16 and
Figure 17 illustrate the progression of the win/draw/loss rates for the trained Q-learning player (as Player ‘X’ and Player ‘O’, respectively) during training against the different predefined opponents: random, new Q-learning, trained Q-learning, and MiniMax. When the Q-learning agent plays as Player ‘X’, the agent exhibits unstable performance, at the beginning of training, due to the high exploration rate (
= 0.9), leading to largely random action selection, as is represented in the initial segments of the curves, where win and loss rates fluctuate significantly. As
gradually decreases, the agent begins to exploit the accumulated Q-values more consistently, and the curves start to stabilize. Against the weaker opponents, the random and the new Q-learning, the win rate increases rapidly, indicating that the agent quickly learns to avoid losing positions and converges toward the optimal Tic-Tac-Toe policy. Against stronger or deterministic opponents, such as the MiniMax, the learning curve shows a slower improvement. In these cases, the model gradually reduces the loss rate but converges to a high proportion of draws, which aligns with the theoretical outcome of an optimal Tic-Tac-Toe play.
On the other hand, when the agent plays as Player ‘O’, the learning curve differs noticeably. Because Tic-Tac-Toe is strongly biased toward the first player under optimal play, the Q-learning agent has limited opportunity to achieve non-losing outcomes when facing a perfect opponent. This is reflected in the learning curves: even after extensive training, the agent’s draw rate increases only marginally, while the loss rate remains high. This behavior is consistent with game theory: an optimal first player forces at least a draw, and any suboptimal move by the second player leads to an immediate loss. Despite this fact, the slight improvement in draw rate observed late in training indicates that the agent learns to avoid the most immediate losing responses but cannot fully overcome the first-player advantage against MiniMax-level play.
5. Results
This section presents the main results obtained for the individual components and for a test game played between a human player and the trained Q-learning player.
5.1. Game Flow
The backend of this system operates through a structured series of steps, ensuring smooth transitions between the different phases of the game. From the initial setup, through each turn, to the final outcomes, the control logic ensures that all eventualities are accounted for. The main components include game setup, player turn switching, integration of YOLO detections, and control of the robot’s physical movements.
When the play button is pressed on the graphical interface, the system retrieves the user-configured settings, including player types and algorithm parameters. These configurations are then applied to initialize the environment and prepare the game for execution.
Once the setup is complete, the system enters the main game loop (
Figure 18), alternating turns between Player 1 and Player 2. After each move, the state of the game board is updated and the system checks whether a winning or draw condition has been met.
For human players, moves performed in real life are registered through the graphical interface. For nonhuman players, the system determines the optimal move based on the selected algorithm and updates the board accordingly.
After this decision, the chosen move is mapped to real-world coordinates, allowing the robotic arm to perform the action. The process of robotic arm movement involves:
Hover over the detected object—The end effector moves to the center x and y of the detection with a predefined elevated z.
Lower the end effector—The end effector remains over the center x and y of the detection, but lowers the z to mark the cell where the move will be made.
Return to base—The end effector returns to the predefined initial position.
To ensure precise movements, the system converts the detected image coordinates from pixels to real-world measurements. The center of the detected object is identified in the image, scaled accordingly, and mapped to the board’s physical dimensions. Using inverse kinematics, the necessary angles are computed to guide the robotic arm accurately. These computed angles are then sent to the servo control module, which executes the movements in sequence.
Throughout the game, the system continuously monitors for a terminal state. Once a winner is determined or a draw is reached, the game ends and the interface presents an option to restart, reset the board, and prepare the system for a new round.
5.2. Final Tests
During the final tests, the system obtained very positive results, managing to fulfill the proposed objective and effectively play the Tic-Tac-Toe game.
For the experimental setup, the robotic arm was physically configured to operate with three effective degrees of freedom. As a result, the system does not perform pick-and-place manipulation. Instead, the robotic arm executes point-reaching motions to indicate the selected board cell corresponding to the decision made by the control algorithm.
In order to evaluate the precision of the robotic arm during positioning tasks, a set of predefined target coordinates was selected on the game board. The arm was commanded to reach each position multiple times under identical conditions, and, for each attempt, the end-effector coordinates were measured over five independent trials. All measured coordinates are defined with respect to the reference frame of the robot base, the plane X-Y aligns with the chessboard, while the Z-axis represents vertical displacement. The origin of the reference frame is located at the center of the robot base. Position measurements were manually obtained using a ruler-based method, with an estimated precision of ±0.5 mm.
Table 6 summarizes the recorded data, showing that the robot consistently reached positions within a small spatial variance, with typical deviations less than 1 cm. Analysis of the measurements allows the repeatability of the robot to be quantified relative to its own biased mean position. The results show that the arm repeatedly returns to within approximately ±0.3 cm along the X-axis and ±0.1–0.2 cm along the Y-axis, Z-axis and radial distance. The maximum errors obtained were 1.5 cm in X, 0.8 cm in Y and 0.6 cm in Z. The mean ± standard deviation of the errors was
cm in X,
cm in Y and
cm in Z, indicating a negligible bias in X, a small positive bias in Y and a small negative bias in Z, with Z being the most consistent axis.
Figure 19 provides additional information on the accuracy and behavior of the system. The scatter plot in
Figure 19a shows that the measured points form tightly defined clusters, demonstrating strong repeatability across trials. Most clusters are shifted upward, revealing a consistent positive bias in the Y-direction. The deviations along X are small and symmetric, suggesting that the errors along the X-axis are primarily random rather than systematic. Because all target positions exhibit similar patterns, the accuracy characteristics appear uniform throughout the workspace. Overall, the distribution of points indicates that the dominant error source is a stable calibration offset rather than measurement noise. Moreover, the box plot in
Figure 19b shows that the X-axis exhibits the largest variability, indicating reduced measurement stability, although its median remains close to zero, implying that there is no significant systematic bias. The Y-axis is more stable but consistently overestimates the true values, confirming the positive Y-bias observed in the scatter plot. The Z-axis shows the smallest spread, making it the most precise of the three axes; however, it systematically underestimates the true value by approximately 0.4 cm.
Analyzing the computer vision results through the precision–recall curve, the recall–confidence curve, the F1-confidence curve, and the confusion matrix of the trained YOLO model (
Figure 20), we can see that the model can detect most classes with high accuracy, 85% and above. However, it can be seen that the model has difficulty distinguishing the class of the central cell of the playing field (‘22’) from that of the field (‘Field’). This confusion is understandable, as the central cell can be misinterpreted as a smaller field.
To validate the Q-learning model, the exploitation value was reduced to a minimum of 0.01 and the algorithm was tested again against its opponents. The results shown in
Table 7 show that the trained model effectively looks for positions in which it wins or, at worst, draws the game, unlike a new untrained Q-learning algorithm (
Table 8).
The inclusion of average move counts quantifies the efficiency of the trained model against an untrained algorithm. The trained agent generally completes the games in fewer moves than the untrained agent, particularly in winning scenarios. This difference suggests that the learned “Q-values” successfully guide the agent toward earlier forced wins or earlier detection of forced-draw positions.
Figure 21 and
Figure 22 show an example of snapshots of a test game between a human player and the Q-learning player.
5.3. Discussion
The effectiveness of the Q-learning algorithm in making decisions and its ability to adapt to the context of the task led the system to try not to lose the game, always looking for the result that would give it the greatest reward, a win or, in the worst case, a draw.
Similarly, the computer vision module, the trained YOLO model, allowed the robotic arm to interact with the surrounding environment and physically execute the moves, producing minimal positioning errors and deviations within acceptable limits for the task (up to 1 cm). Although the accuracy of the computer vision model is high, it is occasionally affected by image conditions. Optimization of this component by implementing different models with greater accuracy may lead to a more robust system.
However, in order to achieve these results, it was necessary to adapt the primary implementation of this system several times to solve the various problems that arose, from the less accurate results of the YOLO model, to the high defeat rate of the Q-learning algorithm, to the mechanical problems of the robotic arm.
In addition, the Q-learning algorithm, although effective in simple contexts, may not adapt well to more complex environments without modifications to its basic logic. The integration of more sophisticated decision-making algorithms to allow the system to handle other types of tasks is one of the aspects to be investigated.
In this work, inverse kinematics is applied to a simplified three-degree-of-freedom manipulator, with the objective of reaching target positions on a 2D game board rather than controlling a full end-effector orientation. Taking this into consideration, the task and workspace were intentionally constrained, and all target positions lie well within the reachable region of the manipulator. During experimental operation, no singular configurations were encountered, so singularity detection or avoidance was not explicitly implemented. When multiple inverse kinematic solutions existed, a single consistent configuration was selected to ensure repeatability and stable motion. Future extensions of this work will consider more general manipulation tasks, where explicit singularity handling will be required.
Although the AI-driven robotic control techniques developed in this work have been applied to the Tic-Tac-Toe game, the underlying principles are widely applicable to a wide range of real-world scenarios. The integration of computer vision and RL enables robotic manipulators to identify, pick, and assemble parts, which is useful in various industrial operations, such as on manufacturing lines. As an example, the YOLOv5-based perception system can be trained to detect objects, parts, or defects instead of game symbols. The bounding boxes obtained from detection can be directly mapped to robot coordinates, exactly as performed for the board cells in the Tic-Tac-Toe setup. The state extraction mechanism used to interpret the board configuration can be reformulated to represent object presence, orientation, or classification labels in a workspace. The Q-learning algorithm can be adapted to select discrete industrial actions such as pick, place, sort, reject, or reorient, rather than selecting a game move, while reinforcement learning can optimize task sequencing, grasp selection, or sorting strategies.
The platform can also be used for educational purposes to teach programming, logic, and AI concepts in games or other complex dynamic environments. In addition, the use of a physical robotic system as a benchmark to compare AI decision-making strategies, such as MiniMax and Q-learning, provides a valuable testing environment for implementing and evaluating new algorithms in realistic scenarios.
5.4. Future Directions
Future research will focus on improving spatial accuracy by transforming computer vision results into robot movements. The current implementation relies on fixed scaling factors from pixels to real-world coordinates, which, although functional under controlled conditions, lack the robustness of a fully calibrated geometric model. Future work should incorporate a complete camera calibration process, including intrinsic parameter estimation, lens distortion correction, extrinsic calibration relative to the board, and the computation of a board-plane homography or PnP-based pose estimation. Establishing well-defined coordinate frames and performing an error-propagation analysis would further increase reproducibility and ensure precise end-effector positioning under varying environmental conditions.
Another important direction involves expanding the evaluation of the reinforcement-learning model. Although the present results compare trained and untrained Q-learning agents and report win/draw/loss rates, additional analysis would strengthen the characterization of agent behavior. These include measuring the rate at which the agent forces a draw under different initial configurations, conducting statistical significance tests to quantify the advantages of the first player versus the second player, and studying the effects of alternative -schedules, learning rates, and reward structures. The learning curve and sample-efficiency analysis would also help establish the minimal training requirements to achieve strong performance.
6. Conclusions
This work developed a system that integrated computer vision and decision-making algorithms to enable a robotic arm to perform an interactive task: playing the game of Tic-Tac-Toe autonomously.
The computer vision module applied a trained YOLOv5 model to accurately detect and locate the symbols on the game board. Decision-making based on a Q-learning algorithm allowed the robot to select the best moves to make based on the current state. Finally, the control of the robotic arm converted the system’s decisions into movements, allowing the robot to interact with the game board in real time.
The results obtained are very promising and show that the developed system is indeed capable of playing the game autonomously. Moreover, this project allowed us to combine computer vision techniques with reinforcement learning to make autonomous decisions in real scenarios. Our approach seamlessly combines all of these methods to operate in real time, enabling more robust and adaptive interactions in dynamic environments.
Future developments of the work could investigate the use of newer architectures, including more recent YOLO versions and transformer-based models, and assess their suitability for real-time operation on resource-constrained hardware. Also, the system can be expanded to perform tasks other than the Tic-Tac-Toe game, such as more complex board games or industrial applications that require precision and adaptability. The ability of the system to learn and adapt through past experiences introduces numerous potential applications.