Black-Box-Based Mathematical Modelling of Machine Intelligence Measuring

: Current machine intelligence metrics rely on a different philosophy, hindering their effective comparison. There is no standardization of what is machine intelligence and what should be measured to quantify it. In this study, we investigate the measurement of intelligence from the viewpoint of real-life difﬁcult-problem-solving abilities, and we highlight the importance of being able to make accurate and robust comparisons between multiple cooperative multiagent systems (CMASs) using a novel metric. A recent metric presented in the scientiﬁc literature, called MetrIntPair , is capable of comparing the intelligence of only two CMASs at an application. In this paper, we propose a generalization of that metric called MetrIntPairII . MetrIntPairII is based on pairwise problem-solving intelligence comparisons (for the same problem, the problem-solving intelligence of the studied CMASs is evaluated experimentally in pairs). The pairwise intelligence comparison is proposed to decrease the necessary number of experimental intelligence measurements. MetrIntPairII has the same properties as MetrIntPair , with the main advantage that it can be applied to any number of CMASs conserving the accuracy of the comparison, while it exhibits enhanced robustness. An important property of the proposed metric is the universality, as it can be applied as a black-box method to intelligent agent-based systems (IABSs) generally, not depending on the aspect of IABS architecture. To demonstrate the effectiveness of the MetrIntPairII metric, we provide a representative experimental study, comparing the intelligence of several CMASs composed of agents specialized in solving an NP-hard problem.


Introduction
Computer systems encounter various issues during problem-solving, including high computational complexity, especially in the case of NP-hard problems, and the presence of different types of uncertainties, e.g., due to missing or erroneous data. Usually, computationally complex problems can be effectively solved by intelligent agent-based systems (IABSs), ranging from individual agents (IAGs) to cooperative multiagent systems (CMASs).
There are diverse CMASs specialized in difficult-problem-solving that are considered intelligent [1][2][3][4]. Applications of agent-based systems (ABSs) include the following: diverse problems solving in Industry 4.0 [5]; adaptive clustering [6]; modeling strategic interactions in diverse democratic systems [7]; investigation on supply chain of product recycling [8]; detecting the proportion of traders in the stock market [9]; control design in the presence of actuator saturation [10]; agent-based simulator for environmental land change [11]; distributed intrusion detection [12]; investigations of complex information systems [13]; discovering Semantic Web services through process similarity matching [14]; power system control and protection [15]; studies on task type and critic information in aspects related to the considered metric are discussed. In the last section, the conclusions of this research are summarized.

Metrics for Measuring Machine Intelligence
Many of the IABSs are CMASs. Even in very simple CMASs, an increased intelligence can emerge at the system's level. For instance, Yang et al. [1] proposed an ICMAS composed of simple reactive mobile agents that are able to mimic the behavior of a human network administrator.
One of the earliest famous definitions of machine intelligence was presented by Alan Turing [30] in 1950. A computing system was considered intelligent if a human assessor could not decide the nature of the system as being human or artificial based on questions asked from a hidden room. The definition is based on the idea of an artificial cognitive system that is able to mimic the cognition of a human being.
As examples of systems that could best fit the Turing's criteria of intelligence appreciation, from a specialty knowledge point of view, similar to the knowledge that human specialists have, the expert systems (ESs) can be mentioned. ESs can solve specialty problems, similarly to human specialists, and have a diversity of applications [31].
Several studies have been performed related to machine intelligence, human intelligence, and the analogies and differences between them, analyzing different aspects of Turing's test proposal. A relatively recent study on Turing's test was presented by Sterret [32], analyzing how Watson, an IBM-developed question-answering computer, could compete against humans in the Jeopardy game. There is also the famous competition between the chess playing machine named Deep Blue and the chess grandmaster Kasparov [33]. Besold et al. [34] studied diverse difficult problems for humans that could be used as benchmark problems for IABSs. Detterman [35] proposed an interesting challenge regarding the Machine Intelligence Quotient (MIQ) measured by well-known IQ tests developed for humans. Sanghi and Dowe [36] presented an intelligent computer program that was evaluated successfully on some standard human IQ tests. According to the authors, it surpassed the average human intelligence (by 100) on some tests [36]. Even in this successful situation from the side of the computing system, we consider that artificial and human intelligence cannot be directly compared at the general level.
MetrIntSimil [37] represents an accurate and robust metric that can be applied for a comparison of similarity in intelligence of any number of cooperative multiagent systems.
An interesting study realized at the US National Institute of Standards and Technology was presented by Schreiner [38], aiming to create standard measures for intelligent systems (ISs). Schreiner studied the research question of how precisely ISs are defined, and analyzed how to measure and compare the intelligence capabilities of ISs. Park et al. [39] introduced the notion of an intelligence task graph to study the measurement of machine intelligence of human-machine cooperative systems. Anthon and Jannett [40] analyzed the ABS intelligence considering the ability to compare alternatives with different levels of complexity. In their research, an agent-based distributed sensor network system was considered. The proposal was tested by comparing MIQs in diverse scenarios. Hernández-Orallo and Dowe [41] proposed a general test called the universal anytime intelligence test. That study considered that such a test should be able to measure the intelligence level, which could be very low or very high in diverse situations. The presented approach was based on the C-tests and compression-enhanced Turing tests developed in the late 1990s. Different tests were discussed, highlighting their limitations, and some new ideas were introduced and need to be studied for the development of a universal intelligence test.
ExtrIntDetect [42] represents a universal method that can be applied for the identification of intelligent cooperative multiagent systems with extreme intelligence.
Legg and Hutter defined an intelligence measure [43], presuming that the performance in easy environments counts less toward an agent's intelligence than does the performance in difficult environments. Hibbard [44] proposed a metric for intelligence measurement based on a hierarchy of sets of increasingly difficult environments. Hibbard considers an agent's intelligence measurement according to diverse considerations related to difficultproblem-solving ability.
In [28], a novel metric called MetrIntComp for the comparison of two CMASs' intelligence was proposed. Intelligence measuring was considered based on the principle of difficult-problem-solving abilities. MetrIntComp is able to make a differentiation in intelligence between the two CMASs even if the numerical difference between the measured intelligence values is low. MetrIntComp makes also a classification of the considered CMASs based on their intelligence. According to this classification, two CMASs with the same intelligence (from a statistical point of view) can be included in the same class.
Liu et al. [45] present a recent complex study regarding the analysis of the intelligence quotient and the grade of artificial intelligence.
In [46], a metric that is able to compare the intelligence of a system with a reference intelligence is presented. The designed metric is also able to measure the evolution in intelligence of swarm systems.
In [47], a novel universal metric called MeasApplInt able to measure and compare the real-life problem solving machine intelligence of two cooperative multiagent systems systems at an application is proposed. The studied intelligent systems are classified in intelligence classes. Systems classified in the same class can solve problems at the same level of intelligence.
Usually, intelligence is required when the problems to be solved are characterized by different kinds of difficulties. In this sense, the main purpose of endowing systems with intelligence is to obtain improvements in solving difficult problems. Machine intelligence must be considered based on difficult-problem-solving abilities. Measuring machine intelligence is important to develop highly intelligent problem-solving approaches. At the same time, it should enable the selection of the most appropriate systems, based on their intelligence.
Each of the metrics/methods presented in the scientific literature is based on some specific ideology of intelligence measuring. Based on this fact, most of them cannot be compared. There is no standardization of intelligence measuring, nor is there a universal vision on what an intelligence's metric should measure. One type of difficult problem could be solved by IABSs with a large diversity of architectures whose intelligence must be measured. This motivates the necessity to design universal metrics. One of the main drawbacks of actual metrics is their limitation in universality.

The Proposed MetrIntPairII Metric
In the following, we present a novel metric called Pairwise Machine Intelligence Measuring and Comparison of Multiple Intelligent Systems (MetrIntPairII). The metric is described in the form of an algorithm abbreviated as MetrIntPairII.

Description of the Proposed MetrIntPairII Metric
This subsection introduces the notion intelligence indicator of solving difficult problems. An intelligence indicator is established by an Ha in order to obtain a quantitative measure of the type of problem-solving intelligence that represents interest.
The following notations are used: The number of problems used with the purpose of intelligence measurements is represented by |Probl| = k; |Int 1 | = |Int 2 | = . . . = |Int m | = k represents the cardinality of It must be noted that different sets of experimental intelligence evaluations could give slightly different MIQ values. This phenomenon is similar to the case of human intelligence tests, where a human obtains an evaluation result of his/her Intelligence Quotient (IQ), but at another evaluation, a slightly different IQ could be obtained. The proposed metric takes into consideration this aspect, which is called variability in intelligence. The central intelligence tendency of an ICMAS is described by the MIQ value and some additional indicators that include the mean, the standard deviation (SD), the confidence level of the mean (CL) (the use of 95% CL is recommended in most cases, and even other values such as 90% or 99% can be used), the lower confidence interval of the mean (LCI) and the upper confidence interval of the mean (UCI), both of which are calculated at an established CL level, and the coefficient of variation (CV) defined as CV = 100 × (SD/mean), expressed as a percentage to be easier to interpret by an Ha. The previously introduced indicators enable a statistical characterization of intelligence that allows for the formulation of diverse conclusions. For instance, CV is used as an indicator of the homogeneity-heterogeneity, and homogeneous intelligence means that there is not much significant variation in the problem-solving intelligence.
The MetrIntPairII algorithm compares the intelligence of Coop on the Probl testing problems set. It verifies whether the intelligence of Coop 1 , Coop 2 , . . . , Coop m is statistically equal and makes a classification of the studied CMASs in intelligence classes. In the following, the null hypothesis, denoted as H I0 , is the statement that the intelligence of Coop 1 , Coop 2 , . . . , Coop m is equal from a statistical point of view (the difference is not statistically significant), meaning that all the analyzed CMASs should be included in the same intelligence class. H I1 is denoted as the alternative hypothesis, which indicates that the intelligence of Coop 1 , Coop 2 , . . . , Coop m are not all equal from a statistical point of view, and there is a difference in intelligence between at least two of them. It can be concluded that all analyzed CMASs cannot be included in the same class. MetrIntPairII uses as input Int 1 , Int 2 , . . . , Int m . The "@" symbol specifies that the performance of a specific set of processing's. For example, "@Apply the Friedman test with the αMore significance level." specifies that the Friedman test is applied, as presented in the scientific literature, with the αMore significance level. Figure 1 briefly presents the main processing steps performed by the MetrIntPairII: Intelligence Comparison Algorithm.
A dataset is called homogeneous if CV < C V1 , relatively homogeneous if CV ∈ [C V1 , C V 2 ), relatively heterogeneous if CV ∈ [C V 2 , C V 3 ), and heterogeneous if CV ≥ C V 3 . Recommended values include C V1 = 10, C V 2 = 20, and C V 3 = 30, as usually they are the most appropriate.
Some studies [48,49] compare the most frequently used tests for verification of the normality assumption, such as the One-Sample Kolmogorov-Smirnov test (KS test), the Shapiro-Wilk test (SW test), and the Lilliefors test (Lill test). The Lill test is an adaptation of the KS test. The SW test was proved to have the most statistical power for significance from the studied tests. It was noted in [48,49] that powerful normality tests could have disadvantages that must also be considered in decisions. For instance, in the case of the SW test, it was proved that it does not work well with many identical values.   Step 1. Preliminary analysis. @Establish CL;@Makes a statistical analysis for Int 1 , Int 2 , . . . , Int m by obtaining the: mean; LCI; UCI; minimum; maximum; median; SD; variance; CV; @Analyze the homogeneity/heterogeneity of Int 1 , Int 2 , . . . , Int m based on the CV values. @Verify if Int 1 , Int 2 , . . . , Int m pass the normality assumption.
Step 2. Hybrid human-computing system decision. @Based on the Int 1 , Int 2 , . . . , Int m data homogeneity/heterogeneity and normality ask Ha whether he/she decides for the application of an extreme detection test or the application of a data transformation. @Set the PreProcessing decision. If (PreProcessing="YesOutl") Then @Apply an outlier detection test; ElseIf (PreProcessing="YesTransf") @Apply a transformation to intelligence indicators data.
Step 4. Classification based on the intelligence measurement. If (Passed ="Yes") Then @Apply the Repeated Measure Anova with αMore. @Obtain p-value.
Else @Apply the Friedman test with αMore. @Obtain the p-value. EndIf If (p-value>αMore) Then @Accept H I0 .

EndIf EndMetrIntPairIIAlgorithm
In our algorithm, for the normality verification, we have chosen the KS test [50][51][52], the Lill test [51][52][53], and the SW test [54]. The Quantile-Quantile plot (QQ plot) is a scatterplot appropriate for the normality visual appreciation. From an interpretation point of view, a QQ plot is a plotted reference line. In the case of normally distributed data, the points should fall approximately along this reference line. The greater the departure from the reference line, the greater the evidence is for the conclusion that the data fail the normality assumption. The joint use of the QQ plot with the SW test is suggested for accurate verification of the normality assumption.
In usual applications, it is sufficient to use the SW test jointly with the QQ plot. The SW test is appropriate even in the case of a normality evaluation of smaller sets of data.
Step 2 of the algorithm includes a hybrid human-computing system decision. This is based on the consideration that a human has some problem and domain-specific knowledge that can help him/her in more efficient decisions in some situations than computing systems. This step in some situations could be implemented as an automatic decision.
In the nonparametric case (the samples of at least one of the intelligence indicators are not normally distributed), to obtain normally distributed data, an Ha could decide for the elimination of extreme (outlier) intelligence indicator values. An intelligence indicator sample could contain extreme intelligence values. It is called the extreme intelligence indicator value, a very high or very low intelligence value, statistically significantly different from other intelligence indicator values. For the detection of extreme values, we used a statistical method called the Grubbs test for outlier detection, also called the ESD method (extreme studentized deviate) [55][56][57][58]. The assumption of the Grubbs test is the expected normality. The dataset can be reasonably approximated by a normal distribution. This must be first verified or known before its application. It is proposed that the Grubbs test be used with the significance level αGrubbs = 0.05, and other values can be considered, such as 0.01 or 0.1. At first application, the Grubbs test can identify a single extreme if there are any. If an extreme value is detected, then it is the statistically significantly most different value from those other measured intelligence indicator values. If an extreme is detected, one may consider applying the extreme detection test again. It can be applied recursively several times until no further extremes are identified.
Whenever the Grubbs test detects an extreme intelligence indicator in an intelligence indicator dataset, this value is removed together with the corresponding paired values from the other intelligence indicators datasets. For instance, in the case of the problem denoted Prl v , the corresponding intelligence indicator value In1 v (from In 1 ) is identified as an extreme, and In2 v , . . . , Inm v are then also removed from the samples In 2 , . . . , In m .
Alternately, if the sample intelligence data does not pass the normality assumption, a transformation can be applied to obtain normally distributed data. Some of the most commonly used normalizing transformations, are indicated in Table 2 [59]. IN denotes an arbitrary dataset. At Step 1 of the algorithm, statistical analysis is performed, and the results are used to make intermediary decisions on the intelligence indicator data and to decide on further processing steps. It is also appropriate to make some additional characterizations of the intelligence variability of the studied CMASs.
Step 4 of the algorithm performs the classification of the studied CMASs based on their intelligence. For effective comparison of the intelligence of the CMASs, the Repeated Measure Anova test [60,61] is used in the parametric case, when Int 1 , Int 2 , . . . .Int m pass the normality assumption. In the nonparametric case, when not all of Int 1 , Int 2 , . . . .Int m are normally distributed, the Friedman Two-Way Analysis of Variance by Ranks test (the Friedman test) is used [62,63]. When using the Friedman test, it is important to use a sample size of at least 12 in order to obtain an accurate p-value. In choosing each of the tests, the αMore significance levels value should be established. It is suggested that a value of αMore = 0.05 be used (other values such as 0.01 and 0.1 can also be used), which is frequently the most appropriate. αMore denotes the probability of making a type I error, signifying the rejection of H I0 when it is true. As a motivation for choosing this significance level value, it must be mentioned that the smaller the significance level is, the less likely it is to make a type I error, and the more likely it is to make a type II error.
In the proposed MetrIntPairII metric algorithm, a p-value > αMore implies that H I0 can be accepted at the established significance level. In this case, it can be concluded that, even if there is a numerical difference between the calculated MIQ 1 , MIQ 2 , . . . , MIQ m values, there is no statistical difference in the intelligence of the studied CMASs. The numerical difference is the result of the variability in the intelligence of the CMASs. From a classification point of view, Coop 1 , Coop 2 , . . . , Coop m can be classified in the same class of intelligence, in the sense that they can solve the considered class of problems with the same level of intelligence.
If H I1 is accepted, then it can be concluded that the intelligence level of MIQ 1 , MIQ 2 , . . . , MIQ m is statistically significantly different (there is a significant difference between at least two of them). Accordingly, from the point of view of classification, Coop 1 , Coop 2 , . . . , Coop m cannot be classified in the same intelligence class. To distribute Coop 1 , Coop 2 , . . . , Coop m into intelligence classes, the following post-tests are used: the Tukey-Kramer Multiple Comparisons test [64,65] (in the parametric case, when all the samples from Int pass the normality assumption) and the Dunn test [66,67] (in the non-parametric case, when not all the samples from Int pass the normality assumption). Additional explanations related to how the classification is accomplished based on the Tukey-Kramer post-test and the Dunn post-test are provided at the end of the experimental study presented in the next section. It is recommended that the application of both tests be at the significance level 0.05, and other significance levels such as 0.01 and 0.1 can be applied as well.
Dunn's test compares the difference in the sum of ranks between two intelligence indicator sets with the expected average difference. For each pair of intelligence indicators, the p-value is obtained.
The Tukey-Kramer post-test is a single-step multiple-comparison statistical test. It can be used separately or in conjunction with, as a post-hoc of ANOVA, to find means that are significantly different from each other. It compares all possible pairs of means and is based on a studentized range distribution.

Intelligence Indicator Calculus Based on More Intelligence Components Values
Choosing the most appropriate intelligence indicator is the responsibility of the Ha who wishes to measure the intelligence and to compare the problem-solving intelligence of two or more CMASs. He/she could choose the most preferable one based on what he/she indicates as problem-solving intelligence, with respect to the type of intelligence that he/she would like to measure.

An Illustrative Example for the Notion: Type of Intelligence
The scenario of an intelligent transporting agent being able to autonomously pilot a car with a passenger.
In the considered Travelling Salesman (TSP) type of problem, a passenger starting from a city located in a country with a certain number of cities would like to visit with the help of a transporting agent each city once and return to the starting city with the smallest cost. Some examples of intelligence types that the pilot agent could use are enumerated below:

1.
Type 1 : Communication intelligence. The capacity to communicate with the passenger from the car. This may be implemented to work via voice commands.

2.
Type 2 : Intelligence in avoiding static objects that might appear on the road.

3.
Type 3 : Intelligence in avoiding collisions with other cars.

4.
Type 4 : Intelligence in avoiding humans who cross the road irregularly.

5.
Type 5 : Intelligence in avoiding animals that cross the road irregularly.
In this scenario, the Ha should establish the type of intelligence that he/she would like to measure at a specific moment of time. The Ha can choose, for instance, Type 1 , which could be measured as the percentage of correctly recognized voice commands and the accuracy of their execution. Type 6 could be measured, for instance, based on the aspect of how close the obtained route is to the shortest possible route.
A pilot agent AG I may be more intelligent based on a specific type of intelligence than another pilot agent AG I I , while for another type of intelligence, the situation could be vice versa. For instance, AG I could be more intelligent than AG I I in avoiding humans that cross the road, while AG I I could be more intelligent than AG I in communication with the passenger.

An Illustrative Example for the: Intelligence Components
The scenario of a hybrid CMAS composed of flying agent-based drones and terrestrial mobile robotic ants.
The scenario of a CMAS composed of intelligent flying agent-based drones and intelligent mobile robotic ants (who operate like agents) able to move on a certain type of land is considered. The mobile robotic ants are specialized in collecting different types of soil samples for analysis. The flying drones have a high altitude vision of the land that can analyze it based on techniques such as image analysis. Using the available information about the robotic ants (e.g., their position and motion), obtained by inspecting the land from the air, the drones can indicate to the robotic ants the most appropriate places to go in order to efficiently collect representative (by diverse type) soil samples. The robotic ants are also able to cooperate with each other during operation. For instance, a robotic ant might find more soil samples than it can transport. Based on this issue, it can request the help of another robotic ant that is nearby, which is able to transport a part of those samples. The Ha should establish the most appropriate indicator of the intelligence, based on the types of intelligence that represent his/her assignment and choose the intelligence components that contribute to the considered intelligence measuring.
The Ha considers the following three (z = 3) components of the intelligence: 1.
The added value of the new information obtained by processing the data that can be extracted from the collected samples. The weight wgh 1 corresponds to Comp 1 .

2.
Comp 2 : used resources. The consumed fuel by the robotic agents. The weight wgh 2 corresponds to Comp 2 .

3.
Comp 3 : problem-solving time. The time to obtain all the samples. The weight wgh 3 corresponds to Comp 3 .
If the added value of the obtained information and the time of collection are considered, the added value component could be more important. It must have a higher weight wgh 1 > wgh 3 . The weights of the components should be established by the Ha. Sometimes it may be necessary to apply a transformation to some components of the intelligence measure (e.g., to its units) before performing the intelligence indicator calculus based on them. The necessary types of transformations should be established by the Ha. For instance, in the case of Comp 2 , a lower value is better; it is better if the resource consumption is lower. In the case of Comp 3 , a lower value is better; it is better if the samples are collected in a smaller amount of time. In the case of Comp 1 , a higher value is better; it is better if a higher amount of new information is obtained.

The Cooperative Multiagent Systems Used in the Study
Dorigo [68][69][70][71] introduced the concept of problem-solving based on simple computing agents that mimic the generic behavior of natural ants. In an Ant System (AtS), initially, each agent (artificial ant) is placed on some randomly chosen node. An agent agent k currently at node i chooses to move to node j by applying the following probabilistic transition rule: After each agent completes its tour, the pheromone amount on each path will be adjusted as follows: In Equations (2)-(5), ρ, α, and β are parameters whose values should be set. α and β control the relative weights of the heuristic visibility and the pheromone trail. α and β establish the necessary trade-off between edge length and pheromone intensity. ρ, 0 < ρ < 1 represents the evaporation factor. Q denotes an arbitrary constant, usually Q = 1. The variables η gh (η gh = 1/d g,h ) stand for the heuristic visibility of the edge (g,h). d g,h represents the distance between the nodes g and h. The number of agents is denoted by m. L k stands for the length of the tour performed by the agent k .
There are many types of cooperation in CMASs that could lead to intelligent behavior at the systems' level. A fundamental component of the cooperation is the communication. There are many types of communication that can be implemented in multiagent systems. One example is the communication where a transmitted message does not have a destinatary agent. It is received by all the agents that are nearby (at a certain distance). The studied CMASs whose intelligence is compared in this study are composed of simple computing software mobile agents (artificial ants) that mimic the operation of natural ants. They are considered to have so-called Swarm Intelligence (SI). The expression of SI was introduced by Beni and Wang [72]. The mobile agents operate in an environment represented by a graph of connected nodes. The agents are able to move in an environment from node to node during problem-solving. The communication of the agents is realized using signs, which is similar to the communication of natural ants using chemical pheromones. Though this is a simple form of communication, it allows for efficient (efficiency in problemsolving), robust (even if some agents fail, the problem can be successfully solved), and scalable (the CMAS can be extended, if necessary, with new agents) cooperation in solving the undertaken problems. Many of the CMASs that operate by mimicking natural ants are considered intelligent in the scientific literature [73][74][75].
The Best-Worst Ant System (BW AS ) was proposed in [76]. Coop 1 operated as a BW AS [76][77][78]. The Min-Max Ant System (MM AS ) was proposed by Stützle and Hoos [79]. Coop 2 operated as a MM AS [79,80]. The first modified version of the AtS consisted in the Ant Colony System (AC S ). The AC S was introduced by Dorigo and Gambardella [81]. Coop 3 operated as an AC S [69,70,81]. Coop 1 , and Coop 2 , and Coop 3 were applied in solving the TSP [81].
TSP can be defined as follows: Given M cities (nodes of an undirected weighted graph), a salesman who starts from a given node should visit each node exactly once and then return to the starting node. The salesman would like to choose the route that minimizes either the traveled distance, or the travelling time, or the travelling energy.

Presentation of the Coop 1 Intelligent System'S Operation
In the operation of Coop 1 , (2) represents the solution construction, and (6) represents the evaporation rule; ∀ i, and j, with ρ ∈ [0, 1], represent the pheromone decay parameter.
with only the best-to-date agent and worst-to-date agent updates of pheromones. The best-to-date agent update is indicated in (7). τ ij bs = Q/L bs if the path ij is from T bs . T bs is the best-to-date agent round trip; L bs is the length of the performed trip.
On the paths of the round trip of the worst agent for the current iteration that are not in the best-to-date solution has an additional evaporation, as indicated in (8).
where ρ w is a supplementary factor for all L rs ∈ T w and L rs / ∈ T w ∩T BS . T w is the worst solution for the given iteration. T bs is the best-to-date solution.

Presentation of the Coop 2 System's Operation
Coop 2 is based on a MM AS . MM AS differs from a conventional AtS in some aspects. An MM AS gives dynamically evolving bounds on the pheromone trail intensities. This is performed in such a way that the pheromone intensity on all paths is always within a specified limit of the path with the greatest pheromone intensity. All the paths will permanently have a non-trivial probability of being selected. This way, a wider exploration of the search space is assured. MM AS uses lower and upper pheromone bounds to ensure that all of the pheromone intensities are between these two bounds.
In a MM AS , the solution construction is according to (2). There are variants in the selection of the agents allowed to update pheromones: the best-for-current iteration, the best-to-date agent, the best-after-latest-reset agent, or the best-to date-agent for even (or odd) iterations. There are minimal and maximal pheromone limits to the quantity of pheromone on the paths between nodes, denoted as τ min and τ max. The evaporation on the graph can be expressed as (9). (10) denotes the pheromone update based on the selected agent's round trip.
∆τ ij bs (t) = Q/L sel if the path ij ∈ T sel , T sel is the selected best-to-date agent's round trip. L sel is the length of the trip. τ0 = 1/nc (nc denotes the number of cities). Another possibility for τ0 initialization consists in τ0 = τ max . The decision of which of them is a more appropriate initialization should be established experimentally.

Presentation of the Coop 3 Intelligent System's Operation
Coop 3 is based on an AtS. The difference between AC S and AtS consists in the decision rule used by the agents during their operation (solution construction process). The agents in AC S use the following rule: the probability for an agent to move from node i to node j depends on a random variable q uniformly distributed over [0, 1], and a parameter q0. If the condition q0 ≥ q is satisfied, then, among the feasible edge, the edge that maximizes the product τ il × η il is chosen. In alternative cases, the same equation as in AtS is used. This is a kind of greedy rule, that favors the exploitation of pheromone information. In order to counterbalance this, the local pheromone update is performed by all agents after each construction process. Each agent applies it only to the last edge traversed: In (11), the following notations are used: ϕ, ϕ ∈ (0,1]: the pheromone decay coefficient; τ0: the initial value of the pheromone. The local pheromone update intends to increase the chance of visiting promising itineraries on the search performed by subsequent agents. The decrease of the pheromone concentration on the edges as they are traversed during a single iteration has the effect of indicating to subsequent agents that they should choose other edges that results in different solutions. This makes it less probable that several agents obtain identical solutions during a single iteration. Because of the local pheromone update, the minimum values of the pheromone are limited.
Similarly with the AtS, in AC S , at the end of the construction process, a pheromone update is realized. It is performed only by the agent that performed best. The best agent updates the edges that it visited.
In (12), the following notations are used: ∆τ ij bs = 1/L bs if the best agent traversed the edge (i,j) in its tour; in alternative cases, ∆τ ij bs = 0. For the calculus of the L bs value, the following is recommended: L bs is considered the iteration best, and the length of the best tour found in the current iteration or L ib is considered best-so-far, the best solution found since the start of the problem-solving process.

Experimental Results
For intelligence measurement, a particular experimental setup was considered. Experiments were performed using a computing system with a Quad Core I7 2.6 GHz processor and 8 GB RAM. There were considered maps with nr = 100 randomly placed cities on the map. The most appropriate parameter values were considered based on some experimental evaluations. As parameters, for all the CMASs, the following settings were considered: number-of-steps = 1000; α = 1 (power of the pheromone); β = 1 (power of the distance/edge weight); ρ = 0.1 (the evaporation factor). Table 3 presents a part of the obtained experimental intelligence evaluation results. Figure 2 provides a graphic representation of Int 1 , Int 2 , and Int 3 . In the simulations, the obtained best-to-date travel value at the end of the problem-solving was considered as the intelligence indicator. A smaller value of the globalbest has the significance of higher intelligence. Probl = {Prl 1 , Prl 2 ,...., Prl 36 } represents the set of problems used in the problem-solving intelligence evaluations.    Table 4 presents the results of the descriptive characterization of Int 1 , Int 2 , Int 3 intelligence indicators samples, where mean denotes the sample mean; sample size represents the calculated sample size; LowerCI and UpperCI represent the lower and upper confidence interval of the sample mean; SD denotes the sample standard deviation; CV represents the coefficient of variation of the sample; variance represents the sample variance (calculated as SD 2 ); min represents the smallest value from the sample; max represents the largest value of the sample; median represents the median of the sample. Table 5 presents the results of the normality tests performed for Int 1 , Int 2 , Int 3 . Table 6 presents the results of the descriptive characterization of Int 1 * , Int 2 * , Int 3 * intelligence indicators samples. Table 7 presents the results of the normality tests performed for Int 1 * , Int 2 * , Int 3 * . To check the normality of data, the KS, Lill, and SW tests were applied, all of them at the significance level αNorm = 0.05. Tables 5 and 7 present, in the case of all the performed normality tests, the obtained test statistic and the p-value. For effective interpretation, only the p-value was used. The condition p-value > αNorm indicated the passing of the normality assumption at the considered significance level.
Tables 4 and 5 present the results obtained by analyzing Int 1 , Int 2 , and Int 3 data, where Table 4 presents a descriptive characterization, and Table 5 presents the results of normality testing. Not all the intelligence indicators' data pass the normality assumption. Based on this fact, the Friedman test is applied with the chosen significance level αMore = 0.05. The obtained Friedman test p-value ≈ 0.0001 (Friedman Statistic Fr = 72 ), p-value < αMore, indicates that there is a statistically significant difference among the intelligence of Coop 1 , Coop 2 , and Coop 3 . This means that all three CMASs cannot be included in the same class of intelligence. For further processing, the Dunn test with the significance level αPost = 0.05 (Table 8, the column labeled "Dunn test") was used to compare all pairs of CMASs. For the interpretation of the results, the p-value of the Dunn test should be compared with the αPost significance level. p-value ≤ αPost indicates a significant statistical difference between a compared pair of CMASs. p-value ≤ αPost means that the two compared CMASs cannot be classified in the same class. The obtained results indicate that no couple of the studied CMASs can be included in the same class. Henceforth, there are three identified intelligence classes.
Applying the second approach of the MetrIntPairII algorithm, based on the fact that Int 3 does not pass the normality assumption, extremes were identified using the Grubbs test on the Int 3 intelligence indicator data. At the first application of the test, Int3 23 = 1727 corresponding to Pr 23 was identified as an outlier. It was removed from The obtained Int 1 * , Int 2 * , Int 3 * data passed the normality assumption (see Table 7 for the obtained normality test results and Table 6 for the performed descriptive characterization of intelligence indicators). QQ plots for Int 1 * (Figure 3), Int 2 * (Figure 4), Int 3 * ( Figure 5) were constructed. The visual interpretation of Figures 3-5 lead to the same conclusion that Int 1 * , Int 2 * , Int 3 * passed the normality assumption (the points fall approximately along this reference line). Based on this fact, according to the MetrIntPairII algorithm, the application of the Repeated Measure Anova test was considered with the significance level αMore = 0.05. The obtained p-value ≈ 0.0001 (p-value < αMore) indicated that the intelligence of the three studied CMASs present significant differences. Henceforth, Coop 1 , Coop 2 , and Coop 3 cannot be included in the same class of intelligence. The fact that the p-value < αMore and all the intelligence indicators data Int 1 * , Int 2 * , Int 3 * passed the normality test justifies the application of Tukey-Kramer Multiple Comparisons test with the significance level αPost = 0.05 for the comparison of all pairs of CMASs (Table 8, the column labeled "Tukey-Kramer test"). If p-value < αPost, then the two compared CMASs cannot be classified in the same class of intelligence. The final decision based on the obtained results presented in Table 8 indicates that all three studied CMASs should be assigned to different classes.   Table 8 indicates that the obtained results in both the parametric (applied Tukey-Kramer test) and the non-parametric (applied Dunn test) cases were the same. Based on the pairwise comparisons, it can be concluded that the difference in the intelligence of any two of the studied three CMASs is statistically significant. Henceforth, Coop 1 , Coop 2 , and Coop 3 should be assigned to separate classes of intelligence. Coop 2 belongs to the most intelligent class, denoted IntClass 1 . Coop 3 belongs to the second intelligence class, denoted IntClass 2 . Coop 1 belongs to the third intelligence class, denoted IntClass 3 . The intelligence of CMASs that belong to IntClass 1 is higher than the intelligence of the CMASs that belong to IntClass 2 . The intelligence of CMASs that belong to IntClass 2 is higher than the intelligence of the CMASs that belong to IntClass 3 .

Discussion
Frequently, in CMASs, the intelligence can be considered at the system's level. Although the intelligence of CMASs cannot be defined universally, it is useful to measure it. This is similar to the nature of human intelligence. Nobody deeply understands what human intelligence is, but even in this context, there are intelligence tests that can measure it. The outcome of human intelligence tests are useful for applicative purposes, such as the comparison of the effectiveness of two books used for learning by students taking the same exam. Each of the students uses one of the two books for learning. If the IQ of the students is taken into consideration in the comparison, more accurate results will be obtained, in the sense that a more intelligent student can learn easier even from a poorly written educational book.
There are few metrics presented in the scientific literature, where most of them is based on a specific ideology. The different ideology of the intelligence measuring does not allow a direct comparison of the metrics. In our study, intelligence measurement was considered based on the ability to solve difficult problems. The designed MetrIntPairII metric is appropriate for CMASs, where the intelligence indicator of the problem-solving ability of each CMAS can be expressed as a single value. If necessary (in the case of highly complex systems), this value can be computed as a weighted sum of other values that measure different aspects of the system's intelligence. MetrIntPairII takes into consideration the variability in the intelligence of the compared CMASs. A CMAS may have a different value of intelligence in different situations. For a specific problem, the intelligence of a CMAS may be higher or lower. Extreme intelligence of CMASs was also considered with extreme high and low intelligence values. If such extreme intelligence indicator values are taken into consideration in case of a CMAS, they might strongly influence the value of its measured machine intelligence.
The MetrIntPairII algorithm, based on some mathematically grounded analysis, chooses the application of either the parametric Single-Factor ANOVA test with replications [60,61] or the nonparametric Friedman test [62,63]. Based on this fact, the principal properties of the MetrIntPairII metric consist in accuracy and robustness in comparison and classification. In the case of our metric, if the intelligence indicator data pass the normality assumption, then the mean should be chosen as a representative statistical indicator of the central intelligence tendency. If the intelligence indicator data do not pass the normality assumption, then the median should be chosen as a representative statistical indicator of the central intelligence tendency based on the fact that is more robust than the mean. The robustness can be explained based on the fact that an extreme value (very high or very low) influences the median value in a lower degree than the value of the mean.
Another strength of the MetrIntPairII metric consists in the reduced sample size of necessary intelligence indicators, which is a result of using pairwise intelligence evaluations, such as the two sample paired and unpaired tests for the verification of the null hypothesis, which consists in the verification of equality of the means or medians of two samples. For example, an application can be considered in the following context: tails = 2 (in statistical analysis, the two-tailed test is almost always chosen instead the one-tailed test); α = 0.05; β = 0.2, Power = 1 − β = 0.8; Effect size = 0.5. β is a type II error, the probability from failure to reject a false null hypothesis. Generally speaking, a type I error is the detection of an effect that is not present, while a type II error is failing to detect an effect that is present. An effect size is a quantitative measure of the strength of a phenomenon. The power represents the probability of detecting a true effect. Based on these data, using an a priori calculus (two samples) in a parametric case (normally distributed data), in the case of matched pairs, the sample size of each sample should be 34; in the case of non-matched pairs, the sample size of each sample should be 64.
A comparable metric called MetrIntPair was presented in [27]. MetrIntPair uses difficult-problem-solving intelligence measuring data. Based on that, it makes a mathematically grounded comparison of the intelligence of two CMASs at an application. Finally, it can classify the compared systems into intelligence classes. MetrIntPair is based on the same consideration for the pairwise measuring of difficult-problem-solving intelligence, similar to the MetrIntPairII metric introduced in this paper. MetrIntPairII conserves the properties of the MetrIntPair metric. The increased generality of the MetrIntPairII metric versus the MetrIntPair metric is the fact that MetrIntPairII is able to simultaneously compare and classify a large number of CMASs at an application. MetrIntPairII, at some point, uses the One-way Repeated Measure ANOVA test, a generalization of the Two-sample Paired T-test [60,61] (used by the MetrIntPair metric). The Repeated Measure ANOVA test for two samples should yield results that are similar to the Two-sample Paired T-test for two CMASs, considering that both tests are applied by taking into consideration all the requested assumptions. Based on this fact, the p-value of the One-way Repeated Measure ANOVA test is mathematically identical to the p-value of the Two-sample Paired T-test.
We applied the MetrIntPairII metric to the intelligence indicator data reported in [27]. The main purpose of the experimental comparison consisted in proving that the MetrInt-PairII metric yields results that are similar to the MetrIntPair metric in the comparison of the intelligence of two CMASs. In [27], two CMASs specialized in solving a class of NP-hard problems were considered. One of them operated similarly to a Rank-Based Ant System (RB AS ) [80,82], and the other operated as a Min-Max Ant System(MM AS ) [79,80]. The experimentally compared two metrics led to the same decision regarding the intelligence of the studied CMASs.
The Family-wise error rate (FWER) is the probability of making one or more type I errors when performing multiple hypotheses tests [29]. If m-independent comparisons are performed, the FWER is calculated according to (13). αComp denotes the type I error of a single comparison. αOv denotes the overall type I error as a result of m comparisons.
MetrIntPair could be applied for the comparison of more than two CMASs, but this approach would not be appropriate. The probability of making a type I error increases as the number of tests increase. If the significance level is set at α, the probability of a type I error can be obtained, regardless of the number of groups being compared. For instance, if the probability of a type I error for the analysis is set at α = 0.05 and four two sample tests (T-test for example) are performed, the overall probability of a type I error for the set of tests αOv = 1 − 0.95 4 ≈ 0.186 (0.185494) substantially increases. In the case of MetrIntPairII for the same four intelligence samples, the type I error does not change. Its value remains at 0.05. In the case of MetrIntPairII, the probability of making a type I error does not increase as the number of compared systems increases.
The extension of the MetrIntPairII metric versus the MetrIntPair consists in using a statistical non-parametric test for intelligence indicator data that do not pass the normality assumption. In this case, it uses the Friedman test, which is known as a robust nonparametric test [62,63]. MetrIntPairII based on the obtained intelligence indicators makes a mathematically grounded analysis and applies the most appropriate statistical tests.

Conclusions
In this paper, a novel intelligence metric called MetrIntPairII was proposed. MetrInt-PairII is able to make an effective measuring and comparison of the intelligence of several CMASs. Based on their difficult-problem-solving intelligence, the studied CMASs are classified into intelligence classes. MetrIntPairII is accurate and robust based on the fact that it takes into account the variability in the intelligence of the compared CMAS and the occurrence of extreme (low and high) intelligence measurement results. MetrIntPairII is a generalization and extension of the metric called MetrIntPair, presented in the scientific literature.
For validation purposes, we performed experimental difficult-problem-solving intelligence evaluations for a set of CMASs. Each CMAS was composed of simple computing agents specialized in solving an NP-hard problem, in that, at the systems' level, increased intelligence emerged.
The most important property of the proposed metric that suggests its applicability is its universality. MetrIntPairII was presented as being applied for CMASs. This decision was based on the fact that measuring the intelligence of a CMASs is usually more difficult than measuring the intelligence of a system that operates individually. MetrIntPairII can be applied to intelligent agent-based systems generally, even to systems that operate in isolation without cooperating with other systems during problem-solving. Prospective applications could include the intelligence measuring of robotics swarms. It can provide a reliable comparison, for instance, of the intelligence of a set of agents with different architectures that solve problems in isolation with the intelligence of a cooperative coalition of agents in solving the same type of problem. Based on a comprehensive scientific literature review performed in this study, the metric proposed in this paper is original, and we estimate that it will represent the foundation for the intelligence measuring of IABSs in many future studies.