A Nonparametric Statistical Approach to Content Analysis of Items

In order to use psychometric instruments to assess a multidimensional construct, we may decompose it in dimensions and, in order to assess each dimension, develop a set of items, so one may assess the construct as a whole, by assessing its dimensions. In this scenario, the content analysis of items aims to verify if the developed items are assessing the dimension they are supposed to. In the content analysis process, it is customary to request the judgement of specialists in the studied construct about the dimension that the developed items assess, what makes it a subjective process as it relies upon the personal opinion of the specialists. This paper aims to develop a nonparametric statistical approach to the content analysis of items in order to present a practical method to assess the consistency of the content analysis process, by the development of a statistical test that seeks to determine if all the specialists have the same capability to judge the items. A simulation study is conducted to assess the consistency of the test and it is applied to a real validation process.


Introduction
Psychometric instruments play an important role in researches in the areas of psychology and education, thus it is necessary that they are thoroughly developed and validated, so that no erroneous results are obtained by their application. The psychometric instruments are developed in order to assess psychological constructs that cannot be operationally defined and, consequently, cannot be objectively assessed. According to [5], a construct is said to be multidimensional when it consists of a number of interrelated attributes or dimensions and exists in multidimensional domains. In order to develop a psychometric instrument to assess a multidimensional construct, a set of items, that assess a dimension, is developed for each one of its dimensions in furtherance of assessing the construct as a whole. The validation process of an instrument must guarantee that each item assesses its dimension correctly according to the desirable characteristics of a psychometric instrument, e.g., reliability and trustworthiness [4].
The validity of an instrument is divided in four categories: predictive validity, concurrent validity, content validity and construct validity. The first two of these may be together considered as criterion-oriented validation processes [3]. The predictive validity is studied when the instrument aims to predict a criterion. The instrument is applied and a correlated construct to the criterion is assessed, providing a prediction for the criterion of interest. The concurrent validity is studied when the instrument is proposed as a substitute for another [3]. The study of the construct validity of a psychometric instrument is necessary when the result of the instrument is the measure of an attribute or a characteristic that is not operationally defined. According to the construct validity, an instrument is valid when it is possible to determine which construct accounts for the variance of the instrument performance. Content validity is established by showing that the instrument items are a sample of a universe in which the investigator is interested. Content validity is ordinarily to be established deductively, by defining a universe of items and sampling systematically within this universe to establish the instrument [3]. Another definition for content validity is that it is the degree to which elements of an assessment instrument are relevant to and representative of the targeted construct for a particular assessment purpose [4].
A list consisting of thirty-five procedures for the content validation was proposed by [4]. Amidst these procedures are to match each item to the dimension of the construct that it assesses and request the judgement of specialists in the construct, also called judges, about the developed items. The accomplishment of these procedures is imperative to verify if the developed items are a sample of the universe that the instrument aims to assess. These procedures, components of the theoretical analysis of items, are subjective for they rely upon the personal opinions of specialists and researchers. Indeed, the theoretical analysis of items is done by judges and aims to establish the comprehension of the items (semantic analysis) and their pertinence to the attribute that they propose to assess. This paper aims to propose a nonparametric statistical approach to the content analysis of items in furtherance of assessing its consistency and reliability. Therefore, our approach does not seek to establish the validity of the instrument, but rather assess the consistency of the content analysis process, so that its rule about the instrument may be trusted. Thus, this approach must be applied among other instrument validation methods, quantitative and qualitative, e.g., semantic analysis, pretrial and factorial analysis, in order to ensure the reliability, consistency, validity and trustworthiness of the psychometric instrument.

Method
The researcher, supported by the theory of the construct that the instrument aims to assess, develops m items and for each item assigns a theoretical dimension according to the theory and/or his opinion about which dimension the item assesses. Although the items and their dimensions have theoretical foundations, it is necessary to test them in order to determine if every item is indeed assessing the dimension it is supposed to.
In order to fulfil such test, the items are sent to s specialists in the construct, so that they may judge the items according to the dimension they assess. The items may be sent to at least six specialists and should be presented to them in a random order and without their theoretical dimensions, so that their judgement is not biased.
A condition for an item to be excluded from the instrument is determined based on the judgement of the specialists. This condition must exclude the items that do not belong to the universe that the instrument aims to assess, so that the not excluded items are a sample of such universe. A possible way to proceed is to determine a Concordance Index (CI ) that states that all items in which less than c% of the specialists agree on the dimension that they assess must be excluded. One may also take the Content Validity Ratio (CRV ), as proposed by [6], as a condition to exclude items that do not belong to the universe that the instrument aims to assess.
The method to be developed in this paper aims to determine if all specialists have the same capability to judge the items according to their dimensions, through the analysis of the judgement of the specialists about the items that were not excluded by the established condition. However, the method does not rank the specialists according to their capabilities, but only determine if all specialists have the same capability. Therefore, it is not possible to determine the specialists with low capability.
If there are no evidences that the capabilities of the specialists are dif-ferent, their judgement is accepted and the items not excluded by the established condition are used in the next steps of the instrument validation process. Indeed, if all the specialists have the same capability, it may happen that they are all highly capable or little capable of judging the items, though the proposed method will not be able to differentiate between the two cases. Nevertheless, the two scenarios may be differentiated by a qualitative analysis of the specialists judgements, by observing if they agree with the theoretical dimension of the items and, when they do not agree, if there is some theory that supports their choice. Therefore, if their judgements are consistent with some theory, then the specialists may be regarded as being all highly capable of judging the items, given that they have all the same capability to judge them.
On the other hand, if it is determined that the specialists do not have all the same capability to judge the items, then at least one specialist is less capable to judge them than the others, what may bias the validation of the instrument. Therefore, in such scenario, we propose two approaches in order to avoid a biased validation process. First, we propose that the specialists judgement be disregarded and a new group of specialists be requested to judge the items. However, this approach may be impractical in some cases, as time and resources may be too limited to repeat the cycle of specialists judgements more than once. Nonetheless, we propose a much more practical approach that consists in applying the proposed method to all subgroups of specialists of size s * , 6 ≤ s * < s, of the original group of specialists, and then choose the judgement of the subgroup whose specialists have all the same capability to judge the items. This approach will be presented in more details in the application section.

Notation and Definitions
Let C = {C 1 , . . . , C n } be a construct divided in n dimensions and U be the universe of all the items that assess the dimensions of C. A set I = {i 1 , . . . , i m } of items is developed based on the theory about C and then a subset I * ⊂ I of items, that we believe to be a subset of U, is determined, by the following process.
Denote E = {e 1 , . . . , e s } a set of s specialists and let C c(i l ) ∈ C be the dimension that the item i l ∈ I * assesses. Let the random variables {X i l (e j ) : i l ∈ I, e j ∈ E}, defined on (Ω, F, P), be so that X i l (e j ) = k if the specialist e j judged the item i l at the kth dimension of C. Note that if i l ∈ I * and X i l (e j ) = c(i l ), then the specialist e j judged the item i l correctly.
The capability of the specialist e j to judge the items is defined as in which P i l (e j ) = P{X i l (e j ) = c(i l )}, ∀i l ∈ I * and ∀e j ∈ E. In the proposed approach, we are interested in developing a hypothesis test to determine if P (e j ) = p ∈ [0, 1] |I * | , ∀e j ∈ E, i.e., if all specialists have the same capability to judge the items.
For this purpose, let a random sample of the judgement of the specialist e j about the items of I be given by x e j = {x i 1 (e j ), . . . , x im (e j )} and let X be the space of all possible random samples {x e j : e j ∈ E}. Define the random is the indicator function of the set A. Note that M i l is the set containing the number of the dimensions in which the majority of the specialists judged the item i l ∈ I. Given a random sample {x e j : e j ∈ E} ∈ X and a subset I * ⊂ I of items, the set {m i l : i l ∈ I * }, determined from the sample values {x e j : e j ∈ E}, is a random sample of {M i l : i l ∈ I * }.
The subset I * may be defined by a condition function, a function of the sample {x e j : e j ∈ E}, given by f : X → P(I), in which P(·) is the power set operator. The condition function must be so that if {m i l : i l ∈ I * } is determined from {x e j : e j ∈ E} ∈ X and I * = f (x e j : e j ∈ E), then |m i l | = 1, ∀i l ∈ I * . The CI for c > 50 and the CRV are condition functions. From now on, it will be supposed that the condition function may be expressed as a CI.
The condition function is based on the assumption that an item is in the universe of items that assess the construct of interest if the majority of specialists agree on the dimension it assesses. Of course, one may take a different criterion to exclude the items that do not assess the construct of interest, although our method may be applied only if the criterion can be expressed as a condition function, for it is based on the fact that M i l is a univariate random variable.
Finally, define as the random variable that indicates if the specialist e j judged the item i l at the same dimension as the majority of the specialists. Given a random sample {x e j : e j ∈ E} ∈ X and a subset f (x e j : e j ∈ E) = I * ⊂ I of items, the set {w i l (e j ) : i l ∈ I * , e j ∈ E}, determined from the sample values {x e j : e j ∈ E}, is a random sample of {W i l (e j ) : i l ∈ I * , e j ∈ E}.
On the one hand, whilst we observe the values of the random variables {X i l (e j ) : i l ∈ I, e j ∈ E}, we do not know if the specialists judged the items correctly or not, for the dimension that an item really assesses (if any) is unknown. Therefore, it is not possible to differentiate the specialists by the number of items they judged correctly, for example.
On the other hand, from the random variables {W i l (e j ) : i l ∈ I, e j ∈ E}, we know the concordance of the specialists on the judgement of the items, what gives us a relative measure of the capability of the specialists to judge the items. Therefore, we are able to test if all the specialists have the same capability to judge the items, although we cannot determine the capability of each one.

Assumptions
The development of the items and the judgement of the specialists must satisfy two assumptions so that the method to be presented below may be applied: 1. Each item i l ∈ I * assesses one, and only one, dimension C c(i l ) ∈ C.
2. The random variables {X i l (e j ) : i l ∈ I * , e j ∈ E} are independent.
Assumption 1 establishes that the items that were not excluded by the condition function, i.e., the items in I * , are well constructed and assess only one dimension of C, while assumption 2 imposes that the specialists judge the items independently of each other and that the judgement of a specialist about one item does not depend on his judgement about any other item. Those assumptions are not strong, for it is expected that they will be satisfied if the items were well constructed. Indeed, as better the condition function is in determining what items are not in U, the better will be the quality of the items in I * . Therefore the assumptions above are closely related to the condition function. If, in fact, I * ⊂ U, then the first assumption is immediately satisfied, for there is no intersection between two dimensions of a construct, and the second assumption may also hold, for the items are well defined.

Mathematical Deduction
Given a random sample {x e j : e j ∈ E} ∈ X, it is not trivial to estimate the capabilities {P (e j ) : e j ∈ E}, for the dimension that each item assesses is unknown. Examining such random sample, it is known that the specialist e j judged the item i l at the dimension C k , but it is not possible to determine, with probability 1, if he judged such item correctly. Therefore, the problem is, given a random sample {x e j : e j ∈ E} ∈ X, to determine random variables that allow us to test if the capability of all the specialists is the same. It will be shown that if the random variables {W i l (e j ) : e j ∈ E} are not identically distributed ∀i l ∈ I * , then the specialists do not have all the same capability to judge the items. Indeed, in order to test if the capability of all specialists is the same, we will consider the following null hypotheses: Of course, we are only interested in testing the first part of H 0 , that refers to the capability of the specialists, i.e., that all specialists have the same capability to judge the items. However, the second part is needed to develop a test statistic for H 0 . It will be argued that for great values of p (i l ) the hypothesis that is actually being tested is the first one. The propositions below set the scenario for the nonparametric test that will be used to test H 0 .

Proposition 1
The random variables {W i l (e j ) : i l ∈ I * } are independent ∀e j ∈ E, but the random variables {W i l (e j ) : e j ∈ E} are dependent ∀i l ∈ I * .
Proof: On the one hand, the random variables {W i l (e j ) : i l ∈ I * } are each, by assumption 2, function of independent random variables, therefore they are independent. On the other hand, note that e j ∈E W i l (e j ) ≥ ⌈ cs 100 ⌉, for at least c% of the specialists must agree on the dimension an item in I * assesses, what establishes a dependence. Proof: We have that . . , n−1}, be independent random variables, and let f * = ⌊ cs 100 ⌋, in which c is the CI. Then, Hence, that does not depend on e j and the result follows.
It is important to note that if all p (i l ) are approximately 1, then P{W i l (e j ) = 1} ≈ p (i l ) P{X ≥ f * } and the hypothesis that is really being tested is the first part of H 0 . Therefore, it is reasonable to test H 0 in order to determine if the specialists have all the same capability to judge the items, for, if it is indeed true, we expect that all p (i l ) are great and the second part of H 0 will hardly leads to the rejection of H 0 when the capability is the same.
This test may be used as a diagnostic for the content analysis of items. If H 0 is not rejected, then there is no evidence that the capabilities of the specialists are different. However, if H 0 is rejected, we do not know if it is the first or the second part (or both) of H 0 that is not being satisfied by the judgement of the specialists. Nevertheless, we may disregard the judgement of those specialists in any case, for either their capability is not the same or they are the same, but some p (i l ) are small, what led to the rejection of H 0 by its second part.

Hypothesis Testing
The Chocran's Q test may applied to the random sample {w i l (e j ) : i l ∈ I * , e j ∈ E} determined from {x e j : e j ∈ E} as a way to test H 0 [1]. The assumptions of the Chocran's Q test, using the notation of this paper, are: (a) The items of I * were randomly selected from the items that form the universe U that the instrument aims to assess. From the usual scenario in which such test is applied, we have that the items may be seen as the blocks and the specialists as the treatments. What the Chocran's Q test evaluates is if the random variables {W i l (e j ) : e j ∈ E} are identically distributed for all i l ∈ I * . Therefore, if we reject the null hypothesis of the test, we conclude that {W i l (e j ) : e j ∈ E} are not identically distributed for all i l ∈ I * and, by Proposition 2, H 0 is also rejected. Thus, the hypothesis tested by the Chocran's Q test is indeed H 0 .
The statistic of the test is calculated from Table 1, in which I * = {i * 1 , . . . , i * v }, and may be expressed as

Item
Specialist Total e 1 · · · e s i * The exact distribution of the Q statistics may be calculated by the method presented by [7], although a large sample approximation may be used instead. If |I * | is large, then the distribution of Q is approximately χ 2 with (s − 1) degrees of freedom [2].
It is worth mentioning that the random variables {W i l (e j ) : e j ∈ E} being identically distributed for all i l ∈ I * does not imply that the specialists have all the same capability to judge the items, although there is no evidence that their capabilities are different. If there is no evidence that the capabilities of the specialists to judge the items are different, their judgement may be accepted.
If it is determined that the random variables {W i l (e j ) : e j ∈ E} are not identically distributed for all i l ∈ I * , then the judgement of the specialists is disregarded for H 0 is rejected. The items may be judged by different groups of specialists until they are judged by a set in which all the specialists have the same capability to judge the items. Those groups may be formed by new specialists or may be a subgroup of size s * , 6 ≤ s * < s, of the specialists for which H 0 was rejected.

Simulation Study
As the Cochran's Q test is not a powerful one, i.e., its Type I error may be too great, a simulation study will be conducted to estimate the power of the test in some specific cases. The power of a statistical test is defined as the probability of H 0 being rejected when it is false and depends on the real scenario, i.e., on the real values of the parameters considered on H 0 . Therefore, the power of Cochran's Q test in testing H 0 depends on the real capability of each specialist in judging the items, so that the simulation study consider 10 distinct scenarios and is conducted as follows.
For each scenario, we will simulate 50,000 judgements of the same items by the judges and then determine the proportion of the simulations in which H 0 was rejected at a significance, i.e., Type II error, of 5%. This proportion will be regarded as an estimate for the power of the test in the considered scenario. A CI of 50% will be used to determine I * in each simulation. Analysing the results of all 10 scenarios, we will have a wide picture of the power of the test and will know for which scenarios it is more powerful.
We will consider in all scenarios nine specialists judging 30 items into three dimensions, that is the framework of the application in the next section. We will also consider that the capability of each specialist is the same for all items, i.e., that P{X i l (e j ) = c(i l )} = p j for all j ∈ {1, . . . , 9} and l ∈ {1, . . . , |I * |}. Finally, we will assume that P{X i l (e j ) = k} = (1 − p j )/2 for all k = c(i l ), j ∈ {1, . . . , 9} and l ∈ {1, . . . , |I * |}. The scenarios and their estimated test power are displayed in Table 2.
On the one hand, we see in Table 2, that the power of the test is great when the majority of the specialists have the same high capability, while a  few specialists have a low capability, as is the case of scenarios 1, 2 and 3. On the other hand, the power of the test is quite low when some of the specialists have the same high capability, but the specialists with lower capability are almost as capable as them, as is the case of scenarios 4, 5 and 6.
On scenarios 7 and 8 we see that the power of the test is low when there are specialists with capability less than 0.5. It happens because the specialists hardly agree on the dimension that each item assesses (as some of them are not capable) so that many items are excluded by the CI and, on the items that remain, the not capable specialists agree with the highly capable ones, so it seems that they have high capability. Indeed, in scenarios 7 and 8, the mean number of not excluded items are the lowest of all scenarios, so that a low concordance among the specialists is an evidence of the existence of low capable specialists, given that the items were well constructed.
Finally, as pointed out in the Mathematical Deduction section, we see in scenarios 9 and 10 that the hypotheses that is actually being tested when all the specialists are highly and equally capable is the first part of H 0 , as the power of the test is close to the Type II error, what must be the case if the hypothesis is true.
The simulation study shed light in some interesting facts about the proposed method on the considered scenarios. On the one hand, if the majority of the specialists have a homogeneous high capability, and a few specialists have a very low capability, then the power of the test is great. However, if the specialists have all high, but different, capability then the power of the test is low. On the other hand, if the majority of the specialists have a low capability, then a great number of items will be excluded by the CI and, given that the items were well constructed, we may conclude that the specialists have low capability of judging the items, even though the power of the test is low. Finally, if only the first part of H 0 is being satisfied, and the capability of the specialists is high, then the power of the test is low and, therefore, the hypothesis that is really being tested is the first part of H 0 .

Application: Perception About the Evaluation of the Teaching-Learning
In this section we will apply the developed method to a real validation process, in order to analyse the content of items of an instrument that aims to assess the perception of teachers and students of higher education institutions about the teaching-learning process, that is a construct that may be divided into three dimensions: process (P), judgement (J) and teaching-learning (T). The evaluation of the teaching-learning has a process dimension, as it must have a beginning, a middle and an end well defined and must have a continuous, cumulative and systematic character. Indeed, it is a systematic mechanism for gathering information over time, with well defined levels, what characterizes it as a process. Also, the evaluation of the teaching-learning has a judgement dimension because it must issue a judgement of value or assign a score through the analysis of educational results obtained from the information gathered over the time. Finally, the evaluation of the teaching-learning has a teaching-learning dimension for, as its own name says, it must not only evaluate the learning, but also the teaching: it should not only evaluate what the student has learnt, but also what the teacher has taught. Therefore, the evaluation of the teaching-learning is a process of data gathering, in which an individual will judge or be judged accordingly to the teaching-learning.
In order to develop an instrument to assess this construct, 30 items were developed and sent to nine specialists so they would judge the items to the dimension that, according to their opinion, each one assesses. The condition defined for excluding an item is the CI with c = 50. The judgements of the specialists are presented in Table 3, the table for the Cochran's Q test is displayed in Table 4 and a translation of the items, that were originally constructed in Portuguese, is presented in the Appendix.
The statistic of the Cochran's Q test for the data in Table 4 is Q = 8.7 and the test p-value is 0.36, so that there is no evidence that H 0 is not true, at a significance of 5%. Furthermore, as the majority of the specialists agreed on the dimension that 24 out of 30 (80%) items assess we also do not have evidence that the capability of the specialists is low. Therefore, based on the proposed method, there is no reason to disregard the judgement of the specialists.
Nevertheless, in order to illustrate the proposed approach for the case in which H 0 is rejected, we will apply the test to every subgroup of size 6 ≤ s * < 9 of specialists, what amounts to 130 subgroups, and see for what subgroups the capability of the specialists is the same. From the 130 subgroups, for 13 of them H 0 was rejected at a significance of 5%. The Q statistic and the p-value for the 10 groups with greatest p-values are displayed in Table 5. If H 0 had been rejected to the group of nine specialists we could now look for a subgroup of those specialists for which H 0 is not rejected and, by the help of a qualitative analysis, we could choose a subgroup of those specialists instead of disregarding their judgements as a whole and sending the items to other specialists to judge. Table 3: Judgement of the specialists about each item, i.e., the sample {x e j : e j ∈ E}.

Final Remarks
The Cochran's Q test is not a powerful one, thus the method must be used with caution. The validation of a psychometric instrument is a process formed by various procedures, therefore it must not be restricted to the content analysis of items and the method developed in this paper. It is important to apply other validation techniques, both qualitative and quantitative, to the instrument so it may be properly validated. The method may be improved in order to decrease even more the subjectivity of the content analysis of items, especially by the development of more powerful tests than the one established and the definition of other random variables that enable the comparison between the judgement of the specialists. This paper does not exhaust the subject, but present a nonparametric statistical approach that aims to decrease the subjectivity of a subjective process and that may applied not only to the content analysis of items, but also to any statistical application that enables the definition of variables such as those of this paper.

Supplementary Materials
The R [8] code used in the simulation study and in the application section is available as supplementary material for this paper and can be accessed at www.ime.usp.br/∼dmarcondes. 22. The evaluation is a process with continuous, cumulative and systematic, but not episodic, character (Process).
23. Evaluate means to provide a judgement of value or to assign a score to whom is being evaluated (Judgement).
24. The evaluation is a tool that permits to inquire to what extent the defined objectives are being achieved (Judgement).
25. The evaluation has an authoritarian and classificatory role inside the process of teaching-learning (Judgement).
26. The evaluation is an educational component that can facilitate the teaching-learning (Teaching-Learning).
27. The teaching-learning and the evaluation are not isolated parts of the education process (Teaching-Learning).
28. The evaluation is the more adequate path to make it feasible an excellent teaching-learning (Teaching-Learning).
29. The evaluation stimulate the acts of teaching and learning as a simultaneous process (Teaching-Learning).
30. The evaluation involves the intentional judgement of a process developed by an individual, during your learning (Judgement).