1. Introduction
In recent years, data analysis methods that require significant computing resources have become viable and represent an important alternative for work. This is the case with Principal Component Analysis (PCA), which can be used to analyze data with a number of attributes from tens to hundreds. Much has been written about PCA, which is a technique that dates back to 1900 [
1], and has been developed intensely recently due to the power of computing; some of the previous works present theoretical and practical aspects [
2,
3,
4,
5,
6,
7], and others are more practically oriented, including Python code [
8].
Unlike other data analysis methods [
9], PCA is a method that involves a loss of what is called “information” about the analyzed data, as a “price to pay” for observing the data in a simplified way, through descriptive data matrices and special diagrams such as bar charts and scatter plots, in order to facilitate drawing conclusions.
Principal component analysis (PCA) is applied to real-valued data. This data consists of the attribute values of several individuals, all with the same attributes but with values specific to each individual. The data is stored in a matrix of real numbers, denoted , which in this work has rows and columns. Each row contains the attribute values of one individual, and each column contains the values of a specific attribute common to all individuals. Therefore, the data corresponds to individuals and attributes.
, is represented in Equation (1), where each row
, represents the attribute data of an individual
.
To graphically visualize the data, represented as Equation (1), the data in each row are represented as a point (a vector) , drawn in a Euclidean space with Cartesian coordinates, which has perpendicular coordinate axes, each axis corresponding to an attribute.
When m is large, it becomes difficult to observe and draw conclusions about the points, specifically about the relationships between the attributes. In this case, PCA can be helpful because PCA is a dimensionality reduction method that facilitates the visualization of the points by finding the most representative projection of the points onto a lower-dimensional subspace, generally having two or three dimensions.
Regarding the data used for the PCA application example provided in the present work, the data comes from the results of the Youth Risk Behavior Surveillance System (YRBSS) [
10], which is a program for obtaining data on health risk behaviors among young adults in the United States. According to the YRBSS, what they consider young adults may be represented by students in grades 9 through 12, clarifying that these school years encompass students aged 12 to 18, meaning that in their final year of this stage, they would be eligible to apply for college admission. One limitation is that those attending school do not fully represent all individuals in this age group; in 2009, 4% of 16- to 17-year-olds were not enrolled in school.
The YRBSS tracks various categories of risk behaviors among young adults, including tobacco, alcohol, and other drug use, physical inactivity, and other risks. It is important to note that the YRBSS does not focus on the effects (outcomes) of these behaviors, such as school absenteeism or poor academic performance; nor does it consider underlying factors, such as attitudes, beliefs, or skills.
The YRBSS is conducted by the CDC (Centers for Disease Control and Prevention) and other educational agencies in the United States, to generate “information” to help evaluate the effect of broad national, state, territorial, tribal, and local policies and programs; and is not intended to evaluate the effectiveness of specific interventions.
The present work focuses on analyzing data from the Youth Risk Behavior Survey (YRBSS), administered in 2023 to a nationally representative sample of students in grades 9 through 12.
The YRBSS also examines, in less detail, aspects of obesity, asthma, anxiety, eating, and exercise.
The contents of the present work are the following:
Section 2, Methods and Materials, describes the theoretical and operational aspects of PCA and presents an application example.
Section 3, Results, presents the results of the example.
Section 4, Discussion, includes a discussion of the results and more explanation about the methods used.
Section 5, Conclusions, follows. A brief Acknowledgments Section follows, along with
Appendix A, which provides further explanation about the application example data. Finally, the References Section is included.
2. Methods and Materials
2.1. The PCA Method
In the initial situation (before knowing that we need PCA), we have data from individuals with attributes, stored in (Equation (1)). To analyze this data, we represent it as points located in an -dimensional Euclidean space (with perpendicular coordinate axes referred to as “original” axes, where each dimension corresponds to one of the attributes.
When It is large, and visualizing the points becomes difficult. To facilitate observing the points, PCA finds an “alternative” set of orthogonal coordinate axes. referred to as “alternative” axes, and individually named PC1, PC2, …, PC.
PCA aims to minimize the average “distance” (the error) between the location of the points and the location of their projections onto each one of the “alternative” axes (it is said that the residuals are minimized [
5]).
No attribute, by its very nature, should have average values bigger than any other. This allows for an unbiased comparison of the average proximity of points to the various “alternative” axes. To achieve this, the attribute values must be normalized (the values in each column of ). Normalization does not affect the objective of discovering the relationship between the attributes.
Let be the direction vectors of the “alternative” axes (PC1, PC2, …, PC); we call these vectors the “alternative” direction vectors. The minimization yields the direction and sense of with regard to the direction vectors of the “original” coordinate axes, which we represent as ; we call these vectors the “original” direction vectors. The minimization is done in steps; first, the minimization is performed with respect to , then minimization with respect to is performed, considering that must be perpendicular to the already determined , and so on, to find all the direction vectors , where the last direction vector, , turns out to be the one with the largest residue (the greatest difference or “distance”).
It can be shown [
5] that these minimizations are equivalent to maximizing the variance of the projections of the points
onto the “alternative” axes. This results in the variance of the projections of the points onto PC1 being the greatest; the variance of the projections of the points onto PC2 being the next greatest; and so on. The variability of the distances of these projections on an “alternative” axis defines the quality with which these projections describe the points.
The “alternative” axes end up appearing as a set of perpendicular axes rotated with respect to the “original” axes, both sets of axes sharing the same center. The change from the “original” axes to the “alternative” axes, which serve as a new reference for locating points, is called a “transformation”—in this case, a “rotation” transformation.
After performing the maximization (using Lagrange multipliers) [
5], it turns out that the vectors
are the eigenvectors of the matrix
(called the covariance matrix of
), such that
, where
is the representation of
as a column, and where the eigenvalues are such that
is the eigenvalue of
, which is the maximum eigenvalue for that matrix;
is the eigenvalue of
, which is the second largest eigenvalue, and so on.
One theoretical aspect that reinforces the existence of eigenvectors for is that, since it is a real and symmetric matrix, it can be shown that it has real eigenvectors, which by definition are linearly independent. Given that is positive semidefinite, its eigenvalues are non-negative. From these eigenvectors, a set of orthonormal eigenvectors can be obtained with the eigenvalues already determined. Some of these eigenvalues could be equal, and others could be different, but all are non-negative.
Figure 1 illustrates both sets of direction vectors in a Euclidean space, where the intention is to represent that one set of orthonormal direction vectors is rotated with respect to another. For ease of representation,
is assumed.
The eigenvectors
can be placed as columns of a matrix called
, as follows:
It can be said that is the main result of applying PCA. It is observed that . is said to be an orthonormal matrix.
The elements of each point
, with respect to the “alternative” axes, can be obtained with the projections of the points onto these axes, that is, using the inner products (dot products) as Equation (3) shows (with the symbol ≜ indicating definition).
Such that the transformation can be obtained as:
The transformation views the points referred to the “alternative” axes. Each column , of contains the values of the projections of the points onto PC, .
It can be shown that the mean and variance of the values in each column,
, are equal to
and
, respectively. The relative variances of these distances are calculated as
. These relative variances are usually given as a percentage value. Relative-variance values are associated with what is called the amount of “information” [
7]; thus, PC1 is the “alternative” axis that has the most “information” about the points, followed by PC2, and so on.
This rotation transformation technique is what PCA does. It is important to note that the term “principal component” can have different meanings in the literature, but for details, please consult [
2] (p. 3).
2.2. Method: The Loss of “Information” from Observing Only a Projection
Instead of projecting the points onto the “alternative” axes, only of these axes can be selected for projection (from PC1 to PC), those “alternative” axes that retain a desired minimum amount of “information,” forming with them a projection subspace of dimension . The greater simplicity achieved in the projection comes at the cost of some “information.” This means using only the projections included in the matrix .
There is a theory of PCA that shows how these projections, viewed again from the reference of the “original” axes, would form the matrix, here called , where here rd stands for “reduction”, at most of rank , in other words is the projection of onto the first “alternative” axes: “PC1,…, PC”. The “information” loss when considering instead of is: .
The “misfit” between and , which is the difference between the sum of the squares of all the elements of the matrix and the sum of the squares of the matrix , can be calculated to be . So the smaller the eigenvalues , the smaller the misfit.
It can be shown that , where comes from taking the first columns of ( contains only the first “alternative” direction vectors of ). Therefore, considering Equation (4), we can write .
In the literature on PCA, a quantity of 70% [
2] of “information” retained, as a subjective cut-off point, is recommended for the first
axes used to form the projection subspace; however, the desire for a simple graphical representation often leads to the use of
, which means that the projection subspace is the plane formed by the “alternative” axes: PC1 and PC2, a plane that is here called PC1-PC2. These two axes are less likely to retain 70% of the “information.” If this were to occur, the graphical representation on PC1-PC2 would lack “quality.”
It is important to say that the recommendation of using PC1 and PC2 (the alternative axes with the biggest “information” is not always the best choice (it depends on the data and necessities), as can be observed in [
11].
2.3. Method: The Biplot and Zoning Biplot
Later in this same section, an example is presented, where
,
and
, with the results shown in
Section 3, Results. The PCA results of the example are shown in two graphs (of different types). First, a bar chart, called a Screeplot, which represents the “information” associated with each “alternative” coordinate axes.
The second is a graph, a Biplot which is the graphical representation of the -dimensional projection subspace, with . This graph includes the following two sets of projections onto the subspace formed by the first “alternative” axes: (1) The projections of the “original” direction vectors, , remembering that these correspond to the “original” axes which, in turn, represent the attributes; (2) The projections of the points .
In this type of graph, the projections of the “original” direction vectors are used more for the interpretation of the results than the projections of the points, but it is important to note that these vectors are a consequence of the location of the points (so the points were already taken into account). The angles between the projections of the “original” direction vectors indicate the relationship between the associated attributes. The details of this are given later when explaining the results of the example.
In the example results, PC1 and PC2 did not retain more than 45% of the total “information”, so, as mentioned, the graphical representation in PC1-PC2 would lack “quality”.
This means that the standard Biplot is not used, but rather the Zoning Biplot (an enhanced version of the Biplot). This type of Biplot can expand the possibilities for leveraging PCA. The explanation of the Zoning Biplot is left until
Section 3, Results, which is based on the practical result of the example presented here.
2.4. Materials: About the Data Used for the Example Included in This Work
Data collection from YRBS applied in 2023 was carried out via questionnaires administered to a sample of randomly selected schools, chosen with a probability proportional to the number of students enrolled in grades 9 through 12. Within the selected classes, all students are eligible to participate. The questionnaire was self-administered to students, either on a computer or paper, and consisted of multiple-choice questions oriented toward ranges of values, for example: How old are you? The answer options are age ranges.
The national data was not an aggregate of the state datasets, but rather the national version uses an independent sample.
One hundred and forty-one formulas were provided for applying logical corrections to each student’s answers. For example, when there is a contradiction in some answers, these are marked as missing data.
In case of contradiction between any two answers of a student, all of that student’s answers are invalidated. The corrections are referred to as Data Edits. There are also other formulas for applying logical corrections to weight and height, but these are not considered in the present work. In
Section 4 (Discussion), we have included explanations to support our handling of missing data.
Response files can be obtained from the YRBSS website and converted to various formats. For 2023, there are 107 questions (q1 to q107), and 20,103 student records.
2.5. Data Processing
After applying the 141 Data Edit formulas to the 20,103 records, 3975 records were discarded. Subsequently, all records with blank responses were discarded, leaving “only” 4972 records. Later, of the 107 questions (q1 to q107), only some were selected, whose answers correspond to risk themes selected in this study (as explained below). Finally, this study only examines the records of men. Men were selected because the study includes the risks we group as Gun Violence, as explained below. The rationale for selecting men is given in
Section 4, Discussion. This leaves a total of 2461 records.
All questions are multiple choice with answers given in whole numbers, implying risk ranges. An example of a question and response range is: q84. “During the past 30 days, how often was your mental health not good? (Poor mental health includes stress, anxiety, and depression.)”. The choice answers are: 1. Never; 2. Rarely; 3. Sometimes; 4. Most of the time; 5. Always.
All answers should indicate low risks at low values and high risks at high values, that is, in ascending order according to “risk level”. The original range values from the questionnaires were retained as risk indicators, except that these values were not arranged in ascending order according to risk level (but rather descending). In that case, the response numbering was changed.
The themes selected for the present work, with the questions in parentheses are: Self-security (q8); Gun violence (q12, q13); Fighting (q16, q17); Sex and dating violence (q19, q20, q21, q22); Race, school and Web violence (q23, q24, q25); Tobacco use (q31, q32, q33, q34, q35, q36, q38, q39); Alcohol use (q41, q42, q43, q44); Exercise (q76, q77, q78); Mental health (q84), Home violence (q89, q90, q91).
A total score is calculated for each theme, resulting in 10 themes that, in this study, correspond to attributes. These attributes are shown in
Table 1, along with their acronyms.
2.6. Limitations
The responses made by CDC are weighted to ensure the data are representative of the population from which they were drawn, based on the student’s sex, grade level, and race/ethnicity. In the present work, weighting is not applied; instead, the questions are used as they appear in the response data.
It should be noted that YRBSS also formulates other questions: qN1 to qN87, whose responses are mathematically dependent on combinations of the responses to questions q1 to q107, allowing for various analyses. These questions are not considered in the present work.
2.7. Tools Used
The Python language (Python v3.13.9 distributed through Anaconda Navigator v2.7.0) is used for the computational implementation of the operations, specifically the NumPy (v2.3.5) [
12] and Pandas (v2.3.3) [
13] packages, and the Scikit-learn (Sklearn v1.7.2) library [
14]. From Scikit-learn, the following packages are used: (a) Preprocessing, using the StandardScaler class, and (b) Decomposition, using the PCA class, which contains unsupervised machine learning algorithms for dimensionality reduction. The CSV module and other packages such as Seaborn (v.0.13.2) [
15] and Matplotlib (v 3.10.6) [
16] are also used for graphical representation. Functions developed by the authors are used to create the zoning Biplots (explained below).
3. Results
Figure 2 shows the resulting Screeplot, which indicates the relative variance (in percentage), equivalent to the amount of “information” of the
“alternative” coordinate axes. It can be observed that PC1 and PC2 do not even reach 45% aggregate “information”.
Given the small amount of “information” retained by PC1 and PC2, the conventional Biplot is not used; instead, the Zoning Biplot is used.
Figure 3 shows the resulting Zoning Biplot, which includes the following two sets of projections on plane PC1-PC2: (1) The projections of the “original” direction vectors,
(
); (2) The projections of the points,
, where
, drawn with shapes according to the values of the main attribute (HoV).
Since the “original” direction vectors,
correspond to the “original” coordinate axes, which in their turn represent the attributes that have already been assigned acronyms (
Table 1), the projections of these “original” direction vectors are identified by the acronyms of the attributes.
The dimensions shown in the lower and left lines correspond to the projections of the points onto PC1 and PC2, respectively. Similarly, the dimensions in the top and right lines correspond to the projections of the “original” direction vectors (market with the acronyms of the attributes), in PC1 and PC2, respectively.
The Zoning Biplot is the same as the conventional Biplot, with the difference that the projections of the points in the Zoning Biplot are shown with shapes according to a classification by the value of the main attribute, in this case HoV, so that the shapes correspond to four ranges of values, from smallest to biggest, with corresponding shapes: ×, ▲, +, ▼. A fairly clear zoning pattern is expected in the location of the point projections based on their shape. If this occurs, the main attribute is considered to be well represented on the PC1-PC2 plane.
With this, the authors argue that from the Zoning Biplot, it is valid to obtain the relationship between this attribute and the others. The graphical explanation of the conception and use of the Zoning Biplot in the Discussion Section. In the example, the HoV values used for zoning are the standardized values (mean 0 and variance 1), but unstandardized values can be used. Visually, there is a clear zoning of the central parts of the zones, but there are overlaps, for example, the HoV zone is practically being overlapped.
In a conventional Biplot and in the Zoning Biplot, the relationship between pairs of attributes can be visually established from the angle between the projections of the direction vectors associated with those attributes. If the projections of two “original” direction vectors, corresponding to two attributes, are highly collinear (the angle between them is small), it can be concluded that there is a reinforcing relationship between the two associated attributes. If the projections are nearly perpendicular, it can be concluded that there is no relationship.
The angle values also contain subjectivity regarding the “amount” of the “degree” of relationship between attributes. These angles can be obtained numerically from the matrix (the main result of applying PCA). Visual interpretation, although subjective, is usually the one most often used.
As a side note, sometimes, depending on the data, PC1 and PC2 are not the “alternative” axes where a clear zoning for a specific attribute is observed, but a clear zoning can be seen on other “alternative” axes, even though they retain less “information” [
11]. In this article, PC1 and PC2 are used, as described in the example.
When a conventional Biplot is used, relations between all pairs of attributes are drawn. In Zoning Biplots (and only if a zoning is observed), the authors suggest drawing conclusions only about the relationship between the main attribute (the one whose values were used for zoning) and the other attributes, but not between the other attributes (although such relationships can be obtained).
In the example, from the observation of angles explained, it is clear that the main attribute, HoV, is closely related to SDV, meaning that the young man experiencing violence at home also experiences sex and dating violence. HoV is also related (not as closely) to RSV and MHe. HoV is less related to the other attributes. It is avoided to formulate relationships between the other attributes.
In addition to observing the angles of the projections of the “original” direction vectors, the direction of the cluster centers appears to coincide with the direction of the HoV’s “original” direction vector (which is desirable). The increasing senses also coincide.
Regarding the matrix
(
) obtained in the example,
Table 2 presents the values of
. It can be seen that the sum of the squares of any row or column is equal to 1 because the table is made up of direction vectors, which are unit vectors. The projection of any “original” direction vector (corresponding to an attribute) onto the PC1−PC2 plane is formed with the values relative to the
and
columns; for example, the projection of the HoV direction vector onto this plane is formed with the values
and
. The absolute value of the projection of HoV is
, as can be seen in
Figure 3.
As an example of how to use the matrix, the angle between the projections of the “original” direction vectors (HoV) and (SDV) can be calculated as follows. The projection of on plane PC1-PC2 is . These values are taken from the last row of , which corresponds to HoV, and from the first two to the columns of , those that correspond to and . The angle between this projection vector and PC1 is . Similarly, for the projection of on plane PC1-PC2, this is . The angle between this projection vector and PC1 is . Then, the angle between these two projection vectors is .
4. Discussion
4.1. Graphical Explanation of the Zoning Biplot Concept
The following is a graphical heuristic that explains why the Zoning Biplot can be more illustrative than a Biplot.
Figure 4 represents a scenario of a plane formed by two “original” axes associated with Attributes
and
, which have direction vectors
and
. On the plane, there are two points:
and
.
In the same plane lies the “alternative” axis PC1, with direction vector
, which is most collinear with the axis with direction vector
. It can be observed that:
Regarding the projections of points onto PC1, it is observed that , so the same order is maintained compared to that of the projections of points onto the axis with direction vector (Attribute ).
If the projection of points on PC1 is drawn with shapes corresponding to the magnitude of the attribute , a correct ordering is more likely to be observed, compared to if the points on PC1 are drawn with shapes corresponding to the attribute . This is not always the case, as seen when considering the point , since exceeds to a greater extent. All this means that the projections in PC1 better represent the attribute .
This is the idea that leads to considering the Zoning Biplot as an aid, where a well-defined zoning indicates that an attribute is well represented.
4.2. About the Handling of Missing Data
Many algorithms, such as PCA, cannot handle missing values; Data Edits can generate them. We removed an individual’s entire response if they had a blank response for any attribute, without imputing any values. The reason for this is as follows.
In the Methodology of the Youth Risk Behavior Surveillance System [
17], it is stated, verbatim: “Responses that conflict in logical terms are both set to missing, and data are not imputed. For example, if a student responds to one question that he or she has never smoked, but then responds to a subsequent question that he or she has smoked two cigarettes during the previous 30 days, the processing system sets both responses to missing. Neither response is assumed to be the correct response.” They explain that they did this in terms of what they call quality-control checks, where questionnaires with fewer than 20 valid responses remaining after editing are deleted from the dataset. They explain that the number of questionnaires eliminated represents a small percentage of the total.
We removed 3975 records out of a total of 20,103, a considerable percentage. However, in any case, we removed an individual’s entire response if it was blank for any attribute, without imputing data, because:
- (1)
The attribute values are related, so imputation could introduce a significant error into the dataset.
- (2)
Blank responses due to data editing can be considered completely random missing data errors (MCAR) (there is absolutely no relationship between the missing data and any other measured or unmeasured variable.) In Ref. [
18] it is stated verbatim: “By far the most common approach to the missing data is to simply omit those cases with the missing data and analyze the remaining data. This approach is known as the complete case (or available case) analysis or listwise deletion,” and says that if the assumption of MCAR is satisfied, a listwise deletion is known to produce unbiased estimates and conservative results.
Regarding Data Edits: From a complete file with all attributes, q1 to q107, Data Edits removed 3975 rows of data from a total of 20,103 rows (leaving 16,128 rows of data). From a data file with only the columns to be worked on at the end, q02, q04, q08, q12, q13, q16, q17, q19, q20, q21, q22, q23, q24, q25, q31, q32, q33, q34, q35, q36, q38, q39, q41, q42, q43, q44, q76, q77, q78, q84, q89, q90, q91, Data Edits again removed 3975 rows of data. In both cases, the only formula that eliminated data was #15, which is (Q22 = A AND Q21 = B, C, D, E, F), meaning that Q22 cannot be A and Q22 B, or C, or D, or E, or F at the same time. The questions are:
Q22. During the past 12 months, how many times did someone you were dating or going out with physically hurt you on purpose?
Q21. During the past 12 months, how many times did someone you were dating or going out with force you to do sexual things that you did not want to do? (Count such things as kissing, touching, or being physically forced to have sexual intercourse.) B. 0 times; C. 1 time; D. 2 or 3 times; E. 4 or 5 times; F. 6 or more times;
It appears there was confusion among the students because there were a lot of mixed responses: Q22: A (I did not date or go out with anyone during the past 12 months); and Q21: B (0 times), instead of answering A: (I did not date or go out with anyone during the past 12 months). Therefore, the complete response is removed.
4.3. About Selecting Men from the Dataset
We selected only men for this study because it includes the risks we classify as Gun Violence. Men’s behavior is very different from women’s. In the United States, the vast majority of gun violence perpetrated by teenagers is committed by males. Ref. [
19] states verbatim: “Men commit the vast majority of gun violence in the U.S. According to FBI data from 2017, men were responsible for 88.1 percent of all homicides, and firearms were used in 72.6 percent of homicides”.
Ref. [
20] states verbatim: “Gun violence, and gun culture more broadly, are gendered phenomena in America. Empirically, the relationship between gun violence and gender is clear: men are more likely to own, use, kill with, and die by the gun. Gun ownership is disproportionately male: 62% of gun owners in the United States are men. Only 22% of women report that they own a gun compared to approximately 40% of men, which means that nearly twice as many men than women own guns”.
4.4. Regarding the Use of Quantiles with Our Dataset
We have attributes whose values come from a small set of integers, for example, between 1 and 5. For an attribute like this, the entire sample set for that attribute often repeats a value, such as 2. Therefore, when we divide this set into quartiles, quintiles, or sextiles, the value 2 is repeated, forming the total values of one or more quantiles. Thus, dividing into quantiles does not help with zoning. Due to the values of some of our attributes, it is not possible to divide all of them effectively into quantiles, but it is possible for the following attributes: Exc into quartiles and quintiles; MHe into quintiles; and HoV into quartiles and quintiles. Here we use HoV with quartiles.
5. Conclusions
PCA facilitates the visualization of multi-attribute data, but it is important that the projection plane used is representative of the attributes. The Zoning Biplot can provide better certainty regarding the validity of the PCA results. As there is a loss of “information” in the projections obtained from PCA, it is important to be cautious when drawing conclusions.
Contrary to what is commonly believed, attribute selection and data preparation can be as intensive as data analysis itself.
The analysis of data related to education can greatly benefit from data analysis methods, such as PCA.
Regarding PCA:
A major drawback of PCA is its interpretability. Other methods to help in this interpretation drawback include: (1) Utilizing rotations like Varimax for better component interpretation; (2) Applying Sparse PCA for clearer variable selection, and (3) Using Non-linear PCA.
PCA is a linear method. If the relationship between attributes is not linear, PCA may fail to find the correct relationships. Therefore, nonlinear methods exist, such as auto-associative neural networks (autoencoders) and the alternating least squares method.
Also, there are non-competitive PCA methods, such as Factor analysis, used to describe variability among observed, correlated attributes, in terms of a smaller number of unobserved theoretical variables called factors.
There are other dimensionality reduction methods, such as t-Distributed Stochastic Neighbor Embedding (t-SNE), for data with hundreds or thousands of attributes. It is not deterministic and is computationally expensive.