Next Article in Journal
Magma Petrogenesis and Evolution of Ultramafic Rocks in the Daaobaogou Ni-Cu Sulfide Deposit, Dunhuang Block, Gansu Province, China: Constraints from Major and Trace Elements and Sr-Nd-Pb Isotopes
Previous Article in Journal
Optimal Transport and Graph Neural Networks for Cross-Session Mental Workload Classification
Previous Article in Special Issue
Dynamic Cognition Graph for Adaptive Learning: Integrating Reasoning Evidence and Reinforcement Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

PCA Applied to YRBSS 2023 Data to Help Assess Health Risk Behaviors

by
Juana Ambrosio-Lucas
,
Héctor Jiménez-Salazar
,
Christian Sánchez-Sánchez
and
Alfredo Piero Mateos-Papis
*
Department of Information Technologies, Division of Communication Sciences and Design, Unidad Cuajimalpa, Universidad Autónoma Metropolitana, Mexico City 05348, Mexico
*
Author to whom correspondence should be addressed.
Appl. Sci. 2026, 16(11), 5507; https://doi.org/10.3390/app16115507
Submission received: 31 March 2026 / Revised: 8 May 2026 / Accepted: 11 May 2026 / Published: 1 June 2026
(This article belongs to the Special Issue Artificial Intelligence in Education: Latest Advances and Prospects)

Abstract

Automated data exploration is very useful for evaluating key aspects of populations such as young adults (which here refers to the youth population in the United States represented by students in grades 9 through 12). This article shows how Principal Component Analysis (PCA) can be used for this exploration. PCA is applicable to data analysis situations with data from n individuals of m attributes (generally n   > >   m ). For analytical purposes, the data can be visualized as n points in a Euclidean space with Cartesian coordinates, with m perpendicular coordinate axes, where each axis corresponds to an attribute. When m is large, the points become difficult to visualize, so PCA is useful, as it is a dimensionality reduction method that facilitates the visualization of the points. The objective of this article is to identify relationships between attributes, where there is a primary attribute of interest. The present work describes some of the main theoretical aspects of PCA and then uses PCA to analyze data, as a practical example. The data comes from the publicly available results of a 2023 survey administered to a nationally representative sample of students in the United States, to assess health risk behaviors among young adults (students in grades 9 through 12), which was conducted by the Youth Risk Behavior Surveillance System (YRBSS), managed by the Centers for Disease Control and Prevention—CDC. The results of this work graphically discover relationships between specific data attributes. The reliability of the results is then discussed, considering: (1) recommendations taken from PCA literature, and (2) the use of a graphical tool called a Zoning Biplot, an improved form of displaying PCA results. This work is relevant because it uses the Zoning Biplot, proposed by the authors, which shows more detail in the results compared to a conventional Biplot; the authors argue that this detail allows for valid results across a larger number of datasets, such as the dataset in the example presented. The authors present a graphical development to support the concept and advantage of a Zoning Biplot.

1. Introduction

In recent years, data analysis methods that require significant computing resources have become viable and represent an important alternative for work. This is the case with Principal Component Analysis (PCA), which can be used to analyze data with a number of attributes from tens to hundreds. Much has been written about PCA, which is a technique that dates back to 1900 [1], and has been developed intensely recently due to the power of computing; some of the previous works present theoretical and practical aspects [2,3,4,5,6,7], and others are more practically oriented, including Python code [8].
Unlike other data analysis methods [9], PCA is a method that involves a loss of what is called “information” about the analyzed data, as a “price to pay” for observing the data in a simplified way, through descriptive data matrices and special diagrams such as bar charts and scatter plots, in order to facilitate drawing conclusions.
Principal component analysis (PCA) is applied to real-valued data. This data consists of the attribute values of several individuals, all with the same attributes but with values specific to each individual. The data is stored in a matrix of real numbers, denoted Y n × m , which in this work has n rows and m columns. Each row contains the attribute values of one individual, and each column contains the values of a specific attribute common to all individuals. Therefore, the data corresponds to n individuals and m attributes.
Y n × m , is represented in Equation (1), where each row y ( i ) 1 × m ,   i = 1 ,   ,   n , represents the attribute data of an individual # i .
Y n × m = [ y ( 1 ) 1 × m y ( 2 ) 1 × m y ( n ) 1 × m ]
To graphically visualize the data, represented as Equation (1), the data in each row are represented as a point (a vector) y i ,   i = 1 ,   ,   n , drawn in a Euclidean space with Cartesian coordinates, which has m perpendicular coordinate axes, each axis corresponding to an attribute.
When m is large, it becomes difficult to observe and draw conclusions about the points, specifically about the relationships between the attributes. In this case, PCA can be helpful because PCA is a dimensionality reduction method that facilitates the visualization of the points by finding the most representative projection of the points onto a lower-dimensional subspace, generally having two or three dimensions.
Regarding the data used for the PCA application example provided in the present work, the data comes from the results of the Youth Risk Behavior Surveillance System (YRBSS) [10], which is a program for obtaining data on health risk behaviors among young adults in the United States. According to the YRBSS, what they consider young adults may be represented by students in grades 9 through 12, clarifying that these school years encompass students aged 12 to 18, meaning that in their final year of this stage, they would be eligible to apply for college admission. One limitation is that those attending school do not fully represent all individuals in this age group; in 2009, 4% of 16- to 17-year-olds were not enrolled in school.
The YRBSS tracks various categories of risk behaviors among young adults, including tobacco, alcohol, and other drug use, physical inactivity, and other risks. It is important to note that the YRBSS does not focus on the effects (outcomes) of these behaviors, such as school absenteeism or poor academic performance; nor does it consider underlying factors, such as attitudes, beliefs, or skills.
The YRBSS is conducted by the CDC (Centers for Disease Control and Prevention) and other educational agencies in the United States, to generate “information” to help evaluate the effect of broad national, state, territorial, tribal, and local policies and programs; and is not intended to evaluate the effectiveness of specific interventions.
The present work focuses on analyzing data from the Youth Risk Behavior Survey (YRBSS), administered in 2023 to a nationally representative sample of students in grades 9 through 12.
The YRBSS also examines, in less detail, aspects of obesity, asthma, anxiety, eating, and exercise.
The contents of the present work are the following: Section 2, Methods and Materials, describes the theoretical and operational aspects of PCA and presents an application example. Section 3, Results, presents the results of the example. Section 4, Discussion, includes a discussion of the results and more explanation about the methods used. Section 5, Conclusions, follows. A brief Acknowledgments Section follows, along with Appendix A, which provides further explanation about the application example data. Finally, the References Section is included.

2. Methods and Materials

2.1. The PCA Method

In the initial situation (before knowing that we need PCA), we have data from n individuals with m attributes, stored in Y n × m (Equation (1)). To analyze this data, we represent it as n points located in an m -dimensional Euclidean space (with m perpendicular coordinate axes referred to as “original” axes, where each dimension corresponds to one of the m attributes.
When m It is large, and visualizing the points becomes difficult. To facilitate observing the points, PCA finds an “alternative” set of m orthogonal coordinate axes. referred to as “alternative” axes, and individually named PC1, PC2, …, PC m .
PCA aims to minimize the average “distance” (the error) between the location of the points and the location of their projections onto each one of the “alternative” axes (it is said that the residuals are minimized [5]).
No attribute, by its very nature, should have average values bigger than any other. This allows for an unbiased comparison of the average proximity of points to the various “alternative” axes. To achieve this, the attribute values must be normalized (the values in each column of Y n × m ). Normalization does not affect the objective of discovering the relationship between the attributes.
Let q 1 , , q m be the direction vectors of the “alternative” axes (PC1, PC2, …, PC m ); we call these vectors the “alternative” direction vectors. The minimization yields the direction and sense of q 1 , , q m with regard to the direction vectors of the “original” coordinate axes, which we represent as e 1 , , e m ; we call these vectors the “original” direction vectors. The minimization is done in steps; first, the minimization is performed with respect to q 1 , then minimization with respect to q 2 is performed, considering that q 2 must be perpendicular to the already determined q 1 , and so on, to find all the direction vectors q j , j = 1 ,   ,   m , where the last direction vector, q m , turns out to be the one with the largest residue (the greatest difference or “distance”).
It can be shown [5] that these minimizations are equivalent to maximizing the variance of the projections of the points y 1 , , y n onto the “alternative” axes. This results in the variance of the projections of the points onto PC1 being the greatest; the variance of the projections of the points onto PC2 being the next greatest; and so on. The variability of the distances of these projections on an “alternative” axis defines the quality with which these projections describe the points.
The “alternative” axes end up appearing as a set of perpendicular axes rotated with respect to the “original” axes, both sets of axes sharing the same center. The change from the “original” axes to the “alternative” axes, which serve as a new reference for locating points, is called a “transformation”—in this case, a “rotation” transformation.
After performing the maximization (using Lagrange multipliers) [5], it turns out that the vectors q 1 , , q m are the eigenvectors of the matrix Y m × n T Y n × m (called the covariance matrix of Y n × m ), such that Y m × n T Y n × m q j m × 1 = λ j q j m × 1 ,   j = 1 ,   ,   m , where q j m × 1 is the representation of q j as a column, and where the eigenvalues are such that λ 1 is the eigenvalue of q 1 , which is the maximum eigenvalue for that matrix; λ 2 is the eigenvalue of q 2 , which is the second largest eigenvalue, and so on.
One theoretical aspect that reinforces the existence of m eigenvectors for Y m × n T Y n × m is that, since it is a real and symmetric matrix, it can be shown that it has m real eigenvectors, which by definition are linearly independent. Given that Y m × n T Y n × m is positive semidefinite, its eigenvalues are non-negative. From these m eigenvectors, a set of m orthonormal eigenvectors can be obtained with the eigenvalues already determined. Some of these eigenvalues could be equal, and others could be different, but all are non-negative.
Figure 1 illustrates both sets of direction vectors in a Euclidean space, where the intention is to represent that one set of orthonormal direction vectors is rotated with respect to another. For ease of representation, m = 3 is assumed.
The eigenvectors q 1 , , q m can be placed as columns of a matrix called Q m × m , as follows:
Q m × m = [ q 1 m × 1 q 2 m × 1 q m m × 1 ]
It can be said that Q m × m is the main result of applying PCA. It is observed that Q m × m T Q m × m = I m . Q m × m is said to be an orthonormal matrix.
The elements of each point y i , i = 1 ,   ,   n , with respect to the “alternative” axes, can be obtained with the projections of the points onto these axes, that is, using the inner products (dot products) as Equation (3) shows (with the symbol ≜ indicating definition).
y i ·   q j t j i
Such that the transformation can be obtained as:
Y n × m Q m × m = [ y 1 1 × m y 2 1 × m y n 1 × m ] [ q 1 m × 1 q 2 m × 1 q m m × 1 ] = T n × m = [ t 1 n × 1 t 2 n × 1 t m n × 1 ] = [ t 1 1 t 2 1 t m 1 t 1 2 t 2 1 t m 2 t 1 n t 2 n t m n ]
The transformation views the points y 1 ,   , y n referred to the “alternative” axes. Each column t j n × 1 ,   j = 1 ,   ,   m , of T n × m contains the values of the projections of the points y 1 , , y n onto PC j , j = 1 ,   ,   m .
It can be shown that the mean and variance of the values in each column, t j n × 1 ,   j = 1 ,   ,   m , are equal to 0 and λ j / n , respectively. The relative variances of these distances are calculated as λ j / l = 1 m λ l , j = 1 ,   , m . These relative variances are usually given as a percentage value. Relative-variance values are associated with what is called the amount of “information” [7]; thus, PC1 is the “alternative” axis that has the most “information” about the points, followed by PC2, and so on.
This rotation transformation technique is what PCA does. It is important to note that the term “principal component” can have different meanings in the literature, but for details, please consult [2] (p. 3).

2.2. Method: The Loss of “Information” from Observing Only a Projection

Instead of projecting the points onto the m “alternative” axes, only r < m of these axes can be selected for projection (from PC1 to PC r ), those r “alternative” axes that retain a desired minimum amount of “information,” forming with them a projection subspace of dimension r . The greater simplicity achieved in the projection comes at the cost of some “information.” This means using only the projections included in the matrix T n × r = [ t 1 n × 1 t 2 n × 1 t r n × 1 ] ,     r < m .
There is a theory of PCA that shows how these projections, viewed again from the reference of the “original” axes, would form the matrix, here called Y ( r d ) n × m , where here rd stands for “reduction”, at most of rank r , in other words Y ( r d ) n × m is the projection of Y n × m onto the first r “alternative” axes: “PC1,…, PC r ”. The “information” loss when considering Y ( r d ) n × m instead of Y n × m is: j = r + 1 m λ j / l = 1 m λ l .
The “misfit” between Y n × m and Y ( r d ) n × m , which is the difference between the sum of the squares of all the elements of the matrix Y n × m and the sum of the squares of the matrix Y ( r d ) n × m , can be calculated to be l = r + 1 m λ l . So the smaller the eigenvalues λ l ,   l = r + 1 ,   ,   m , the smaller the misfit.
It can be shown that Y n × m Q m × r = Y ( r ) n × m Q m × r , where Q m × r comes from taking the first r columns of Q m × m ( Q m × r contains only the first r “alternative” direction vectors of Q m × m ). Therefore, considering Equation (4), we can write T n × r = Y n × m Q m × r = Y ( r d ) n × m Q m × r .
In the literature on PCA, a quantity of 70% [2] of “information” retained, as a subjective cut-off point, is recommended for the first r axes used to form the projection subspace; however, the desire for a simple graphical representation often leads to the use of r = 2 , which means that the projection subspace is the plane formed by the “alternative” axes: PC1 and PC2, a plane that is here called PC1-PC2. These two axes are less likely to retain 70% of the “information.” If this were to occur, the graphical representation on PC1-PC2 would lack “quality.”
It is important to say that the recommendation of using PC1 and PC2 (the alternative axes with the biggest “information” is not always the best choice (it depends on the data and necessities), as can be observed in [11].

2.3. Method: The Biplot and Zoning Biplot

Later in this same section, an example is presented, where n = 20,103 , m = 10 and r = 2 , with the results shown in Section 3, Results. The PCA results of the example are shown in two graphs (of different types). First, a bar chart, called a Screeplot, which represents the “information” associated with each “alternative” coordinate axes.
The second is a graph, a Biplot which is the graphical representation of the r -dimensional projection subspace, with r = 2 . This graph includes the following two sets of projections onto the subspace formed by the first r = 2 “alternative” axes: (1) The projections of the “original” direction vectors, e 1 ,   ,   e m , remembering that these correspond to the “original” axes which, in turn, represent the attributes; (2) The projections of the points y 1 ,   ,   y n .
In this type of graph, the projections of the “original” direction vectors are used more for the interpretation of the results than the projections of the points, but it is important to note that these vectors are a consequence of the location of the points (so the points were already taken into account). The angles between the projections of the “original” direction vectors indicate the relationship between the associated attributes. The details of this are given later when explaining the results of the example.
In the example results, PC1 and PC2 did not retain more than 45% of the total “information”, so, as mentioned, the graphical representation in PC1-PC2 would lack “quality”.
This means that the standard Biplot is not used, but rather the Zoning Biplot (an enhanced version of the Biplot). This type of Biplot can expand the possibilities for leveraging PCA. The explanation of the Zoning Biplot is left until Section 3, Results, which is based on the practical result of the example presented here.

2.4. Materials: About the Data Used for the Example Included in This Work

Data collection from YRBS applied in 2023 was carried out via questionnaires administered to a sample of randomly selected schools, chosen with a probability proportional to the number of students enrolled in grades 9 through 12. Within the selected classes, all students are eligible to participate. The questionnaire was self-administered to students, either on a computer or paper, and consisted of multiple-choice questions oriented toward ranges of values, for example: How old are you? The answer options are age ranges.
The national data was not an aggregate of the state datasets, but rather the national version uses an independent sample.
One hundred and forty-one formulas were provided for applying logical corrections to each student’s answers. For example, when there is a contradiction in some answers, these are marked as missing data.
In case of contradiction between any two answers of a student, all of that student’s answers are invalidated. The corrections are referred to as Data Edits. There are also other formulas for applying logical corrections to weight and height, but these are not considered in the present work. In Section 4 (Discussion), we have included explanations to support our handling of missing data.
Response files can be obtained from the YRBSS website and converted to various formats. For 2023, there are 107 questions (q1 to q107), and 20,103 student records.

2.5. Data Processing

After applying the 141 Data Edit formulas to the 20,103 records, 3975 records were discarded. Subsequently, all records with blank responses were discarded, leaving “only” 4972 records. Later, of the 107 questions (q1 to q107), only some were selected, whose answers correspond to risk themes selected in this study (as explained below). Finally, this study only examines the records of men. Men were selected because the study includes the risks we group as Gun Violence, as explained below. The rationale for selecting men is given in Section 4, Discussion. This leaves a total of 2461 records.
All questions are multiple choice with answers given in whole numbers, implying risk ranges. An example of a question and response range is: q84. “During the past 30 days, how often was your mental health not good? (Poor mental health includes stress, anxiety, and depression.)”. The choice answers are: 1. Never; 2. Rarely; 3. Sometimes; 4. Most of the time; 5. Always.
All answers should indicate low risks at low values and high risks at high values, that is, in ascending order according to “risk level”. The original range values from the questionnaires were retained as risk indicators, except that these values were not arranged in ascending order according to risk level (but rather descending). In that case, the response numbering was changed.
The themes selected for the present work, with the questions in parentheses are: Self-security (q8); Gun violence (q12, q13); Fighting (q16, q17); Sex and dating violence (q19, q20, q21, q22); Race, school and Web violence (q23, q24, q25); Tobacco use (q31, q32, q33, q34, q35, q36, q38, q39); Alcohol use (q41, q42, q43, q44); Exercise (q76, q77, q78); Mental health (q84), Home violence (q89, q90, q91).
A total score is calculated for each theme, resulting in 10 themes that, in this study, correspond to attributes. These attributes are shown in Table 1, along with their acronyms.

2.6. Limitations

The responses made by CDC are weighted to ensure the data are representative of the population from which they were drawn, based on the student’s sex, grade level, and race/ethnicity. In the present work, weighting is not applied; instead, the questions are used as they appear in the response data.
It should be noted that YRBSS also formulates other questions: qN1 to qN87, whose responses are mathematically dependent on combinations of the responses to questions q1 to q107, allowing for various analyses. These questions are not considered in the present work.

2.7. Tools Used

The Python language (Python v3.13.9 distributed through Anaconda Navigator v2.7.0) is used for the computational implementation of the operations, specifically the NumPy (v2.3.5) [12] and Pandas (v2.3.3) [13] packages, and the Scikit-learn (Sklearn v1.7.2) library [14]. From Scikit-learn, the following packages are used: (a) Preprocessing, using the StandardScaler class, and (b) Decomposition, using the PCA class, which contains unsupervised machine learning algorithms for dimensionality reduction. The CSV module and other packages such as Seaborn (v.0.13.2) [15] and Matplotlib (v 3.10.6) [16] are also used for graphical representation. Functions developed by the authors are used to create the zoning Biplots (explained below).

3. Results

Figure 2 shows the resulting Screeplot, which indicates the relative variance (in percentage), equivalent to the amount of “information” of the m = 10 “alternative” coordinate axes. It can be observed that PC1 and PC2 do not even reach 45% aggregate “information”.
Given the small amount of “information” retained by PC1 and PC2, the conventional Biplot is not used; instead, the Zoning Biplot is used.
Figure 3 shows the resulting Zoning Biplot, which includes the following two sets of projections on plane PC1-PC2: (1) The projections of the “original” direction vectors, e 1 ,   ,   e 10 ( m = 10 ); (2) The projections of the points, y 1 ,   ,   y n , where n = 2461 , drawn with shapes according to the values of the main attribute (HoV).
Since the “original” direction vectors, e 1 ,   ,   e 10 correspond to the “original” coordinate axes, which in their turn represent the attributes that have already been assigned acronyms (Table 1), the projections of these “original” direction vectors are identified by the acronyms of the attributes.
The dimensions shown in the lower and left lines correspond to the projections of the points onto PC1 and PC2, respectively. Similarly, the dimensions in the top and right lines correspond to the projections of the “original” direction vectors (market with the acronyms of the attributes), in PC1 and PC2, respectively.
The Zoning Biplot is the same as the conventional Biplot, with the difference that the projections of the points in the Zoning Biplot are shown with shapes according to a classification by the value of the main attribute, in this case HoV, so that the shapes correspond to four ranges of values, from smallest to biggest, with corresponding shapes: ×, ▲, +, ▼. A fairly clear zoning pattern is expected in the location of the point projections based on their shape. If this occurs, the main attribute is considered to be well represented on the PC1-PC2 plane.
With this, the authors argue that from the Zoning Biplot, it is valid to obtain the relationship between this attribute and the others. The graphical explanation of the conception and use of the Zoning Biplot in the Discussion Section. In the example, the HoV values used for zoning are the standardized values (mean 0 and variance 1), but unstandardized values can be used. Visually, there is a clear zoning of the central parts of the zones, but there are overlaps, for example, the HoV <   0.834 zone is practically being overlapped.
In a conventional Biplot and in the Zoning Biplot, the relationship between pairs of attributes can be visually established from the angle between the projections of the direction vectors associated with those attributes. If the projections of two “original” direction vectors, corresponding to two attributes, are highly collinear (the angle between them is small), it can be concluded that there is a reinforcing relationship between the two associated attributes. If the projections are nearly perpendicular, it can be concluded that there is no relationship.
The angle values also contain subjectivity regarding the “amount” of the “degree” of relationship between attributes. These angles can be obtained numerically from the Q matrix (the main result of applying PCA). Visual interpretation, although subjective, is usually the one most often used.
As a side note, sometimes, depending on the data, PC1 and PC2 are not the “alternative” axes where a clear zoning for a specific attribute is observed, but a clear zoning can be seen on other “alternative” axes, even though they retain less “information” [11]. In this article, PC1 and PC2 are used, as described in the example.
When a conventional Biplot is used, relations between all pairs of attributes are drawn. In Zoning Biplots (and only if a zoning is observed), the authors suggest drawing conclusions only about the relationship between the main attribute (the one whose values were used for zoning) and the other attributes, but not between the other attributes (although such relationships can be obtained).
In the example, from the observation of angles explained, it is clear that the main attribute, HoV, is closely related to SDV, meaning that the young man experiencing violence at home also experiences sex and dating violence. HoV is also related (not as closely) to RSV and MHe. HoV is less related to the other attributes. It is avoided to formulate relationships between the other attributes.
In addition to observing the angles of the projections of the “original” direction vectors, the direction of the cluster centers appears to coincide with the direction of the HoV’s “original” direction vector (which is desirable). The increasing senses also coincide.
Regarding the matrix Q 10 × 10 ( m = 10 ) obtained in the example, Table 2 presents the values of Q . It can be seen that the sum of the squares of any row or column is equal to 1 because the table is made up of direction vectors, which are unit vectors. The projection of any “original” direction vector (corresponding to an attribute) onto the PC1−PC2 plane is formed with the values relative to the q 1 and q 2 columns; for example, the projection of the HoV direction vector onto this plane is formed with the values q 1 = 0.378214 and q 2 = 0.306069 . The absolute value of the projection of HoV is 0.378214 2 + 0.378214 2 = 0.486542839 , as can be seen in Figure 3.
As an example of how to use the Q matrix, the angle between the projections of the “original” direction vectors e 10 (HoV) and e 4 (SDV) can be calculated as follows. The projection of e 10 on plane PC1-PC2 is 0.378214 × q 1 + 0.306069 × q 2 . These values are taken from the last row of Q , which corresponds to HoV, and from the first two to the columns of Q , those that correspond to q 1 and q 2 . The angle between this projection vector and PC1 is A T A N ( 0.306069 / 0.378214 ) =   38.98 ° . Similarly, for the projection of e 4 on plane PC1-PC2, this is 0.321971 × q 1 + 0.287771 × q 2 . The angle between this projection vector and PC1 is A T A N ( 0.287771 / 0.321971 ) = 41.79 ° . Then, the angle between these two projection vectors is 2.808 ° .

4. Discussion

4.1. Graphical Explanation of the Zoning Biplot Concept

The following is a graphical heuristic that explains why the Zoning Biplot can be more illustrative than a Biplot. Figure 4 represents a scenario of a plane formed by two “original” axes associated with Attributes U and V , which have direction vectors e U and e V . On the plane, there are two points: a = a U e U + a V e V and b = b U e U + b V e V .
In the same plane lies the “alternative” axis PC1, with direction vector q 1 , which is most collinear with the axis with direction vector e V . It can be observed that:
a U < b U a V > b V
Regarding the projections of points onto PC1, it is observed that a P C 1 > b P C 1 , so the same order is maintained compared to that of the projections of points onto the axis with direction vector e V (Attribute V ).
If the projection of points on PC1 is drawn with shapes corresponding to the magnitude of the attribute V , a correct ordering is more likely to be observed, compared to if the points on PC1 are drawn with shapes corresponding to the attribute U . This is not always the case, as seen when considering the point b , since b U exceeds a U to a greater extent. All this means that the projections in PC1 better represent the attribute V .
This is the idea that leads to considering the Zoning Biplot as an aid, where a well-defined zoning indicates that an attribute is well represented.

4.2. About the Handling of Missing Data

Many algorithms, such as PCA, cannot handle missing values; Data Edits can generate them. We removed an individual’s entire response if they had a blank response for any attribute, without imputing any values. The reason for this is as follows.
In the Methodology of the Youth Risk Behavior Surveillance System [17], it is stated, verbatim: “Responses that conflict in logical terms are both set to missing, and data are not imputed. For example, if a student responds to one question that he or she has never smoked, but then responds to a subsequent question that he or she has smoked two cigarettes during the previous 30 days, the processing system sets both responses to missing. Neither response is assumed to be the correct response.” They explain that they did this in terms of what they call quality-control checks, where questionnaires with fewer than 20 valid responses remaining after editing are deleted from the dataset. They explain that the number of questionnaires eliminated represents a small percentage of the total.
We removed 3975 records out of a total of 20,103, a considerable percentage. However, in any case, we removed an individual’s entire response if it was blank for any attribute, without imputing data, because:
(1)
The attribute values are related, so imputation could introduce a significant error into the dataset.
(2)
Blank responses due to data editing can be considered completely random missing data errors (MCAR) (there is absolutely no relationship between the missing data and any other measured or unmeasured variable.) In Ref. [18] it is stated verbatim: “By far the most common approach to the missing data is to simply omit those cases with the missing data and analyze the remaining data. This approach is known as the complete case (or available case) analysis or listwise deletion,” and says that if the assumption of MCAR is satisfied, a listwise deletion is known to produce unbiased estimates and conservative results.
Regarding Data Edits: From a complete file with all attributes, q1 to q107, Data Edits removed 3975 rows of data from a total of 20,103 rows (leaving 16,128 rows of data). From a data file with only the columns to be worked on at the end, q02, q04, q08, q12, q13, q16, q17, q19, q20, q21, q22, q23, q24, q25, q31, q32, q33, q34, q35, q36, q38, q39, q41, q42, q43, q44, q76, q77, q78, q84, q89, q90, q91, Data Edits again removed 3975 rows of data. In both cases, the only formula that eliminated data was #15, which is (Q22 = A AND Q21 = B, C, D, E, F), meaning that Q22 cannot be A and Q22 B, or C, or D, or E, or F at the same time. The questions are:
Q22. During the past 12 months, how many times did someone you were dating or going out with physically hurt you on purpose?
Q21. During the past 12 months, how many times did someone you were dating or going out with force you to do sexual things that you did not want to do? (Count such things as kissing, touching, or being physically forced to have sexual intercourse.) B. 0 times; C. 1 time; D. 2 or 3 times; E. 4 or 5 times; F. 6 or more times;
It appears there was confusion among the students because there were a lot of mixed responses: Q22: A (I did not date or go out with anyone during the past 12 months); and Q21: B (0 times), instead of answering A: (I did not date or go out with anyone during the past 12 months). Therefore, the complete response is removed.

4.3. About Selecting Men from the Dataset

We selected only men for this study because it includes the risks we classify as Gun Violence. Men’s behavior is very different from women’s. In the United States, the vast majority of gun violence perpetrated by teenagers is committed by males. Ref. [19] states verbatim: “Men commit the vast majority of gun violence in the U.S. According to FBI data from 2017, men were responsible for 88.1 percent of all homicides, and firearms were used in 72.6 percent of homicides”.
Ref. [20] states verbatim: “Gun violence, and gun culture more broadly, are gendered phenomena in America. Empirically, the relationship between gun violence and gender is clear: men are more likely to own, use, kill with, and die by the gun. Gun ownership is disproportionately male: 62% of gun owners in the United States are men. Only 22% of women report that they own a gun compared to approximately 40% of men, which means that nearly twice as many men than women own guns”.

4.4. Regarding the Use of Quantiles with Our Dataset

We have attributes whose values come from a small set of integers, for example, between 1 and 5. For an attribute like this, the entire sample set for that attribute often repeats a value, such as 2. Therefore, when we divide this set into quartiles, quintiles, or sextiles, the value 2 is repeated, forming the total values of one or more quantiles. Thus, dividing into quantiles does not help with zoning. Due to the values of some of our attributes, it is not possible to divide all of them effectively into quantiles, but it is possible for the following attributes: Exc into quartiles and quintiles; MHe into quintiles; and HoV into quartiles and quintiles. Here we use HoV with quartiles.

5. Conclusions

PCA facilitates the visualization of multi-attribute data, but it is important that the projection plane used is representative of the attributes. The Zoning Biplot can provide better certainty regarding the validity of the PCA results. As there is a loss of “information” in the projections obtained from PCA, it is important to be cautious when drawing conclusions.
Contrary to what is commonly believed, attribute selection and data preparation can be as intensive as data analysis itself.
The analysis of data related to education can greatly benefit from data analysis methods, such as PCA.
Regarding PCA:
A major drawback of PCA is its interpretability. Other methods to help in this interpretation drawback include: (1) Utilizing rotations like Varimax for better component interpretation; (2) Applying Sparse PCA for clearer variable selection, and (3) Using Non-linear PCA.
PCA is a linear method. If the relationship between attributes is not linear, PCA may fail to find the correct relationships. Therefore, nonlinear methods exist, such as auto-associative neural networks (autoencoders) and the alternating least squares method.
Also, there are non-competitive PCA methods, such as Factor analysis, used to describe variability among observed, correlated attributes, in terms of a smaller number of unobserved theoretical variables called factors.
There are other dimensionality reduction methods, such as t-Distributed Stochastic Neighbor Embedding (t-SNE), for data with hundreds or thousands of attributes. It is not deterministic and is computationally expensive.

Author Contributions

Conceptualization, H.J.-S., C.S.-S. and A.P.M.-P.; methodology, A.P.M.-P.; software, H.J.-S., C.S.-S. and A.P.M.-P.; validation, J.A.-L. and A.P.M.-P.; formal analysis and investigation, A.P.M.-P.; resources, A.P.M.-P.; data curation, J.A.-L. and A.P.M.-P.; writing—original draft preparation and review and editing, A.P.M.-P.; project administration, A.P.M.-P.; funding acquisition, H.J.-S., C.S.-S. and A.P.M.-P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data is available from the corresponding author.

Acknowledgments

The authors are extremely grateful to the peer reviewers for their constructive comments and suggestions, as well as to the editors of this journal for their attention to the authors’ questions and communications. The resources supporting this research work and the related publication were granted by the Division of Communication Sciences and Design, and the Department of Information Technologies, of the Universidad Autónoma Metropolitana, Unidad Cuajimalpa, Mexico.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

The selected questions are given in Table A1.
Table A1. Selected Questions.
Table A1. Selected Questions.
Self-security
Q8. How often do you wear a seat belt when riding in a car driven by someone else? 1. Never, …, 5. Always
Gun Violence
Q12. During the past 30 days, on how many days did you carry a weapon, such as a gun, knife, or club on school property? 1. 0 days; 2. 1 day; 3. 2 or 3 days; 4. 4 or 5 days; 5. 6 or more days
Q13. During the past 12 months, on how many days did you carry a gun? (Do not count the days when you carried a gun only for hunting or for a sport, such as target shooting.) 1. 0 days; 2. 1 day; 3. 2 or 3 days; 4. 4 or 5 days; 5. 6 or more days
Fighting
Q16. During the past 12 months, how many times were you in a physical fight? 1. 0 times; 2. 1 time; 3. 2 or 3 times; 4. 4 or 5 times; 5. 6 or 7 times; 6. 8 or 9 times; 7. 10 or 11 times; 8. 12 or more times
Q17. During the past 12 months, how many times were you in a physical fight on school property? 1. 0 times; 2. 1 time; 3. 2 or 3 times; 4. 4 or 5 times; 5. 6 or 7 times; 6. 8 or 9 times; 7. 10 or 11 times; 8. 12 or more times
Sex and dating violence
Q19. Have you ever been physically forced to have sexual intercourse when you did not want to? 1. Y; 2. N
Q20. During the past 12 months, how many times did anyone force you to do sexual things that you did not want to do? (Count such things as kissing, touching, or being physically forced to have sexual intercourse.) 1. 0 times; 2. 1 time; 3. 2 or 3 times; 4. 4 or 5 times; 5. 6 or more times
Q21. During the past 12 months, how many times did someone you were dating or going out with force you to do sexual things that you did not want to do? (Count such things as kissing, touching, or being physically forced to have sexual intercourse.) 1. I did not date or go out with anyone during the past 12 months; 2. 0 times; 3. 1 time; 4. 2 or 3 times; 5. 4 or 5 times; 6. 6 or more times;
Q22. During the past 12 months, how many times did someone you were dating or going out with physically hurt you on purpose? 1. I did not date or go out with anyone during the past 12 months; 2. 0 times; 3. 1 time; 4. 2 or 3 times; 5. 4 or 5 times; 6. 6 or more times;
Race, School and Web violence
Q23. During your life, how often have you felt that you were treated badly or unfairly in school because of your race or ethnicity? 1. Never, …, 5. Always
Q24. During the past 12 months, have you ever been bullied on school property? 1. Y; 2. N
Q25. During the past 12 months, have you ever been electronically bullied? (Count being bullied through texting, Instagram, Facebook, or other social media.) 1. Y; 2. N
Tobacco use
Q31. Have you ever smoked a cigarette, even one or two puffs? 1. Y; 2. N)
Q32. How old were you when you first smoked a cigarette, even one or two puffs? 1. I have never smoked a cigarette, not even one or two puffs; 2. 8 years old or younger; 3. 9 or 10 years old; 4. 11 or 12 years old; 5. 13 or 14 years old; 6. 15 or 16 years old; 7. 17 years old or older
Q33. During the past 30 days, on how many days did you smoke cigarettes? 1. 0 days; 2. 1 or 2 days; 3. 3 to 5 days; 4. 6 to 9 days; 5. 10 to 19 days; 6. 20 to 29 days; 7. All 30 days
Q34. During the past 30 days, on the days you smoked, how many cigarettes did you smoke per day? 1. I did not smoke cigarettes during the past 30 days; 2. Less than 1 cigarette per day; 3. 1 cigarette per day; 4. 2 to 5 cigarettes per day; 5. 6 to 10 cigarettes per day; 6. 11 to 20 cigarettes per day; 7. More than 20 cigarettes per day
Q35. Have you ever used an electronic vapor product? 1. Y; 2. N
Q36. During the past 30 days, on how many days did you use an electronic vapor product? 1. 0 days; 2. 1 or 2 days; 3. 3 to 5 days; 4. 6 to 9 days; 5. 10 to 19 days; 6. 20 to 29 days; 7. All 30 days
Q38. During the past 30 days, on how many days did you use chewing tobacco, snuff, dip, snus, or dissolvable tobacco. 1. 0 days; 2. 1 or 2 days; 3. 3 to 5 days; 4. 6 to 9 days; 5. 10 to 19 days; 6. 20 to 29 days; 7. All 30 days
Q39. During the past 30 days, on how many days did you smoke cigars, cigarillos, or little cigars, such as Swisher Sweets, Middleton’s (including Black & Mild), or Backwoods? 1. 0 days; 2. 1 or 2 days; 3. 3 to 5 days; 4. 6 to 9 days; 5. 10 to 19 days; 6. 20 to 29 days; 7. All 30 days
Alcohol use
Q41. How old were you when you had your first drink of alcohol other than a few sips? 1. I have never had a drink of alcohol other than a few sips; 2. 8 years old or younger; 3. 9 or 10 years old; 4. 11 or 12 years old; 5. 13 or 14 years old; 6. 15 or 16 years old; 7. 17 years old or older
Q42. During the past 30 days, on how many days did you have at least one drink of alcohol? 1. 0 days; 2. 1 or 2 days; 3. 3 to 5 days; 4. 6 to 9 days; 5. 10 to 19 days; 6. 20 to 29 days; 7. All 30 days
Q43. During the past 30 days, on how many days did you have 4 or more drinks of alcohol in a row, that is, within a couple of hours (if you are female) or 5 or more drinks of alcohol in a row, that is, within a couple of hours (if you are male)? 1. 0 days; 2. 1 day; 3. 2 days; 4. 3 to 5 days; 5. 6 to 9 days; 6. 10 to 19 days; 7. 20 or more days
Q44. During the past 30 days, what is the largest number of alcoholic drinks you had in a row, that is, within a couple of hours? 1. I did not drink alcohol during the past 30 days; 2. 1 or 2 drinks; 3. 3 drinks; 4. 4 drinks; 5. 5 drinks; 6. 6 or 7 drinks; 7. 8 or 9 drinks; 8. 10 or more drinks
Exercise
Q76. During the past 7 days, on how many days were you physically active for a total of at least 60 min per day?
Q77. In an average week when you are in school, on how many days do you go to physical education (PE) classes? 1. 0 days; 2. 1 day; 3. 2 days; 4. 3 days; 5. 4 days; 6. 5 days
Q78. During the past 12 months, on how many sports teams did you play? (Count any teams run by your school or community groups.) 1. 0 teams; 2. 1 team; 3. 2 teams; 4. 3 or more teams
Mental health
Q84. During the past 30 days, how often was your mental health not good? (Poor mental health includes stress, anxiety, and depression.) 1. Never; 2. Rarely; 3. Sometimes; 4. Most of the time; 5. Always
Home violence
Q89. During your life, how often has a parent or other adult in your home insulted you or put you down? 1. Never; 2. Rarely; 3. Sometimes; 4. Most of the time; 5. Always
Q90. During your life, how often has a parent or other adult in your home hit, beat, kicked, or physically hurt you in any way? 1. Never; 2. Rarely; 3. Sometimes; 4. Most of the time; 5. Always
Q91. During your life, how often have your parents or other adults in your home slapped, hit, kicked, punched, or beat each other up? 1. Never; 2. Rarely; 3. Sometimes; 4. Most of the time; 5. Always

References

  1. Stewart, G.W. On the Early History of the Singular Value Decomposition. SIAM Rev. 1993, 35, 551–566. [Google Scholar] [CrossRef]
  2. Jolliffe, I.T.; Cadima, J. Principal component analysis: A review and recent developments. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 2016, 374, 20150202. [Google Scholar] [CrossRef] [PubMed]
  3. Greenacre, M. Biplots in Practice; Foundation BBVA: Bilbao, Spain, 2010. [Google Scholar]
  4. Shlens, J. A Tutorial on Principal Component Analysis. arXiv 2014, arXiv:1404.1100. [Google Scholar] [CrossRef]
  5. Shalizi, C.R. Advanced Data Analysis from an Elementary Point of View (Principal Component Analysis Chapter), Carnegie Mellon University. 2021. Available online: https://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ (accessed on 18 October 2022).
  6. Gabriel, K.R. The biplot graphic display of matrices with application to principal component analysis. Biometrika 1971, 58, 453–467. [Google Scholar] [CrossRef]
  7. Borg, I.; Groenen, P.J.F. Modern Multidimensional Scaling; Theory and Applications; Springer: New York, NY, USA, 2005. [Google Scholar]
  8. Bedre, R. Principal Component Analysis (PCA) and Visualization Using Python (Detailed Guide with Example). Creative Commons. 2021. Available online: https://www.reneshbedre.com/blog/principal-component-analysis.html (accessed on 7 July 2022).
  9. Young, E.; McCain, J.L.; Mercado, M.C.; Ballesteros, M.F.; Moore, S.; Licitis, L.; Stinson, J.; Jones, S.E.; Wilkins, N.J. Frequent Social Media Use and Experiences with Bullying Victimization, Persistent Feelings of Sadness or Hopelessness, and Suicide Risk Among High School Students—Youth Risk Behavior Survey, United States, 2023. Morb. Mortal. Wkly. Rep. (MMWR) 2024, 73, 23–30. [Google Scholar] [CrossRef] [PubMed]
  10. Centers for Disease Control and Prevention (CDC). 1991–2023 High School Youth Risk Behavior Survey Data, United States; U.S. Department of Health and Human Services: Atlanta, GA, USA, 2025. Available online: http://yrbs-explorer.services.cdc.gov/ (accessed on 29 March 2026).
  11. Mateos-Papis, A.P.; Sánchez-Sánchez, C.; Jiménez-Salazar, H.; Guerrero-Vargas, N.; Ángeles-Castellanos, A.M.; Escobar, C. Exploración de representaciones para identificar relaciones entre atributos. Res. Comput. Sci. 2022, 151, 165–187. [Google Scholar]
  12. Harris, C.R.; Millman, K.J.; van der Walt, S.J.; Gommers, R.; Virtanen, P.; Cournapeau, D.; Wieser, E.; Taylor, J.; Berg, S.; Smith, N.J.; et al. Array programming with NumPy. Nature 2020, 585, 357–362. [Google Scholar] [CrossRef] [PubMed]
  13. McKinney, W. Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference, Austin, TX, USA, 28 June–3 July 2010. [Google Scholar]
  14. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  15. Waskom, M.L. Seaborn: Statistical data visualization. J. Open Source Softw. 2021, 6, 3021. [Google Scholar] [CrossRef]
  16. Hunter, J.D. Matplotlib: A 2D graphics environment. Comput. Sci. Eng. 2007, 9, 90–95. [Google Scholar] [CrossRef]
  17. Brener, N.D.; Kann, L.; Shanklin, S.; Kinchen, S.; Eaton, D.K.; Hawkins, J.; Flint, K.H. Methodology of the Youth Risk Behavior Surveillance System—2013. Morb. Mortal. Wkly. Rep. Recomm. Rep. 2013, 62, 1–20. [Google Scholar]
  18. Kang, H. The prevention and handling of the missing data. Korean J. Anesthesiol. 2013, 64, 402–406. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
  19. Addressing Gun Violence by Reimagining Masculinity and Protection. Available online: https://genderpolicyreport.umn.edu/addressing-gun-violence-by-reimagining-masculinity-and-protection/ (accessed on 8 May 2026).
  20. Lawrence, H. Toxic Masculinity and Gender-Based Gun Violence in America: A Way Forward. J. Gend. Race Justice 2023, 26, 33. [Google Scholar] [CrossRef]
Figure 1. Image of the set of “original” direction vectors and the set of “alternative” direction vectors, in a 3-dimensional Euclidean space.
Figure 1. Image of the set of “original” direction vectors and the set of “alternative” direction vectors, in a 3-dimensional Euclidean space.
Applsci 16 05507 g001
Figure 2. Screeplot obtained.
Figure 2. Screeplot obtained.
Applsci 16 05507 g002
Figure 3. Zoning Biplot obtained.
Figure 3. Zoning Biplot obtained.
Applsci 16 05507 g003
Figure 4. Graphic to explain the Zoning Biplot concept.
Figure 4. Graphic to explain the Zoning Biplot concept.
Applsci 16 05507 g004
Table 1. Attributes and acronyms.
Table 1. Attributes and acronyms.
Self-inflicted risks
Self-security scoreSSc
Gun violence scoreGuV
Fighting scoreFgh
Tobacco use scoreTob
Alcohol use scoreAlc
Exercise
Exercise scoreExc
Risks of external violence
Sex and dating violence scoreSDV
Race, School and Web violence scoreRSV
Home violence scoreHoV
Wellbeing risks
Mental health scoreMHe
Table 2. Representation of the matrix Q 10 × 10 , obtained in the example.
Table 2. Representation of the matrix Q 10 × 10 , obtained in the example.
q 1 (PC1) q 2 (PC2) q 3 (PC3) q 4 (PC4) q 5 (PC5) q 6 (PC6) q 7 (PC7) q 8 (PC8) q 9 (PC9) q 10 (PC10)
e 1 (SSc)0.232352−0.332240.153853−0.08480.808610.3165390.214722−0.033310.037722−0.04568
e 2 (GuV)0.278394−0.24415−0.071460.569285−0.343710.5690240.056104−0.2537−0.104020.1159
e 3 (Fgh)0.372283−0.15084−0.070540.3019910.055432−0.186−0.262180.755253−0.23808−0.09478
e 4 (SDV)0.3219710.287771−0.223960.2499560.028316−0.358920.749186−0.032290.082493−0.04619
e 5 (RSV)0.2905910.385438−0.33790.0942570.316738−0.1286−0.46671−0.45685−0.31419−0.04277
e 6 (Tob)0.4235−0.257570.169983−0.28315−0.12551−0.30084−0.03295−0.13315−0.050810.719434
e 7 (Alc)0.396778−0.327640.161842−0.28679−0.2827−0.17392−0.02371−0.25125−0.01673−0.67279
e 8 (Exc)0.0237040.3207270.8608170.3410470.055155−0.12077−0.04455−0.08942−0.10209−0.03497
e 9 (MHe)0.2478190.4534850.074851−0.47508−0.155520.4736830.1879280.231716−0.405010.005646
e 10 (HoV)0.3782140.3060690.014679−0.04157−0.030230.190641−0.252910.1087390.8051320.006963
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ambrosio-Lucas, J.; Jiménez-Salazar, H.; Sánchez-Sánchez, C.; Mateos-Papis, A.P. PCA Applied to YRBSS 2023 Data to Help Assess Health Risk Behaviors. Appl. Sci. 2026, 16, 5507. https://doi.org/10.3390/app16115507

AMA Style

Ambrosio-Lucas J, Jiménez-Salazar H, Sánchez-Sánchez C, Mateos-Papis AP. PCA Applied to YRBSS 2023 Data to Help Assess Health Risk Behaviors. Applied Sciences. 2026; 16(11):5507. https://doi.org/10.3390/app16115507

Chicago/Turabian Style

Ambrosio-Lucas, Juana, Héctor Jiménez-Salazar, Christian Sánchez-Sánchez, and Alfredo Piero Mateos-Papis. 2026. "PCA Applied to YRBSS 2023 Data to Help Assess Health Risk Behaviors" Applied Sciences 16, no. 11: 5507. https://doi.org/10.3390/app16115507

APA Style

Ambrosio-Lucas, J., Jiménez-Salazar, H., Sánchez-Sánchez, C., & Mateos-Papis, A. P. (2026). PCA Applied to YRBSS 2023 Data to Help Assess Health Risk Behaviors. Applied Sciences, 16(11), 5507. https://doi.org/10.3390/app16115507

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop