Origins, Styles, and Applications of Text Analytics in Social Science Research

Zougris, Konstantinos

doi:10.3390/encyclopedia5020070

Open AccessEntry

Origins, Styles, and Applications of Text Analytics in Social Science Research

by

Konstantinos Zougris

Division of Social Sciences/Sociology, University of Hawai’i–West O’ahu, Kapolei, HI 96707, USA

Encyclopedia 2025, 5(2), 70; https://doi.org/10.3390/encyclopedia5020070

Submission received: 18 March 2025 / Revised: 21 May 2025 / Accepted: 23 May 2025 / Published: 26 May 2025

(This article belongs to the Collection Encyclopedia of Social Sciences)

Download

Browse Figures

Versions Notes

Definition

Textual analysis is grounded in conceptual schemes of traditional qualitative and quantitative content analysis techniques that have led to the hybridization of methodological styles widely used across social scientific fields. This paper delivers an extensive review of the origins and evolution of text analysis within the domains of traditional content analysis. Emphasis is given to the conceptual schemas and operational structure of latent semantic analysis, and its capacity to detect topical clusters of large corpora. Further, I describe the operations of Entity–Aspect Sentiment Analysis which are designed to measure and assess sentiments/opinions within specific contextual domains of textual data. Then, I conceptualize and elaborate on the potential of streamlining latent semantic and Entity–Aspect Sentiment Analysis complemented by Correspondence Analysis, generating an integrated operational scheme that would detect the topic structure, assess the contextual sentiment/opinion for each detected topic, test for statistical dependence of sentiments/opinions across topical domains, and graphically display conceptual maps of sentiments in topics space.

Keywords:

origins of content analysis; text analytics; latent semantic analysis; sentiment analysis; topic maps; social sciences

1. Introduction

The rapid development of data mining and text analytics, especially in the fields of communications, linguistics, sociology, and psychology, has significantly contributed to the evolution of various families of textual analysis techniques in social science research [1,2,3,4,5,6]. In its early stages, a generic form of communication content analysis was used in academic works [7,8] that focused on the comparative or controversial narratives of historical incidents [9], the expressions of nationalism in children’s books [10], inaugural presidential speeches [11,12], the topic discovery of published academic articles [13,14,15,16], and more recently the analysis of textbooks via natural language processing [17]. Quantitative textual analysis was often identified as the most prominent content analysis technique in testing empirical hypotheses using textual data to evaluate social theories [5,18,19,20,21].

The main function of textual content analysis is to systematically analyze the content of different types of communication [22,23]. The conceptual and operational challenges of the technique have been explicitly discussed in the methodological literature. Indicatively, Shapiro and Markoff [24] outlined the framework of definitional peculiarities of content analysis classifying them based on (a) the scientific purpose (descriptive vs. inferential vs. taxonomic), (b) the methodological orientations (quantitative vs. qualitative), (c) the extraction of contextual meaning (manifest vs. latent), (d) the unit of analysis (words, sentences, etc.), and (e) measurement quality (issues of validity and reliability).

Content analysis is a family of formal methodological techniques that systematically convert textual materials into numeric representations of manifest and latent meanings. In the last few decades, despite the obstacles to the social context of digital texts and the reliability issues of computer-assisted coding schemes, there has been an increasing interest in examining the applicability of text mining in sociological studies [25,26,27,28]. The need for hybridization of content analysis techniques was extensively discussed by Roberts [19] who recognized the need for a methodological synthesis ascribing cultural meanings derived from textual data and validating coding schemes suitable for statistical analysis.

The scope of this paper is to trace the historical origins of traditional content analysis and exemplify the importance of epistemological and methodological syntheses consisting of traditional text analytics and mainstream multivariate techniques that have significantly contributed to the analysis of socio-cognitive and socio-cultural studies. Explicitly, I conceptually streamline the operations of the foundational models of latent semantic analysis (LSA), Entity–Aspect Sentiment Analysis (EASA), and Correspondence Analysis (CA) and conceptually assert that such operational synthesis not only detects topics, describes sentiments, and classifies sentiments within the context of topic domains, but also assesses the degree of dependence of sentiments across topic domains. Overall, this paper provides insightful information about the origins and evolution of content analysis and, grounded in socio-cognitive frameworks, presents the operational structure of a streamlined integrating model.

2. Origins of Content Analysis

Text analysis was developed as an integrated formal methodological technique in the early 1950s. It was defined as “a research technique for the objective, systematic, and quantitative description of the manifest content of communication” [23]. Several social researchers and methodologists have argued over time that the origins of content analysis are linked to the philosophical foundations of logic, cognition, conscious understanding [3], and rhetoric [4]. Krippendorff [3] pointed out the distinctions between the philosophical and empirical notions of communication content analysis. He stated that content analysis is not just a postulate but has an empirical orientation designed to explore, describe, and predict social phenomena.

Content analysis’ origins can be traced back to the philosophical postulates of communication content, semiotics, and rhetorical practices [4]. Aristotle explained the art of communication as a conscious sensory process of objectification and interpretation of signs (semiology). More advanced postulates describing the communication process included morphemes and symbols to the content of communication analysis [29]. Aristotle conceptualized signs as products of reflexive awareness of the environment. Signs take the form of words, images, and human artifacts shaping cognitive schemas. Such schemas ascribe manifest and covert meanings of observations, thoughts, feelings, opinions, sentiments, and aspirations [29]. Aristotle’s postulate on communication content was based on the study of signs, the first known type of semiotic analysis.

Succeeding Aristotle’s philosophical claims on the interpretative understanding of signs, Augustine classified signs into two unique categories: the natural and the conventional. Natural signs do not intentionally signify something beyond their occurrence, but they naturally indicate an underlying condition (latency). On the other hand, conventional signs are linked to living entities that use words to comprehend and express sensory conceptions (manifestation). Many centuries had passed until the theoretical postulates on communication content were used in empirical inquiries.

In the late 1600s, the first known empirical inquiry of communication content was developed to respond to the Church’s’ concerns about the extent of non-religious content of newspaper articles. Technological advancements that occurred during the Industrial Revolution in the eighteenth and nineteenth centuries had a major impact on the intellectual growth of content analysis. The large volumes of textual data available in the press drew the attention of journalists and academicians who utilized them to analyze topics appearing in the newspapers.

The first officially recorded empirical study using quantitative textual analysis occurred at the end of the 19th century [30], when Speed [31] conducted a quantitative newspaper text analysis demonstrating the reluctance of New York newspapers to cover news associated with religious, scientific, and literary topics. Instead, they favored topics related to sports and gossip for extended periods [3,31,32]. The first sociological study using quantitative newspaper content analysis was conducted by an African American journalist, Ida B. Wells, who analyzed several newspaper articles reporting incidents of rape of white women by African American men [33]. For the most part, early forms of textual content analysis were utilized to analyze the content of the newspapers’ articles [3].

A more sophisticated technique was developed during the Great War called predictive textual analysis. Predictive textual analysis was widely used as a tool of propaganda analysis during World War I, World War II, and the Korean War. One of the first reported applications of the technique took place in the United States in 1916. According to Chomsky [34], in 1916, Woodrow Wilson’s victorious campaign was affiliated with the slogan “Peace without Victory”. The message of this slogan was meticulously adopted by millions of American pacifists who consistently stated their opposition to America’s involvement in World War I. Wilson’s administration, the Creel Commission, and the John Dewey Circle strategically propagated by employing media content to turn pacifists into pro-war fanatics. Experts on propaganda analysis introduced a very sophisticated and innovative technique, known to this day as content analysis. Content analysis rapidly grew to be one of the prominent analytical styles of political discourse, media, and propaganda.

Without a doubt, textual content analysis was recognized by numerous scholars as a distinct methodology sharing identical properties with other popular quantitative methodologies of the time. It has been approximately 75 years since textual content analysis penetrated the traditional modules of scientific design. With the employment of sophisticated sampling techniques, well-established metrics, and cautiously constructed reliability and validity tests, content analysis appealed to a variety of academic disciplines [5,35]. After all, quantitative textual analysis constituted the most prominent methodological technique in formulating empirical hypotheses using solely non-numeric data. However, its variant applications across different academic disciplines have generated long-lasting debates regarding its definitional and operational structure forming two distinct camps: quantitative and qualitative text analysis.

3. Qualitative vs. Quantitative Text Analysis

Textual content analysis aims to systematically describe the manifest content of communication [5,19]. The conceptual and operational challenges of the technique have been explicitly discussed in methodological literature by a large group of social researchers. Most experts in the field have agreed that it constitutes a designated method for the analysis of texts, symbols, images, linguistics, semantics, syntax, and pragmatics [24]. Textbook definitions of content analysis extensively cover the deterministic and interpretive significance of the technique. The definitional peculiarities of content analysis are based on (a) the scientific purpose (descriptive vs. inferential vs. taxonomic), (b) the methodological orientations (quantitative vs. qualitative), (c) the extraction of contextual meaning (manifest vs. latent), (d) the unit of analysis (words, texts, individuals, etc.), and (e) its measurement quality (validity and reliability).

The rise of the symbolic interactionist perspective impacted the interpretation of words, symbols, and mental images, treating such artifacts as a reliable data source suitable for analyzing socio-cognitive schemas. Textual content analysis became the foundation of qualitative methodologies such as critical analysis, interpretative analysis, ethical analysis, and more. It was not until the late 1960s that more social scientists started analyzing social realities informed by cultural aspirations. Structural perspectives—complemented by cultural explanations of social issues—penetrated the traditional sociological theoretical lens. Cultural theorists such as Michelle Lamont and Fournier [36], Pierre Bourdieu [37], and Jeffrey Alexander [38] illuminated the importance of cultural interpretations in making sense of social reality. Since the mid-1990s, qualitative content analysis has been consolidated as a suitable methodology for exploring and describing socio-cultural phenomena.

Qualitative text analysis prompts analytical induction and theory formulation. In the early 1970s, content analysis entered the qualitative methodological domains and was established in the fields of ethnomethodology [39], grounded theory [40], critical social research [36], historical research [37], ethical research [38], and phenomenological research [39]. There are seven broad categories of qualitative content analysis, classified as follows: (a) rhetorical narrative analysis, which emphasizes the way a textual or symbolic message is presented, (b) narrative analysis, which describes characteristics of the main actors presented in the text, (c) discourse analysis, which focuses on the manifest meaning of the words and the themes created by the segments of the words (terms) appearing in texts, (d) semiotic analysis, which focuses on the meaning of textual and symbolic material, (e) interpretive analysis, which involves theory formulation via observation, conceptualization, and coding of texts, (f) conversational analysis, which is part of ethnomethodology analyzing conversation content from a naturalistic perspective, and (g) critical analysis, which examines how the characters within a given context are presented [41].

However, the shortfall in the applicability of qualitative methodologies in theory-driven studies was recognized by McCroskey [42]. The strength and depth of qualitative text analysis techniques on the cultural interpretations of social phenomena remain undisputable in the circles of social scientists. However, from an epistemological standpoint, there has been a strain between qualitative and quantitative text analysis experts. Various methodologists strongly assert that content analysis primarily belongs to the quantitative methodological families [5,21,23,43].

Quantitative text analysis uncovers the co-occurrence of words (or word frequencies) in the process of identifying thematic entities (topics) in large unstructured textual corpora [24,44]. Two vanguard approaches dominated the field of quantitative textual analysis: (1) representational text analysis and (2) instrumental text analysis. Representational text analysis occurs when researchers aim to analyze the meaning of textual corpora, while instrumental textual analysis is used when the researchers apply theories to explain the content of the text itself. Should the research objective be to classify and interpret the textual content itself based on a theoretical framework, an instrumental text analysis technique is most suited (see Figure 1. Generic types of textual content analysis styles).

Further, quantitative and qualitative approaches to text analysis can be classified into two analytical domains: thematic and semantic [19,45]. With thematic text analysis, the researcher aims to identify and categorize the group of words under a theme or topic, while semantic analysis purports to exemplify the relationships and interpret the contextual meaning of themes within texts [44]. Broad text analysis is applied in explanatory, theory-driven studies, while specific text analysis serves an exploratory research purpose. When the focus of a study is what a participant is saying, then thematic analysis is ideal, yet when the intent is to inquire about the participants’ expressions, then the focus lies on language styles or semantics [44]. Sociologists often use text analysis methods to extract and assign meanings to the textual content, while psychologists focus more on investigating the underlying messages of texts.

4. Computational Text Analytics

Social researchers recognize the potential of computational text analysis specifically in the identification and interpretation of cognitive schemas, texts, and symbols in the process of developing conceptual frameworks that shape institutional meanings [35]. The expansion of text analytics facilitates textual information retrieval and maximizes the hermeneutic utility of web archival data. Text analytics is based on the execution of a series of algorithms, which automatically mine, collect, and analyze textual data. Sebeok and Zeps [46] performed a computer-based information retrieval and text analysis of 4000 Cheremis’ folktales. A few years later, Hays [47] with his paper, Automatic Content Analysis, proposed a semi-automatic computational model with which he analyzed political documents relying on the raw frequency distributions (counts) of words drawn from populations or samples of texts. By the late 1960s, a group of researchers in linguistics, psychology, social sciences, and the humanities, known as “computational stylists”, worked on developing computer programs to manage, process, and analyze texts from historical archives and other domains. Computer-aided textual analysis appears advantageous compared to manual text analysis [48,49,50]. That is, computational text analysis can analyze large collections of textual data quickly, easily, and at a lower cost, while it can develop explicit conceptual coding schemes avoiding ambiguities and bias. It also facilitates lexicon construction based on empirical, rather than interpretive measures that contribute to higher content validity. One of the greatest uses of computational analysis is to mine large volumes of textual content.

Text mining involves information retrieval (e.g., document matching, search optimization, etc.), information extraction (e.g., entity extraction, co-reference of words, relationship extraction), concept extraction (e.g., colocation, word association, etc.), natural language processing, web mining (web content mining, web analytics, and web structure analysis), web classification (e.g., websites, blogs, social media, etc.), document classification (document ranking and categorization), and clustering (document similarity).

Depending on research objectives, text mining techniques provide efficient, automated, or semi-automated topic modeling techniques for encoding and analyzing textual data. The rapid evolution of topic modeling has led to the development of more nuanced models widely known as Language Models (LMs) grounded in the traditional topic modeling techniques, which will be discussed in a subsequent section of this paper, set the foundations for the development and swift advancements of transformer-based large language models (LLMs) such as the generative pre-trained transformer (GPT) model, bidirectional encoder representations from transformers (BERT) model [51], text-to-text transfer transformer (T5) [52], bidirectional, auto-regressive transformer (BART) [53], prompt topic [54], and Robustly Optimized BERT Pretraining Approach RoBERTa [55]. In brief, LLMs are artificial intelligence (AI) models that identify the patterns and relationships of words in large corpora. They involve the processes of tokenization (where words are reduced to tokens), embedment (where tokens are transformed into numerical values ascribing the meaning of the words), network (layer) transformation (which identifies and weighs the importance of words and detects the context), and prediction (which predicts the sequence of tokens and generate sentences based on the relationship between words and constructs). The choice of modern LLMs strongly depends on the study’s purpose or task to be performed.

The technical description and discussion of automated LLMs are beyond the scope of this paper. However, below I outline some of the key features and functions of the mainstream LLMs: (1) GPT is an autoregressive generative text that is particularly useful for creative writing and language modeling; (2) BERT focuses on the understanding of the meaning of the text based on text classification and answering specific questions about the text content; (3) T5 is widely used for text-to-text tasks showing great versatility across various natural language processing duties; (4) BART performs both text generation and text interpretation, and it is widely used in text summarization and translation; (5) the prompt topic technique is widely used for tuning the text; (6) and RoBERTa is essentially an improved version of the BERT model that focuses on text classification and understanding.

Traditional Topic Models

LLMs are grounded in the operational framework of traditional text analysis techniques. Traditional topic models generate unique dictionaries and coding categories that detect the underlying themes of large unstructured textual datasets. This operation increases the text’s accuracy and substantive interpretability [56]. Several traditional topic modeling techniques are widely used in social science research, such as Probabilistic LSA (PLSA) [57], Latent Dirichlet Allocation (LDA) [58], and non-negative matrix factorization (NGMF) [59].

PLSA employs the maximum likelihood estimation to minimize the perplexity of words solving the problem of polysemy. It also employs statistical techniques to identify the best-fitting model for the distribution of terms across large volumes of documents. PLSA was the first probabilistic text analysis technique and the predecessor of Latent Dirichlet Allocation [58], which is based on estimated topic parameters through Markov chain Monte-Carlo simulations. LDA assumes that the distribution of document D is not dependent on the distribution of words W. Finally, NGMF employs factor analysis representing the terms and the documents in the latent space following the assumption that the results must be centered on factor analysis [59]. This technique dismisses the different scales of textual information, increasing the term selection bias [60]. This paper focuses on the foundation of topic models and gives a detailed overview of LSA that constitutes the origins of topic models that subsequently set the foundations of LLMs.

5. LSA

LSA was developed as an information retrieval technique aiming to improve the organization and indexing of libraries’ archives, analyze sociocultural phenomena, and assess public opinion [60,61,62,63,64,65]. It is a linear algebra-based methodology and is classified as a descriptive rather than a probabilistic text analysis technique widely employed in a wide range of disciplines such as psychology [44], theory of learning and meaning [61], operations management [66], business [67], and sociology [25,26,27,28]. LSA involves operations of developing automated dictionaries, knowledge bases, semantic networks, grammar, syntactic parsers, or morphologies. It transforms unstructured textual content into meaningful clusters of words, sentences, or paragraphs that reveal the semantic structure of a corpus. To that end, grounded in a least-squares approach, LSA classifies textual data in a document, and based on term-by-term simulations and document-to-term lexical priming data, it detects the topic structure of large corpora [57,68,69]. The computational process of LSA involves reducing unstructured textual content into a meaningful number of underlying dimensions (factors) aiming to identify and classify documents into thematic entities (topics). The operations of LSA are summarized in Figure 2.

The generalized model can be expressed as

X = ΛF + ε

where X is the collection of documents, ΛF denotes the linear combination of topics in large volumes of text data, and ε is the error term.

The LSA first delivers normalized weighted terms by utilizing a technique known as vector space modeling (VSM), an information retrieval technique that presents terms and documents as vectors [70] and ranks individual documents against a query [71,72,73]. The terms (words) and documents form matrix X where the counts of terms and documents appear in the rows and columns of a matrix. An important function of vector space modeling is that it weighs the term frequency (words), representing the documents in terms’ space. The weighting method is accomplished by a series of iterations involving the calculation of term frequencies (tf) and inverse document frequencies (idf). Inverse document frequency transformation (tf-idf transformation) is suitable for the pre-processing of unstructured textual content with complex latent structures [69,74]. Briefly, tf-idf devalues the significance of common terms appearing across large corpora, while increasing the importance of terms corresponding to a smaller group of documents [72]. Overall, VSM can be described as a classification and weighting method involving term filtering, stemming, lemmatization, and term weighting; however, it does not involve any data normalization procedure. Normalization of documents is very important because every document must have equal significance within the textual content. The weighted terms across documents can be denoted as

\frac{w_{i d}}{\sqrt{\sum_{i = 0}^{n - 1} {w_{i d}}^{2}}}

where

w_{i d}

refers to weighted terms value for each term–document combination.

Upon the completion of VSM, the procedures of SVD are initiated to reduce the dimensionality of the textual content. Technically, SVD decomposes the term by document matrix

X_{t, d}

into the product of three matrices:

U_{t, k}

,

Σ_{k, k}

, and

V_{k, d}

. The

U_{t, k}

matrix is a term-by-factor matrix containing the eigenvectors of a

X X^{T}

matrix (term-by-term covariance matrix revealing the dimensions of terms). The term loadings (

L_{T}

) represents the dimensions (factors) revealing the latent meanings or simply topics derived from the text.

V_{k, d}

is a factor-by-document matrix showing the eigenvectors of

X^{T} X

which is a document-by-document covariance matrix showing the loading of each document for each factor. Further, the model employs a unique form of truncated singular value decomposition (TSVD) [75], maintaining only the term frequencies of great significance. TSVD is one of the dimensionality reduction techniques used in LSA procedures to remove dimensions (factors) of low importance. It is based on the principal component analysis (PCA), which involves orthogonal transformation of the correlated factors, converting them into a set of principal components that reveal the lowest possible dimensional space. Simply, TSVD discards the dimensions with smaller singular values and filters out less significant variations. Also, it effectively deals with the problem of polysemy of words and significantly affects inquiry performance. In brief, TSVD reduces the dimensions of the textual content of large corpora [76].

In mathematical notations, TSVD can be expressed as

X_{t, d} = U_{t, k} \cdot Σ_{k, k} \cdot V_{k, d}^{T}

where

X_{t, d}

is the

t

term by document matrix,

U_{t, k}

is the truncated version of terms by factors,

V_{k, d}^{T}

is the truncated matrix of factors by documents,

Σ_{k, k}

is the truncated factor-by-factor matrix solution, and

k

denotes the ranking of singular values. Finally, using the property of orthonormality, UTU = I and VTV = I, where I is a

k

×

k

identity matrix, we attain the term (

L_{T}

) and document loadings (

L_{D}

) expressed as follows:

L_{T} = U_{t, k} \cdot S_{k, k} = Χ_{t, d} \cdot V_{d . k}

L_{D} = U_{d, k} \cdot S_{k, k} = Χ_{d, t} \cdot U_{t . k}

where

L_{T}

is a term-by-factor matrix grounded in the association between terms and latent topics, while

L_{D}

is a document-by-term matrix showing the relationship between documents and a cluster of terms that reveal latent topics.

Further, the analysis of textual data involves the identification of the optimal number of dimensions given that humans understand oral or written speech in different layers of abstraction. Selecting a low number of dimensions may dismiss some important information of textual content, while too many dimensions would increase noise and may generate overfitting. Similarly, Deerwester et al. [60] suggested that 70 to 100 dimensions offer an ideal solution when 5000 terms are extracted from 1000 documents. In most LSA studies, the decisions concerning the number of dimensions—representing the entire textual content—appear to be random, inconsistent, and much dispersed [69,77].

Gombay and Horváth [78] significantly contributed to the change point literature by finding that the null asymptotic distribution is of a double-exponential-extreme-value type, while Zhu and Ghodsi [79] developed a dimensionality detection design using the Profile Likelihood Function (PLF). PLF detects the point-change of eigenvalues, signifying a cluster of terms dividing the textual content into groups of distinct dimensions. Their method consists of explicitly constructing a model for dimensions dj (j = 1, 2… n) estimating the position of the “gap” or the “elbow” in a scree plot. The model follows the assumption of ordered sequential eigenvalues

λ

1 ≥

λ

2 ≥ … ≥

λ

n, following the normal distribution

F (λ; μ_{i}, {σ_{i}}^{2})

testing the following hypothesis

H0.

μ_{1} = \dots = μ_{κ *} = μ_{κ * + 1} = \dots = μ_{n}

σ_{1} = \dots = σ_{κ *} = σ_{κ * + 1} = \dots = σ_{n}

.

H1.

There is an unknown

k^{*}

1 ≤

k^{*}

≤ n − 1 such as

μ_{1} = \dots = μ_{κ *} \neq μ_{κ * + 1} = \dots = μ_{n}

σ_{1} = \dots = σ_{κ *} \neq σ_{κ * + 1} = \dots = σ_{n}

.

Rejection of the null hypothesis H0 indicates the optimal candidate multiple dimensions. Figure 3 abstractly illustrates the locus of the change point in a given distribution.

Once the optimal number of dimensions is detected, and the textual data are condensed (through TSVD), the next step of the LSA is to improve the interpretability of the initial results with the use of varimax rotations [66]. The function of varimax rotations improves the explanatory ability of the terms and documents’ loadings. Even though many techniques of factor rotations can be performed, varimax rotations significantly increase the interpretability of the results.

The last step of the LSA procedure is to label the factors that underline the high-loaded factors. Factor labeling involves the notion of interpretive understanding of human subjects (i.e., researchers, a panel of experts) that are assigned to label the topics extracted. For instance, the labels of the topics extracted could be based on the development of a Thurstone scale [33,80] or the Delphi method [81,82]. To develop Thurstone scales, researchers initially assign preliminary labels for each topic based on the cluster of terms highly loaded (precoding), and then a group of experts are surveyed to evaluate the level of their agreement on the initially assigned labels. The aggregated scores assigned by the panelists would determine whether the labels should be retained or dismissed. Similarly, the Delphi method relies on a panel of experts who participate in multiple reiterative rounds of developing and refining the coding schemes (topic labels) based on the cluster of words indicating unique dimensions.

In sum, Landauer and Dumais [61] illuminated the strengths of the LSA model as a descriptive tool that classifies and thematically attributes textual data with minimal human intervention. It is a technique that increases the analytical velocity of large volumes of textual data, presenting them in low dimensional spaces, enhancing the interpretive understanding of semantic relationships and generating topic structures of unstructured texts. Several limitations appear in the LSA procedures. For instance, LSA does not capture the context where similar words appear; therefore, polysemy may taint the interpretability of similar word occurrences that may not have the same meaning in different contexts. Also, due to a heavy reliance on statistical procedures focusing on the co-occurrence of words, LSA may generate inaccurate results if the sample of words or documents is not representative. Further, the dimensionality reduction process can dismiss important semantic information. Moreover, it does not account for the syntactic structure of the sentences, which again can generate inaccurate interpretations of meanings. The optimization and accuracy of the LSA procedures heavily rely on preprocessing (i.e., tokenization, stemming, stop words) and transformation of textual information into numeric values that are used for the analytical stages of the dimensionality detection (i.e., SVD, TSVD). Figure 4 summarizes the procedures involved in LSA as described in this section.

6. Sentiment Analysis

Besides their use to detect the themes or topics of large corpora, topic modeling has been extensively used in understanding the themes of online interactions and opinion mining [83]. Semantic and sentiment analyses have rarely been used as a joint topic–sentiment modeling approach [84]. Without a doubt, textual content can be seen as organized thematic domains representing facts, sentiments, or a combination of both. They represent mental images, as well as opinions or judgments over factual or emotional realization [85]. Recognizing the importance of the distinction between the symbolic expressions of the objective and subjective textual expressions, Liu [86] proposed an analytic instrument scaling textual data of subjective expressions called sentiment analysis (SA).

There are four levels on which scientists perform SA for opinion mining: (1) document-level sentiment classification [84,87]; (2) sentence (feature)-based sentiment analysis [88]; and (3) entity- and aspect-level classification [89]. Document-level sentiment classification categorizes a set of opinionated documents based on the identification of positive and negative words revealing general sentiments represented in textual contents. The procedure involves the ordering of subjective and objective sentences, and generating scales of positive, negative, or neutral opinions. The sentence- or feature-based sentiment analysis is a model extracting the sentiments over targeted words or clauses within a sentence. Finally, entity- and aspect-level sentiment analysis focuses on counts of words and clauses revealing positive, negative, or neutral opinions.

Overall, SA is an instrumental textual analytic technique designed to assess the manifest opinions expressed in verbal and non-verbal content [87]. Unlike topic extraction and topic modeling techniques, sentiment analysis focuses on opinion retrieval and sentiment extraction, rather than classifying and reducing the dimensions of the textual content. It is suitable for the analysis of opinionated textual frames [86,88,90], and it can be performed in different levels of analysis given the unit of granulation used to extract information from different domains [91] that could be detected by a topic extraction or topic modeling technique (i.e., LSA).

EASA

EASA classifies sentiments of predefined topics [92]. EASA was first known as feature-based opinion mining and summarization [89]. Lexicon-based sentiment classification is based on dictionaries containing lists of words, idioms, metaphors, and clauses expressing positive or negative sentiments [93,94]. The main elements of EASA involve the entity (what is being discussed), the aspect term (words or phrases lined to the aspect), the aspect category (higher-level classification and grouping of words or phrases), opinion term/expression (words conveying a sentiment), and sentiment polarity (positive, negative, and neutral sentiments). It also involves a unique procedure that identifies sentiment shifters within a sentence. Following Ding’s [94] lexicon-based approach, EASA consists of four operational steps: (1) mark sentiment words and phrases, (2) apply sentiment shifters, (3) handle but-clauses, and (4) aggregate opinions.

The first step involves assessing words and phrases that express a sentiment; the model assigns scores to words or clauses within a sentence. For each word revealing a positive sentiment, a score above zero (+) is assigned, while each word of negative sentiment is rated below zero (−). Further, for sentences lacking words revealing a sentiment, a zero score is assigned revealing sentiment neutrality. Words of sentiments like “love, like, favor, good and fabulous, etc.” are classified at different scales of positive opinions on a topic. On the other hand, words such as “hate, horrible, dislike, awful, etc.” reveal a scale of negative sentiments. For instance, in simplistic terms, the sentence “I love this movie” signifies a stronger positive sentiment than the sentence “I like this movie”; therefore, the first sentence weighs for a higher sentiment score than the latter. The same applies to sentences revealing negative sentiments; “I hate this movie” accounts for a lower negative sentiment score than the sentence “I dislike this movie”. The second step involves the process of sentiment or valence shifters [93,95]. This procedure focuses on words or expressions changing the direction of sentiments within the same sentence. For instance, words such as “not, but, never, neither etc.” are considered sentiment shifters. The third step deals with the but-clauses that appear to boost or contrast the sentiment expressed in a sentence. If there is a conjunction “but” between two sentiment words, the algorithm divides the sentence into two parts of opposing sentiments. Finally, the fourth step accumulates the aggregate opinions found in a sentence.

The simplest way to find the total sentiment score of a sentence is to add up all the scores of the opinion words appearing in the sentence [89]. A more sophisticated technique calculates the sentiments for each aspect of the textual content. In mathematical notations, Liu’s [89] model is expressed as

S e n t i m e n t s c o r e (A i, S) = Σ \frac{S W j S o}{d i s t (S W j, A i)}

where S is the sentence, A is the aspect, SWj is the word expressing a sentiment, dist (SWj, Ai) is the distance between the sentiment word and aspects in a sentence, and SWj.So is the sentiment score of SWi. The advantage of this technique is that it jointly weighs the sentiment words based on their distance from the aspects presented in the textual content. SA procedures can be summarized in seven steps (see Figure 5).

As discussed in the previous section, LSA lacks specific contextual understanding as it performs broader classifications and only detects document-level topic domains. Performing EASA on the documents attributed to specific topics—detected by the LSA operations—could provide more guided insights into the text and measure the sentiment for sub-topical domains (entities). This synthesis could reveal evidence of opinion polarization or consensus within the contexts (entities) of larger thematic entities. Such an incorporated semi-automated approach could be beneficial in discourse analysis, and it could be especially beneficial in large texts generated by unstructured interviews or semi-structured focus interviews, conversational analysis, and more.

EASA has various limitations due to the absence of universally accepted reliable metrics. Also, there are challenges to accurately evaluate the sentiments in large texts due to contextuality and bias [96]. The complexity of differentiating sentiments expressed in different aspects of the text generates several limitations about the accuracy of the results. EASA may produce ambiguous interpretations in texts where sentiments are rather implied than manifested in a given aspect. For instance, in the sentence “I watched better movies before” there is an implicit negative sentiment attributed to it. Yet, EASA cannot accurately ascribe a negative score due to the absence of a negative word (construct) in the sentence. The accuracy of the lexicon-based EASA is a crucial component that, typically, relies on syntactic dependency parsing (analysis of the structure of the sentence detecting sentiment words) and lexicons tailored for aspect-level analysis. Attributing the correct sentiment scores remains challenging as the summation of sentiment scores in sentences containing multiple contextual domains may be misleading. Also, the lexicon-based approach of EASA is static, offering limited coverage and presenting the issue of bias. Finally, expressions that have a sarcastic or ironic tone in a textual form cannot be fully ascribed accurate sentiment scores using EASA.

The reliability and accuracy of sentiment analysis models are traditionally measured by the F1 score [97]. The F1 score evaluates the performance of the sentiment analysis models by considering a balanced mean between precision and recall.

Precision measures the accuracy of the sentiment model based on the proportion of correct predictions (true positives) over the sum of the entire predicted entity sentiment (true positive + false positives). Simply, precision takes the form of

Precision = \frac{T P}{T P + F P}

where TP is the true positive referring to the correct identification of the entity and sentiment, while FP is the false positive referring to the incorrect assignment of the wrong sentiment to a correctly identified entity in the textual content.

On the other hand, recall apprehends the proportion of correct identification of entities and sentiments (true positives) out of all entities and sentiments that should be identified in the text (sum of true positives and false negatives). In mathematical notation, recall can be stated as follows:

Recall = \frac{T P}{T P + F N}

where TP is the true positive referring to the correct identification of the entity and sentiment, while the FN is the false negative referring to the failure of the model to detect the entity and sentiment that exists in a given text.

The F1 score balances the precision and recall metrics. It is derived from the balanced mean of precision and recall taking the following form:

F 1 = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l} = \frac{2 \times T P}{2 \times (T P + F P + F N)}

F1 score ranges from 0 to 1, where 0 indicates no accuracy in predictions and a lack of inclusiveness of entities and sentiments, while 1 ascribes the highest degree of accuracy and precision of entities and sentiments in a given text. Overall, F1 is a significant evaluation tool that provides a balanced measure to correctly detect the aspects and precisely assign the sentiment scores associated with them.

7. Topic by Sentiment Maps

Topic mapping is the third and final stage of the proposed integrated model. Upon the completion of LSA and EASA, the frequency (count) of positive, negative, and neutral sentiment scores across the topic categories is organized in a contingency table. Followed by the data organization and management process, CA can be employed to graphically display and test for the dependence of sentiment scores across topical domains. In brief, CA is a statistical technique designed to create conceptual maps and reveal associations (dependence) across categorical variables [98,99]. The main function of CA is to graphically display categorical data in low-dimensional spaces [100] and test for statistical dependence. CA relies on the χ² statistic to examine whether there is a significant difference (dependence) in frequencies across the attributes of the categorical variables in a contingency table. CA relies heavily on SVD to identify and display categorical data in low-dimensional space (two to three dimensions). CA constitutes an ideal technique to conceptually map textual data and display the disparities of proximities of sentiment categories (positive, negative, neutral) across different clusters of topics displayed in geometrical spaces.

The first step of CA involves the transformation of raw frequency table N into a matrix of proportions (correspondence matrix P). The correspondence matrix is calculated representing the joint probability distribution of the sentiments by the topic contingency table that takes the following form:

p_{i j} = \frac{n_{i j}}{n}

Based on the correspondence matrix P, the marginal probabilities (masses) are calculated for topic categories (row masses, r) using

r_{i}

=

Σ_{i} p_{i j}

, and sentiment scores (column masses, c) form a column vector

c_{i}

=

Σ_{i}

i

p_{i j}

. These operations of CA produce weights ensuring that sentiments by topic categories with higher frequencies have a proportionately higher impact on the analysis. The following step involves the calculation of row and column profiles as vectors generated by simply dividing the topic (row) elements by the topic mass (row mass) and sentiment (column) elements by sentiment mass (column mass), respectively. Row and column profiles represent the conditional distribution of sentiment categories across topic categories, generating the centroids in the geometric space of the topic by sentiment maps.

Further, in CA, the dimensionality reduction process relies on SVD operations (I described the mathematical operations of SVD in the LSA section of this paper). The appropriate number of dimensions is determined by inertia (φ) that decomposes the aggregate Pearson Chi-Square (

χ^{2}

) attributing data in lower dimensions by using only the first two principal components that explain the highest percentage of the total variability of sentiments across topics. In mathematical notations, inertia takes the form of

φ = \frac{χ^{2}}{Ν}

where φ denotes the inertia,

χ^{2}

is the statistic’s obtained value, and N is the total (sum of all row and column frequencies in the contingency table). Typically, dimensions that explain 70% (or higher of inertia) are retained and graphically displayed as a biplot that identifies the patterns and associations between the row variable (detected topics) and column variables (identified sentiments).

Overall, CA is a powerful visualization tool suitable for the graphical representation of categorical data in two- to three-dimensional spaces. Among others, CA discovers patterns of categorical data yet does not require distributional assumptions providing a unified framework of testing for statistical dependence, and it measures the degree of associations between row and column variables. However, the main limitations of CA are the interpretative complexity of the results, data sensitivity issues (i.e., outliers, sparsity of categories, etc.), and its descriptive nature restraining its suitability for explanatory (theory-driven) research.

8. Conclusions

In the last thirty years, interdisciplinary research has contributed to the development of sophisticated techniques performing topic extraction, topic modeling, opinion mining, and more. Collaborative works between social scientists, linguists, computer scientists, and information technology experts have shown great promise in utilizing and making sense of the context of large volumes of textual data. A plethora of data mining techniques have been widely used to attribute textual content that can facilitate both exploratory and theory-driven explanatory studies analyzing social phenomena. Sociologists often analyze cultural frames relying on the basic mechanism of text analysis: “get the text, find the use, and map the meaning” [25,27,35] to analyze historical texts, policies, political discourse, the interaction of digital communities, and more. This paper provided a thorough overview of distinctive as well as intersecting traditional practices of qualitative and quantitative text analysis and conceptualized a streamlined incorporated style consisting of a combination of foundational styles in topic extraction and opinion mining complemented by a multivariate technique for the analysis of categorical data, multivariate analysis LSA, EASA, and CA (see Figure 6).

The three distinct traditional techniques can be sequentially used to form a cohesive epistemological model generated by a methodological synthesis that detects the structure of large textual corpora, assesses the sentiments corresponding to the context (aspect) of each detected topic, performs tests of dependence, and graphically displays sentiment categories across topic domains. Specifically, in any form of unstructured data, the integrating model first classifies the text in topical domains (the topics discussed) through the LSA operations; then, it assesses the sentiment (opinions) within the subtopics (entities) through the EASA, and upon the generation of contingency tables entailing the distribution of positive and negative sentiments across topical domains, the CA will test for statistical dependence and display a topic by sentiment map revealing evidence of statistical association (difference) or lack thereof (no difference). A synthesis of topic models and multivariate statistical techniques may contribute to increasing the sophistication of developing interpretative coding schemes as it may improve the ascription of elusive socio-cognitive schemas and develop metrics on perception, emotions, and sentiments that are occasionally delimited by issues such as discerning bias. Discerning bias is often attributed to hermeneutic discrepancies and operational pitfalls subject to structural and cultural influences in societal settings.

Overall, the ongoing developments of various text analytic models have shown great promise in improving the utility and reliability of textual data, increasing the content validity of measures of meaning that enhance the interpretative understanding of social attitudes, opinions, and beliefs. The development of highly advanced LLMs has shaded the use of traditional techniques as they have shown remarkable capabilities far exceeding traditional text analytics. However, traditional text analysis models still hold their position in social science research. Traditional topic models perform specific operational tasks in the research process (i.e., interview coding) that could be enhanced by streamlined synthesis that could generate reliable data and valid measures of textual content at a much lower cost. Finally, the human intervention in the interpretation of outcomes generated by the semi-automated structure of streamlined topic models could be seen as their main strength contributing to the dynamic nature of the cultural production of meanings that are deeply grounded in value-oriented, rather than deterministic, frameworks widely used in LLMs. Implementing complementary sequential hybrid techniques could be used to analyze culturally informed convergent and divergent opinions in public discourse that have traditionally afflicted the notion of social cohesion and solidarity.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing does not apply to this article.

Conflicts of Interest

The author declares no conflicts of interest.

References

Németh, R.; Koltai, J. The potential of automated text analytics in social knowledge building. In Pathways Between Social Science and Computational Social Science: Theories, Methods, and Interpretations; Springer International Publishing: Cham, Switzerland, 2021; pp. 49–70. [Google Scholar]
Stone, P.J. Thematic text analysis: New agendas for analyzing text content. In Text Analysis for the Social Sciences; Routledge: London, UK, 2020; pp. 35–54. [Google Scholar]
Krippendorff, K. Content Analysis: An Introduction to Its Methodology; Sage Publications: Thousand Oaks, CA, USA, 2018. [Google Scholar]
Neuendorf, K.A.; Kumar, A. Content analysis. In The International Encyclopedia of Political Communication; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2015; pp. 1–10. [Google Scholar]
Franzosi, R. (Ed.) Content Analysis; Sage Publications: New York, NY, USA, 2008. [Google Scholar]
Berelson, B. Content Analysis in Communication Research; Media studies: A reader; Free Press: Los Angeles, CA, USA, 2000; pp. 200–209. [Google Scholar]
Lippmann, W. Public Opinion; Mcmillan: New York, NY, USA, 1922. [Google Scholar]
Simpson, G.E. The Negro in the Philadelphia Press; University of Pennsylvania Press: Philadelphia, PA, USA, 1936. [Google Scholar]
Walworth, A. School Histories at War: A Study of the Treatment of Our Wars in the Secondary School History Books of the United States and in Those of Its Former Enemies; Harvard University Press: Cambridge, MA, USA, 1938. [Google Scholar]
Martin, H. Nationalism in children’s literature. Libr. Q. 1936, 6, 405–418. [Google Scholar] [CrossRef]
Janis, I. The problem of validating content analysis. In The Content Analysis Reader; SAGE: Atlanta, GA, USA, 1965; pp. 358–366. [Google Scholar]
Janis, I.L.; Fadner, R.H. A coefficient of imbalance for content analysis. Psychometrika 1943, 8, 105–119. [Google Scholar] [CrossRef]
Rainoff, T.J. Wave-like fluctuations of creative productivity in the development of West-European physics in the eighteenth and nineteenth centuries. Isis 1929, 12, 287–319. [Google Scholar] [CrossRef]
Becker, H.P. Distribution of space in the American Journal of Sociology, 1895–1927. Am. J. Sociol. 1930, 36, 461–466. [Google Scholar] [CrossRef]
Shanas, E. The American Journal of Sociology through fifty years. Am. J. Sociol. 1945, 50, 522–533. [Google Scholar] [CrossRef]
Tannenbaum, P.H.; Greenberg, B.S. Mass communication. Annu. Rev. Psychol. 1968, 19, 351–386. [Google Scholar] [CrossRef]
Esfahani, M.N. Content Analysis of Textbooks via Natural Language Processing. Am. J. Educ. Pract. 2024, 8, 36–54. [Google Scholar] [CrossRef]
Frade, C. Social theory and the digital: The institutionalisation of digital sociology. Acta Sociol. 2025, 68, 41–56. [Google Scholar] [CrossRef]
Roberts, C.W. A conceptual framework for quantitative text analysis. Qual. Quant. 2000, 34, 259–274. [Google Scholar] [CrossRef]
Lasswell, H.D. Propaganda Technique in the World War; Peter Smith: New York, NY, USA, 1938. [Google Scholar]
Lasswell, H.D.; Lerner, D.; de Sola Pool, I. The Comparative Study of Symbols: An Introduction; Stanford University Press: Stanford, CA, USA, 1952. [Google Scholar]
Belsey, C. Textual analysis as a research method. Res. Methods Engl. Stud. 2013, 2, 160–178. [Google Scholar]
Berelson, B. Content Analysis. In Handbook of Social Psychology; Lindzey, G., Ed.; Addison-Wesley: Reading, MA, USA, 1954; pp. 488–522. [Google Scholar]
Shapiro, G.; Markoff, J. A matter of definition. In Text Analysis for the Social Sciences; Roberts, C.W., Ed.; Lawrence Erlbaum Associates: Mahwah, NJ, USA, 1997; pp. 9–31. [Google Scholar]
Ignatow, G. Theoretical Foundations for Digital Text Analysis. J. Theory Soc. Behaviour. 2015, 46. [Google Scholar] [CrossRef]
Bail, C.A. The cultural environment: Measuring culture with big data. Theory Soc. 2014, 43, 465–482. [Google Scholar] [CrossRef]
Mohr, J.W.; Bogdanov, P. Introduction—Topic models: What they are and why they matter. Poetics 2013, 41, 545–569. [Google Scholar] [CrossRef]
Ignatow, G.; Evangelopoulos, N.; Zougris, K. Sentiment analysis of polarizing topics in social media: News site readers’ comments on the Trayvon Martin controversy. In Communication and Information Technologies Annual; [New] Media Cultures; Emerald Group Publishing: Bradford, UK, 2016; pp. 259–284. [Google Scholar]
Modrak, D.K. Aristotle’s Theory of Language and Meaning; Cambridge University Press: Cambridge, UK, 2001. [Google Scholar]
Downe-Wamboldt, B. Content analysis: Method, applications, and issues. Health Care Women Int. 1992, 13, 313–321. [Google Scholar] [CrossRef] [PubMed]
Speed, G.J. Do newspapers now give then news? Forum 1893, 15, 705–711. [Google Scholar]
Whitney, D.C.; Fritzler, M.; Jones, S.; Mazzarella, S.R.; Rakow, L.F. Geographic and Source Bias in Network Television News, 1982–1984. J. Broadcast. Electron. Media 1989, 33, 159–174. [Google Scholar] [CrossRef]
Babbie, E. The Basics of Social Research; Cengage Learning: Independence, KY, USA, 2013. [Google Scholar]
Chomsky, N. Media Control: The Spectacular Achievements of Propaganda; Seven Stories Press: New York, NY, USA, 2011. [Google Scholar]
Mohr, J.W. Measuring meaning structures. Annu. Rev. Sociol. 1998, 24, 345–370. [Google Scholar] [CrossRef]
Lamont, M.; Fournier, M. (Eds.) . Cultivating Differences: Symbolic Boundaries and the Making of Inequality; University of Chicago Press: Chicago, IL, USA, 1992. [Google Scholar]
Bourdieu, P. The Field of Cultural Production: Essays on Art and Literature; Columbia University Press: New York City, NY, USA, 1993. [Google Scholar]
Alexander, J.C. Clifford Geertz and the strong program: The human sciences and cultural sociology. Cult. Sociol. 2008, 2, 157–168. [Google Scholar] [CrossRef]
Garfinkel, H. Truth is Our Weapon. By Edward W. Barrett. Am. Political Sci. Rev. 1954, 48, 257–259. [Google Scholar] [CrossRef]
Glaser, B.G.; Strauss, A.L. Grounded theory. Strategien Qualitativer Forschung; Huber: Bern, Switzerland, 1998; p. 4. [Google Scholar]
Neuendorf, K.A. The Content Analysis Guidebook; Sage Publications: Thousand Oaks, CA, USA, 2002; Volume 300. [Google Scholar]
McCroskey, J.C. An Introduction to Rhetorical Communication; Prentice Hall: Englewood Cliffs, NJ, USA, 1993. [Google Scholar]
Weber, R.P. Basic Content Analysis; Sage Publications: Newbury Park, CA, USA, 1990. [Google Scholar]
Mehl, M.R.; Eid, M.; Diener, E. Quantitative text analysis. In Handbook of Multimethod Measurement in Psychology; Eid, M.E., Diener, E.E., Eds.; American Psychological Association: Washington, DC, USA, 2006; pp. 141–156. [Google Scholar]
Popping, R. Computer-Assisted Text Analysis; Sage Publications: Thousand Oaks, CA, USA, 2000. [Google Scholar]
Sebeok, T.A.; Zeps, V.J. An Analysis of Structured Content, with Application of Electronic Computer Research, in Psycholinguistics. Lang. Speech 1958, 1, 181–193. [Google Scholar] [CrossRef]
Hays, D.C. Automatic Content Analysis; Rand Corporation: Santa Monica, CA, USA, 1960. [Google Scholar]
DiMaggio, P. Adapting computational text analysis to social science (and vice versa). Big Data Soc. 2015, 2, 2053951715602908. [Google Scholar] [CrossRef]
Laver, M.; Benoit, K.; Garry, J. Extracting policy positions from political texts using words as data. Am. Political Sci. Rev. 2003, 97, 311–331. [Google Scholar] [CrossRef]
Garry, J. The computer coding of political texts. Estim. Policy Position Political Actors 2001, 20, 183. [Google Scholar]
Grootendorst, M. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv 2022, arXiv:2203.05794. [Google Scholar]
HarMastropaolo, A.; Scalabrino, S.; Cooper, N.; Palacio, D.N.; Poshyvanyk, D.; Oliveto, R.; Bavota, G. Studying the usage of text-to-text transfer transformer to support code-related tasks. In Proceedings of the 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), Madrid, Spain, 22–30 May 2021; pp. 336–347. [Google Scholar]
Hartawan, G.; Maylawati, D.S.A.; Uriawan, W. Bidirectional and Auto-Regressive Transformer (BART) for Indonesian Abstractive Text Summarization. J. Inform. Polinema 2024, 10, 535–542. [Google Scholar] [CrossRef]
Wang, H.; Prakash, N.; Hoang, N.K.; Hee, M.S.; Naseem, U.; Lee, R.K.W. Prompting large language models for topic modeling. In Proceedings of the 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy, 15–18 December 2023; pp. 1236–1241. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V.; et al. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
DiMaggio, P.; Nag, M.; Blei, D. Exploiting affinities between topic modeling and the sociological perspective on culture: Application to newspaper coverage of US government arts funding. Poetics 2013, 41, 570–606. [Google Scholar] [CrossRef]
Hofmann, T. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, CA, USA, 15–19 August 1999; pp. 50–57. [Google Scholar]
Blei, D.; Ng, A.; Jordan, M. Latent Dirichlet Allocation. J. Mach. Learn. 2003, 3, 993–1022. [Google Scholar]
Lee, D.; Seung, H.S. Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2000; Volume 13, p. 13. [Google Scholar]
Deerwester, S.; Dumais, S.; Furnas, G.; Landauer, T.; Harshman, R. Indexing by LSA. J. Am. Soc. Inf. Sci. 1990, 41, 391–407. [Google Scholar] [CrossRef]
Landauer, T.K.; Dumais, S.T. A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychol. Rev. 1997, 104, 211. [Google Scholar] [CrossRef]
Dumais, S.T. LSA. Annu. Rev. Inf. Sci. Technol. 2004, 38, 188–230. [Google Scholar] [CrossRef]
Manning, C.D.; Raghavan, P.; Schütze, H. Introduction to Information Retrieval; Cambridge University Press: Cambridge, UK, 2008. [Google Scholar]
Landauer, T.K. LSA as a theory of meaning. In Handbook of LSA; Taylor and Francis: Hoboke, NJ, USA, 2007; pp. 3–34. [Google Scholar]
Evangelopoulos, N.E. LSA. Wiley Interdiscip. Rev. Cogn. Sci. 2013, 4, 683–692. [Google Scholar] [CrossRef]
Sidorova, A.; Evangelopoulos, N.; Valacich, J.S.; Ramakrishnan, T. Uncovering the intellectual core of the information systems discipline. Mis Q. 2008, 32, 467–482. [Google Scholar] [CrossRef]
Winson-Geideman, K.; Evangelopoulos, N. Topics in Real Estate Research, 1973–2010: A Latent Semantic Analysis. J. Real Estate Lit. 2013, 21, 59–76. [Google Scholar] [CrossRef]
Tonta, Y.; Darvish, H.R. Diffusion of LSA as a research tool: A social network analysis approach. J. Informetr. 2010, 4, 166–174. [Google Scholar] [CrossRef]
Evangelopoulos, N.; Zhang, X.; Prybutok, V.R. LSA: Five methodological recommendations. Eur. J. Inf. Syst. 2012, 21, 70–86. [Google Scholar] [CrossRef]
Salton, G.; Wong, A.; Yang, C.S. A vector space model for automatic indexing. Commun. ACM 1975, 18, 613–620. [Google Scholar] [CrossRef]
Salton, G.; Lesk, M.E. Computer evaluation of indexing and text processing. J. ACM (JACM) 1968, 15, 8–36. [Google Scholar] [CrossRef]
Salton, G.; Buckley, C. Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 1988, 24, 513–523. [Google Scholar] [CrossRef]
Frakes, W.B. Introduction to information storage and retrieval systems. In Information Retrieval: Data Structures and Algorithm; Frakes, W.B., Baeza-Yates, R., Eds.; Prentice Hall: Upper Saddle River, NJ, USA, 1992; pp. 1–12. [Google Scholar]
Husbands, P.; Simon, H.; Ding, C.H. On the use of the singular value decomposition for text retrieval. Comput. Inform. Retrieval 2001, 145–156. [Google Scholar]
Pilato, G.; Vassallo, G. TSVD as a statistical estimator in the LSA paradigm. IEEE Trans. Emerg. Top. Comput. 2014, 3, 185–192. [Google Scholar] [CrossRef]
Cliff, N. The Eigenvalues-greater-than-one rule and the reliability of components. Psychol. Bull. 1988, 103, 276–279. [Google Scholar] [CrossRef]
Bradford, R. An empirical study of required dimensionality for large-scale latent semantic indexing applications. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, Napa Valley, CA, USA, 26–30 October 2008; Harvard University Press: Cambridge, MA, USA, 2008; pp. 153–162. [Google Scholar]
Gombay, E.; Horvath, L. An application of the likelihood method to change-point detection. Environmetrics 1997, 8, 459–467. [Google Scholar] [CrossRef]
Zhu, M.; Ghodsi, A. Automatic dimensionality selection from the scree plot via the use of profile likelihood. Comput. Stat. Data Anal. 2006, 51, 918–930. [Google Scholar] [CrossRef]
Thurstone, L.L. The unit of measurement in educational scales. J. Educ. Psychol. 1927, 18, 505. [Google Scholar] [CrossRef]
Hsu, C.C.; Sandford, B.A. The Delphi technique: Making sense of consensus. Pract. Assess. Res. Eval. 2007, 12, 10. [Google Scholar]
Dalkey, N.; Helmer, O. An experimental application of the Delphi method to the use of experts. Manag. Sci. 1963, 9, 458–467. [Google Scholar] [CrossRef]
Li, Y.; Duan, T.; Zhu, L. Public Attitudes and Sentiments toward Common Prosperity in China: A Text Mining Analysis Based on Social Media. Appl. Sci. 2024, 14, 4295. [Google Scholar] [CrossRef]
Hanny, D.; Resch, B. Clustering-based joint topic-sentiment modeling of social media data: A neural networks approach. Information 2024, 15, 200. [Google Scholar] [CrossRef]
Chalkias, I.; Tzafilkou, K.; Karapiperis, D.; Tjortjis, C. Learning analytics on YouTube educational videos: Exploring sentiment analysis methods and topic clustering. Electronics 2023, 12, 3949. [Google Scholar] [CrossRef]
Liu, B. Sentiment analysis and subjectivity. In Handbook of Natural Language Processing, 2nd ed.; CRC: Boca Raton, FL, USA, 2010; pp. 627–666. [Google Scholar]
Pang, B.; Lee, L. Opinion mining and sentiment analysis. Found. Trends Inf. Retr. 2008, 2, 1–135. [Google Scholar] [CrossRef]
Wiebe, J.; Bruce, R.; O’Hara, T.P. Development and use of a gold-standard data set for subjectivity classifications. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, College Park, MD, USA, 20−26 June 1999; pp. 246–253. [Google Scholar]
Hu, M.; Liu, B. Mining and summarizing customer reviews. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, WA, USA, 22–25 August 2004; ACM: New York, NY, USA, 2004; pp. 168–177. [Google Scholar]
Pang, B.; Lee, L.; Vaithyanathan, S. Thumbs up? Sentiment classification using machine learning techniques. In Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2002; Volume 10, pp. 79–86. [Google Scholar]
Rodríguez-Ibánez, M.; Casánez-Ventura, A.; Castejón-Mateos, F.; Cuenca-Jiménez, P.M. A review on sentiment analysis from social media platforms. Expert Syst. Appl. 2023, 223, 119862. [Google Scholar] [CrossRef]
Chen, X.; Xie, H.; Tao, X.; Wang, F.L.; Zhang, D.; Dai, H.N. A computational analysis of aspect-based sentiment analysis research through bibliometric mapping and topic modeling. J. Big Data 2025, 12, 40. [Google Scholar] [CrossRef]
Liu, B. Sentiment analysis and opinion mining. In Synthesis Lectures on Human Language; Morgan & Claypool Publishers: Kentfield, CA, USA, 2012. [Google Scholar]
Ding, X.; Liu, B.; Yu, P.S. A holistic lexicon-based approach to opinion mining. In Proceedings of the 2008 International Conference on Web Search and Data Mining, Palo Alto, CA, USA, 11–12 February 2008; pp. 231–240. [Google Scholar]
Polanyi, L.; Zaenen, A. Contextual valence shifters. In Computing Attitude and Affect in Text: Theory and Applications; Hanahan, J., Qu, Y., Wiebe, J., Eds.; Springer: Dordrecht, The Netherlands, 2006; pp. 1–10. [Google Scholar]
Nazir, A.; Rao, Y.; Wu, L.; Sun, L. Issues and challenges of aspect-based sentiment analysis: A comprehensive survey. IEEE Trans. Affect. Comput. 2020, 13, 845–863. [Google Scholar] [CrossRef]
Van Rijsbergen, C. Information retrieval: Theory and Practice. In Proceedings of the Joint IBM/University of Newcastle Upon Tyne Seminar on Database Systems; NewCastle Upon Tyne Univ Computing Lab: Newcastle Upon Tyne, UK, 1979; Volume 79, pp. 1–14. [Google Scholar]
Hirschfeld, H.O. A connection between correlation and contingency. In Mathematical Proceedings of the Cambridge Philosophical Society; Cambridge University Press: Cambridge, UK, 1935; Volume 31, pp. 520–524. [Google Scholar]
Benzécri, J.P. CA Handbook; Marcel Dekker: New York, NY, USA, 1992. [Google Scholar]
Whitlark, D.B.; Smith, S.M. Using correspondence analysis to map relationships. Mark. Res. 2001, 13, 22–27. [Google Scholar]

Figure 1. Methodological styles of text analysis.

Figure 2. LSA operations.

Figure 3. Scree plot of k-topics at a change point.

Figure 4. Summary of LSA procedure.

Figure 5. Summary of the entity and aspect-level sentiment analysis.

Figure 6. Streamlining the operations of the model.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zougris, K. Origins, Styles, and Applications of Text Analytics in Social Science Research. Encyclopedia 2025, 5, 70. https://doi.org/10.3390/encyclopedia5020070

AMA Style

Zougris K. Origins, Styles, and Applications of Text Analytics in Social Science Research. Encyclopedia. 2025; 5(2):70. https://doi.org/10.3390/encyclopedia5020070

Chicago/Turabian Style

Zougris, Konstantinos. 2025. "Origins, Styles, and Applications of Text Analytics in Social Science Research" Encyclopedia 5, no. 2: 70. https://doi.org/10.3390/encyclopedia5020070

APA Style

Zougris, K. (2025). Origins, Styles, and Applications of Text Analytics in Social Science Research. Encyclopedia, 5(2), 70. https://doi.org/10.3390/encyclopedia5020070

Article Menu

Origins, Styles, and Applications of Text Analytics in Social Science Research

Definition

1. Introduction

2. Origins of Content Analysis

3. Qualitative vs. Quantitative Text Analysis

4. Computational Text Analytics

Traditional Topic Models

5. LSA

6. Sentiment Analysis

EASA

7. Topic by Sentiment Maps

8. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI