1. Introduction
Social media platforms have grown from being entertainment tools to an institutional means of communication, especially in an educational context [
1,
2,
3]. TikTok [
4] is one of the key platforms used by institutions to reach the younger generation and connect with them, forming a new means for marketing in the field of higher education institutions [
5,
6]. With over 1 billion users, predominantly young people between 16 and 24, TikTok offers an opportunity to understand what these users—the new generation of learners—prefer and identify with [
7].
The existing literature on social media in the context of higher education institutions focuses on unimodal analysis, primarily text scraping. An important research gap exists in the study of cross-national behavioral trends using a multimodal approach that incorporates videos, audio, and text. Contrary to previous research that treated regional user groups as culturally unique [
8,
9], the proposed study uses a unified DL model across four countries to validate the hypothesis of algorithmic homogeneity. Furthermore, the proposed method uses offline local LLMs and knowledge graphs to circumvent the privacy and granularity issues faced by API-based studies.
In this work, we aimed to study similarities between people from different European countries by employing a multimodal DL framework, combining natural language processing (NLP), CV, and statistical methods and techniques. This is done by designing a system able to process and analyze the public reposts of followers of the official and public TikTok accounts related to technical universities in four European nations, namely Romania, Germany, Italy, and Russia. For the purpose of achieving this objective, we design and describe the Media Information Processing System (MIPS).
Although there is no precise target audience (a person following a university account might be a student, professor, stakeholder, and so on; moreover, there is no evidence regarding the person’s birthplace, residence, and nationality), our analysis revealed a surprising degree of similarity in persona preferences for university content across Europe.
However, we assumed that most of the followers of a university account are young and have some connection with the university being studied [
7,
10]. We also assumed that most of the followers of a university account are from the same region, since large universities are poles of attraction for students from neighboring areas [
11].
Our claim that most university account followers are from the same region and have some connection to the university is supported by several studies on social media followership, which indicates a locality effect. In particular, the followers of an account tend to cluster geographically near the origin of the account holder, influenced by shared language and cultural context (see [
12,
13,
14]). The same conclusion is also endorsed by a Eurostat study from 2023, which states that about 91.6% of students were studying in their own country rather than internationally (see [
15]). This fact, corroborated by the findings in [
16], which show that most of the students (about 90%) are not long-distance movers (more than 50% of students are within a distance of 91 km), constitutes empirical evidence for our assumption. Correspondingly, people generally remain within their home region for education, which might suggest that a similar local bias may be present among university followers on social media. Nevertheless, even though these arguments provide robust evidence for a predominant locality effect, they constitute indirect proofs because students’ demographic data from the studied universities is not publicly available. It is also worth mentioning that social media dynamics might transcend regional boundaries due to viral content or targeted international campaigns; hence, the results might be biased to a certain degree.
Online platforms create a shared space that ignores geographical limits and leads to convergence in preferences [
17,
18]. Conversely, according to a broad understanding within the social sciences, regional differentiation is commonly highlighted [
19]. However, STEM education creates a shared culture among young people in technical universities, regardless of the distance [
20].
However, the existing empirical gaps prevent a precise clarification of these aspects. In particular, the link between the characteristics of the institutions (size, degree of specialization, and national framework) and the content preferences of followers has never been addressed within a unified framework of analysis. Two research questions drive this study:
RQ1: Are the content preferences of followers of European technical universities homogeneous despite geographical, linguistic, and institutional differences?
RQ2: Which content clusters show most consistency among universities?
This paper adds value in three fields of knowledge. It is the first large-scale empirical proof of preference homogeneity among European technical universities, analyzing 15,520 videos from 2359 sampled followers. Secondly, the MIPS architecture, a methodology improvement, is a repeatable template for potential social media studies. Third, the results challenge existing beliefs formed from the typical social science literature on regional differentiation. Contrary to most previous research, which argues for the existence of a strong degree of cultural differentiation among European regions (see [
21]), the present study finds a significant level of homogenization with regard to content engagement levels.
2. Media Information Processing System
This section describes the architecture of the Media Information Processing System (MIPS), a system that aims to extract, analyze, and provide insights from multimedia content.
In this regard, we assume the reader to be familiar with the main concepts related to large language models (LLMs). For more information about this topic, we refer users to [
22,
23,
24]. Interactions with LLMs to maximize their performance and determine different generative capabilities (e.g., creative, analytical, and so on) have recently emerged as a new field of research. In this respect, prompt patterns and templates are extensively used to calibrate the model’s responses to specific requests (see [
25,
26,
27,
28]). The usage of proprietary/closed-source LLMs via online access of service (such as OpenAI’s ChatGPT [
29], Google’s Gemini [
30], and Microsoft Copilot [
31]), and open-weight models running locally (such as Llama [
32] and Gemma [
33]) for prompt testing are also common, outputting quality answers (with regard to translations into plain English, summarization, knowledge extraction, and so on). However, for our implementation, we selected offline local LLMs, namely Llama v3.3, v4 and Gemma v3 (via Ollama [
34]), along with the Whisper automatic speech recognition system [
35], to ensure data privacy. Prompt optimization tools ([
36,
37]) were employed as well to improve and test prompts. The extracted knowledge was stored in the Neo4j (Kernel 2025.09.0) graph database ([
38]), which provides an effective and scalable solution to store and query complex relationships between entities.
Last but not least, we refer users to [
39,
40] for the early versions of the MIPS framework, which provided the foundation for the architecture of the system used in our study.
The novelty of the MIPS framework lies in the privacy-preserving and neuro-symbolic approach that it employs. Unlike other tools that rely on cloud-based APIs, the proposed framework leverages LLMs (such as Llama and Gemma) in an offline manner to process the scraped data without the need to rely on third parties, thereby adhering strictly to the GDPR. Also, the proposed framework incorporates a knowledge graph layer, which allows the unification of fragmented entities before statistical processing.
MIPS was designed as a multimodal DL pipeline to process various types of media content, such as text (transcripts), images (frame descriptions), and audio (if applicable), and to extract meaningful information from social media platforms like TikTok. In MIPS, each stage considers the result of the preceding stage’s processing and thus forms a continuous flow of information from data capture to prediction.
Media Information Gatherer (MIG) captures information from publicly available domains or repositories or from qualified sources.
Content Analyzer (CA) uses artificial intelligence (AI) to analyze media content. This includes image and video analysis through CV in the extraction of relevant information, as well as analysis of transcripts through speech recognition and NLP in the extraction of information from audio or video material.
Knowledge Graph Designer (KGD) builds the knowledge graph from analyzed data, identifying significant individuals, organizations, and concepts in the data. Moreover, this involves the extraction of relationships in mapping entities and relationships among those entities.
Statistics Profiler (SP) extracts profiles from data queried by KGD.
Predictive Profiler (PP) uses machine learning (ML) and statistical modeling to predict future trends or behaviors based on the profiles developed using SP.
The overall architecture discussed above is represented in
Figure 1.
Descriptions of each part of the MIPS processor are elaborated in the next sections.
For this research, publicly available data on the TikTok social media platform was used, including not only visual data in the form of videos but also other data parameters accompanying these videos, thus providing a vast database for analysis. Data from reposts made by followers of technical universities in Europe were used. By examining how these users engage with content, our aim was to identify relevant patterns related to subject interests and engagement parameters that may influence specialized academic social groups. It should be noted here that this module focuses primarily on data from the TikTok platform; however, other data parameters from other platforms, such as other social networks or specialized academic platforms, would also be useful for understanding user data. This would create more varied data from different users on different platforms.
This module takes a video clip as input; however, the underlying process pipeline can handle many different types of input, such as text, audio, or sensor information. The specifics of this module are detailed below.
Pre-processing and Chunking
A video clip is first assessed against a duration threshold (>threshold value) to determine its suitability for analysis (for example, it could be assumed that a video clip of duration less than 10 s does not contain rich information). If the clip meets the criteria, the soundtrack is extracted and split into discrete chunks. Each chunk is then subjected to speech detection using a parallel processing approach that assigns each chunk to an available graphics processing unit (GPU). This enables efficient concurrent processing of individual chunks.
The chunk threshold value is fixed at 10 s, assuming that the average person utters 2–3 words in one second, indicating that in 10 s, there are approximately three sentences. The transcription task in each chunk is conducted using the Whisper automatic speech recognition system (see [
41]) with the medium model. The system seeks to determine the speech segments in each audio recording, discarding the ones that represent noise or music (hence facilitating a further analysis of spoken content that represents opinions, facts, and so on). After the identification of the speech segments, they are processed in the Whisper ASR engine. The transcriptions obtained in each chunk are combined in the original order.
The prompt used to perform the translation and summarization is given below.
Translate the text into English and build a very brief and concise bullet list version, including only the essential and relevant information. You must use the complete names for the identified entities such that each statement in the list could be self-understandable and will refer to the same entities as the previous ones. The output must have around 150 words (plus or minus 10 words). Use simple telegraphic and logically connected sentences in plain English. Obey these requirements strictly and provide only the telegraphic sentences without any additional text. Here is the text:
Multimodal Contextualization
During the process of transcription and summarization, frames of the video clip are extracted and analyzed through the LLM. This step provides additional information, thus improving the interpretation of summarized text information. The result obtained here helps in moving towards the next step of analysis: knowledge extraction.
For this task, several frames are extracted from the clip, resized, and then fed to the LLM gemma3:12b-it-qat [
43] to obtain a meaningful description of the context in which the speech took place. The prompt used is given below:
Describe the following sequence of images extracted from a video clip in at most three sentences.
This entire pipeline (from chunking the clip’s audio track to generating a concise summary and getting the descriptions of frames) ensures that even long video clips are efficiently processed into actionable insights.
The module uses machine learning and natural language processing algorithms to structure data into entities, relationships, and attributes (a similar approach was employed in [
40]).
Entity Disambiguation and Knowledge Graph Construction
Entities and relationships are identified, then LLMs and ML are used to disambiguate entities and construct a knowledge graph. This process resolves ambiguities and inconsistencies in entity identification. The result is a representation of multimedia content.
The LLM is given a prompt to extract assertions, summarize them, and recognize unique entities and relationships in a structured JSON format (the exact prompt structure can be seen in
Appendix B). As a result, the LLM outputs a JSON for each chunk. However, due to the probabilistic nature of the LLM functioning, entities representing the same concept might be named differently. Consequently, in order to reduce possible ambiguities, the LLM is invoked once more with the JSON created as above and with the following prompt:
The following text is a JSON containing ’assertions’, each of them containing ’knowledge’ information. The ’knowledge’ contains ’entities’. Identify potential ambiguities among all entities, considering contextual clues such as their relationships with other nodes. Use your understanding of domain-specific knowledge to disambiguate any ambiguous entities. Unify the name and type of entities across the entire text. Produce only the JSON output following the same structure strictly. Here is the JSON:
Once each video chunk has been processed to generate its corresponding JSON output, one can notice that the raw outputs are subject to variability. Since LLM operates probabilistically, the same underlying concept may be rendered under different names or with slight differences in type. This variability can lead to fragmented representations in which a single real-world entity is split into multiple nodes, each potentially with its own set of properties.
Hence, to address these possible inconsistencies, the above-defined disambiguation step is invoked and used to obtain a proper representation of knowledge data.
The response from the model (that had to be inserted into the knowledge graph stored in a NEO4J graph database) has to obey a strict JSON structure, given by a precise schema: firstly, the response has to be sanitized, and then, the possible minor errors have to be automatically corrected based on the provided schema (e.g., using json-corrector) prior to loading.
It is worth mentioning that the prompt itself also asks the model to perform entity disambiguation, using contextual relations and domain knowledge to distinguish same-named items. Apart from this, entity unification is also done by a deterministic, threshold-based pass that fuses near-duplicates. Based on the Levenshtein and Jaro–Winkler distances and Word2Vec cosine similarity, the differences between entity names and types are computed, and those that exceeded a given threshold are merged (see [
40] for more details). This hybrid approach involving LLM-guided standardization and fixed similarity thresholds proved to be robust for the dataset under study.
An example of the knowledge data extracted from an analyzed video clip is presented in
Figure 2. The arcs between entities also store additional information. For example, the arc from the knowledge graph between entities
Young Women (type: Person, Student) and
vlog (type: Content) describes the relation
plans to post and stores searchable data extracted from both video frames and transcripts:
domain: social media, content creation, time management
keywords: vlog posting, social media, content creation, time management
meta_source: PUB
meta_timestamp: 1757040133.0
metadata_rel_order: 20
str: the young woman with a gray hoodie plans to post her vlog the next day after editing it in the evening
Based on the above structuring, it is possible to restore the original information with high accuracy and to support relevant insights through contextual relationship analysis.
After processing the multimedia content and generating it via our JSON-based framework, each assertion includes two important metadata components: domains and keywords. These (comma-separated) lists provide high-level semantic markers that describe the context and content of each assertion.
Combining all domain lists (or the keyword lists) into a single set containing all unique elements from all lists gives the input for the SP module.
The SP module uses this set to analyze the frequency and distribution of topics within statements and to identify dominant themes by calculating the frequency of occurrences.
4. Results and Discussion
For each university considered in our study, the number of distinct domains identified is reported in
Table 3.
Since TikTok data are varied but influenced by emerging trends, we performed multiple experiments in order to capture the users’ preference profiles with better accuracy. More precisely, to cover multiple viewpoints, we considered different numbers of clusters when grouping domain embeddings computed using RoBERTa [
46]. Additionally, we explored different sizes for the vector embeddings as we aimed to reduce noise and irrelevant dimensions, hence generating accurate and quality clustering.
By adjusting the number of clusters, we aimed to record relevant changes when the level of detail changed. Our hypothesis was as follows: if a slight change in the number of considered clusters determines observable differences in university-specific preference profiles, it indicates that the data have complexities that cannot be covered by a single clustering solution. Moreover, this approach allowed us to obtain different perspectives, ranging from specificity to generality. By considering many clusters, one can uncover subtle semantic differences, while fewer clusters provide the main subjects or themes. Furthermore, testing different grouping settings assesses the reliability of our analysis, as consistent results indicate underlying patterns. For this reason, we performed the analysis considering two clustering schemes: one that involves 10 to 15 clusters and another that involves 20 to 25 clusters.
Running K-means with different random states yields different clustering. This happens because K-means clustering performance relies on the initial setup of centroids, based on which the data points in proximity/similarity are appended to form clusters. It is also worth noting that for a given random state, the K-means algorithm is deterministic and produces the specified number of clusters.
To overcome this limitation and extract stable and relevant preference themes from noisy data, we performed multiple runs of K-means, considering different numbers of clusters and different random states. Subsequently, the consensus clustering method was especially used to identify clusters that occur most of the time in multiple K-means instances. In particular, in our experiment, we used a number of clusters in the intervals [10,15] and [20,25] and compared the findings. We considered this large number of clusters because of the nature of social media data, which is variable and noisy; this approach allowed the identification and extraction of meaningful patterns.
We conducted 100 runs of the K-means algorithm initialized with different random states for each given number of clusters.
We then built the consensus matrix (C), which is a square, symmetric matrix whose entries are the frequencies with which the two domains i and j are assigned by the K-means algorithm to the same cluster across all 100 runs (that is, the value is close to 1 if the domains i and j are almost always clustered together and close to 0 otherwise). Once this matrix is computed, we obtain the distance matrix , where is an all-ones squared matrix. The distance matrix measures the dissimilarities between any two domains, and it is used to perform agglomerative clustering. More precisely, the agglomerative clustering initially sets all domains into their own clusters and then, iteratively, combines them based on their vicinity until the specified number of clusters is met. In this case, the distance between two clusters is obtained as the average of distances between all pairs of data points from the two clusters. Consequently, the consensus cluster is obtained. To assess this, we compute its stability score by measuring the average co-association similarity among its members. The results obtained after performing these steps are as follows:
For example, the consensus cluster data look like
The stability score for each cluster is as follows:
Taking all clusters that exceed the stability threshold of 0.75 (a score that indicates a true semantic theme and yields statistically meaningful confidence), we were able to identify the most impactful domains.
In our case, the output shows 32 stable clusters:
The SP module names each stable cluster, and in case the same name is found for different clusters, they are merged. This approach reduces noise and creates meaningful semantic categories. However, there might be cases where very similar (but distinct) names are not merged (these cases happen because the LLM might give slightly distinct names even for very similar clusters). For example, in a certain instance (the 10–15 clusters case, PCA size = 768), we obtained the cluster names “Social Interactions and Community” and “Social Interactions Community”. Although these cases appeared, given the small number of resulting clusters, we did not consider further merging very similar clusters (e.g., by employing a string similarity algorithm), as the main themes could be easily extracted.
Remark 1.
In this study, we generated representative cluster names using multiple large language model configurations—including Llama 3.3, Llama 4, and Gemma 3—each evaluated under five temperature settings () to capture a broad spectrum of generative variability. The semantic similarity between generated names was measured using cosine similarity over RoBERTa embeddings. In particular, using a 20-cluster schema extracted from the PUB university dataset as a representative baseline, every model–temperature combination produced candidate names for the same underlying clusters. Beyond evaluating cross-configuration consistency, we performed an extensive Monte Carlo robustness analysis, in which we repeatedly sampled 70%, 80%, and 90% of each cluster’s word set and generated 30 independent naming runs per condition. This allowed us to examine the stability of the naming mechanism when the input information was deliberately perturbed.
Across all experiments, naming remained highly consistent: even with heterogeneous LLM architectures and temperature values, most generated names were semantically close to one another and to the baseline (full-information) name. While cross-temperature comparisons for a given model (e.g., Llama 3.3) showed very high agreement (mean cosine similarity ), the Monte Carlo results across all clusters, models, and temperatures further strengthen this conclusion. The mean semantic similarity between subsampled and baseline names remained high and increased with sampling coverage—0.744 at 70%, 0.762 at 80%, and 0.772 at 90%, each with narrow 95% confidence intervals of —based on 30 Monte Carlo sampling runs per condition. This monotonic rise confirms that the naming process remains stable even when up to 30% of the cluster vocabulary is randomly dropped. However, it is worth noting that some clusters are inherently easier to name consistently (several have mean subsample-to-baseline similarity >0.85 at 90%), while others induce paraphrase changes (the lowest score is at 90%). In this respect, a small number of low-scoring clusters decrease the overall mean, leading to semantic-similarity averages in the range.
Nevertheless, the consistency observed across all experiments suggests that—even with different LLMs and temperature settings—there is an inherent semantic structure in the data that guides the generation of representative names. Importantly for the argument, the topics identified through this algorithmic approach directly map onto well-established thematic domains in sociology, psychology, public health, communication and media studies, political science, anthropology, and cultural studies (which reinforce the external validity of the generated labels). Taken together, the strong alignment between automatically generated cluster names and recognized domains of human interest strengthens confidence in using these names as reliable indicators of cluster content (even without manual expert review). Overall, our multi-run experiments confirm that the naming mechanism is stable across configurations and that the identified themes have solid grounding in established research, thereby validating both the robustness and the interpretability of the resulting cluster names.
For example,
Figure 7 reports the named stable clusters discovered by the SP when considering the 20–25 cluster scheme and a PCA dimensionality of 160.
Taking into account all stable clustering results for all clustering schemes (10 to 15 and 20 to 25 cluster schemes) and for all PCA sizes, we obtain the following main common themes:
Social Interactions and Community (Social Interactions and Dynamics, Social Interaction Concepts, Social Interaction Context, Social Interactions and Community, Social Interactions Community, Social Community Dynamics, Social Community Interactions, Social Interactions and Society).
Food and Cuisine (Food and Cuisine, Food and Dining, Food and Cooking, Food and Beverage, Food and Beverage Industry).
Personal Identity and Characteristics (Personal Identity Traits, Personal Characteristics, Personal Characteristics Traits).
Human Concepts and Behavior (Human Nature Concepts, Human-Related Concepts, Human Behavior Traits, Human Interactions and Behavior).
Youth and Development (Youth and Education, Youth and Child Welfare, Youth and Child Development).
Infrastructure and Systems (Road Infrastructure Systems).
Digital Media Technology.
We may conclude that all these major themes represent the main interest streams of the followers of the studied universities.
However, the above approach represents a broader view of the user themes and their engagement with the universities’ TikTok accounts.
Yet another meaningful view on the data relates to the selection of the K-means runs with the highest scores according to the Calinski–Harabasz (
) metric. For a given K-means run (for 10 to 15 clusters/20 to 25 clusters and 100 runs for each) and the corresponding clustering, this score is defined as follows:
where BCSS (between-cluster sum of squares) measures the separation between clusters, WCSS (within-cluster sum of squares) measures the compactness or dispersion within clusters, K is the number of clusters, and N is the total number of data points.
The results for the Calinski–Harabasz (
) scores are summarized in
Table 4.
As reported in
Table 4, for the clustering scheme that involves 10–15 clusters, the maximum
score (52.22) is obtained for k = 10 clusters and a PCA dimension of 160.
Figure 8 displays the identified clusters and their representativeness for the followers of each university’s account.
The correlation matrix presented in
Table 5 provides insight into whether a theme of interest for one university is also found in another. Higher correlation values indicate that the followers of the same university have similar interests and behaviors, whereas lower values reveal a significantly different profile. Since the values are more than
in the presented case, we may conclude that there exists a strong thematic consistency across the university’s follower bases. In fact, taking into account all the K-means runs for all PCA sizes, we found that the average of the group profile correlation values is 0.96 (the minimum value being 0.74 and the maximum being 0.99), which confirms our findings and underscores the thematic coherence among (different) university users.
The high values of correlation suggest that a student from Timişoara consumes and shares almost the same content mix as a student from Berlin or Turin. Despite their language and geographical differences, the online repertoires of these students show a common interest in universal topics such as academic anxiety, lifestyle hacks, and technology. The algorithmic effect of these platforms and the student identity in relation to STEM could be viewed as a form of cultural equalizer. An interesting implication of these findings is that universities may consider dropping their content strategy in favor of a strategy that generates high-quality content related to universal topics.
The heatmap presented in
Figure 9 illustrates the percentage distribution of user preferences across thematic clusters and compares the interests of users from different universities. A comparison between the largest effect sizes (Cohen’s H) per thematic cluster is shown in
Figure 10.
Similar to the previous analysis, we considered the clustering scheme involving 20–25 clusters. In this case, the maximum
score (38.04) was obtained for k = 20 clusters and a PCA dimension of 768.
Figure 11 displays the identified clusters and their representativeness for each university’s followers.
The charts from
Figure 8 and
Figure 11 represent the highest
scores for the different clustering schemes and reveal interesting information:
Human Identity and Characteristics: Both charts include categories describing who people are as individuals; they relate to personality, traits, and individual qualities: Personal Identity Traits (in
Figure 8), Human Characteristics Analysis (in
Figure 11), and Human Nature Concepts (in
Figure 11).
Human Life Experiences and Wellbeing Both charts include fields relating to lived human experiences, covering physical and emotional experiences: Human Life Concepts (in
Figure 11), Human Life Experiences (in
Figure 8), Health and Wellbeing (in
Figure 11), Human Health and Wellbeing (in
Figure 8), and Emotional Distress Themes (in
Figure 11).
Social Interaction, Community, and Relationships: Both charts include categories on how people interact, representing social behavior, family life, and group dynamics. Family and Relationships (in
Figure 11), Social Group Interactions (in
Figure 11), Social Interaction Concepts (in
Figure 11), Human Interaction Dynamics (in
Figure 8), Social Interactions and Community (in
Figure 8), Community and Family Life (in
Figure 8), and Social Community Development (in
Figure 8).
Media, Technology, and Information: Both charts include topics relating to communication and technology; they are related to information handling and digital domains. Media and Video Production (in
Figure 11), Information Processing Services (in
Figure 11), and Media and Technology (in
Figure 8).
Work, Employment, and Public Services: Both charts contain fields related to public life and professional domains, covering societal infrastructure and labor: Work and Employment (in
Figure 11) and Public Services and Leisure (in
Figure 8).
Personal Growth, Education, and Self-Development: Both charts show themes of self-improvement and learning, reflecting personal development: Personal Growth Concepts (in
Figure 11), Youth and Education (in
Figure 11), and Human Life Experiences (in
Figure 8).
Social Justice, Global Issues, and Public Perception: Both charts include topics concerning society at large and which relate to public attitudes and societal problems: Social Justice Issues (in
Figure 11), Global Social Issues (in
Figure 11), Public Perception and Preferences (in
Figure 11), Public Services and Leisure (in
Figure 8), and Social Community Development (in
Figure 8).
Interests, Leisure, and Culture: Both charts discuss what people enjoy and consume. They are related to hobbies, curiosity, and lifestyle content: Random Knowledge Topics (in
Figure 11), Food and Cuisine (in
Figure 11), and Diverse Human Interests (in
Figure 8).
We also computed the silhouette score, but this proved to be unreliable for high-dimensional embeddings and noisy, incomplete, inconsistent, and inaccurate text in the analyzed social media data.
Figure 12 reports the silhouette scores across several PCA dimensions and the number of clusters. It is worth mentioning that in our case, the silhouette score was not used as the primary measure of cluster validity. Moreover, this method gives the mathematically optimal clusters and not necessarily the business-relevant ones.
From a statistical point of view, to validate whether university affiliation influences content preference, we computed the chi-squared (
) statistic for each clustering iteration. For the large majority of the tests (clustering scheme covering 10 to 15 clusters and 100 different distinct random state K-means applications), we found that the
p-value is, in general, greater than 0.85, which is significantly above the statistical significance threshold alpha level (
= 0.05). This finding implies that there is no evidence of an association between clusters and groups. For each run, we also computed the chi-square values, and the results showed that they are lower than the degrees of freedom. This suggests that the observed data are very close to the expected frequencies. The Cramér’s V value indicates that the association between clusters and groups is not only statistically non-significant but is also negligible in magnitude. The results are summarized in
Table 6 and
Table 7.
Performing the same tests but for the clustering scheme covering 20 to 25 clusters and 100 different distinct random state K-means applications, we obtain similar results (see
Table 8 and
Table 9).
The experiments in our study compare 4800 experimental scenarios with various embedding sizes and clustering levels. The study can be regarded as a robustness test of the proposed method. The basic statistics remain stable: the mean p-values are consistently above 0.90, and all the values of the mean Cramér’s V are lower than 0.05. This verifies that the homogenization effect is indeed embedded in the data and is not due to the adjustment of parameters.
These findings show that each group (set_PTU, set_PUB, set_PUT, set_TUC, etc.) has nearly the same distribution across clusters. As can be noted in the example presented in
Figure 8 and
Figure 11, the bars for each theme are nearly overlapping. Correspondingly, the groups do not specialize in certain clusters, and clusters contain about the same proportions of each group. These results suggest several possible causes, a few of which are discussed below:
Social media themes are universal across university account audiences.
The groups represent demographically similar populations (presumably students) sharing the same interests.
Social media platforms (like TikTok, for instance) homogenize the discourse; hence, the users tend to develop the same habits or interests.
Social media platforms’ algorithms seem to promote certain “recipes”; hence, users attempt to conform to the globally successful content format. In the long run, this implies a homogeneous feed for all users.
Social platform algorithms expose users to the same type of videos; hence, they are more likely to redistribute the same content.
The psychographic and developmental characteristics of university students create a general life-stage homogeneity, where shared global goals and interests override national differences.
Finally, as can be observed in the correlation matrix
Table 5, the off-diagonal correlations are as follows: min 0.9234, max 0.9927, mean 0.9711, median 0.9753, and standard deviation 0.022. These statistics, as well as the matrix itself, suggest homogeneity among the profiles; the subject areas’ preferences are highly aligned for all universities considered in our study. However, to determine how different universities group together in terms of content preferences, we performed hierarchical clustering on the group profile correlation matrix, taking into account all possible runs (for both clustering schemes (10–15 and 20–25) and all PCA sizes (120, 160, 200, and 768).
The data collected from all 4800 runs (two clustering schemes covering 10 to 15 clusters and 20 to 25 clusters, four PCA dimensions for each scheme (120, 160, 200, and 768), and 100 K-means runs for each clustering value) show the following:
The majority (3056) of inferred topologies place set_RTU in a singleton branch.
set_TUC is placed in a singleton branch 2698 times.
set_PTU and set_PUT are placed together 1755 times.
set_RTU and set_PUB are placed together 1244 times.
set_TUB and set_TUC are placed together 848 times.
In particular, for the case presented in
Figure 8, we obtained the following dendrogram, which illustrates how the universities considered in our study are grouped based on their similarities (
Figure 13).
It should be noted that, while content preference is highly homogenized across the studied universities, the persistent isolation of one particular university profile in a single branch in the clustering suggests that non-homogenizing factors maintain a distinct structural identity for that outlier institution.
Last but not least, we wanted to determine if there is a potential algorithmic effect on the interests intrinsic to the STEM community. To this aim, we performed the following steps: Given the data collected for a university (say, PUB) and the corresponding clustering of subjects (say, the 20 cluster scheme), we scanned every JSON (corresponding to the posts of users of that university) for cluster terms. Then, we recorded the timestamp (the moment of the post). Using these timestamps, we computed three temporal-dispersion metrics for each cluster over a common global window defined by the earliest and latest timestamps found in the dataset (for PUB, from 20 December 2021 to 26 August 2025).
The metrics used in our research were as follows:
The Kolmogorov–Smirnov distance (), which measures the maximum deviation between the empirical distribution of timestamps and a perfectly uniform distribution over the global window (lower indicates that occurrences are more evenly spread, while higher indicates clustering or inactivity gaps);
The coefficient of variation of inter-arrival times (), which captures how irregular the gaps between consecutive timestamps are ( is Poisson-like (regular), whereas indicates strong burstiness).
Normalized Shannon entropy, which captures how evenly timestamps fill equally spaced time bins; values close to 1 indicate broad temporal coverage, whereas lower values indicate concentration in a few periods. In our experiment, we used 12 equal-width bins spanning the global time range in order to balance resolution against statistical stability (avoiding too many empty/low-count bins); in particular, for the mentioned time span, we had approximately 112 days per bin (roughly a third of a year), which is a reasonable scale to capture semester/seasonal patterns without fragmenting the data.
Across the 20 clusters, the results show moderate temporal spread but pronounced burstiness: values range from 0.559 to 0.623 (mean ), entropy values cluster around 0.60, and values range from 2.35 to 4.65 (mean ), none of which are compatible with an evenly distributed temporal pattern. The findings indicate that topics are present across the entire multi-year window but appear in waves rather than uniformly, suggesting temporal clustering, potentially driven by platform trends, events, event cycles (exams/holidays/politics), or algorithmic reinforcement, rather than purely intrinsic steady-state interests.
From a broad perspective, the findings contradict previous suppositions about differences between regions or countries relating to persona preferences on social media. The uniformity of content affinities can be explained both by the algorithms of the social media platform and by the psychographic characteristics of the audience consuming the short-form video content. This, in turn, underlines the role of global digital platforms such as TikTok in the formation of a unified cultural and educational space for students, implying the existence of a “digital student persona” beyond national boundaries, despite possible differences at the local level.
Finally, the following limitations of the study are acknowledged. First, the homogeneity of the content may, at least to some extent, be augmented by the platform’s algorithmic bias towards promoting the most globally successful content formats. Second, with regard to user demographics, although the study of the locality effect and the statistics of student mobility support the assumption that most of the followers are local students, the study’s inference must rely on indirect evidence due to the non-public nature of specific statistical information from the universities. Third, although we used a low temperature to increase determinism, the inherently probabilistic nature of LLMs introduces a degree of variability in semantic summarization that we attempted to overcome through consensus clustering of multiple experimental runs.
5. Conclusions
This study aims to determine the level of homogenization in TikTok user preferences among technical university students in Europe. Using the proposed multimodal approach within the MIPS framework, this study presents empirical evidence of a high level of homogenization in user content preferences among these technical university students, thus calling into question existing theories of cultural regional differentiation.
The main contribution of this study is the empirical evidence of a high level of thematic homogenization in user content preferences, as shown by high levels of clustering and correlation among different follower cohorts. Such a finding supports the emergence of a “digital student persona” transcending national boundaries, as influenced by algorithmic trends and the STEM academic lifestyle. Instead of focusing on cultural regional differentiation in social media communication strategies, universities should now recognize the existence of a globalized algorithmic trend and universal student interests, such as those related to academics, technology, and health, as a cultural equalizer. As such, a more appropriate social media communication strategy should focus on generating high-quality content related to these universally relevant themes.
It is worth noting that although the study focused on the TikTok platform, a more exhaustive approach could include additional social media platforms (like Instagram, Facebook, and X) or academic forums. This extensive data collection might capture a larger range of user behaviors and content preferences, thereby making the analysis more detailed.
With regard to methodological enhancements, it may be possible to optimize the MIPS pipeline. For example, different LLMs might be better suited for identifying separate entities. Moreover, incorporating reasoning with assertion extraction could improve interpretation. It would be possible to upgrade the frame description component with multi-modal models capable of understanding video context at significant points.
Nevertheless, it is worth noting that summaries generated by the LLMs are subject to prompt design, and there might exist variations due to the probabilistic nature of LLM operation. Similarly, assigning semantic labels to clusters is also a potential source of variability (or even of errors); moreover, as shown in the paper, LLMs may produce slightly different names for similar themes, which may require manual review or other NLP techniques to merge those with the same semantics. However, even if LLMs’ outputs are (possibly) biased, incomplete, or even incorrect, in this particular case of analyzing social media data, they provided some valuable insights about users’ preferences. To overcome this limitation comprehensively in the future, a human-in-the-loop validation process will be included. For instance, the plan is to conduct expert human evaluations on randomly selected subsamples of the AI-generated labels and the video content to qualitatively evaluate the robustness and cultural accuracy of the clustering process.
Although the usefulness of these insights was empirically demonstrated throughout the paper (given the large number of individuals studied and the number of test cases performed), it should be noted that there might be other important factors that were underrepresented (for example, user demographics or online behaviors) that might influence the accuracy and reliability of the study. This represents a future line of research that could enhance the presented methodology.
Within the same context, a further potential development is a stability study. For example, it would be interesting to see whether slight changes in the input (frame extraction/noise addition to input audio/video) determine a bounded change in the output (in particular, to determine if the developed system has a bounded sensitivity, characterized by a finite Lipschitz constant). This can be achieved because the MIPS sequential pipeline can be intuitively modeled as a composition of functions
where
, which maps raw videos to structured data ;
, which maps multimedia data to text representations (transcripts, summaries);
, which maps text to a graph (nodes = entities, edges = relations).
Hence, by considering
x = the original input video,
= the perturbed input videoclip,
= output eigenvector of the resulting knowledge graph produced by the LLM,
= distance metric (cosine distance),
one can define the stability condition
where
L is a Lipschitz constant. If Relation (
1) holds, then one could say that the pipeline is stable under small perturbations.
Other new lines of research might include alternative clustering methods such as spectral clustering or density-based methods (DBSCAN). Additionally, the analysis could include time-series clustering aimed at capturing changes in content preferences over time. Another idea to explore would be to use quantitative metrics in the knowledge graph (such as node counts and/or centrality measures) to validate the findings: for example, if the pattern of related statements (or nodes) around a given input keyword is similar in number and structure across different university-affiliated video clip sets, this supports the notion that content preferences are homogenized regardless of regional or institutional differences.
Lastly, MIPS was specifically designed as a privacy-preserving architecture that uses local LLMs for large-scale social media analytics within the context of public technical higher education. The present study uses publicly available TikTok videos associated with the universities under investigation. In accordance with the General Data Protection Regulation (GDPR), no personally identifiable information was collected, processed, or stored during the research. All data analyzed originated exclusively from publicly accessible content, and no private, restricted, or sensitive material was used. The analysis focuses on aggregated patterns reflecting the general interests of the universities’ followers, without reference to individual users.
All multimedia content (videos, images, and soundtracks) was accessed solely for analytical purposes and was deleted immediately after use, ensuring compliance with the GDPR principles of data minimization and storage limitation. The output knowledge graph was built by enforcing robust anonymization measures that prevent singling-out, linkability, and inference.
Access to publicly available data was facilitated through TikTok’s Research Account. There was no collaboration with TikTok beyond the use of this account, and the platform had no material interest or influence on the study design, data analysis, or research findings.
The findings of this study are published openly, with the aim of promoting transparency and contributing to the common and public good. The present research was conducted without commercial purposes and follows the principles of European copyright and research exceptions.