A Multimodal Deep Learning Approach for Analyzing Content Preferences on TikTok Across European Technical Universities Using Media Information Processing System

Sburlan, Dragoş-Florin; Bucos, Marian

doi:10.3390/electronics15061288

Open AccessArticle

A Multimodal Deep Learning Approach for Analyzing Content Preferences on TikTok Across European Technical Universities Using Media Information Processing System

by

Dragoş-Florin Sburlan

^1,*

and

Marian Bucos

^2,*

¹

Faculty of Mathematics and Informatics, Ovidius University of Constanta, 900470 Constanta, Romania

²

Communications Department, Politehnica University Timisoara, 300223 Timisoara, Romania

^*

Authors to whom correspondence should be addressed.

Electronics 2026, 15(6), 1288; https://doi.org/10.3390/electronics15061288

Submission received: 6 January 2026 / Revised: 9 March 2026 / Accepted: 10 March 2026 / Published: 19 March 2026

(This article belongs to the Special Issue Feature Papers in "Computer Science & Engineering", 3rd Edition)

Download

Browse Figures

Versions Notes

Abstract

Social media platforms have become primary communication channels for technical European universities. However, the extent to which global platform algorithms homogenize individual preferences across cultures remains underexplored. Although the current literature offers insights into the topic, none of the works consider the cross-national and multimodal nature of the phenomenon. In the current paper, we introduce the Media Information Processing System (MIPS), a privacy-preserving multimodal deep learning (DL) framework that incorporates large language models (LLMs), computer vision (CV), and knowledge graphs. We analyze data from 15,520 public videos shared by 2359 followers of six top technical universities from Romania, Germany, Italy, and Russia. The results of the study suggest that the degree of homogeneity of the followers’ interest profiles is markedly high. Statistical profiling of the data indicates that the interest profiles of the followers from different countries are positively correlated with a high degree of strength (mean Pearson r = 0.96; p > 0.90). Consensus clustering of the data reveals the existence of stable clusters of themes with high stability scores (>0.75), such as “Human Interaction Dynamics”. The results of the study contradict the traditional theory of regional cultural differentiation. Instead, the results suggest the existence of a new “digital student persona” that is characteristic of the academic lifestyle of students from different countries.

Keywords:

multimodal deep learning; TikTok content analysis; preference homogenization; higher education social media

1. Introduction

Social media platforms have grown from being entertainment tools to an institutional means of communication, especially in an educational context [1,2,3]. TikTok [4] is one of the key platforms used by institutions to reach the younger generation and connect with them, forming a new means for marketing in the field of higher education institutions [5,6]. With over 1 billion users, predominantly young people between 16 and 24, TikTok offers an opportunity to understand what these users—the new generation of learners—prefer and identify with [7].

The existing literature on social media in the context of higher education institutions focuses on unimodal analysis, primarily text scraping. An important research gap exists in the study of cross-national behavioral trends using a multimodal approach that incorporates videos, audio, and text. Contrary to previous research that treated regional user groups as culturally unique [8,9], the proposed study uses a unified DL model across four countries to validate the hypothesis of algorithmic homogeneity. Furthermore, the proposed method uses offline local LLMs and knowledge graphs to circumvent the privacy and granularity issues faced by API-based studies.

In this work, we aimed to study similarities between people from different European countries by employing a multimodal DL framework, combining natural language processing (NLP), CV, and statistical methods and techniques. This is done by designing a system able to process and analyze the public reposts of followers of the official and public TikTok accounts related to technical universities in four European nations, namely Romania, Germany, Italy, and Russia. For the purpose of achieving this objective, we design and describe the Media Information Processing System (MIPS).

Although there is no precise target audience (a person following a university account might be a student, professor, stakeholder, and so on; moreover, there is no evidence regarding the person’s birthplace, residence, and nationality), our analysis revealed a surprising degree of similarity in persona preferences for university content across Europe.

However, we assumed that most of the followers of a university account are young and have some connection with the university being studied [7,10]. We also assumed that most of the followers of a university account are from the same region, since large universities are poles of attraction for students from neighboring areas [11].

Our claim that most university account followers are from the same region and have some connection to the university is supported by several studies on social media followership, which indicates a locality effect. In particular, the followers of an account tend to cluster geographically near the origin of the account holder, influenced by shared language and cultural context (see [12,13,14]). The same conclusion is also endorsed by a Eurostat study from 2023, which states that about 91.6% of students were studying in their own country rather than internationally (see [15]). This fact, corroborated by the findings in [16], which show that most of the students (about 90%) are not long-distance movers (more than 50% of students are within a distance of 91 km), constitutes empirical evidence for our assumption. Correspondingly, people generally remain within their home region for education, which might suggest that a similar local bias may be present among university followers on social media. Nevertheless, even though these arguments provide robust evidence for a predominant locality effect, they constitute indirect proofs because students’ demographic data from the studied universities is not publicly available. It is also worth mentioning that social media dynamics might transcend regional boundaries due to viral content or targeted international campaigns; hence, the results might be biased to a certain degree.

Online platforms create a shared space that ignores geographical limits and leads to convergence in preferences [17,18]. Conversely, according to a broad understanding within the social sciences, regional differentiation is commonly highlighted [19]. However, STEM education creates a shared culture among young people in technical universities, regardless of the distance [20].

However, the existing empirical gaps prevent a precise clarification of these aspects. In particular, the link between the characteristics of the institutions (size, degree of specialization, and national framework) and the content preferences of followers has never been addressed within a unified framework of analysis. Two research questions drive this study:

RQ1: Are the content preferences of followers of European technical universities homogeneous despite geographical, linguistic, and institutional differences?
RQ2: Which content clusters show most consistency among universities?

This paper adds value in three fields of knowledge. It is the first large-scale empirical proof of preference homogeneity among European technical universities, analyzing 15,520 videos from 2359 sampled followers. Secondly, the MIPS architecture, a methodology improvement, is a repeatable template for potential social media studies. Third, the results challenge existing beliefs formed from the typical social science literature on regional differentiation. Contrary to most previous research, which argues for the existence of a strong degree of cultural differentiation among European regions (see [21]), the present study finds a significant level of homogenization with regard to content engagement levels.

2. Media Information Processing System

This section describes the architecture of the Media Information Processing System (MIPS), a system that aims to extract, analyze, and provide insights from multimedia content.

In this regard, we assume the reader to be familiar with the main concepts related to large language models (LLMs). For more information about this topic, we refer users to [22,23,24]. Interactions with LLMs to maximize their performance and determine different generative capabilities (e.g., creative, analytical, and so on) have recently emerged as a new field of research. In this respect, prompt patterns and templates are extensively used to calibrate the model’s responses to specific requests (see [25,26,27,28]). The usage of proprietary/closed-source LLMs via online access of service (such as OpenAI’s ChatGPT [29], Google’s Gemini [30], and Microsoft Copilot [31]), and open-weight models running locally (such as Llama [32] and Gemma [33]) for prompt testing are also common, outputting quality answers (with regard to translations into plain English, summarization, knowledge extraction, and so on). However, for our implementation, we selected offline local LLMs, namely Llama v3.3, v4 and Gemma v3 (via Ollama [34]), along with the Whisper automatic speech recognition system [35], to ensure data privacy. Prompt optimization tools ([36,37]) were employed as well to improve and test prompts. The extracted knowledge was stored in the Neo4j (Kernel 2025.09.0) graph database ([38]), which provides an effective and scalable solution to store and query complex relationships between entities.

Last but not least, we refer users to [39,40] for the early versions of the MIPS framework, which provided the foundation for the architecture of the system used in our study.

The novelty of the MIPS framework lies in the privacy-preserving and neuro-symbolic approach that it employs. Unlike other tools that rely on cloud-based APIs, the proposed framework leverages LLMs (such as Llama and Gemma) in an offline manner to process the scraped data without the need to rely on third parties, thereby adhering strictly to the GDPR. Also, the proposed framework incorporates a knowledge graph layer, which allows the unification of fragmented entities before statistical processing.

MIPS was designed as a multimodal DL pipeline to process various types of media content, such as text (transcripts), images (frame descriptions), and audio (if applicable), and to extract meaningful information from social media platforms like TikTok. In MIPS, each stage considers the result of the preceding stage’s processing and thus forms a continuous flow of information from data capture to prediction.

Media Information Gatherer (MIG) captures information from publicly available domains or repositories or from qualified sources.
Content Analyzer (CA) uses artificial intelligence (AI) to analyze media content. This includes image and video analysis through CV in the extraction of relevant information, as well as analysis of transcripts through speech recognition and NLP in the extraction of information from audio or video material.
Knowledge Graph Designer (KGD) builds the knowledge graph from analyzed data, identifying significant individuals, organizations, and concepts in the data. Moreover, this involves the extraction of relationships in mapping entities and relationships among those entities.
Statistics Profiler (SP) extracts profiles from data queried by KGD.
Predictive Profiler (PP) uses machine learning (ML) and statistical modeling to predict future trends or behaviors based on the profiles developed using SP.

The overall architecture discussed above is represented in Figure 1.

Descriptions of each part of the MIPS processor are elaborated in the next sections.

Media Information Gatherer

For this research, publicly available data on the TikTok social media platform was used, including not only visual data in the form of videos but also other data parameters accompanying these videos, thus providing a vast database for analysis. Data from reposts made by followers of technical universities in Europe were used. By examining how these users engage with content, our aim was to identify relevant patterns related to subject interests and engagement parameters that may influence specialized academic social groups. It should be noted here that this module focuses primarily on data from the TikTok platform; however, other data parameters from other platforms, such as other social networks or specialized academic platforms, would also be useful for understanding user data. This would create more varied data from different users on different platforms.

Content Analyzer

This module takes a video clip as input; however, the underlying process pipeline can handle many different types of input, such as text, audio, or sensor information. The specifics of this module are detailed below.

Pre-processing and Chunking
A video clip is first assessed against a duration threshold (>threshold value) to determine its suitability for analysis (for example, it could be assumed that a video clip of duration less than 10 s does not contain rich information). If the clip meets the criteria, the soundtrack is extracted and split into discrete chunks. Each chunk is then subjected to speech detection using a parallel processing approach that assigns each chunk to an available graphics processing unit (GPU). This enables efficient concurrent processing of individual chunks.

The chunk threshold value is fixed at 10 s, assuming that the average person utters 2–3 words in one second, indicating that in 10 s, there are approximately three sentences. The transcription task in each chunk is conducted using the Whisper automatic speech recognition system (see [41]) with the medium model. The system seeks to determine the speech segments in each audio recording, discarding the ones that represent noise or music (hence facilitating a further analysis of spoken content that represents opinions, facts, and so on). After the identification of the speech segments, they are processed in the Whisper ASR engine. The transcriptions obtained in each chunk are combined in the original order.

Transcription, Translation, and Summarization
Upon completion of the chunking process, the merged transcript is corrected, translated into English, and summarized using an LLM. The objective of this step is the achievement of linguistic uniformity, regardless of dialects or variations in the audio language. In terms of linguistic bias, although the translation of the transcripts to English by the model would result in the loss of cultural nuances at the local level, the essential high-level semantics required for macro thematic clustering remain the objective of this study. The method assures uniform data from various sources. The result is presented in a clear format to identify entities and their relationships. The LLM model used is Llama 3.3 (see [42], for more details), with temperature = 0 and $n u m_c t x = 4096$ , and relying on Ollama’s default sampler under a zero-temperature regime (effectively minimizing randomness and approximating greedy decoding); no $t o p_p$ , $t o p_k$ , or repetition penalties are specified beyond this.

The prompt used to perform the translation and summarization is given below.

Translate the text into English and build a very brief and concise bullet list version, including only the essential and relevant information. You must use the complete names for the identified entities such that each statement in the list could be self-understandable and will refer to the same entities as the previous ones. The output must have around 150 words (plus or minus 10 words). Use simple telegraphic and logically connected sentences in plain English. Obey these requirements strictly and provide only the telegraphic sentences without any additional text. Here is the text:

Multimodal Contextualization
During the process of transcription and summarization, frames of the video clip are extracted and analyzed through the LLM. This step provides additional information, thus improving the interpretation of summarized text information. The result obtained here helps in moving towards the next step of analysis: knowledge extraction.

For this task, several frames are extracted from the clip, resized, and then fed to the LLM gemma3:12b-it-qat [43] to obtain a meaningful description of the context in which the speech took place. The prompt used is given below:

Describe the following sequence of images extracted from a video clip in at most three sentences.

This entire pipeline (from chunking the clip’s audio track to generating a concise summary and getting the descriptions of frames) ensures that even long video clips are efficiently processed into actionable insights.

Knowledge Graph Designer

The module uses machine learning and natural language processing algorithms to structure data into entities, relationships, and attributes (a similar approach was employed in [40]).

Entity Disambiguation and Knowledge Graph Construction
Entities and relationships are identified, then LLMs and ML are used to disambiguate entities and construct a knowledge graph. This process resolves ambiguities and inconsistencies in entity identification. The result is a representation of multimedia content.

The LLM is given a prompt to extract assertions, summarize them, and recognize unique entities and relationships in a structured JSON format (the exact prompt structure can be seen in Appendix B). As a result, the LLM outputs a JSON for each chunk. However, due to the probabilistic nature of the LLM functioning, entities representing the same concept might be named differently. Consequently, in order to reduce possible ambiguities, the LLM is invoked once more with the JSON created as above and with the following prompt:

The following text is a JSON containing ’assertions’, each of them containing ’knowledge’ information. The ’knowledge’ contains ’entities’. Identify potential ambiguities among all entities, considering contextual clues such as their relationships with other nodes. Use your understanding of domain-specific knowledge to disambiguate any ambiguous entities. Unify the name and type of entities across the entire text. Produce only the JSON output following the same structure strictly. Here is the JSON:

Once each video chunk has been processed to generate its corresponding JSON output, one can notice that the raw outputs are subject to variability. Since LLM operates probabilistically, the same underlying concept may be rendered under different names or with slight differences in type. This variability can lead to fragmented representations in which a single real-world entity is split into multiple nodes, each potentially with its own set of properties.

Hence, to address these possible inconsistencies, the above-defined disambiguation step is invoked and used to obtain a proper representation of knowledge data.

The response from the model (that had to be inserted into the knowledge graph stored in a NEO4J graph database) has to obey a strict JSON structure, given by a precise schema: firstly, the response has to be sanitized, and then, the possible minor errors have to be automatically corrected based on the provided schema (e.g., using json-corrector) prior to loading.

It is worth mentioning that the prompt itself also asks the model to perform entity disambiguation, using contextual relations and domain knowledge to distinguish same-named items. Apart from this, entity unification is also done by a deterministic, threshold-based pass that fuses near-duplicates. Based on the Levenshtein and Jaro–Winkler distances and Word2Vec cosine similarity, the differences between entity names and types are computed, and those that exceeded a given threshold are merged (see [40] for more details). This hybrid approach involving LLM-guided standardization and fixed similarity thresholds proved to be robust for the dataset under study.

An example of the knowledge data extracted from an analyzed video clip is presented in Figure 2. The arcs between entities also store additional information. For example, the arc from the knowledge graph between entities Young Women (type: Person, Student) and vlog (type: Content) describes the relation plans to post and stores searchable data extracted from both video frames and transcripts:

domain: social media, content creation, time management
keywords: vlog posting, social media, content creation, time management
meta_source: PUB
meta_timestamp: 1757040133.0
metadata_rel_order: 20
str: the young woman with a gray hoodie plans to post her vlog the next day after editing it in the evening

Based on the above structuring, it is possible to restore the original information with high accuracy and to support relevant insights through contextual relationship analysis.

Statistics Profiler

After processing the multimedia content and generating it via our JSON-based framework, each assertion includes two important metadata components: domains and keywords. These (comma-separated) lists provide high-level semantic markers that describe the context and content of each assertion.

Combining all domain lists (or the keyword lists) into a single set containing all unique elements from all lists gives the input for the SP module.

The SP module uses this set to analyze the frequency and distribution of topics within statements and to identify dominant themes by calculating the frequency of occurrences.

3. Methods

3.1. Sampling Strategy

This current study used a quantitative design to explore TikTok content preferences of followers of six European technical universities. Target populations for this study (N) included follower information gathered in August 2025 via the TikTok research API and TikTok university accounts.

The list of universities considered for this analysis is as follows: PUT, Politehnica University Timisoara; PUB, Polytechnic University of Bucharest; TUC, Technical University of Cluj-Napoca; TUB, Berlin Institute of Technology; PTU, Polytechnic University of Turin; and RTU MIREA (Russian Technological University). The principal data of the mentioned universities, along with their countries, TikTok handle names, and number of followers, are shown in Table 1 below.

The selected universities consist of three universities from Romania, one university from Germany, one from Italy, and one from Russia. This ensures that a comparable group of technical universities with a strong presence on social media sites is obtained.

A statistically valid sampling design was required to enable generalization of results from the sample of users to the population of followers for each university. The sample sizes (n) represent the number of observations for a 95% confidence level with a 5% margin of error (e). A two-step approach was used to calculate the sample sizes. Initially, the formula for deriving sample sizes for a population considered infinite was used, assuming maximum variability for a binomial probability (p = 0.5). This was followed by a modification using the finite population correction (FPC). This correction reduced the sample sizes for universities with smaller counts of followers, such as TUC, while having a minimal effect on larger populations, such as PTU.

Figure 3 demonstrates how the sample size correlates with the resulting error margin. The blue line for N = 2000 and the green line for

N = 5000

show that as the sample size increases, the margin of error decreases rapidly until it passes below the standard ±5% reference line. Past this point, these curves flatten out to demonstrate the law of diminishing returns. This was the necessary statistical analysis to justify the sample sizes selected, thus validating our methodology.

In API-based social media research, data losses produced by accounts that are not accessible due to their private status, deletion, or platform rate limits are expected. These issues reduce both the volume and completeness of data accessible for analysis and require adjustments in study population estimates [44,45]. Based on the pilot test, we estimated that there would be a 25% data loss rate. The detailed mathematical formulas for these calculations (initial sample size, FPC, and loss adjustments) are provided in Appendix A.

Figure 4 presents the outcome of the sampling strategy. The theoretical sample size for each university is plotted against its adjusted sample size. The result confirms the non-linear relationship; the populations range from 790 to 8867, and the required sample sizes are grouped between 324 and 460.

The effect of this non-linearity is further shown in Figure 5, which plots the population against the percentage of that population required for the sample. The smallest population, TUC, with

N = 324

, was sampled at 41.10%. As the size of the population increases, the required sampling fraction asymptotically reduces to stabilize at about 5–6% for the largest universities, showing efficiency in the FPC-adjusted model.

As evident from Table 2, the margin of error for all six universities remains under the 5% target. Data indicate that within the scope of this study, sample sizes larger than 300 will consistently yield a margin of error in the range of about 4.5%.

3.2. Data Collection

The MIPS pipeline is described in Algorithm A1. In this setup, the MIG module is responsible for finding the followers of the university’s TikTok account. Once the list is obtained, the public reposts of these users are downloaded in the same module.

Subsequently, every downloaded video clip becomes an input to the CA module, which has the task of extracting transcripts, translating, generating frame-wise descriptive information, and developing a representative, anonymized text form of representation in order to comply with GDPR. At this point, data related to that particular video clip, including its video file and audio, are deleted.

The next task in the pipeline is performed by the KGD, which builds the knowledge by analyzing each text representation:

Entity Extraction: Using NLP, KGD picks out important entities in a summarized text. These can range from names of individuals, organizations (e.g., universities), events, concepts, and so on.
Relationship Identification: The KGD evaluates a text-based context for the identification of links between entities, which involves establishing associations, for instance, between a user and a topic/event.
Correction and Disambiguation: Because it has been established that the output of some preceding stages, especially those of probabilistic models like LLM, might be noisy, this module provides corrections. An example of this is disambiguating different names of entities in a text using clues like co-occurrences. This step ensures that all links to a particular real-world object are merged under a single node in the graph.
Graph Construction: Identified entities as well as their connections are then structured in nodes, which correspond to distinct entities, and edges that signify discovered relations. Every edge contains properties like domain metadata and keywords.
Database Upload: The knowledge graph is uploaded to a graph database like Neo4J. This database stores complex relations for querying. A knowledge graph can be used to extract relevant data.

Once the knowledge graph is built, it can be queried for relevant information. In particular, we wanted to identify, for any given university being considered, how many times a certain domain appears. When comparing the domains (for example, “travel”, “religion”, “cinematography”, and so on) across all six universities, the data in Figure 6 display the counts of domains for each university, divided by the total number of domains. More precisely, we made a mapping to connect domain values (e.g., “travel”, “religion”, “cinematography”, and so on) to a second mapping that linked each university to its count.

We encoded each domain label into a vector using the RoBERTa model [46]. The model is a transformer trained on large data. We encoded each domain label into a dense vector that contained its features.

To reduce the dimensionality of our embeddings while preserving the global variance necessary for our K-means clustering algorithm, we used principal component analysis (PCA). We did not use any non-linear manifold methods, such as t-SNE or UMAP, since they preserve local neighborhood relationships but do not faithfully represent global distance.

The embeddings determined in this way were grouped into clusters using the K-means algorithm, which measured the cosine of the angle between any two vectors. A value close to 1 meant similar semantics, while a value near -1 meant different semantics. Consequently, words with related semantics are grouped together.

To capture meaningful information out of unstructured and variable data, we studied the dynamics of cluster aggregation when the number of clusters was set to the intervals [10–15] and [20–25]. The aim was to perform an unsupervised classification of the areas of interest for users who follow a specific university on the TikTok platform; in this framework, the areas (domains) with similar underlying meaning will fall into the same category.

As social media data encompasses a broad list of areas (which are represented by an even larger number of terms), in our study, we used a relatively large number of clusters for an effective analysis, regardless of inconsistencies in data embeddings (for instance, due to dimensionality reduction of vector embeddings).

The SP module used an LLM to assign a generic name to each cluster. The ratio of domain occurrences in a cluster to the total number of domains of a university gave us raw data that could be processed statistically. For example, a JSON fragment of the data is presented below.

The raw domain frequencies were grouped to determine the level of user engagement for each newly identified thematic cluster. A part of the aggregated data structure can be seen in Appendix B.

In particular, the SP module computed the following:

Correlation Matrix of Preference Profiles across Universities

For each thematic cluster, the SP module extracted the university-specific preference proportions (e.g., for “Human Behavior Analysis” there was a set of six proportion values corresponding to the analyzed universities) and computed the pairwise correlations between these sets of proportions using statistical measures such as Pearson or Spearman correlation. This methodology seeks to evaluate how similarly users from each university emphasize each thematic cluster. The obtained correlation matrix is meant to reveal whether, for example, a high interest in “Human Behavior Analysis” at one university is also found in others. Consequently, inter-university similarities or differences in users’ preferences might be revealed.

Pairwise Effect Sizes (Cohen’s h) per Thematic Cluster

The comparison of universities in terms of users’ preferences within each thematic cluster was performed using a pairwise analysis. This was performed using Cohen’s h as a metric to discover which followers of universities show a significant difference in interest for each identified area.

4. Results and Discussion

For each university considered in our study, the number of distinct domains identified is reported in Table 3.

Since TikTok data are varied but influenced by emerging trends, we performed multiple experiments in order to capture the users’ preference profiles with better accuracy. More precisely, to cover multiple viewpoints, we considered different numbers of clusters when grouping domain embeddings computed using RoBERTa [46]. Additionally, we explored different sizes for the vector embeddings as we aimed to reduce noise and irrelevant dimensions, hence generating accurate and quality clustering.

By adjusting the number of clusters, we aimed to record relevant changes when the level of detail changed. Our hypothesis was as follows: if a slight change in the number of considered clusters determines observable differences in university-specific preference profiles, it indicates that the data have complexities that cannot be covered by a single clustering solution. Moreover, this approach allowed us to obtain different perspectives, ranging from specificity to generality. By considering many clusters, one can uncover subtle semantic differences, while fewer clusters provide the main subjects or themes. Furthermore, testing different grouping settings assesses the reliability of our analysis, as consistent results indicate underlying patterns. For this reason, we performed the analysis considering two clustering schemes: one that involves 10 to 15 clusters and another that involves 20 to 25 clusters.

Running K-means with different random states yields different clustering. This happens because K-means clustering performance relies on the initial setup of centroids, based on which the data points in proximity/similarity are appended to form clusters. It is also worth noting that for a given random state, the K-means algorithm is deterministic and produces the specified number of clusters.

To overcome this limitation and extract stable and relevant preference themes from noisy data, we performed multiple runs of K-means, considering different numbers of clusters and different random states. Subsequently, the consensus clustering method was especially used to identify clusters that occur most of the time in multiple K-means instances. In particular, in our experiment, we used a number of clusters in the intervals [10,15] and [20,25] and compared the findings. We considered this large number of clusters because of the nature of social media data, which is variable and noisy; this approach allowed the identification and extraction of meaningful patterns.

We conducted 100 runs of the K-means algorithm initialized with different random states for each given number of clusters.

We then built the consensus matrix (C), which is a square, symmetric matrix whose entries

C_{i, j}

are the frequencies with which the two domains i and j are assigned by the K-means algorithm to the same cluster across all 100 runs (that is, the value

C_{i, j}

is close to 1 if the domains i and j are almost always clustered together and close to 0 otherwise). Once this matrix is computed, we obtain the distance matrix

D = J_{n} - C

, where

J_{n}

is an all-ones

n \times n

squared matrix. The distance matrix measures the dissimilarities between any two domains, and it is used to perform agglomerative clustering. More precisely, the agglomerative clustering initially sets all domains into their own clusters and then, iteratively, combines them based on their vicinity until the specified number of clusters is met. In this case, the distance between two clusters is obtained as the average of distances between all pairs of data points from the two clusters. Consequently, the consensus cluster is obtained. To assess this, we compute its stability score by measuring the average co-association similarity among its members. The results obtained after performing these steps are as follows:

For example, the consensus cluster data look like

The stability score for each cluster is as follows:

{0 : 0.88, 1 : 0.93, 2 : 0.71, 3 : 0.79, \dots}

Taking all clusters that exceed the stability threshold of 0.75 (a score that indicates a true semantic theme and yields statistically meaningful confidence), we were able to identify the most impactful domains.

In our case, the output shows 32 stable clusters:

The SP module names each stable cluster, and in case the same name is found for different clusters, they are merged. This approach reduces noise and creates meaningful semantic categories. However, there might be cases where very similar (but distinct) names are not merged (these cases happen because the LLM might give slightly distinct names even for very similar clusters). For example, in a certain instance (the 10–15 clusters case, PCA size = 768), we obtained the cluster names “Social Interactions and Community” and “Social Interactions Community”. Although these cases appeared, given the small number of resulting clusters, we did not consider further merging very similar clusters (e.g., by employing a string similarity algorithm), as the main themes could be easily extracted.

Remark 1.

In this study, we generated representative cluster names using multiple large language model configurations—including Llama 3.3, Llama 4, and Gemma 3—each evaluated under five temperature settings (

t e m p e r a t u r e \in {0, 0.25, 0.5, 0.75, 1}

) to capture a broad spectrum of generative variability. The semantic similarity between generated names was measured using cosine similarity over RoBERTa embeddings. In particular, using a 20-cluster schema extracted from the PUB university dataset as a representative baseline, every model–temperature combination produced candidate names for the same underlying clusters. Beyond evaluating cross-configuration consistency, we performed an extensive Monte Carlo robustness analysis, in which we repeatedly sampled 70%, 80%, and 90% of each cluster’s word set and generated 30 independent naming runs per condition. This allowed us to examine the stability of the naming mechanism when the input information was deliberately perturbed.

Across all experiments, naming remained highly consistent: even with heterogeneous LLM architectures and temperature values, most generated names were semantically close to one another and to the baseline (full-information) name. While cross-temperature comparisons for a given model (e.g., Llama 3.3) showed very high agreement (mean cosine similarity

\approx 0.91

), the Monte Carlo results across all clusters, models, and temperatures further strengthen this conclusion. The mean semantic similarity between subsampled and baseline names remained high and increased with sampling coverage—0.744 at 70%, 0.762 at 80%, and 0.772 at 90%, each with narrow 95% confidence intervals of

\approx \pm 0.016

—based on 30 Monte Carlo sampling runs per condition. This monotonic rise confirms that the naming process remains stable even when up to 30% of the cluster vocabulary is randomly dropped. However, it is worth noting that some clusters are inherently easier to name consistently (several have mean subsample-to-baseline similarity >0.85 at 90%), while others induce paraphrase changes (the lowest score is

\approx 0.605

at 90%). In this respect, a small number of low-scoring clusters decrease the overall mean, leading to semantic-similarity averages in the

(0.7, 0.8)

range.

Nevertheless, the consistency observed across all experiments suggests that—even with different LLMs and temperature settings—there is an inherent semantic structure in the data that guides the generation of representative names. Importantly for the argument, the topics identified through this algorithmic approach directly map onto well-established thematic domains in sociology, psychology, public health, communication and media studies, political science, anthropology, and cultural studies (which reinforce the external validity of the generated labels). Taken together, the strong alignment between automatically generated cluster names and recognized domains of human interest strengthens confidence in using these names as reliable indicators of cluster content (even without manual expert review). Overall, our multi-run experiments confirm that the naming mechanism is stable across configurations and that the identified themes have solid grounding in established research, thereby validating both the robustness and the interpretability of the resulting cluster names.

For example, Figure 7 reports the named stable clusters discovered by the SP when considering the 20–25 cluster scheme and a PCA dimensionality of 160.

Taking into account all stable clustering results for all clustering schemes (10 to 15 and 20 to 25 cluster schemes) and for all PCA sizes, we obtain the following main common themes:

Social Interactions and Community (Social Interactions and Dynamics, Social Interaction Concepts, Social Interaction Context, Social Interactions and Community, Social Interactions Community, Social Community Dynamics, Social Community Interactions, Social Interactions and Society).
Food and Cuisine (Food and Cuisine, Food and Dining, Food and Cooking, Food and Beverage, Food and Beverage Industry).
Personal Identity and Characteristics (Personal Identity Traits, Personal Characteristics, Personal Characteristics Traits).
Human Concepts and Behavior (Human Nature Concepts, Human-Related Concepts, Human Behavior Traits, Human Interactions and Behavior).
Youth and Development (Youth and Education, Youth and Child Welfare, Youth and Child Development).
Infrastructure and Systems (Road Infrastructure Systems).
Digital Media Technology.

We may conclude that all these major themes represent the main interest streams of the followers of the studied universities.

However, the above approach represents a broader view of the user themes and their engagement with the universities’ TikTok accounts.

Yet another meaningful view on the data relates to the selection of the K-means runs with the highest scores according to the Calinski–Harabasz (

C H

) metric. For a given K-means run (for 10 to 15 clusters/20 to 25 clusters and 100 runs for each) and the corresponding clustering, this score is defined as follows:

C H = \frac{B C S S / (K - 1)}{W C S S / (N - K)}

where BCSS (between-cluster sum of squares) measures the separation between clusters, WCSS (within-cluster sum of squares) measures the compactness or dispersion within clusters, K is the number of clusters, and N is the total number of data points.

The results for the Calinski–Harabasz (

C H

) scores are summarized in Table 4.

As reported in Table 4, for the clustering scheme that involves 10–15 clusters, the maximum

C H

score (52.22) is obtained for k = 10 clusters and a PCA dimension of 160. Figure 8 displays the identified clusters and their representativeness for the followers of each university’s account.

The correlation matrix presented in Table 5 provides insight into whether a theme of interest for one university is also found in another. Higher correlation values indicate that the followers of the same university have similar interests and behaviors, whereas lower values reveal a significantly different profile. Since the values are more than

0.9234

in the presented case, we may conclude that there exists a strong thematic consistency across the university’s follower bases. In fact, taking into account all the K-means runs for all PCA sizes, we found that the average of the group profile correlation values is 0.96 (the minimum value being 0.74 and the maximum being 0.99), which confirms our findings and underscores the thematic coherence among (different) university users.

The high values of correlation suggest that a student from Timişoara consumes and shares almost the same content mix as a student from Berlin or Turin. Despite their language and geographical differences, the online repertoires of these students show a common interest in universal topics such as academic anxiety, lifestyle hacks, and technology. The algorithmic effect of these platforms and the student identity in relation to STEM could be viewed as a form of cultural equalizer. An interesting implication of these findings is that universities may consider dropping their content strategy in favor of a strategy that generates high-quality content related to universal topics.

The heatmap presented in Figure 9 illustrates the percentage distribution of user preferences across thematic clusters and compares the interests of users from different universities. A comparison between the largest effect sizes (Cohen’s H) per thematic cluster is shown in Figure 10.

Similar to the previous analysis, we considered the clustering scheme involving 20–25 clusters. In this case, the maximum

C H

score (38.04) was obtained for k = 20 clusters and a PCA dimension of 768. Figure 11 displays the identified clusters and their representativeness for each university’s followers.

The charts from Figure 8 and Figure 11 represent the highest

C H

scores for the different clustering schemes and reveal interesting information:

Human Identity and Characteristics: Both charts include categories describing who people are as individuals; they relate to personality, traits, and individual qualities: Personal Identity Traits (in Figure 8), Human Characteristics Analysis (in Figure 11), and Human Nature Concepts (in Figure 11).
Human Life Experiences and Wellbeing Both charts include fields relating to lived human experiences, covering physical and emotional experiences: Human Life Concepts (in Figure 11), Human Life Experiences (in Figure 8), Health and Wellbeing (in Figure 11), Human Health and Wellbeing (in Figure 8), and Emotional Distress Themes (in Figure 11).
Social Interaction, Community, and Relationships: Both charts include categories on how people interact, representing social behavior, family life, and group dynamics. Family and Relationships (in Figure 11), Social Group Interactions (in Figure 11), Social Interaction Concepts (in Figure 11), Human Interaction Dynamics (in Figure 8), Social Interactions and Community (in Figure 8), Community and Family Life (in Figure 8), and Social Community Development (in Figure 8).
Media, Technology, and Information: Both charts include topics relating to communication and technology; they are related to information handling and digital domains. Media and Video Production (in Figure 11), Information Processing Services (in Figure 11), and Media and Technology (in Figure 8).
Work, Employment, and Public Services: Both charts contain fields related to public life and professional domains, covering societal infrastructure and labor: Work and Employment (in Figure 11) and Public Services and Leisure (in Figure 8).
Personal Growth, Education, and Self-Development: Both charts show themes of self-improvement and learning, reflecting personal development: Personal Growth Concepts (in Figure 11), Youth and Education (in Figure 11), and Human Life Experiences (in Figure 8).
Social Justice, Global Issues, and Public Perception: Both charts include topics concerning society at large and which relate to public attitudes and societal problems: Social Justice Issues (in Figure 11), Global Social Issues (in Figure 11), Public Perception and Preferences (in Figure 11), Public Services and Leisure (in Figure 8), and Social Community Development (in Figure 8).
Interests, Leisure, and Culture: Both charts discuss what people enjoy and consume. They are related to hobbies, curiosity, and lifestyle content: Random Knowledge Topics (in Figure 11), Food and Cuisine (in Figure 11), and Diverse Human Interests (in Figure 8).

We also computed the silhouette score, but this proved to be unreliable for high-dimensional embeddings and noisy, incomplete, inconsistent, and inaccurate text in the analyzed social media data. Figure 12 reports the silhouette scores across several PCA dimensions and the number of clusters. It is worth mentioning that in our case, the silhouette score was not used as the primary measure of cluster validity. Moreover, this method gives the mathematically optimal clusters and not necessarily the business-relevant ones.

From a statistical point of view, to validate whether university affiliation influences content preference, we computed the chi-squared (

χ^{2}

) statistic for each clustering iteration. For the large majority of the tests (clustering scheme covering 10 to 15 clusters and 100 different distinct random state K-means applications), we found that the p-value is, in general, greater than 0.85, which is significantly above the statistical significance threshold alpha level (

α

= 0.05). This finding implies that there is no evidence of an association between clusters and groups. For each run, we also computed the chi-square values, and the results showed that they are lower than the degrees of freedom. This suggests that the observed data are very close to the expected frequencies. The Cramér’s V value indicates that the association between clusters and groups is not only statistically non-significant but is also negligible in magnitude. The results are summarized in Table 6 and Table 7.

Performing the same tests but for the clustering scheme covering 20 to 25 clusters and 100 different distinct random state K-means applications, we obtain similar results (see Table 8 and Table 9).

The experiments in our study compare 4800 experimental scenarios with various embedding sizes and clustering levels. The study can be regarded as a robustness test of the proposed method. The basic statistics remain stable: the mean p-values are consistently above 0.90, and all the values of the mean Cramér’s V are lower than 0.05. This verifies that the homogenization effect is indeed embedded in the data and is not due to the adjustment of parameters.

These findings show that each group (set_PTU, set_PUB, set_PUT, set_TUC, etc.) has nearly the same distribution across clusters. As can be noted in the example presented in Figure 8 and Figure 11, the bars for each theme are nearly overlapping. Correspondingly, the groups do not specialize in certain clusters, and clusters contain about the same proportions of each group. These results suggest several possible causes, a few of which are discussed below:

Social media themes are universal across university account audiences.
The groups represent demographically similar populations (presumably students) sharing the same interests.
Social media platforms (like TikTok, for instance) homogenize the discourse; hence, the users tend to develop the same habits or interests.
Social media platforms’ algorithms seem to promote certain “recipes”; hence, users attempt to conform to the globally successful content format. In the long run, this implies a homogeneous feed for all users.
Social platform algorithms expose users to the same type of videos; hence, they are more likely to redistribute the same content.
The psychographic and developmental characteristics of university students create a general life-stage homogeneity, where shared global goals and interests override national differences.

Finally, as can be observed in the correlation matrix Table 5, the off-diagonal correlations are as follows: min 0.9234, max 0.9927, mean 0.9711, median 0.9753, and standard deviation 0.022. These statistics, as well as the matrix itself, suggest homogeneity among the profiles; the subject areas’ preferences are highly aligned for all universities considered in our study. However, to determine how different universities group together in terms of content preferences, we performed hierarchical clustering on the group profile correlation matrix, taking into account all possible runs (for both clustering schemes (10–15 and 20–25) and all PCA sizes (120, 160, 200, and 768).

The results are summarized in Table 10.

The data collected from all 4800 runs (two clustering schemes covering 10 to 15 clusters and 20 to 25 clusters, four PCA dimensions for each scheme (120, 160, 200, and 768), and 100 K-means runs for each clustering value) show the following:

The majority (3056) of inferred topologies place set_RTU in a singleton branch.
set_TUC is placed in a singleton branch 2698 times.
set_PTU and set_PUT are placed together 1755 times.
set_RTU and set_PUB are placed together 1244 times.
set_TUB and set_TUC are placed together 848 times.

In particular, for the case presented in Figure 8, we obtained the following dendrogram, which illustrates how the universities considered in our study are grouped based on their similarities (Figure 13).

It should be noted that, while content preference is highly homogenized across the studied universities, the persistent isolation of one particular university profile in a single branch in the clustering suggests that non-homogenizing factors maintain a distinct structural identity for that outlier institution.

Last but not least, we wanted to determine if there is a potential algorithmic effect on the interests intrinsic to the STEM community. To this aim, we performed the following steps: Given the data collected for a university (say, PUB) and the corresponding clustering of subjects (say, the 20 cluster scheme), we scanned every JSON (corresponding to the posts of users of that university) for cluster terms. Then, we recorded the timestamp (the moment of the post). Using these timestamps, we computed three temporal-dispersion metrics for each cluster over a common global window defined by the earliest and latest timestamps found in the dataset (for PUB, from 20 December 2021 to 26 August 2025).

The metrics used in our research were as follows:

The Kolmogorov–Smirnov distance ( $K S$ ), which measures the maximum deviation between the empirical distribution of timestamps and a perfectly uniform distribution over the global window (lower $K S$ indicates that occurrences are more evenly spread, while higher $K S$ indicates clustering or inactivity gaps);
The coefficient of variation of inter-arrival times ( $C V$ ), which captures how irregular the gaps between consecutive timestamps are ( $C V \approx 1$ is Poisson-like (regular), whereas $C V ≫ 1$ indicates strong burstiness).
Normalized Shannon entropy, which captures how evenly timestamps fill equally spaced time bins; values close to 1 indicate broad temporal coverage, whereas lower values indicate concentration in a few periods. In our experiment, we used 12 equal-width bins spanning the global time range in order to balance resolution against statistical stability (avoiding too many empty/low-count bins); in particular, for the mentioned time span, we had approximately 112 days per bin (roughly a third of a year), which is a reasonable scale to capture semester/seasonal patterns without fragmenting the data.

Across the 20 clusters, the results show moderate temporal spread but pronounced burstiness:

K S

values range from 0.559 to 0.623 (mean

\approx 0.583

), entropy values cluster around 0.60, and

C V

values range from 2.35 to 4.65 (mean

\approx 3.47

), none of which are compatible with an evenly distributed temporal pattern. The findings indicate that topics are present across the entire multi-year window but appear in waves rather than uniformly, suggesting temporal clustering, potentially driven by platform trends, events, event cycles (exams/holidays/politics), or algorithmic reinforcement, rather than purely intrinsic steady-state interests.

From a broad perspective, the findings contradict previous suppositions about differences between regions or countries relating to persona preferences on social media. The uniformity of content affinities can be explained both by the algorithms of the social media platform and by the psychographic characteristics of the audience consuming the short-form video content. This, in turn, underlines the role of global digital platforms such as TikTok in the formation of a unified cultural and educational space for students, implying the existence of a “digital student persona” beyond national boundaries, despite possible differences at the local level.

Finally, the following limitations of the study are acknowledged. First, the homogeneity of the content may, at least to some extent, be augmented by the platform’s algorithmic bias towards promoting the most globally successful content formats. Second, with regard to user demographics, although the study of the locality effect and the statistics of student mobility support the assumption that most of the followers are local students, the study’s inference must rely on indirect evidence due to the non-public nature of specific statistical information from the universities. Third, although we used a low temperature to increase determinism, the inherently probabilistic nature of LLMs introduces a degree of variability in semantic summarization that we attempted to overcome through consensus clustering of multiple experimental runs.

5. Conclusions

This study aims to determine the level of homogenization in TikTok user preferences among technical university students in Europe. Using the proposed multimodal approach within the MIPS framework, this study presents empirical evidence of a high level of homogenization in user content preferences among these technical university students, thus calling into question existing theories of cultural regional differentiation.

The main contribution of this study is the empirical evidence of a high level of thematic homogenization in user content preferences, as shown by high levels of clustering and correlation among different follower cohorts. Such a finding supports the emergence of a “digital student persona” transcending national boundaries, as influenced by algorithmic trends and the STEM academic lifestyle. Instead of focusing on cultural regional differentiation in social media communication strategies, universities should now recognize the existence of a globalized algorithmic trend and universal student interests, such as those related to academics, technology, and health, as a cultural equalizer. As such, a more appropriate social media communication strategy should focus on generating high-quality content related to these universally relevant themes.

It is worth noting that although the study focused on the TikTok platform, a more exhaustive approach could include additional social media platforms (like Instagram, Facebook, and X) or academic forums. This extensive data collection might capture a larger range of user behaviors and content preferences, thereby making the analysis more detailed.

With regard to methodological enhancements, it may be possible to optimize the MIPS pipeline. For example, different LLMs might be better suited for identifying separate entities. Moreover, incorporating reasoning with assertion extraction could improve interpretation. It would be possible to upgrade the frame description component with multi-modal models capable of understanding video context at significant points.

Nevertheless, it is worth noting that summaries generated by the LLMs are subject to prompt design, and there might exist variations due to the probabilistic nature of LLM operation. Similarly, assigning semantic labels to clusters is also a potential source of variability (or even of errors); moreover, as shown in the paper, LLMs may produce slightly different names for similar themes, which may require manual review or other NLP techniques to merge those with the same semantics. However, even if LLMs’ outputs are (possibly) biased, incomplete, or even incorrect, in this particular case of analyzing social media data, they provided some valuable insights about users’ preferences. To overcome this limitation comprehensively in the future, a human-in-the-loop validation process will be included. For instance, the plan is to conduct expert human evaluations on randomly selected subsamples of the AI-generated labels and the video content to qualitatively evaluate the robustness and cultural accuracy of the clustering process.

Although the usefulness of these insights was empirically demonstrated throughout the paper (given the large number of individuals studied and the number of test cases performed), it should be noted that there might be other important factors that were underrepresented (for example, user demographics or online behaviors) that might influence the accuracy and reliability of the study. This represents a future line of research that could enhance the presented methodology.

Within the same context, a further potential development is a stability study. For example, it would be interesting to see whether slight changes in the input (frame extraction/noise addition to input audio/video) determine a bounded change in the output (in particular, to determine if the developed system has a bounded sensitivity, characterized by a finite Lipschitz constant). This can be achieved because the MIPS sequential pipeline can be intuitively modeled as a composition of functions

F = f_{K G D} \circ f_{C A} \circ f_{M I G}

where

$f_{M I G} : V \to D$ , which maps raw videos $V$ to structured data $D$ ;
$f_{C A} : D \to T$ , which maps multimedia data to text representations (transcripts, summaries);
$f_{K G D} : T \to G$ , which maps text to a graph $G$ (nodes = entities, edges = relations).

Hence, by considering

x = the original input video,
$x^{'}$ = the perturbed input videoclip,
$f (x)$ = output eigenvector of the resulting knowledge graph produced by the LLM,
$d (\cdot, \cdot)$ = distance metric (cosine distance),

one can define the stability condition

∣ f (x) - f (x^{'}) ∣ \leq L ∣ x - x^{'} ∣

(1)

where L is a Lipschitz constant. If Relation (1) holds, then one could say that the pipeline is stable under small perturbations.

Other new lines of research might include alternative clustering methods such as spectral clustering or density-based methods (DBSCAN). Additionally, the analysis could include time-series clustering aimed at capturing changes in content preferences over time. Another idea to explore would be to use quantitative metrics in the knowledge graph (such as node counts and/or centrality measures) to validate the findings: for example, if the pattern of related statements (or nodes) around a given input keyword is similar in number and structure across different university-affiliated video clip sets, this supports the notion that content preferences are homogenized regardless of regional or institutional differences.

Lastly, MIPS was specifically designed as a privacy-preserving architecture that uses local LLMs for large-scale social media analytics within the context of public technical higher education. The present study uses publicly available TikTok videos associated with the universities under investigation. In accordance with the General Data Protection Regulation (GDPR), no personally identifiable information was collected, processed, or stored during the research. All data analyzed originated exclusively from publicly accessible content, and no private, restricted, or sensitive material was used. The analysis focuses on aggregated patterns reflecting the general interests of the universities’ followers, without reference to individual users.

All multimedia content (videos, images, and soundtracks) was accessed solely for analytical purposes and was deleted immediately after use, ensuring compliance with the GDPR principles of data minimization and storage limitation. The output knowledge graph was built by enforcing robust anonymization measures that prevent singling-out, linkability, and inference.

Access to publicly available data was facilitated through TikTok’s Research Account. There was no collaboration with TikTok beyond the use of this account, and the platform had no material interest or influence on the study design, data analysis, or research findings.

The findings of this study are published openly, with the aim of promoting transparency and contributing to the common and public good. The present research was conducted without commercial purposes and follows the principles of European copyright and research exceptions.

Author Contributions

Conceptualization, D.-F.S. and M.B.; Methodology, D.-F.S. and M.B.; Software, D.-F.S.; Formal analysis, D.-F.S. and M.B.; Investigation, D.-F.S. and M.B.; Resources, M.B.; Data curation, D.-F.S. and M.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw media (videos, images, audio) used to generate the knowledge graph were processed locally and deleted immediately after extraction, in line with our privacy-by-design approach and GDPR compliance. The processed, anonymized JSON outputs (containing no personally identifiable information) may be made available to reviewers upon request solely for verification purposes.

Acknowledgments

D. Sburlan acknowledges the collaboration and expertise provided by the European university alliance—Artemis (Alliance for Regional Transition, Equality, Mobility, Inclusion, and Sustainability, Erasmus+, ERASMUS-EDU-2024-EUR-UNIV-1), which enabled some of the research presented here.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Sampling Formulas

We considered maximum variability for a binomial probability (

p = 0.5

) and a 95% confidence level (

Z = 1.96

) with a 5% margin of error (

e = 0.05

). The initial sample size

n_{0}

was calculated as

n_{0} = \frac{Z^{2} \cdot p \cdot (1 - p)}{e^{2}}

(A1)

This calculation provides a reference value of

n_{0} \approx 385

. This baseline

n_{0}

is then adjusted for the specific population, N, using the finite population correction (FPC) formula to obtain the final sample size required, n:

n = \frac{n_{0}}{1 + \frac{n_{0} - 1}{N}}

(A2)

Because of the anticipated data loss (estimated at 25%), the calculated target sample size (n) was inflated by 25% to make sure the final achieved sample (

n_{a d j u s t e d}

) meets or exceeds the statistically required number for 95% confidence:

n_{a d j u s t e d} = n_{t h e o r e t i c a l} \cdot 1.25

(A3)

Appendix B. Technical Details of the MIPS Pipeline

This appendix provides the technical artifacts and algorithmic structures used within the Media Information Processing System (MIPS) to ensure reproducibility.

Appendix B.1. Algorithmic Pipeline

The pipeline of the MIPS architecture is described by the following pseudocode.

Algorithm A1: Media Information Processing Pipeline

1:: // Media Information Processing Pipeline
2:: university_list ← getUniversityList() ▹ Obtain list of universities
3:: for each university in university_list do
4:: followers ← MediaInformationGatherer.getFollowers(university)
5:: for each user in followers do
6:: video_clips ← MediaInformationGatherer.getVideoClips(user)
7:: for each videoclip in video_clips do
8:: raw_transcript ← ContentAnalyzer.getTranscript(videoclip)
9:: translated_text ← ContentAnalyzer.translate(raw_transcript)
10:: images ← videoclip.getFrames()
11:: images_desc ← ContentAnalyzer.analyzeImages(images)
12:: summary ←
13:: ContentAnalyzer.gdprDeleteVideoclip(videoclip)
14:: ContentAnalyzer.gdprDeleteImages(images)
15:: ContentAnalyzer.summarize(translated_text, images_desc)
16:: json_knowledge ← KnowledgeGraphDesigner.buildKnowledge(summary)
17:: KnowledgeGraphDesigner.uploadIntoGraphDB(json_knowledge)
18:: end for
19:: end for
20:: end for
21:: all_domains ← [ ] ▹ List of (domain, university)
22:: for each university in university_list do
23:: domains_for_univ ← MediaInformationGatherer.getUniversityDomains(university)
24:: all_domains.extend(domains_for_univ)
25:: end for
26:: domains_vectors ← StatisticsProfiler.embedDomains(all_domains, model="Roberta")
27:: clusters ← StatisticsProfiler.KMeansClustering(domains_vectors, num_clusters=20)
28:: cluster_names ← StatisticsProfiler.nameClusters(clusters)
29:: for each cluster_name in cluster_names do
30:: universities_in_cluster ← cluster_names.getUniversities(cluster_name)
31:: for each university in universities_in_cluster do
32:: profile_data ← StatisticsProfiler.countAndNormalize(university, cluster_name)
33:: StatisticsProfiler.saveProfileData(profile_data)
34:: end for
35:: end for
36:: comparison_results ← PredictiveProfiler.compareResults()
37:: // End of Media Information Processing Pipeline

Appendix B.2. Knowledge Graph Designer Prompts

The primary extraction prompt used to generate the JSON output for each chunk is as follows: Analyze the given schematic text and break it down into its constituent assertions to build a JSON file with a ’brief’ key summarizing the text, and an ’assertions’ key. The ’assertions’ represents a list of elements corresponding to the identified statements. Each element must have the keys: ’stm’ is a string stating the fact or opinion, ’id’ is string representing a unique ID, ’summary’ is a string with a brief summary, ’domain’ is a string representing a listing of relevant comma-separated domains, ’keywords’ is a string representing four relevant comma-separated keywords, and ’knowledge’ is an object in structured JSON format with the following components: ’entities’—an array of objects, each representing a distinct real-world concept (each entity is defined only by: ’name’—a string denoting the proper name of the entity; ’type’—a string describing the entity category; ’prop’—a JSON array of strings representing relevant characteristics) and ’relationships’: an array of objects (with at least two distinct items), each expressing a directed relation between two previously defined entities. Each relationship must include: ’source’—an object with ’name’ and ’type’ corresponding to one of the defined entities; ’relation’: a string describing the nature of the relationship from source to destination; ’destination’: an object with ’name’ and ’type’ corresponding to another defined entity. In relationships, you have to only use the discovered entities and nothing else. Produce only the JSON output following this structure strictly.

Appendix B.3. Data Structure Examples

Example of the aggregated raw data format prior to statistical processing:

References

Anand, S.; Mehta, S.; Choudhary, A.; Bhesania, S.; Thaker, R.; Peterson, S.J.; Mehta, P. Using Social Media as a Dynamic Supplement to Traditional Teaching. J. Integr. Cardiol. 2020, 3, 2–3. [Google Scholar] [CrossRef]
Makki, A.; Bali, A.O. The Use of Social Media as a Platform in Education: Ramifications of COVID-19 in Iraq. Acad. J. Interdiscip. Stud. 2021, 10, 394. [Google Scholar] [CrossRef]
Sterling, M.; Leung, P.; Wright, D.; Bishop, T.F. The Use of Social Media in Graduate Medical Education: A Systematic Review. Acad. Med. 2017, 92, 1043–1056. [Google Scholar] [CrossRef] [PubMed]
TikTok Platform. Available online: https://www.tiktok.com/ (accessed on 10 December 2025).
Akbari, D.A.; Jastacia, B.; Setiawan, E.; Widya Ningsih, D. The Marketing Power of TikTok: A Content Analysis in Higher Education. Binus Bus. Rev. 2022, 13, 159–170. [Google Scholar] [CrossRef]
Rawat, B.; Sunarya, P.A.; Devana, V.T. Digital Marketing as a Strategy to Improve Higher Education Promotion During the COVID-19 Pandemic. Startupreneur Bus. Digit. 2022, 1, 114–119. [Google Scholar] [CrossRef]
Budnikevych, I.; Kolomytseva, O.; Bastrakov, D. Communication component in the formation of the image of higher education institutions based on a marketing approach. Proc. Sci. Work. Cherkasy State Technol. Univ. Ser. Econ. Sci. 2023, 24, 5–16. [Google Scholar] [CrossRef]
Lamimi, I.J.; Alaoui, S.M.; Ouelfatmi, M. Bite-Sized Learning on TikTok: Exploring the Platform’s Educational Value within the Framework of TAM (Technology Acceptance Theory). Open J. Soc. Sci. 2024, 12, 228. [Google Scholar] [CrossRef]
Liu, Z. Analysis of Short Video Communication Mode of City Image under Cross-cultural Threshold: Taking Chengdu on Tiktok as an Example. Commun. Humanit. Res. 2023, 18, 102–107. [Google Scholar] [CrossRef]
Bularca, M.C.; Nechita, F.; Sargu, L.; Motoi, G.; Otovescu, A.; Coman, C. Looking for the Sustainability Messages of European Universities’ Social Media Communication during the COVID-19 Pandemic. Sustainability 2022, 14, 1554. [Google Scholar] [CrossRef]
Santa, R.; Fierăscu, S.I. Access Patterns in Romanian Higher Education. A Story of Asymmetry and Polarization. In Higher Education in Romania: Overcoming Challenges and Embracing Opportunities; Curaj, A., Salmi, J., Hâj, C.M., Eds.; Springer: Cham, Switzerland, 2022. [Google Scholar] [CrossRef]
Gonzalez, R.; Cuevas, R.; Cuevas, A.; Guerrero, C. Where Are My Followers? Understanding the Locality Effect in Twitter. arXiv 2011, arXiv:1105.3682. [Google Scholar] [CrossRef]
Cuevas, R.; Gonzalez, R.; Cuevas, A.; Guerrero, C. Understanding the Locality Effect in Twitter: Measurement and Analysis. Pers. Ubiquit Comput. 2014, 18, 397–411. [Google Scholar] [CrossRef]
Bader, S.; Condrache, A. Universities, Culture, and Social Media: Enhancing Engagement and Community Through Digital Strategies. J. Media 2025, 6, 80. [Google Scholar] [CrossRef]
Learning Mobility Statistics, Statistics Explained, Eurostat. 2025. Available online: https://ec.europa.eu/eurostat/statistics-explained/index.php?title=Learning_mobility_statistics (accessed on 10 February 2026).
Donnelly, M.; Gamsu, S. Home and Away. Social, Ethnic and Spatial Inequalities in Student Mobility. 2018. Available online: https://www.suttontrust.com/our-research/home-and-away-student-mobility/ (accessed on 10 February 2026).
Echesony, G. Impact of Social Media on Cultural Identity in Urban Youth. Am. J. Arts Soc. Humanit. Stud. 2024, 4, 1–11. [Google Scholar] [CrossRef]
Mostajo-Radji, M.A. Why online science education falls short. iScience 2025, 28, 113376. [Google Scholar] [CrossRef]
Atherton, O.E.; Willroth, E.C.; Graham, E.K.; Luo, J.; Mroczek, D.K.; Lewis-Thames, M.W. Rural–urban differences in personality traits and well-being in adulthood. J. Personal. 2024, 92, 73–87. [Google Scholar] [CrossRef]
Trapero-González, I.; Hinojo-Lucena, F.J.; Romero-Rodríguez, J.-M.; Martínez-Menéndez, A. Frontiers—Didactic impact of educational robotics on the development of STEM competence in primary education: A systematic review and meta-analysis. Front. Educ. 2024, 9, 1480908. [Google Scholar] [CrossRef]
David, D.; Iliescu, D.; Matu, S.; Balaszi, R. The national psychological/personality profile of Romanians: An in depth analysis of the regional national psychological/personality profile of Romanians. Rom. J. Psychol. 2015, 17, 34–44. [Google Scholar]
Kamath, U.; Keenan, K.; Somers, G.; Sorenson, S. Large Language Models: A Deep Dive; Springer: Berlin/Heidelberg, Germany, 2024. [Google Scholar] [CrossRef]
Atkinson–Abutridy, J. Large Language Models: Concepts, Techniques and Applications; CRC Press: Boca Raton, FL, USA, 2024. [Google Scholar] [CrossRef]
Amaratunga, T. Understanding Large Language Models; Apress: Berkeley, CA, USA, 2023. [Google Scholar] [CrossRef]
Geroimenko, V. The Essential Guide to Prompt Engineering: Key Principles, Techniques, Challenges, and Security Risks; Springer Nature: Cham, Switzerland, 2025. [Google Scholar] [CrossRef]
Giray, L. Prompt engineering with ChatGPT: A guide for academic writers. Ann. Biomed. Eng. 2023, 51, 2629–2633. [Google Scholar] [CrossRef]
Marvin, G.; Hellen, N.; Jjingo, D.; Nakatumba-Nabende, J. Prompt engineering in large language models. In International Conference on Data Intelligence and Cognitive Informatics, Tirunelveli, India, 27–28 June 2023; Springer Nature: Singapore, 2023; pp. 387–402. [Google Scholar] [CrossRef]
Polat, F.; Tiddi, I.; Groth, P. Testing prompt engineering methods for knowledge extraction from text. Semant. Web 2025, 16, SW-243719. [Google Scholar] [CrossRef]
AI, ChatGPT, OpenAI. 2025. Available online: https://chatgpt.com (accessed on 10 December 2025).
AI, Gemini, Google. 2025. Available online: https://gemini.google.com (accessed on 10 December 2025).
AI, Copilot, Microsoft. 2025. Available online: https://copilot.microsoft.com (accessed on 10 December 2025).
AI, Llama, Meta. 2025. Available online: https://www.llama.com (accessed on 10 December 2025).
AI, Gemma, Google. 2025. Available online: https://deepmind.google/models/gemma/ (accessed on 10 December 2025).
AI, Ollama. 2025. Available online: https://ollama.com/ (accessed on 10 December 2025).
AI, Whisper, OpenAI. Available online: https://openai.com/index/whisper/ (accessed on 10 December 2025).
PromptPerfect. 2025. Available online: https://promptperfect.jina.ai (accessed on 10 December 2025).
Promptfoo. 2025. Available online: https://github.com/promptfoo/promptfoo (accessed on 10 December 2025).
Graph Database, Neo4J. Available online: https://neo4j.com/ (accessed on 10 December 2025).
Dumitriu, D.C.; Sburlan, D.F. Enhanced Human-Machine Conversations by Long-Term Memory and LLMs. Int. J. User-Syst. Interact. 2023, 16, 85–102. [Google Scholar]
Sburlan, D.F.; Sburlan, C.; Bobe, A. Tech Trend Analysis System: Using Large Language Models and Finite State Chain Machines. Electronics 2025, 14, 2191. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision. arXiv 2022, arXiv:2212.04356. [Google Scholar] [CrossRef]
Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Vasic, P. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
Kamath, A.; Ferret, J.; Pathak, S.; Vieillard, N.; Merhej, R.; Perrin, S.; Matejovicova, T.; Ramé, A.; Rivière, M.; Rouillard, L.; et al. Gemma 3 Technical Report. arXiv 2025, arXiv:2503.19786. [Google Scholar] [CrossRef]
Burnat, F.A.D.; Davidson, B.I. The Accountability Paradox: How Platform API Restrictions Undermine AI Transparency Mandates. arXiv 2025, arXiv:2505.11577. [Google Scholar] [CrossRef]
Leppämäki, T.; Heikinheimo, V.; Eklund, J.; Hausmann, A.; Toivonen, T. The rise and fall of the social media platform Flickr: Implications for nature recreation research. J. Outdoor Recreat. Tour. 2025, 50, 100880. [Google Scholar] [CrossRef]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar] [CrossRef]

Figure 1. The MIPS architecture pipeline includes five main stages: Media Information Gatherer (handling various media formats), Content Analyzer (involving natural language processing, automatic speech recognition, computer vision, and machine learning tasks), Knowledge Graph Designer (involving large language models), Statistics Profiler, and Predictive Profiler.

Figure 2. The knowledge graph associated with a video clip (the NEO4J view—the truncated texts of nodes are detailed nearby).

Figure 3. Relationship between sample size and margin of error with FPC.

Figure 4. Theoretical vs. adjusted sample size by university.

Figure 5. Relationship between population and sampling percentage.

Figure 6. A Python dictionary representing the domain counts for each university. Here, all double-precision floating-point values are truncated to four decimal places.

Figure 7. Stable clusters obtained for the 20–25 cluster scheme and PCA = 160.

Figure 8. The clustering achieving the highest

C H

score across multiple runs and different PCA sizes.

Figure 8. The clustering achieving the highest

C H

score across multiple runs and different PCA sizes.

Figure 9. The heatmap serves as a visual representation of the comparative analysis of user preferences among different universities.

Figure 10. A visual representation comparing the largest effect sizes (Cohen’s H) per thematic cluster.

Figure 11. The clustering achieving the highest

C H

score across multiple runs and different PCA sizes (768).

Figure 11. The clustering achieving the highest

C H

score across multiple runs and different PCA sizes (768).

Figure 12. Silhouette scores chart for different numbers of clusters.

Figure 13. Dendrogram illustrating the hierarchical clustering of universities based on similarity.

Table 1. Follower population by university.

University	Code	Country	Username	Population
Politehnica University Timisoara	PUT	Romania	upt.ro	1716
Polytechnic University of Bucharest	PUB	Romania	upb1818	7058
Technical University of Cluj-Napoca	TUC	Romania	ut.cluj	790
Berlin Institute of Technology	TUB	Germany	tu_berlin	1168
Polytechnic University of Turin	PTU	Italy	politecnicotorino	8867
Russian Technological University	RTU	Russia	rtu.mirea	7225

Table 2. Determination of sample sizes and margins of error per university.

Code	Population	Theoretical Sample	Adjusted Sample	Margin of Error (±%)
PUT	1716	314	393	4.34
PUB	7058	364	455	4.44
TUC	790	259	324	4.18
TUB	1168	289	362	4.28
PTU	8867	368	460	4.45
RTU	7225	365	457	4.44

Table 3. Number of identified distinct domains per university.

University:	RTU	PTU	TUB	PUB	PUT	TUC
No. of domains:	1295	1109	1240	1443	1438	1197

Table 4. Maximum and minimum

C H

scores for different numbers of clusters and various PCA dimensionalities.

Table 4. Maximum and minimum

C H

scores for different numbers of clusters and various PCA dimensionalities.

CH Scores		PCA Sizes
CH Scores		120	160	200	768
10–15 clusters	max value (nr. clusters)	52.09 (10)	52.22 (10)	52.19 (10)	52.15 (10)
10–15 clusters	min value (nr. clusters)	41.51 (15)	41.63 (15)	40.80 (15)	40.85 (15)
20–25 clusters	max value (20 clusters)	38.01 (20)	38.02 (20)	38.03 (20)	38.04 (20)
20–25 clusters	min value (25 clusters)	32.69 (24)	32.64 (25)	32.61 (25)	32.59 (25)

Table 5. Group profile correlations when the number of clusters is 10 and

C H

= 52.22 is the maximum.

Table 5. Group profile correlations when the number of clusters is 10 and

C H

= 52.22 is the maximum.

	RTU	PTU	TUB	PUB	PUT	TUC
RTU	1.0000	0.9505	0.9380	0.9617	0.9358	0.9234
PTU	0.9505	1.0000	0.9790	0.9806	0.9927	0.9866
TUB	0.9380	0.9790	1.0000	0.9646	0.9714	0.9827
PUB	0.9617	0.9806	0.9646	1.0000	0.9865	0.9559
PUT	0.9358	0.9927	0.9714	0.9865	1.0000	0.9716
TUC	0.9234	0.9866	0.9827	0.9559	0.9716	1.0000

Table 6. A summary of chi-squared (

χ^{2}

) and p-values indicating the level of significance in the analysis.

Table 6. A summary of chi-squared (

χ^{2}

) and p-values indicating the level of significance in the analysis.

	Mean $χ^{2}$	Max $χ^{2}$	Min $χ^{2}$	Mean p	Max p	Min p
PCA120	41.43	65.63	21.89	0.91	0.99	0.17
PCA160	40.83	64.44	20.76	0.92	0.99	0.30
PCA200	40.69	69.05	20.41	0.91	0.99	0.27
PCA768	40.55	65.94	20.29	0.92	0.99	0.38

Table 7. A measure of the correlation between variables, with Cramer’s V values indicating the degree of association.

	Mean Cramer’s V	Max Cramer’s V	Min Cramer’s V
PCA120	0.0325	0.041	0.024
PCA160	0.0323	0.041	0.023
PCA200	0.0322	0.042	0.023
PCA768	0.0322	0.041	0.023

Table 8. A summary of chi-squared (

χ^{2}

) and p-values indicating the level of significance in the analysis.

Table 8. A summary of chi-squared (

χ^{2}

) and p-values indicating the level of significance in the analysis.

	Mean $χ^{2}$	Max $χ^{2}$	Min $χ^{2}$	Mean p	Max p	Min p
PCA120	79.56	114.45	51.90	0.95	0.99	0.44
PCA160	79.92	118.84	51.05	0.94	0.99	0.31
PCA200	80.00	115.15	53.45	0.95	0.99	0.50
PCA768	80.56	114.77	50.53	0.94	0.99	0.38

Table 9. A measure of the correlation between variables, with Cramer’s V values indicating the degree of association.

	Mean Cramer’s V	Max Cramer’s V	Min Cramer’s V
PCA120	0.045	0.054	0.037
PCA160	0.045	0.055	0.036
PCA200	0.045	0.055	0.037
PCA768	0.045	0.055	0.036

Table 10. The first five topologies in terms of occurrences across all runs.

Topology	Cases
((set_RTU,set_PUB),((set_PTU,set_PUT),(set_TUB,set_TUC)))	168
(set_RTU,(((set_PTU,set_PUT),(set_TUB,set_PUB)),set_TUC))	145
(set_RTU,((set_PTU,set_PUT),((set_TUB,set_PUB),set_TUC)))	120
(set_RTU,(set_PTU,(((set_TUB,set_PUT),set_PUB),set_TUC)))	92
((set_RTU,set_PUB),(((set_PTU,set_PUT),set_TUB),set_TUC))	78

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sburlan, D.-F.; Bucos, M. A Multimodal Deep Learning Approach for Analyzing Content Preferences on TikTok Across European Technical Universities Using Media Information Processing System. Electronics 2026, 15, 1288. https://doi.org/10.3390/electronics15061288

AMA Style

Sburlan D-F, Bucos M. A Multimodal Deep Learning Approach for Analyzing Content Preferences on TikTok Across European Technical Universities Using Media Information Processing System. Electronics. 2026; 15(6):1288. https://doi.org/10.3390/electronics15061288

Chicago/Turabian Style

Sburlan, Dragoş-Florin, and Marian Bucos. 2026. "A Multimodal Deep Learning Approach for Analyzing Content Preferences on TikTok Across European Technical Universities Using Media Information Processing System" Electronics 15, no. 6: 1288. https://doi.org/10.3390/electronics15061288

APA Style

Sburlan, D.-F., & Bucos, M. (2026). A Multimodal Deep Learning Approach for Analyzing Content Preferences on TikTok Across European Technical Universities Using Media Information Processing System. Electronics, 15(6), 1288. https://doi.org/10.3390/electronics15061288

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multimodal Deep Learning Approach for Analyzing Content Preferences on TikTok Across European Technical Universities Using Media Information Processing System

Abstract

1. Introduction

2. Media Information Processing System

3. Methods

3.1. Sampling Strategy

3.2. Data Collection

4. Results and Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Sampling Formulas

Appendix B. Technical Details of the MIPS Pipeline

Appendix B.1. Algorithmic Pipeline

Appendix B.2. Knowledge Graph Designer Prompts

Appendix B.3. Data Structure Examples

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI