3.2. Vocabulary, Morphology, and Orthography
The clean list of extracted vocabularies from the manuscript constitutes unique tokens that are essential to be examined to reveal the most frequent words. This objective, in the bigger implication, will give the predominant theme depiction as the outcome of this study. The 50 most frequent words are identified and shown in
Figure 5. This distribution can be described as having three main segments: the plot shows an extremely steep initial drop among the first few high-frequency words, with the most frequent word (
Raja) reaching approximately 400, rapidly falling to around 200 to 250 by the fourth or fifth ranked word.
Following this head, the curve enters a moderate decline across the mid-frequency range, which spans roughly from 200 down to 100 (e.g., from bernama down to perempuan), where the frequency decreases steadily but at a much slower rate. Finally, the distribution ends in a long, flat tail where a vast number of unique words (e.g., tuan, membawa, tinggal) cluster around very similar, low frequencies, specifically centered around 50, confirming that the majority of the vocabulary contributes minimally to the total word count.
Raja (king) indicates that monarchy or royal figures are central to the saga, involving topics related to the kingdom, rulers, and governance. The second-ranked words, saudara (brother/sibling/relative), tuan (sir/master), and bala (troopers), suggest formal address or social hierarchy, showing interpersonal or structured societal roles. Other commonly used terms, such as tanah (land), show the essence of territory, while belanda (Dutch), kapitan (captain), and syahbandar (harbor master) hint at a colonial or maritime context, emphasizing the period of European influence in the Banda region. Anak (child), putri (princess), and perempuan (woman) show family and gender elements in the narrative. Perahu (boat), laut (sea), and banda (Banda Islands) reinforce the idea that the setting involves coastal or island communities, typical of the Maluku archipelago, specifically Banda. Words such as Allah (God) and malam (night) suggest spiritual or poetic undertones, possibly reflecting the cultural and religious environment of the time.
In morphology, affix analysis shows that prefixation dominates the corpus, with action-oriented morphemes such as
me- and
di- most salient, signaling active and passive voice in narrative clauses (e.g.,
mendengar, membawa, menjawab versus
disebut, dibawa, diangkat, dijawab), shown in
Figure 6. The bar chart shows that prefixes are more frequent than suffixes in the corpus, with 906 prefixed words and 616 suffixed words in total. The three most frequent prefixes are
me- with 290 tokens (32.0%),
ber- with 179 tokens (19.8%), and
di- with 172 tokens (19.0%), followed by
ke- (73 tokens; 8.1%),
ter- (52 tokens; 5.7%),
pe- (47 tokens; 5.2%),
se- (45 tokens; 5.0%),
per- (42 tokens; 4.6%), and
ku- (6 tokens; 0.7%). In contrast, suffix usage is more concentrated in a few forms. The most dominant suffix is -
an with 324 tokens (52.6%), followed by -
nya with 114 tokens (18.5%), -
lah with 107 tokens (17.4%), -
i with 51 tokens (8.3%), and the less frequent -
ku and -
mu, each appearing 10 times (1.6%).
Lexemes with ber- foreground intransitive or middle actions (bernama, berjalan, berbicara), while pe- and per- nominalize processes and roles (pekerjaan, penduduk, pemimpin; perkataan, perjanjian), helping build the social and institutional texture of the story world. Quantifying and delimiting meanings surface through se- (sebelah, sebesar, seekor, segenap), and ke- frequently marks ordinals and abstract states (keempat, ketiga, keesokan, kesusahan), structuring chronology and circumstance; ter- highlights stative or superlative readings and suddenness/emergence (tentulah, terkejut, terdengar), fitting moments of discovery or affect.
Suffixation is also prominent, led by -nya, which encodes definiteness and third-person possession (saudaranya, istrinya), indicating referential continuity across scenes; -an productively derives nouns and collectives (perkataan, perjanjian, pakaian), while -i marks applicative or locative verbal nuance (mengetahui, mengikuti). Possessive closeness and address appear in -ku and -mu (mulutku, milikku, kepadamu), and the pragmatic particle -lah (datanglah, baiklah) punctuates directive or emphatic turns. Together, these affix patterns show a narrative that is densely verbal, agentive, and referentially cohesive: voice alternations drive event progression, nominalizations scaffold themes and institutions, quantifiers order time and space, and possession/definiteness maintain character continuity.
In terms of spelling signals, the referentiality to dictionaries was examined to identify orthographic patterns.
Figure 7 shows the orthographic skewness per-10k-chars frequency (i.e., sy, kh, ny, ng, ai, au, dz) across categories, revealing whether sy/kh skew PRPM-only and ny/ng or certain diphthongs skew elsewhere. Word length can provide valuable insights into linguistic patterns and help us understand the nature of the vocabulary within different categories. In the “Shared” category, shorter words (typically 4–6 letters) dominate, indicating that these words likely represent core vocabulary with common roots shared between Malay and Indonesian.
In contrast, the “KBBI-only” category contains longer words, which are often affixed or derived forms of Indonesian, such as menyampaikan or peraturan. These longer words suggest a higher degree of morphological complexity in Indonesian vocabulary. The “PRPM-only” category, with its relatively longer words, seems to reflect classical or Arabic-influenced spellings, seen in examples like syahbandar or khidmat. This points to the historical influence of Arabic and classical Malay on the lexicon of PRPM. Lastly, the presence of very long or short words in the “Neither” category could indicate the inclusion of archaic terms, compound words, or possibly misspelt tokens, reflecting the complexity or inconsistency in word usage that falls outside both dictionaries. These insights reveal how word length and structure can reflect the underlying linguistic, historical, and cultural influences on the language.
The orthographic footprint of the Van Ophuijsen system measurably shapes the corpus: graphemic correspondences such as oe → /u/ (boelan, boenga, batoe), dj → /j/ (djakaria, djohor), tj → /c/ (tjakbeir, tjekalele), and j → /y/ (aij, ajeir, kelij) systematically inflate the “Neither” bucket in dictionary matching and complicate downstream tasks (tokenization, lemmatization, and NER). Empirically, many tokens flagged as out-of-vocabulary resolve to regular Indonesian/Malay forms after orthography-aware normalization (e.g., boelan → bulan; djohor → johor; tjakbeir → cakbir), yet we deliberately preserve the pre-normalization forms for onomastic fidelity because the archaic layer is dominated by proper names, such as individuals, toponyms, boats, and ritual objects (e.g., Raja Noeilaj, Gunung Oeloepitoe, Belang Limareij).
This dual track (reversible normalization for lexical analytics and retention for named entity integrity) reduces false negatives in dictionary lookups while safeguarding historical signal for narrative mapping. Practically, we implement a deterministic rewrite table (regex rules in
Table 5) before stemming/lemmatization, then restore original spellings at render time for quotations and entity graphs; this yields cleaner frequency profiles without erasing the diachronic character of the manuscript and explains why archaic lexicons appear chiefly as names rather than productive lemmas in the morphological inventory.
In the loanwords (foreign terminology), the inventory reveals two principal streams of borrowing that intersect directly with the manuscript’s religious and colonial chronicle: Arabic and Dutch. The Arabic stratum ranges from greetings and ritual terms (
assalamualaikum,
azan,
zuhur), through theological lexis (
Allah,
gaib,
rahmat,
illallah), to religious onomastics (
Abdul-,
Abdullah,
Achmad,
Auliya,
Syekh), signaling a strong Islamic register and functions as a discourse marker for prayer, legitimacy, and scholarly authority, while also forming dense clusters of named entities for NER (see
Table 6). Orthographically, several occur in historical spellings, which can evade modern dictionaries. Hence, we apply reversible normalization during preprocessing and restore original forms at presentation to preserve historical fidelity.
By contrast, the Dutch layer (see
Table 7), such as
compania/VOC,
gubernur,
gulden,
kapitan,
perk (plantation),
vandel (banner), along with administrative surnames and titles such as
Verhoven and supervisory roles like
Velak, encodes the colonial infrastructure of the spice economy, governance, and coercive power in Banda. Methodologically, the two tables complement the orthographic and corpus analyses by (i) providing controlled lists for domain-semantic labelling (Religion; Colonial/Political–Economic), (ii) informing entity-sensitive orthographic rules, and (iii) reducing false negatives in dictionary matching and named entity extraction. In sum, Arabic borrowings inscribe ritual–genealogical networks, Dutch borrowings inscribe institutional–economic networks, and together they explain why most “foreign” tokens surface as proper names or domain-specific terminology rather than productive lemmas, making cross-linguistic handling and historical orthography central to analytic accuracy in this manuscript.
3.3. Semantic Fields Analysis
The profile of semantic patterns across four themes (Maritime, Religious, Social/Genealogical, and Place/Geographic), partitioned into 15 sequences ordered by descending word counts (see
Figure 8). The class divisions were not exactly equal. Instead, this was only to map the sequential order of the persistent words based on their frequency. The sequential blocks show that Sequences
1–
4 are led by maritime and social/political governance vocabulary (e.g.,
perahu,
pelabuhan,
kora,
raja,
tuan,
kapitan,
Belanda), which consistently co-activate place/geographic anchors (
Banda,
pulau,
kota); together, these establish seaborne movement and authority relations as the narrative spine.
Sequences 5–9 thicken around geospatial nodes and mobility (Neira, Ambon, Java, Timor; ombak, layar), while rank/household terms (bangsawan, rakyat, saudara) knit actors into hierarchies; many tokens here light up multiple columns, acting as hinges that connect voyage → port → magistrate → kin network. Sequences 10–13 introduce concentrated religious bursts, such as institutions, roles, and formulae (masjid, khatib, syekh; sembahyang, bismillah, innallah), which appear as spikes rather than a steady background, marking oaths, blessings, and legitimizing passages. The tail (14–15) carries colonial–administrative/Dutch items (laksamana, perk, Pieterszoon, vandel, velak) plus occasional fauna/objects, signaling episodic detail within governance scenes. Overall, the 15 sequences reveal a skewed but coherent ecology: maritime and politico-social terms persist longest and structure the storyline; religious lexicon punctuates key ritual or moral junctures; and place terms scaffold movement throughout.
Principal Component Analysis (PCA) was performed on the word-context frequency matrix to plot the spatial distribution of words and assess their underlying usage patterns (see
Figure 9). The initial frequency data was processed using one-hot encoding (or a similar binary representation), where each word forms an observation vector and its presence or absence across different contexts forms the features [
23].
Each point represents a single word, and the points are color-coded according to their pre-assigned semantic groups. Proximity between points indicates a high degree of similarity in the words’ contextual usage patterns, suggesting a shared semantic or thematic meaning, while distant points represent words with distinct frequency distributions. This visual grouping is crucial, as it provides immediate confirmation that the dimensions extracted by the PCA effectively differentiate the words according to their labeled thematic categories, validating the dimensionality reduction process.
The Maritime group forms a noticeable cluster, generally positioned near the central vertical axis, confirming internal heterogeneity, which is dominated by PCA1 (62.95%). In contrast, the Social/Genealogic group, heavily driven by PCA2 (57.66%), tends to form a large, central, and often dense cluster, indicating that while these terms are frequent, their primary separation may be driven by lower-ranked components or that they exhibit a usage pattern closer to the overall average. The Religious (dominated PCA2 = 62.56%)and Place/Geographic group (PCA1 = 57.56%) form well-defined, generally tight clusters, suggesting internal consistency in how these terms are used relative to all other words.
3.4. Key Entities Identification and Relationships Mapping
Figure 10 shows the hierarchical clustering of entity types based on their estimated frequency, visualized in a dendrogram. In this context, it groups entities with similar frequencies closer together using Ward’s method, which minimizes the variance of the clusters being merged. The Euclidean distance was used on the scaled frequency values as the distance metric [
24]. The vertical lines (or branches) represent the distance between the clusters (or individual entities) being merged. Several clusters were identified:
Cluster 1 (Low Frequency): This includes Place (GEOG) and Social/Genealogical group (PGRP), which are grouped first due to their very similar and low frequencies.
Cluster 2 (Mid-Frequency): This cluster consists of Maritime (PROC), Concept (CONC), Object/Physical Entity (DEVI/TISS), and Event (ACTI), all of which have frequencies clustered in the low hundreds.
Outlier: Person (LIVB) stands alone, merging with the main clusters at a much greater distance, as its estimated frequency is significantly higher than all the other entity types.
Table 8 presents a list of identified entities and their specified relationship types, highlighting how different entities are interconnected within the historical and cultural framework. For example,
Nabiullah Nuh and
Andara (Banda) are connected by a Geographic relationship, indicating that Andara is considered the first land to emerge after Noah’s flood.
Jailin and
Siti Gelsoen share a Genealogical relationship, marking them as the ancestral couple and the first major figures in the narrative.
Other relationships are more complex, such as Raja Noeilaj and Putri Cilu Bintang, who share a strong Genealogical connection as siblings. Raja Noeilaj also has significant Power connections, notably with Banda, where he is identified as the first and most important king. Liliselij, another key figure, has multiple relationships, such as with Lautaka (Lewetaka) (regional power) and Makkah (religious journey), demonstrating a blend of political and religious ties. Other entries in the table also reveal the interconnectedness of religion, such as Kakijaij and Makkah, where his journey to study Islam. Perjalanan ke Warandesi shows the voyage’s significance, linking to the Belang Limareij (a transport vessel) and the geographic destination of Warandesi. In addition, there are entries related to conflict and colonial influence, such as Tuan Coen’s relationship with Banda, signifying a Conflict/Power dynamic, and the Perjanjian (treaty), illustrating the Dutch colonial attempts.
Figure 11 maps two dominant clusters, as the general theme introduced by the saga. First, Genealogical/Local cluster highlights several entities such as
Raja Noeilaj,
Siti Gelsoen,
Putri Cilu Bintang, and
Jailin, showcasing the story’s focus on the origin and kinship in Banda. The second cluster (Migration/Religion) highlights agglomerated entities such as
Kakijaij,
Liliselij,
Makkah, and
Jeddah, exhibiting the theme of journey and Islamic introduction to Banda. Several characters are also identified as bridging entities that link one story fragment with another story, such as
Kakijaij and
Liliselij, connecting the Genealogical cluster (
Banda,
saudara) with Religion/Migration (
Makkah,
Jeddah,
Nūr Al-
Mubīn). These entities reflect their prophetic role, which brought religious and cultural changes to Banda. Another character,
Tuan Coen, also constitutes an isolated node (connected to
Banda and
Perjanjian only). This reflects his status as an external actor and the primary source of conflict, interacting directly with the Banda rulers.
Degree centrality measures the number of direct connections (edges) a node has in a network or graph. It is used in network analysis to quantify the importance or influence of a node based on how many other nodes it is directly connected to. The degree values range from 0 to 1. The higher the degree centrality, the more connections an entity has. For example, the entities Raja Noeilaj, Banda, Liliselij, Kakijaij, and Siti Gelsoen each have a degree centrality of 1.00, indicating that they are each directly connected to roughly the same number of other entities in the network. In contrast, entities such as Gunung Oeloepitoa and Belang Limareij have less connection and are more peripheral, with a value of 0.25. The entities are categorized into four types. Person (LIVB) category tends to have a higher centrality value, suggesting they might play a more central role in the network compared to Place (GEOG), Object (DEVI/TISS), and Concept (CONC), which generally have lower centrality. This provides a clear explanation, strengthened by quantitative and visual representations of how entities in the saga interact with each other, shaping a complex cultural narrative.
Raja Noeilaj, marked by its size and central location, forms the nucleus of the network [
25]. This position shows the thickest, strongest relational links to key Person entities
Liliseij,
Kakijaj, and
Siti Gelsoen, and the primary Place node,
Banda, signifying these as the most frequent and important associations in the underlying narrative. The graph further reveals interconnected sub-networks, with the core group of people linking to important geographical hubs like
Jeddah and
Makkah, suggesting themes of trade or pilgrimage, while other connections involve political entities like
Portugis and abstract elements such as
Perjanjian (Treaty) and
Mahar (Dowry). Overall, the visualization effectively maps the complex historical and social landscape surrounding
Raja Noeilaj, identifying not only the central figures and locations but also the critical objects and concepts (like the single CONC node,
Perjanjian) that define the political and historical narrative being represented.