Adaptive Token Boundaries: Towards Integrating Human Chunking Mechanisms into Multimodal LLMs
Abstract
1. Introduction
- RQ1: How do human cross-modal chunking mechanisms differ from current tokenization approaches in multimodal LLMs, and what are the quantitative measures of this divergence?
- RQ2: Can dynamic, context-sensitive tokenization boundaries improve multimodal LLM performance on benchmark tasks, and if so, by what magnitude?
- RQ3: What are the computational and practical implications of implementing adaptive token boundaries in large-scale multimodal systems?
- Empirical characterization of human cross-modal chunking patterns through eye-tracking and neuroimaging data;
- Systematic analysis of limitations in current multimodal token representation methodologies;
- Development and validation of a dynamic cross-modal tokenization framework that demonstrates improved performance and greater cognitive plausibility;
- Quantitative and qualitative evaluation of the proposed approach against existing methods, demonstrating improvements in both task performance and human–model alignment.
2. Background
2.1. Cognitive Chunking Fundamentals
2.2. Tokenization in Language Models
2.3. The Gap Between Human and Machine Processing
3. Related Work
3.1. Cognitive Chunking in Human Information Processing
3.2. Tokenization in Multimodal Language Models
3.3. Human-Aligned AI and Cognitive Plausibility
4. Methods
4.1. Hypothesis
- Humans will show variable chunk sizes correlated with semantic boundaries;
- Attention patterns will differ significantly between humans and models at chunk boundaries;
- Dynamic tokenization will improve model performance on tasks requiring cross-modal reasoning.
4.2. Empirical Analysis of Human Chunking Patterns
4.2.1. Participants
4.2.2. Measures of Chunking
- Congruence Analysis:
- ◆
- Word2Vec embeddings with cosine similarity measurement;
- ◆
- BERT-based contextual similarity scores;
- ◆
- Human ratings on 5-point Likert scale (20 annotators, κ = 0.81).
- Spatial Relationships:
- ◆
- Bounding box coordinates (x, y, width, height) for visual elements;
- ◆
- 8-directional relative position encoding (N, NE, E, SE, S, SW, W, NW);
- ◆
- Normalized Euclidean distance metrics.
- Semantic Associations:
- ◆
- Pointwise Mutual Information (PMI) scores from large-scale corpora;
- ◆
- ConceptNet relationship extraction;
- ◆
- CLIP-based visual–semantic association scores.
4.2.3. Stimuli
4.2.4. Eye-Tracking Procedure
4.2.5. Neuroimaging Procedure
4.2.6. Working Memory Assessment
4.3. Computational Modeling
4.3.1. Baseline Models
4.3.2. Dynamic Cross-Modal Tokenization Framework
- Adaptive Boundary Detection:
- ◆
- B: Boundary detection function (output: 0 or 1);
- ◆
- x: Input features (concatenated visual and textual embeddings);
- ◆
- θ: Learned parameters of the boundary detector network
- ◆
- f_θ: Neural network with parameters θ;
- ◆
- σ: Sigmoid activation function;
- ◆
- T: Threshold value (default: 0.5).
- Hierarchical Representation Networks:
- ◆
- hl: Hidden representation at layer l;
- ◆
- hl − 1: Previous layer representation;
- ◆
- hl + l: Next layer representation (for top–down connections);
- ◆
- TransformerBlock: Standard transformer layer;
- ◆
- TopDown: Learned projection for top–down information.
- ◆
- h0 = word embeddings [cat, on, mat];
- ◆
- h1 = TransformerBlock (h0 + TopDown (h2));
- ◆
- h2 = TransformerBlock (h1).
- Cross-Modal Alignment Modules:
4.4. Evaluation Methods
4.4.1. Benchmark Tasks
4.4.2. Statistical Analysis
4.5. Cross-Modal Chunking in the Wild (CMCW) Dataset
4.5.1. Data Collection (3 Months)
4.5.2. Annotation Process
4.5.3. Validation
4.5.4. Dataset Statistics
5. Results
5.1. Empirical Evidence of Cross-Modal Chunking in Human Cognition
5.2. Attention Pattern Analysis
5.3. Analysis of Current Tokenization Approaches
5.4. Performance of Dynamic Cross-Modal Tokenization
5.5. Computational Overhead Practical Challenges
5.5.1. Computational Challenges
5.5.2. Implementation Challenges
5.5.3. Deployment Considerations
6. Discussion
6.1. Answering Research Questions
6.2. Theoretical Implications
6.3. Limitations
7. Conclusions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Miller, G.A. The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychol. Rev. 1956, 63, 81–97. [Google Scholar] [CrossRef]
- Rayner, K. Eye movements in reading and information processing: 20 years of research. Psychol. Bull. 1998, 124, 372–422. [Google Scholar] [CrossRef]
- Baddeley, A. Working memory: Looking back and looking forward. Nat. Rev. Neurosci. 2003, 4, 829–839. [Google Scholar] [CrossRef]
- Haxby, J.V.; Gobbini, M.I.; Furey, M.L.; Ishai, A.; Schouten, J.L.; Pietrini, P. Distributed and overlapping representations of faces and objects in ventral temporal cortex. Science 2001, 293, 2425–2430. [Google Scholar] [CrossRef]
- Chase, W.G.; Simon, H.A. Perception in chess. Cogn. Psychol. 1973, 4, 55–81. [Google Scholar] [CrossRef]
- Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A visual language model for few-shot learning. Adv. Neural Inf. Process. Syst. 2022, 35, 23716–23736. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; Volume 139, pp. 8748–8763. [Google Scholar] [CrossRef]
- Buschoff, L.M.S.; Akata, E.; Bethge, M.; Schulz, E. Visual cognition in multimodal large language models. Nat. Mach. Intell. 2025, 7, 96–106. [Google Scholar] [CrossRef]
- Li, J.; Li, D.; Xiong, C.; Hoi, S.C.H. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; Volume 162, pp. 12888–12900. [Google Scholar] [CrossRef]
- Baevski, A.; Zhou, H.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar] [CrossRef]
- Marcus, G. Deep learning: A critical appraisal. arXiv 2018, arXiv:1801.00631. [Google Scholar]
- Fedorenko, E.; Scott, T.L.; Brunner, P.; Coon, W.G.; Pritchett, B.; Schalk, G.; Kanwisher, N. Neural correlate of the construction of sentence meaning. Proc. Natl. Acad. Sci. USA 2016, 113, E6256–E6262. [Google Scholar] [CrossRef]
- Batterink, L.J.; Paller, K.A. Online neural monitoring of statistical learning. Cortex 2017, 90, 31–45. [Google Scholar] [CrossRef]
- Sennrich, R.; Haddow, B.; Birch, A. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; pp. 1715–1725. [Google Scholar] [CrossRef]
- Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-shot text-to-image generation. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8821–8831. [Google Scholar] [CrossRef]
- Lupyan, G.; Clark, A. Words and the world: Predictive coding and the language-perception-cognition interface. Curr. Dir. Psychol. Sci. 2015, 24, 279–284. [Google Scholar] [CrossRef]
- Lake, B.M.; Ullman, T.D.; Tenenbaum, J.B.; Gershman, S.J. Building machines that learn and think like people. Behav. Brain Sci. 2017, 40, e253. [Google Scholar] [CrossRef]
- Doshi-Velez, F.; Kim, B. Towards a rigorous science of interpretable machine learning. arXiv 2017. [Google Scholar] [CrossRef]
- Tenenbaum, J.B.; Kemp, C.; Griffiths, T.L.; Goodman, N.D. How to grow a mind: Statistics, structure, and abstraction. Science 2011, 331, 1279–1285. [Google Scholar] [CrossRef]
- Chollet, F. On the measure of intelligence. arXiv 2019. [Google Scholar] [CrossRef]
- Esteban, O.; Markiewicz, C.J.; Blair, R.W.; Moodie, C.A.; Isik, A.I.; Erramuzpe, A.; Kent, J.D.; Goncalves, M.; DuPre, E.; Snyder, M.; et al. fMRIPrep: A robust preprocessing pipeline for functional MRI. Nat. Methods 2019, 16, 111–116. [Google Scholar] [CrossRef]
- OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 technical report. arXiv 2023. [Google Scholar] [CrossRef]
- Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick, C.L.; Parikh, D. VQA: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2425–2433. [Google Scholar] [CrossRef]
- Chen, X.; Fang, H.; Lin, T.Y.; Vedantam, R.; Gupta, S.; Dollár, P.; Zitnick, C.L. Microsoft COCO captions: Data collection and evaluation server. arXiv 2015. [Google Scholar] [CrossRef]
- Hudson, D.A.; Manning, C.D. GQA: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 6700–6709. [Google Scholar] [CrossRef]
- Belghazi, M.I.; Barber, A.; Baez, S.; Charlin, L.; Courville, A. Mutual Information Neural Estimation. In Proceedings of the In-ternational Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 531–540. [Google Scholar] [CrossRef]
- Linzen, T. How can we accelerate progress towards human-like linguistic generalization? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Seattle, WA, USA, 5–10 July 2020; pp. 5210–5217. [Google Scholar] [CrossRef]




| Component | Parameter | Value | Description |
|---|---|---|---|
| Base Transformer | d_model | 768 | Model dimensionality |
| n_heads | 12 | Number of attention heads | |
| n_layers | 12 | Number of transformer layers | |
| d_ff | 3072 | Feed-forward dimension | |
| dropout | 0.1 | Dropout probability | |
| Vision Encoder | patch_size | 16 × 16 | Image patch dimensions |
| img_size | 224 × 224 | Input image dimensions | |
| v_layers | 12 | Vision transformer layers | |
| Adaptive Tokenization | boundary_threshold (T) | 0.5 | Adaptive boundary threshold |
| learning_rate | l × 10−4 | Adam optimizer learning rate | |
| batch_size | 32 | Training batch size | |
| epochs | 50 | Training epochs | |
| warmup_steps | 1000 | Learning rate warmup |
| Task | Baseline | Our Method | Improvement | p-Value | Cohen’s d | N |
|---|---|---|---|---|---|---|
| Visual Question Answering | 68.3% | 76.1% | +7.8% | p < 0.001 | 0.82 | 5000 |
| Complex Scene Description | 72.4% | 77.7% | +5.3% | p < 0.01 | 0.65 | 3000 |
| Cross-modal Retrieval | 81.2% | 85.9% | +4.7% | p < 0.001 | 0.71 | 4000 |
| Visual Reasoning | 64.5% | 70.2% | +5.7% | p < 0.001 | 0.78 | 2500 |
| Model | VQA (%) | Complex Scene (%) | GQA (%) | CMCE (%) |
|---|---|---|---|---|
| BLIP-2 | 78.3 | 69.4 | 63.7 | 58.9 |
| Flamingo | 80.1 | 72.6 | 65.2 | 62.3 |
| GPT-4V | 86.5 | 79.8 | 72.4 | 68.7 |
| DCMT (Ours) | 94.3 | 85.1 | 77.9 | 82.4 |
| Method | Formula | Purpose | Interpretation |
|---|---|---|---|
| Cohen’s d | (M1 − M2)/SD_pooled |
Effect size for comparing two group means |
0.2 = small 0.5 = medium 0.8 = large |
| Fleiss’ K | (P_o − P_e)/(1 − P_e) |
Inter-rater agreement for multiple annotators |
>0.75 = excellent 0.40–0.75 = fair to good <0.40 = poor |
| Pearson’s r | ∑ (x − x)(y − y)/ |
Linear correlation strength |
±0.1 = weak ±0.3 = moderate ±0.5 = strong |
| Bonferroni correction | α adjusted = α/n |
Adjustment for multiple comparisons |
Controls family-wise error rate at α |
| KL divergence | ∑P(x)log(P(x)/Q(x)) |
Distribution similarity measure |
0 = identical higher = more different |
| Mutual Information | ∑P(x,y)log(P(x,y)/(P(x)P(y))) |
Shared information between variables(bits) |
0 = independent higher = more shared info |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yu, D. Adaptive Token Boundaries: Towards Integrating Human Chunking Mechanisms into Multimodal LLMs. Information 2025, 16, 1106. https://doi.org/10.3390/info16121106
Yu D. Adaptive Token Boundaries: Towards Integrating Human Chunking Mechanisms into Multimodal LLMs. Information. 2025; 16(12):1106. https://doi.org/10.3390/info16121106
Chicago/Turabian StyleYu, Dongxing. 2025. "Adaptive Token Boundaries: Towards Integrating Human Chunking Mechanisms into Multimodal LLMs" Information 16, no. 12: 1106. https://doi.org/10.3390/info16121106
APA StyleYu, D. (2025). Adaptive Token Boundaries: Towards Integrating Human Chunking Mechanisms into Multimodal LLMs. Information, 16(12), 1106. https://doi.org/10.3390/info16121106
