Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (208)

Search Parameters:
Keywords = Vector Quantization

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
20 pages, 2197 KB  
Article
Perceptual Image Hashing Fusing Zernike Moments and Saliency-Based Local Binary Patterns
by Wei Li, Tingting Wang, Yajun Liu and Kai Liu
Computers 2025, 14(9), 401; https://doi.org/10.3390/computers14090401 - 21 Sep 2025
Viewed by 270
Abstract
This paper proposes a novel perceptual image hashing scheme that robustly combines global structural features with local texture information for image authentication. The method starts with image normalization and Gaussian filtering to ensure scale invariance and suppress noise. A saliency map is then [...] Read more.
This paper proposes a novel perceptual image hashing scheme that robustly combines global structural features with local texture information for image authentication. The method starts with image normalization and Gaussian filtering to ensure scale invariance and suppress noise. A saliency map is then generated from a color vector angle matrix using a frequency-tuned model to identify perceptually significant regions. Local Binary Pattern (LBP) features are extracted from this map to represent fine-grained textures, while rotation-invariant Zernike moments are computed to capture global geometric structures. These local and global features are quantized and concatenated into a compact binary hash. Extensive experiments on standard databases show that the proposed method outperforms state-of-the-art algorithms in both robustness against content-preserving manipulations and discriminability across different images. Quantitative evaluations based on ROC curves and AUC values confirm its superior robustness–uniqueness trade-off, demonstrating the effectiveness of the saliency-guided fusion of Zernike moments and LBP for reliable image hashing. Full article
Show Figures

Figure 1

20 pages, 285 KB  
Article
The Role of Symmetry Aspects in Considering the Spin-1 Particle with Two Additional Electromagnetic Characteristics in the Presence of Both Magnetic and Electric Fields
by Alina Ivashkevich, Viktor Red’kov, Elena Ovsiyuk and Alexander Chichurin
Symmetry 2025, 17(9), 1465; https://doi.org/10.3390/sym17091465 - 5 Sep 2025
Viewed by 323
Abstract
In this paper, we study a generalized Duffin–Kemmer equation for a spin-1 particle with two characteristics, anomalous magnetic moment and polarizability in the presence of external uniform magnetic and electric fields. After separating the variables, we obtained a system of 10 first-order partial [...] Read more.
In this paper, we study a generalized Duffin–Kemmer equation for a spin-1 particle with two characteristics, anomalous magnetic moment and polarizability in the presence of external uniform magnetic and electric fields. After separating the variables, we obtained a system of 10 first-order partial differential equations for 10 functions fA(r,z). To resolve this complicated problem, we first took into account existing symmetry in the structure of the derived system. The main step consisted of applying a special method for fixing the r-dependence of ten functions fA(r,z),A=1,,10. We used the approach of Fedorov–Gronskiy, according to which the complete 10-component wave function is decomposed into the sum of three projective constituents. The dependence of each component on the polar coordinate r is determined by only one corresponding function, Fi(r),i=1,2,3. These three basic functions are constructed in terms of confluent hypergeometric functions, and in this process a quantization rule arises due to the presence of a magnetic field.In fact, this approach is a step-by-step algebraization of the systems of equations in partial derivatives. After that, we derived a system of 10 ordinary differential equations for 10 functions fA(z). This system was solved using the elimination method and with the help of special linear combinined with the involved functions. As a result, we found three separated second-order differential equations, and their solutions were constructed in the terms of the confluent hypergeometric functions. Thus, in this paper, the three types of solutions for a vector particle with two additional electromagnetic characteristics in the presence of both external uniform magnetic and electric fields. Full article
23 pages, 4446 KB  
Article
A Modular Framework for RGB Image Processing and Real-Time Neural Inference: A Case Study in Microalgae Culture Monitoring
by José Javier Gutiérrez-Ramírez, Ricardo Enrique Macias-Jamaica, Víctor Manuel Zamudio-Rodríguez, Héctor Arellano Sotelo, Dulce Aurora Velázquez-Vázquez, Juan de Anda-Suárez and David Asael Gutiérrez-Hernández
Eng 2025, 6(9), 221; https://doi.org/10.3390/eng6090221 - 2 Sep 2025
Viewed by 375
Abstract
Recent progress in computer vision and embedded systems has facilitated real-time monitoring of bioprocesses; however, lightweight and scalable solutions for resource-constrained settings remain limited. This work presents a modular framework for monitoring Chlorella vulgaris growth by integrating RGB image processing with multimodal sensor [...] Read more.
Recent progress in computer vision and embedded systems has facilitated real-time monitoring of bioprocesses; however, lightweight and scalable solutions for resource-constrained settings remain limited. This work presents a modular framework for monitoring Chlorella vulgaris growth by integrating RGB image processing with multimodal sensor fusion. The system incorporates a Logitech C920 camera and low-cost pH and temperature sensors within a compact photobioreactor. It extracts RGB channel statistics, luminance, and environmental data to generate a 10-dimensional feature vector. A feedforward artificial neural network (ANN) with ReLU activations, dropout layers, and SMOTE-based data balancing was trained to classify growth phases: lag, exponential, and stationary. The optimized model, quantized to 8 bits, was deployed on an ESP32 microcontroller, achieving 98.62% accuracy with 4.8 ms inference time and a 13.48 kB memory footprint. Robustness analysis confirmed tolerance to geometric transformations, though variable lighting reduced performance. Principal component analysis (PCA) retained 95% variance, supporting the discriminative power of the features. The proposed system outperformed previous vision-only methods, demonstrating the advantages of multimodal fusion for early detection. Limitations include sensitivity to lighting and validation limited to a single species. Future directions include incorporating active lighting control and extending the model to multi-species classification for broader applicability. Full article
(This article belongs to the Special Issue Artificial Intelligence for Engineering Applications, 2nd Edition)
Show Figures

Figure 1

16 pages, 22201 KB  
Article
MECO: Mixture-of-Expert Codebooks for Multiple Dense Prediction Tasks
by Gyutae Hwang and Sang Jun Lee
Sensors 2025, 25(17), 5387; https://doi.org/10.3390/s25175387 - 1 Sep 2025
Viewed by 491
Abstract
Autonomous systems operating in embedded environments require robust scene understanding under computational constraints. Multi-task learning offers a compact alternative to deploying multiple task-specific models by jointly solving dense prediction tasks. However, recent MTL models often suffer from entangled shared feature representations and significant [...] Read more.
Autonomous systems operating in embedded environments require robust scene understanding under computational constraints. Multi-task learning offers a compact alternative to deploying multiple task-specific models by jointly solving dense prediction tasks. However, recent MTL models often suffer from entangled shared feature representations and significant computational overhead. To address these limitations, we propose Mixture-of-Expert Codebooks (MECO), a novel multi-task learning framework that leverages vector quantization to design Mixture-of-Experts with lightweight codebooks. MECO disentangles task-generic and task-specific representations and enables efficient learning across multiple dense prediction tasks such as semantic segmentation and monocular depth estimation. The proposed multi-task learning model is trained end-to-end using a composite loss that combines task-specific objectives and vector quantization losses. We evaluate MECO on a real-world driving dataset collected in challenging embedded scenarios. MECO achieves a +0.4% mIoU improvement in semantic segmentation and maintains comparable depth estimation accuracy to the baseline, while reducing model parameters and FLOPs by 18.33% and 28.83%, respectively. These results demonstrate the potential of vector quantization-based Mixture-of-Experts modeling for efficient and scalable multi-task learning in embedded environments. Full article
Show Figures

Figure 1

17 pages, 1462 KB  
Article
Key Operator Vectorization for LeNet and ResNet Based on Buddy Compiler
by Juncheng Chen, Weiwei Chen and Zhi Cai
Appl. Sci. 2025, 15(17), 9523; https://doi.org/10.3390/app15179523 - 29 Aug 2025
Viewed by 395
Abstract
Deep learning has emerged as a prominent focus in both academia and industry, with a wide range of models being applied across diverse domains. Fast and efficient model inference is essential for the practical deployment of deep learning models. Under specific hardware constraints, [...] Read more.
Deep learning has emerged as a prominent focus in both academia and industry, with a wide range of models being applied across diverse domains. Fast and efficient model inference is essential for the practical deployment of deep learning models. Under specific hardware constraints, accelerating inference remains a key research challenge. Common techniques for model acceleration include quantization, pruning, and vectorization. Although quantization and pruning primarily reduce model precision or complexity to enhance efficiency, this paper concentrates on vectorization, a technique that accelerates models by increasing the parallelism of operator execution. Based on the open-source Buddy-MLIR project, this work implements vectorization optimizations for Matmul, Conv2d, and Max Pooling operations to improve inference performance. These optimizations are designed as compiler passes and integrated into the Buddy-MLIR framework, offering a general solution for vectorizing such operators. Two optimization approaches are proposed: general vectorization and adaptive vectorization. Compared to the standard MLIR lowering pipeline and the fully optimized LLVM backend, the proposed general and adaptive vectorization methods reduce the inference latency of LeNet-5 by 26.7% and 37.3%, respectively. For the more complex ResNet-18 model, these methods achieve latency reductions of 79.9% and 82.6%, respectively. Full article
Show Figures

Figure 1

28 pages, 2070 KB  
Article
Enhancing Security and Applicability of Local LLM-Based Document Retrieval Systems in Smart Grid Isolated Environments
by Kiho Lee, Sumi Yang, Jaeyeong Jeong, Yongjoon Lee and Dongkyoo Shin
Electronics 2025, 14(17), 3407; https://doi.org/10.3390/electronics14173407 - 27 Aug 2025
Viewed by 621
Abstract
The deployment of large language models (LLMs) in closed-network industrial environments remains constrained by privacy and connectivity limitations. This study presents a retrieval-augmented question-answering system designed to operate entirely offline, integrating local vector embeddings, ontology-based semantic enrichment, and quantized LLMs, while ensuring compliance [...] Read more.
The deployment of large language models (LLMs) in closed-network industrial environments remains constrained by privacy and connectivity limitations. This study presents a retrieval-augmented question-answering system designed to operate entirely offline, integrating local vector embeddings, ontology-based semantic enrichment, and quantized LLMs, while ensuring compliance with industrial security standards like IEC 62351. The system was implemented using OpenChat-3.5 models with two quantization variants (Q5 and Q8), and evaluated through comparative experiments focused on response accuracy, generation speed, and secure document handling. Empirical results show that both quantized models delivered comparable answer quality, with the Q5 variant achieving approximately 1.5 times faster token generation under limited hardware. The ontology-enhanced retriever further improved semantic relevance by incorporating structured domain knowledge into the retrieval stage. Throughout the experiments, the system demonstrated effective performance across speed, accuracy, and information containment—core requirements for AI deployment in security-sensitive domains. These findings underscore the practical viability of offline LLM systems for privacy-compliant document search, while also highlighting architectural considerations essential for extending their utility to environments such as smart grids or defense-critical infrastructures. Full article
Show Figures

Figure 1

25 pages, 8472 KB  
Article
Harnessing the Power of Pre-Trained Models for Efficient Semantic Communication of Text and Images
by Emrecan Kutay and Aylin Yener
Entropy 2025, 27(8), 813; https://doi.org/10.3390/e27080813 - 29 Jul 2025
Viewed by 778
Abstract
This paper investigates point-to-point multimodal digital semantic communications in a task-oriented setup, where messages are classified at the receiver. We employ a pre-trained transformer model to extract semantic information and propose three methods for generating semantic codewords. First, we propose semantic quantization that [...] Read more.
This paper investigates point-to-point multimodal digital semantic communications in a task-oriented setup, where messages are classified at the receiver. We employ a pre-trained transformer model to extract semantic information and propose three methods for generating semantic codewords. First, we propose semantic quantization that uses quantized embeddings of source realizations as a codebook. We investigate the fixed-length coding, considering the source semantic structure and end-to-end semantic distortion. We propose a neural network-based codeword assignment mechanism incorporating codeword transition probabilities to minimize the expected semantic distortion. Second, we present semantic compression that clusters embeddings, exploiting the inherent semantic redundancies to reduce the codebook size, i.e., further compression. Third, we introduce a semantic vector-quantized autoencoder (VQ-AE) that learns a codebook through training. In all cases, we follow this semantic source code with a standard channel code to transmit over the wireless channel. In addition to classification accuracy, we assess pre-communication overhead via a novel metric we term system time efficiency. Extensive experiments demonstrate that our proposed semantic source-coding approaches provide comparable accuracy and better system time efficiency compared to their learning-based counterparts. Full article
(This article belongs to the Special Issue Semantic Information Theory)
Show Figures

Figure 1

21 pages, 2467 KB  
Article
Implementation of a Conditional Latent Diffusion-Based Generative Model to Synthetically Create Unlabeled Histopathological Images
by Mahfujul Islam Rumman, Naoaki Ono, Kenoki Ohuchida, Ahmad Kamal Nasution, Muhammad Alqaaf, Md. Altaf-Ul-Amin and Shigehiko Kanaya
Bioengineering 2025, 12(7), 764; https://doi.org/10.3390/bioengineering12070764 - 15 Jul 2025
Viewed by 619
Abstract
Generative image models have revolutionized artificial intelligence by enabling the synthesis of high-quality, realistic images. These models utilize deep learning techniques to learn complex data distributions and generate novel images that closely resemble the training dataset. Recent advancements, particularly in diffusion models, have [...] Read more.
Generative image models have revolutionized artificial intelligence by enabling the synthesis of high-quality, realistic images. These models utilize deep learning techniques to learn complex data distributions and generate novel images that closely resemble the training dataset. Recent advancements, particularly in diffusion models, have led to remarkable improvements in image fidelity, diversity, and controllability. In this work, we investigate the application of a conditional latent diffusion model in the healthcare domain. Specifically, we trained a latent diffusion model using unlabeled histopathology images. Initially, these images were embedded into a lower-dimensional latent space using a Vector Quantized Generative Adversarial Network (VQ-GAN). Subsequently, a diffusion process was applied within this latent space, and clustering was performed on the resulting latent features. The clustering results were then used as a conditioning mechanism for the diffusion model, enabling conditional image generation. Finally, we determined the optimal number of clusters using cluster validation metrics and assessed the quality of the synthetic images through quantitative methods. To enhance the interpretability of the synthetic image generation process, expert input was incorporated into the cluster assignments. Full article
(This article belongs to the Section Biosignal Processing)
Show Figures

Figure 1

13 pages, 423 KB  
Article
A Deep Learning-Driven Solution to Limited-Feedback MIMO Relaying Systems
by Kwadwo Boateng Ofori-Amanfo, Bridget Durowaa Antwi-Boasiako, Prince Anokye, Suho Shin and Kyoung-Jae Lee
Mathematics 2025, 13(14), 2246; https://doi.org/10.3390/math13142246 - 11 Jul 2025
Viewed by 622
Abstract
In this work, we investigate a new design strategy for the implementation of a deep neural network (DNN)-based limited-feedback relay system by using conventional filters to acquire training data in order to jointly solve the issues of quantization and feedback. We aim to [...] Read more.
In this work, we investigate a new design strategy for the implementation of a deep neural network (DNN)-based limited-feedback relay system by using conventional filters to acquire training data in order to jointly solve the issues of quantization and feedback. We aim to maximize the effective channel gain to reduce the symbol error rate (SER). By harnessing binary feedback information from the implemented DNNs together with efficient beamforming vectors, a novel approach to the resulting problem is presented. We compare our proposed system to a Grassmannian codebook system to show that our system outperforms its benchmark in terms of SER. Full article
Show Figures

Figure 1

13 pages, 289 KB  
Article
Induction of a Landau-Type Quantization in a Background of CPT-Odd Lorentz Symmetry Violation
by R. L. L. Vitória
Symmetry 2025, 17(7), 1070; https://doi.org/10.3390/sym17071070 - 5 Jul 2025
Viewed by 262
Abstract
In this article, we approach a scalar particle in a background characterized by the Lorentz symmetry violation through a non-minimal coupling in the mathematical structure of the Klein–Gordon equation, where the Lorentz symmetry violation is governed by a background vector field. For an [...] Read more.
In this article, we approach a scalar particle in a background characterized by the Lorentz symmetry violation through a non-minimal coupling in the mathematical structure of the Klein–Gordon equation, where the Lorentz symmetry violation is governed by a background vector field. For an electric field configuration and in the search for solutions of bound states, we determine the relativistic energy profile of the system, which is characterized by quantized orbits, that is, a relativistic Landau-type quantization. Then, we particularize our system and analyze it in the presence of a hard-wall potential, from which, we analytically determine its relativistic energy profile in this confining type. Full article
17 pages, 1788 KB  
Article
Detection of Double Compression in HEVC Videos Containing B-Frames
by Yoshihisa Furushita, Daniele Baracchi, Marco Fontani, Dasara Shullani and Alessandro Piva
J. Imaging 2025, 11(7), 211; https://doi.org/10.3390/jimaging11070211 - 27 Jun 2025
Viewed by 632
Abstract
This study proposes a method to detect double compression in H.265/HEVC videos containing B-frames, a scenario underexplored in previous research. The method extracts frame-level encoding features—including frame type, coding unit (CU) size, quantization parameter (QP), and prediction modes—and represents each video as a [...] Read more.
This study proposes a method to detect double compression in H.265/HEVC videos containing B-frames, a scenario underexplored in previous research. The method extracts frame-level encoding features—including frame type, coding unit (CU) size, quantization parameter (QP), and prediction modes—and represents each video as a 28-dimensional feature vector. A bidirectional Long Short-Term Memory (Bi-LSTM) classifier is then trained to model temporal inconsistencies introduced during recompression. To evaluate the method, we created a dataset of 129 HEVC-encoded YUV videos derived from 43 original sequences, covering various bitrate combinations and GOP structures. The proposed method achieved a detection accuracy of 80.06%, outperforming two existing baselines. These results demonstrate the practical applicability of the proposed approach in realistic double compression scenarios. Full article
(This article belongs to the Special Issue Celebrating the 10th Anniversary of the Journal of Imaging)
Show Figures

Figure 1

11 pages, 2486 KB  
Article
Constraints on Bit Precision and Row Parallelism for Reliable Computing-in-Memory
by Yongxiang Li, Shiqing Wang and Zhong Sun
Electronics 2025, 14(13), 2532; https://doi.org/10.3390/electronics14132532 - 22 Jun 2025
Viewed by 768
Abstract
Computing-in-memory (CIM) with emerging non-volatile resistive memory devices has demonstrated remarkable performance in data-intensive applications, such as neural networks and machine learning. A crosspoint memory array enables naturally parallel computation of matrix–vector multiplication (MVM) in the analog domain, offering significant advantages in terms [...] Read more.
Computing-in-memory (CIM) with emerging non-volatile resistive memory devices has demonstrated remarkable performance in data-intensive applications, such as neural networks and machine learning. A crosspoint memory array enables naturally parallel computation of matrix–vector multiplication (MVM) in the analog domain, offering significant advantages in terms of speed, energy efficiency, and computational density. However, the intrinsic device non-ideality residing in analog conductance state distorts the MVM precision and limits the application to high-precision scenarios, e.g., scientific computing. Yet, a theoretical framework for guiding reliable computing-in-memory designs has been lacking. In this work, we develop an analytical model describing the constraints on bit precision and row parallelism for reliable MVM operations. By leveraging the concept of capacity from information theory, the impact of non-ideality on computational precision is quantitively analyzed. This work offers a theoretical guidance for optimizing the quantized margins, providing valuable insights for future research and practical implementation of reliable CIM. Full article
(This article belongs to the Special Issue Analog Circuits and Analog Computing)
Show Figures

Figure 1

24 pages, 158818 KB  
Article
Reconstruction of Cultural Heritage in Virtual Space Following Disasters
by Guanlin Chen, Yiyang Tong, Yuwei Wu, Yongjin Wu, Zesheng Liu and Jianwen Huang
Buildings 2025, 15(12), 2040; https://doi.org/10.3390/buildings15122040 - 13 Jun 2025
Viewed by 2030
Abstract
While previous studies have explored the use of digital technologies in cultural heritage site reconstruction, limited attention has been given to systems that simultaneously support cultural restoration and psychological healing. This study investigates how multimodal, deep learning–assisted digital technologies can aid displaced populations [...] Read more.
While previous studies have explored the use of digital technologies in cultural heritage site reconstruction, limited attention has been given to systems that simultaneously support cultural restoration and psychological healing. This study investigates how multimodal, deep learning–assisted digital technologies can aid displaced populations by enabling both digital reconstruction and trauma relief within virtual environments. A demonstrative virtual reconstruction workflow was developed using the Great Mosque of Aleppo in Damascus as a case study. High-precision three-dimensional models were generated using Neural Radiance Fields, while Stable Diffusion was applied for texture style transfer and localized structural refinement. To enhance immersion, Vector Quantized Variational Autoencoder–based audio reconstruction was used to embed personalized ambient soundscapes into the virtual space. To evaluate the system’s effectiveness, interviews, tests, and surveys were conducted with 20 refugees aged 18–50 years, using the Impact of Event Scale-Revised and the System Usability Scale as assessment tools. The results showed that the proposed approach improved the quality of digital heritage reconstruction and contributed to psychological well-being, offering a novel framework for integrating cultural memory and emotional support in post-disaster contexts. This research provides theoretical and practical insights for future efforts in combining cultural preservation and psychosocial recovery. Full article
(This article belongs to the Section Construction Management, and Computers & Digitization)
Show Figures

Figure 1

17 pages, 688 KB  
Article
Task-Based Quantizer for CSI Feedback in Multi-User MISO VLC/RF Systems
by Fugui He, Congcong Wang, Yao Nie, Xianglin Fan, Chensitian Zhang and Yang Yang
Electronics 2025, 14(11), 2277; https://doi.org/10.3390/electronics14112277 - 3 Jun 2025
Viewed by 584
Abstract
The performance of multiple-input single-output (MISO) transmission is highly dependent on the accuracy of the channel state information (CSI) at the base station (BS), which necessitates precise CSI estimation and reliable feedback from the user equipment. However, the overhead of the CSI feedback [...] Read more.
The performance of multiple-input single-output (MISO) transmission is highly dependent on the accuracy of the channel state information (CSI) at the base station (BS), which necessitates precise CSI estimation and reliable feedback from the user equipment. However, the overhead of the CSI feedback occupies substantial uplink bandwidth resources. To alleviate the overhead, this paper proposes a novel task-based quantizer for uplink MISO visible light communication (VLC) systems. In particular, a hybrid radio frequency (RF)/VLC system is considered, where VLC links are mainly used for large-volume downlink transmissions and RF links are used for uplink CSI feedback. Since the RF bandwidth resources are limited, the CSI is quantified to reduce the uplink resource requirements, which, however, inevitably causes CSI estimation errors at the BS. To guarantee the CSI estimation accuracy while minimizing the RF resource cost, a task-based quantization scheme for channel estimation (TQ-CE) is proposed. In the TQ-CE, both the quantized codebook and the post-processing matrix are optimized to minimize the mean square error (MSE) of the channel estimation. Taking the minimum MSE as the target task, the TQ-CE leverages vector quantization (VQ) to generate a codebook, which is designed to reduce the feedback overhead without compromising the precision of the channel estimation. Then, an optimal closed-form solution of the post-processing matrix is derived based on the minimum mean square error (MMSE) criterion. The simulation results demonstrate that the proposed TQ-CE achieved 0.25Mbit/s and 0.62Mbit/s higher data rates compared with the conventional scalar quantization-based channel estimation (SQ-CE) schemes and vector quantization-based channel estimation (VQ-CE) schemes, respectively. Moreover, in terms of the feedback overhead, compared with the 18-bit SQ-CE, the 4-bit TQ-CE achieved a 22.2% reduction in uplink bits. Full article
Show Figures

Figure 1

44 pages, 12058 KB  
Article
Harmonizer: A Universal Signal Tokenization Framework for Multimodal Large Language Models
by Amin Amiri, Alireza Ghaffarnia, Nafiseh Ghaffar Nia, Dalei Wu and Yu Liang
Mathematics 2025, 13(11), 1819; https://doi.org/10.3390/math13111819 - 29 May 2025
Viewed by 1801
Abstract
This paper introduces Harmonizer, a universal framework designed for tokenizing heterogeneous input signals, including text, audio, and video, to enable seamless integration into multimodal large language models (LLMs). Harmonizer employs a unified approach to convert diverse, non-linguistic signals into discrete tokens via its [...] Read more.
This paper introduces Harmonizer, a universal framework designed for tokenizing heterogeneous input signals, including text, audio, and video, to enable seamless integration into multimodal large language models (LLMs). Harmonizer employs a unified approach to convert diverse, non-linguistic signals into discrete tokens via its FusionQuantizer architecture, built on FluxFormer, to efficiently capture essential signal features while minimizing complexity. We enhance features through STFT-based spectral decomposition, Hilbert transform analytic signal extraction, and SCLAHE spectrogram contrast optimization, and train using a composite loss function to produce reliable embeddings and construct a robust vector vocabulary. Experimental validation on music datasets such as E-GMD v1.0.0, Maestro v3.0.0, and GTZAN demonstrates high fidelity across 288 s of vocal signals (MSE = 0.0037, CC = 0.9282, Cosine Sim. = 0.9278, DTW = 12.12, MFCC Sim. = 0.9997, Spectral Conv. = 0.2485). Preliminary tests on text reconstruction and UCF-101 video clips further confirm Harmonizer’s applicability across discrete and spatiotemporal modalities. Rooted in the universality of wave phenomena and Fourier theory, Harmonizer offers a physics-inspired, modality-agnostic fusion mechanism via wave superposition and interference principles. In summary, Harmonizer integrates natural language processing and signal processing into a coherent tokenization paradigm for efficient, interpretable multimodal learning. Full article
Show Figures

Figure 1

Back to TopTop