Abstract
Data prefetching is essential for modern file storage systems operating in large-scale cloud and data-intensive environments, where high performance increasingly depends on intelligent, adaptive mechanisms. Traditional rule-based methods and recently proposed machine learning-based techniques often struggle to cope with the complex and rapidly evolving data access patterns characteristic of big-data workloads. In this paper, we introduce an online, streaming machine learning (SML) approach for predictive data prefetching that retrieves useful data into the cache ahead of time. We present a novel online training framework that extracts features in real time and continuously updates streaming ML models to learn and adapt from large and dynamic access streams. Building on this framework, we design new SML-driven prefetching algorithms that decide when, how, and what data to prefetch into the cache with minimal overhead. Extensive experiments using production traces from Huawei Technologies Inc. and Google workloads from the SNIA IOTTA repository demonstrate that our intelligent policies consistently deliver the highest byte hits among competing approaches, achieving 97% prefetch byte precision and reducing data access latency by up to 2.8 times. These results show that streaming ML can deliver immediate performance gains and offers a scalable foundation for future adaptive storage systems.