Efﬁcient Use of GPU Memory for Large-Scale Deep Learning Model Training

: To achieve high accuracy when performing deep learning, it is necessary to use a large-scale training model. However, due to the limitations of GPU memory, it is difﬁcult to train large-scale training models within a single GPU. NVIDIA introduced a technology called CUDA Uniﬁed Memory with CUDA 6 to overcome the limitations of GPU memory by virtually combining GPU memory and CPU memory. In addition, in CUDA 8, memory advise options are introduced to efﬁciently utilize CUDA Uniﬁed Memory. In this work, we propose a newly optimized scheme based on CUDA Uniﬁed Memory to efﬁciently use GPU memory by applying different memory advise to each data type according to access patterns in deep learning training. We apply CUDA Uniﬁed Memory technology to PyTorch to see the performance of large-scale learning models through the expanded GPU memory. We conduct comprehensive experiments on how to efﬁciently utilize Uniﬁed Memory by applying memory advises when performing deep learning. As a result, when the data used for deep learning are divided into three types and a memory advise is applied to the data according to the access pattern, the deep learning execution time is reduced by 9.4% compared to the default Uniﬁed Memory.


Introduction
To achieve high accuracy when performing deep learning, the size of deep learning models is increasing [1][2][3][4][5][6]. However, there is a difficulty in using large-scale deep learning models that can achieve high accuracy due to the limitation of single-GPU memory [7][8][9][10][11][12]. NVIDIA introduced CUDA Unified Memory technology from CUDA 6, allowing the GPU memory to be expanded by integrating the system's host (CPU) memory and device (GPU) memory [13][14][15]. CUDA Unified Memory enables automatic data migration between host memory and device memory via virtual memory. For example, if the GPU tries to access a virtual page that does not exist in GPU memory, a page fault occurs. After that, the page fault is resolved by mapping the accessed page to the physical page of the GPU and migrating the data existing in the host memory to the GPU memory. Through this, device memory and host memory are logically integrated to achieve the same effect as the memory of the GPU is expanded. Therefore, it is possible to train large-scale deep learning models with expanded GPU memory using CUDA Unified Memory technology.
NVIDIA provides CUDA Unified Memory technology from CUDA 6.0 to expand GPU memory. CUDA Unified Memory technology integrates CPU memory and GPU memory so that it can be used as a single address space; therefore, CUDA Unified Memory enables the training of large-scale deep learning models by expanding the small GPU memory. In addition, starting with CUDA 8.0, memory advise can be used to efficiently manage data allocated to unified memory according to the access pattern [13,16,17]. NVIDIA divides memory advise into three types according to access patterns: Read mostly, Preferred location, and Access by. Read mostly makes efficient use of read-intensive data. When accessing data set to Read mostly, a replication of the data is created in the device that accessed the data. Preferred location sets the device where the data will be located; therefore, the occurrence of page faults due to data access is minimized by locating the data in the device that will mainly use the data. Access by sets the device to access the data. Devices set to Access by are directly mapped to the data; therefore, when device accesses the data, a page fault does not occur.
The data required for deep learning can be roughly classified into three types: model parameters, input data, and intermediate results. Model parameters must be updated with new values every iteration and must be on the GPU until training is complete. Input data are changed to new data every iteration and are not changed due to deep learning operations. Intermediate results are data that are newly generated for each iteration through deep learning operations, and are no longer needed after one iteration. The access pattern of data used in deep learning is similar to the memory advise supported by NVIDIA. Therefore, when deep learning is performed by applying CUDA Unified Memory technology, it is possible to efficiently perform deep learning using memory advise.
In this paper, we extend GPU memory using CUDA Unified Memory technology for large-scale deep learning model training. To the best of our knowledge, our proposed scheme is the first approach to efficiently apply CUDA Unified Memory technology and hints for memory management provided by NVIDIA to deep learning. In addition, we propose an efficient GPU memory utilization method for large-scale training model training using memory advise. The main contributions of our work are as follows: • First, we propose the most efficient memory advise for the three types of data required for deep learning. Deep learning data can be divided into three types according to access patterns: model parameters, input data, and intermediate results. Further, memory advise can manage data by dividing it into three types: Read mostly, Preferred location, and Access by. In addition, the access patterns of the three types of deep learning data are similar to those that can be managed with memory advise. Therefore, it is possible to efficiently utilize GPU memory by setting three types of deep learning data as appropriate memory advise. We conducted an experiment to train a large-scale model, BERT-Large [18], by applying our optimized memory advise scheme, and analyze the results of the experiment to suggest efficient memory advise. We trained BERT-Large, a large-scale deep learning model, using PyTorch to which CUDA Unified Memory technology is applied. When the mini-batch size is 32 and the max sequence size is 256, we confirmed that the training time is reduced by up to 9.4% by using appropriate memory advise. Using vanilla PyTorch, it is possible to train BERT-Large in mini-batches of size up to 16 with a GPU with 24 GB of memory. However, if our proposed scheme is applied to PyTorch, training can be performed with a mini-batch of size 32, which was impossible to train with vanilla PyTorch.
The rest of the paper is organized as follows. In Section 2, we describe the CUDA Unified Memory and how to use CUDA Unified Memory efficiently. The data management scheme we propose for efficient deep learning is described in detail in Section 3. In addition, Section 4 describes how we modified PyTorch to apply CUDA Unified Memory.
In Section 5, we analyze the performance of the proposed GPU memory utilization method. We present the related work in Section 6. Finally, we conclude our paper in Section 7.

CUDA Unified Memory
From CUDA 6.0, NVIDIA provides CUDA Unified Memory, a technology to overcome the limit of single GPU memory size [13,14,16]. CUDA Unified Memory is a technology that allows physically separated host memory and device memory to be integrated into one logical address space using virtual memory scheme. Unified Memory works in the following way. First, when the GPU attempts to access data in the CPU memory, a page fault occurs. After that, the host unmaps the page mapped to the CPU memory and migrates the data to the GPU memory. Finally, the page fault is resolved by mapping the page to the GPU memory. By using CUDA Unified Memory, GPU can use data stored in physically different locations without explicitly calling memory copy functions such as cudaMemcpy() and cudaMemcpyAsync(). However, since a page fault occurs when accessing data in a physically different location, the ongoing process should be stopped to resolve the page fault and migrate data, so performance may be degraded.

How to Use CUDA Unified Memory Efficiently
For efficient use of CUDA Unified Memory technology, NVIDIA supports memory advise starting with CUDA 8.0. Memory advise can be divided into three types according to data access patterns [13,16]. Figure 1 shows how memory advise works. (c) Figure 1. Operation method according to the type of memory advise: (a) Data set as Preferred location are directly mapped when accessed from other devices; (b) If data are set to Access by, a direct mapping is created with a specific device; (c) When another device accesses data set to Read mostly, a replica is created on the accessed device.
• Preferred location can set the physical location where data are actually stored. Data set as Preferred location are fixed in the designated physical location. Therefore, if the data are accessed from another device, data are not migrated to the device but data are directly mapped. For example, as shown in Figure 1a, if Device 1 attempts to access data fixed to Device 0, a page fault occurs in the first access. After the page fault, the data that are set as the Preferred location and fixed to Device 0 are mapped directly to the Device 1 instead of unmapping them from the memory of Device 0 and migrating to Device 1 memory. Therefore, the next time the Device 1 accesses the data, a page fault will not occur. Preferred location can operate efficiently when specific data needs to be physically fixed in specific device. • Access by specifies direct mapping to a specific device, as depicted in Figure 1b; therefore, a device that is directly mapped by designating Access by does not generate a page fault when accessing the corresponding data. The difference between data set as Access by and data set as Preferred location is that data set as Access by are not fixed to a specific device but can be moved freely; therefore, the direct mapped device has the advantage that the data can be accessed without a page fault no matter where they are physically located; however, in the case of frequently accessed data, they may be an overhead because it takes a long time to access them because they are physically located on a different device. • Read mostly can be used efficiently when applied to read-intensive data. As you can see from Figure 1c, if data set to Read mostly are accessed by a device that is not physically stored, a replica is created on the accessed device instead of migrating the accessed data to the device that accessed the data. Therefore, the data set to Read mostly have the advantage that it can be accessed quickly from the device because there are copies in the device's memory after a page fault occurs in the first access; however, if the data set to Read mostly are modified, there is a disadvantage that the data must be modified for all devices where replication exists.
NVIDIA supports the data prefetch scheme in addition to memory advise to efficiently utilize CUDA Unified Memory technology. Before accessing the data allocated to unified memory, the user can call cudaMemPrefetchAsync() to prefetch the data to the device where the data will be used. Data prefetched through cudaMemPrefetchAsync() can be accessed without a page fault when accessed from the device because the data are already in the device. Data prefetch operates differently depending on the settings of memory advise. In the case of data set to Read mostly, if data prefetch is performed, a replica is created in the device. When the data set as the Preferred location are prefetched to a device other than the device where the data are fixed, the data are moved to another device instead of being fixed to the device.

PyTorch
PyTorch is one of the most representative open-source deep learning frameworks. PyTorch operates in a Define-by-Run method, unlike Caffe and Tensorflow, which are the representative deep learning frameworks that operate in a Define-and-Run method. By operating in the Define-by-Run method, PyTorch solves the difficulty of debugging, which was a disadvantage of the existing Define-and-Run deep learning framework. Further, PyTorch is a Python framework designed to be easy to use for existing Python users.
Due to the Global Interpreter Lock (GIL), Python can only run one thread at a time; therefore, Define-and-Run deep learning frameworks based on Python show better performance than Define-by-Run frameworks. PyTorch solves the performance degradation problem of Define-by-Run frameworks in the following way. PyTorch is implemented using the C++ language to achieve high performance. This allows PyTorch to run multiple threads at the same time, solving the problem that the GIL causes Python to only run one thread at a time. In addition, PyTorch can run Python code on the CPU while computing on the GPU via the CUDA Stream mechanism. Thus, PyTorch can increase the utilization of the GPU even when using Python, which has a large execution overhead. PyTorch is implemented to dynamically allocate and reuse memory according to the characteristics of deep learning, so users can use memory efficiently. In addition to the above methods, by extending Python's multiprocessing module, problems caused by GIL can be avoided and users can easily implement parallel programs. Further, PyTorch avoids wasting memory through reference counting.

Efficient Data Management Scheme for Deep Learning
In this study, we enable large-scale model training by expanding GPU memory by integrating CPU memory and GPU memory through CUDA Unified Memory technology in a single GPU environment. In this section, we propose a method to efficiently manage data used in deep learning using memory advise and data prefetch supported by CUDA 8.0. For example, an efficient convolution algorithm requires feature map data and layer weights required for feed forwarding, and an additional temporary buffer is required when using a fast-Fourier-transform-based convolution algorithm [19]. In this paper, we classify these data into two types: model parameters and intermediate results. Since each layer weight is part of the model parameters, the layer weights can be classified as model parameters. Feature map data and temporary buffer are data generated and used during feed forwarding or backpropagation, so we classified these data as intermediate results.
However, the data required to perform deep learning are not only intermediate results, such as feature map data and temporary buffers, and model parameters, but also input data to be trained. Therefore, we can classify the data needed in deep learning into three types: model parameters, intermediate results, and input data. The three types of data needed to perform deep learning are created, modified, and used at different times. This means that the three types of data needed to perform deep learning each have different access patterns.
The characteristics of the three types of data used in deep learning are as follows: • First, there are model parameters. Model parameters are generated and initialized before starting deep learning. Model parameters are weights for the features of the input data. When performing feed forwarding, the weight of each hidden layer(part of model parameters) generates feature map data to be passed to the next hidden layer through mathematical operation with the feature map data output from the previous hidden layer. After performing backpropagation, model parameters are updated with new values. The updated model parameters will be used by the GPU for training in the next iteration. In addition, in most deep learning models, model parameters account for the smallest proportion of the total data of deep learning [19]; therefore, we recommend that the model parameters remain on the GPU until training is complete. Finally, there are input data. The input data are the data to be trained by passing them to the first layer (input layer) of the deep learning model. Input data are delivered to the GPU, which performs deep learning, with new data every iteration. In addition, the input data are not changed due to the deep learning operation and are no longer needed after one iteration. Therefore, the input data do not always have to be on the GPU until training is finished.
The Table 1 summarizes the access patterns according to the types of deep learning data. According to the access characteristics of the deep learning data described above, it can be seen that the access patterns supported by memory advise and the access patterns of the deep learning data are similar. Model parameters remain in GPU memory until training is finished, and are accessed and updated by the GPU at every iteration. Therefore, model parameters can be managed efficiently when set as the Preferred location. Input data are accessed and used by GPU, but they are not changed due to deep learning operation, so they can be used efficiently by setting it to Read mostly. However, the input data are changed to new data read from the storage device every iteration. Due to this characteristic, it can be seen that the input data are modified with new data at the start of each iteration. It is also necessary to consider not using memory advise, since input data are accessed only once and are changed every iteration. Intermediate results are used by the GPU at every iteration; however, since the intermediate results are very large, it is difficult for all intermediate results to always reside on the GPU; therefore, intermediate results can be used efficiently by moving them to GPU memory through data prefetch before being used.

Implementation in PyTorch
PyTorch, an open source deep learning framework, currently does not support CUDA Unified Memory technology; therefore, it is impossible to train a large-scale deep learning model that requires more than the GPU memory size through PyTorch in a single GPU environment. In this study, to overcome the limitations of existing PyTorch, CUDA Unified Memory technology is applied to PyTorch to enable large-scale deep learning model training.
PyTorch performs computation in the Define-by-Run method [20]. PyTorch does not allocate the amount of memory required for operation in advance in the way that Tensorflow does, which operates in a Define-and-Run method, but rather allocates GPU memory when a user requests or requires additional memory space due to GPU operation. The user can copy a specific tensor to another device's memory by calling the to() function. The user can call to('cuda') to copy data to GPU memory. The tensor on which to('cuda') is called is transferred to GPU memory, and the tensor can be used for computation using the GPU. Further, PyTorch automatically allocates additional GPU memory space for the computed results using the tensor in the GPU and stores the computed results in the GPU memory. Table 2 show the allocated GPU memory size according to the training progress when ResNet-50 is trained using PyTorch. The ResNet-50 model is a representative convolutional neural network (CNN) for image classification consisting of 50 convolution layers. As a result of analyzing the source code of PyTorch, we confirmed that all GPU memory allocation of PyTorch is performed through CUDACachingAllocator. CUDA Unified Memory technology can be applied to PyTorch simply by calling cudaMallocManaged() instead of cudaMalloc() in the malloc() function of CUDACachingAllocator. When implemented in a simple way, the user cannot use the general GPU memory allocation method, but also has a problem where only fixed memory advise can be set; therefore, a problem arises where users have no choice but to use unified memory even when the memory required for deep learning is smaller than the GPU memory. Further, even when unified memory is used due to insufficient GPU memory, CUDA Unified Memory technology cannot be used efficiently because the same memory advise must be set for data with different access patterns. Therefore, we need to implement so that users can choose whether to apply CUDA Unified Memory and choose memory advise. In order to implement the above functions and allow users to use them, it is difficult to implement because it is necessary to implement a new user interface function and add a new option. To solve the above problems and efficiently apply CUDA Unified Memory technology, we analyzed PyTorch's GPU memory allocation method. As a result of the analysis, we confirmed that the functions were called in the order shown in Figure 2 when allocating memory by calling PyTorch to("device:index"). When a PyTorch user calls the Python level function to("device:index"), the C level implementation to_impl() is called. When to("device:index") is called, the "device:index" argument is passed to the C level and is used to create a device object. The to_impl() function checks the device information and calls empty_cuda() or empty_cpu() depending on the device type. Through the above analysis, we were able to confirm that the device information is transferred to the C level when to("device:index") is called. The device in the "device:index" argument is either 'cuda' or 'cpu', which tells the type of device on which the tensor operation should be performed. In addition, the index informs the number of the device to perform the operation from among the devices of the corresponding type. Further, the index is information on which GPU to store the tensor on. Thus, we implemented this to deliver information regarding memory advise to the C level through the index rather than the device.
In the Figure 2, the functions we modified to set the memory advise are marked in red. We modified a part of PyTorch's C level implementation so that memory advise information can be passed through index information when calling to("device:index"). The memory advise delivered through the index information is passed to the malloc() function of CUDACachingAllocator, so that the tensor created at the Python level can be set to the desired memory advise. Users can set tensor's memory advise simply by changing tensor.to("cuda:index") in the vanilla PyTorch code to tensor.to("cuda:memoryadvise"). When tensor.to("cuda:memoryadvise") is called, the tensor is copied to unified memory and memory advise is applied. In the case of prefetch, calling tensor.to("cuda:memoryadvise") not only copies the tensor to unified memory, but also calls cudaMemPrefetchAsync() to prefetch the tensor to GPU memory. Users can set the memory advise in five different ways: noadvise, preferredlocation, accessby, readmostly, and prefetch.
By modifying only the functions marked in red in the Figure 2, the intermediate results generated by the operation on the GPU cannot use CUDA Unified Memory. In addition, the information of memory advise of intermediate results cannot pass to the malloc() function of CUDACachingAllocator via to("device:index"). Therefore, we created a new Python level user interface function to set the memory advise of intermediate results, and using the newly created C level function, the information of the memory advise is passed to the malloc() function of CUDACachingAllocator. We newly added three functions colored in blue in the Figure 3. Figure 3 shows the calling sequence of the newly implemented functions to pass the memory advise information of the intermediate results. is an API that sets the GPU to be used for operation, so users do not need to modify it to apply CUDA Unified Memory. Users add torch.cuda.set_unified(args.advise) to make PyTorch use CUDA Unified Memory. when torch.cuda.set_unified(args.advise) is called, the memory advise of intermediate results is set. So, users input the desired memory advise to torch.cuda.set_unified(args.advise). model.cuda(args.gpu) copies the training model to the GPU and works the same as model.to("cuda:args.gpu"). Users can copy the model to unified memory by changing model.to("cuda:args.gpu") to model.to("cuda:args.parameter"). Users can use CUDA Unified Memory with memory advise by modifying the vanilla PyTorch code with just two lines. We conducted experiments by changing the vanilla PyTorch code in Figure 4 as modified PyTorch code in Figure 5. The results of the experiment are described in the next section.

Experiment and Performance Analysis
To analyze the performance of PyTorch to which CUDA Unified Memory is applied, we configured the following experimental environment. We used Python 3.6.12, CUDA 10.1, NVIDIA driver 440.54, and PyTorch 1.6.0 as software. As GPU, NVIDIA's Quadro RTX 6000 was used, and the GPU memory size of Quadro RTX 6000 is 24 GB. As a deep learning model, BERT-Large model was used. BERT-Large consists of 24 layers and 16 self-attention heads, and the hidden size of BERT-Large is 1024. BERT-Large is a large-scale deep learning model with a total of 340 M model parameters. We used BERT in our experiments because it is a widely used representative large-scale deep learning model. When training a BERTlarge using a general GPU, it is difficult to train with a mini-batch and max sequence of sufficient size. We were able to train the BERT-Large with a mini-batch size of 16 using a GPU with 24 GB of memory; however, it was not possible to train a BERT-Large with a mini-batch size of 32. We conducted the experiment by dividing the case where the data size is smaller than the GPU memory size and the case where the data size is larger than the GPU memory size.   Figure 6 shows the training time when the data size is smaller than the GPU memory size. The y-axis of Figure 6 starts with 0.8. BERT-Large can be trained with the input mini-batch size of 16 and the max sequence size of 256 without using unified memory. When deep learning is performed without applying the Unified Memory, the time to perform one iteration is about 0.87 s, which shows the best performance. On the other hand, when unified memory is applied, performance is lower than when unified memory is not applied. In particular, when the model parameters are set as the Preferred location and no advise is set in the input data, the training time when prefetching the intermediate results is about 0.98 s, showing the lowest performance. Further, when the memory advise of the model parameters and the memory advise of the input data are the same, prefetching the intermediate result shows the lowest performance. The reason for showing the lowest performance when prefetching intermediate results is that if the data size is smaller than the GPU memory size, the communication overhead due to prefetch is larger than the deep learning operation time on the GPU; therefore, if deep learning can be performed in GPU memory, it is better not to prefetch intermediate results. When model parameters are set as Preferred location and input data are set to Read mostly, if Preferred location or memory advise is not set for intermediate results, the execution time is 0.89 s, which is only 0.019 s longer than when the unified memory is not applied. This is the best performance when unified memory is applied. The reason for such good performance seems to be that, when memory advise is set as described above, no other overhead occurs except for the overhead caused by the initial page fault when deep learning is performed. Figure 7. Training time increase rate due to the use of unified memory when the data size is smaller than the GPU memory size. Figure 7 shows the increase rate of training time due to the use of unified memory when deep learning can be performed in GPU memory. The learning time increases by up to 12.3% and at least by 2.3% when the unified memory is applied. Therefore, if deep learning is possible within the GPU memory, applying unified memory cannot efficiently perform deep learning. Further, it is better not to prefetch intermediate results when unified memory is applied. Figure 8 is a graph showing the execution time when deep learning cannot be performed without using unified memory because the data size is larger than the GPU memory size. Please note that y-axis of Figure 8 starts with 14. Figure 8 shows that the performance is the lowest when memory advise is not set for model parameters and input data; however, setting the model parameters to the Preferred location gives the best performance.

Results of GPU Page Fault Profiling Using Nvprof
In this section, we profile the experiment (mini-batch size: 32, max sequence: 256) performed in the previous section when the data size is larger than the GPU memory size using nvprof, a profiling tool of NVIDIA, and analyze the profiling results. When the data size required for deep learning is larger than the GPU memory, the memory advise setting that shows the best performance is when the intermediate result data are prefetched, the model parameters are set as Preferred location, and input data are set to Read mostly. Further, the memory advise setting that showed the lowest performance is the case where the intermediate results are set as the Preferred location without giving memory advise to the model parameters and input data. Table 3 summarizes the profiling results of GPU page fault profiling. Table 3 is the result of profiling learning for 10 iterations by applying each memory advise. Table 3 shows that the number of page faults due to unified memory. When the memory advise is not applied, the page faults occur the most. The difference in the elapsed time for data migration of unified memory according to memory advise is so small that it is meaningless. Further, data migration takes more time for Host to Device than Device to Host. Therefore, looking at Table 3, as can be expected from the results shown through the graphs of the Figures 8 and 9 in the previous section, the memory advise with the best performance generated the least number of page faults. We used NVIDIA Visual Profiler, a graphic profiling tool, to check the pattern of page faults according to each memory advise. Figure 10 are visualizing the profiling results using NVIDIA Visual Profiler. Figure 10a is the case where no memory advise is applied, and Figure 10b,c are the results of applying the memory advise showing the highest performance and the memory advise showing the lowest performance, respectively. In the upper red area of Figure 10a-c, you can check the pattern of page faults due to unified memory when deep learning is performed. NVIDIA Visual Profiler divides the execution of the process into equal-width sections, and expresses the percentage of time it takes to resolve GPU page faults in each section as a color. As can be seen in the upper red area of each figure, colors appear in a similar pattern regardless of memory advise when deep learning is performed. This shows that page faults occur in a similar pattern regardless of memory advise when deep learning is performed. In the blue area at the bottom of Figure 10a-c, you can check the time it takes to resolve the page fault according to the color of each section. The width of each section of unified memory in NVIDIA Visual Profiler is the total execution time divided by the same number; therefore, even with the same color in the three pictures, the time taken to resolve the page fault is the shortest when the memory advise that shows the best performance is applied. From the profiling results, we can confirm that GPU page faults can be reduced by applying the optimal memory advise. In addition, by reducing GPU page faults, the overhead caused by page faults is reduced, and thus, operator resources of the GPU can be used more efficiently. Figure 11 shows the pattern of GPU page faults when training a single mini-batch. The area marked in blue in Figure 11 is the area performing feed forwarding and the area marked in red is the area performing backpropagation. Figure 11 shows that most sections are displayed in green when feed forwarding is performed. Further, when performing backpropagation, it can be seen that the ratio of the section marked in yellow is high. The green section means that the percentage of the time it takes to resolve the page fault is 40% to 50%, and the yellow section means 60% to 70%. Through this, we can see that the ratio of the time of the operation is delayed due to a page fault is greater when performing backpropagation than feed forwarding; therefore, in order to more efficiently utilize the computing resources of the GPU when performing deep learning by applying unified memory, it is thought that we need a technique to identify and prefetch the data required in the backpropagation operation.

Related Works
Among the methods for optimizing GPU memory when performing deep learning, there is a swap in/out method between GPU memory and CPU memory. vDNN, a memory management system provided by NVIDIA, operates by swapping in/out of feature map data generated by deep learning operations to overcome the limitations of GPU memory [19]. During feed forwarding, vDNN swaps out feature map data not currently used in GPU memory to CPU memory. After that, vDNN swaps in the feature map data required for backpropagation from CPU memory to GPU memory; therefore, it is possible to learn a mini-batch larger than the size of the GPU memory. vDNN is similar to our proposed technique in that it improves the performance of deep learning through efficient memory management. However, unlike vDNN, our work utilizes CUDA Unified Memory to expand GPU memory. In addition, we apply memory advise according to data access patterns to efficiently use CUDA Unified Memory.
Kim, Y., Lee, J., Kim, J.S. et al. proposed a new memory management method for accelerating deep learning applications [11]. They extended NVIDIA's vDNN concept to address the computational performance degradation caused by PCIe bottlenecks in multi-GPU environments. To address the PCIe bottleneck, they limited the number of GPUs that swap feature map data generated in a specific hidden layer when performing deep learning. Further, they designed an algorithm to effectively prefetch the swapped out feature map data into GPU memory. Unlike their research, we propose a memory expansion method in a single GPU environment using CUDA Unified Memory technology and memory advise.
S. Chien, I. Peng and S. Markidis. et al. evaluated the performance of CUDA Unified Memory technology and memory advise on the Intel-Volta/Pascal-PCI platforms and the Power9-Volta-NVLink platform [13]. In their study, they evaluated the performance of CUDA Unified Memory and memory advise using eight applications such as matrix multiplication and convolution operations that use GPUs. Their study evaluated the performance of CUDA Unified Memory and Memory Advise on various hardware platforms. Through experiments, they showed that when the size of data exceeds the size of GPU memory, using memory advise for CUDA Unified Memory can improve performance. However, their study has a limitation in that there is no performance evaluation for the widely used deep learning workloads. Our study shows that the size of trainable minibatch increases by applying CUDA Unified Memory to deep learning workloads, and that memory can be managed efficiently by using memory advise.
Most of the datasets used in GPU applications such as deep learning are very large, which is not only larger than the GPU memory size but also larger than the host memory size [21]. In most HPC systems, data sets are stored in non-volatile memory (NVM) storage, which has a large capacity; therefore, for complex applications, data movement between GPU memory and NVM storage can be a difficult problem. To solve this problem, Direct Resource Access for GPUs Over NVM (DRAGON) was developed. DRAGON extended GPU memory to NVM storage, not just host memory, by extending NVIDIA's Unified Memory technology; they optimize the data movement by dividing the data access pattern into read-only and write-only, and separately manage the intermediate data generated through GPU operation. DRAGON optimizes only for read-only and write-only, so there is a limitation in that it is difficult to optimize according to data access patterns when performing deep learning. Our proposed scheme can manage memory more efficiently by managing deep learning data by dividing it into three types of model parameters, intermediate results, and input data.
To overcome the limitations of large-scale deep learning model training, Microsoft developed Zero Redundancy Optimizer (ZeRO) by integrating existing data parallelization techniques and model parallelization techniques [22]. Through ZeRO-DP, Microsoft achieved both computational and communication efficiency, which has the advantage of data parallelism, and memory efficiency, which has the advantage of model parallelism, overcoming the limitations of existing parallelization techniques. Microsoft has also developed ZeRO-R to optimize three types of residual state memory: activations, temporary buffers, and unusable memory fragments. Microsoft integrated ZeRO-DP and ZeRO-R to implement ZeRO, a memory-optimized system for deep learning. ZeRO efficiently utilizes multiple GPUs to train large-scale models. Unlike ZeRO, we propose a technique for efficiently training large-scale models on a single GPU environment. In addition, we divide data used for deep learning into three types according to access patterns and apply memory advise to each type of data to efficiently utilize GPU memory.
GPU-Enabled Memory-Aware Model-Parallelism System (GEMS) has been proposed to train large-scale deep learning models using high-resolution images, which are mainly used in digital pathology [23]. In their paper, four types of techniques are proposed: GEMS-Basic, GEMS-MAST, GEMS-MASTER, and GEMS-Hybrid. GEMS-MAST trains a replica of a part of the model by utilizing memory and computing resources that were not used in the existing model parallelization technique. GEMS-MASTER uses more replicas of model that GEMS-MAST used only two. Through this, GEMS-MASTER can train the model faster by overlapping computations and reducing the number of allreduce executions. They also achieved near-linear scalability with the GEMS-Hybrid. Unlike our work, GEMS requires the use of multiple GPUs to train large-scale models. Our work uses CUDA Unified Memory technology to train large-scale models even in a single GPU environment, and can efficiently manage memory to improve training performance.
Out-of-Core DNN training framework (OC-DNN) is a deep learning framework that uses the new Unified Memory feature with Volta and Pascal GPUs [24]. The two main components of OC-DNN are the Interception Library(IL) and OC-Caffe. IL is an independent library that operates between CUDA Runtime/Driver and HPC Platform. IL allows applications to run using data that exceed GPU memory without modifying existing applications. OC-Caffe can save GPU memory and overlap data copy and computation by using Unified Memory buffer. In addition, OC-Caffe changes the existing device-to-device communication to Managed memory-to-Managed memory communication. Further, they use cudaMemAdvise and cudaMemPrefetch to optimize intra-GPU communication. OC-DNN and our study are similar in that both use CUDA Unified Memory technology. Unlike OC-DNN, We divide deep learning data into three types and use memory advise, a hint for memory management provided by NVIDIA, to efficiently manage memory to improve training performance.

Conclusions
In this study, we propose an efficient data management scheme for deep learning using CUDA Unified Memory and memory advise through an experiment to train a largescale deep learning model. Further, we applied CUDA Unified Memory technology to PyTorch and trained a large-scale model, BERT-Large. According to the results of the experiments conducted in this paper, using unified memory with our optimized memory advise scheme shows similar performance to the vanilla method with negligible overheads. On the other hand, when training on a single GPU is not possible, we show that the most efficient way to use unified memory is to prefetch intermediate results and set the model parameters to Preferred location to pin the model parameters to the GPU performing deep learning. If memory advise is properly configured by our proposed method, the deep learning execution time can be reduced by up to 9.4% compared to the vanilla method. Our approach does not modify the training algorithm of deep learning. Most of the deep learning models other than BERT-Large used in this paper can also classify data in the same way as in this study. Therefore, our proposed scheme can improve performance even when training other deep learning models than BERT-Large.
In the future, we plan to perform experiments in various environments and optimize the unified memory by additionally implementing the data prefetch function when performing backpropagation of unified memory.