How to use Gradient Accumulation to Overcome GPU Memory Limitations

Knowledgebase

1. Introduction

To train a machine learning model, the training dataset is split into batches. The model processes this data batch-wise. To process (evaluate) a batch, the computer loads it into memory. Since the model trains on one batch at a time, the computer needs sufficient memory to load each batch and process it completely. Complex models are typically trained on large volumes of detailed input data, like graphics and audio transcripts. The more detailed the data, the more memory is needed to hold it while processing. The available memory can be insufficient to accommodate the desired batch size. When a model's memory requirement exceeds the available memory, it crashes with an out-of-memory error.

The larger the batch size, the more memory is needed to load and evaluate its data. One way of working around memory constraints is to reduce the batch size. However, reducing the batch size is not always desirable. Many models learn better and faster on larger (up to a limit) batch sizes.

Another approach is to use smaller batches but not update the model parameters based on the feedback from every batch. Instead, the updates are accumulated for several batches and then applied. To a certain extent, this mimics the effect of having used a larger batch size. This technique is called gradient accumulation.

Knowledgebase

Categories

Categories

1. Introduction

Related Articles

Support

Knowledgebase

Categories

Categories