Google open source GPipe, Training Large-scale Neural Network Models
Google has open sourced a distributed machine learning library, GPipe, a library for efficient training of large-scale neural network models. GPipe uses synchronous random gradient descent and pipelines parallel training for any DNN consisting of multiple consecutive layers. Importantly, GPipe allows researchers to easily deploy more accelerators to train larger models and scale performance without adjusting hyperparameters.
The development team trained AmoebaNet-B on Google Cloud TPUv2s with 557 million model parameters and 480 x 480 input image size. The model performed well on several popular datasets, including pushing single-crop ImageNet accuracy to 84.3%, pushing CIFAR-10 accuracy to 99%, and pushing CIFAR-100 accuracy to 91.3%.
GPipe can maximize the memory allocation of model parameters. The team experimented on Google Cloud TPUv2, each with 8 accelerator cores and 64 GB of memory (8 GB per accelerator). Without GPipe, a single accelerator can train 82 million model parameters due to memory limitations. Due to recalculation in backpropagation and batch splitting, GPipe reduced intermediate activation memory from 6.26 GB to 3.46 GB and implemented 318 million parameters on a single accelerator. In addition, through pipeline parallelism, the maximum model size is proportional to the expected number of partitions. With GPipe, AmoebaNet is able to add 1.8 billion parameters to the 8 accelerators of TPUv2, which is 25 times more than without GPipe.
The core GPipe library is currently open source under the Lingvo framework.