The latest trend in artificial intelligence is that larger natural language models can provide better accuracy, but larger models are difficult to train due to barriers to cost, time, and code integration. Microsoft has recently open-sourced a deep learning optimization library, DeepSpeed, which can increase the scale, speed, availability, and reduce costs to train deep learning models with more than 100 billion parameters on the current generation of GPU clusters, greatly facilitating the training of large models. At the same time, system performance can be improved by more than 5 times compared to the latest technology.
According to Microsoft,
DeepSpeed excels in four aspects:
• Scale: State-of-the-art large models such as OpenAI GPT-2, NVIDIA Megatron-LM, and Google T5 have sizes of 1.5 billion, 8.3 billion, and 11 billion parameters respectively. ZeRO stage one in DeepSpeed provides system support to run models up to 100 billion parameters, 10 times bigger. In the future, we plan to add support for ZeRO stages two and three, unlocking the ability to train models with 200 billion parameters to trillions of parameters.
• Speed: We observe up to five times higher throughput over state of the art across various hardware. For example, to train large models on GPT family of workloads, DeepSpeed combines ZeRO-powered data parallelism with NVIDIA Megatron-LM model parallelism. On NVIDIA GPU clusters with low-bandwidth interconnect (without NVIDIA NVLink or Infiniband), we achieve a 3.75x throughput improvement over using Megatron-LM alone for a standard GPT-2 model with 1.5 billion parameters. On NVIDIA DGX-2 clusters with high-bandwidth interconnect, for models of 20 to 80 billion parameters, we are three to five times faster. These throughput improvements come from DeepSpeed’s higher memory efficiency and ability to fit these models using a lower degree of model parallelism and larger batch sizes.
• Cost: Improved throughput can be translated to significantly reduced training cost. For example, to train a model with 20 billion parameters, DeepSpeed requires three times fewer resources.
• Usability: Only a few lines of code changes are needed to enable a PyTorch model to use DeepSpeed and ZeRO. Compared to current model parallelism libraries, DeepSpeed does not require a code redesign or model refactoring. It also does not put limitations on model dimensions (such as number of attention heads, hidden sizes, and others), batch size, or any other training parameters. For models of up to six billion parameters, you can use data parallelism (powered by ZeRO) conveniently without requiring model parallelism, while in contrast, standard data parallelism will run out of memory for models with more than 1.3 billion parameters. ZeRO stages two and three will further increase the model size trainable with data parallelism alone. In addition, DeepSpeed supports flexible combination of ZeRO-powered data parallelism with model parallelism.
The DeepSpeed source code is available on GitHub.