Gaudi2 and Sapphire Rapids Provide Excellent Performance and Cost Savings for AI Training

In February this year, Intel launched the fourth-generation Xeon Scalable processors with the Sapphire Rapids architecture, and in May it introduced the second-generation deep learning chip, Habana Gaudi 2. The latter naturally targets the AI domain, while the former underwent copious optimization for AI performance. The MLCommons has announced the results of its industry AI performance benchmark test, MLPerf training 3.0, in which both of Intel’s products displayed remarkably impressive training results.

Currently, the industry predominantly believes that generative AI and Large Language Models (LLMs) are best run on GPUs. Nevertheless, the latest data demonstrates that AI solutions based on Intel’s product portfolio offer highly competitive options for customers seeking to break free from the current efficiency and scale limitations in a closed ecosystem.

Firstly, concerning Habana Gaudi 2, training generative AI and large language models requires server clusters to meet massive computational demands. The latest MLPerf results tangibly validate Gaudi 2’s outstanding performance and efficient scalability on the demanding GPT-3 model with 175 billion parameters.

On the GPT-3 model, Gaudi 2 achieved a training time of 311 minutes on 384 accelerators, realizing a near-linear 95% scaling effect from 256 to 384 accelerators. It also accomplished excellent training results on the computer vision models ResNet-50 (8 accelerators) and Unet3D (8 accelerators), and the natural language processing model BERT (8 and 64 accelerators). Compared to the data submitted last November, the performance of BERT and ResNet models improved by 10% and 4%, respectively, attesting to the maturation of Gaudi 2’s software. The software support is continuously developing and maturing, keeping pace with the escalating demands of generative AI and large language models.

The fourth-generation Xeon Scalable processors, on the other hand, is the only CPU-based solution submitted among the numerous solutions, with the MLPerf results indicating that Xeon Scalable processors offer businesses a “ready-to-use” feature. They can deploy AI on general systems, avoiding the high costs and complexities of introducing specialized AI systems.

Within a closed environment, the fourth-generation Xeon can train the BERT and ResNet-50 models in less than 50 minutes (47.93 minutes) and 90 minutes (88.17 minutes), respectively. For the open division of the BERT model, the results show that when expanded to 16 nodes, the fourth-generation Xeon can complete model training in around 30 minutes (31.06 minutes). For the larger RetinaNet model, the fourth-generation Xeon can achieve a training time of 232 minutes on 16 nodes, enabling customers to flexibly use off-peak Xeon cycles to train their models, i.e., they can conduct model training in the morning, at lunch, or at night. The fourth-generation Intel Xeon Scalable processors, equipped with Intel AMX, provide significant performance improvement, covering multiple frameworks, end-to-end data science tools, and an extensive intelligent solution ecosystem.

For the minority of users who intermittently train large models from scratch, they can use general CPUs and typically run their businesses on already deployed servers based on Intel. Moreover, most people will adopt pre-trained models and fine-tune them with small datasets. The results released by Intel indicate that this fine-tuning can be accomplished in a matter of minutes using Intel AI software and standard industry open-source software.