CyberPanda
Banned
Nvidia and Microsoft have joined forces to offer a cloud HPC capability based on the GPU vendor’s V100 Tensor Core chips linked via an InfiniBand network scaling up to 800 graphics processors.
The partners announced during this week’s SC19 that the GPU-accelerated supercomputer running on the Azure cloud would allow users to rent an entire “AI supercomputer” on their desktop. The new Azure cloud instances are billed as matching on-premise machines requiring months to deploy.
Microsoft said the new Azure NDv2 instances are among the fastest available for AI, machine learning and HPC workloads. Users would be able to spin up multiple cloud instances to train models in “hours,” the partners said Monday (Nov. 18).
Engineers from both companies deployed 64 cloud instances on a prototype HPC platform. A cluster was used to train a BERT conversational AI model in about three hours.
The performance benchmark was achieved using Mellanox network interconnects along with Nvidia’s CUDA X library and GPU optimizations.
The GPU-accelerated cloud instances also boosted HPC workload performance using deep learning frameworks such as MXNet, PyTorch, TensorFlow pulled from Nvidia’s container registry and the Azure marketplace.
The container registry includes Helm Charts. Helm is an application package manager running on top of the Kubernetes cluster orchestrator. Helm Charts are used to describe how an application is structured.
The shift—and value proposition—for investing in GPU accelerated computing is being driven by the huge volumes of data crunched in AI and other machine learning applications. Those applications require far more computing power. Hence, HPC capabilities backed by GPU accelerators and fast interconnects are shifting to the public cloud, providing broad access to raw supercomputing power.
“Working with Nvidia, Microsoft is giving customers instant access to a level of supercomputing power that was previously unimaginable,” noted Girish Bablani, Microsoft’s corporate vice president of Azure Compute.
Groups like OpenAI have been tracking the amount of computing needed to train advanced models over time. “The models are getting bigger,” Nvidia CEO Jensen Huang noted in an SC ‘19 keynote address this week. “If the models are getting bigger, then the amount of data you need to train it has to be proportionally better, proportionally larger.
“Otherwise, [the model] would be underfed,” Huang continued. Hence, “the amount of computation is skyrocketing,” doubling every three-and-a-half months, according to OpenAI.
Microsoft joins Amazon Web Services in offering cloud access to Nvidia’s V100 Tesla GPUs. The AWS configuration is paired with 100 Gbps of network bandwidth. AWS announced the new GPU-accelerated cloud instance last NovemberMicrosoft’s NDv2 cloud instances are available now in preview. One instance with eight V100 GPUs can be clustered via the Kubernetes orchestrator, the partners said.
Pricing details for the cloud HPC instances are here.

Microsoft, Nvidia Launch Cloud HPC Service
Nvidia and Microsoft have joined forces to offer a cloud HPC capability based on the GPU vendor’s V100 Tensor Core chips linked via an InfiniBand network scaling up to 800 […]
