Microsoft lifts the lid on global AI infrastructure projects

Microsoft has revealed it’s working on a new “planet-wide” scheduling system for AI workloads, called Singularity.

As explained in a technical sheet paper released by the company, Singularity is a “new workload-aware scheduler that can seamlessly anticipate and elastically scale deep learning workloads to drive high utilization without affecting their accuracy. or their performance through an all-encompassing feel of AI accelerators”.

In non-technical terms, this means that the system is designed to ensure that the global network of server hardware is utilized optimally, reducing the costs associated with running AI workloads.

Microsoft singularity

At the heart of Singularity’s value proposition is the ability to scale jobs mid-stream, as well as move them between different facilities located around the world.

As explained in the document, a live job can be migrated to a different cluster or data center and resumed at the exact point where it left off, thus optimizing capacity utilization. It can also be elastically scaled up or down, leveraging a varying number and type of AI accelerators as needed.

The beauty of this system, according to Microsoft, is that it doesn’t require any extra work from developers, as no code changes are required for Singularity to work.

However, to make all of this possible, Microsoft had to find a way to decouple workloads from hardware resources. The new solution uses something the company calls a “device proxy,” which runs in its own address space and establishes a layer of separation that enables smooth reallocation of resources.

“Singularity achieves a significant breakthrough in planning deep learning workloads, converting niche features such as elasticity into common, always-on features that the planner can rely on to enforce SLAs. strict,” Microsoft wrote in its summary.

“With new mechanisms that make unmodified tasks preemptible and resizable with negligible performance overhead, Singularity enables unprecedented levels of workload fungibility, allowing tasks to take advantage of spare capacity anywhere in the globally distributed fleet.”

Although the scheduling service is the main focus of the article, the authors state that the system is designed to scale with a fleet of hundreds of thousands of GPUs and other AI accelerators.

Tech Radar Pro asked Microsoft when it expects Singularity to be commercially available.

Comments are closed.