Kwai, Kuaishou and ETH Zürich offer PERSIA, a distributed training system that supports deep learning-based recommendations for up to 100 trillion parameters
Modern recommender systems have countless real-world applications and have made astonishing progress thanks to the ever-increasing size of deep neural network models – which have grown from Google’s 2016 model with 1 billion parameters to the latest Facebook model with 12,000 billion parameters. There seems to be no limit to the significant quality improvements brought about by such model scaling, and deep learning practitioners believe that the era of 100 trillion parameter systems will come sooner. sooner rather than later.
Training extremely large models that require both memory and computationally intensive is difficult, however, even with the support of industrial-scale data centers. In the new newspaper PERSIA: a scalable open and hybrid system based on deep learning up to 100,000 billion parameters, a research team from Kwai Inc., Kuaishou Technology and ETH Zürich proposes PERSIA, an efficient distributed training system that exploits a novel hybrid training algorithm to ensure both training efficiency and accuracy in such models of recommendation. The team provides theoretical demonstrations and empirical studies to validate the effectiveness of PERSIA on recommender systems of up to 100 trillion parameters.
The team summarizes the main contributions of their study as follows:
- We present a natural yet novel hybrid training algorithm to approach the integration layer and dense neural network modules differently.
- We provide a rigorous theoretical analysis on its convergence behavior and further relate the characteristics of a recommendation model to its convergence to justify its effectiveness.
- We design a distributed system to manage hybrid computing resources (CPU and GPU) to optimize the coexistence of asynchronicity and synchronicity in the learning algorithm.
- We evaluate PERSIA using both publicly available benchmark tasks and real-world tasks in Kwai. We show that PERSIA scales efficiently and produces speedups up to 7.12× compared to state-of-the-art approaches.
The team first proposes a new hybrid sync-async algorithm, where the integration module trains asynchronously while the dense neural network is updated synchronously. This hybrid algorithm allows hardware efficiency comparable to fully asynchronous mode without sacrificing statistical efficiency.
The team designed PERSIA (Parallel Recommendation Training System with Hybrid Acceleration) to support the aforementioned hybrid algorithm with two fundamental aspects: 1) placing the training workflow on a heterogeneous cluster, and 2) corresponding training procedure on the hybrid infrastructure. PERSIA offers four modules designed to provide efficient autoscaling and support recommendation models of up to 100 trillion parameters:
- A data loader that fetches training data from distributed storages.
- An integration settings server (PS) that manages the storage and updates of settings in the integration layer.
- A group of integration workers to retrieve integration parameters from the integration PS, aggregate the integration vectors (potentially), and deliver the integration gradients to the integration PS.
- A group of NN workers that perform forward/backward propagation of the neural network.
The team evaluated PERSIA on three open-source benchmarks (Taobao-Ad, Avazu-Ad, and Criteo-Ad) and the real-world production microvideo recommendation workflow at Kwai. They used two state-of-the-art distributed recommendation training systems – XDL and PaddlePaddle – as benchmarks.
The proposed hybrid algorithm achieved much higher throughput compared to all other systems. PERSIA achieved nearly linear speedups with significantly higher throughput compared to XDL and PaddlePaddle, and 3.8 times higher throughput compared to the fully synchronous algorithm on the Kwai-video benchmark. Additionally, PERSIA demonstrated stable training throughput even when model size increased up to 100 trillion parameters, achieving 2.6 times higher throughput than fully synchronous mode.
Overall, the results show that the proposed PERSIA effectively supports the efficient and scalable training of recommendation models at a scale of up to 100 trillion parameters. The team hopes that their study and insights can benefit both academia and industry.
Author: Hecate He | Editor: Michel Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Weekly Synchronized Global AI to get weekly AI updates.