Practice of Building AI Training Cluster Based on Kubernetes+RoCEv2 - Wang DeKui & Wang Chao IEI
基于Kubernetes+RoCEv2构建AI训练集群的实践 | Practice of Building AI Training Cluster Based on Kubernetes+RoCEv2 - Wang DeKui & Wang Chao IEI
AI训练任务依赖RDMA通信来提高训练速度。目前,大多数用户场景使用Infiniband网络,但一些用户也开始使用RoCEv2网络来构建AI集群。与Infiniband网络不同,RoCEv2基于UDP协议。这给k8s集群的维护和使用带来了新的挑战。主要包括以下几个方面:1. 如何将RoCEv2无丢包网络集成到k8s网络中 2. 如何在k8s pod中使用RoCEv2网络 3. 如何合理调度资源,优化具有多个GPU和RoCE网络卡的节点的AI训练资源 4. 需要对AI训练任务进行哪些调整 本次会议主要介绍基于k8s+RoCEv2构建AI训练集群的一些实践。我们已经提供了一些解决上述挑战的方案,主要包括k8s集群网络解决方案、网络卡虚拟化、RoCE无丢包网络配置、基于RoCEv2+k8s运行训练任务等。
AI training tasks rely on RDMA communication to improve training speed. Currently, most user scenarios use Infiniband network,but some users have also started using RoCEv2 network to build AI clusters. Unlike Infiniband network, RoCEv2 is based on UDP protocol. This brings new challenges to the ,maintenance and use of k8s clusters. Mainly including the following aspects: 1.How to integrate RoCEv2 lossless network into k8s network 2.How to use RoCEv2 network in k8s pod 3. How to schedule resources reasonably and optimize AI training resources for nodes with multiple GPUs and RoCE network cards 4. What adjustments need to be made to AI training tasks This session mainly explains some practices of building AI training clusters based on k8s+RoCEv2. We have provided some solutions to the challenges mentioned above, mainly including k8s cluster network solution, network card virtualization, RoCE lossless network configuration, running training tasks based on RoCEv2+k8s,etc
CNCF概况(幻灯片)
扫描二维码联系我们!
CNCF (Cloud Native Computing Foundation)成立于2015年12月,隶属于Linux Foundation,是非营利性组织。
CNCF(云原生计算基金会)致力于培育和维护一个厂商中立的开源生态系统,来推广云原生技术。我们通过将最前沿的模式民主化,让这些创新为大众所用。请关注CNCF微信公众号。