Distributed Computing For Training Large-Scale AI Models in .NET Clusters

Authors

  • Rajashree Manjulalayam Rajendran HomeASAP LLC, USA

Keywords:

Distributed Computing, Large-Scale AI Models, .NET Clusters, Parallel Computing, Azure Service Fabric, Akka.NET

Abstract

Distributed computing plays a pivotal role in the training of large-scale AI models, enabling the parallelization of computations across multiple nodes within a cluster. This paper explores the integration of distributed computing techniques within .NET clusters for efficient and scalable training of AI models. The .NET ecosystem, with its versatile and extensible framework, provides a robust foundation for developing distributed computing solutions. The paper begins by outlining the challenges associated with training large-scale AI models and the need for distributed computing solutions to address computational bottlenecks. It then delves into the architectural considerations for implementing distributed computing in .NET clusters, emphasizing the utilization of technologies such as Microsoft's Azure Service Fabric or third-party frameworks like Akka. NET. The proposed solution leverages the inherent capabilities of .NET for building distributed systems, allowing seamless communication and coordination among cluster nodes. Key aspects such as data parallelism, model parallelism, and asynchronous communication are explored to harness the full potential of distributed computing for AI model training. A case study is presented to demonstrate the practical implementation of the proposed solution in a real-world scenario. Performance metrics, scalability analysis, and comparisons with traditional single-node training are provided to showcase the advantages of employing distributed computing for large-scale AI model training in .NET clusters.

References

M. Abadi et al., "Tensorflow: Large-scale machine learning on heterogeneous distributed systems," arXiv preprint arXiv:1603.04467, 2016.

S. Deshmukh, K. Thirupathi Rao, and M. Shabaz, "Collaborative learning based straggler prevention in large-scale distributed computing framework," Security and communication networks, vol. 2021, pp. 1-9, 2021.

M. Jeon, S. Venkataraman, A. Phanishayee, J. Qian, W. Xiao, and F. Yang, "Analysis of {Large-Scale}{Multi-Tenant}{GPU} clusters for {DNN} training workloads," in 2019 USENIX Annual Technical Conference (USENIX ATC 19), 2019, pp. 947-960.

M. Langer, Z. He, W. Rahayu, and Y. Xue, "Distributed training of deep learning models: A taxonomic perspective," IEEE Transactions on Parallel and Distributed Systems, vol. 31, no. 12, pp. 2802-2818, 2020.

J. J. Dai et al., "Bigdl: A distributed deep learning framework for big data," in Proceedings of the ACM Symposium on Cloud Computing, 2019, pp. 50-60.

N. A. Bahcall, "Large-scale structure in the universe indicated by galaxy clusters," Annual review of astronomy and astrophysics, vol. 26, no. 1, pp. 631-686, 1988.

J. J. Dai et al., "Bigdl 2.0: Seamless scaling of ai pipelines from laptops to distributed cluster," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 21439-21446.

S. Li et al., "Colossal-ai: A unified deep learning system for large-scale parallel training," in Proceedings of the 52nd International Conference on Parallel Processing, 2023, pp. 766-775.

M. N. Nguyen et al., "Self-organizing democratized learning: Toward large-scale distributed learning systems," IEEE Transactions on Neural Networks and Learning Systems, 2022.

Y. Huang et al., "Hierarchical training: Scaling deep recommendation models on large CPU clusters," in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021, pp. 3050-3058.

D. V. Gadasin, A. V. Shvedov, and A. A. Yudina, "Clustering methods in large-scale systems," Synchroinfo Journal, vol. 6, no. 5, pp. 21-24, 2020.

J. Li et al., "On 3D cluster-based channel modeling for large-scale array communications," IEEE Transactions on wireless communications, vol. 18, no. 10, pp. 4902-4914, 2019.

X.-B. Nguyen, D. T. Bui, C. N. Duong, T. D. Bui, and K. Luu, "Clusformer: A transformer-based clustering approach to unsupervised large-scale face and visual landmark recognition," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10847-10856.

Q. Weng et al., "{MLaaS} in the wild: Workload analysis and scheduling in {Large-Scale} heterogeneous {GPU} clusters," in 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), 2022, pp. 945-960.

A. Zerzelidis and A. J. Wellings, "Requirements for a real-time. net framework," ACM SIGPLAN Notices, vol. 40, no. 2, pp. 41-50, 2005.

Downloads

Published

20-01-2024

How to Cite

[1]
R. Manjulalayam Rajendran, “Distributed Computing For Training Large-Scale AI Models in .NET Clusters”, J. Computational Intel. & Robotics, vol. 4, no. 1, pp. 64–78, Jan. 2024.