Distributed Computing For Training Large-Scale AI Models in .NET Clusters

Rajashree Manjulalayam Rajendran

Distributed Computing For Training Large-Scale AI Models in .NET Clusters

Authors

Rajashree Manjulalayam Rajendran HomeASAP LLC, USA

Keywords:

Distributed Computing, Large-Scale AI Models, .NET Clusters, Parallel Computing, Azure Service Fabric, Akka.NET

Abstract

Distributed computing plays a pivotal role in the training of large-scale AI models, enabling the parallelization of computations across multiple nodes within a cluster. This paper explores the integration of distributed computing techniques within .NET clusters for efficient and scalable training of AI models. The .NET ecosystem, with its versatile and extensible framework, provides a robust foundation for developing distributed computing solutions. The paper begins by outlining the challenges associated with training large-scale AI models and the need for distributed computing solutions to address computational bottlenecks. It then delves into the architectural considerations for implementing distributed computing in .NET clusters, emphasizing the utilization of technologies such as Microsoft's Azure Service Fabric or third-party frameworks like Akka. NET. The proposed solution leverages the inherent capabilities of .NET for building distributed systems, allowing seamless communication and coordination among cluster nodes. Key aspects such as data parallelism, model parallelism, and asynchronous communication are explored to harness the full potential of distributed computing for AI model training. A case study is presented to demonstrate the practical implementation of the proposed solution in a real-world scenario. Performance metrics, scalability analysis, and comparisons with traditional single-node training are provided to showcase the advantages of employing distributed computing for large-scale AI model training in .NET clusters.

References

M. Abadi et al., "Tensorflow: Large-scale machine learning on heterogeneous distributed systems," arXiv preprint arXiv:1603.04467, 2016.

S. Deshmukh, K. Thirupathi Rao, and M. Shabaz, "Collaborative learning based straggler prevention in large-scale distributed computing framework," Security and communication networks, vol. 2021, pp. 1-9, 2021.

M. Jeon, S. Venkataraman, A. Phanishayee, J. Qian, W. Xiao, and F. Yang, "Analysis of {Large-Scale}{Multi-Tenant}{GPU} clusters for {DNN} training workloads," in 2019 USENIX Annual Technical Conference (USENIX ATC 19), 2019, pp. 947-960.

M. Langer, Z. He, W. Rahayu, and Y. Xue, "Distributed training of deep learning models: A taxonomic perspective," IEEE Transactions on Parallel and Distributed Systems, vol. 31, no. 12, pp. 2802-2818, 2020.

J. J. Dai et al., "Bigdl: A distributed deep learning framework for big data," in Proceedings of the ACM Symposium on Cloud Computing, 2019, pp. 50-60.

N. A. Bahcall, "Large-scale structure in the universe indicated by galaxy clusters," Annual review of astronomy and astrophysics, vol. 26, no. 1, pp. 631-686, 1988.

J. J. Dai et al., "Bigdl 2.0: Seamless scaling of ai pipelines from laptops to distributed cluster," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 21439-21446.

S. Li et al., "Colossal-ai: A unified deep learning system for large-scale parallel training," in Proceedings of the 52nd International Conference on Parallel Processing, 2023, pp. 766-775.

M. N. Nguyen et al., "Self-organizing democratized learning: Toward large-scale distributed learning systems," IEEE Transactions on Neural Networks and Learning Systems, 2022.

Y. Huang et al., "Hierarchical training: Scaling deep recommendation models on large CPU clusters," in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021, pp. 3050-3058.

D. V. Gadasin, A. V. Shvedov, and A. A. Yudina, "Clustering methods in large-scale systems," Synchroinfo Journal, vol. 6, no. 5, pp. 21-24, 2020.

J. Li et al., "On 3D cluster-based channel modeling for large-scale array communications," IEEE Transactions on wireless communications, vol. 18, no. 10, pp. 4902-4914, 2019.

X.-B. Nguyen, D. T. Bui, C. N. Duong, T. D. Bui, and K. Luu, "Clusformer: A transformer-based clustering approach to unsupervised large-scale face and visual landmark recognition," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10847-10856.

Q. Weng et al., "{MLaaS} in the wild: Workload analysis and scheduling in {Large-Scale} heterogeneous {GPU} clusters," in 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), 2022, pp. 945-960.

A. Zerzelidis and A. J. Wellings, "Requirements for a real-time. net framework," ACM SIGPLAN Notices, vol. 40, no. 2, pp. 41-50, 2005.

Downloads

Published

20-01-2024

How to Cite

[1]

R. Manjulalayam Rajendran, “Distributed Computing For Training Large-Scale AI Models in .NET Clusters”, J. Computational Intel. & Robotics, vol. 4, no. 1, pp. 64–78, Jan. 2024.

Download Citation

Issue

Vol. 4 No. 1 (2024): Journal of Computational Intelligence and Robotics

Section

Articles

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

License Terms

Ownership and Licensing:

Authors of this research paper submitted to the journal owned and operated by The Science Brigade Group retain the copyright of their work while granting the journal certain rights. Authors maintain ownership of the copyright and have granted the journal a right of first publication. Simultaneously, authors agreed to license their research papers under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License.

License Permissions:

Under the CC BY-NC-SA 4.0 License, others are permitted to share and adapt the work, as long as proper attribution is given to the authors and acknowledgement is made of the initial publication in the Journal. This license allows for the broad dissemination and utilization of research papers.

Additional Distribution Arrangements:

Authors are free to enter into separate contractual arrangements for the non-exclusive distribution of the journal's published version of the work. This may include posting the work to institutional repositories, publishing it in journals or books, or other forms of dissemination. In such cases, authors are requested to acknowledge the initial publication of the work in this Journal.

Online Posting:

Authors are encouraged to share their work online, including in institutional repositories, disciplinary repositories, or on their personal websites. This permission applies both prior to and during the submission process to the Journal. Online sharing enhances the visibility and accessibility of the research papers.

Responsibility and Liability:

Authors are responsible for ensuring that their research papers do not infringe upon the copyright, privacy, or other rights of any third party. The Science Brigade Publishers disclaim any liability or responsibility for any copyright infringement or violation of third-party rights in the research papers.

Distributed Computing For Training Large-Scale AI Models in .NET Clusters

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

License Terms

Journal Snapshot

Make a Submission

Copyright & Usage Policy