Theme 3: Fine-grained Communication and Coordination

The trend to design data centers with millions of accelerators, supporting a wide variety of heterogeneous functionality will quickly accelerate, introducing networking challenges. We envision hierarchical, flexible and reconfigurable network topologies that leverage accelerators for protocol and infrastructure tasks. This evolvable hardware will be matched with an evolvable communication software stack that specializes to the accelerators available—substantially reducing the Datacenter Tax. Furthermore, to prevent accelerators from sitting idle because the scheduler fails to assign them work or because they are waiting for remote data to arrive, we envision a novel runtime. The runtime bundles computation in small buckets and ships them to where the data lives. Moreover, accelerators in network switches and smartNICs  use their unique vantage points to perform a variety of computations, further improving efficiency. Finally, we propose new memory and accelerator designs to speed-up geo-distributed data stores---since they are a fundamental component of the cloud infrastructure.

WAN switches, datacenter switches, and SmartNICs can perform in-network computing.
WAN switches, datacenter switches, and SmartNICs can perform in-network computing (Courtesy of Manya Ghobadi).
Papers and Presentations:
 
Homunculus: Auto-Generating Efficient Data-Plane ML Pipelines for Datacenter Networks
Tushar Swamy, Annus Zulfiqar, Luigi Nardi, Muhammad Shahbaz, Kunle Olukotun
ASPLOS 2023, March 2023
 
ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale
William Won; Taekyung Heo; Saeed Rashidi; Srinivas Sridharan; Sudarshan Srinivasan; Tushar Krishna
2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), April 2023
10.1109/ISPASS57527.2023.00035
 
The Slow Path Needs an Accelerator Too!
Annus Zulfiqar, Ben Pfaff, William Tu, Gianni Antichi, Muhammad Shahbaz
ACM SIGCOMM Computer Communication Review, Vol 53, Issue 1, April 2023
10.1145/3594255.3594259
 
eZNS: An Elastic Zoned Namespace for Commodity ZNS SSDs
Jaehong Min, Chenxingyu Zhao, Ming Liu, Arvind Krishnamurthy
Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation. April 2023
 
Electrode: Accelerating Distributed Protocols with eBPF
Yang Zhou, Zezhou Wang, Sowmya Dharanipragada, Minlan Yu
Proceedings of the 20th USENIX Symposium on Network Systems Design and Implementation, April 2023
 
RecD: Deduplication for End-to-End Deep Learning Recommendation Model Training Infrastructure
Mark Zhao, Dhruv Choudhary, Devashish Tyagi, Ajay Somani, Max Kaplan, Sung-Han Lin, Sarunya Pumma, Jongsoo Park, Aarti Basant, Niket Agarwal, Carole-Jean Wu, and Christos Kozyrakis
Proceedings of the 6th Conference on Machine Learning and Systems, June 2023
arXiv:2211.05239
 
Tectonic-Shift: A Composite Storage Fabric for Large- Scale ML Training
Mark Zhao, Satadru Pan, Niket Agarwal, Zhaoduo Wen, David Su, Anand Natarajan, Pavan Kumar, Shiva Shankar P, Ritesh Tijoriwala, Karan Asher, Hao Wu, Aarti Basant, Daniel Ford, Deli David, Nezih Yigitbasi, Pratap Sing, Carole-Jean Wu, Christos Kozyrakis
Proceedings of the 2023 USENIX Annual Technical Conference
July 2023
 
RecD: Deduplication for End-to-End Deep Learning Recommendation Model Training Infrastructure
Mark Zhao, Dhruv Choudhary, Devashish Tyagi, Ajay Somani, Max Kaplan, Sung-Han Lin, Sarunya Pumma, Jongsoo Park, Aarti Basant, Niket Agarwal, Carole-Jean Wu, and Christos Kozyrakis
Proceedings of the 6th Conference on Machine Learning and Systems, June 2023
arXiv:2211.05239
 
Cloud-Native 5G Mobile Core
Jingqi Huang, Bilal Saleem, Jiayi Meng, Iftekhar Alam, Ajay Thakur, Christian Maciocco, Muhammad Shahbaz, and Y. Charlie Hu
SRC TechCon September 2023
 
Direct Telemetry Access
Jonatan Langlet, Ran Ben Basat, Gabriele Oliaro, Michael Mitzenmacher, Minln Yu, Gianni Antichi
ACM SIGCOMM '23: Proceedings of the ACM SIGCOMM 2023 Conference, September 2023
10.1145/3603269.3604827
 
 
Dissecting Overheads of Service Mesh Sidecars
Xiangfeng Zhu, Guozhen She, Bowen Xue, Yu Zhang, Yogsu Zhang, Xuan Kelvin Zou, XiongChun Duan, Peng He, Arvind Krishnamurthy, Matthew Lentz, Danyang Zhuo Ratul Mahajan
SoCC '23: Proceedings of the 2023 ACM Symposium on Cloud Computing, October 2023
10.1145/3620678.3624652
 
Modeling and Generating Control-Plane Traffic for Cellular Networks
Jiayi Meng, Jingqi Huang, Y. Charlie Hu, Yaron Koral, Xiaojun Lin, Muhammad Shahbaz, Abhigyan Sharma IMC '23: Proceedings of the 2023 ACM on Internet Measurement Conference, October 2023
 
Anticipatory Resource Allocation for ML Training
Tapan Chugh, Srikanth Kandula, Arvind Krsihnamurthy, Ratul Mahajan, Ishai Menache
 
On-Fiber Photonic Computing
Mingran Yang, Zhizhen Zhong, Manya Ghobadi
HotNets ’23, November 2023
10.1145/3626111.3628177
 
Optimal Oblivious Routing with Concave Objectives for Structured Networks
Kanatip Chitavisutthivong; Sucha Supittayapornpong; Pooria Namyar; Mingyang Zhang; Minlan Yu; Ramesh Govindan
IEEE/ACM Transactions on Networking December 2023
10.1109/TNET.2023.3264632