Theme 3: Fine-grained Communication and Coordination

The trend to design data centers with millions of accelerators, supporting a wide variety of heterogeneous functionality will quickly accelerate, introducing networking challenges. We envision hierarchical, flexible and reconfigurable network topologies that leverage accelerators for protocol and infrastructure tasks. This evolvable hardware will be matched with an evolvable communication software stack that specializes to the accelerators available—substantially reducing the Datacenter Tax. Furthermore, to prevent accelerators from sitting idle because the scheduler fails to assign them work or because they are waiting for remote data to arrive, we envision a novel runtime. The runtime bundles computation in small buckets and ships them to where the data lives. Moreover, accelerators in network switches and smartNICs  use their unique vantage points to perform a variety of computations, further improving efficiency. Finally, we propose new memory and accelerator designs to speed-up geo-distributed data stores---since they are a fundamental component of the cloud infrastructure.

WAN switches, datacenter switches, and SmartNICs can perform in-network computing.
WAN switches, datacenter switches, and SmartNICs can perform in-network computing (Courtesy of Manya Ghobadi).

Papers and Presentations:

Homunculus: Auto-Generating Efficient Data-Plane ML Pipelines for Datacenter Networks

Tushar Swamy, Annus Zulfiqar, Luigi Nardi, Muhammad Shahbaz, Kunle Olukotun

ASPLOS 2023, March 2023

10.1145/3582016.3582022

 

ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale

William Won; Taekyung Heo; Saeed Rashidi; Srinivas Sridharan; Sudarshan Srinivasan; Tushar Krishna

2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), April 2023

10.1109/ISPASS57527.2023.00035

 

The Slow Path Needs an Accelerator Too!

Annus Zulfiqar, Ben Pfaff, William Tu, Gianni Antichi, Muhammad Shahbaz

ACM SIGCOMM Computer Communication Review, Vol 53, Issue 1, April 2023

10.1145/3594255.3594259

 

eZNS: An Elastic Zoned Namespace for Commodity ZNS SSDs

Jaehong Min, Chenxingyu Zhao, Ming Liu, Arvind Krishnamurthy

Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation. April 2023

 

Electrode: Accelerating Distributed Protocols with eBPF

Yang Zhou, Zezhou Wang, Sowmya Dharanipragada, Minlan Yu

Proceedings of the 20th USENIX Symposium on Network Systems Design and Implementation, April 2023

 

RecD: Deduplication for End-to-End Deep Learning Recommendation Model Training Infrastructure

Mark Zhao, Dhruv Choudhary, Devashish Tyagi, Ajay Somani, Max Kaplan, Sung-Han Lin, Sarunya Pumma, Jongsoo Park, Aarti Basant, Niket Agarwal, Carole-Jean Wu, and Christos Kozyrakis

Proceedings of the 6th Conference on Machine Learning and Systems, June 2023

arXiv:2211.05239

 

Tectonic-Shift: A Composite Storage Fabric for Large- Scale ML Training

Mark Zhao, Satadru Pan, Niket Agarwal, Zhaoduo Wen, David Su, Anand Natarajan, Pavan Kumar, Shiva Shankar P, Ritesh Tijoriwala, Karan Asher, Hao Wu, Aarti Basant, Daniel Ford, Deli David, Nezih Yigitbasi, Pratap Sing, Carole-Jean Wu, Christos Kozyrakis

Proceedings of the 2023 USENIX Annual Technical Conference

July 2023

 

RecD: Deduplication for End-to-End Deep Learning Recommendation Model Training Infrastructure

Mark Zhao, Dhruv Choudhary, Devashish Tyagi, Ajay Somani, Max Kaplan, Sung-Han Lin, Sarunya Pumma, Jongsoo Park, Aarti Basant, Niket Agarwal, Carole-Jean Wu, and Christos Kozyrakis

Proceedings of the 6th Conference on Machine Learning and Systems, June 2023

arXiv:2211.05239

 

 

 

Cloud-Native 5G Mobile Core

Jingqi Huang, Bilal Saleem, Jiayi Meng, Iftekhar Alam, Ajay Thakur, Christian Maciocco, Muhammad Shahbaz, and Y. Charlie Hu

SRC TechCon September 2023

 

Direct Telemetry Access

Jonatan Langlet, Ran Ben Basat, Gabriele Oliaro, Michael Mitzenmacher, Minln Yu, Gianni Antichi

ACM SIGCOMM '23: Proceedings of the ACM SIGCOMM 2023 Conference, September 2023

10.1145/3603269.3604827

 

 

Dissecting Overheads of Service Mesh Sidecars

Xiangfeng Zhu, Guozhen She, Bowen Xue, Yu Zhang, Yogsu Zhang, Xuan Kelvin Zou, XiongChun Duan, Peng He, Arvind Krishnamurthy, Matthew Lentz, Danyang Zhuo Ratul Mahajan

SoCC '23: Proceedings of the 2023 ACM Symposium on Cloud Computing, October 2023

10.1145/3620678.3624652

 

Modeling and Generating Control-Plane Traffic for Cellular Networks

Jiayi Meng, Jingqi Huang, Y. Charlie Hu, Yaron Koral, Xiaojun Lin, Muhammad Shahbaz, Abhigyan Sharma IMC '23: Proceedings of the 2023 ACM on Internet Measurement Conference, October 2023

10.1145/3618257.3624808

 

Anticipatory Resource Allocation for ML Training

Tapan Chugh, Srikanth Kandula, Arvind Krsihnamurthy, Ratul Mahajan, Ishai Menache

SoCC '23: Proceedings of the 2023 ACM Symposium on Cloud Computing, October 2023

10.1145/3620678.3624669

 

On-Fiber Photonic Computing

Mingran Yang, Zhizhen Zhong, Manya Ghobadi

HotNets ’23, November 2023

10.1145/3626111.3628177

 

Optimal Oblivious Routing with Concave Objectives for Structured Networks

Kanatip Chitavisutthivong; Sucha Supittayapornpong; Pooria Namyar; Mingyang Zhang; Minlan Yu; Ramesh Govindan

IEEE/ACM Transactions on Networking December 2023

10.1109/TNET.2023.3264632