SpiderFlow: Efficient Topology-Aware Scheduling for LLM Training Across Decentralized GPU Clusters

Zihan Chang; Shuibing He; Bo Zhou; Sheng Xiao; Siling Yang; Rui Wang; Zhe Pan

SpiderFlow: Efficient Topology-Aware Scheduling for LLM Training Across Decentralized GPU Clusters

Zihan Chang, Shuibing He, Bo Zhou, Sheng Xiao, Siling Yang, Rui Wang, Zhe Pan

Abstract

In response to the increasing demand for largescale machine learning training jobs, many organizations have deployed GPU clusters across geographically distributed regions. However, existing ILP- or genetic-based cross-cluster training approaches largely overlook the topology of decentralized clusters, lacking both topologyaware task scheduling mechanisms and automated model parallelization strategies. As a result, naively applying these optimization-based methods in cross-cluster settings leads to prohibitive scheduling overhead, due to the drastically enlarged search space induced by complex inter-cluster topologies. To address these challenges, we propose SpiderFlow, a topologyaware scheduling system specifically designed for decentralized GPU clusters. We formulate cross-cluster task scheduling as a graph optimization problem and introduce SpinSearch, a low-overhead topology-aware scheduling algorithm. In addition, for automated model parallelization, we propose TPA, a two-level scheduling framework that combines heuristic methods at the inter-cluster level with ILP-based optimization within clusters, effectively reducing the search space while maintaining high training throughput with substantially lower scheduling overhead. We evaluate SpiderFlow on a physical platform comprising 8 decentralized clusters, as well as on a simulation platform with up to 64 decentralized clusters. Experimental results demonstrate that SpiderFlow reduces job completion time (JCT) by 1.2-1.3×, improves throughput by 1.12-1.25×, and reduces scheduling overhead by 20-90× on average compared to state-of-the-art scheduling systems.

Anthology ID:: 2026.acl-long.619
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 13603–13615
Language:
URL:: https://aclanthology.org/2026.acl-long.619/
DOI:
Bibkey:
Cite (ACL):: Zihan Chang, Shuibing He, Bo Zhou, Sheng Xiao, Siling Yang, Rui Wang, and Zhe Pan. 2026. SpiderFlow: Efficient Topology-Aware Scheduling for LLM Training Across Decentralized GPU Clusters. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13603–13615, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: SpiderFlow: Efficient Topology-Aware Scheduling for LLM Training Across Decentralized GPU Clusters (Chang et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.619.pdf
Checklist:: 2026.acl-long.619.checklist.pdf

PDF Cite Search Checklist Fix data