PMID- 32287013 OWN - NLM STAT- PubMed-not-MEDLINE DCOM- 20210928 LR - 20210928 IS - 2162-2388 (Electronic) IS - 2162-237X (Linking) VI - 32 IP - 3 DP - 2021 Mar TI - LAGC: Lazily Aggregated Gradient Coding for Straggler-Tolerant and Communication-Efficient Distributed Learning. PG - 962-974 LID - 10.1109/TNNLS.2020.2979762 [doi] AB - Gradient-based distributed learning in parameter server (PS) computing architectures is subject to random delays due to straggling worker nodes and to possible communication bottlenecks between PS and workers. Solutions have been recently proposed to separately address these impairments based on the ideas of gradient coding (GC), worker grouping, and adaptive worker selection. This article provides a unified analysis of these techniques in terms of wall-clock time, communication, and computation complexity measures. Furthermore, in order to combine the benefits of GC and grouping in terms of robustness to stragglers with the communication and computation load gains of adaptive selection, novel strategies, named lazily aggregated GC (LAGC) and grouped-LAG (G-LAG), are introduced. Analysis and results show that G-LAG provides the best wall-clock time and communication performance while maintaining a low computational cost, for two representative distributions of the computing times of the worker nodes. FAU - Zhang, Jingjing AU - Zhang J FAU - Simeone, Osvaldo AU - Simeone O LA - eng PT - Journal Article PT - Research Support, Non-U.S. Gov't DEP - 20210301 PL - United States TA - IEEE Trans Neural Netw Learn Syst JT - IEEE transactions on neural networks and learning systems JID - 101616214 SB - IM EDAT- 2020/04/15 06:00 MHDA- 2020/04/15 06:01 CRDT- 2020/04/15 06:00 PHST- 2020/04/15 06:00 [pubmed] PHST- 2020/04/15 06:01 [medline] PHST- 2020/04/15 06:00 [entrez] AID - 10.1109/TNNLS.2020.2979762 [doi] PST - ppublish SO - IEEE Trans Neural Netw Learn Syst. 2021 Mar;32(3):962-974. doi: 10.1109/TNNLS.2020.2979762. Epub 2021 Mar 1.