Finally, we follow previous studies (Humbatova et al., 2020; Lou et al., 2020; Chen et al., 2021) to exclude questions that do not have an accepted answer, ensuring that we consider only questions with a confirmed answer. By studying frequent topics in how-to questions about distributed training, we can summarize the common difficulties that developers face in training DL models in a distributed way. Whether the PlayBook can hold its own against Apple’s iPad, the multitude of Android-based tablets and HP’s webOS devices remains to be seen. Workflow. A distributed training task should first be partitioned so that they can run parallelly on different devices ( 1). The two most predominant parallelization ways are data parallelism and model parallelism (Mayer and Jacobsen, 2020). For data parallelism, the training data is split into non-overlapping chunks, and then these data chunks are respectively fed into different devices that load an identical copy of the DL model (Krizhevsky et al., 2012). For model parallelism, the DL model is split, and then each device loads a different part of the DL model for training (Dean et al., 2012). Through data/model parallelism, the training data and the DL model are distributed on different devices. If a post is raising a how-to question (e.g., asking how to implement a specific distributed training task or inquiring conceptual knowledge about distributed training), it is labeled with only the how-to topic. This has been generated by GSA Content Generator Demover sion!
Most of these projects are stored in version control systems, and there are discussions about them in Question & Answer websites. Note that there are two kinds of faults during distributed training: distributed-specific faults, which are caused by distributed-specific reasons (e.g., communication failure and invalid data partition), and non-distributed-specific faults, which are caused by non-distributed-specific reasons (e.g., wrong type of input data). To show the whole picture of fault symptoms in distributed training, in RQ2, we construct our taxonomy based on both kinds of faults. Otherwise, a post with a clear fault description is labeled with whether it is distributed-specific, the fault symptom, and the fix pattern. Specifically, we jointly read each post and exclude posts that (1) do not have clear descriptions or solutions, (2) fix a bug in the framework itself rather than in distributed training program, or (3) are not related to distributed training. Specifically, we randomly select another 500 posts to perform the evaluation and also identify new keywords from them. We repeat the above evaluation process four times until the keyword set can achieve a recall of 90%. Note that here we do not consider the precision level of these keywords since any misidentified post can be filtered out during the refining process in Section 3.1.3 and will not threaten the validity of our results. The views expressed here are solely those of the author. Since Horovod is specifically designed for distributed training, we take all of the GitHub issues in its repository into consideration no matter which labels they are marked.
To characterize developers’ issues in distributed training, we collect and analyze relevant SO questions and GitHub issues. Really this is two questions. We conducted the survey in two phases, i.e., we repeated our judgemental and snowball sampling approach twice. Next, we evaluate the recall level of these keywords, i.e., the percentage of the distributed-training-related posts that can be identified by these keywords. This shows that our proposed tool can help mitigate three of the challenges (C1, C2, and C4) presented above related to the lack of agile tools, the need for continuous risk assessment, and to the lack of collaboration. D. Table 1 shows the used labels and the number of identified GitHub issues in each repository, respectively. Specifically, they read all the posts carefully to understand their context and assign each post with a set of labels describing (1) the how-to topic, which describes the how-to question briefly, (2) whether the fault is specific to distributed training, (3) the fault symptom, which shows what the fault looks like, and (4) the fix pattern, which tells how a fault is resolved. By constructing a comprehensive taxonomy of fault symptoms related to distributed DL, we present frequent fault symptoms neglected by previous work. Given the surging importance of distributed training in the current practice of developing DL software, this paper fills in the knowledge gap and presents the first comprehensive study on developers’ issues in distributed training.
To fill in the knowledge gap, this paper presents the first comprehensive study on developers’ issues in distributed training of DL software. In addition, on GitHub, framework vendors employ repository-specific keywords to label different types of GitHub issues, such as bug reports, feature requests, users’ questions, etc. Following previous work (Franco et al., 2017; Chen et al., 2021), we leverage these labels of GitHub issues to help us identify relevant developers’ issues. We focus our study on the three most popular DL frameworks (i.e., TensorFlow (Abadi et al., 2016), PyTorch (Paszke et al., 2019), and Keras (ker, 2021)) that support distributed training and a widely used DL framework specifically designed for distributed training (i.e., Horovod (Sergeev and Balso, 2018)) to construct the dataset of our interest. By analyzing this work, professionals will already find evidence to assist in solving these problems or provide support in conducting activities, through strategies and metrics. For example, some developers find it difficult to configure communication between multiple devices involved in distributed training (com, 2017) and complain that they cannot achieve the expected training speedup (low, 2020). Moreover, some developers report that training may be stuck due to the drop-out of involved devices (dro, 2021). Unfortunately, as mentioned before, these developers’ issues have not been comprehensively uncovered and well characterized in existing studies. To train powerful DL models with large datasets efficiently, it has been a common practice for developers to parallelize and distribute the computation and memory over multiple devices in the training process, which is known as distributed training.
Distributed training is a subfield of DL, which parallelizes and distributes the computation and memory of training across multiple devices, e.g., GPUs, TPUs, and server machines. Some of them share common symptoms (e.g., out of memory) although they are caused by different reasons. In software engineering, we point out data visualization projects to follow the disease evolution, machine learning to estimate the pandemic behavior, and computer vision processing radiologic images. This allows manufacturers to pack a lot of processing ability into a very small device, enabling them to forgo things like cooling fans which would make the devices too large. Swiping your finger horizontally allows you to navigate menus on the device. Next, every device trains its own model with the data allocated to it ( 2). During this process, the devices communicate with each other to transfer essential data and to synchronize the training progress on them. In addition to this obvious inefficiency, two other important problems are inherent in the monolithic model. To increase the accuracy of DL models, on one hand, a substantial amount of training data is required; on the other hand, sophisticated DL model architectures such as BERT (Devlin et al., 2019) have emerged.