Distributed Learning of Convolutional Neural Networks on Heterogeneous Processing Units

Marques, José Fernando Duarte

Please use this identifier to cite or link to this item: https://hdl.handle.net/10316/81665

Title:	Distributed Learning of Convolutional Neural Networks on Heterogeneous Processing Units
Other Titles:	Aprendizagem distribuída de redes neuronais convolucionais em unidades de processamento heterogéneas.
Authors:	Marques, José Fernando Duarte
Orientador:	Fernandes, Gabriel Falcão Paiva
Keywords:	Redes Neuronais de Convolução; Computação Paralela; Computação Distribuída; Deep Learning; GPGPU; Convolutional Neural Networks; Parallel Computing; Distributed Computing; Deep Learning; GPGPU
Issue Date:	21-Sep-2016
Serial title, monograph or event:	Distributed Learning of Convolutional Neural Networks on Heterogeneous Processing Units
Place of publication or event:	DEEC
Abstract:	A área de deep learning tem sido o foco de muita pesquisa e desenvolvimento aolongo dos últimos anos. As DNNs, e mais concretamente as CNNs provaram ser ferramentas poderosas em tarefas que vão desde as mais comuns, como leitura de cheques, às mais essenciais, sendo usadas em diagnóstico médico. Esta evolução na área levou ao desenvolvimento de frameworks, como o Torch e o Theano, que simplificaram o processo de treino de uma CNN, sendo necessário apenas estruturar a rede, escolhendo os parâmetros ideais e fornecer os inputs e outputs desejados. No entanto, o fácil acesso a essas frameworks levou a um aumento no tamanho tanto das redes como dos conjuntos de dados usados, uma vez que as redes tiveram que se tornar maiores e mais complexas para obter resultados mais significativos. Isto levou a tempos de treinos maiores, que nem a melhoria de GPUs e mais especificamente o uso de GPGPU conseguiu acompanhar.Para dar resposta a isso, foram desenvolvidos métodos de treino distribuído, dividindo o trabalho quer por várias GPUs na mesma máquina, quer por CPUs e GPUs em máquinas distintas. As diferentes técnicas de distribuição podem ser dividas em 2 grupos: paralelismo de dados e paralelismo de modelo. O primeiro método consiste em usar réplicas de uma rede e treinar fornecendo dados diferentes. O paralelismo de modelo passa por dividir o trabalho de toda a rede pelos diferentes dispositivos usados. No entanto, nenhuma destas técnicas usadas pela diferentes frameworks existentes tira partido da paralelização oferecida pelas CNNs, e tentar usar um outro método com essas frameworks revela-se um trabalho demasiado complexo e muitas vezes impossível.Nesta tese, é apresentada uma nova técnica de treino distribuído, que faz uso da paralelização que as CNNs oferecem. O método é uma variação do paralelismo de modelo, onde apenas a camada de convolução é distribuída. Todas as máquinas recebem as mesmas entradas mas um conjunto diferente de filtros, sendo que no final das convoluções os resultados são enviados para uma máquina central, designada como nó mestre.Este método foi alvo de uma série de testes, variando o número de máquinas en-volvidas e a arquitectura da rede, cujos resultados se encontram neste documento. Os resultados mostram que esta técnica é capaz de diminuir os tempos de treino consideravelmente sem perda de desempenho de classificação, tanto para CPU como para GPU. Também foi feita uma análise detalhada sobre a influência do tamanho da rede e do batchsize no speedup conseguido. Por fim, foram também simulados resultados para um número superior de máquinas usadas, bem como o possível uso de GPUs de dispositivos móveis, cuja eficiência energética aplicada ao deep learning foi também explorada neste trabalhado, suportado pelo conteúdo do Appendix A. The field of deep learning has been the focus of plenty of research and development over the last years. Deep Neural Networks (DNNs), and more specifically Convolutional Neural Networks (CNNs) have shown to be powerful tools in tasks that range from ordinary, like check reading, to the most essential, being used in medical diagnosis. This evolution in the field has lead to the development of frameworks, such as Torch and Theano, that simplified the training process of a CNN, where the user only needs to create the network architecture, select the ideal hyper-parameters and provide the inputs and desired outputs. However, the easy access to these frameworks lead to an increase in both network size as well as dataset size, since the networks had to become bigger and more complex to be able to achieve more significant results. This lead to larger training times, that not even the improvement of Graphics Processing Units (GPUs) and more specifically the use of General-Purpose GPU (GPGPU) could keep up to.To solve that problem, several distributed training methods were developed, dividing the workload through several GPUs on the same machine or through Central Processing Units (CPUs) or GPUs on different machines. This distribution techniques can be divided into 2 groups: data parallelism and model parallelism. The first method consists on using replicas of the same network on each device and train them using different data. Model parallelism divides the workload of the entire network through the different devices used. However, none of these techniques used by the different frameworks takes advantage of the parallelization offered by the CNNs, and trying to use a different method with those frameworks ends up being a task too complex or even impossible.In the present thesis, a new distributed training technique is developed, that makes use of the parallelization that CNNs have to offer. The method is a variation of the model parallelism, but only the convolutional layer is distributed. Every machine receives the same inputs but a different set of kernels, and the result of the convolutions is then sent to a main machine, known as master node.This method was subjected to a series of test, varying the number of machines involved as well as the network architecture, with the results being presented in this document. The results show that this technique is capable of diminishing the training times considerably without classification performance loss, for both the CPUs as well as the GPUs. A detailed analysis regarding the influence of the network size and batch size was also included in this document. Finally, a simulation was executed, that shows results using a higher number of machines as well as a possible use of mobile GPUs, whose energy efficiency applied to deep learning was also explored in this work, supported by the contents of Appendix A.
Description:	Dissertação de Mestrado Integrado em Engenharia Electrotécnica e de Computadores apresentada à Faculdade de Ciências e Tecnologia
URI:	https://hdl.handle.net/10316/81665
Rights:	embargoedAccess
Appears in Collections:	UC - Dissertações de Mestrado

Files in This Item:

File	Description	Size	Format
thesis.pdf		33.05 MB	Adobe PDF	View/Open

Show full item record

Google Scholar^TM

Check

This item is licensed under a Creative Commons License

Files in This Item:

Google ScholarTM

Google Scholar^TM