Data parallelism

What is data parallelism in deep learning?

In deep learning, data parallelism refers to parallelisation across several processors in parallel computing environments. It concentrates on spreading the data across various nodes, which carry out operations on the data in parallel.

Data parallelism can be applied to regular data structures such as arrays and matrices by working on every element in parallel. Rather than depending on process or task concurrency, data parallelism is related to both the flow and the structure of the information.

The goal in data parallelism is to scale the throughput of processing according to the ability to decompose the data set into concurrent processing streams, all carrying out the same set of operations.

As an example, a customer address standardization process would iteratively take an address and try to convert it into a standard form. By adapting this task to data parallelism it can be sped up by a factor of four by instantiating four address standardization processes and streaming one quarter of the address records through every instantiation.

Data parallelism is essentially a finer-grained parallelism that enables you to achieve your performance improvement by carrying out the same small set of tasks iteratively over several streams of data.

The locality of data references has a critical role in the evaluation of the performance of a data parallel programming model. The locality of data is dependent on the memory accesses performed by the program and even the size of the cache.

The use of the concept of data parallelism began in the 1960s when the Solomon machine was created. The Solomon machine is also known as a vector processor and it was built for the purpose of speeding up the performance of mathematical operations by working on a large data array (operating on multiple data in consecutive time steps). The concurrency of data operations was also exploited through the act of operating on multiple data at the same time and making use of just one instruction. These processors were known as array processors. The term was coined in the 1980s to refer to the programming style which was very popularly used for the purpose of programming connection machines in data parallel languages such as C*.

The best examples of the use of data parallelism today are graphics processing units (GPUs) in which the techniques of operating on multiple data in space and time making use of a single instruction.

How is data parallelism used in GPUs (graphics processing units)?

Since data parallelism can be implemented with great ease, it is the most widely used implementation strategy on multi-GPUs.

In data parallelism, every GPU makes use of the same model to train on a different data subset. Here there is no synchronization between GPUs in forward computing, due to the fact that every GPU has a full copy of the model, including the deep net structure and the parameters. However, the parameter gradients that are computed by different GPUs need to be synchronized in BP.

An illustration of data parallelism mode (Source: Science Direct)

‍

What is model parallelism?

In model parallelism, every computational node bears responsibility for parts of the model by training the same data samples. The model is divided into multiple pieces and each computing node like the GPU is responsible for one piece of them. The communication occurs between computational nodes when the input of a neuron is from the output of the other computational node. The performance of model parallelism tends to not be as good as that of data parallelism. This is due to the fact that the communication expenses from model parallelism are much higher than that of data parallelism.

An illustration of model parallelism mode (Source: Science Direct)

What is data-model parallelism?

There are multiple restrictions that limit both data parallelism as well as model parallelism. For data parallelism, you would need to reduce the learning rate to maintain a smooth training process if there are too many computational nodes. For model parallelism, if there are too many nodes the performance of the network will be drastically decreased for the sake of communication expense.

Model parallelism could achieve a high performance if there is a large number of neuron activities, and data parallelism is efficient with a large number of weights. In convolutional neural networks, the convolution layer holds around 90% of the computation and 5% of the parameters, while the full connected layer holds 95% of the parameters and 5%-10% the computation.

So it is possible to parallelize the convolutional neural networks (CNNS) in data-model mode by employing data parallelism for the convolutional layer and making use of model parallelism for the fully connected layer.

An illustration of data-model parallelism mode (Source: Science Direct)

‍

What is the difference between task and data parallelism?

To understand the difference between data parallelism vs task parallelism, let’s start with data parallelism. Say that you have a lot of data that you want to process. If you want to get data parallelism, you can take that data and divide it up across several processors. Supercomputers have excelled in this area for years. This type of problem is very well understood and there has been far more focus in the past than on data parallelism vs task parallelism it has been lesser.

Task parallelism is about having multiple tasks that you want to accomplish. You could have several processors each looking at the same data set and computing different answers. Rather than splitting the data and doing the same work on different processors, in task parallelism involes dividing up the task to apply.

Pipelining is the most common task parallelism. Here you have multiple tasks, lets call them task 1 and task 2 and task 3. Rather than having every one of them operate on the data independently, you take the data and give it to the first task and process it, and the second task and the third task.