No training speed improvement can be obtained by using multi-gpus with mxnet as the backend #79

Wendison · 2017-09-05T12:49:20Z

Hi, I have some questions about the training speed when using multi-gpus with mxnet as the backend for keras. According to https://mxnet.incubator.apache.org/how_to/multi_devices.html, which said "By default, MXNet partitions a data batch evenly among the available GPUs. Assume a batch size b and assume there are k GPUs, then in one iteration each GPU will perform forward and backward on b/k examples. The gradients are then summed over all GPUs before updating the model." I think when the batch size b is fixed, each gpu calculates gradients on b/k examples, compared to the gradients calculation on b examples with single gpu, the former should comsume less time. As a result, with the same batch size, the speed of weights updating by using multi-gpus should be faster than that by using single gpu for each iteration. But through the experiments, I found the speed of training using multi-gpus is slower than that using single gpu .

below are parts of my code, where I used the fully-connected network

model = Sequential()
model.add(Dropout(0.1,input_shape=(2056,)))
model.add(Dense(2800,activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(2800,activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(2800,activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(257))
model.summary()
opt=SGD()
NUM_GPU = 4
gpu_list = []
for i in range(NUM_GPU):
gpu_list.append('gpu(%d)' % i)
batch_size=128
model.compile(loss=my_loss, optimizer=opt, context=gpu_list)

I don't know whether my understanding is right, why no speed improvment can be obatined with multi-gpus? Can anyone solve my questions? Thanks!

Wendison · 2017-09-06T01:02:31Z

Below is the training process with 1 gpu and 4 gpus respectively,
1 gpu:

4 gpus:

It seems that the training with 4 gpus has faster convergence speed, but requires more time for each epoch.

kevinthesun · 2017-09-06T18:07:15Z

Can you provide full codes for your experiment? Sometimes multi-cpu won't get any boost or can even slow down training since overhead of hardware communication.

Wendison · 2017-09-07T08:56:28Z

Ok, my code is shown as follows:

import numpy as np
np.random.seed(1337)  # for reproducibility

from keras.models import Sequential
from keras.layers.core import Dense, Dropout
from keras.optimizers import SGD
import numpy as np
from sklearn import preprocessing
import random
from keras import backend as K 

def my_loss(y_true,y_pred):
    term1=K.sum(K.square(y_pred[:,:257]-y_true[:,:257]),axis=-1)
    term2=K.sum(K.square(y_pred[:,257:350]-y_true[:,257:350]),axis=-1)
    term3=K.sum(K.square(y_pred[:,350:]-y_true[:,350:]),axis=-1)
    return 0.5*term1+0.3*term2+0.2*term3
                                                            
data_dir='/work/Wendison/training_data/'
NameX=[]
NameY=[]
Numxy=[]
##As the training data is too big(>100G), I divided it into 20 file pairs (input+label)
for j in range(1,21):  
    NameX.append(data_dir+'Xtrain'+str(j)+'.npy') # the path for input data of DNN
    NameY.append(data_dir+'Ytrain'+str(j)+'.npy') # the path for label data of DNN
    Numxy.append(data_dir+'Num'+str(j)+'.npy') # the path for number of samples for each file
    
meanx=np.load('meanx.npy')
stdx=np.load('stdx.npy')
meany=np.load('meany.npy')
stdy=np.load('stdy.npy')

scalerx=preprocessing.StandardScaler()
scalery=preprocessing.StandardScaler()
scalerx.mean_=meanx
scalerx.scale_=stdx
scalery.mean_=meany
scalery.scale_=stdy

##use the last data pair as the validation data
tempx=np.load(NameX[-1])
tempy=np.load(NameY[-1])
X_val=scalerx.transform(tempx)
Y_val=scalery.transform(tempy)
NameX.pop()
NameX.pop()
Numxy.pop()

batch_size=128
Num=len(Numxy)
numall=0
for i in range(len(Numxy)):
    nn=np.load(Numxy[i])
    numall+=sum(nn) # compute the number of overall training samples

##define a data generator to read training data
def mygenerator(batch_size=batch_size,num=30):
    num=range(Num)
    random.shuffle(num) # shuffle the order of training files
    for i in num:
        tempx=np.load(NameX[i])
        tempy=np.load(NameY[i])
        X_train=scalerx.transform(tempx)
        Y_train=scalery.transform(tempy)
        numxy=np.load(Numxy[i])
        orde=range(X_train.shape[0])
        random.shuffle(orde)
        X_train=X_train[orde,:]
        Y_train=Y_train[orde,:] # shuffle the order of samples in each data file
        numb=numxy/batch_size
        while 1:
            for ii in range(numb):
                if ii<numb-1:
                    yield X_train[ii*batch_size:(ii+1)*batch_size,:], Y_train[ii*batch_size:(ii+1)*batch_size,:]
                else:
                    yield X_train[batch_size*ii:,:],Y_train[batch_size*ii:,:]

##model definition
model = Sequential()
model.add(Dropout(0.1,input_shape=(2056,)))
model.add(Dense(2800,activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(2800,activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(2800,activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(607))
model.summary()

opt=SGD()

NUM_GPU = 4
gpu_list = []
for i in range(NUM_GPU):
    gpu_list.append('gpu(%d)' % i)
    
model.compile(loss=my_loss,optimizer=opt, context=gpu_list)

mygen=mygenerator()
for i in range(1,101):
    model.fit_generator(mygen,samples_per_epoch=numall, nb_epoch=1, verbose=1, 
                        validation_data=(X_val, Y_val))

The training data is very large(>100G), so I divide it into 20 file pairs, and load the data periodically for each epoch by using the generator of keras, is that related to the speed of training via multi-gpus? Thanks! @kevinthesun

kevinthesun · 2017-09-07T09:21:34Z

@Wendison You can benchmark pure training time without data IO to see if data IO is the bottleneck.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No training speed improvement can be obtained by using multi-gpus with mxnet as the backend #79

No training speed improvement can be obtained by using multi-gpus with mxnet as the backend #79

Wendison commented Sep 5, 2017

Wendison commented Sep 6, 2017

kevinthesun commented Sep 6, 2017

Wendison commented Sep 7, 2017 •

edited

Loading

kevinthesun commented Sep 7, 2017

No training speed improvement can be obtained by using multi-gpus with mxnet as the backend #79

No training speed improvement can be obtained by using multi-gpus with mxnet as the backend #79

Comments

Wendison commented Sep 5, 2017

Wendison commented Sep 6, 2017

kevinthesun commented Sep 6, 2017

Wendison commented Sep 7, 2017 • edited Loading

kevinthesun commented Sep 7, 2017

Wendison commented Sep 7, 2017 •

edited

Loading