Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No training speed improvement can be obtained by using multi-gpus with mxnet as the backend #79

Open
Wendison opened this issue Sep 5, 2017 · 4 comments

Comments

@Wendison
Copy link

Wendison commented Sep 5, 2017

Hi, I have some questions about the training speed when using multi-gpus with mxnet as the backend for keras. According to https://mxnet.incubator.apache.org/how_to/multi_devices.html, which said "By default, MXNet partitions a data batch evenly among the available GPUs. Assume a batch size b and assume there are k GPUs, then in one iteration each GPU will perform forward and backward on b/k examples. The gradients are then summed over all GPUs before updating the model." I think when the batch size b is fixed, each gpu calculates gradients on b/k examples, compared to the gradients calculation on b examples with single gpu, the former should comsume less time. As a result, with the same batch size, the speed of weights updating by using multi-gpus should be faster than that by using single gpu for each iteration. But through the experiments, I found the speed of training using multi-gpus is slower than that using single gpu .

below are parts of my code, where I used the fully-connected network

model = Sequential()
model.add(Dropout(0.1,input_shape=(2056,)))
model.add(Dense(2800,activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(2800,activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(2800,activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(257))
model.summary()
opt=SGD()
NUM_GPU = 4
gpu_list = []
for i in range(NUM_GPU):
gpu_list.append('gpu(%d)' % i)
batch_size=128
model.compile(loss=my_loss, optimizer=opt, context=gpu_list)

I don't know whether my understanding is right, why no speed improvment can be obatined with multi-gpus? Can anyone solve my questions? Thanks!

@Wendison
Copy link
Author

Wendison commented Sep 6, 2017

Below is the training process with 1 gpu and 4 gpus respectively,
1 gpu:
x6m d d shd hw7nc b
4 gpus:
yhu d969fc2t_crzpvwyc5j
It seems that the training with 4 gpus has faster convergence speed, but requires more time for each epoch.

@kevinthesun
Copy link

Can you provide full codes for your experiment? Sometimes multi-cpu won't get any boost or can even slow down training since overhead of hardware communication.

@Wendison
Copy link
Author

Wendison commented Sep 7, 2017

Ok, my code is shown as follows:

import numpy as np
np.random.seed(1337)  # for reproducibility

from keras.models import Sequential
from keras.layers.core import Dense, Dropout
from keras.optimizers import SGD
import numpy as np
from sklearn import preprocessing
import random
from keras import backend as K 

def my_loss(y_true,y_pred):
    term1=K.sum(K.square(y_pred[:,:257]-y_true[:,:257]),axis=-1)
    term2=K.sum(K.square(y_pred[:,257:350]-y_true[:,257:350]),axis=-1)
    term3=K.sum(K.square(y_pred[:,350:]-y_true[:,350:]),axis=-1)
    return 0.5*term1+0.3*term2+0.2*term3
                                                            
data_dir='/work/Wendison/training_data/'
NameX=[]
NameY=[]
Numxy=[]
##As the training data is too big(>100G), I divided it into 20 file pairs (input+label)
for j in range(1,21):  
    NameX.append(data_dir+'Xtrain'+str(j)+'.npy') # the path for input data of DNN
    NameY.append(data_dir+'Ytrain'+str(j)+'.npy') # the path for label data of DNN
    Numxy.append(data_dir+'Num'+str(j)+'.npy') # the path for number of samples for each file
    
meanx=np.load('meanx.npy')
stdx=np.load('stdx.npy')
meany=np.load('meany.npy')
stdy=np.load('stdy.npy')

scalerx=preprocessing.StandardScaler()
scalery=preprocessing.StandardScaler()
scalerx.mean_=meanx
scalerx.scale_=stdx
scalery.mean_=meany
scalery.scale_=stdy

##use the last data pair as the validation data
tempx=np.load(NameX[-1])
tempy=np.load(NameY[-1])
X_val=scalerx.transform(tempx)
Y_val=scalery.transform(tempy)
NameX.pop()
NameX.pop()
Numxy.pop()

batch_size=128
Num=len(Numxy)
numall=0
for i in range(len(Numxy)):
    nn=np.load(Numxy[i])
    numall+=sum(nn) # compute the number of overall training samples

##define a data generator to read training data
def mygenerator(batch_size=batch_size,num=30):
    num=range(Num)
    random.shuffle(num) # shuffle the order of training files
    for i in num:
        tempx=np.load(NameX[i])
        tempy=np.load(NameY[i])
        X_train=scalerx.transform(tempx)
        Y_train=scalery.transform(tempy)
        numxy=np.load(Numxy[i])
        orde=range(X_train.shape[0])
        random.shuffle(orde)
        X_train=X_train[orde,:]
        Y_train=Y_train[orde,:] # shuffle the order of samples in each data file
        numb=numxy/batch_size
        while 1:
            for ii in range(numb):
                if ii<numb-1:
                    yield X_train[ii*batch_size:(ii+1)*batch_size,:], Y_train[ii*batch_size:(ii+1)*batch_size,:]
                else:
                    yield X_train[batch_size*ii:,:],Y_train[batch_size*ii:,:]

##model definition
model = Sequential()
model.add(Dropout(0.1,input_shape=(2056,)))
model.add(Dense(2800,activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(2800,activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(2800,activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(607))
model.summary()

opt=SGD()

NUM_GPU = 4
gpu_list = []
for i in range(NUM_GPU):
    gpu_list.append('gpu(%d)' % i)
    
model.compile(loss=my_loss,optimizer=opt, context=gpu_list)

mygen=mygenerator()
for i in range(1,101):
    model.fit_generator(mygen,samples_per_epoch=numall, nb_epoch=1, verbose=1, 
                        validation_data=(X_val, Y_val))

The training data is very large(>100G), so I divide it into 20 file pairs, and load the data periodically for each epoch by using the generator of keras, is that related to the speed of training via multi-gpus? Thanks! @kevinthesun

@kevinthesun
Copy link

@Wendison You can benchmark pure training time without data IO to see if data IO is the bottleneck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants