Not sure if this is a TensorFlow issue or Docker issue #9

chattertonc09 · 2019-08-14T21:08:57Z

getting a strange error on one of my embedding layers when using this with keras.

restype:container
2019-08-14 21:00:10,145|azureml.core.authentication|DEBUG|Time to expire 604466.854539 seconds
2019-08-14 azureml.history._tracking.PythonWorkingDirectory.workingdir|DEBUG|Calling pyfs
2019-08-14 21:00:29,324|azureml.history._tracking.PythonWorkingDirectory|INFO|Current working dir: /mnt/batch/tasks/....
2019-08-14
2019-08-14 21:00:29,324|azureml.WorkingDirectoryCM|ERROR|<class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>: indices[8,0] = 565 is not in [0, 562)
[[node master_Embedding/GatherV2 (defined at /azureml-envs/azureml/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:1211) ]]

invalidArgumentError (see above for traceback): indices[8,0] = 565 is not in [0, 562)
[[node broker_master_Embedding/GatherV2 (defined at /azureml-envs/azureml_d582dd13e83051343c8ab0e51ab5a504/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:1211) ]]

any ideas....

The driver_log.txt shows:

WARNING - From /azureml-envs/azureml_d582dd13e83051343c8ab0e51ab5a504/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Train on 72626 samples, validate on 4035 samples
Epoch 1/100
2019-08-14 21:00:15.382966: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-08-14 21:00:15.388250: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2596990000 Hz
2019-08-14 21:00:15.388560: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55dbbf606c20 executing computations on platform Host. Devices:
2019-08-14 21:00:15.388579: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): ,

jpe316 · 2019-08-27T18:08:03Z

Hi, which script are you trying to run in this repo? Will help us debug

chattertonc09 · 2019-08-27T20:27:00Z

I am using the train.py where I've added a keras MLP neural network as a python class.
I got through this by looking at the AmlPipelines.py where the train script was using a pythonScriptStep, I changed this to use a Tensorflow Estimator and a EstimatorStep. The issue I get now is with running with GPU support. If I try to enable GPU support for 4 nodes then use the keras multiple_gpu like this it fails because it does not recognize all the available GPUs on the cluster. Not sure if this is because of the version of Tensorflow or CUDA

with tf.device('/cpu:0'):
model = Xception(weights=None,
input_shape=(height, width, 3),
classes=num_classes)

Replicates the model on 8 GPUs.

This assumes that your machine has 8 available GPUs.

parallel_model = multi_gpu_model(model, gpus=8)
parallel_model.compile(loss='categorical_crossentropy',

jpe316 self-assigned this Aug 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not sure if this is a TensorFlow issue or Docker issue #9

Not sure if this is a TensorFlow issue or Docker issue #9

chattertonc09 commented Aug 14, 2019

jpe316 commented Aug 27, 2019

chattertonc09 commented Aug 27, 2019

Not sure if this is a TensorFlow issue or Docker issue #9

Not sure if this is a TensorFlow issue or Docker issue #9

Comments

chattertonc09 commented Aug 14, 2019

jpe316 commented Aug 27, 2019

chattertonc09 commented Aug 27, 2019

Replicates the model on 8 GPUs.

This assumes that your machine has 8 available GPUs.