Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not sure if this is a TensorFlow issue or Docker issue #9

Open
chattertonc09 opened this issue Aug 14, 2019 · 2 comments
Open

Not sure if this is a TensorFlow issue or Docker issue #9

chattertonc09 opened this issue Aug 14, 2019 · 2 comments
Assignees

Comments

@chattertonc09
Copy link

getting a strange error on one of my embedding layers when using this with keras.

restype:container
2019-08-14 21:00:10,145|azureml.core.authentication|DEBUG|Time to expire 604466.854539 seconds
2019-08-14 azureml.history._tracking.PythonWorkingDirectory.workingdir|DEBUG|Calling pyfs
2019-08-14 21:00:29,324|azureml.history._tracking.PythonWorkingDirectory|INFO|Current working dir: /mnt/batch/tasks/....
2019-08-14
2019-08-14 21:00:29,324|azureml.WorkingDirectoryCM|ERROR|<class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>: indices[8,0] = 565 is not in [0, 562)
[[node master_Embedding/GatherV2 (defined at /azureml-envs/azureml/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:1211) ]]

invalidArgumentError (see above for traceback): indices[8,0] = 565 is not in [0, 562)
[[node broker_master_Embedding/GatherV2 (defined at /azureml-envs/azureml_d582dd13e83051343c8ab0e51ab5a504/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:1211) ]]

any ideas....

The driver_log.txt shows:

WARNING - From /azureml-envs/azureml_d582dd13e83051343c8ab0e51ab5a504/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Train on 72626 samples, validate on 4035 samples
Epoch 1/100
2019-08-14 21:00:15.382966: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-08-14 21:00:15.388250: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2596990000 Hz
2019-08-14 21:00:15.388560: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55dbbf606c20 executing computations on platform Host. Devices:
2019-08-14 21:00:15.388579: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): ,

@jpe316 jpe316 self-assigned this Aug 27, 2019
@jpe316
Copy link
Contributor

jpe316 commented Aug 27, 2019

Hi, which script are you trying to run in this repo? Will help us debug

@chattertonc09
Copy link
Author

I am using the train.py where I've added a keras MLP neural network as a python class.
I got through this by looking at the AmlPipelines.py where the train script was using a pythonScriptStep, I changed this to use a Tensorflow Estimator and a EstimatorStep. The issue I get now is with running with GPU support. If I try to enable GPU support for 4 nodes then use the keras multiple_gpu like this it fails because it does not recognize all the available GPUs on the cluster. Not sure if this is because of the version of Tensorflow or CUDA

with tf.device('/cpu:0'):
model = Xception(weights=None,
input_shape=(height, width, 3),
classes=num_classes)

Replicates the model on 8 GPUs.

This assumes that your machine has 8 available GPUs.

parallel_model = multi_gpu_model(model, gpus=8)
parallel_model.compile(loss='categorical_crossentropy',

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants