Skip to content

Training

HuanNguyen edited this page Jul 14, 2024 · 14 revisions

1) Match the dynamics of your drone with the Gazebo model

Before training the prediction networks, you need to create a model of your robot in Gazebo and match the closed-loop responses of the low-level velocity/attitude controllers in simulation with your real robot.

2) Generate training data

data_collection

Gazebo simulator is used to generate training data to predict collision events. An indicative Gazebo environment for collecting data for ORACLE is visualized on the left side of the above figure while the last line on the right side illustrates seVAE-ORACLE training environments with thin obstacles populated. Note that for seVAE-ORACLE, we also need to train the VAE with both real-world and simulated depth images, utilizing labeled semantic mask, as can be seen on the right side of the above image.

Set EVALUATE_MODE = False and RUN_IN_SIM = True in config.py file.

Run in one terminal (NOT in conda virtual environment)

# for ORACLE or A-ORACLE
roslaunch rmf_sim rmf_sim.launch
# OR for seVAE-ORACLE
roslaunch rmf_sim rmf_sim_sevae.launch

Open another terminal, source lmf_sim_ws workspace and run inside deep_collision_predictor folder (Note: remember to set PLANNING_TYPE=1 in config.py for seVAE-ORACLE!)

# conda activate oracle_env
python generate/generate_data_info_gain.py --save_path=path_to_folder

If --save_path is not specified, the default path in common_flags.py is used.

3) Process the training data

ORACLE and A-ORACLE

Set TRAIN_INFOGAIN = False (for generating ORACLE data) or True (for labeling A-ORACLE data with Voxblox) in config.py file.

If labeling data for A-ORACLE, we need to run in one terminal (NO need to run this for ORACLE)

roslaunch voxblox_ros voxblox_gazebo.launch

In another terminal, run

# conda activate oracle_env
python process/data_processing.py --load_path=path_to_folder --save_tf_path=path_to_folder

seVAE-ORACLE

Run the script in seVAE repo to create the di_latent.p and di_flipped_latent.p pickle files. Put the latent pickles in the same folder as the other pickle files in step 2 above.

Then run

# conda activate oracle_env
python process/data_processing_sevae.py --load_path=path_to_folder --save_tf_path=path_to_folder

If --load_path or --save_tf_path is not specified, the default path in common_flags.py is used.
The tfrecord files created from data_processing.py are saved in save_tf_path.
Split the tfrecord files into 2 folders for training and validation (80/20 ratio).

4) Train the network

Train ORACLE (collision prediction): the training typically takes around 200-300 epochs to converge to a good network.

# conda activate oracle_env
python train/training.py --training_type=0 --train_tf_folder=path_to_folder --validate_tf_folder=path_to_folder --model_save_path=path_to_folder

Train seVAE-ORACLE (collision prediction): the training typically takes around 500 epochs to converge to a good network.

# conda activate oracle_env
python train/training.py --training_type=1 --train_tf_folder=path_to_folder --validate_tf_folder=path_to_folder --model_save_path=path_to_folder

or train Attentive ORACLE (info-gain prediction): the training typically takes around 200-300 epochs to converge to a good network.

# conda activate oracle_env
python train/training.py --training_type=2 --train_tf_folder=path_to_folder --validate_tf_folder=path_to_folder --model_save_path=path_to_folder

If --train_tf_folder or --validate_tf_folder or --model_save_path is not specified, the default path in common_flags.py is used.

Note: you can view the training/validation losses and metrics by running

# conda activate oracle_env
tensorboard --logdir=logs

5) Optimize the network for inference speed (with TensorRT, optional)

Note:

  • The TensorRT files cannot be used across platforms. You need to re-create them for different platforms (GPU, NVIDIA software) that you use.
  • For multi-GPU systems, you may need to export CUDA_VISIBLE_DEVICES=0 to run TensorRT, otherwise, you can get some runtime errors.

Set the path to the .hdf5 file using --checkpoint_path when calling Python scripts in the optimize folder. The resulting .trt or .onnx files will be created in the main folder of this package. Please copy these .trt files to a folder listed in CPN_TRT_CHECKPOINT_PATH (ORACLE) or seVAE_CPN_TRT_CHECKPOINT_PATH (seVAE-ORACLE) or IPN_TRT_CHECKPOINT_PATH (A-ORACLE).

ORACLE

# conda activate oracle_env
python3 optimize/convert_keras_cnn_to_tensorrt_engine.py --checkpoint_path=PATH_TO_HDF5_FILE
python3 optimize/convert_keras_combiner_tensorrt_engine.py --checkpoint_path=PATH_TO_HDF5_FILE
python3 optimize/convert_keras_rnn_to_tensorrt_engine.py --checkpoint_path=PATH_TO_HDF5_FILE

seVAE-ORACLE

# conda activate oracle_env
python3 optimize/convert_keras_combiner_tensorrt_engine_sevae.py --checkpoint_path=PATH_TO_HDF5_FILE
python3 optimize/convert_keras_rnn_to_tensorrt_engine_sevae.py --checkpoint_path=PATH_TO_HDF5_FILE

Attentive ORACLE

Note: The optimize scripts for the IPN are only verified with CUDA 10.1 + cuDNN 7.6 + TensorRT 6.0.1 or CUDA 10.2 + cuDNN 8.0 + TensorRT 7.1.3 (Jetpack 4.4). You may receive cuda error message with other setup. In that case, please use Tensorflow inference for the IPN (set INFOGAIN_USE_TENSORRT = False in config.py)

# conda activate oracle_env
python3 optimize/convert_keras_infogain_cnn_to_tensorrt_engine.py --checkpoint_path=PATH_TO_HDF5_FILE
python3 optimize/convert_keras_infogain_predictor_to_tensorrt_engine.py --checkpoint_path=PATH_TO_HDF5_FILE

or for predicting the information gain of only one step in every ... step in the future (use SKIP_STEP_INFERENCE_INFOGAIN param in config.py):

# conda activate oracle_env
python3 optimize/convert_keras_infogain_predictor_to_tensorrt_engine_light_inference.py --checkpoint_path=PATH_TO_HDF5_FILE

This can lead to even faster inference speed but will hurt the performance (SKIP_STEP_INFERENCE_INFOGAIN = 2 or 4 is recommended).

6) Evaluate the planner

Choose PLANNING_TYPE in config.py file (for evaluating A-ORACLE in sim, enable the RGB camera xacro in rmf_sim/rmf_sim/rotors/urdf/delta.gazebo)

If using Tensorflow model for inference, set COLLISION_USE_TENSORRT = False or INFOGAIN_USE_TENSORRT = False in config.py file and update the path to the weight files (.hdf5 files) in config.py.

If using TensorRT model for inference, set COLLISION_USE_TENSORRT = True or INFOGAIN_USE_TENSORRT = True in config.py file and update the path to the weight folders (containing .trt files) in config.py. Note: for multi-GPU systems, you may need to export CUDA_VISIBLE_DEVICES=0 to run TensorRT, otherwise you can get some runtime errors.

Change the world_file argument in rmf_sim.launch to choose the testing environment. We provide some testing environments in rmf_sim/worlds folder. Additionally, set rviz_en to true in rmf_sim.launch for visualization of the network's prediction. Please refer to the wiki for detailed instructions to run the demo simulations as well as documentation of parameters in config.py.

In SIM

Set EVALUATE_MODE = True and RUN_IN_SIM = True in config.py file.

Run in one terminal (NOT in conda virtual environment)

roslaunch rmf_sim rmf_sim.launch

In another terminal, run

# conda activate oracle_env
source PATH_TO_lmf_sim_ws/devel/setup.bash
source PATH_TO_ros_stuff_ws/devel/setup.bash # only if your ROS version < Noetic
python evaluate/evaluate.py

Wait until you see the green text START planner printed out in the second terminal, then call the service to start the planner

rosservice call /start_planner "{}"

In the real robot (see more info about the robot here)

Follow the instructions here: LMF_ws to set up the software in the real robot.

Set RUN_IN_SIM = False in config.py file. Run

# conda activate oracle_env
source PATH_TO_lmf_ws/devel/setup.bash
python evaluate/evaluate.py

Wait until you see the green text START planner printed out in your terminal, then call the service to start the planner

rosservice call /start_planner "{}"