Skip to content

Latest commit

 

History

History
123 lines (93 loc) · 3.85 KB

readme.md

File metadata and controls

123 lines (93 loc) · 3.85 KB

Kaggle Competition Solution (5th)

NeurIPS 2024 - Predict New Medicines with BELKA

https://www.kaggle.com/competitions/leash-BELKA

For discussion, please refer to:
https://www.kaggle.com/competitions/leash-BELKA/discussion/456084

1. Hardware

  • GPU: 2x Nvidia Ada A6000 (Ampere), each with VRAM 48 GB
  • CPU: Intel® Xeon(R) w7-3455 CPU @ 2.5GHz, 24 cores, 48 threads
  • Memory: 256 GB RAM

2. OS

  • ubuntu 22.04.4 LTS

3. Set Up Environment

  • Install Python >=3.10.9
  • Install requirements.txt in the python environment
  • Set up the directory structure as shown below.
└── <solution_dir>
    ├── src 
    ├── result
    ├── data
    |   ├── processed
    |   |     ├──all_buildingblock.csv
    |   ├── kaggle 
    |         ├── leash-BELKA
    |               ├── sample_submission.csv
    │               ├── train.parquet
    │               ├── test.parquet
    ├── LICENSE 
    ├── README.md 
  • Modify the path setting by editing "/src/third_party/_current_dir_.py"
# please use full path 
KAGGLE_DATA_DIR = '<solution_dir>/data/kaggle'
PROCESSED_DATA_DIR = '<solution_dir>/data/processed'
RESULT_DIR = '<solution_dir>/result'
python "/src/process-data-01/run_make_data.py"  

There are 98 millions molecules in the train data. Hence processing the data can take very long time.
Alternatively, you can download processed data from the share google drive at :
/leash-BELKA-solution/data/processed
https://drive.google.com/drive/folders/1bEBGtTJrQlYc_MQRYceBp0Kb9zGYue9H?usp=drive_link

4. Training the model

Warning !!! training output will be overwritten to the "/result" folder

Please run the following python scripts to learn the model files

python "/src/cnn1d-nonshare-05-mean-layer5-bn/run_train.py"
output model:
- /result/cnn1d-mean-pool-ly5-bn-01/fold-0/checkpoint/00400000.pth
- /result/cnn1d-mean-pool-ly5-bn-01/fold-1/checkpoint/00550000.pth
- /result/cnn1d-mean-pool-ly5-bn-01/fold-3/checkpoint/00415000.pth

python "/src/transformer-fa-03/run_train.py"
output model:
- /result/transfomer-fa-03/fold-2/checkpoint/00264000.pth
- /result/transfomer-fa-03/fold-4/checkpoint/00264000.pth

python "/src/mamba-03/run_train.py"
output model:
- /result/mamba-03/checkpoint/00255000.pth

If you want to do local validation, you can run the scripts:

python "/src/cnn1d-nonshare-05-mean-layer5-bn/run_valid.py"
python "/src/transformer-fa-03/run_valid.py"
python "/src/mamba-03/run_valid.py"

5. Submission csv

Please run the following scripts:

python "/src/cnn1d-nonshare-05-mean-layer5-bn/run_submit.py"
python "/src/transformer-fa-03/run_submit.py"
python "/src/mamba-03/run_submit.py"
python "/src/run_ensemble.py"
output file:
- /result/final-3fold-tx2a-mamba-fix.submit.csv

alt text


6. Reference trained models and validation results

Authors

License

  • This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgement

"We extend our thanks to HP for providing the Z8 Fury-G5 Data Science Workstation, which empowered our deep learning experiments. The high computational power and large GPU memory enabled us to design our models swiftly."