Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Getting exit code 137 in validation step while training #1939

Open
idonahum1 opened this issue Aug 28, 2024 · 0 comments
Open

[Bug] Getting exit code 137 in validation step while training #1939

idonahum1 opened this issue Aug 28, 2024 · 0 comments

Comments

@idonahum1
Copy link

idonahum1 commented Aug 28, 2024

Branch

main branch (mmpretrain version)

Describe the bug

Hi,

Im running some tests to train different architectures on a specific dataset. The training is going alright, but once getting to the validation step, at the last iter the validation the process is being killed with 137 (no error is being raised).
I watched the ram usage, seems like that ram is running out, thats why I get that 137 exit. I cant find the reason why ram usage is being increased overtime. It only happens in validation step, while in the training step everything go smoothly.
This is happening on different architecture, not on a specific one. If I disable the validation step, the training works perfect, but I cant watch the performance of my model over time.

Thinks that I try to solve it:

  1. Changing the batch size.
  2. Change the number of workers.
  3. disable pin_memory.
  4. Changed to a different machine with much more RAM.

Nothing seems to solve the issue.
One think that I should say is that the dataset is huge - around 8mill crops, and around 800gb. I used symlink to split it to train - test. so test folder and train folder have files which are actually a symlink to different location.

Any ideas?

Thank you.

Environment

{'sys.platform': 'linux',
'Python': '3.8.19 | packaged by conda-forge | (default, Mar 20 2024, '
'12:47:35) [GCC 12.3.0]',
'CUDA available': True,
'MUSA available': False,
'numpy_random_seed': 2147483648,
'GPU 0,1,2,3': 'NVIDIA L4',
'CUDA_HOME': '/usr',
'NVCC': 'Cuda compilation tools, release 10.1, V10.1.24',
'GCC': 'gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0',
'PyTorch': '1.9.0+cu111',
'TorchVision': '0.10.0+cu111',
'OpenCV': '4.10.0',
'MMEngine': '0.10.4',
'MMCV': '2.1.0',
'MMPreTrain': '1.2.0+17a886c'}

Other information

Config -


# preprocessing settings
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(scale=256, type='Resize'),
    dict(type='PackInputs'),
]

test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(scale=256, type='Resize'),
    dict(type='PackInputs'),
]

# datasets
data_root = '/home/ubuntu/engineCache/atlantis-90-10-split'
dataset_type = 'CustomDataset'
num_classes = 36723

train_dataset = dict(
        data_root=data_root,
        ann_file='meta/train.txt',
        pipeline=train_pipeline,
        data_prefix='train',
        type=dataset_type)

test_dataset = dict(
        data_root=data_root,
        pipeline=test_pipeline,
        ann_file='meta/test.txt',
        data_prefix='test',
        type=dataset_type)

# dataloaders settings
batch_size = 128

train_dataloader = dict(
    batch_size=batch_size,
    collate_fn=dict(type='default_collate'),
    dataset=train_dataset,
    num_workers=8,
    persistent_workers=True,
    pin_memory=True,
    sampler=dict(shuffle=True, type='DefaultSampler'))

val_dataloader = dict(
    batch_size=128,
    collate_fn=dict(type='default_collate'),
    dataset=test_dataset,
    num_workers=2,
    persistent_workers=True,
    pin_memory=True,
    sampler=dict(shuffle=False, type='DefaultSampler'))

test_dataloader = val_dataloader

# model settings
model = dict(
    backbone=dict(
        depth=18,
        num_stages=4,
        out_indices=(3, ),
        style='pytorch',
        type='ResNet'),
    head=dict(
        in_channels=512,
        loss=dict(loss_weight=1.0, type='CrossEntropyLoss'),
        num_classes=num_classes,
        hidden_dim=128,
        topk=(
            1,
            5,
        ),
        type='ElectraLinearClsHead'),
    neck=dict(type='GlobalAveragePooling'),
    type='ImageClassifier')


auto_scale_lr = dict(base_batch_size=512)

data_preprocessor = dict(
    mean=[
        115.875383,
        102.297249,
        91.7643419,
    ],
    num_classes=36723,
    std=[
        71.497372,
        66.883428,
        65.108552,
    ],
    to_rgb=True)


# Hooks settings

default_hooks = dict(
    checkpoint=dict(interval=1, type='CheckpointHook'),
    logger=dict(interval=500, type='LoggerHook'),
    param_scheduler=dict(type='ParamSchedulerHook'),
    sampler_seed=dict(type='DistSamplerSeedHook'),
    timer=dict(type='IterTimerHook'),
    visualization=dict(enable=True, type='VisualizationHook'))

# Environment settings
default_scope = 'mmpretrain'
env_cfg = dict(
    cudnn_benchmark=False,
    dist_cfg=dict(backend='nccl'),
    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0))
launcher = 'none'
load_from = None
log_level = 'INFO'

# Optimizer and learning rate settings
optim_wrapper = dict(
    optimizer=dict(lr=0.1, momentum=0.9, type='SGD', weight_decay=0.0001, nesterov=True))
param_scheduler = dict(
    by_epoch=True, gamma=0.1, milestones=[
        10,
        20
    ], type='MultiStepLR')

# Training settings
randomness = dict(deterministic=False, seed=None)
resume = False
test_cfg = dict()
train_cfg = dict(by_epoch=True, max_epochs=30, val_interval=30)
val_cfg = dict()

# Evaluation settings
val_evaluator = dict(
    topk=(
        1,
        5,
    ), type='Accuracy')

test_evaluator = val_evaluator


# Visualizer and result settings
vis_backends = [
    dict(type='LocalVisBackend'),
]
visualizer = dict(
    type='UniversalVisualizer', vis_backends=[
        dict(type='LocalVisBackend'),
    ])
work_dir = '/home/ubuntu/dev/mmpretrain/train_runs'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant