Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] cpu not fully used, data_time load slow #1927

Open
YAwei666 opened this issue Aug 7, 2024 · 2 comments
Open

[Bug] cpu not fully used, data_time load slow #1927

YAwei666 opened this issue Aug 7, 2024 · 2 comments

Comments

@YAwei666
Copy link

YAwei666 commented Aug 7, 2024

Branch

main branch (mmpretrain version)

Describe the bug

python tools/train.py configs/resnet/resnet50_8xb32_in1k_2.py
base = [
'../base/models/resnet50.py', '../base/datasets/imagenet_bs32.py',
'../base/schedules/imagenet_bs256_coslr.py', '../base/default_runtime.py'
]
model = dict(
backbone=dict(
frozen_stages=2,
init_cfg=dict(
type='Pretrained',
checkpoint='https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb32_in1k_20210831-ea4938fc.pth',
prefix='backbone',
)),
head=dict(num_classes=5),
)

>>>>>>>>>>>>>>> 在这里重载数据配置 >>>>>>>>>>>>>>>>>>>

data_root = '/mnt/data//dataset'
train_dataloader = dict(
batch_size=192,
dataset=dict(
type='CustomDataset',
data_root=data_root,
ann_file='meta/train.txt', # 我们假定使用子文件夹格式,因此需要将标注文件置空
data_prefix='',
))
val_dataloader = dict(
batch_size=192,
dataset=dict(
type='CustomDataset',
data_root=data_root,
ann_file='meta/test.txt', # 我们假定使用子文件夹格式,因此需要将标注文件置空
data_prefix='',
))
test_dataloader = val_dataloader

optim_wrapper = dict(
optimizer=dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001))

# 学习率策略

param_scheduler = dict(

type='MultiStepLR', by_epoch=True, milestones=[15], gamma=0.1)

train, val, test setting

train_cfg = dict(by_epoch=True, max_epochs=30, val_interval=1)

'../base/models/resnet50.py'

model settings

model = dict(
type='ImageClassifier',
backbone=dict(
type='ResNeSt',
depth=50,
num_stages=4,
out_indices=(3, ),
style='pytorch'),
neck=dict(type='GlobalAveragePooling'),
head=dict(
type='LinearClsHead',
num_classes=1000,
in_channels=2048,
loss=dict(
type='LabelSmoothLoss',
label_smooth_val=0.1,
num_classes=1000,
reduction='mean',
loss_weight=1.0),
topk=(1, 5),
cal_acc=False),
train_cfg=dict(augments=dict(type='Mixup', alpha=0.2)),
)

'../base/datasets/imagenet_bs32.py'

dataset settings

dataset_type = 'ImageNet'
data_preprocessor = dict(
num_classes=1000,
# RGB format normalization parameters
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
# convert image from BGR to RGB
to_rgb=True,
)

train_pipeline = [
dict(type='LoadImageFromFile',imdecode_backend='pillow' ),
dict(type='RandomResizedCrop', scale=224),
dict(type='RandomFlip', prob=0.5, direction='horizontal'),
dict(type='PackInputs'),
]

test_pipeline = [
dict(type='LoadImageFromFile',imdecode_backend='pillow'),
dict(type='ResizeEdge', scale=256, edge='short'),
dict(type='CenterCrop', crop_size=224),
dict(type='PackInputs'),
]

train_dataloader = dict(
batch_size=128,
num_workers=12,
dataset=dict(
type=dataset_type,
data_root='data/imagenet',
pipeline=train_pipeline),
sampler=dict(type='DefaultSampler', shuffle=True),
)

val_dataloader = dict(
batch_size=128,
num_workers=12,
dataset=dict(
type=dataset_type,
data_root='data/imagenet',
pipeline=test_pipeline),
sampler=dict(type='DefaultSampler', shuffle=False),
)
val_evaluator = dict(type='Accuracy', topk=(1))

If you want standard test, please manually configure the test dataset

test_dataloader = val_dataloader
test_evaluator = val_evaluator

'../base/schedules/imagenet_bs256_coslr.py',

optimizer

optim_wrapper = dict(
optimizer=dict(type='SGD', lr=0.8, momentum=0.9, weight_decay=5e-5))

learning policy

param_scheduler = [
dict(type='LinearLR', start_factor=0.1, by_epoch=True, begin=0, end=5),
dict(type='CosineAnnealingLR', T_max=95, by_epoch=True, begin=5, end=100)
]

train, val, test setting

train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
val_cfg = dict()
test_cfg = dict()

NOTE: auto_scale_lr is for automatically scaling LR,

based on the actual training batch size.

auto_scale_lr = dict(base_batch_size=1024)

lr: 1.0000e-02 eta: 17:43:57 time: 3.5791 data_time: 3.3401 memory: 4676 loss: 0.3175
Screenshot from 2024-08-07 23-18-56

Environment

{'sys.platform': 'linux',
'Python': '3.8.19 (default, Mar 20 2024, 19:58:24) [GCC 11.2.0]',
'CUDA available': True,
'MUSA available': False,
'numpy_random_seed': 2147483648,
'GPU 0': 'NVIDIA GeForce RTX 3090',
'CUDA_HOME': ':/usr/local/cuda',
'GCC': 'gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609',
'PyTorch': '1.10.1',
'TorchVision': '0.11.2',
'OpenCV': '4.10.0',
'MMEngine': '0.10.4',
'MMCV': '2.2.0',
'MMPreTrain': '1.2.0+'}

Other information

No response

@YAwei666
Copy link
Author

YAwei666 commented Aug 7, 2024

it seems only 2 cpu kernels works. 怎么回事呢

@liuwake
Copy link

liuwake commented Sep 8, 2024

This is strange. Based on your information, it can be seen that you have successfully started 12 num_workers . However, only two CPU threads are occupied. Does your server have virtualization technology enabled, which may result in you only being able to use two CPU threads

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants