Skip to content

Latest commit

 

History

History
101 lines (77 loc) · 24.7 KB

README.md

File metadata and controls

101 lines (77 loc) · 24.7 KB

HRNet

Deep High-Resolution Representation Learning for Human Pose Estimation

Abstract

This is an official pytorch implementation of Deep High-Resolution Representation Learning for Human Pose Estimation. In this work, we are interested in the human pose estimation problem with a focus on learning reliable high-resolution representations. Most existing methods recover high-resolution representations from low-resolution representations produced by a high-to-low resolution network. Instead, our proposed network maintains high-resolution representations through the whole process. We start from a high-resolution subnetwork as the first stage, gradually add high-to-low resolution subnetworks one by one to form more stages, and connect the mutli-resolution subnetworks in parallel. We conduct repeated multi-scale fusions such that each of the high-to-low resolution representations receives information from other parallel representations over and over, leading to rich high-resolution representations. As a result, the predicted keypoint heatmap is potentially more accurate and spatially more precise. We empirically demonstrate the effectiveness of our network through the superior pose estimation results over two benchmark datasets: the COCO keypoint detection dataset and the MPII Human Pose dataset.

High-resolution representation learning plays an essential role in many vision problems, e.g., pose estimation and semantic segmentation. The high-resolution network (HRNet), recently developed for human pose estimation, maintains high-resolution representations through the whole process by connecting high-to-low resolution convolutions in parallel and produces strong high-resolution representations by repeatedly conducting fusions across parallel convolutions. In this paper, we conduct a further study on high-resolution representations by introducing a simple yet effective modification and apply it to a wide range of vision tasks. We augment the high-resolution representation by aggregating the (upsampled) representations from all the parallel convolutions rather than only the representation from the high-resolution convolution as done in HRNet. This simple modification leads to stronger representations, evidenced by superior results. We show top results in semantic segmentation on Cityscapes, LIP, and PASCAL Context, and facial landmark detection on AFLW, COFW, 300W, and WFLW. In addition, we build a multi-level representation from the high-resolution representation and apply it to the Faster R-CNN object detection framework and the extended frameworks. The proposed approach achieves superior results to existing single-model networks on COCO object detection.

Results and Models

Faster R-CNN

Backbone Style Lr schd Mem (GB) Inf time (fps) box AP Config Download
HRNetV2p-W18 pytorch 1x 6.6 13.4 36.9 config model | log
HRNetV2p-W18 pytorch 2x 6.6 - 38.9 config model | log
HRNetV2p-W32 pytorch 1x 9.0 12.4 40.2 config model | log
HRNetV2p-W32 pytorch 2x 9.0 - 41.4 config model | log
HRNetV2p-W40 pytorch 1x 10.4 10.5 41.2 config model | log
HRNetV2p-W40 pytorch 2x 10.4 - 42.1 config model | log

Mask R-CNN

Backbone Style Lr schd Mem (GB) Inf time (fps) box AP mask AP Config Download
HRNetV2p-W18 pytorch 1x 7.0 11.7 37.7 34.2 config model | log
HRNetV2p-W18 pytorch 2x 7.0 - 39.8 36.0 config model | log
HRNetV2p-W32 pytorch 1x 9.4 11.3 41.2 37.1 config model | log
HRNetV2p-W32 pytorch 2x 9.4 - 42.5 37.8 config model | log
HRNetV2p-W40 pytorch 1x 10.9 42.1 37.5 config model | log
HRNetV2p-W40 pytorch 2x 10.9 42.8 38.2 config model | log

Cascade R-CNN

Backbone Style Lr schd Mem (GB) Inf time (fps) box AP Config Download
HRNetV2p-W18 pytorch 20e 7.0 11.0 41.2 config model | log
HRNetV2p-W32 pytorch 20e 9.4 11.0 43.3 config model | log
HRNetV2p-W40 pytorch 20e 10.8 43.8 config model | log

Cascade Mask R-CNN

Backbone Style Lr schd Mem (GB) Inf time (fps) box AP mask AP Config Download
HRNetV2p-W18 pytorch 20e 8.5 8.5 41.6 36.4 config model | log
HRNetV2p-W32 pytorch 20e 8.3 44.3 38.6 config model | log
HRNetV2p-W40 pytorch 20e 12.5 45.1 39.3 config model | log

Hybrid Task Cascade (HTC)

Backbone Style Lr schd Mem (GB) Inf time (fps) box AP mask AP Config Download
HRNetV2p-W18 pytorch 20e 10.8 4.7 42.8 37.9 config model | log
HRNetV2p-W32 pytorch 20e 13.1 4.9 45.4 39.9 config model | log
HRNetV2p-W40 pytorch 20e 14.6 46.4 40.8 config model | log

FCOS

Backbone Style GN MS train Lr schd Mem (GB) Inf time (fps) box AP Config Download
HRNetV2p-W18 pytorch Y N 1x 13.0 12.9 35.3 config model | log
HRNetV2p-W18 pytorch Y N 2x 13.0 - 38.2 config model | log
HRNetV2p-W32 pytorch Y N 1x 17.5 12.9 39.5 config model | log
HRNetV2p-W32 pytorch Y N 2x 17.5 - 40.8 config model | log
HRNetV2p-W18 pytorch Y Y 2x 13.0 12.9 38.3 config model | log
HRNetV2p-W32 pytorch Y Y 2x 17.5 12.4 41.9 config model | log
HRNetV2p-W48 pytorch Y Y 2x 20.3 10.8 42.7 config model | log

Note:

  • The 28e schedule in HTC indicates decreasing the lr at 24 and 27 epochs, with a total of 28 epochs.
  • HRNetV2 ImageNet pretrained models are in HRNets for Image Classification.

Citation

@inproceedings{SunXLW19,
  title={Deep High-Resolution Representation Learning for Human Pose Estimation},
  author={Ke Sun and Bin Xiao and Dong Liu and Jingdong Wang},
  booktitle={CVPR},
  year={2019}
}

@article{SunZJCXLMWLW19,
  title={High-Resolution Representations for Labeling Pixels and Regions},
  author={Ke Sun and Yang Zhao and Borui Jiang and Tianheng Cheng and Bin Xiao
  and Dong Liu and Yadong Mu and Xinggang Wang and Wenyu Liu and Jingdong Wang},
  journal   = {CoRR},
  volume    = {abs/1904.04514},
  year={2019}
}