HRNet

Deep High-Resolution Representation Learning for Human Pose Estimation

Abstract

This is an official pytorch implementation of Deep High-Resolution Representation Learning for Human Pose Estimation. In this work, we are interested in the human pose estimation problem with a focus on learning reliable high-resolution representations. Most existing methods recover high-resolution representations from low-resolution representations produced by a high-to-low resolution network. Instead, our proposed network maintains high-resolution representations through the whole process. We start from a high-resolution subnetwork as the first stage, gradually add high-to-low resolution subnetworks one by one to form more stages, and connect the mutli-resolution subnetworks in parallel. We conduct repeated multi-scale fusions such that each of the high-to-low resolution representations receives information from other parallel representations over and over, leading to rich high-resolution representations. As a result, the predicted keypoint heatmap is potentially more accurate and spatially more precise. We empirically demonstrate the effectiveness of our network through the superior pose estimation results over two benchmark datasets: the COCO keypoint detection dataset and the MPII Human Pose dataset.

High-resolution representation learning plays an essential role in many vision problems, e.g., pose estimation and semantic segmentation. The high-resolution network (HRNet), recently developed for human pose estimation, maintains high-resolution representations through the whole process by connecting high-to-low resolution convolutions in parallel and produces strong high-resolution representations by repeatedly conducting fusions across parallel convolutions. In this paper, we conduct a further study on high-resolution representations by introducing a simple yet effective modification and apply it to a wide range of vision tasks. We augment the high-resolution representation by aggregating the (upsampled) representations from all the parallel convolutions rather than only the representation from the high-resolution convolution as done in HRNet. This simple modification leads to stronger representations, evidenced by superior results. We show top results in semantic segmentation on Cityscapes, LIP, and PASCAL Context, and facial landmark detection on AFLW, COFW, 300W, and WFLW. In addition, we build a multi-level representation from the high-resolution representation and apply it to the Faster R-CNN object detection framework and the extended frameworks. The proposed approach achieves superior results to existing single-model networks on COCO object detection.

Results and Models

Faster R-CNN

Backbone	Style	Lr schd	Mem (GB)	Inf time (fps)	box AP	Config	Download
HRNetV2p-W18	pytorch	1x	6.6	13.4	36.9	config	model \| log
HRNetV2p-W18	pytorch	2x	6.6	-	38.9	config	model \| log
HRNetV2p-W32	pytorch	1x	9.0	12.4	40.2	config	model \| log
HRNetV2p-W32	pytorch	2x	9.0	-	41.4	config	model \| log
HRNetV2p-W40	pytorch	1x	10.4	10.5	41.2	config	model \| log
HRNetV2p-W40	pytorch	2x	10.4	-	42.1	config	model \| log

Mask R-CNN

Backbone	Style	Lr schd	Mem (GB)	Inf time (fps)	box AP	mask AP	Config	Download
HRNetV2p-W18	pytorch	1x	7.0	11.7	37.7	34.2	config	model \| log
HRNetV2p-W18	pytorch	2x	7.0	-	39.8	36.0	config	model \| log
HRNetV2p-W32	pytorch	1x	9.4	11.3	41.2	37.1	config	model \| log
HRNetV2p-W32	pytorch	2x	9.4	-	42.5	37.8	config	model \| log
HRNetV2p-W40	pytorch	1x	10.9		42.1	37.5	config	model \| log
HRNetV2p-W40	pytorch	2x	10.9		42.8	38.2	config	model \| log

Cascade R-CNN

Backbone	Style	Lr schd	Mem (GB)	Inf time (fps)	box AP	Config	Download
HRNetV2p-W18	pytorch	20e	7.0	11.0	41.2	config	model \| log
HRNetV2p-W32	pytorch	20e	9.4	11.0	43.3	config	model \| log
HRNetV2p-W40	pytorch	20e	10.8		43.8	config	model \| log

Cascade Mask R-CNN

Backbone	Style	Lr schd	Mem (GB)	Inf time (fps)	box AP	mask AP	Config	Download
HRNetV2p-W18	pytorch	20e	8.5	8.5	41.6	36.4	config	model \| log
HRNetV2p-W32	pytorch	20e		8.3	44.3	38.6	config	model \| log
HRNetV2p-W40	pytorch	20e	12.5		45.1	39.3	config	model \| log

Hybrid Task Cascade (HTC)

Backbone	Style	Lr schd	Mem (GB)	Inf time (fps)	box AP	mask AP	Config	Download
HRNetV2p-W18	pytorch	20e	10.8	4.7	42.8	37.9	config	model \| log
HRNetV2p-W32	pytorch	20e	13.1	4.9	45.4	39.9	config	model \| log
HRNetV2p-W40	pytorch	20e	14.6		46.4	40.8	config	model \| log

FCOS

Backbone	Style	GN	MS train	Lr schd	Mem (GB)	Inf time (fps)	box AP	Config	Download
HRNetV2p-W18	pytorch	Y	N	1x	13.0	12.9	35.3	config	model \| log
HRNetV2p-W18	pytorch	Y	N	2x	13.0	-	38.2	config	model \| log
HRNetV2p-W32	pytorch	Y	N	1x	17.5	12.9	39.5	config	model \| log
HRNetV2p-W32	pytorch	Y	N	2x	17.5	-	40.8	config	model \| log
HRNetV2p-W18	pytorch	Y	Y	2x	13.0	12.9	38.3	config	model \| log
HRNetV2p-W32	pytorch	Y	Y	2x	17.5	12.4	41.9	config	model \| log
HRNetV2p-W48	pytorch	Y	Y	2x	20.3	10.8	42.7	config	model \| log

Note:

The 28e schedule in HTC indicates decreasing the lr at 24 and 27 epochs, with a total of 28 epochs.
HRNetV2 ImageNet pretrained models are in HRNets for Image Classification.

Citation

@inproceedings{SunXLW19,
  title={Deep High-Resolution Representation Learning for Human Pose Estimation},
  author={Ke Sun and Bin Xiao and Dong Liu and Jingdong Wang},
  booktitle={CVPR},
  year={2019}
}

@article{SunZJCXLMWLW19,
  title={High-Resolution Representations for Labeling Pixels and Regions},
  author={Ke Sun and Yang Zhao and Borui Jiang and Tianheng Cheng and Bin Xiao
  and Dong Liu and Yadong Mu and Xinggang Wang and Wenyu Liu and Jingdong Wang},
  journal   = {CoRR},
  volume    = {abs/1904.04514},
  year={2019}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

HRNet

Abstract

Results and Models

Faster R-CNN

Mask R-CNN

Cascade R-CNN

Cascade Mask R-CNN

Hybrid Task Cascade (HTC)

FCOS

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

HRNet

Abstract

Results and Models

Faster R-CNN

Mask R-CNN

Cascade R-CNN

Cascade Mask R-CNN

Hybrid Task Cascade (HTC)

FCOS

Citation