RefinePose: Towards More Refined Human Pose Estimation

Human pose estimation is a very important research topic in computer vision and attracts more and more researchers. Recently, ViTPose based on heatmap representation refreshed the state of the art for pose estimation methods. However, we find that ViTPose still has room for improvement in our experi...

Full description

Bibliographic Details
Main Authors: Hao Dong, Guodong Wang, Chenglizhao Chen, Xinyue Zhang
Format: Article
Language:English
Published: MDPI AG 2022-12-01
Series:Electronics
Subjects:
Online Access:https://www.mdpi.com/2079-9292/11/23/4060
_version_ 1797463344995631104
author Hao Dong
Guodong Wang
Chenglizhao Chen
Xinyue Zhang
author_facet Hao Dong
Guodong Wang
Chenglizhao Chen
Xinyue Zhang
author_sort Hao Dong
collection DOAJ
description Human pose estimation is a very important research topic in computer vision and attracts more and more researchers. Recently, ViTPose based on heatmap representation refreshed the state of the art for pose estimation methods. However, we find that ViTPose still has room for improvement in our experiments. On the one hand, the PatchEmbedding module of ViTPose uses a convolutional layer with a stride of 14 × 14 to downsample the input image, resulting in the loss of a significant amount of feature information. On the other hand, the two decoding methods (Classical Decoder and Simple Decoder) used by ViTPose are not refined enough: transpose convolution in the Classical Decoder produces the inherent chessboard effect; the upsampling factor in the Simple Decoder is too large, resulting in the blurry heatmap. To this end, we propose a novel pose estimation method based on ViTPose, termed RefinePose. In RefinePose, we design the GradualEmbedding module and Fusion Decoder, respectively, to solve the above problems. More specifically, the GradualEmbedding module only downsamples the image to 1/2 of the original size in each downsampling stage, and it reduces the input image to a fixed size (16 × 112 in ViTPose) through multiple downsampling stages. At the same time, we fuse the outputs of max pooling layers and convolutional layers in each downsampling stage, which retains more meaningful feature information. In the decoding stage, the Fusion Decoder designed by us combines bilinear interpolation with max unpooling layers, and gradually upsamples the feature maps to restore the predicted heatmap. In addition, we also design the FeatureAggregation module to aggregate features after sampling (upsampling and downsampling). We validate the RefinePose on the COCO dataset. The experiments show that RefinePose has achieved better performance than ViTPose.
first_indexed 2024-03-09T17:49:19Z
format Article
id doaj.art-5e8198a94f4343b595faa970b1651d15
institution Directory Open Access Journal
issn 2079-9292
language English
last_indexed 2024-03-09T17:49:19Z
publishDate 2022-12-01
publisher MDPI AG
record_format Article
series Electronics
spelling doaj.art-5e8198a94f4343b595faa970b1651d152023-11-24T10:49:57ZengMDPI AGElectronics2079-92922022-12-011123406010.3390/electronics11234060RefinePose: Towards More Refined Human Pose EstimationHao Dong0Guodong Wang1Chenglizhao Chen2Xinyue Zhang3College of Computer Science and Technology, Qingdao University, Qingdao 266000, ChinaCollege of Computer Science and Technology, Qingdao University, Qingdao 266000, ChinaCollege of Computer Science and Technology, Qingdao University, Qingdao 266000, ChinaCollege of Computer Science and Technology, Qingdao University, Qingdao 266000, ChinaHuman pose estimation is a very important research topic in computer vision and attracts more and more researchers. Recently, ViTPose based on heatmap representation refreshed the state of the art for pose estimation methods. However, we find that ViTPose still has room for improvement in our experiments. On the one hand, the PatchEmbedding module of ViTPose uses a convolutional layer with a stride of 14 × 14 to downsample the input image, resulting in the loss of a significant amount of feature information. On the other hand, the two decoding methods (Classical Decoder and Simple Decoder) used by ViTPose are not refined enough: transpose convolution in the Classical Decoder produces the inherent chessboard effect; the upsampling factor in the Simple Decoder is too large, resulting in the blurry heatmap. To this end, we propose a novel pose estimation method based on ViTPose, termed RefinePose. In RefinePose, we design the GradualEmbedding module and Fusion Decoder, respectively, to solve the above problems. More specifically, the GradualEmbedding module only downsamples the image to 1/2 of the original size in each downsampling stage, and it reduces the input image to a fixed size (16 × 112 in ViTPose) through multiple downsampling stages. At the same time, we fuse the outputs of max pooling layers and convolutional layers in each downsampling stage, which retains more meaningful feature information. In the decoding stage, the Fusion Decoder designed by us combines bilinear interpolation with max unpooling layers, and gradually upsamples the feature maps to restore the predicted heatmap. In addition, we also design the FeatureAggregation module to aggregate features after sampling (upsampling and downsampling). We validate the RefinePose on the COCO dataset. The experiments show that RefinePose has achieved better performance than ViTPose.https://www.mdpi.com/2079-9292/11/23/4060human pose estimationViTPosevision transformerheatmapdeep learning
spellingShingle Hao Dong
Guodong Wang
Chenglizhao Chen
Xinyue Zhang
RefinePose: Towards More Refined Human Pose Estimation
Electronics
human pose estimation
ViTPose
vision transformer
heatmap
deep learning
title RefinePose: Towards More Refined Human Pose Estimation
title_full RefinePose: Towards More Refined Human Pose Estimation
title_fullStr RefinePose: Towards More Refined Human Pose Estimation
title_full_unstemmed RefinePose: Towards More Refined Human Pose Estimation
title_short RefinePose: Towards More Refined Human Pose Estimation
title_sort refinepose towards more refined human pose estimation
topic human pose estimation
ViTPose
vision transformer
heatmap
deep learning
url https://www.mdpi.com/2079-9292/11/23/4060
work_keys_str_mv AT haodong refineposetowardsmorerefinedhumanposeestimation
AT guodongwang refineposetowardsmorerefinedhumanposeestimation
AT chenglizhaochen refineposetowardsmorerefinedhumanposeestimation
AT xinyuezhang refineposetowardsmorerefinedhumanposeestimation