Cofopose: Conditional 2D Pose Estimation with Transformers

Human pose estimation has long been a fundamental problem in computer vision and artificial intelligence. Prominent among the 2D human pose estimation (HPE) methods are the regression-based approaches, which have been proven to achieve excellent results. However, the ground-truth labels are usually...

Full description

Bibliographic Details
Main Authors: Evans Aidoo, Xun Wang, Zhenguang Liu, Edwin Kwadwo Tenagyei, Kwabena Owusu-Agyemang, Seth Larweh Kodjiku, Victor Nonso Ejianya, Esther Stacy E. B. Aggrey
Format: Article
Language:English
Published: MDPI AG 2022-09-01
Series:Sensors
Subjects:
Online Access:https://www.mdpi.com/1424-8220/22/18/6821
_version_ 1797482632482652160
author Evans Aidoo
Xun Wang
Zhenguang Liu
Edwin Kwadwo Tenagyei
Kwabena Owusu-Agyemang
Seth Larweh Kodjiku
Victor Nonso Ejianya
Esther Stacy E. B. Aggrey
author_facet Evans Aidoo
Xun Wang
Zhenguang Liu
Edwin Kwadwo Tenagyei
Kwabena Owusu-Agyemang
Seth Larweh Kodjiku
Victor Nonso Ejianya
Esther Stacy E. B. Aggrey
author_sort Evans Aidoo
collection DOAJ
description Human pose estimation has long been a fundamental problem in computer vision and artificial intelligence. Prominent among the 2D human pose estimation (HPE) methods are the regression-based approaches, which have been proven to achieve excellent results. However, the ground-truth labels are usually inherently ambiguous in challenging cases such as motion blur, occlusions, and truncation, leading to poor performance measurement and lower levels of accuracy. In this paper, we propose Cofopose, which is a two-stage approach consisting of a person and keypoint detection transformers for 2D human pose estimation. Cofopose is composed of conditional cross-attention, a conditional DEtection TRansformer (conditional DETR), and an encoder-decoder in the transformer framework; this allows it to achieve person and keypoint detection. In a significant departure from other approaches, we use conditional cross-attention and fine-tune conditional DETR for our person detection, and encoder-decoders in the transformers for our keypoint detection. Cofopose was extensively evaluated using two benchmark datasets, MS COCO and MPII, achieving an improved performance with significant margins over the existing state-of-the-art frameworks.
first_indexed 2024-03-09T22:35:13Z
format Article
id doaj.art-cfa0639a61f647e6936a0cad8c34a11e
institution Directory Open Access Journal
issn 1424-8220
language English
last_indexed 2024-03-09T22:35:13Z
publishDate 2022-09-01
publisher MDPI AG
record_format Article
series Sensors
spelling doaj.art-cfa0639a61f647e6936a0cad8c34a11e2023-11-23T18:49:50ZengMDPI AGSensors1424-82202022-09-012218682110.3390/s22186821Cofopose: Conditional 2D Pose Estimation with TransformersEvans Aidoo0Xun Wang1Zhenguang Liu2Edwin Kwadwo Tenagyei3Kwabena Owusu-Agyemang4Seth Larweh Kodjiku5Victor Nonso Ejianya6Esther Stacy E. B. Aggrey7School of Computer & Information Engineering, Zhejiang Gongshang University, Hangzhou 310018, ChinaSchool of Computer & Information Engineering, Zhejiang Gongshang University, Hangzhou 310018, ChinaSchool of Computer & Information Engineering, Zhejiang Gongshang University, Hangzhou 310018, ChinaSchool of Information & Software Engineering, University of Electronic Science & Technology of China, Chengdu 611731, ChinaDepartment of Computer Science, Kwame Nkrumah University of Science and Technology (KNUST), Kumasi 03220, GhanaSchool of Computer & Information Engineering, Zhejiang Gongshang University, Hangzhou 310018, ChinaSchool of Computer & Information Engineering, Zhejiang Gongshang University, Hangzhou 310018, ChinaSchool of Information & Software Engineering, University of Electronic Science & Technology of China, Chengdu 611731, ChinaHuman pose estimation has long been a fundamental problem in computer vision and artificial intelligence. Prominent among the 2D human pose estimation (HPE) methods are the regression-based approaches, which have been proven to achieve excellent results. However, the ground-truth labels are usually inherently ambiguous in challenging cases such as motion blur, occlusions, and truncation, leading to poor performance measurement and lower levels of accuracy. In this paper, we propose Cofopose, which is a two-stage approach consisting of a person and keypoint detection transformers for 2D human pose estimation. Cofopose is composed of conditional cross-attention, a conditional DEtection TRansformer (conditional DETR), and an encoder-decoder in the transformer framework; this allows it to achieve person and keypoint detection. In a significant departure from other approaches, we use conditional cross-attention and fine-tune conditional DETR for our person detection, and encoder-decoders in the transformers for our keypoint detection. Cofopose was extensively evaluated using two benchmark datasets, MS COCO and MPII, achieving an improved performance with significant margins over the existing state-of-the-art frameworks.https://www.mdpi.com/1424-8220/22/18/6821DETRhuman pose estimationconditional DETRconvolutional neural network (CNN)detection
spellingShingle Evans Aidoo
Xun Wang
Zhenguang Liu
Edwin Kwadwo Tenagyei
Kwabena Owusu-Agyemang
Seth Larweh Kodjiku
Victor Nonso Ejianya
Esther Stacy E. B. Aggrey
Cofopose: Conditional 2D Pose Estimation with Transformers
Sensors
DETR
human pose estimation
conditional DETR
convolutional neural network (CNN)
detection
title Cofopose: Conditional 2D Pose Estimation with Transformers
title_full Cofopose: Conditional 2D Pose Estimation with Transformers
title_fullStr Cofopose: Conditional 2D Pose Estimation with Transformers
title_full_unstemmed Cofopose: Conditional 2D Pose Estimation with Transformers
title_short Cofopose: Conditional 2D Pose Estimation with Transformers
title_sort cofopose conditional 2d pose estimation with transformers
topic DETR
human pose estimation
conditional DETR
convolutional neural network (CNN)
detection
url https://www.mdpi.com/1424-8220/22/18/6821
work_keys_str_mv AT evansaidoo cofoposeconditional2dposeestimationwithtransformers
AT xunwang cofoposeconditional2dposeestimationwithtransformers
AT zhenguangliu cofoposeconditional2dposeestimationwithtransformers
AT edwinkwadwotenagyei cofoposeconditional2dposeestimationwithtransformers
AT kwabenaowusuagyemang cofoposeconditional2dposeestimationwithtransformers
AT sethlarwehkodjiku cofoposeconditional2dposeestimationwithtransformers
AT victornonsoejianya cofoposeconditional2dposeestimationwithtransformers
AT estherstacyebaggrey cofoposeconditional2dposeestimationwithtransformers