RMPE: Regional Multi-Person Pose Estimation (2024)

Hao-Shu Fang1111part of this work was done when Hao-Shu Fang was an student intern in Tencent,Shuqin Xie1,Yu-Wing Tai2,Cewu Lu1444corresponding author is Cewu Lu
1Shanghai Jiao Tong University, China2 Tencent YouTu
fhaoshu@gmail.com qweasdshu@sjtu.edu.cn yuwingtai@tencent.com lucewu@sjtu.edu.cn

Abstract

Multi-person pose estimation in the wild is challenging. Although state-of-the-art human detectors have demonstrated good performance, small errors in localization and recognition are inevitable. These errors can cause failures for a single-person pose estimator (SPPE), especially for methods that solely depend on human detection results. In this paper, we propose a novel regional multi-person pose estimation (RMPE) framework to facilitate pose estimation in the presence of inaccurate human bounding boxes. Our framework consists of three components: Symmetric Spatial Transformer Network (SSTN), Parametric Pose Non-Maximum-Suppression (NMS), and Pose-Guided Proposals Generator (PGPG). Our method is able to handle inaccurate bounding boxes and redundant detections, allowing it to achieve 76.776.7{\bf 76.7} mAP on the MPII (multi person) dataset[3]. Our model and source codes are made publicly available.222https://cvsjtu.wordpress.com/rmpe-regional-multi-person-pose-estimation/.

1 Introduction

Human pose estimation is a fundamental challenge for computer vision. In practice, recognizing the pose of multiple persons in the wild is a lot more challenging than recognizing the pose of a single person in an image[36, 37, 25, 28, 44]. Recent attempts approach this problem by using either a two-step framework[34, 15] ora part-based framework [9, 33, 21].The two-step framework first detects human bounding boxes and then estimates the pose within each box independently. The part-based framework first detects body parts independently and then assembles the detected body parts to form multiple human poses. Both frameworks have their advantages and disadvantages. In the two-step framework, the accuracy of pose estimation highly depends on the quality of the detected bounding boxes. In the part-based framework, the assembled human poses are ambiguous when two or more persons are too close together. Also, part-based framework loses the capability to recognize body parts from a global pose view due to the mere utilization of second-order body parts dependence.

Our approach follows the two-step framework. We aim to detect accurate human poses even when given inaccurate bounding boxes. To illustrate the problems of previous approaches, we applied the state-of-the-art object detector Faster-RCNN [35] and the SPPE Stacked Hourglass model [28]. Figure 1 and Figure 2 show two major problems: the localization error problem and the redundant detection problem. In fact, SPPE is rather vulnerable to bounding box errors. Even for the cases when the bounding boxes are considered as correct with Iโ€‹oโ€‹U>0.5๐ผ๐‘œ๐‘ˆ0.5IoU>0.5, the detected human poses can still be wrong. Since SPPE produces a pose for each given bounding box, redundant detections result in redundant poses.

RMPE: Regional Multi-Person Pose Estimation (1)RMPE: Regional Multi-Person Pose Estimation (2)
RMPE: Regional Multi-Person Pose Estimation (3)

To address the above problems, a regional multi-person pose estimation (RMPE) framework is proposed. Our framework improves the performance of SPPE-based human pose estimation algorithms. We have designed a new symmetric spatial transformer network (SSTN) which is attached to the SPPE to extract a high-quality single person region from an inaccurate bounding box. A novel parallel SPPE branch is introduced to optimize this network. To address the problem of redundant detection, a parametric pose NMS is introduced. Our parametric pose NMS eliminates redundant poses by using a novel pose distance metric to compare pose similarity. A data-driven approach is applied to optimize the pose distance parameters. Lastly, we propose a novel pose-guided human proposal generator (PGPG) to augment training samples. By learning the output distribution of a human detector for different poses, we can simulate the generation of human bounding boxes, producing a large sample of training data.

Our RMPE framework is general and is applicable to different human detectors and single person pose estimators. We applied our framework on the MPII (multi-person) dataset[3], where it outperforms the state-of-the-art methods and achieves 76.776.776.7 mAP. We have also conducted ablation studies to validate the effectiveness of each proposed component of our framework. Our model and source codes are made publicly available to support reproducible research.

2 Related Work

2.1 Single Person Pose Estimation

In single person pose estimation, the pose estimation problem is simplified by only attempting to estimate the pose of a single person, and the person is assumed to dominate the image content. Conventional methods considered pictorial structure models. For example, tree models[43, 36, 47, 42] and random forest models [37, 11] have demonstrated to be very efficient in human pose estimation.Graph based models such as random field models[24] and dependency graph models[17] have also been widely investigated in the literature[16, 38, 25, 32].

More recently, deep learning has become a promising technique in object/face recognition, and human pose estimation is of no exception. Representative works include DeepPose (Toshev et al)[40], DNN based models[29, 14] and various CNN based models[23, 39, 28, 4, 44].Apart from simply estimating a human pose, some studies [12, 31] consider human parsingand pose estimation simultaneously. For single person pose estimation, these methods could perform well only when the person has been correctly located. However, this assumption is not always satisfied.

RMPE: Regional Multi-Person Pose Estimation (4)

2.2 Multi Person Pose Estimation

Part-based FrameworkRepresentative works on part-based framework [9, 15, 41, 33, 21] are reviewed. Chen et al. presented an approach to parse largely occluded people by graphical model which models humans as flexible compositions of body parts [9]. Gkiox et al used k-poselets to jointly detect people and predict locations of human poses [15]. The final pose localization is predicted by a weighted average of all activated poselets. Pishchulin et al. proposed DeepCut to first detect all body parts, and then label and assemble these parts via integral linear programming[33]. A stronger part detector based on ResNet[19] and a better incremental optimization strategy is proposed by Insafutdinov et al [21]. While part-based methods have demonstrated good performance, their body-part detectors can be vulnerable since only small local regions are considered.

Two-step FrameworkOur work follows the two-step framework [34, 15]. In our work, we use a CNN based SPPE method to estimate poses, while Pishchulin et al. [34] used conventional pictorial structure models for pose estimation. In particular, Insafutdinov et al [21] propose a similar two-step pipeline which uses the Faster R-CNN as their human detector and a unary DeeperCut as their pose estimator. Their method can only achieve 51.051.051.0 in mAP on MPII dataset, while ours can achieve 76.776.776.7 mAP. With the development of object detection and single person pose estimation, the two-step framework can achieve further advances in its performance. Our paper aims to solve the problem of imperfect human detection in the two-step framework in order to maximize the power of SPPE.

3 Regional Multi-person Pose Estimation

The pipeline of our proposed RMPE is illustrated in Figure 3. The human bounding boxes obtained by the human detector are fed into the โ€œSymmetric STN + SPPEโ€ module, and the pose proposals are generated automatically. The generated pose proposals are refined by parametric Pose NMS to obtain the estimated human poses. During the training, we introduce โ€œParallel SPPEโ€ in order to avoid local minimums and further leverage the power of SSTN. To augment the existing training samples, a pose-guided proposals generator (PGPG) is designed. In the following sections, we present the three major components of our framework.

RMPE: Regional Multi-Person Pose Estimation (5)

3.1 Symmetric STN and Parallel SPPE

Human proposals provided by human detectors are not well-suited to SPPE. This is because SPPE is specifically trained on single person images and is very sensitive to localisation errors. It has been shown that small translation or cropping of human proposals can significantly affect performance of SPPE [28]. Our symmetric STN ++ parallel SPPE was introduced to enhance SPPE when given imperfect human proposals. The module of our SSTN and parallel SPPE is shown in Figure 4.

STN and SDTN The spatial transformer network [22](STN) has demonstrated excellent performancein selecting region of interests automatically. In this paper, we use the STN to extract high quality dominant human proposals. Mathematically, the STN performs a 2D affine transformation which can be expressed as

(xisyis)=[๐œฝ๐Ÿ๐œฝ๐Ÿ๐œฝ๐Ÿ‘]โ€‹(xityit1),matrixsuperscriptsubscript๐‘ฅ๐‘–๐‘ superscriptsubscript๐‘ฆ๐‘–๐‘ delimited-[]matrixsubscript๐œฝ1subscript๐œฝ2subscript๐œฝ3matrixsuperscriptsubscript๐‘ฅ๐‘–๐‘กsuperscriptsubscript๐‘ฆ๐‘–๐‘ก1\left(\begin{matrix}x_{i}^{s}\\y_{i}^{s}\end{matrix}\right)=\left[\begin{matrix}\boldsymbol{\theta_{1}}&\boldsymbol{\theta_{2}}&\boldsymbol{\theta_{3}}\end{matrix}\right]\left(\begin{matrix}x_{i}^{t}\\y_{i}^{t}\\1\end{matrix}\right),(1)

where ๐œฝ๐Ÿsubscript๐œฝ1\boldsymbol{\theta_{1}}, ๐œฝ๐Ÿsubscript๐œฝ2\boldsymbol{\theta_{2}} and ๐œฝ๐Ÿ‘subscript๐œฝ3\boldsymbol{\theta_{3}} are vectors in โ„2superscriptโ„2\mathbb{R}^{2}. {xis,yis}superscriptsubscript๐‘ฅ๐‘–๐‘ superscriptsubscript๐‘ฆ๐‘–๐‘ \{x_{i}^{s},y_{i}^{s}\} and {xit,yit}superscriptsubscript๐‘ฅ๐‘–๐‘กsuperscriptsubscript๐‘ฆ๐‘–๐‘ก\{x_{i}^{t},y_{i}^{t}\} are the coordinates before and after transformation, respectively. After SPPE, the resulting pose is mapped into the original human proposal image. Naturally, a spatial de-transformer network (SDTN) is required to remap the estimated human pose back to the original image coordinate. The SDTN computes the ๐œธ๐œธ\boldsymbol{\gamma} for de-transformation and generates grids based on ๐œธ๐œธ\boldsymbol{\gamma}:

(xityit)=[๐œธ๐Ÿ๐œธ๐Ÿ๐œธ๐Ÿ‘]โ€‹(xisyis1)matrixsuperscriptsubscript๐‘ฅ๐‘–๐‘กsuperscriptsubscript๐‘ฆ๐‘–๐‘กdelimited-[]matrixsubscript๐œธ1subscript๐œธ2subscript๐œธ3matrixsuperscriptsubscript๐‘ฅ๐‘–๐‘ superscriptsubscript๐‘ฆ๐‘–๐‘ 1\left(\begin{matrix}x_{i}^{t}\\y_{i}^{t}\end{matrix}\right)=\left[\begin{matrix}\boldsymbol{\gamma_{1}}&\boldsymbol{\gamma_{2}}&\boldsymbol{\gamma_{3}}\end{matrix}\right]\left(\begin{matrix}x_{i}^{s}\\y_{i}^{s}\\1\end{matrix}\right)(2)

Since SDTN is an inverse procedure of STN, we can obtain the following:

[๐œธ๐Ÿ๐œธ๐Ÿ]=[๐œฝ๐Ÿ๐œฝ๐Ÿ]โˆ’1delimited-[]matrixsubscript๐œธ1subscript๐œธ2superscriptdelimited-[]matrixsubscript๐œฝ1subscript๐œฝ21\left[\begin{matrix}\boldsymbol{\gamma_{1}}&\boldsymbol{\gamma_{2}}\end{matrix}\right]=\left[\begin{matrix}\boldsymbol{\theta_{1}}&\boldsymbol{\theta_{2}}\end{matrix}\right]^{-1}(3)
๐œธ๐Ÿ‘=โˆ’1ร—[๐œธ๐Ÿ๐œธ๐Ÿ]โ€‹๐œฝ๐Ÿ‘subscript๐œธ31delimited-[]matrixsubscript๐œธ1subscript๐œธ2subscript๐œฝ3\boldsymbol{\gamma_{3}}=-1\times\left[\begin{matrix}\boldsymbol{\gamma_{1}}&\boldsymbol{\gamma_{2}}\end{matrix}\right]\boldsymbol{\theta_{3}}(4)

To back propagate through SDTN, โˆ‚Jโ€‹(W,b)โˆ‚ฮธ๐ฝ๐‘Š๐‘๐œƒ\frac{\partial J(W,b)}{\partial\theta} can be derived as

โˆ‚Jโ€‹(W,b)โˆ‚[๐œฝ๐Ÿ๐œฝ๐Ÿ]=โˆ‚Jโ€‹(W,b)โˆ‚[๐œธ๐Ÿ๐œธ๐Ÿ]ร—โˆ‚[๐œธ๐Ÿ๐œธ๐Ÿ]โˆ‚[๐œฝ๐Ÿ๐œฝ๐Ÿ]+โˆ‚Jโ€‹(W,b)โˆ‚๐œธ๐Ÿ‘ร—โˆ‚๐œธ๐Ÿ‘โˆ‚[๐œธ๐Ÿ๐œธ๐Ÿ]ร—โˆ‚[๐œธ๐Ÿ๐œธ๐Ÿ]โˆ‚[๐œฝ๐Ÿ๐œฝ๐Ÿ]๐ฝ๐‘Š๐‘delimited-[]matrixsubscript๐œฝ1subscript๐œฝ2๐ฝ๐‘Š๐‘delimited-[]matrixsubscript๐œธ1subscript๐œธ2delimited-[]matrixsubscript๐œธ1subscript๐œธ2delimited-[]matrixsubscript๐œฝ1subscript๐œฝ2๐ฝ๐‘Š๐‘subscript๐œธ3subscript๐œธ3delimited-[]matrixsubscript๐œธ1subscript๐œธ2delimited-[]matrixsubscript๐œธ1subscript๐œธ2delimited-[]matrixsubscript๐œฝ1subscript๐œฝ2\begin{split}\frac{\partial J(W,b)}{\partial\left[\begin{matrix}\boldsymbol{\theta_{1}}&\boldsymbol{\theta_{2}}\end{matrix}\right]}=\frac{\partial J(W,b)}{\partial\left[\begin{matrix}\boldsymbol{\gamma_{1}}&\boldsymbol{\gamma_{2}}\end{matrix}\right]}\times\frac{\partial\left[\begin{matrix}\boldsymbol{\gamma_{1}}&\boldsymbol{\gamma_{2}}\end{matrix}\right]}{\partial\left[\begin{matrix}\boldsymbol{\theta_{1}}&\boldsymbol{\theta_{2}}\end{matrix}\right]}\\+\frac{\partial J(W,b)}{\partial\boldsymbol{\gamma_{3}}}\times\frac{\partial\boldsymbol{\gamma_{3}}}{\partial\left[\begin{matrix}\boldsymbol{\gamma_{1}}&\boldsymbol{\gamma_{2}}\end{matrix}\right]}\times\frac{\partial\left[\begin{matrix}\boldsymbol{\gamma_{1}}&\boldsymbol{\gamma_{2}}\end{matrix}\right]}{\partial\left[\begin{matrix}\boldsymbol{\theta_{1}}&\boldsymbol{\theta_{2}}\end{matrix}\right]}\end{split}(5)

with respect to ฮธ1subscript๐œƒ1\theta_{1} and ฮธ2subscript๐œƒ2\theta_{2}, and

โˆ‚Jโ€‹(W,b)โˆ‚๐œฝ๐Ÿ‘=โˆ‚Jโ€‹(W,b)โˆ‚๐œธ๐Ÿ‘ร—โˆ‚๐œธ๐Ÿ‘โˆ‚๐œฝ๐Ÿ‘๐ฝ๐‘Š๐‘subscript๐œฝ3๐ฝ๐‘Š๐‘subscript๐œธ3subscript๐œธ3subscript๐œฝ3\frac{\partial J(W,b)}{\partial\boldsymbol{\theta_{3}}}=\frac{\partial J(W,b)}{\partial\boldsymbol{\gamma_{3}}}\times\frac{\partial\boldsymbol{\gamma_{3}}}{\partial\boldsymbol{\theta_{3}}}(6)

with respect to ฮธ3subscript๐œƒ3\theta_{3}.โˆ‚[๐œธ๐Ÿ๐œธ๐Ÿ]โˆ‚[๐œฝ๐Ÿ๐œฝ๐Ÿ]delimited-[]matrixsubscript๐œธ1subscript๐œธ2delimited-[]matrixsubscript๐œฝ1subscript๐œฝ2\frac{\partial\left[\begin{matrix}\boldsymbol{\gamma_{1}}&\boldsymbol{\gamma_{2}}\end{matrix}\right]}{\partial\left[\begin{matrix}\boldsymbol{\theta_{1}}&\boldsymbol{\theta_{2}}\end{matrix}\right]} andโˆ‚๐œธ๐Ÿ‘โˆ‚๐œฝ๐Ÿ‘subscript๐œธ3subscript๐œฝ3\frac{\partial\boldsymbol{\gamma_{3}}}{\partial\boldsymbol{\theta_{3}}} can be derived from Eqn. (3) and (4) respectively.

After extracting high quality dominant human proposal regions, we can utilize off-the-shelf SPPE for accurate pose estimation. In our training, the SSTN is fine-tuned together with our SPPE.

Parallel SPPE

To further help STN extract good human-dominant regions, we add a parallel SPPE branch in the training phrase. This branch shares the same STN with the original SPPE, but the spatial de-transformer (SDTN) is omitted. The human pose label of this branch is specified to be centered. To be more specific, the output of this SPPE branch is directly compared to labels of center-located ground truth poses. We freeze all the layers of this parallel SPPE during the training phase. The weights of this branch are fixed and its purpose is to back-propagate center-located pose errors to the STN module. If the extracted pose of the STN is not center-located, the parallel branch will back-propagate large errors. In this way, we can help the STN focus on the correct area and extract high quality human-dominant regions. In the testing phase, the parallel SPPE is discarded. The effectiveness of our parallel SPPE will be verified in our experiments.

Discussions The parallel SPPE can be regarded as a regularizer during the training phase. It helps to avoid a poor solution (local minimum) where the STN does not transform the pose to the center of extracted human regions. The likelihood of reaching a local minimum is increased because compensation from the SDTN will make the network generate fewer errors. These errors are necessary to train the STN. With the parallel SPPE, the STN is trained to move the human to the center of the extracted region to facilitate accurate pose estimation by SPPE.

It may seem intuitive to replace parallel SPPE with a center-located poses regression loss in the output of SPPE (before SDTN). However, this approach will degrade the performance of our system. Although STN can partly transform the input, it is impossible to perfectly place the person at the same location as the label. The difference in coordinate space between the input and label of SPPE will largely impair its ability to learn pose estimation. This will cause the performance of our main branch SPPE to decrease. Thus, to ensure that both STN and SPPE can fully leverage their own power, a parallel SPPE with frozen weights is indispensable for our framework. The parallel SPPE always produces large errors for non-center poses to push the STN to produce a center-located pose, without affecting the performance of the main branch SPPE.

3.2 Parametric Pose NMS

Human detectors inevitably generate redundant detections, which in turn produce redundant pose estimations. Therefore, pose non-maximum suppression (NMS) is required to eliminate the redundancies. Previous methods [6, 9] are either not efficient or not accurate enough. In this paper, we propose a parametric pose NMS method. Similar to the previous subsection, the pose Pisubscript๐‘ƒ๐‘–P_{i}, with m๐‘šm joints is denoted as {โŸจki1,ci1โŸฉ,โ€ฆ,โŸจkim,cimโŸฉ}superscriptsubscript๐‘˜๐‘–1superscriptsubscript๐‘๐‘–1โ€ฆsuperscriptsubscript๐‘˜๐‘–๐‘šsuperscriptsubscript๐‘๐‘–๐‘š\{\langle k_{i}^{1},c_{i}^{1}\rangle,\ldots,\langle k_{i}^{m},c_{i}^{m}\rangle\}, where kijsuperscriptsubscript๐‘˜๐‘–๐‘—k_{i}^{j} and cijsuperscriptsubscript๐‘๐‘–๐‘—c_{i}^{j} are the jtโ€‹hsuperscript๐‘—๐‘กโ„Žj^{th} location and confidence score of joints respectively.

NMS scheme We revisit pose NMS as follows: firstly, the most confident pose is selected as reference, and some poses close to it are subject to elimination by applying elimination criterion. This process is repeated on the remaining poses set until redundant poses are eliminated and only unique poses are reported.

Elimination Criterion We need to define pose similarity in order to eliminate the poses which are too close and too similar to each others. We define a pose distance metric dโ€‹(Pi,Pj|ฮ›)๐‘‘subscript๐‘ƒ๐‘–conditionalsubscript๐‘ƒ๐‘—ฮ›d(P_{i},P_{j}|\Lambda) to measure the pose similarity, and a threshold ฮท๐œ‚\eta as elimination criterion, where ฮ›ฮ›\Lambda is a parameter set of function dโ€‹(โ‹…)๐‘‘โ‹…d(\cdot). Our elimination criterion can be written as follows:

fโ€‹(Pi,Pj|ฮ›,ฮท)=๐Ÿ™โ€‹[dโ€‹(Pi,Pj|ฮ›,ฮป)โ‰คฮท]๐‘“subscript๐‘ƒ๐‘–conditionalsubscript๐‘ƒ๐‘—ฮ›๐œ‚1delimited-[]๐‘‘subscript๐‘ƒ๐‘–conditionalsubscript๐‘ƒ๐‘—ฮ›๐œ†๐œ‚f(\ P_{i},P_{j}|\Lambda,\eta)=\mathds{1}[d(P_{i},P_{j}|\Lambda,\lambda)\leq\eta](7)

If dโ€‹(โ‹…)๐‘‘โ‹…d(\cdot) is smaller than ฮท๐œ‚\eta, the output of fโ€‹(โ‹…)๐‘“โ‹…f(\cdot) should be 111, which indicates that pose Pisubscript๐‘ƒ๐‘–P_{i} should be eliminated due to redundancy with reference pose Pjsubscript๐‘ƒ๐‘—P_{j}.

Pose Distance

Now, we present the distance function dpโ€‹oโ€‹sโ€‹eโ€‹(Pi,Pj)subscript๐‘‘๐‘๐‘œ๐‘ ๐‘’subscript๐‘ƒ๐‘–subscript๐‘ƒ๐‘—d_{pose}(P_{i},P_{j}). We assume that the box for Pisubscript๐‘ƒ๐‘–P_{i} is Bisubscript๐ต๐‘–B_{i}. Then we define a soft matching function

KSโ€‹iโ€‹mโ€‹(Pi,Pj|ฯƒ1)=subscript๐พ๐‘†๐‘–๐‘šsubscript๐‘ƒ๐‘–conditionalsubscript๐‘ƒ๐‘—subscript๐œŽ1absent\displaystyle K_{Sim}(P_{i},P_{j}|\sigma_{1})=~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}
{โˆ‘ntanhโกcinฯƒ1โ‹…tanhโกcjnฯƒ1,ifkjnis withinโ„ฌโ€‹(kin)0otherwisecasessubscript๐‘›โ‹…superscriptsubscript๐‘๐‘–๐‘›subscript๐œŽ1superscriptsubscript๐‘๐‘—๐‘›subscript๐œŽ1ifkjnis withinโ„ฌ(kin)0otherwise\displaystyle\begin{cases}\sum_{n}\tanh\frac{c_{i}^{n}}{\sigma_{1}}\cdot\tanh\frac{c_{j}^{n}}{\sigma_{1}},&\text{if $k_{j}^{n}$ is within $\mathcal{B}(k_{i}^{n})$}\\0&\text{otherwise }\end{cases}(8)

where โ„ฌโ€‹(kin)โ„ฌsuperscriptsubscript๐‘˜๐‘–๐‘›\mathcal{B}(k_{i}^{n}) is a box center at kinsuperscriptsubscript๐‘˜๐‘–๐‘›k_{i}^{n}, and each dimension of โ„ฌโ€‹(kin)โ„ฌsuperscriptsubscript๐‘˜๐‘–๐‘›\mathcal{B}(k_{i}^{n}) is 1/101101/10 of the original box Bisubscript๐ต๐‘–B_{i}. The tanh\tanh operation filters out poses with low-confidence scores. When two corresponding joints both have high confidence scores, the output will be close to 1. This distance softly counts the number of joints matching between poses.

The spatial distance between parts is also considered, which can be written as

HSโ€‹iโ€‹mโ€‹(Pi,Pj|ฯƒ2)=โˆ‘nexpโก[โˆ’(kinโˆ’kjn)2ฯƒ2]subscript๐ป๐‘†๐‘–๐‘šsubscript๐‘ƒ๐‘–conditionalsubscript๐‘ƒ๐‘—subscript๐œŽ2subscript๐‘›superscriptsuperscriptsubscript๐‘˜๐‘–๐‘›superscriptsubscript๐‘˜๐‘—๐‘›2subscript๐œŽ2H_{Sim}(P_{i},P_{j}|\sigma_{2})=\sum_{n}\exp[-\frac{(k_{i}^{n}-k_{j}^{n})^{2}}{\sigma_{2}}](9)

By combining Eqn (3.2) and (9), the final distance function can be written as

dโ€‹(Pi,Pj|ฮ›)=KSโ€‹iโ€‹mโ€‹(Pi,Pj|ฯƒ1)+ฮปโ€‹HSโ€‹iโ€‹mโ€‹(Pi,Pj|ฯƒ2)๐‘‘subscript๐‘ƒ๐‘–conditionalsubscript๐‘ƒ๐‘—ฮ›subscript๐พ๐‘†๐‘–๐‘šsubscript๐‘ƒ๐‘–conditionalsubscript๐‘ƒ๐‘—subscript๐œŽ1๐œ†subscript๐ป๐‘†๐‘–๐‘šsubscript๐‘ƒ๐‘–conditionalsubscript๐‘ƒ๐‘—subscript๐œŽ2d(P_{i},P_{j}|\Lambda)=K_{Sim}(P_{i},P_{j}|\sigma_{1})+\lambda H_{Sim}(P_{i},P_{j}|\sigma_{2})(10)

where ฮป๐œ†\lambda is a weight balancing the two distances and ฮ›={ฯƒ1,ฯƒ2,ฮป}ฮ›subscript๐œŽ1subscript๐œŽ2๐œ†\Lambda=\{\sigma_{1},\sigma_{2},\lambda\}. Note that the previous pose NMS [9] set pose distance parameters and thresholds manually. In contrast, our parameters can be determined in a data-driven manner.

Optimization Given the detected redundant poses, the four parameters in the eliminate criterion fโ€‹(Pi,Pj|ฮ›,ฮท)๐‘“subscript๐‘ƒ๐‘–conditionalsubscript๐‘ƒ๐‘—ฮ›๐œ‚f(\ P_{i},P_{j}|\Lambda,\eta) are optimized to achieve the maximal mAP for the validation set. Since exhaustive search in a 4D space is intractable, we optimize two parameters at a time by fixing the other two parameters in an iterative manner. Once convergence is achieved, the parameters are fixed and will be used in the testing phase.

3.3 Pose-guided Proposals Generator

Data Augmentation

For the two-stage pose estimation, proper data augmentation is necessary to make the SSTN+SPPE module adapt to the โ€™imperfectโ€™ human proposals generated by the human detector. Otherwise, the module may not work properly in the testing phase for the human detector. An intuitive approach is to directly use bounding boxes generated by the human detector during the training phase. However, the human detector can only produce one bounding box for each person. By using the proposals generator, this quantity can be greatly increased. Since we already have the ground truth pose and an object detection bounding box for each person, we can generate a large sample of training proposals with the same distribution as the output of the human detector. With this technique, we are able to further boost the performance of our system.

Insight We find that the distribution of the relative offset between the detected bounding box and the ground truth bounding box varies across different poses. To be more specific, there exists a distribution Pโ€‹(ฮดโ€‹B|P)๐‘ƒconditional๐›ฟ๐ต๐‘ƒP(\delta B|P), where ฮดโ€‹B๐›ฟ๐ต\delta B is the offset between the coordinates of a bounding box generated by human detector and the coordinates of the ground truth bounding box, and P๐‘ƒP is the ground truth pose of a person. If we can model this distribution, we are able to generate many training samples that are similar to human proposals generated by the human detector.

RMPE: Regional Multi-Person Pose Estimation (6)

Implementation To directly learn the distribution Pโ€‹(ฮดโ€‹B|P)๐‘ƒconditional๐›ฟ๐ต๐‘ƒP(\delta B|P) is difficult due to the variation of human poses. So instead, we attempt to learn the distribution Pโ€‹(ฮดโ€‹B|aโ€‹tโ€‹oโ€‹mโ€‹(P))๐‘ƒconditional๐›ฟ๐ต๐‘Ž๐‘ก๐‘œ๐‘š๐‘ƒP(\delta B|atom(P)), where aโ€‹tโ€‹oโ€‹mโ€‹(P)๐‘Ž๐‘ก๐‘œ๐‘š๐‘ƒatom(P) denotes the atomic pose[46] of P๐‘ƒP. We follow the method used by Andriluka et al[3] to learn the atomic poses. To derive the atomic poses from annotations of human poses, we first align all poses so that their torsos have the same length. Then we use the k-means algorithm to cluster our aligned poses, and the computed cluster centers form our atomic poses. Now for each person instance sharing the same atomic pose a๐‘Ža, we calculate the offsets between its ground truth bounding box and detected bounding box. The offsets are then normalized by the corresponding side-length of ground truth bounding box in that direction. After these processes, the offsets form a frequency distribution, and we fit our data to a Gaussian mixture distribution. For different atomic poses, we have different Gaussian mixture parameters. We visualize some of the distributions and their corresponding clustered human poses in Figure5.

Proposals Generation During the training phase of the SSTN+SPPE, for each annotated pose in the training sample we first look up the corresponding atomic pose a๐‘Ža. Then we generate additional offsets by dense sampling according to Pโ€‹(ฮดโ€‹B|a)๐‘ƒconditional๐›ฟ๐ต๐‘ŽP(\delta B|a) to produce augmented training proposals.

4 Experiments

The proposed method is qualitatively and quantitatively evaluated on two standard multi-person datasets with large occlusion cases: MPII [3] and MSCOCO 2016 Keypoints Challenge dataset[1].

4.1 Evaluation datasets

MPII Multi-Person Dataset The challenging benchmark MPII Human Pose (multi-person)[3] consists of 3,844 training and 1,758 testing groups with both occluded and overlapped people. Moreover, it contains more than 28,000 training samples for single person pose estimation. We use all the training data in the single person dataset and 90% of the multi-person training set to fine-tune the SPPE, leaving 10% for validation.

MSCOCO Keypoints Challenge We also evaluate our method on the MSCOCO Keypoints Challenge dataset[1]. This dataset requires localization of person keypoints in challenging, uncontrolled conditions. It consists of 105,698 training and around 80,000 testing human instances. The training set contains over 1 million total labeled keypoints. The testing set are divided into four roughly equally sized splits: test-challenge, test-dev, test-standard, and test-reserve.

4.2 Implementation details in testing

In this paper, we use the VGG-based SSD-512 [26] as our human detector, as it performs object detection effectively and efficiently. In order to guarantee that the entire person region will be extracted, detected human proposals are extended by 30%percent3030\% along both the height and width directions. We use the stacked hourglass model[28] as the single person pose estimator because of its superior performance. For the STN network, we adopt the ResNet-18[19] as our localization network. Considering the memory efficiency, we use a smaller 4-stack hourglass network as the parallel SPPE.

To show that our framework is general and is applicable to different human detectors and pose estimators, we also do experiments by replacing the human detector with ResNet152 based Faster-RCNN[8] and replacing the pose estimator with PyraNet[45]. In this case, we adopt multi-scale testing for the human detection and use an input size of 320x256 for the PyraNet.

HeadShoulderElbowWristHipKneeAnkleTotal
full testing set
Iqbal&Gall, ECCVw16 [41]58.453.944.535.042.236.731.143.1
DeeperCut, ECCV16 [21]78.472.560.251.057.252.045.459.5
Levinkov et al., CVPR17[13]89.885.271.859.671.163.053.570.6
Insafutdinov et al., CVPR17[20]88.887.075.964.974.268.860.574.3
Cao et al., CVPR17[7]91.287.677.766.875.468.961.775.6
Newell & Deng, NIPS17[27]92.189.378.969.876.271.664.777.5
ours88.486.578.670.474.473.065.876.7
ours++91.390.584.076.480.379.972.482.1

4.3 Results

RMPE: Regional Multi-Person Pose Estimation (7)RMPE: Regional Multi-Person Pose Estimation (8)RMPE: Regional Multi-Person Pose Estimation (9)RMPE: Regional Multi-Person Pose Estimation (10)
RMPE: Regional Multi-Person Pose Estimation (11)RMPE: Regional Multi-Person Pose Estimation (12)RMPE: Regional Multi-Person Pose Estimation (13)RMPE: Regional Multi-Person Pose Estimation (14)
RMPE: Regional Multi-Person Pose Estimation (15)RMPE: Regional Multi-Person Pose Estimation (16)RMPE: Regional Multi-Person Pose Estimation (17)RMPE: Regional Multi-Person Pose Estimation (18)
RMPE: Regional Multi-Person Pose Estimation (19)RMPE: Regional Multi-Person Pose Estimation (20)RMPE: Regional Multi-Person Pose Estimation (21)RMPE: Regional Multi-Person Pose Estimation (22)
RMPE: Regional Multi-Person Pose Estimation (23)RMPE: Regional Multi-Person Pose Estimation (24)RMPE: Regional Multi-Person Pose Estimation (25)RMPE: Regional Multi-Person Pose Estimation (26)

Results on MPII dataset.

We evaluated our method on full MPII multi-person test set. Quantitative results on the full testing set are given in Table 1.Notably, we achieve an average accuracy of 727272 mAP on identifying difficult joints such as wrists, elbows, ankles, and knees, which is 3.33.33.3 mAP higher than the previous state-of-the-art result. We reach a final accuracy of 70.470.470.4 mAP for the wrist and an accuracy of 737373 mAP for the knee. By using a stronger human detector and pose estimator, we can further achieve 82.182.182.1 mAP, which is 4.64.64.6 mAP higher than the previous best result. We present some of our results in Figure6. These results show that our method can accurately predict pose in multi-person images. More results are presented in supplementary materials.

Results on MSCOCO Keypoints dataset. We fine-tuned the SPPE on the MSCOCO Keypoints training + validating sets and leave 5,000 images for validation. Quantitative results on the test-dev set are given in Table 2. Our method achieves the state-of-the-art performance. Note that without specific design for the pose estimation network, our frame work can perform on par with Megvii[10], which propose a new pose estimation network. It demonstrates the effectiveness of our proposed framework. And we believe that using the pose network from[10] can further boost our performance.

TeamAPAโ€‹P50๐ดsuperscript๐‘ƒ50AP^{50}Aโ€‹P75๐ดsuperscript๐‘ƒ75AP^{75}Aโ€‹PM๐ดsuperscript๐‘ƒ๐‘€AP^{M}Aโ€‹PL๐ดsuperscript๐‘ƒ๐ฟAP^{L}
CMU-Pose[7]61.884.967.557.168.2
G-RMI[30]68.587.175.565.873.3
Mask R-CNN[18]63.187.368.757.871.4
Megvii[10]72.191.480.068.777.2
ours61.883.769.858.667.6
ours++72.389.279.168.078.6

4.4 Ablation studies

We evaluate the effectiveness of the three proposed components, i.e., symmetric STN, pose-guided proposals generator and parametric pose NMS. The ablative studies have been conducted by removing the proposed components from the pipeline or replacing the proposed components with conventional solvers. The straightforward two-step method without the three components and the upper-bound of our framework are tested for comparison. We conducted these experiments on the MPII validation set. In addition, we replace our human detection module to prove the generality of our framework.

MethodsHeadShoulderElbowWristHipKneeAnkleTotal
RMPE, full90.789.784.175.480.475.567.380.8
a)w/o SSTN+parallel SPPE89.086.982.873.577.173.365.078.2
w/o parallel SPPE only89.988.083.474.777.874.065.879.1
b)w/o PGPG82.881.077.568.274.666.860.173.0
random jittering*89.387.882.370.478.473.363.877.9
c)w/o PoseNMS85.183.679.269.876.472.263.675.7
PoseNMS [9]88.987.883.073.878.774.666.379.1
PoseNMS [6]90.088.683.774.679.775.167.079.9
d)straight forward two-steps81.980.474.168.569.066.162.271.7
e)oracle human detection94.393.487.780.284.378.970.684.2

Symmetric STN and Parallel SPPE To validate the importance of symmetric STN and parallel SPPE, two experiments were conducted. In the first experiment, we removed the SSTN, including the parallel SPPE, from our pipeline. In the second experiment, we only removed the parallel SPPE and kept the symmetric STN structure. Both of these results are shown in Table 3(a). We can observe performance degradation when removing parallel SPPE, which implies that parallel SPPE with single person image labels strongly encourages the STN to extract single person regions to minimize the total losses.

Pose-guided Proposals Generator In Table 3(b), we demonstrate that our pose-guided proposals generator also plays an important role in our system. In this experiment, we first remove the data augmentation from our training phase. The final mAP drops to 73.0%percent73.073.0\%. Then we compare our data augmentation technique with a simple baseline. The baseline is formed by jittering the locations and aspect ratios of the bounding boxes produced by person detector to generate a large number of additional proposals. We choose those that have IoU>>0.5 with ground truth boxes. From our result in Table 3(b), we can see that our technique is better than the baseline method. Generating training proposals according to the distribution can be regarded as a kind of data re-sampling, which can help the model to better fit human proposals.

Parametric Pose NMS Since pose NMS is an independent module, we can directly remove it from our final model. The experimental results are shown in Table 3(c). As we can see, the mAP drops significantly if the parametric pose NMS is removed. This is because the increase in the number of redundant poses will ultimately decrease our precision. We note that the previous pose NMS can also eliminate redundant detection to some extent. The state-of-the-art pose NMS algorithms [6, 9] are used to replace our parametric pose NMS, with the results given in Table 3(c). These schemes perform less effectively than ours, since the parameter learning is missing. In terms of efficiency, on our validation set which contains 1300 images, the publicly available implementation of [6]333http://www.vision.caltech.edu/dhall/projects/MergingPoseEstimates/takes 62.2 seconds to perform pose NMS while using our algorithm takes only 1.8 seconds.

Upper Bound of Our Framework The upper bound of our framework is tested, where we use the ground truth bounding boxes as human proposals. As shown in Table 3(e), this setting could yield 84.2%percent84.284.2\% mAP. It verifies that our system is already close to the upper-bound of two-step framework.

4.5 Failure cases

We present some failure cases in Figure7. It can be seen that the SPPE can not handle poses which are rarely occurred (e.g. the person performing the โ€™Human Flagโ€™ in the first image). When two persons are highly overlapped, our system get confused and can not separate them apart (e.g. the two persons in the left of the second image). The misses of person detector will also cause the missing detection of human poses (e.g. the person who has laid down in the third image). Finally, erroneous pose may still be detected when an object looks very similar to human which can fool both human detector and SPPE (e.g. the background object in the forth image).

5 Conclusion

In this paper, a novel regional multi-person pose estimation (RMPE) framework is proposed, which significantly outperforms the state-of-the-art methods for multi-person human pose estimation in terms of accuracy and efficiency. It validates the potential of two-step frameworks, i.e., human detector + SPPE, when SPPE is adapted to a human detector. Our RMPE framework consists of three novel components: symmetric STN with parallel SPPE, parametric pose NMS, and pose-guided proposals generator (PGPG). In particular, PGPG is used to greatly argument the training data by learning the conditional distribution of bounding box proposals for a given human pose. The SPPE becomes adept at handling human localization errors due to the utilization of symmetric STN and parallel SPPE. Finally, the parametric pose NMS can be used to reduce redundant detections.In our future work, it would be interesting to explore the possibility of training our framework together with the human detector in an end-to-end manner.

References

  • [1]MSCOCO keypoint challenge 2016.http://mscoco.org/dataset/keypoints-challenge2016.
  • [2]http://mscoco.org/dataset/#keypoints-leaderboard, 2016.
  • [3]M.Andriluka, L.Pishchulin, P.Gehler, and B.Schiele.2d human pose estimation: New benchmark and state of the artanalysis.In IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2014.
  • [4]V.Belagiannis and A.Zisserman.Recurrent human pose estimation.In arXiv preprint arXiv:1605.02914, 2016.
  • [5]N.Bodla, B.Singh, R.Chellappa, and L.S. Davis.Soft-nmsโ€”improving object detection with one line of code.In 2017 IEEE International Conference on Computer Vision(ICCV), pages 5562โ€“5570. IEEE, 2017.
  • [6]X.Burgos-Artizzu, D.Hall, P.Perona, and P.Dollar.Merging pose estimates across space and time.In British Machine Vision Conference (BMVC), 2013.
  • [7]Z.Cao, T.Simon, S.-E. Wei, and Y.Sheikh.Realtime multi-person 2d pose estimation using part affinity fields.In IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2017.
  • [8]X.Chen and A.Gupta.An implementation of faster rcnn with study for region sampling.arXiv preprint arXiv:1702.02138, 2017.
  • [9]X.Chen and A.L. Yuille.Parsing occluded people by flexible compositions.In IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pages 3945โ€“3954, 2015.
  • [10]Y.Chen, Z.Wang, Y.Peng, Z.Zhang, G.Yu, and J.Sun.Cascaded pyramid network for multi-person pose estimation.arXiv preprint arXiv:1711.07319, 2017.
  • [11]M.Dantone, J.Gall, C.Leistner, and L.VanGool.Human pose estimation using body parts dependent joint regressors.In IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pages 3041โ€“3048, 2013.
  • [12]J.Dong, Q.Chen, X.Shen, J.Yang, and S.Yan.Towards unified human parsing and pose estimation.In IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pages 843โ€“850, 2014.
  • [13]S.T. M. O. E. I. A. K. C. R. T. B. B. S. B.A. EvgenyLevinkov, JonasUhrig.Joint graph decomposition and node labeling: Problem, algorithms,applications.In IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2017.
  • [14]X.Fan, K.Zheng, Y.Lin, and S.Wang.Combining local appearance and holistic view: Dual-source deep neuralnetworks for human pose estimation.In IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pages 1347โ€“1355. IEEE, 2015.
  • [15]G.Gkioxari, B.Hariharan, R.Girshick, and J.Malik.Using k-poselets for detecting people and localizing their keypoints.In IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pages 3582โ€“3589, 2014.
  • [16]A.Gupta, T.Chen, F.Chen, D.Kimber, and L.S. Davis.Context and observation driven latent variable model for human poseestimation.In IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pages 1โ€“8. IEEE, 2008.
  • [17]K.Hara and R.Chellappa.Computationally efficient regression on a dependency graph for humanpose estimation.In IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pages 3390โ€“3397, 2013.
  • [18]K.He, G.Gkioxari, P.Dollรกr, and R.Girshick.Mask r-cnn.In Computer Vision (ICCV), 2017 IEEE International Conferenceon, pages 2980โ€“2988. IEEE, 2017.
  • [19]K.He, X.Zhang, S.Ren, and J.Sun.Deep residual learning for image recognition.2016.
  • [20]E.Insafutdinov, M.Andriluka, L.Pishchulin, S.Tang, E.Levinkov, B.Andres,and B.Schiele.Arttrack: Articulated multi-person tracking in the wild.In IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2017.
  • [21]E.Insafutdinov, L.Pishchulin, B.Andres, M.Andriluka, and B.Schiele.DeeperCut: A Deeper, Stronger, and Faster Multi-Person PoseEstimation Model.In European Conference on Computer Vision (ECCV), May 2016.
  • [22]M.Jaderberg, K.Simonyan, A.Zisserman, etal.Spatial transformer networks.In Conference on Neural Information Processing Systems (NIPS),pages 2017โ€“2025, 2015.
  • [23]A.Jain, J.Tompson, M.Andriluka, G.W. Taylor, and C.Bregler.Learning human pose estimation features with convolutional networks.In arXiv preprint arXiv:1312.7302, 2013.
  • [24]M.Kiefel and P.V. Gehler.Human pose estimation with fields of parts.In European Conference on Computer Vision (ECCV), pages331โ€“346. Springer, 2014.
  • [25]L.Ladicky, P.H. Torr, and A.Zisserman.Human pose estimation using a joint pixel-wise and part-wiseformulation.In IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pages 3578โ€“3585, 2013.
  • [26]W.Liu, D.Anguelov, D.Erhan, C.Szegedy, and S.Reed.SSD: Single Shot MultiBox Detector.In European Conference on Computer Vision (ECCV), 2016.
  • [27]A.Newell, Z.Huang, and J.Deng.Associative embedding: End-to-end learning for joint detection andgrouping.In Advances in Neural Information Processing Systems, pages2274โ€“2284, 2017.
  • [28]A.Newell, K.Yang, and J.Deng.Stacked hourglass networks for human pose estimation.In arXiv preprint arXiv:1603.06937, 2016.
  • [29]W.Ouyang, X.Chu, and X.Wang.Multi-source deep learning for human pose estimation.In IEEE Conference on Computer Vision and Pattern Recognition(CVPR), June 2014.
  • [30]G.Papandreou, T.Zhu, N.Kanazawa, A.Toshev, J.Tompson, C.Bregler, andK.Murphy.Towards accurate multiperson pose estimation in the wild.arXiv preprint arXiv:1701.01779, 8, 2017.
  • [31]S.Park and S.-C. Zhu.Attributed grammars for joint estimation of human attributes, partand pose.In IEEE International Conference on Computer Vision (ICCV),pages 2372โ€“2380, 2015.
  • [32]L.Pishchulin, M.Andriluka, P.Gehler, and B.Schiele.Strong appearance and expressive spatial models for human poseestimation.In IEEE International Conference on Computer Vision (ICCV),pages 3487โ€“3494, 2013.
  • [33]L.Pishchulin, E.Insafutdinov, S.Tang, B.Andres, M.Andriluka, P.Gehler,and B.Schiele.Deepcut: Joint subset partition and labeling for multi person poseestimation.In IEEE Conference on Computer Vision and Pattern Recognition(CVPR), June 2016.
  • [34]L.Pishchulin, A.Jain, M.Andriluka, T.Thormรคhlen, and B.Schiele.Articulated people detection and pose estimation: Reshaping thefuture.In IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pages 3178โ€“3185, 2012.
  • [35]S.Ren, K.He, R.Girshick, and J.Sun.Faster R-CNN: Towards real-time object detection with regionproposal networks.In Conference on Neural Information Processing Systems (NIPS),pages 91โ€“99, 2015.
  • [36]B.Sapp, A.Toshev, and B.Taskar.Cascaded models for articulated pose estimation.In European Conference on Computer Vision (ECCV), pages406โ€“420. Springer, 2010.
  • [37]M.Sun, P.Kohli, and J.Shotton.Conditional regression forests for human pose estimation.In IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pages 3394โ€“3401. IEEE, 2012.
  • [38]M.Sun, M.Telaprolu, H.Lee, and S.Savarese.An efficient branch-and-bound algorithm for optimal human poseestimation.In IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pages 1616โ€“1623. IEEE, 2012.
  • [39]J.J. Tompson, A.Jain, Y.LeCun, and C.Bregler.Joint training of a convolutional network and a graphical model forhuman pose estimation.In Conference on Neural Information Processing Systems (NIPS),pages 1799โ€“1807, 2014.
  • [40]A.Toshev and C.Szegedy.Deeppose: Human pose estimation via deep neural networks.In IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2014.
  • [41]J.G. UmarIqbal.Multi-person pose estimation with local joint-to-person associations.In European Conference on Computer Vision Workshops 2016(ECCVWโ€™16) - Workshop on Crowd Understanding (CUWโ€™16), 2016.
  • [42]F.Wang and Y.Li.Beyond physical connections: Tree models in human pose estimation.In IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pages 596โ€“603, 2013.
  • [43]Y.Wang and G.Mori.Multiple tree models for occlusion and spatial constraints in humanpose estimation.In European Conference on Computer Vision (ECCV), pages710โ€“724. Springer, 2008.
  • [44]S.-E. Wei, V.Ramakrishna, T.Kanade, and Y.Sheikh.Convolutional pose machines.In IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pages 4724โ€“4732, 2016.
  • [45]W.Yang, S.Li, W.Ouyang, H.Li, and X.Wang.Learning feature pyramids for human pose estimation.In IEEE International Conference on Computer Vision (ICCV),2017.
  • [46]B.Yao and L.Fei-Fei.Recognizing human-object interactions in still images by modeling themutual context of objects and human poses.IEEE Transactions on Pattern Analysis and Machine Intelligence(PAMI), 34(9):1691โ€“1703, 2012.
  • [47]X.Zhang, C.Li, X.Tong, W.Hu, S.Maybank, and Y.Zhang.Efficient human pose estimation via parsing a tree structure basedhuman model.In IEEE International Conference on Computer Vision (ICCV),pages 1349โ€“1356. IEEE, 2009.
RMPE: Regional Multi-Person Pose Estimation (2024)
Top Articles
Best Pecan Pie Bars Recipe
Best Sugar Cookie Recipe EVER
Valley Fair Tickets Costco
Eric Rohan Justin Obituary
7543460065
Cosentyxยฎ 75 mg Injektionslรถsung in einer Fertigspritze - PatientenInfo-Service
Ou Class Nav
Mivf Mdcalc
UEQ - User Experience Questionnaire: UX Testing schnell und einfach
The Murdoch succession drama kicks off this week. Here's everything you need to know
Nitti Sanitation Holiday Schedule
Accuradio Unblocked
Void Touched Curio
4156303136
Mini Handy 2024: Die besten Mini Smartphones | Purdroid.de
Espn Horse Racing Results
Comics Valley In Hindi
Bj Alex Mangabuddy
2020 Military Pay Charts โ€“ Officer & Enlisted Pay Scales (3.1% Raise)
Everything you need to know about Costco Travel (and why I love it) - The Points Guy
Moving Sales Craigslist
Nurse Logic 2.0 Testing And Remediation Advanced Test
Www.publicsurplus.com Motor Pool
Tinker Repo
Craigslist Lakeville Ma
Busted Campbell County
Parc Soleil Drowning
[PDF] PDF - Education Update - Free Download PDF
Engineering Beauties Chapter 1
Copper Pint Chaska
Receptionist Position Near Me
WPoS's Content - Page 34
Santa Barbara Craigs List
1964 Impala For Sale Craigslist
Autotrader Bmw X5
Where Can I Cash A Huntington National Bank Check
House Of Budz Michigan
Chuze Fitness La Verne Reviews
19 Best Seafood Restaurants in San Antonio - The Texas Tasty
How To Paint Dinos In Ark
Cherry Spa Madison
Felix Mallard Lpsg
Prior Authorization Requirements for Health Insurance Marketplace
Ferguson Employee Pipeline
Former Employees
Citymd West 146Th Urgent Care - Nyc Photos
Crystal Glassware Ebay
Fine Taladorian Cheese Platter
Ty Glass Sentenced
1Tamilmv.kids
Palmyra Authentic Mediterranean Cuisine ู…ุทุนู… ุฃุจูˆ ุณู…ุฑุฉ
Primary Care in Nashville & Southern KY | Tristar Medical Group
Latest Posts
Article information

Author: Ouida Strosin DO

Last Updated:

Views: 6781

Rating: 4.6 / 5 (56 voted)

Reviews: 87% of readers found this page helpful

Author information

Name: Ouida Strosin DO

Birthday: 1995-04-27

Address: Suite 927 930 Kilback Radial, Candidaville, TN 87795

Phone: +8561498978366

Job: Legacy Manufacturing Specialist

Hobby: Singing, Mountain biking, Water sports, Water sports, Taxidermy, Polo, Pet

Introduction: My name is Ouida Strosin DO, I am a precious, combative, spotless, modern, spotless, beautiful, precious person who loves writing and wants to share my knowledge and understanding with you.