A 20-minute Tour to MMPose

MMPose 1.0 is built upon a brand-new framework. For developers with basic knowledge of deep learning, this tutorial provides a overview of MMPose 1.0 framework design. Whether you are a user of the previous version of MMPose, or a beginner of MMPose wishing to start with v1.0, this tutorial will show you how to build a project based on MMPose 1.0.


This tutorial covers what developers will concern when using MMPose 1.0:

  • Overall code architecture

  • How to manage modules with configs

  • How to use my own custom datasets

  • How to add new modules(backbone, head, loss function, etc.)

The content of this tutorial is organized as follows:

  • A 20 Minute Guide to MMPose Framework

    • Structure

    • Overview

    • Step1: Configs

    • Step2: Data

      • Dataset Meta Information

      • Dataset

      • Pipeline

        • i. Augmentation

        • ii. Transformation

        • iii. Encoding

        • iv. Packing

    • Step3: Model

      • Data Preprocessor

      • Backbone

      • Neck

      • Head


The file structure of MMPose 1.0 is as follows:

  • apis provides high-level APIs for model inference

  • structures provides data structures like bbox, keypoint and PoseDataSample

  • datasets supports various datasets for pose estimation

    • transforms contains a lot of useful data augmentation transforms

  • codecs provides pose encoders and decoders: an encoder encodes poses (mostly keypoints) into learning targets (e.g. heatmaps), and a decoder decodes model outputs into pose predictions

  • models provides all components of pose estimation models in a modular structure

    • pose_estimators defines all pose estimation model classes

    • data_preprocessors is for preprocessing the input data of the model

    • backbones provides a collection of backbone networks

    • necks contains various neck modules

    • heads contains various prediction heads that perform pose estimation

    • losses contains various loss functions

  • engine provides runtime components related to pose estimation

    • hooks provides various hooks of the runner

  • evaluation provides metrics for evaluating model performance

  • visualization is for visualizing skeletons, heatmaps and other information



Generally speaking, there are five parts developers will use during project development:

  • General: Environment, Hook, Checkpoint, Logger, etc.

  • Data: Dataset, Dataloader, Data Augmentation, etc.

  • Training: Optimizer, Learning Rate Scheduler, etc.

  • Model: Backbone, Neck, Head, Loss function, etc.

  • Evaluation: Metric, Evaluator, etc.

Among them, modules related to General, Training and Evaluation are often provided by the training framework MMEngine, and developers only need to call APIs and adjust the parameters. Developers mainly focus on implementing the Data and Model parts.

Step1: Configs

In MMPose, we use a Python file as config for the definition and parameter management of the whole project. Therefore, we strongly recommend the developers who use MMPose for the first time to refer to Configs.

Note that all new modules need to be registered using Registry and imported in in the corresponding directory before we can create their instances from configs.

Step2: Data

The organization of data in MMPose contains:

  • Dataset Meta Information

  • Dataset

  • Pipeline

Dataset Meta Information

The meta information of a pose dataset usually includes the definition of keypoints and skeleton, symmetrical characteristic, and keypoint properties (e.g. belonging to upper or lower body, weights and sigmas). These information is important in data preprocessing, model training and evaluation. In MMpose, the dataset meta information is stored in configs files under $MMPOSE/configs/_base_/datasets.

To use a custom dataset in MMPose, you need to add a new config file of the dataset meta information. Take the MPII dataset ($MMPOSE/configs/_base_/datasets/ as an example. Here is its dataset information:

dataset_info = dict(
        author='Mykhaylo Andriluka and Leonid Pishchulin and '
        'Peter Gehler and Schiele, Bernt',
        title='2D Human Pose Estimation: New Benchmark and '
        'State of the Art Analysis',
        container='IEEE Conference on Computer Vision and '
        'Pattern Recognition (CVPR)',
            color=[255, 128, 0],
        ## omitted
        dict(link=('right_ankle', 'right_knee'), id=0, color=[255, 128, 0]),
        ## omitted
        1.5, 1.2, 1., 1., 1.2, 1.5, 1., 1., 1., 1., 1.5, 1.2, 1., 1., 1.2, 1.5
    # Adapted from COCO dataset.
        0.089, 0.083, 0.107, 0.107, 0.083, 0.089, 0.026, 0.026, 0.026, 0.026,
        0.062, 0.072, 0.179, 0.179, 0.072, 0.062
  • keypoint_info contains the information about each keypoint.

    1. name: the keypoint name. The keypoint name must be unique.

    2. id: the keypoint id.

    3. color: ([B, G, R]) is used for keypoint visualization.

    4. type: ‘upper’ or ‘lower’, will be used in data augmentation RandomHalfBody.

    5. swap: indicates the ‘swap pair’ (also known as ‘flip pair’). When applying image horizontal flip, the left part will become the right part, used in data augmentation RandomFlip. We need to flip the keypoints accordingly.

  • skeleton_info contains information about the keypoint connectivity, which is used for visualization.

  • joint_weights assigns different loss weights to different keypoints.

  • sigmas is used to calculate the OKS score. You can read keypoints-eval to learn more about it.

In the model config, the user needs to specify the metainfo path of the custom dataset (e.g. $MMPOSE/configs/_base_/datasets/{your_dataset}.py) as follows:

# dataset and dataloader settings
dataset_type = 'MyCustomDataset' # or 'CocoDataset'

train_dataloader = dict(
        # ann file is stored at {data_root}/{ann_file}
        # e.g. aaa/annotations/train.json
        # img is stored at {data_root}/{img}/
        # e.g. aaa/train/c.jpg
        # specify the new dataset meta information config file

val_dataloader = dict(
        # ann file is stored at {data_root}/{ann_file}
        # e.g. aaa/annotations/val.json
        # img is stored at {data_root}/{img}/
        # e.g. aaa/val/c.jpg
        # specify the new dataset meta information config file

test_dataloader = val_dataloader

More specifically speaking, if you organize your data as follows:

├── annotations
│   ├── train.json
│   ├── val.json
├── train
│   ├── images
│      ├── 000001.jpg
├── val
│   ├── images
│      ├── 000002.jpg

You need to set your config as follows:



To use custom dataset in MMPose, we recommend converting the annotations into a supported format (e.g. COCO or MPII) and directly using our implementation of the corresponding dataset. If this is not applicable, you may need to implement your own dataset class.

More details about using custom datasets can be found in Customize Datasets.


If you wish to inherit from the BaseDataset provided by MMEngine. Please refer to this documents for details.

2D Dataset

Most 2D keypoint datasets in MMPose organize the annotations in a COCO-like style. Thus we provide a base class BaseCocoStyleDataset for these datasets. We recommend that users subclass BaseCocoStyleDataset and override the methods as needed (usually __init__() and _load_annotations()) to extend to a new custom 2D keypoint dataset.


Please refer to COCO for more details about the COCO data format.

The bbox format in MMPose is in xyxy instead of xywh, which is consistent with the format used in other OpenMMLab projects like MMDetection. We provide useful utils for bbox format conversion, such as bbox_xyxy2xywh, bbox_xywh2xyxy, bbox_xyxy2cs, etc., which are defined in $MMPOSE/mmpose/structures/bbox/

Let’s take the implementation of the CrowPose dataset ($MMPOSE/mmpose/datasets/datasets/body/ in COCO format as an example.

class CrowdPoseDataset(BaseCocoStyleDataset):
    """CrowdPose dataset for pose estimation.

    "CrowdPose: Efficient Crowded Scenes Pose Estimation and
    A New Benchmark", CVPR'2019.
    More details can be found in the `paper

    CrowdPose keypoints::

        0: 'left_shoulder',
        1: 'right_shoulder',
        2: 'left_elbow',
        3: 'right_elbow',
        4: 'left_wrist',
        5: 'right_wrist',
        6: 'left_hip',
        7: 'right_hip',
        8: 'left_knee',
        9: 'right_knee',
        10: 'left_ankle',
        11: 'right_ankle',
        12: 'top_head',
        13: 'neck'

        ann_file (str): Annotation file path. Default: ''.
        bbox_file (str, optional): Detection result file path. If
            ``bbox_file`` is set, detected bboxes loaded from this file will
            be used instead of ground-truth bboxes. This setting is only for
            evaluation, i.e., ignored when ``test_mode`` is ``False``.
            Default: ``None``.
        data_mode (str): Specifies the mode of data samples: ``'topdown'`` or
            ``'bottomup'``. In ``'topdown'`` mode, each data sample contains
            one instance; while in ``'bottomup'`` mode, each data sample
            contains all instances in a image. Default: ``'topdown'``
        metainfo (dict, optional): Meta information for dataset, such as class
            information. Default: ``None``.
        data_root (str, optional): The root directory for ``data_prefix`` and
            ``ann_file``. Default: ``None``.
        data_prefix (dict, optional): Prefix for training data. Default:
            ``dict(img=None, ann=None)``.
        filter_cfg (dict, optional): Config for filter data. Default: `None`.
        indices (int or Sequence[int], optional): Support using first few
            data in annotation file to facilitate training/testing on a smaller
            dataset. Default: ``None`` which means using all ``data_infos``.
        serialize_data (bool, optional): Whether to hold memory using
            serialized objects, when enabled, data loader workers can use
            shared RAM from master process instead of making a copy.
            Default: ``True``.
        pipeline (list, optional): Processing pipeline. Default: [].
        test_mode (bool, optional): ``test_mode=True`` means in test phase.
            Default: ``False``.
        lazy_init (bool, optional): Whether to load annotation during
            instantiation. In some cases, such as visualization, only the meta
            information of the dataset is needed, which is not necessary to
            load annotation file. ``Basedataset`` can skip load annotations to
            save time by set ``lazy_init=False``. Default: ``False``.
        max_refetch (int, optional): If ``Basedataset.prepare_data`` get a
            None img. The maximum extra number of cycles to get a valid
            image. Default: 1000.

    METAINFO: dict = dict(from_file='configs/_base_/datasets/')

For COCO-style datasets, we only need to inherit from BaseCocoStyleDataset and specify METAINFO, then the dataset class is ready to use.

3D Dataset

we provide a base class BaseMocapDataset for 3D datasets. We recommend that users subclass BaseMocapDataset and override the methods as needed (usually __init__() and _load_annotations()) to extend to a new custom 3D keypoint dataset.


Data augmentations and transformations during pre-processing are organized as a pipeline. Here is an example of typical pipelines:

# pipelines
train_pipeline = [
    dict(type='RandomFlip', direction='horizontal'),
    dict(type='TopdownAffine', input_size=codec['input_size']),
    dict(type='GenerateTarget', encoder=codec),
test_pipeline = [
    dict(type='TopdownAffine', input_size=codec['input_size']),

In a keypoint detection task, data will be transformed among three scale spaces:

  • Original Image Space: the space where the original images and annotations are stored. The sizes of different images are not necessarily the same

  • Input Image Space: the image space used for model input. All images and annotations will be transformed into this space, such as 256x256, 256x192, etc.

  • Output Space: the scale space where model outputs are located, such as 64x64(Heatmap)1x1(Regression), etc. The supervision signal is also in this space during training

Here is a diagram to show the workflow of data transformation among the three scale spaces:


In MMPose, the modules used for data transformation are under $MMPOSE/mmpose/datasets/transforms, and their workflow is shown as follows:


i. Augmentation

Commonly used transforms are defined in $MMPOSE/mmpose/datasets/transforms/, such as RandomFlip, RandomHalfBody, etc. For top-down methods, Shift, Rotateand Resize are implemented by RandomBBoxTransform. For bottom-up methods, BottomupRandomAffine is used.

Transforms for 3d pose data are defined in $MMPOSE/mmpose/datasets/transforms/


Most data transforms depend on bbox_center and bbox_scale, which can be obtained by GetBBoxCenterScale.

ii. Transformation

For 2D image inputs, affine transformation is used to convert images and annotations from the original image space to the input space. This is done by TopdownAffine for top-down methods and BottomupRandomAffine for bottom-up methods.

For pose lifting tasks, transformation is merged into Encoding.

iii. Encoding

In training phase, after the data is transformed from the original image space into the input space, it is necessary to use GenerateTarget to obtain the training target(e.g. Gaussian Heatmaps). We name this process Encoding. Conversely, the process of getting the corresponding coordinates from Gaussian Heatmaps is called Decoding.

In MMPose, we collect Encoding and Decoding processes into a Codec, in which encode() and decode() are implemented.

Currently we support the following types of Targets.

  • heatmap: Gaussian heatmaps

  • keypoint_label: keypoint representation (e.g. normalized coordinates)

  • keypoint_xy_label: axis-wise keypoint representation

  • heatmap+keypoint_label: Gaussian heatmaps and keypoint representation

  • multiscale_heatmap: multi-scale Gaussian heatmaps

  • lifting_target_label: 3D lifting target keypoint representation

and the generated targets will be packed as follows.

  • heatmaps: Gaussian heatmaps

  • keypoint_labels: keypoint representation (e.g. normalized coordinates)

  • keypoint_x_labels: keypoint x-axis representation

  • keypoint_y_labels: keypoint y-axis representation

  • keypoint_weights: keypoint visibility and weights

  • lifting_target_label: 3D lifting target representation

  • lifting_target_weight: 3D lifting target visibility and weights

Note that we unify the data format of top-down, pose-lifting and bottom-up methods, which means that a new dimension is added to represent different instances from the same image, in shape:

[batch_size, num_instances, num_keypoints, dim_coordinates]
  • top-down and pose-lifting: [B, 1, K, D]

  • bottom-up: [B, N, K, D]

The provided codecs are stored under $MMPOSE/mmpose/codecs.


If you wish to customize a new codec, you can refer to Codec for more details.

iv. Packing

After the data is transformed, you need to pack it using PackPoseInputs.

This method converts the data stored in the dictionary results into standard data structures in MMPose, such as InstanceData, PixelData, PoseDataSample, etc.

Specifically, we divide the data into gt (ground-truth) and pred (prediction), each of which has the following types:

  • instances(numpy.array): instance-level raw annotations or predictions in the original scale space

  • instance_labels(torch.tensor): instance-level training labels (e.g. normalized coordinates, keypoint visibility) in the output scale space

  • fields(torch.tensor): pixel-level training labels or predictions (e.g. Gaussian Heatmaps) in the output scale space

The following is an example of the implementation of PoseDataSample under the hood:

def get_pose_data_sample(self):
    # meta
    pose_meta = dict(
        img_shape=(600, 900),   # [h, w, c]
        crop_size=(256, 192),   # [h, w]
        heatmap_size=(64, 48),  # [h, w]

    # gt_instances
    gt_instances = InstanceData()
    gt_instances.bboxes = np.random.rand(1, 4)
    gt_instances.keypoints = np.random.rand(1, 17, 2)

    # gt_instance_labels
    gt_instance_labels = InstanceData()
    gt_instance_labels.keypoint_labels = torch.rand(1, 17, 2)
    gt_instance_labels.keypoint_weights = torch.rand(1, 17)

    # pred_instances
    pred_instances = InstanceData()
    pred_instances.keypoints = np.random.rand(1, 17, 2)
    pred_instances.keypoint_scores = np.random.rand(1, 17)

    # gt_fields
    gt_fields = PixelData()
    gt_fields.heatmaps = torch.rand(17, 64, 48)

    # pred_fields
    pred_fields = PixelData()
    pred_fields.heatmaps = torch.rand(17, 64, 48)
    data_sample = PoseDataSample(

    return data_sample

Step3: Model

In MMPose 1.0, the model consists of the following components:

  • Data Preprocessor: perform data normalization and channel transposition

  • Backbone: used for feature extraction

  • Neck: GAP,FPN, etc. are optional

  • Head: used to implement the core algorithm and loss function

We define a base class BasePoseEstimator for the model in $MMPOSE/models/pose_estimators/ All models, e.g. TopdownPoseEstimator, should inherit from this base class and override the corresponding methods.

Three modes are provided in forward() of the estimator:

  • mode == 'loss': return the result of loss function for model training

  • mode == 'predict': return the prediction result in the input space, used for model inference

  • mode == 'tensor': return the model output in the output space, i.e. model forward propagation only, for model export

Developers should build the components by calling the corresponding registry. Taking the top-down model as an example:

class TopdownPoseEstimator(BasePoseEstimator):
    def __init__(self,
                 backbone: ConfigType,
                 neck: OptConfigType = None,
                 head: OptConfigType = None,
                 train_cfg: OptConfigType = None,
                 test_cfg: OptConfigType = None,
                 data_preprocessor: OptConfigType = None,
                 init_cfg: OptMultiConfig = None):
        super().__init__(data_preprocessor, init_cfg)

        self.backbone =

        if neck is not None:
            self.neck =

        if head is not None:
            self.head =

Data Preprocessor

Starting from MMPose 1.0, we have added a new module to the model called data preprocessor, which performs data preprocessings like image normalization and channel transposition. It can benefit from the high computing power of devices like GPU, and improve the integrity in model export and deployment.

A typical data_preprocessor in the config is as follows:

    mean=[123.675, 116.28, 103.53],
    std=[58.395, 57.12, 57.375],

It will transpose the channel order of the input image from bgr to rgb and normalize the data according to mean and std.


MMPose provides some commonly used backbones under $MMPOSE/mmpose/models/backbones.

In practice, developers often use pre-trained backbone weights for transfer learning, which can improve the performance of the model on small datasets.

In MMPose, you can use the pre-trained weights by setting init_cfg in config:


If you want to load a checkpoint to your backbone, you should specify the prefix:


checkpoint can be either a local path or a download link. Thus, if you wish to use a pre-trained model provided by Torchvision(e.g. ResNet50), you can simply use:


In addition to these commonly used backbones, you can easily use backbones from other repositories in the OpenMMLab family such as MMClassification, which all share the same config system and provide pre-trained weights.

It should be emphasized that if you add a new backbone, you need to register it by doing:

class YourBackbone(BaseBackbone):

Besides, import it in $MMPOSE/mmpose/models/backbones/, and add it to __all__.


Neck is usually a module between Backbone and Head, which is used in some algorithms. Here are some commonly used Neck:

  • Global Average Pooling (GAP)

  • Feature Pyramid Networks (FPN)

  • Feature Map Processor (FMP)

    The FeatureMapProcessor is a flexible PyTorch module designed to transform the feature outputs generated by backbones into a format suitable for heads. It achieves this by utilizing non-parametric operations such as selecting, concatenating, and rescaling. Below are some examples along with their corresponding configurations:

    • Select operation

      neck=dict(type='FeatureMapProcessor', select_index=0)

    • Concatenate operation

      neck=dict(type='FeatureMapProcessor', concat=True)

      Note that all feature maps will be resized to match the shape of the first feature map (index 0) prior to concatenation.

    • rescale operation

      neck=dict(type='FeatureMapProcessor', scale_factor=2.0)

Read the Docs v: latest
On Read the Docs
Project Home

Free document hosting provided by Read the Docs.