Training and Testing¶

Launch training¶

Train with your PC¶

You can use tools/train.py to train a model on a single machine with a CPU and optionally a GPU.

Here is the full usage of the script:

python tools/train.py ${CONFIG_FILE} [ARGS]

Note

By default, MMPose prefers GPU to CPU. If you want to train a model on CPU, please empty CUDA_VISIBLE_DEVICES or set it to -1 to make GPU invisible to the program.

CUDA_VISIBLE_DEVICES=-1 python tools/train.py ${CONFIG_FILE} [ARGS]

ARGS	Description
`CONFIG_FILE`	The path to the config file.
`--work-dir WORK_DIR`	The target folder to save logs and checkpoints. Defaults to a folder with the same name as the config file under `./work_dirs`.
`--resume [RESUME]`	Resume training. If specify a path, resume from it, while if not specify, try to auto resume from the latest checkpoint.
`--amp`	Enable automatic-mixed-precision training.
`--no-validate`	Not suggested. Disable checkpoint evaluation during training.
`--auto-scale-lr`	Automatically rescale the learning rate according to the actual batch size and the original batch size.
`--cfg-options CFG_OPTIONS`	Override some settings in the used config, the key-value pair in xxx=yyy format will be merged into the config file. If the value to be overwritten is a list, it should be of the form of either `key="[a,b]"` or `key=a,b`. The argument also allows nested list/tuple values, e.g. `key="[(a,b),(c,d)]"`. Note that quotation marks are necessary and that no white space is allowed.
`--show-dir SHOW_DIR`	The directory to save the result visualization images generated during validation.
`--show`	Visualize the prediction result in a window.
`--interval INTERVAL`	The interval of samples to visualize.
`--wait-time WAIT_TIME`	The display time of every window (in seconds). Defaults to 1.
`--launcher {none,pytorch,slurm,mpi}`	Options for job launcher.

Train with multiple GPUs¶

We provide a shell script to start a multi-GPUs task with torch.distributed.launch.

bash ./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [PY_ARGS]

ARGS	Description
`CONFIG_FILE`	The path to the config file.
`GPU_NUM`	The number of GPUs to be used.
`[PYARGS]`	The other optional arguments of `tools/train.py`, see here.

You can also specify extra arguments of the launcher by environment variables. For example, change the communication port of the launcher to 29666 by the below command:

PORT=29666 bash ./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [PY_ARGS]

If you want to startup multiple training jobs and use different GPUs, you can launch them by specifying different port and visible devices.

CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 bash ./tools/dist_train.sh ${CONFIG_FILE1} 4 [PY_ARGS]
CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 bash ./tools/dist_train.sh ${CONFIG_FILE2} 4 [PY_ARGS]

Train with multiple machines¶

Multiple machines in the same network¶

If you launch a training job with multiple machines connected with ethernet, you can run the following commands:

On the first machine:

NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_train.sh $CONFIG $GPUS

On the second machine:

NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_train.sh $CONFIG $GPUS

Compared with multi-GPUs in a single machine, you need to specify some extra environment variables:

ENV_VARS	Description
`NNODES`	The total number of machines.
`NODE_RANK`	The index of the local machine.
`PORT`	The communication port, it should be the same in all machines.
`MASTER_ADDR`	The IP address of the master machine, it should be the same in all machines.

Usually, it is slow if you do not have high-speed networking like InfiniBand.

Multiple machines managed with slurm¶

If you run MMPose on a cluster managed with slurm, you can use the script slurm_train.sh.

[ENV_VARS] ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${WORK_DIR} [PY_ARGS]

Here are the arguments description of the script.

ARGS	Description
`PARTITION`	The partition to use in your cluster.
`JOB_NAME`	The name of your job, you can name it as you like.
`CONFIG_FILE`	The path to the config file.
`WORK_DIR`	The target folder to save logs and checkpoints.
`[PYARGS]`	The other optional arguments of `tools/train.py`, see here.

Here are the environment variables that can be used to configure the slurm job.

ENV_VARS	Description
`GPUS`	The total number of GPUs to be used. Defaults to 8.
`GPUS_PER_NODE`	The number of GPUs to be allocated per node. Defaults to 8.
`CPUS_PER_TASK`	The number of CPUs to be allocated per task (Usually one GPU corresponds to one task). Defaults to 5.
`SRUN_ARGS`	The other arguments of `srun`. Available options can be found here.

Resume training¶

Resume training means to continue training from the state saved from one of the previous trainings, where the state includes the model weights, the state of the optimizer and the optimizer parameter adjustment strategy.

Automatically resume training¶

Users can add --resume to the end of the training command to resume training. The program will automatically load the latest weight file from work_dirs to resume training. If there is a latest checkpoint in work_dirs (e.g. the training was interrupted during the previous training), the training will be resumed from the checkpoint. Otherwise (e.g. the previous training did not save checkpoint in time or a new training task was started), the training will be restarted.

Here is an example of resuming training:

python tools/train.py configs/body_2d_keypoint/topdown_heatmap/coco/td-hm_res50_8xb64-210e_coco-256x192.py --resume

Specify the checkpoint to resume training¶

You can also specify the checkpoint path for --resume. MMPose will automatically read the checkpoint and resume training from it. The command is as follows:

python tools/train.py configs/body_2d_keypoint/topdown_heatmap/coco/td-hm_res50_8xb64-210e_coco-256x192.py \
    --resume work_dirs/td-hm_res50_8xb64-210e_coco-256x192/latest.pth

If you hope to manually specify the checkpoint path in the config file, in addition to setting resume=True, you also need to set the load_from.

It should be noted that if only load_from is set without setting resume=True, only the weights in the checkpoint will be loaded and the training will be restarted from scratch, instead of continuing from the previous state.

The following example is equivalent to the example above that specifies the --resume parameter:

resume = True
load_from = 'work_dirs/td-hm_res50_8xb64-210e_coco-256x192/latest.pth'
# model settings
model = dict(
    ## omitted ##
    )

Freeze partial parameters during training¶

In some scenarios, it might be desirable to freeze certain parameters of a model during training to fine-tune specific parts or to prevent overfitting. In MMPose, you can set different hyperparameters for any module in the model by setting custom_keys in paramwise_cfg. This allows you to control the learning rate and decay coefficient for specific parts of the model.

For example, if you want to freeze the parameters in backbone.layer0 and backbone.layer1, you can modify the optimizer wrapper in the config file as:

optim_wrapper = dict(
    optimizer=dict(...),
    paramwise_cfg=dict(
        custom_keys={
            'backbone.layer0': dict(lr_mult=0, decay_mult=0),
            'backbone.layer0': dict(lr_mult=0, decay_mult=0),
        }))

This configuration will freeze the parameters in backbone.layer0 and backbone.layer1 by setting their learning rate and decay coefficient to 0. By using this approach, you can effectively control the training process and fine-tune specific parts of your model as needed.

Automatic Mixed Precision (AMP) training¶

Mixed precision training can reduce training time and storage requirements without changing the model or reducing the model training accuracy, thus supporting larger batch sizes, larger models, and larger input sizes.

To enable Automatic Mixing Precision (AMP) training, add --amp to the end of the training command, which is as follows:

python tools/train.py ${CONFIG_FILE} --amp

Specific examples are as follows:

python tools/train.py configs/body_2d_keypoint/topdown_heatmap/coco/td-hm_res50_8xb64-210e_coco-256x192.py  --amp

Set the random seed¶

If you want to specify the random seed during training, you can use the following command:

python ./tools/train.py \
    ${CONFIG} \                               # config file
    --cfg-options randomness.seed=2023 \      # set the random seed = 2023
    [randomness.diff_rank_seed=True] \        # Set different seeds according to rank.
    [randomness.deterministic=True]           # Set the cuDNN backend deterministic option to True
# `[]` stands for optional parameters, when actually entering the command line, you do not need to enter `[]`

randomness has three parameters that can be set, with the following meanings.

randomness.seed=2023, set the random seed to 2023.
randomness.diff_rank_seed=True, set different seeds according to global rank. Defaults to False.
randomness.deterministic=True, set the deterministic option for cuDNN backend, i.e., set torch.backends.cudnn.deterministic to True and torch.backends.cudnn.benchmark to False. Defaults to False. See Pytorch Randomness for more details.

Training Log¶

During training, the training log will be printed in the console as follows:

07/14 08:26:50 - mmengine - INFO - Epoch(train) [38][ 6/38]  base_lr: 5.148343e-04 lr: 5.148343e-04  eta: 0:15:34  time: 0.540754  data_time: 0.394292  memory: 3141  loss: 0.006220  loss_kpt: 0.006220  acc_pose: 1.000000

The training log contains the following information:

07/14 08:26:50: The current time.
mmengine: The name of the program.
INFO or WARNING: The log level.
Epoch(train): The current training stage. train means the training stage, val means the validation stage.
[38][ 6/38]: The current epoch and the current iteration.
base_lr: The base learning rate.
lr: The current (real) learning rate.
eta: The estimated time of arrival.
time: The elapsed time (minutes) of the current iteration.
data_time: The elapsed time (minutes) of data processing (i/o and transforms).
memory: The GPU memory (MB) allocated by the program.
loss: The total loss value of the current iteration.
loss_kpt: The loss value you passed in head module.
acc_pose: The accuracy value you passed in head module.

Visualize training process¶

Monitoring the training process is essential for understanding the performance of your model and making necessary adjustments. In this section, we will introduce two methods to visualize the training process of your MMPose model: TensorBoard and the MMEngine Visualizer.

TensorBoard¶

TensorBoard is a powerful tool that allows you to visualize the changes in losses during training. To enable TensorBoard visualization, you may need to:

Install TensorBoard environment
```
pip install tensorboard
```

Enable TensorBoard in the config file

visualizer = dict(vis_backends=[
    dict(type='LocalVisBackend'),
    dict(type='TensorboardVisBackend'),
])

The event file generated by TensorBoard will be save under the experiment log folder ${WORK_DIR}, which defaults to work_dir/${CONFIG} or can be specified using the --work-dir option. To visualize the training process, use the following command:

tensorboard --logdir ${WORK_DIR}/${TIMESTAMP}/vis_data

MMEngine visualizer¶

MMPose also supports visualizing model inference results during validation. To activate this function, please use the --show option or set --show-dir when launching training. This feature provides an effective way to analyze the model’s performance on specific examples and make any necessary adjustments.

Test your model¶

Test with your PC¶

You can use tools/test.py to test a model on a single machine with a CPU and optionally a GPU.

Here is the full usage of the script:

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [ARGS]

Note

By default, MMPose prefers GPU to CPU. If you want to test a model on CPU, please empty CUDA_VISIBLE_DEVICES or set it to -1 to make GPU invisible to the program.

CUDA_VISIBLE_DEVICES=-1 python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [ARGS]

ARGS	Description
`CONFIG_FILE`	The path to the config file.
`CHECKPOINT_FILE`	The path to the checkpoint file (It can be a http link, and you can find checkpoints here).
`--work-dir WORK_DIR`	The directory to save the file containing evaluation metrics.
`--out OUT`	The path to save the file containing evaluation metrics.
`--dump DUMP`	The path to dump all outputs of the model for offline evaluation.
`--cfg-options CFG_OPTIONS`	Override some settings in the used config, the key-value pair in xxx=yyy format will be merged into the config file. If the value to be overwritten is a list, it should be of the form of either `key="[a,b]"` or `key=a,b`. The argument also allows nested list/tuple values, e.g. `key="[(a,b),(c,d)]"`. Note that quotation marks are necessary and that no white space is allowed.
`--show-dir SHOW_DIR`	The directory to save the result visualization images.
`--show`	Visualize the prediction result in a window.
`--interval INTERVAL`	The interval of samples to visualize.
`--wait-time WAIT_TIME`	The display time of every window (in seconds). Defaults to 1.
`--launcher {none,pytorch,slurm,mpi}`	Options for job launcher.

Test with multiple GPUs¶

We provide a shell script to start a multi-GPUs task with torch.distributed.launch.

bash ./tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} [PY_ARGS]

ARGS	Description
`CONFIG_FILE`	The path to the config file.
`CHECKPOINT_FILE`	The path to the checkpoint file (It can be a http link, and you can find checkpoints here).
`GPU_NUM`	The number of GPUs to be used.
`[PYARGS]`	The other optional arguments of `tools/test.py`, see here.

You can also specify extra arguments of the launcher by environment variables. For example, change the communication port of the launcher to 29666 by the below command:

PORT=29666 bash ./tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} [PY_ARGS]

If you want to startup multiple test jobs and use different GPUs, you can launch them by specifying different port and visible devices.

CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 bash ./tools/dist_test.sh ${CONFIG_FILE1} ${CHECKPOINT_FILE} 4 [PY_ARGS]
CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 bash ./tools/dist_test.sh ${CONFIG_FILE2} ${CHECKPOINT_FILE} 4 [PY_ARGS]

Test with multiple machines¶

Multiple machines in the same network¶

If you launch a test job with multiple machines connected with ethernet, you can run the following commands:

On the first machine:

NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_test.sh $CONFIG $CHECKPOINT_FILE $GPUS

On the second machine:

NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_test.sh $CONFIG $CHECKPOINT_FILE $GPUS

Compared with multi-GPUs in a single machine, you need to specify some extra environment variables:

ENV_VARS	Description
`NNODES`	The total number of machines.
`NODE_RANK`	The index of the local machine.
`PORT`	The communication port, it should be the same in all machines.
`MASTER_ADDR`	The IP address of the master machine, it should be the same in all machines.

Usually, it is slow if you do not have high-speed networking like InfiniBand.

Multiple machines managed with slurm¶

If you run MMPose on a cluster managed with slurm, you can use the script slurm_test.sh.

[ENV_VARS] ./tools/slurm_test.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${CHECKPOINT_FILE} [PY_ARGS]

Here are the argument descriptions of the script.

ARGS	Description
`PARTITION`	The partition to use in your cluster.
`JOB_NAME`	The name of your job, you can name it as you like.
`CONFIG_FILE`	The path to the config file.
`CHECKPOINT_FILE`	The path to the checkpoint file (It can be a http link, and you can find checkpoints here).
`[PYARGS]`	The other optional arguments of `tools/test.py`, see here.

Here are the environment variables that can be used to configure the slurm job.

ENV_VARS	Description
`GPUS`	The total number of GPUs to be used. Defaults to 8.
`GPUS_PER_NODE`	The number of GPUs to be allocated per node. Defaults to 8.
`CPUS_PER_TASK`	The number of CPUs to be allocated per task (Usually one GPU corresponds to one task). Defaults to 5.
`SRUN_ARGS`	The other arguments of `srun`. Available options can be found here.

Custom Testing Features¶

Test with Custom Metrics¶

If you’re looking to assess models using unique metrics not already supported by MMPose, you’ll need to code these metrics yourself and include them in your config file. For guidance on how to accomplish this, check out our customized evaluation guide.

Evaluating Across Multiple Datasets¶

MMPose offers a handy tool known as MultiDatasetEvaluator for streamlined assessment across multiple datasets. Setting up this evaluator in your config file is a breeze. Below is a quick example demonstrating how to evaluate a model using both the COCO and AIC datasets:

# Set up validation datasets
coco_val = dict(type='CocoDataset', ...)
aic_val = dict(type='AicDataset', ...)
val_dataset = dict(
        type='CombinedDataset',
        datasets=[coco_val, aic_val],
        pipeline=val_pipeline,
        ...)

# configurate the evaluator
val_evaluator = dict(
    type='MultiDatasetEvaluator',
    metrics=[  # metrics for each dataset
        dict(type='CocoMetric',
             ann_file='data/coco/annotations/person_keypoints_val2017.json'),
        dict(type='CocoMetric',
            ann_file='data/aic/annotations/aic_val.json',
            use_area=False,
            prefix='aic')
    ],
    # the number and order of datasets must align with metrics
    datasets=[coco_val, aic_val],
    )

Keep in mind that different datasets, like COCO and AIC, have various keypoint definitions. Yet, the model’s output keypoints are standardized. This results in a discrepancy between the model outputs and the actual ground truth. To address this, you can employ KeypointConverter to align the keypoint configurations between different datasets. Here’s a full example that shows how to leverage KeypointConverter to align AIC keypoints with COCO keypoints:

aic_to_coco_converter = dict(
            type='KeypointConverter',
            num_keypoints=17,
            mapping=[
                (0, 6),
                (1, 8),
                (2, 10),
                (3, 5),
                (4, 7),
                (5, 9),
                (6, 12),
                (7, 14),
                (8, 16),
                (9, 11),
                (10, 13),
                (11, 15),
            ])

# val datasets
coco_val = dict(
    type='CocoDataset',
    data_root='data/coco/',
    data_mode='topdown',
    ann_file='annotations/person_keypoints_val2017.json',
    bbox_file='data/coco/person_detection_results/'
    'COCO_val2017_detections_AP_H_56_person.json',
    data_prefix=dict(img='val2017/'),
    test_mode=True,
    pipeline=[],
)

aic_val = dict(
        type='AicDataset',
        data_root='data/aic/',
        data_mode=data_mode,
        ann_file='annotations/aic_val.json',
        data_prefix=dict(img='ai_challenger_keypoint_validation_20170911/'
                         'keypoint_validation_images_20170911/'),
        test_mode=True,
        pipeline=[],
    )

val_dataset = dict(
        type='CombinedDataset',
        metainfo=dict(from_file='configs/_base_/datasets/coco.py'),
        datasets=[coco_val, aic_val],
        pipeline=val_pipeline,
        test_mode=True,
    )

val_dataloader = dict(
    batch_size=32,
    num_workers=2,
    persistent_workers=True,
    drop_last=False,
    sampler=dict(type='DefaultSampler', shuffle=False, round_up=False),
    dataset=val_dataset)

test_dataloader = val_dataloader

val_evaluator = dict(
    type='MultiDatasetEvaluator',
    metrics=[
        dict(type='CocoMetric',
             ann_file=data_root + 'annotations/person_keypoints_val2017.json'),
        dict(type='CocoMetric',
            ann_file='data/aic/annotations/aic_val.json',
            use_area=False,
            gt_converter=aic_to_coco_converter,
            prefix='aic')
    ],
    datasets=val_dataset['datasets'],
    )

test_evaluator = val_evaluator

For further clarification on converting AIC keypoints to COCO keypoints, please consult this guide.

Evaluating Top-down Models with Custom Detector¶

To evaluate top-down models, you can use either ground truth or pre-detected bounding boxes. The bbox_file provides these boxes, generated by a specific detector. For instance, COCO_val2017_detections_AP_H_56_person.json contains bounding boxes for the COCO val2017 dataset, generated using a detector with a human AP of 56.4. To create your own bbox_file using a custom detector supported by MMDetection, run the following command:

python tools/misc/generate_bbox_file.py \
    ${DET_CONFIG} ${DET_WEIGHT} ${OUTPUT_FILE_NAME} \
    [--pose-config ${POSE_CONFIG}] \
    [--score-thr ${SCORE_THRESHOLD}] [--nms-thr ${NMS_THRESHOLD}]

Here, the DET_CONFIG and DET_WEIGHT initialize the detector. POSE_CONFIG specifies the test dataset requiring bounding box detection, while SCORE_THRESHOLD and NMS_THRESHOLD arguments are used for bounding box filtering.