Shortcuts

mmpose.apis

class mmpose.apis.MMPoseInferencer(pose2d: Optional[str] = None, pose2d_weights: Optional[str] = None, pose3d: Optional[str] = None, pose3d_weights: Optional[str] = None, device: Optional[str] = None, scope: str = 'mmpose', det_model: Optional[Union[dict, mmengine.config.config.Config, mmengine.config.config.ConfigDict, str]] = None, det_weights: Optional[str] = None, det_cat_ids: Optional[Union[int, List]] = None, show_progress: bool = False)[source]

MMPose Inferencer. It’s a unified inferencer interface for pose estimation task, currently including: Pose2D. and it can be used to perform 2D keypoint detection.

Parameters
  • pose2d (str, optional) –

    Pretrained 2D pose estimation algorithm. It’s the path to the config file or the model name defined in metafile. For example, it could be:

    • model alias, e.g. 'body',

    • config name, e.g. 'simcc_res50_8xb64-210e_coco-256x192',

    • config path

    Defaults to None.

  • pose2d_weights (str, optional) – Path to the custom checkpoint file of the selected pose2d model. If it is not specified and “pose2d” is a model name of metafile, the weights will be loaded from metafile. Defaults to None.

  • device (str, optional) – Device to run inference. If None, the available device will be automatically used. Defaults to None.

  • scope (str, optional) – The scope of the model. Defaults to “mmpose”.

  • det_model (str, optional) – Config path or alias of detection model. Defaults to None.

  • det_weights (str, optional) – Path to the checkpoints of detection model. Defaults to None.

  • det_cat_ids (int or list[int], optional) – Category id for detection model. Defaults to None.

  • output_heatmaps (bool, optional) – Flag to visualize predicted heatmaps. If set to None, the default setting from the model config will be used. Default is None.

forward(inputs: Union[str, numpy.ndarray], **forward_kwargs) Union[mmengine.structures.instance_data.InstanceData, List[mmengine.structures.instance_data.InstanceData]][source]

Forward the inputs to the model.

Parameters

inputs (InputsType) – The inputs to be forwarded.

Returns

The prediction results. Possibly with keys “pose2d”.

Return type

Dict

preprocess(inputs: Union[str, numpy.ndarray, Sequence[Union[str, numpy.ndarray]]], batch_size: int = 1, **kwargs)[source]

Process the inputs into a model-feedable format.

Parameters
  • inputs (InputsType) – Inputs given by user.

  • batch_size (int) – batch size. Defaults to 1.

Yields

Any – Data processed by the pipeline and collate_fn. List[str or np.ndarray]: List of original inputs in the batch

visualize(inputs: Union[str, numpy.ndarray, Sequence[Union[str, numpy.ndarray]]], preds: Union[mmengine.structures.instance_data.InstanceData, List[mmengine.structures.instance_data.InstanceData]], **kwargs) List[numpy.ndarray][source]

Visualize predictions.

Parameters
  • inputs (list) – Inputs preprocessed by _inputs_to_list().

  • preds (Any) – Predictions of the model.

  • return_vis (bool) – Whether to return images with predicted results.

  • show (bool) – Whether to display the image in a popup window. Defaults to False.

  • show_interval (int) – The interval of show (s). Defaults to 0

  • radius (int) – Keypoint radius for visualization. Defaults to 3

  • thickness (int) – Link thickness for visualization. Defaults to 1

  • kpt_thr (float) – The threshold to visualize the keypoints. Defaults to 0.3

  • vis_out_dir (str, optional) – directory to save visualization results w/o predictions. If left as empty, no file will be saved. Defaults to ‘’.

Returns

Visualization results.

Return type

List[np.ndarray]

class mmpose.apis.Pose2DInferencer(model: Union[dict, mmengine.config.config.Config, mmengine.config.config.ConfigDict, str], weights: Optional[str] = None, device: Optional[str] = None, scope: Optional[str] = 'mmpose', det_model: Optional[Union[dict, mmengine.config.config.Config, mmengine.config.config.ConfigDict, str]] = None, det_weights: Optional[str] = None, det_cat_ids: Optional[Union[int, Tuple]] = None, show_progress: bool = False)[source]

The inferencer for 2D pose estimation.

Parameters
  • model (str, optional) –

    Pretrained 2D pose estimation algorithm. It’s the path to the config file or the model name defined in metafile. For example, it could be:

    • model alias, e.g. 'body',

    • config name, e.g. 'simcc_res50_8xb64-210e_coco-256x192',

    • config path

    Defaults to None.

  • weights (str, optional) – Path to the checkpoint. If it is not specified and “model” is a model name of metafile, the weights will be loaded from metafile. Defaults to None.

  • device (str, optional) – Device to run inference. If None, the available device will be automatically used. Defaults to None.

  • scope (str, optional) – The scope of the model. Defaults to “mmpose”.

  • det_model (str, optional) – Config path or alias of detection model. Defaults to None.

  • det_weights (str, optional) – Path to the checkpoints of detection model. Defaults to None.

  • det_cat_ids (int or list[int], optional) – Category id for detection model. Defaults to None.

forward(inputs: Union[dict, tuple], merge_results: bool = True, bbox_thr: float = - 1, pose_based_nms: bool = False)[source]

Performs a forward pass through the model.

Parameters
  • inputs (Union[dict, tuple]) – The input data to be processed. Can be either a dictionary or a tuple.

  • merge_results (bool, optional) – Whether to merge data samples, default to True. This is only applicable when the data_mode is ‘topdown’.

  • bbox_thr (float, optional) – A threshold for the bounding box scores. Bounding boxes with scores greater than this value will be retained. Default value is -1 which retains all bounding boxes.

Returns

A list of data samples with prediction instances.

preprocess_single(input: Union[str, numpy.ndarray], index: int, bbox_thr: float = 0.3, nms_thr: float = 0.3, bboxes: Union[List[List], List[numpy.ndarray], numpy.ndarray] = [])[source]

Process a single input into a model-feedable format.

Parameters
  • input (InputType) – Input given by user.

  • index (int) – index of the input

  • bbox_thr (float) – threshold for bounding box detection. Defaults to 0.3.

  • nms_thr (float) – IoU threshold for bounding box NMS. Defaults to 0.3.

Yields

Any – Data processed by the pipeline and collate_fn.

update_model_visualizer_settings(draw_heatmap: bool = False, skeleton_style: str = 'mmpose', **kwargs) None[source]

Update the settings of models and visualizer according to inference arguments.

Parameters
  • draw_heatmaps (bool, optional) – Flag to visualize predicted heatmaps. If not provided, it defaults to False.

  • skeleton_style (str, optional) – Skeleton style selection. Valid options are ‘mmpose’ and ‘openpose’. Defaults to ‘mmpose’.

mmpose.apis.collate_pose_sequence(pose_results_2d, with_track_id=True, target_frame=- 1)[source]

Reorganize multi-frame pose detection results into individual pose sequences.

Note

  • The temporal length of the pose detection results: T

  • The number of the person instances: N

  • The number of the keypoints: K

  • The channel number of each keypoint: C

Parameters
  • pose_results_2d (List[List[PoseDataSample]]) –

    Multi-frame pose detection results stored in a nested list. Each element of the outer list is the pose detection results of a single frame, and each element of the inner list is the pose information of one person, which contains:

    • keypoints (ndarray[K, 2 or 3]): x, y, [score]

    • track_id (int): unique id of each person, required when

      with_track_id==True`

  • with_track_id (bool) – If True, the element in pose_results is expected to contain “track_id”, which will be used to gather the pose sequence of a person from multiple frames. Otherwise, the pose results in each frame are expected to have a consistent number and order of identities. Default is True.

  • target_frame (int) – The index of the target frame. Default: -1.

Returns

Indivisual pose sequence in with length N.

Return type

List[PoseDataSample]

mmpose.apis.collect_multi_frames(video, frame_id, indices, online=False)[source]

Collect multi frames from the video.

Parameters
  • video (mmcv.VideoReader) – A VideoReader of the input video file.

  • frame_id (int) – index of the current frame

  • indices (list(int)) – index offsets of the frames to collect

  • online (bool) – inference mode, if set to True, can not use future frame information.

Returns

multi frames collected from the input video file.

Return type

list(ndarray)

mmpose.apis.convert_keypoint_definition(keypoints, pose_det_dataset, pose_lift_dataset)[source]

Convert pose det dataset keypoints definition to pose lifter dataset keypoints definition, so that they are compatible with the definitions required for 3D pose lifting.

Parameters
  • keypoints (ndarray[N, K, 2 or 3]) – 2D keypoints to be transformed.

  • pose_det_dataset (str) – Name of the dataset for 2D pose detector.

:param : Name of the dataset for 2D pose detector. :type : str :param pose_lift_dataset: Name of the dataset for pose lifter model. :type pose_lift_dataset: str

Returns

the transformed 2D keypoints.

Return type

ndarray[K, 2 or 3]

mmpose.apis.extract_pose_sequence(pose_results, frame_idx, causal, seq_len, step=1)[source]

Extract the target frame from 2D pose results, and pad the sequence to a fixed length.

Parameters
  • pose_results (List[List[PoseDataSample]]) – Multi-frame pose detection results stored in a list.

  • frame_idx (int) – The index of the frame in the original video.

  • causal (bool) – If True, the target frame is the last frame in a sequence. Otherwise, the target frame is in the middle of a sequence.

  • seq_len (int) – The number of frames in the input sequence.

  • step (int) – Step size to extract frames from the video.

Returns

Multi-frame pose detection results

stored in a nested list with a length of seq_len.

Return type

List[List[PoseDataSample]]

mmpose.apis.inference_bottomup(model: torch.nn.modules.module.Module, img: Union[numpy.ndarray, str])[source]

Inference image with a bottom-up pose estimator.

Parameters
  • model (nn.Module) – The bottom-up pose estimator

  • img (np.ndarray | str) – The loaded image or image file to inference

Returns

The inference results. Specifically, the predicted keypoints and scores are saved at data_sample.pred_instances.keypoints and data_sample.pred_instances.keypoint_scores.

Return type

List[PoseDataSample]

mmpose.apis.inference_pose_lifter_model(model, pose_results_2d, with_track_id=True, image_size=None, norm_pose_2d=False)[source]

Inference 3D pose from 2D pose sequences using a pose lifter model.

Parameters
  • model (nn.Module) – The loaded pose lifter model

  • pose_results_2d (List[List[PoseDataSample]]) – The 2D pose sequences stored in a nested list.

  • with_track_id – If True, the element in pose_results_2d is expected to contain “track_id”, which will be used to gather the pose sequence of a person from multiple frames. Otherwise, the pose results in each frame are expected to have a consistent number and order of identities. Default is True.

  • image_size (tuple|list) – image width, image height. If None, image size will not be contained in dict data.

  • norm_pose_2d (bool) – If True, scale the bbox (along with the 2D pose) to the average bbox scale of the dataset, and move the bbox (along with the 2D pose) to the average bbox center of the dataset.

Returns

3D pose inference results. Specifically, the predicted keypoints and scores are saved at data_sample.pred_instances.keypoints_3d.

Return type

List[PoseDataSample]

mmpose.apis.inference_topdown(model: torch.nn.modules.module.Module, img: Union[numpy.ndarray, str], bboxes: Optional[Union[List, numpy.ndarray]] = None, bbox_format: str = 'xyxy') List[mmpose.structures.pose_data_sample.PoseDataSample][source]

Inference image with a top-down pose estimator.

Parameters
  • model (nn.Module) – The top-down pose estimator

  • img (np.ndarray | str) – The loaded image or image file to inference

  • bboxes (np.ndarray, optional) – The bboxes in shape (N, 4), each row represents a bbox. If not given, the entire image will be regarded as a single bbox area. Defaults to None

  • bbox_format (str) – The bbox format indicator. Options are 'xywh' and 'xyxy'. Defaults to 'xyxy'

Returns

The inference results. Specifically, the predicted keypoints and scores are saved at data_sample.pred_instances.keypoints and data_sample.pred_instances.keypoint_scores.

Return type

List[PoseDataSample]

mmpose.apis.init_model(config: Union[str, pathlib.Path, mmengine.config.config.Config], checkpoint: Optional[str] = None, device: str = 'cuda:0', cfg_options: Optional[dict] = None) torch.nn.modules.module.Module[source]

Initialize a pose estimator from a config file.

Parameters
  • config (str, Path, or mmengine.Config) – Config file path, Path, or the config object.

  • checkpoint (str, optional) – Checkpoint path. If left as None, the model will not load any weights. Defaults to None

  • device (str) – The device where the anchors will be put on. Defaults to 'cuda:0'.

  • cfg_options (dict, optional) – Options to override some settings in the used config. Defaults to None

Returns

The constructed pose estimator.

Return type

nn.Module

mmpose.apis.visualize(img: Union[numpy.ndarray, str], keypoints: numpy.ndarray, keypoint_score: Optional[numpy.ndarray] = None, metainfo: Optional[Union[str, dict]] = None, visualizer: Optional[mmpose.visualization.local_visualizer.PoseLocalVisualizer] = None, show_kpt_idx: bool = False, skeleton_style: str = 'mmpose', show: bool = False, kpt_thr: float = 0.3)[source]

Visualize 2d keypoints on an image.

Parameters
  • img (str | np.ndarray) – The image to be displayed.

  • keypoints (np.ndarray) – The keypoint to be displayed.

  • keypoint_score (np.ndarray) – The score of each keypoint.

  • metainfo (str | dict) – The metainfo of dataset.

  • visualizer (PoseLocalVisualizer) – The visualizer.

  • show_kpt_idx (bool) – Whether to show the index of keypoints.

  • skeleton_style (str) – Skeleton style. Options are ‘mmpose’ and ‘openpose’.

  • show (bool) – Whether to show the image.

  • wait_time (int) – Value of waitKey param.

  • kpt_thr (float) – Keypoint threshold.

mmpose.codecs

class mmpose.codecs.AssociativeEmbedding(input_size: Tuple[int, int], heatmap_size: Tuple[int, int], sigma: Optional[float] = None, use_udp: bool = False, decode_keypoint_order: List[int] = [], decode_nms_kernel: int = 5, decode_gaussian_kernel: int = 3, decode_keypoint_thr: float = 0.1, decode_tag_thr: float = 1.0, decode_topk: int = 30, decode_center_shift=0.0, decode_max_instances: Optional[int] = None)[source]

Encode/decode keypoints with the method introduced in “Associative Embedding”. This is an asymmetric codec, where the keypoints are represented as gaussian heatmaps and position indices during encoding, and restored from predicted heatmaps and group tags.

See the paper `Associative Embedding: End-to-End Learning for Joint Detection and Grouping`_ by Newell et al (2017) for details

Note

  • instance number: N

  • keypoint number: K

  • keypoint dimension: D

  • embedding tag dimension: L

  • image size: [w, h]

  • heatmap size: [W, H]

Encoded:

  • heatmaps (np.ndarray): The generated heatmap in shape (K, H, W)

    where [W, H] is the heatmap_size

  • keypoint_indices (np.ndarray): The keypoint position indices in shape

    (N, K, 2). Each keypoint’s index is [i, v], where i is the position index in the heatmap (\(i=y*w+x\)) and v is the visibility

  • keypoint_weights (np.ndarray): The target weights in shape (N, K)

Parameters
  • input_size (tuple) – Image size in [w, h]

  • heatmap_size (tuple) – Heatmap size in [W, H]

  • sigma (float) – The sigma value of the Gaussian heatmap

  • use_udp (bool) – Whether use unbiased data processing. See `UDP (CVPR 2020)`_ for details. Defaults to False

  • decode_keypoint_order (List[int]) – The grouping order of the keypoint indices. The groupping usually starts from a keypoints around the head and torso, and gruadually moves out to the limbs

  • decode_keypoint_thr (float) – The threshold of keypoint response value in heatmaps. Defaults to 0.1

  • decode_tag_thr (float) – The maximum allowed tag distance when matching a keypoint to a group. A keypoint with larger tag distance to any of the existing groups will initializes a new group. Defaults to 1.0

  • decode_nms_kernel (int) – The kernel size of the NMS during decoding, which should be an odd integer. Defaults to 5

  • decode_gaussian_kernel (int) – The kernel size of the Gaussian blur during decoding, which should be an odd integer. It is only used when self.use_udp==True. Defaults to 3

  • decode_topk (int) – The number top-k candidates of each keypoints that will be retrieved from the heatmaps during dedocding. Defaults to 20

  • decode_max_instances (int, optional) – The maximum number of instances to decode. None means no limitation to the instance number. Defaults to None

Grouping`: https://arxiv.org/abs/1611.05424 .. UDP (CVPR 2020): https://arxiv.org/abs/1911.07524

batch_decode(batch_heatmaps: torch.Tensor, batch_tags: torch.Tensor) Tuple[List[numpy.ndarray], List[numpy.ndarray]][source]

Decode the keypoint coordinates from a batch of heatmaps and tagging heatmaps. The decoded keypoint coordinates are in the input image space.

Parameters
  • batch_heatmaps (Tensor) – Keypoint detection heatmaps in shape (B, K, H, W)

  • batch_tags (Tensor) – Tagging heatmaps in shape (B, C, H, W), where \(C=L*K\)

Returns

  • batch_keypoints (List[np.ndarray]): Decoded keypoint coordinates

    of the batch, each is in shape (N, K, D)

  • batch_scores (List[np.ndarray]): Decoded keypoint scores of the

    batch, each is in shape (N, K). It usually represents the confidience of the keypoint prediction

Return type

tuple

decode(encoded: Any) Tuple[numpy.ndarray, numpy.ndarray][source]

Decode keypoints.

Parameters

encoded (any) – Encoded keypoint representation using the codec

Returns

  • keypoints (np.ndarray): Keypoint coordinates in shape (N, K, D)

  • keypoints_visible (np.ndarray): Keypoint visibility in shape

    (N, K, D)

Return type

tuple

encode(keypoints: numpy.ndarray, keypoints_visible: Optional[numpy.ndarray] = None) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray][source]

Encode keypoints into heatmaps and position indices. Note that the original keypoint coordinates should be in the input image space.

Parameters
  • keypoints (np.ndarray) – Keypoint coordinates in shape (N, K, D)

  • keypoints_visible (np.ndarray) – Keypoint visibilities in shape (N, K)

Returns

  • heatmaps (np.ndarray): The generated heatmap in shape

    (K, H, W) where [W, H] is the heatmap_size

  • keypoint_indices (np.ndarray): The keypoint position indices

    in shape (N, K, 2). Each keypoint’s index is [i, v], where i is the position index in the heatmap (\(i=y*w+x\)) and v is the visibility

  • keypoint_weights (np.ndarray): The target weights in shape

    (N, K)

Return type

dict

class mmpose.codecs.DecoupledHeatmap(input_size: Tuple[int, int], heatmap_size: Tuple[int, int], root_type: str = 'kpt_center', heatmap_min_overlap: float = 0.7, encode_max_instances: int = 30)[source]

Encode/decode keypoints with the method introduced in the paper CID.

See the paper Contextual Instance Decoupling for Robust Multi-Person Pose Estimation`_ by Wang et al (2022) for details

Note

  • instance number: N

  • keypoint number: K

  • keypoint dimension: D

  • image size: [w, h]

  • heatmap size: [W, H]

Encoded:
  • heatmaps (np.ndarray): The coupled heatmap in shape

    (1+K, H, W) where [W, H] is the heatmap_size.

  • instance_heatmaps (np.ndarray): The decoupled heatmap in shape

    (M*K, H, W) where M is the number of instances.

  • keypoint_weights (np.ndarray): The weight for heatmaps in shape

    (M*K).

  • instance_coords (np.ndarray): The coordinates of instance roots

    in shape (M, 2)

Parameters
  • input_size (tuple) – Image size in [w, h]

  • heatmap_size (tuple) – Heatmap size in [W, H]

  • root_type (str) –

    The method to generate the instance root. Options are:

    • 'kpt_center': Average coordinate of all visible keypoints.

    • 'bbox_center': Center point of bounding boxes outlined by

      all visible keypoints.

    Defaults to 'kpt_center'

  • heatmap_min_overlap (float) – Minimum overlap rate among instances. Used when calculating sigmas for instances. Defaults to 0.7

  • background_weight (float) – Loss weight of background pixels. Defaults to 0.1

  • encode_max_instances (int) – The maximum number of instances to encode for each sample. Defaults to 30

Contextual_Instance_Decoupling_for_Robust_Multi-Person_Pose_Estimation_ CVPR_2022_paper.html

decode(instance_heatmaps: numpy.ndarray, instance_scores: numpy.ndarray) Tuple[numpy.ndarray, numpy.ndarray][source]

Decode keypoint coordinates from decoupled heatmaps. The decoded keypoint coordinates are in the input image space.

Parameters
  • instance_heatmaps (np.ndarray) – Heatmaps in shape (N, K, H, W)

  • instance_scores (np.ndarray) – Confidence of instance roots prediction in shape (N, 1)

Returns

  • keypoints (np.ndarray): Decoded keypoint coordinates in shape

    (N, K, D)

  • scores (np.ndarray): The keypoint scores in shape (N, K). It

    usually represents the confidence of the keypoint prediction

Return type

tuple

encode(keypoints: numpy.ndarray, keypoints_visible: Optional[numpy.ndarray] = None, bbox: Optional[numpy.ndarray] = None) dict[source]

Encode keypoints into heatmaps.

Parameters
  • keypoints (np.ndarray) – Keypoint coordinates in shape (N, K, D)

  • keypoints_visible (np.ndarray) – Keypoint visibilities in shape (N, K)

  • bbox (np.ndarray) – Bounding box in shape (N, 8) which includes coordinates of 4 corners.

Returns

  • heatmaps (np.ndarray): The coupled heatmap in shape

    (1+K, H, W) where [W, H] is the heatmap_size.

  • instance_heatmaps (np.ndarray): The decoupled heatmap in shape

    (N*K, H, W) where M is the number of instances.

  • keypoint_weights (np.ndarray): The weight for heatmaps in shape

    (N*K).

  • instance_coords (np.ndarray): The coordinates of instance roots

    in shape (N, 2)

Return type

dict

class mmpose.codecs.EDPoseLabel(num_select: int = 100, num_keypoints: int = 17)[source]

Generate keypoint and label coordinates for `ED-Pose`_ by Yang J. et al (2023).

Note

  • instance number: N

  • keypoint number: K

  • keypoint dimension: D

  • image size: [w, h]

Encoded:

  • keypoints (np.ndarray): Keypoint coordinates in shape (N, K, D)

  • keypoints_visible (np.ndarray): Keypoint visibility in shape

    (N, K, D)

  • area (np.ndarray): Area in shape (N)

  • bbox (np.ndarray): Bbox in shape (N, 4)

Parameters
  • num_select (int) – The number of candidate instances

  • num_keypoints (int) – The Number of keypoints

decode(input_shapes: numpy.ndarray, pred_logits: numpy.ndarray, pred_boxes: numpy.ndarray, pred_keypoints: numpy.ndarray)[source]

Select the final top-k keypoints, and decode the results from normalize size to origin input size.

Parameters
  • input_shapes (Tensor) – The size of input image resize.

  • test_cfg (ConfigType) – Config of testing.

  • pred_logits (Tensor) – The result of score.

  • pred_boxes (Tensor) – The result of bbox.

  • pred_keypoints (Tensor) – The result of keypoints.

Returns

Decoded boxes, keypoints, and keypoint scores.

Return type

tuple

encode(img_shape, keypoints: numpy.ndarray, keypoints_visible: Optional[numpy.ndarray] = None, area: Optional[numpy.ndarray] = None, bboxes: Optional[numpy.ndarray] = None) dict[source]

Encoding keypoints, area and bbox from input image space to normalized space.

Parameters
  • img_shape (-) – The shape of image in the format of (width, height).

  • keypoints (-) – Keypoint coordinates in shape (N, K, D).

  • keypoints_visible (-) – Keypoint visibility in shape (N, K)

  • area (-) –

  • bboxes (-) –

Returns

Contains the following items:

  • keypoint_labels (np.ndarray): The processed keypoints in

    shape like (N, K, D).

  • keypoints_visible (np.ndarray): Keypoint visibility in shape

    (N, K, D)

  • area_labels (np.ndarray): The processed target

    area in shape (N).

  • bboxes_labels: The processed target bbox in

    shape (N, 4).

Return type

encoded (dict)

class mmpose.codecs.Hand3DHeatmap(image_size: Tuple[int, int] = [256, 256], root_heatmap_size: int = 64, heatmap_size: Tuple[int, int, int] = [64, 64, 64], heatmap3d_depth_bound: float = 400.0, heatmap_size_root: int = 64, root_depth_bound: float = 400.0, depth_size: int = 64, use_different_joint_weights: bool = False, sigma: int = 2, joint_indices: Optional[list] = None, max_bound: float = 1.0)[source]

Generate target 3d heatmap and relative root depth for hand datasets.

Note

  • instance number: N

  • keypoint number: K

  • keypoint dimension: D

Parameters
  • image_size (tuple) – Size of image. Default: [256, 256].

  • root_heatmap_size (int) – Size of heatmap of root head. Default: 64.

  • heatmap_size (tuple) – Size of heatmap. Default: [64, 64, 64].

  • heatmap3d_depth_bound (float) – Boundary for 3d heatmap depth. Default: 400.0.

  • heatmap_size_root (int) – Size of 3d heatmap root. Default: 64.

  • depth_size (int) – Number of depth discretization size, used for decoding. Defaults to 64.

  • root_depth_bound (float) – Boundary for 3d heatmap root depth. Default: 400.0.

  • use_different_joint_weights (bool) – Whether to use different joint weights. Default: False.

  • sigma (int) – Sigma of heatmap gaussian. Default: 2.

  • joint_indices (list, optional) – Indices of joints used for heatmap generation. If None (default) is given, all joints will be used. Default: None.

  • max_bound (float) – The maximal value of heatmap. Default: 1.0.

decode(heatmaps: numpy.ndarray, root_depth: numpy.ndarray, hand_type: numpy.ndarray) Tuple[numpy.ndarray, numpy.ndarray][source]

Decode keypoint coordinates from heatmaps. The decoded keypoint coordinates are in the input image space.

Parameters
  • heatmaps (np.ndarray) – Heatmaps in shape (K, D, H, W)

  • root_depth (np.ndarray) – Root depth prediction.

  • hand_type (np.ndarray) – Hand type prediction.

Returns

  • keypoints (np.ndarray): Decoded keypoint coordinates in shape

    (N, K, D)

  • scores (np.ndarray): The keypoint scores in shape (N, K). It

    usually represents the confidence of the keypoint prediction

Return type

tuple

encode(keypoints: numpy.ndarray, keypoints_visible: Optional[numpy.ndarray], dataset_keypoint_weights: Optional[numpy.ndarray], rel_root_depth: numpy.float32, rel_root_valid: numpy.float32, hand_type: numpy.ndarray, hand_type_valid: numpy.ndarray, focal: numpy.ndarray, principal_pt: numpy.ndarray) dict[source]

Encoding keypoints from input image space to input image space.

Parameters
  • keypoints (np.ndarray) – Keypoint coordinates in shape (N, K, D).

  • keypoints_visible (np.ndarray, optional) – Keypoint visibilities in shape (N, K).

  • dataset_keypoint_weights (np.ndarray, optional) – Keypoints weight in shape (K, ).

  • rel_root_depth (np.float32) – Relative root depth.

  • rel_root_valid (float) – Validity of relative root depth.

  • hand_type (np.ndarray) – Type of hand encoded as a array.

  • hand_type_valid (np.ndarray) – Validity of hand type.

  • focal (np.ndarray) – Focal length of camera.

  • principal_pt (np.ndarray) – Principal point of camera.

Returns

Contains the following items:

  • heatmaps (np.ndarray): The generated heatmap in shape (K * D, H, W) where [W, H, D] is the heatmap_size

  • keypoint_weights (np.ndarray): The target weights in shape (N, K)

  • root_depth (np.ndarray): Encoded relative root depth

  • root_depth_weight (np.ndarray): The weights of relative root depth

  • type (np.ndarray): Encoded hand type

  • type_weight (np.ndarray): The weights of hand type

Return type

encoded (dict)

class mmpose.codecs.ImagePoseLifting(num_keypoints: int, root_index: Union[int, List] = 0, remove_root: bool = False, save_index: bool = False, reshape_keypoints: bool = True, concat_vis: bool = False, keypoints_mean: Optional[numpy.ndarray] = None, keypoints_std: Optional[numpy.ndarray] = None, target_mean: Optional[numpy.ndarray] = None, target_std: Optional[numpy.ndarray] = None, additional_encode_keys: Optional[List[str]] = None)[source]

Generate keypoint coordinates for pose lifter.

Note

  • instance number: N

  • keypoint number: K

  • keypoint dimension: D

  • pose-lifitng target dimension: C

Parameters
  • num_keypoints (int) – The number of keypoints in the dataset.

  • root_index (Union[int, List]) – Root keypoint index in the pose.

  • remove_root (bool) – If true, remove the root keypoint from the pose. Default: False.

  • save_index (bool) – If true, store the root position separated from the original pose. Default: False.

  • reshape_keypoints (bool) – If true, reshape the keypoints into shape (-1, N). Default: True.

  • concat_vis (bool) – If true, concat the visibility item of keypoints. Default: False.

  • keypoints_mean (np.ndarray, optional) – Mean values of keypoints coordinates in shape (K, D).

  • keypoints_std (np.ndarray, optional) – Std values of keypoints coordinates in shape (K, D).

  • target_mean (np.ndarray, optional) – Mean values of pose-lifitng target coordinates in shape (K, C).

  • target_std (np.ndarray, optional) – Std values of pose-lifitng target coordinates in shape (K, C).

decode(encoded: numpy.ndarray, target_root: Optional[numpy.ndarray] = None) Tuple[numpy.ndarray, numpy.ndarray][source]

Decode keypoint coordinates from normalized space to input image space.

Parameters
  • encoded (np.ndarray) – Coordinates in shape (N, K, C).

  • target_root (np.ndarray, optional) – The target root coordinate. Default: None.

Returns

Decoded coordinates in shape (N, K, C). scores (np.ndarray): The keypoint scores in shape (N, K).

Return type

keypoints (np.ndarray)

encode(keypoints: numpy.ndarray, keypoints_visible: Optional[numpy.ndarray] = None, lifting_target: Optional[numpy.ndarray] = None, lifting_target_visible: Optional[numpy.ndarray] = None) dict[source]

Encoding keypoints from input image space to normalized space.

Parameters
  • keypoints (np.ndarray) – Keypoint coordinates in shape (N, K, D).

  • keypoints_visible (np.ndarray, optional) – Keypoint visibilities in shape (N, K).

  • lifting_target (np.ndarray, optional) – 3d target coordinate in shape (T, K, C).

  • lifting_target_visible (np.ndarray, optional) – Target coordinate in shape (T, K, ).

Returns

Contains the following items:

  • keypoint_labels (np.ndarray): The processed keypoints in shape like (N, K, D) or (K * D, N).

  • keypoint_labels_visible (np.ndarray): The processed keypoints’ weights in shape (N, K, ) or (N-1, K, ).

  • lifting_target_label: The processed target coordinate in shape (K, C) or (K-1, C).

  • lifting_target_weight (np.ndarray): The target weights in shape (K, ) or (K-1, ).

  • trajectory_weights (np.ndarray): The trajectory weights in shape (K, ).

  • target_root (np.ndarray): The root coordinate of target in shape (C, ).

In addition, there are some optional items it may contain:

  • target_root (np.ndarray): The root coordinate of target in shape (C, ). Exists if zero_center is True.

  • target_root_removed (bool): Indicate whether the root of pose-lifitng target is removed. Exists if remove_root is True.

  • target_root_index (int): An integer indicating the index of root. Exists if remove_root and save_index are True.

Return type

encoded (dict)

class mmpose.codecs.IntegralRegressionLabel(input_size: Tuple[int, int], heatmap_size: Tuple[int, int], sigma: float, unbiased: bool = False, blur_kernel_size: int = 11, normalize: bool = True)[source]

Generate keypoint coordinates and normalized heatmaps. See the paper: DSNT by Nibali et al(2018).

Note

  • instance number: N

  • keypoint number: K

  • keypoint dimension: D

  • image size: [w, h]

Encoded:

  • keypoint_labels (np.ndarray): The normalized regression labels in

    shape (N, K, D) where D is 2 for 2d coordinates

  • heatmaps (np.ndarray): The generated heatmap in shape (K, H, W) where

    [W, H] is the heatmap_size

  • keypoint_weights (np.ndarray): The target weights in shape (N, K)

Parameters
  • input_size (tuple) – Input image size in [w, h]

  • heatmap_size (tuple) – Heatmap size in [W, H]

  • sigma (float) – The sigma value of the Gaussian heatmap

  • unbiased (bool) – Whether use unbiased method (DarkPose) in 'msra' encoding. See Dark Pose for details. Defaults to False

  • blur_kernel_size (int) – The Gaussian blur kernel size of the heatmap modulation in DarkPose. The kernel size and sigma should follow the expirical formula \(sigma = 0.3*((ks-1)*0.5-1)+0.8\). Defaults to 11

  • normalize (bool) – Whether to normalize the heatmaps. Defaults to True.

decode(encoded: numpy.ndarray) Tuple[numpy.ndarray, numpy.ndarray][source]

Decode keypoint coordinates from normalized space to input image space.

Parameters

encoded (np.ndarray) – Coordinates in shape (N, K, D)

Returns

  • keypoints (np.ndarray): Decoded coordinates in shape (N, K, D)

  • socres (np.ndarray): The keypoint scores in shape (N, K).

    It usually represents the confidence of the keypoint prediction

Return type

tuple

encode(keypoints: numpy.ndarray, keypoints_visible: Optional[numpy.ndarray] = None) dict[source]

Encoding keypoints to regression labels and heatmaps.

Parameters
  • keypoints (np.ndarray) – Keypoint coordinates in shape (N, K, D)

  • keypoints_visible (np.ndarray) – Keypoint visibilities in shape (N, K)

Returns

  • keypoint_labels (np.ndarray): The normalized regression labels in

    shape (N, K, D) where D is 2 for 2d coordinates

  • heatmaps (np.ndarray): The generated heatmap in shape

    (K, H, W) where [W, H] is the heatmap_size

  • keypoint_weights (np.ndarray): The target weights in shape

    (N, K)

Return type

dict

class mmpose.codecs.MSRAHeatmap(input_size: Tuple[int, int], heatmap_size: Tuple[int, int], sigma: float, unbiased: bool = False, blur_kernel_size: int = 11)[source]

Represent keypoints as heatmaps via “MSRA” approach. See the paper: Simple Baselines for Human Pose Estimation and Tracking by Xiao et al (2018) for details.

Note

  • instance number: N

  • keypoint number: K

  • keypoint dimension: D

  • image size: [w, h]

  • heatmap size: [W, H]

Encoded:

  • heatmaps (np.ndarray): The generated heatmap in shape (K, H, W)

    where [W, H] is the heatmap_size

  • keypoint_weights (np.ndarray): The target weights in shape (N, K)

Parameters
  • input_size (tuple) – Image size in [w, h]

  • heatmap_size (tuple) – Heatmap size in [W, H]

  • sigma (float) – The sigma value of the Gaussian heatmap

  • unbiased (bool) – Whether use unbiased method (DarkPose) in 'msra' encoding. See Dark Pose for details. Defaults to False

  • blur_kernel_size (int) – The Gaussian blur kernel size of the heatmap modulation in DarkPose. The kernel size and sigma should follow the expirical formula \(sigma = 0.3*((ks-1)*0.5-1)+0.8\). Defaults to 11

decode(encoded: numpy.ndarray) Tuple[numpy.ndarray, numpy.ndarray][source]

Decode keypoint coordinates from heatmaps. The decoded keypoint coordinates are in the input image space.

Parameters

encoded (np.ndarray) – Heatmaps in shape (K, H, W)

Returns

  • keypoints (np.ndarray): Decoded keypoint coordinates in shape

    (N, K, D)

  • scores (np.ndarray): The keypoint scores in shape (N, K). It

    usually represents the confidence of the keypoint prediction

Return type

tuple

encode(keypoints: numpy.ndarray, keypoints_visible: Optional[numpy.ndarray] = None) dict[source]

Encode keypoints into heatmaps. Note that the original keypoint coordinates should be in the input image space.

Parameters
  • keypoints (np.ndarray) – Keypoint coordinates in shape (N, K, D)

  • keypoints_visible (np.ndarray) – Keypoint visibilities in shape (N, K)

Returns

  • heatmaps (np.ndarray): The generated heatmap in shape

    (K, H, W) where [W, H] is the heatmap_size

  • keypoint_weights (np.ndarray): The target weights in shape

    (N, K)

Return type

dict

class mmpose.codecs.MegviiHeatmap(input_size: Tuple[int, int], heatmap_size: Tuple[int, int], kernel_size: int)[source]

Represent keypoints as heatmaps via “Megvii” approach. See MSPN (2019) and CPN (2018) for details.

Note

  • instance number: N

  • keypoint number: K

  • keypoint dimension: D

  • image size: [w, h]

  • heatmap size: [W, H]

Encoded:

  • heatmaps (np.ndarray): The generated heatmap in shape (K, H, W)

    where [W, H] is the heatmap_size

  • keypoint_weights (np.ndarray): The target weights in shape (N, K)

Parameters
  • input_size (tuple) – Image size in [w, h]

  • heatmap_size (tuple) – Heatmap size in [W, H]

  • kernel_size (tuple) – The kernel size of the heatmap gaussian in [ks_x, ks_y]

decode(encoded: numpy.ndarray) Tuple[numpy.ndarray, numpy.ndarray][source]

Decode keypoint coordinates from heatmaps. The decoded keypoint coordinates are in the input image space.

Parameters

encoded (np.ndarray) – Heatmaps in shape (K, H, W)

Returns

  • keypoints (np.ndarray): Decoded keypoint coordinates in shape

    (K, D)

  • scores (np.ndarray): The keypoint scores in shape (K,). It

    usually represents the confidence of the keypoint prediction

Return type

tuple

encode(keypoints: numpy.ndarray, keypoints_visible: Optional[numpy.ndarray] = None) dict[source]

Encode keypoints into heatmaps. Note that the original keypoint coordinates should be in the input image space.

Parameters
  • keypoints (np.ndarray) – Keypoint coordinates in shape (N, K, D)

  • keypoints_visible (np.ndarray) – Keypoint visibilities in shape (N, K)

Returns

  • heatmaps (np.ndarray): The generated heatmap in shape

    (K, H, W) where [W, H] is the heatmap_size

  • keypoint_weights (np.ndarray): The target weights in shape

    (N, K)

Return type

dict

class mmpose.codecs.MotionBERTLabel(num_keypoints: int, root_index: int = 0, remove_root: bool = False, save_index: bool = False, concat_vis: bool = False, rootrel: bool = False, mode: str = 'test')[source]

Generate keypoint and label coordinates for `MotionBERT`_ by Zhu et al (2022).

Note

  • instance number: N

  • keypoint number: K

  • keypoint dimension: D

  • pose-lifitng target dimension: C

Parameters
  • num_keypoints (int) – The number of keypoints in the dataset.

  • root_index (int) – Root keypoint index in the pose. Default: 0.

  • remove_root (bool) – If true, remove the root keypoint from the pose. Default: False.

  • save_index (bool) – If true, store the root position separated from the original pose, only takes effect if remove_root is True. Default: False.

  • concat_vis (bool) – If true, concat the visibility item of keypoints. Default: False.

  • rootrel (bool) – If true, the root keypoint will be set to the coordinate origin. Default: False.

  • mode (str) – Indicating whether the current mode is ‘train’ or ‘test’. Default: 'test'.

decode(encoded: numpy.ndarray, w: Optional[numpy.ndarray] = None, h: Optional[numpy.ndarray] = None, factor: Optional[numpy.ndarray] = None) Tuple[numpy.ndarray, numpy.ndarray][source]

Decode keypoint coordinates from normalized space to input image space.

Parameters
  • encoded (np.ndarray) – Coordinates in shape (N, K, C).

  • w (np.ndarray, optional) – The image widths in shape (N, ). Default: None.

  • h (np.ndarray, optional) – The image heights in shape (N, ). Default: None.

  • factor (np.ndarray, optional) – The factor for projection in shape (N, ). Default: None.

Returns

Decoded coordinates in shape (N, K, C). scores (np.ndarray): The keypoint scores in shape (N, K).

Return type

keypoints (np.ndarray)

encode(keypoints: numpy.ndarray, keypoints_visible: Optional[numpy.ndarray] = None, lifting_target: Optional[numpy.ndarray] = None, lifting_target_visible: Optional[numpy.ndarray] = None, camera_param: Optional[dict] = None, factor: Optional[numpy.ndarray] = None) dict[source]

Encoding keypoints from input image space to normalized space.

Parameters
  • keypoints (np.ndarray) – Keypoint coordinates in shape (B, T, K, D).

  • keypoints_visible (np.ndarray, optional) – Keypoint visibilities in shape (B, T, K).

  • lifting_target (np.ndarray, optional) – 3d target coordinate in shape (T, K, C).

  • lifting_target_visible (np.ndarray, optional) – Target coordinate in shape (T, K, ).

  • camera_param (dict, optional) – The camera parameter dictionary.

  • factor (np.ndarray, optional) – The factor mapping camera and image coordinate in shape (T, ).

Returns

Contains the following items:

  • keypoint_labels (np.ndarray): The processed keypoints in shape like (N, K, D).

  • keypoint_labels_visible (np.ndarray): The processed keypoints’ weights in shape (N, K, ) or (N, K-1, ).

  • lifting_target_label: The processed target coordinate in shape (K, C) or (K-1, C).

  • lifting_target_weight (np.ndarray): The target weights in shape (K, ) or (K-1, ).

  • factor (np.ndarray): The factor mapping camera and image coordinate in shape (T, 1).

Return type

encoded (dict)

class mmpose.codecs.RegressionLabel(input_size: Tuple[int, int])[source]

Generate keypoint coordinates.

Note

  • instance number: N

  • keypoint number: K

  • keypoint dimension: D

  • image size: [w, h]

Encoded:

  • keypoint_labels (np.ndarray): The normalized regression labels in

    shape (N, K, D) where D is 2 for 2d coordinates

  • keypoint_weights (np.ndarray): The target weights in shape (N, K)

Parameters

input_size (tuple) – Input image size in [w, h]

decode(encoded: numpy.ndarray) Tuple[numpy.ndarray, numpy.ndarray][source]

Decode keypoint coordinates from normalized space to input image space.

Parameters

encoded (np.ndarray) – Coordinates in shape (N, K, D)

Returns

  • keypoints (np.ndarray): Decoded coordinates in shape (N, K, D)

  • scores (np.ndarray): The keypoint scores in shape (N, K).

    It usually represents the confidence of the keypoint prediction

Return type

tuple

encode(keypoints: numpy.ndarray, keypoints_visible: Optional[numpy.ndarray] = None) dict[source]

Encoding keypoints from input image space to normalized space.

Parameters
  • keypoints (np.ndarray) – Keypoint coordinates in shape (N, K, D)

  • keypoints_visible (np.ndarray) – Keypoint visibilities in shape (N, K)

Returns

  • keypoint_labels (np.ndarray): The normalized regression labels in

    shape (N, K, D) where D is 2 for 2d coordinates

  • keypoint_weights (np.ndarray): The target weights in shape

    (N, K)

Return type

dict

class mmpose.codecs.SPR(input_size: Tuple[int, int], heatmap_size: Tuple[int, int], sigma: Optional[Union[float, Tuple[float]]] = None, generate_keypoint_heatmaps: bool = False, root_type: str = 'kpt_center', minimal_diagonal_length: Union[int, float] = 5, background_weight: float = 0.1, decode_nms_kernel: int = 5, decode_max_instances: int = 30, decode_thr: float = 0.01)[source]

Encode/decode keypoints with Structured Pose Representation (SPR).

See the paper Single-stage multi-person pose machines by Nie et al (2017) for details

Note

  • instance number: N

  • keypoint number: K

  • keypoint dimension: D

  • image size: [w, h]

  • heatmap size: [W, H]

Encoded:

  • heatmaps (np.ndarray): The generated heatmap in shape (1, H, W)

    where [W, H] is the heatmap_size. If the keypoint heatmap is generated together, the output heatmap shape is (K+1, H, W)

  • heatmap_weights (np.ndarray): The target weights for heatmaps which

    has same shape with heatmaps.

  • displacements (np.ndarray): The dense keypoint displacement in

    shape (K*2, H, W).

  • displacement_weights (np.ndarray): The target weights for heatmaps

    which has same shape with displacements.

Parameters
  • input_size (tuple) – Image size in [w, h]

  • heatmap_size (tuple) – Heatmap size in [W, H]

  • sigma (float or tuple, optional) – The sigma values of the Gaussian heatmaps. If sigma is a tuple, it includes both sigmas for root and keypoint heatmaps. None means the sigmas are computed automatically from the heatmap size. Defaults to None

  • generate_keypoint_heatmaps (bool) – Whether to generate Gaussian heatmaps for each keypoint. Defaults to False

  • root_type (str) –

    The method to generate the instance root. Options are:

    • 'kpt_center': Average coordinate of all visible keypoints.

    • 'bbox_center': Center point of bounding boxes outlined by

      all visible keypoints.

    Defaults to 'kpt_center'

  • minimal_diagonal_length (int or float) – The threshold of diagonal length of instance bounding box. Small instances will not be used in training. Defaults to 32

  • background_weight (float) – Loss weight of background pixels. Defaults to 0.1

  • decode_thr (float) – The threshold of keypoint response value in heatmaps. Defaults to 0.01

  • decode_nms_kernel (int) – The kernel size of the NMS during decoding, which should be an odd integer. Defaults to 5

  • decode_max_instances (int) – The maximum number of instances to decode. Defaults to 30

decode(heatmaps: torch.Tensor, displacements: torch.Tensor) Tuple[numpy.ndarray, numpy.ndarray][source]

Decode the keypoint coordinates from heatmaps and displacements. The decoded keypoint coordinates are in the input image space.

Parameters
  • heatmaps (Tensor) – Encoded root and keypoints (optional) heatmaps in shape (1, H, W) or (K+1, H, W)

  • displacements (Tensor) – Encoded keypoints displacement fields in shape (K*D, H, W)

Returns

  • keypoints (Tensor): Decoded keypoint coordinates in shape

    (N, K, D)

  • scores (tuple):
    • root_scores (Tensor): The root scores in shape (N, )

    • keypoint_scores (Tensor): The keypoint scores in

      shape (N, K). If keypoint heatmaps are not generated, keypoint_scores will be None

Return type

tuple

encode(keypoints: numpy.ndarray, keypoints_visible: Optional[numpy.ndarray] = None) dict[source]

Encode keypoints into root heatmaps and keypoint displacement fields. Note that the original keypoint coordinates should be in the input image space.

Parameters
  • keypoints (np.ndarray) – Keypoint coordinates in shape (N, K, D)

  • keypoints_visible (np.ndarray) – Keypoint visibilities in shape (N, K)

Returns

  • heatmaps (np.ndarray): The generated heatmap in shape

    (1, H, W) where [W, H] is the heatmap_size. If keypoint heatmaps are generated together, the shape is (K+1, H, W)

  • heatmap_weights (np.ndarray): The pixel-wise weight for heatmaps

    which has same shape with heatmaps

  • displacements (np.ndarray): The generated displacement fields in

    shape (K*D, H, W). The vector on each pixels represents the displacement of keypoints belong to the associated instance from this pixel.

  • displacement_weights (np.ndarray): The pixel-wise weight for

    displacements which has same shape with displacements

Return type

dict

get_keypoint_scores(heatmaps: torch.Tensor, keypoints: torch.Tensor)[source]

Calculate the keypoint scores with keypoints heatmaps and coordinates.

Parameters
  • heatmaps (Tensor) – Keypoint heatmaps in shape (K, H, W)

  • keypoints (Tensor) – Keypoint coordinates in shape (N, K, D)

Returns

Keypoint scores in [N, K]

Return type

Tensor

class mmpose.codecs.SimCCLabel(input_size: Tuple[int, int], smoothing_type: str = 'gaussian', sigma: Union[float, int, Tuple[float]] = 6.0, simcc_split_ratio: float = 2.0, label_smooth_weight: float = 0.0, normalize: bool = True, use_dark: bool = False, decode_visibility: bool = False, decode_beta: float = 150.0)[source]

Generate keypoint representation via “SimCC” approach. See the paper: `SimCC: a Simple Coordinate Classification Perspective for Human Pose Estimation`_ by Li et al (2022) for more details. Old name: SimDR

Note

  • instance number: N

  • keypoint number: K

  • keypoint dimension: D

  • image size: [w, h]

Encoded:

  • keypoint_x_labels (np.ndarray): The generated SimCC label for x-axis.

    The label shape is (N, K, Wx) if smoothing_type=='gaussian' and (N, K) if smoothing_type==’standard’`, where \(Wx=w*simcc_split_ratio\)

  • keypoint_y_labels (np.ndarray): The generated SimCC label for y-axis.

    The label shape is (N, K, Wy) if smoothing_type=='gaussian' and (N, K) if smoothing_type==’standard’`, where \(Wy=h*simcc_split_ratio\)

  • keypoint_weights (np.ndarray): The target weights in shape (N, K)

Parameters
  • input_size (tuple) – Input image size in [w, h]

  • smoothing_type (str) – The SimCC label smoothing strategy. Options are 'gaussian' and 'standard'. Defaults to 'gaussian'

  • sigma (float | int | tuple) – The sigma value in the Gaussian SimCC label. Defaults to 6.0

  • simcc_split_ratio (float) – The ratio of the label size to the input size. For example, if the input width is w, the x label size will be \(w*simcc_split_ratio\). Defaults to 2.0

  • label_smooth_weight (float) – Label Smoothing weight. Defaults to 0.0

  • normalize (bool) – Whether to normalize the heatmaps. Defaults to True.

  • use_dark (bool) – Whether to use the DARK post processing. Defaults to False.

  • decode_visibility (bool) – Whether to decode the visibility. Defaults to False.

  • decode_beta (float) – The beta value for decoding visibility. Defaults to 150.0.

Estimation`: https://arxiv.org/abs/2107.03332

decode(simcc_x: numpy.ndarray, simcc_y: numpy.ndarray) Tuple[numpy.ndarray, numpy.ndarray][source]

Decode keypoint coordinates from SimCC representations. The decoded coordinates are in the input image space.

Parameters
  • encoded (Tuple[np.ndarray, np.ndarray]) – SimCC labels for x-axis and y-axis

  • simcc_x (np.ndarray) – SimCC label for x-axis

  • simcc_y (np.ndarray) – SimCC label for y-axis

Returns

  • keypoints (np.ndarray): Decoded coordinates in shape (N, K, D)

  • socres (np.ndarray): The keypoint scores in shape (N, K).

    It usually represents the confidence of the keypoint prediction

Return type

tuple

encode(keypoints: numpy.ndarray, keypoints_visible: Optional[numpy.ndarray] = None) dict[source]

Encoding keypoints into SimCC labels. Note that the original keypoint coordinates should be in the input image space.

Parameters
  • keypoints (np.ndarray) – Keypoint coordinates in shape (N, K, D)

  • keypoints_visible (np.ndarray) – Keypoint visibilities in shape (N, K)

Returns

  • keypoint_x_labels (np.ndarray): The generated SimCC label for

    x-axis. The label shape is (N, K, Wx) if smoothing_type=='gaussian' and (N, K) if smoothing_type==’standard’`, where \(Wx=w*simcc_split_ratio\)

  • keypoint_y_labels (np.ndarray): The generated SimCC label for

    y-axis. The label shape is (N, K, Wy) if smoothing_type=='gaussian' and (N, K) if smoothing_type==’standard’`, where \(Wy=h*simcc_split_ratio\)

  • keypoint_weights (np.ndarray): The target weights in shape

    (N, K)

Return type

dict

class mmpose.codecs.UDPHeatmap(input_size: Tuple[int, int], heatmap_size: Tuple[int, int], heatmap_type: str = 'gaussian', sigma: float = 2.0, radius_factor: float = 0.0546875, blur_kernel_size: int = 11)[source]

Generate keypoint heatmaps by Unbiased Data Processing (UDP). See the paper: `The Devil is in the Details: Delving into Unbiased Data Processing for Human Pose Estimation`_ by Huang et al (2020) for details.

Note

  • instance number: N

  • keypoint number: K

  • keypoint dimension: D

  • image size: [w, h]

  • heatmap size: [W, H]

Encoded:

  • heatmap (np.ndarray): The generated heatmap in shape (C_out, H, W)

    where [W, H] is the heatmap_size, and the C_out is the output channel number which depends on the heatmap_type. If heatmap_type==’gaussian’, C_out equals to keypoint number K; if heatmap_type==’combined’, C_out equals to K*3 (x_offset, y_offset and class label)

  • keypoint_weights (np.ndarray): The target weights in shape (K,)

Parameters
  • input_size (tuple) – Image size in [w, h]

  • heatmap_size (tuple) – Heatmap size in [W, H]

  • heatmap_type (str) –

    The heatmap type to encode the keypoitns. Options are:

    • 'gaussian': Gaussian heatmap

    • 'combined': Combination of a binary label map and offset

      maps for X and Y axes.

  • sigma (float) – The sigma value of the Gaussian heatmap when heatmap_type=='gaussian'. Defaults to 2.0

  • radius_factor (float) – The radius factor of the binary label map when heatmap_type=='combined'. The positive region is defined as the neighbor of the keypoit with the radius \(r=radius_factor*max(W, H)\). Defaults to 0.0546875

  • blur_kernel_size (int) – The Gaussian blur kernel size of the heatmap modulation in DarkPose. Defaults to 11

Human Pose Estimation`: https://arxiv.org/abs/1911.07524

decode(encoded: numpy.ndarray) Tuple[numpy.ndarray, numpy.ndarray][source]

Decode keypoint coordinates from heatmaps. The decoded keypoint coordinates are in the input image space.

Parameters

encoded (np.ndarray) – Heatmaps in shape (K, H, W)

Returns

  • keypoints (np.ndarray): Decoded keypoint coordinates in shape

    (N, K, D)

  • scores (np.ndarray): The keypoint scores in shape (N, K). It

    usually represents the confidence of the keypoint prediction

Return type

tuple

encode(keypoints: numpy.ndarray, keypoints_visible: Optional[numpy.ndarray] = None) dict[source]

Encode keypoints into heatmaps. Note that the original keypoint coordinates should be in the input image space.

Parameters
  • keypoints (np.ndarray) – Keypoint coordinates in shape (N, K, D)

  • keypoints_visible (np.ndarray) – Keypoint visibilities in shape (N, K)

Returns

  • heatmap (np.ndarray): The generated heatmap in shape

    (C_out, H, W) where [W, H] is the heatmap_size, and the C_out is the output channel number which depends on the heatmap_type. If heatmap_type==’gaussian’, C_out equals to keypoint number K; if heatmap_type==’combined’, C_out equals to K*3 (x_offset, y_offset and class label)

  • keypoint_weights (np.ndarray): The target weights in shape

    (K,)

Return type

dict

class mmpose.codecs.VideoPoseLifting(num_keypoints: int, zero_center: bool = True, root_index: Union[int, List] = 0, remove_root: bool = False, save_index: bool = False, reshape_keypoints: bool = True, concat_vis: bool = False, normalize_camera: bool = False)[source]

Generate keypoint coordinates for pose lifter.

Note

  • instance number: N

  • keypoint number: K

  • keypoint dimension: D

  • pose-lifitng target dimension: C

Parameters
  • num_keypoints (int) – The number of keypoints in the dataset.

  • zero_center – Whether to zero-center the target around root. Default: True.

  • root_index (Union[int, List]) – Root keypoint index in the pose. Default: 0.

  • remove_root (bool) – If true, remove the root keypoint from the pose. Default: False.

  • save_index (bool) – If true, store the root position separated from the original pose, only takes effect if remove_root is True. Default: False.

  • reshape_keypoints (bool) – If true, reshape the keypoints into shape (-1, N). Default: True.

  • concat_vis (bool) – If true, concat the visibility item of keypoints. Default: False.

  • normalize_camera (bool) – Whether to normalize camera intrinsics. Default: False.

decode(encoded: numpy.ndarray, target_root: Optional[numpy.ndarray] = None) Tuple[numpy.ndarray, numpy.ndarray][source]

Decode keypoint coordinates from normalized space to input image space.

Parameters
  • encoded (np.ndarray) – Coordinates in shape (N, K, C).

  • target_root (np.ndarray, optional) – The pose-lifitng target root coordinate. Default: None.

Returns

Decoded coordinates in shape (N, K, C). scores (np.ndarray): The keypoint scores in shape (N, K).

Return type

keypoints (np.ndarray)

encode(keypoints: numpy.ndarray, keypoints_visible: Optional[numpy.ndarray] = None, lifting_target: Optional[numpy.ndarray] = None, lifting_target_visible: Optional[numpy.ndarray] = None, camera_param: Optional[dict] = None) dict[source]

Encoding keypoints from input image space to normalized space.

Parameters
  • keypoints (np.ndarray) – Keypoint coordinates in shape (N, K, D).

  • keypoints_visible (np.ndarray, optional) – Keypoint visibilities in shape (N, K).

  • lifting_target (np.ndarray, optional) – 3d target coordinate in shape (T, K, C).

  • lifting_target_visible (np.ndarray, optional) – Target coordinate in shape (T, K, ).

  • camera_param (dict, optional) – The camera parameter dictionary.

Returns

Contains the following items:

  • keypoint_labels (np.ndarray): The processed keypoints in shape like (N, K, D) or (K * D, N).

  • keypoint_labels_visible (np.ndarray): The processed keypoints’ weights in shape (N, K, ) or (N-1, K, ).

  • lifting_target_label: The processed target coordinate in shape (K, C) or (K-1, C).

  • lifting_target_weight (np.ndarray): The target weights in shape (K, ) or (K-1, ).

  • trajectory_weights (np.ndarray): The trajectory weights in shape (K, ).

In addition, there are some optional items it may contain:

  • target_root (np.ndarray): The root coordinate of target in shape (C, ). Exists if zero_center is True.

  • target_root_removed (bool): Indicate whether the root of pose-lifitng target is removed. Exists if remove_root is True.

  • target_root_index (int): An integer indicating the index of root. Exists if remove_root and save_index are True.

  • camera_param (dict): The updated camera parameter dictionary. Exists if normalize_camera is True.

Return type

encoded (dict)

class mmpose.codecs.YOLOXPoseAnnotationProcessor(expand_bbox: bool = False, input_size: Optional[Tuple] = None)[source]

Convert dataset annotations to the input format of YOLOX-Pose.

This processor expands bounding boxes and converts category IDs to labels.

Parameters
  • expand_bbox (bool, optional) – Whether to expand the bounding box to include all keypoints. Defaults to False.

  • input_size (tuple, optional) – The size of the input image for the model, formatted as (h, w). This argument is necessary for the codec in deployment but is not used indeed.

encode(keypoints: Optional[numpy.ndarray] = None, keypoints_visible: Optional[numpy.ndarray] = None, bbox: Optional[numpy.ndarray] = None, category_id: Optional[List[int]] = None) Dict[str, numpy.ndarray][source]

Encode keypoints, bounding boxes, and category IDs.

Parameters
  • keypoints (np.ndarray, optional) – Keypoints array. Defaults to None.

  • keypoints_visible (np.ndarray, optional) – Visibility array for keypoints. Defaults to None.

  • bbox (np.ndarray, optional) – Bounding box array. Defaults to None.

  • category_id (List[int], optional) – List of category IDs. Defaults to None.

Returns

Encoded annotations.

Return type

Dict[str, np.ndarray]

mmpose.models

backbones

class mmpose.models.backbones.AlexNet(num_classes=- 1, init_cfg=None)[source]

AlexNet backbone.

The input for AlexNet is a 224x224 RGB image.

Parameters
  • num_classes (int) – number of classes for classification. The default value is -1, which uses the backbone as a feature extractor without the top classifier.

  • init_cfg (dict or list[dict], optional) – Initialization config dict. Default: None

forward(x)[source]

Forward function.

Parameters

x (Tensor | tuple[Tensor]) – x could be a torch.Tensor or a tuple of torch.Tensor, containing input data for forward computation.

class mmpose.models.backbones.CPM(in_channels, out_channels, feat_channels=128, middle_channels=32, num_stages=6, norm_cfg={'requires_grad': True, 'type': 'BN'}, init_cfg=[{'type': 'Normal', 'std': 0.001, 'layer': ['Conv2d']}, {'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}])[source]

CPM backbone.

Convolutional Pose Machines. More details can be found in the paper .

Parameters
  • in_channels (int) – The input channels of the CPM.

  • out_channels (int) – The output channels of the CPM.

  • feat_channels (int) – Feature channel of each CPM stage.

  • middle_channels (int) – Feature channel of conv after the middle stage.

  • num_stages (int) – Number of stages.

  • norm_cfg (dict) – Dictionary to construct and config norm layer.

  • init_cfg (dict or list[dict], optional) –

    Initialization config dict. Default: ``[

    dict(type=’Normal’, std=0.001, layer=[‘Conv2d’]), dict(

    type=’Constant’, val=1, layer=[‘_BatchNorm’, ‘GroupNorm’])

    ]``

Example

>>> from mmpose.models import CPM
>>> import torch
>>> self = CPM(3, 17)
>>> self.eval()
>>> inputs = torch.rand(1, 3, 368, 368)
>>> level_outputs = self.forward(inputs)
>>> for level_output in level_outputs:
...     print(tuple(level_output.shape))
(1, 17, 46, 46)
(1, 17, 46, 46)
(1, 17, 46, 46)
(1, 17, 46, 46)
(1, 17, 46, 46)
(1, 17, 46, 46)
forward(x)[source]

Model forward function.

class mmpose.models.backbones.CSPDarknet(arch='P5', deepen_factor=1.0, widen_factor=1.0, out_indices=(2, 3, 4), frozen_stages=- 1, use_depthwise=False, arch_ovewrite=None, spp_kernal_sizes=(5, 9, 13), conv_cfg=None, norm_cfg={'eps': 0.001, 'momentum': 0.03, 'type': 'BN'}, act_cfg={'type': 'Swish'}, norm_eval=False, init_cfg={'a': 2.23606797749979, 'distribution': 'uniform', 'layer': 'Conv2d', 'mode': 'fan_in', 'nonlinearity': 'leaky_relu', 'type': 'Kaiming'})[source]

CSP-Darknet backbone used in YOLOv5 and YOLOX.

Parameters
  • arch (str) – Architecture of CSP-Darknet, from {P5, P6}. Default: P5.

  • deepen_factor (float) – Depth multiplier, multiply number of blocks in CSP layer by this amount. Default: 1.0.

  • widen_factor (float) – Width multiplier, multiply number of channels in each layer by this amount. Default: 1.0.

  • out_indices (Sequence[int]) – Output from which stages. Default: (2, 3, 4).

  • frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters. Default: -1.

  • use_depthwise (bool) – Whether to use depthwise separable convolution. Default: False.

  • arch_ovewrite (list) – Overwrite default arch settings. Default: None.

  • spp_kernal_sizes – (tuple[int]): Sequential of kernel sizes of SPP layers. Default: (5, 9, 13).

  • conv_cfg (dict) – Config dict for convolution layer. Default: None.

  • norm_cfg (dict) – Dictionary to construct and config norm layer. Default: dict(type=’BN’, requires_grad=True).

  • act_cfg (dict) – Config dict for activation layer. Default: dict(type=’LeakyReLU’, negative_slope=0.1).

  • norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only.

  • init_cfg (dict or list[dict], optional) – Initialization config dict. Default: None.

Example

>>> from mmpose.models import CSPDarknet
>>> import torch
>>> self = CSPDarknet(depth=53)
>>> self.eval()
>>> inputs = torch.rand(1, 3, 416, 416)
>>> level_outputs = self.forward(inputs)
>>> for level_out in level_outputs:
...     print(tuple(level_out.shape))
...
(1, 256, 52, 52)
(1, 512, 26, 26)
(1, 1024, 13, 13)
forward(x)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

train(mode=True)[source]

Set the module in training mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

Parameters

mode (bool) – whether to set training mode (True) or evaluation mode (False). Default: True.

Returns

self

Return type

Module

class mmpose.models.backbones.CSPNeXt(arch: str = 'P5', deepen_factor: float = 1.0, widen_factor: float = 1.0, out_indices: Sequence[int] = (2, 3, 4), frozen_stages: int = - 1, use_depthwise: bool = False, expand_ratio: float = 0.5, arch_ovewrite: Optional[dict] = None, spp_kernel_sizes: Sequence[int] = (5, 9, 13), channel_attention: bool = True, conv_cfg: Optional[Union[mmengine.config.config.ConfigDict, dict]] = None, norm_cfg: Union[mmengine.config.config.ConfigDict, dict] = {'eps': 0.001, 'momentum': 0.03, 'type': 'BN'}, act_cfg: Union[mmengine.config.config.ConfigDict, dict] = {'type': 'SiLU'}, norm_eval: bool = False, init_cfg: Optional[Union[mmengine.config.config.ConfigDict, dict]] = {'a': 2.23606797749979, 'distribution': 'uniform', 'layer': 'Conv2d', 'mode': 'fan_in', 'nonlinearity': 'leaky_relu', 'type': 'Kaiming'})[source]

CSPNeXt backbone used in RTMDet.

Parameters
  • arch (str) – Architecture of CSPNeXt, from {P5, P6}. Defaults to P5.

  • expand_ratio (float) – Ratio to adjust the number of channels of the hidden layer. Defaults to 0.5.

  • deepen_factor (float) – Depth multiplier, multiply number of blocks in CSP layer by this amount. Defaults to 1.0.

  • widen_factor (float) – Width multiplier, multiply number of channels in each layer by this amount. Defaults to 1.0.

  • out_indices (Sequence[int]) – Output from which stages. Defaults to (2, 3, 4).

  • frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters. Defaults to -1.

  • use_depthwise (bool) – Whether to use depthwise separable convolution. Defaults to False.

  • arch_ovewrite (list) – Overwrite default arch settings. Defaults to None.

  • spp_kernel_sizes – (tuple[int]): Sequential of kernel sizes of SPP layers. Defaults to (5, 9, 13).

  • channel_attention (bool) – Whether to add channel attention in each stage. Defaults to True.

  • conv_cfg (ConfigDict or dict, optional) – Config dict for convolution layer. Defaults to None.

  • norm_cfg (ConfigDict or dict) – Dictionary to construct and config norm layer. Defaults to dict(type=’BN’, requires_grad=True).

  • act_cfg (ConfigDict or dict) – Config dict for activation layer. Defaults to dict(type=’SiLU’).

  • norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only.

:param init_cfg (ConfigDict or dict or list[dict] or: list[ConfigDict]): Initialization config dict.

forward(x: Tuple[torch.Tensor, ...]) Tuple[torch.Tensor, ...][source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

train(mode=True) None[source]

Set the module in training mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

Parameters

mode (bool) – whether to set training mode (True) or evaluation mode (False). Default: True.

Returns

self

Return type

Module

class mmpose.models.backbones.DSTFormer(in_channels, feat_size=256, depth=5, num_heads=8, mlp_ratio=4, num_keypoints=17, seq_len=243, qkv_bias=True, qk_scale=None, drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, att_fuse=True, init_cfg=None)[source]

Dual-stream Spatio-temporal Transformer Module.

Parameters
  • in_channels (int) – Number of input channels.

  • feat_size – Number of feature channels. Default: 256.

  • depth – The network depth. Default: 5.

  • num_heads – Number of heads in multi-Head self-attention blocks. Default: 8.

  • mlp_ratio (int, optional) – The expansion ratio of FFN. Default: 4.

  • num_keypoints – num_keypoints (int): Number of keypoints. Default: 17.

  • seq_len – The sequence length. Default: 243.

  • qkv_bias (bool, optional) – If True, add a learnable bias to q, k, v. Default: True.

  • qk_scale (float | None, optional) – Override default qk scale of head_dim ** -0.5 if set. Default: None.

  • drop_rate (float, optional) – Dropout ratio of input. Default: 0.

  • attn_drop_rate (float, optional) – Dropout ratio of attention weight. Default: 0.

  • drop_path_rate (float, optional) – Stochastic depth rate. Default: 0.

  • att_fuse – Whether to fuse the results of attention blocks. Default: True.

  • init_cfg (dict or list[dict], optional) – Initialization config dict. Default: None

Example

>>> from mmpose.models import DSTFormer
>>> import torch
>>> self = DSTFormer(in_channels=3)
>>> self.eval()
>>> inputs = torch.rand(1, 2, 17, 3)
>>> level_outputs = self.forward(inputs)
>>> print(tuple(level_outputs.shape))
(1, 2, 17, 512)
forward(x)[source]

Forward function.

Parameters

x (Tensor | tuple[Tensor]) – x could be a torch.Tensor or a tuple of torch.Tensor, containing input data for forward computation.

init_weights()[source]

Initialize the weights in backbone.

class mmpose.models.backbones.HRFormer(extra, in_channels=3, conv_cfg=None, norm_cfg={'requires_grad': True, 'type': 'BN'}, transformer_norm_cfg={'eps': 1e-06, 'type': 'LN'}, norm_eval=False, with_cp=False, zero_init_residual=False, frozen_stages=- 1, init_cfg=[{'type': 'Normal', 'std': 0.001, 'layer': ['Conv2d']}, {'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}])[source]

HRFormer backbone.

This backbone is the implementation of HRFormer: High-Resolution Transformer for Dense Prediction.

Parameters
  • extra (dict) –

    Detailed configuration for each stage of HRNet. There must be 4 stages, the configuration for each stage must have 5 keys:

    • num_modules (int): The number of HRModule in this stage.

    • num_branches (int): The number of branches in the HRModule.

    • block (str): The type of block.

    • num_blocks (tuple): The number of blocks in each branch.

      The length must be equal to num_branches.

    • num_channels (tuple): The number of channels in each branch.

      The length must be equal to num_branches.

  • in_channels (int) – Number of input image channels. Normally 3.

  • conv_cfg (dict) – Dictionary to construct and config conv layer. Default: None.

  • norm_cfg (dict) – Config of norm layer. Use SyncBN by default.

  • transformer_norm_cfg (dict) – Config of transformer norm layer. Use LN by default.

  • norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Default: False.

  • zero_init_residual (bool) – Whether to use zero init for last norm layer in resblocks to let them behave as identity. Default: False.

  • frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters. Default: -1.

  • init_cfg (dict or list[dict], optional) –

    Initialization config dict. Default: ``[

    dict(type=’Normal’, std=0.001, layer=[‘Conv2d’]), dict(

    type=’Constant’, val=1, layer=[‘_BatchNorm’, ‘GroupNorm’])

    ]``

Example

>>> from mmpose.models import HRFormer
>>> import torch
>>> extra = dict(
>>>     stage1=dict(
>>>         num_modules=1,
>>>         num_branches=1,
>>>         block='BOTTLENECK',
>>>         num_blocks=(2, ),
>>>         num_channels=(64, )),
>>>     stage2=dict(
>>>         num_modules=1,
>>>         num_branches=2,
>>>         block='HRFORMER',
>>>         window_sizes=(7, 7),
>>>         num_heads=(1, 2),
>>>         mlp_ratios=(4, 4),
>>>         num_blocks=(2, 2),
>>>         num_channels=(32, 64)),
>>>     stage3=dict(
>>>         num_modules=4,
>>>         num_branches=3,
>>>         block='HRFORMER',
>>>         window_sizes=(7, 7, 7),
>>>         num_heads=(1, 2, 4),
>>>         mlp_ratios=(4, 4, 4),
>>>         num_blocks=(2, 2, 2),
>>>         num_channels=(32, 64, 128)),
>>>     stage4=dict(
>>>         num_modules=2,
>>>         num_branches=4,
>>>         block='HRFORMER',
>>>         window_sizes=(7, 7, 7, 7),
>>>         num_heads=(1, 2, 4, 8),
>>>         mlp_ratios=(4, 4, 4, 4),
>>>         num_blocks=(2, 2, 2, 2),
>>>         num_channels=(32, 64, 128, 256)))
>>> self = HRFormer(extra, in_channels=1)
>>> self.eval()
>>> inputs = torch.rand(1, 1, 32, 32)
>>> level_outputs = self.forward(inputs)
>>> for level_out in level_outputs:
...     print(tuple(level_out.shape))
(1, 32, 8, 8)
(1, 64, 4, 4)
(1, 128, 2, 2)
(1, 256, 1, 1)
class mmpose.models.backbones.HRNet(extra, in_channels=3, conv_cfg=None, norm_cfg={'type': 'BN'}, norm_eval=False, with_cp=False, zero_init_residual=False, frozen_stages=- 1, init_cfg=[{'type': 'Normal', 'std': 0.001, 'layer': ['Conv2d']}, {'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}])[source]

HRNet backbone.

High-Resolution Representations for Labeling Pixels and Regions

Parameters
  • extra (dict) – detailed configuration for each stage of HRNet.

  • in_channels (int) – Number of input image channels. Default: 3.

  • conv_cfg (dict) – dictionary to construct and config conv layer.

  • norm_cfg (dict) – dictionary to construct and config norm layer.

  • norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Default: False

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed.

  • zero_init_residual (bool) – whether to use zero init for last norm layer in resblocks to let them behave as identity.

  • frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters. Default: -1.

  • init_cfg (dict or list[dict], optional) –

    Initialization config dict. Default: ``[

    dict(type=’Normal’, std=0.001, layer=[‘Conv2d’]), dict(

    type=’Constant’, val=1, layer=[‘_BatchNorm’, ‘GroupNorm’])

    ]``

Example

>>> from mmpose.models import HRNet
>>> import torch
>>> extra = dict(
>>>     stage1=dict(
>>>         num_modules=1,
>>>         num_branches=1,
>>>         block='BOTTLENECK',
>>>         num_blocks=(4, ),
>>>         num_channels=(64, )),
>>>     stage2=dict(
>>>         num_modules=1,
>>>         num_branches=2,
>>>         block='BASIC',
>>>         num_blocks=(4, 4),
>>>         num_channels=(32, 64)),
>>>     stage3=dict(
>>>         num_modules=4,
>>>         num_branches=3,
>>>         block='BASIC',
>>>         num_blocks=(4, 4, 4),
>>>         num_channels=(32, 64, 128)),
>>>     stage4=dict(
>>>         num_modules=3,
>>>         num_branches=4,
>>>         block='BASIC',
>>>         num_blocks=(4, 4, 4, 4),
>>>         num_channels=(32, 64, 128, 256)))
>>> self = HRNet(extra, in_channels=1)
>>> self.eval()
>>> inputs = torch.rand(1, 1, 32, 32)
>>> level_outputs = self.forward(inputs)
>>> for level_out in level_outputs:
...     print(tuple(level_out.shape))
(1, 32, 8, 8)
forward(x)[source]

Forward function.

init_weights()[source]

Initialize the weights in backbone.

property norm1

the normalization layer named “norm1”

Type

nn.Module

property norm2

the normalization layer named “norm2”

Type

nn.Module

train(mode=True)[source]

Convert the model into training mode.

class mmpose.models.backbones.HourglassAENet(downsample_times=4, num_stacks=1, out_channels=34, stage_channels=(256, 384, 512, 640, 768), feat_channels=256, norm_cfg={'requires_grad': True, 'type': 'BN'}, init_cfg=[{'type': 'Normal', 'std': 0.001, 'layer': ['Conv2d']}, {'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}])[source]

Hourglass-AE Network proposed by Newell et al.

Associative Embedding: End-to-End Learning for Joint Detection and Grouping.

More details can be found in the paper .

Parameters
  • downsample_times (int) – Downsample times in a HourglassModule.

  • num_stacks (int) – Number of HourglassModule modules stacked, 1 for Hourglass-52, 2 for Hourglass-104.

  • stage_channels (list[int]) – Feature channel of each sub-module in a HourglassModule.

  • stage_blocks (list[int]) – Number of sub-modules stacked in a HourglassModule.

  • feat_channels (int) – Feature channel of conv after a HourglassModule.

  • norm_cfg (dict) – Dictionary to construct and config norm layer.

  • init_cfg (dict or list[dict], optional) –

    Initialization config dict. Default: ``[

    dict(type=’Normal’, std=0.001, layer=[‘Conv2d’]), dict(

    type=’Constant’, val=1, layer=[‘_BatchNorm’, ‘GroupNorm’])

    ]``

Example

>>> from mmpose.models import HourglassAENet
>>> import torch
>>> self = HourglassAENet()
>>> self.eval()
>>> inputs = torch.rand(1, 3, 512, 512)
>>> level_outputs = self.forward(inputs)
>>> for level_output in level_outputs:
...     print(tuple(level_output.shape))
(1, 34, 128, 128)
forward(x)[source]

Model forward function.

class mmpose.models.backbones.HourglassNet(downsample_times=5, num_stacks=2, stage_channels=(256, 256, 384, 384, 384, 512), stage_blocks=(2, 2, 2, 2, 2, 4), feat_channel=256, norm_cfg={'requires_grad': True, 'type': 'BN'}, init_cfg=[{'type': 'Normal', 'std': 0.001, 'layer': ['Conv2d']}, {'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}])[source]

HourglassNet backbone.

Stacked Hourglass Networks for Human Pose Estimation. More details can be found in the paper .

Parameters
  • downsample_times (int) – Downsample times in a HourglassModule.

  • num_stacks (int) – Number of HourglassModule modules stacked, 1 for Hourglass-52, 2 for Hourglass-104.

  • stage_channels (list[int]) – Feature channel of each sub-module in a HourglassModule.

  • stage_blocks (list[int]) – Number of sub-modules stacked in a HourglassModule.

  • feat_channel (int) – Feature channel of conv after a HourglassModule.

  • norm_cfg (dict) – Dictionary to construct and config norm layer.

  • init_cfg (dict or list[dict], optional) –

    Initialization config dict. Default: ``[

    dict(type=’Normal’, std=0.001, layer=[‘Conv2d’]), dict(

    type=’Constant’, val=1, layer=[‘_BatchNorm’, ‘GroupNorm’])

    ]``

Example

>>> from mmpose.models import HourglassNet
>>> import torch
>>> self = HourglassNet()
>>> self.eval()
>>> inputs = torch.rand(1, 3, 511, 511)
>>> level_outputs = self.forward(inputs)
>>> for level_output in level_outputs:
...     print(tuple(level_output.shape))
(1, 256, 128, 128)
(1, 256, 128, 128)
forward(x)[source]

Model forward function.

class mmpose.models.backbones.LiteHRNet(extra, in_channels=3, conv_cfg=None, norm_cfg={'type': 'BN'}, norm_eval=False, with_cp=False, init_cfg=[{'type': 'Normal', 'std': 0.001, 'layer': ['Conv2d']}, {'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}])[source]

Lite-HRNet backbone.

Lite-HRNet: A Lightweight High-Resolution Network.

Code adapted from ‘https://github.com/HRNet/Lite-HRNet’.

Parameters
  • extra (dict) – detailed configuration for each stage of HRNet.

  • in_channels (int) – Number of input image channels. Default: 3.

  • conv_cfg (dict) – dictionary to construct and config conv layer.

  • norm_cfg (dict) – dictionary to construct and config norm layer.

  • norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Default: False

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed.

  • init_cfg (dict or list[dict], optional) –

    Initialization config dict. Default: ``[

    dict(type=’Normal’, std=0.001, layer=[‘Conv2d’]), dict(

    type=’Constant’, val=1, layer=[‘_BatchNorm’, ‘GroupNorm’])

    ]``

Example

>>> from mmpose.models import LiteHRNet
>>> import torch
>>> extra=dict(
>>>    stem=dict(stem_channels=32, out_channels=32, expand_ratio=1),
>>>    num_stages=3,
>>>    stages_spec=dict(
>>>        num_modules=(2, 4, 2),
>>>        num_branches=(2, 3, 4),
>>>        num_blocks=(2, 2, 2),
>>>        module_type=('LITE', 'LITE', 'LITE'),
>>>        with_fuse=(True, True, True),
>>>        reduce_ratios=(8, 8, 8),
>>>        num_channels=(
>>>            (40, 80),
>>>            (40, 80, 160),
>>>            (40, 80, 160, 320),
>>>        )),
>>>    with_head=False)
>>> self = LiteHRNet(extra, in_channels=1)
>>> self.eval()
>>> inputs = torch.rand(1, 1, 32, 32)
>>> level_outputs = self.forward(inputs)
>>> for level_out in level_outputs:
...     print(tuple(level_out.shape))
(1, 40, 8, 8)
forward(x)[source]

Forward function.

train(mode=True)[source]

Convert the model into training mode.

class mmpose.models.backbones.MSPN(unit_channels=256, num_stages=4, num_units=4, num_blocks=[2, 2, 2, 2], norm_cfg={'type': 'BN'}, res_top_channels=64, init_cfg=[{'type': 'Kaiming', 'layer': ['Conv2d']}, {'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}, {'type': 'Normal', 'std': 0.01, 'layer': ['Linear']}])[source]

MSPN backbone. Paper ref: Li et al. “Rethinking on Multi-Stage Networks for Human Pose Estimation” (CVPR 2020).

Parameters
  • unit_channels (int) – Number of Channels in an upsample unit. Default: 256

  • num_stages (int) – Number of stages in a multi-stage MSPN. Default: 4

  • num_units (int) – Number of downsample/upsample units in a single-stage network. Default: 4 Note: Make sure num_units == len(self.num_blocks)

  • num_blocks (list) – Number of bottlenecks in each downsample unit. Default: [2, 2, 2, 2]

  • norm_cfg (dict) – dictionary to construct and config norm layer. Default: dict(type=’BN’)

  • res_top_channels (int) – Number of channels of feature from ResNetTop. Default: 64.

  • init_cfg (dict or list[dict], optional) –

    Initialization config dict. Default: ``[

    dict(type=’Kaiming’, layer=[‘Conv2d’]), dict(

    type=’Constant’, val=1, layer=[‘_BatchNorm’, ‘GroupNorm’]),

    dict(

    type=’Normal’, std=0.01, layer=[‘Linear’]),

    ]``

Example

>>> from mmpose.models import MSPN
>>> import torch
>>> self = MSPN(num_stages=2,num_units=2,num_blocks=[2,2])
>>> self.eval()
>>> inputs = torch.rand(1, 3, 511, 511)
>>> level_outputs = self.forward(inputs)
>>> for level_output in level_outputs:
...     for feature in level_output:
...         print(tuple(feature.shape))
...
(1, 256, 64, 64)
(1, 256, 128, 128)
(1, 256, 64, 64)
(1, 256, 128, 128)
forward(x)[source]

Model forward function.

init_weights()[source]

Initialize model weights.

class mmpose.models.backbones.MobileNetV2(widen_factor=1.0, out_indices=(7,), frozen_stages=- 1, conv_cfg=None, norm_cfg={'type': 'BN'}, act_cfg={'type': 'ReLU6'}, norm_eval=False, with_cp=False, init_cfg=[{'type': 'Kaiming', 'layer': ['Conv2d']}, {'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}])[source]

MobileNetV2 backbone.

Parameters
  • widen_factor (float) – Width multiplier, multiply number of channels in each layer by this amount. Default: 1.0.

  • out_indices (None or Sequence[int]) – Output from which stages. Default: (7, ).

  • frozen_stages (int) – Stages to be frozen (all param fixed). Default: -1, which means not freezing any parameters.

  • conv_cfg (dict) – Config dict for convolution layer. Default: None, which means using conv2d.

  • norm_cfg (dict) – Config dict for normalization layer. Default: dict(type=’BN’).

  • act_cfg (dict) – Config dict for activation layer. Default: dict(type=’ReLU6’).

  • norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Default: False.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

  • init_cfg (dict or list[dict], optional) –

    Initialization config dict. Default: ``[

    dict(type=’Kaiming’, layer=[‘Conv2d’]), dict(

    type=’Constant’, val=1, layer=[‘_BatchNorm’, ‘GroupNorm’])

    ]``

forward(x)[source]

Forward function.

Parameters

x (Tensor | tuple[Tensor]) – x could be a torch.Tensor or a tuple of torch.Tensor, containing input data for forward computation.

make_layer(out_channels, num_blocks, stride, expand_ratio)[source]

Stack InvertedResidual blocks to build a layer for MobileNetV2.

Parameters
  • out_channels (int) – out_channels of block.

  • num_blocks (int) – number of blocks.

  • stride (int) – stride of the first block. Default: 1

  • expand_ratio (int) – Expand the number of channels of the hidden layer in InvertedResidual by this ratio. Default: 6.

train(mode=True)[source]

Set module status before forward computation.

Parameters

mode (bool) – Whether it is train_mode or test_mode

class mmpose.models.backbones.MobileNetV3(arch='small', conv_cfg=None, norm_cfg={'type': 'BN'}, out_indices=(- 1,), frozen_stages=- 1, norm_eval=False, with_cp=False, init_cfg=[{'type': 'Kaiming', 'layer': ['Conv2d']}, {'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm']}])[source]

MobileNetV3 backbone.

Parameters
  • arch (str) – Architecture of mobilnetv3, from {small, big}. Default: small.

  • conv_cfg (dict) – Config dict for convolution layer. Default: None, which means using conv2d.

  • norm_cfg (dict) – Config dict for normalization layer. Default: dict(type=’BN’).

  • out_indices (None or Sequence[int]) – Output from which stages. Default: (-1, ), which means output tensors from final stage.

  • frozen_stages (int) – Stages to be frozen (all param fixed). Default: -1, which means not freezing any parameters.

  • norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Default: False.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

  • init_cfg (dict or list[dict], optional) –

    Initialization config dict. Default: ``[

    dict(type=’Kaiming’, layer=[‘Conv2d’]), dict(

    type=’Constant’, val=1, layer=[‘_BatchNorm’])

    ]``

forward(x)[source]

Forward function.

Parameters

x (Tensor | tuple[Tensor]) – x could be a torch.Tensor or a tuple of torch.Tensor, containing input data for forward computation.

train(mode=True)[source]

Set module status before forward computation.

Parameters

mode (bool) – Whether it is train_mode or test_mode

class mmpose.models.backbones.PyramidVisionTransformer(pretrain_img_size=224, in_channels=3, embed_dims=64, num_stages=4, num_layers=[3, 4, 6, 3], num_heads=[1, 2, 5, 8], patch_sizes=[4, 2, 2, 2], strides=[4, 2, 2, 2], paddings=[0, 0, 0, 0], sr_ratios=[8, 4, 2, 1], out_indices=(0, 1, 2, 3), mlp_ratios=[8, 8, 4, 4], qkv_bias=True, drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.1, use_abs_pos_embed=True, norm_after_stage=False, use_conv_ffn=False, act_cfg={'type': 'GELU'}, norm_cfg={'eps': 1e-06, 'type': 'LN'}, convert_weights=True, init_cfg=[{'type': 'TruncNormal', 'std': 0.02, 'layer': ['Linear']}, {'type': 'Constant', 'val': 1, 'layer': ['LayerNorm']}, {'type': 'Kaiming', 'layer': ['Conv2d']}])[source]

Pyramid Vision Transformer (PVT)

Implementation of Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions.

Parameters
  • pretrain_img_size (int | tuple[int]) – The size of input image when pretrain. Defaults: 224.

  • in_channels (int) – Number of input channels. Default: 3.

  • embed_dims (int) – Embedding dimension. Default: 64.

  • num_stags (int) – The num of stages. Default: 4.

  • num_layers (Sequence[int]) – The layer number of each transformer encode layer. Default: [3, 4, 6, 3].

  • num_heads (Sequence[int]) – The attention heads of each transformer encode layer. Default: [1, 2, 5, 8].

  • patch_sizes (Sequence[int]) – The patch_size of each patch embedding. Default: [4, 2, 2, 2].

  • strides (Sequence[int]) – The stride of each patch embedding. Default: [4, 2, 2, 2].

  • paddings (Sequence[int]) – The padding of each patch embedding. Default: [0, 0, 0, 0].

  • sr_ratios (Sequence[int]) – The spatial reduction rate of each transformer encode layer. Default: [8, 4, 2, 1].

  • out_indices (Sequence[int] | int) – Output from which stages. Default: (0, 1, 2, 3).

  • mlp_ratios (Sequence[int]) – The ratio of the mlp hidden dim to the embedding dim of each transformer encode layer. Default: [8, 8, 4, 4].

  • qkv_bias (bool) – Enable bias for qkv if True. Default: True.

  • drop_rate (float) – Probability of an element to be zeroed. Default 0.0.

  • attn_drop_rate (float) – The drop out rate for attention layer. Default 0.0.

  • drop_path_rate (float) – stochastic depth rate. Default 0.1.

  • use_abs_pos_embed (bool) – If True, add absolute position embedding to the patch embedding. Defaults: True.

  • use_conv_ffn (bool) – If True, use Convolutional FFN to replace FFN. Default: False.

  • act_cfg (dict) – The activation config for FFNs. Default: dict(type=’GELU’).

  • norm_cfg (dict) – Config dict for normalization layer. Default: dict(type=’LN’).

  • pretrained (str, optional) – model pretrained path. Default: None.

  • convert_weights (bool) – The flag indicates whether the pre-trained model is from the original repo. We may need to convert some keys to make it compatible. Default: True.

  • init_cfg (dict or list[dict], optional) –

    Initialization config dict. Default: ``[

    dict(type=’TruncNormal’, std=.02, layer=[‘Linear’]), dict(type=’Constant’, val=1, layer=[‘LayerNorm’]), dict(type=’Normal’, std=0.01, layer=[‘Conv2d’])

    ]``

forward(x)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

init_weights()[source]

Initialize the weights in backbone.

class mmpose.models.backbones.PyramidVisionTransformerV2(**kwargs)[source]

Implementation of PVTv2: Improved Baselines with Pyramid Vision Transformer.

class mmpose.models.backbones.RSN(unit_channels=256, num_stages=4, num_units=4, num_blocks=[2, 2, 2, 2], num_steps=4, norm_cfg={'type': 'BN'}, res_top_channels=64, expand_times=26, init_cfg=[{'type': 'Kaiming', 'layer': ['Conv2d']}, {'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}, {'type': 'Normal', 'std': 0.01, 'layer': ['Linear']}])[source]

Residual Steps Network backbone. Paper ref: Cai et al. “Learning Delicate Local Representations for Multi-Person Pose Estimation” (ECCV 2020).

Parameters
  • unit_channels (int) – Number of Channels in an upsample unit. Default: 256

  • num_stages (int) – Number of stages in a multi-stage RSN. Default: 4

  • num_units (int) – NUmber of downsample/upsample units in a single-stage RSN. Default: 4 Note: Make sure num_units == len(self.num_blocks)

  • num_blocks (list) – Number of RSBs (Residual Steps Block) in each downsample unit. Default: [2, 2, 2, 2]

  • num_steps (int) – Number of steps in a RSB. Default:4

  • norm_cfg (dict) – dictionary to construct and config norm layer. Default: dict(type=’BN’)

  • res_top_channels (int) – Number of channels of feature from ResNet_top. Default: 64.

  • expand_times (int) – Times by which the in_channels are expanded in RSB. Default:26.

  • init_cfg (dict or list[dict], optional) –

    Initialization config dict. Default: ``[

    dict(type=’Kaiming’, layer=[‘Conv2d’]), dict(

    type=’Constant’, val=1, layer=[‘_BatchNorm’, ‘GroupNorm’]),

    dict(

    type=’Normal’, std=0.01, layer=[‘Linear’]),

    ]``

Example

>>> from mmpose.models import RSN
>>> import torch
>>> self = RSN(num_stages=2,num_units=2,num_blocks=[2,2])
>>> self.eval()
>>> inputs = torch.rand(1, 3, 511, 511)
>>> level_outputs = self.forward(inputs)
>>> for level_output in level_outputs:
...     for feature in level_output:
...         print(tuple(feature.shape))
...
(1, 256, 64, 64)
(1, 256, 128, 128)
(1, 256, 64, 64)
(1, 256, 128, 128)
forward(x)[source]

Model forward function.

class mmpose.models.backbones.RegNet(arch, in_channels=3, stem_channels=32, base_channels=32, strides=(2, 2, 2, 2), dilations=(1, 1, 1, 1), out_indices=(3,), style='pytorch', deep_stem=False, avg_down=False, frozen_stages=- 1, conv_cfg=None, norm_cfg={'requires_grad': True, 'type': 'BN'}, norm_eval=False, with_cp=False, zero_init_residual=True, init_cfg=[{'type': 'Kaiming', 'layer': ['Conv2d']}, {'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}])[source]

RegNet backbone.

More details can be found in paper .

Parameters
  • arch (dict) – The parameter of RegNets. - w0 (int): initial width - wa (float): slope of width - wm (float): quantization parameter to quantize the width - depth (int): depth of the backbone - group_w (int): width of group - bot_mul (float): bottleneck ratio, i.e. expansion of bottleneck.

  • strides (Sequence[int]) – Strides of the first block of each stage.

  • base_channels (int) – Base channels after stem layer.

  • in_channels (int) – Number of input image channels. Default: 3.

  • dilations (Sequence[int]) – Dilation of each stage.

  • out_indices (Sequence[int]) – Output from which stages.

  • style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default: “pytorch”.

  • frozen_stages (int) – Stages to be frozen (all param fixed). -1 means not freezing any parameters. Default: -1.

  • norm_cfg (dict) – dictionary to construct and config norm layer. Default: dict(type=’BN’, requires_grad=True).

  • norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Default: False.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

  • zero_init_residual (bool) – whether to use zero init for last norm layer in resblocks to let them behave as identity. Default: True.

  • init_cfg (dict or list[dict], optional) –

    Initialization config dict. Default: ``[

    dict(type=’Kaiming’, layer=[‘Conv2d’]), dict(

    type=’Constant’, val=1, layer=[‘_BatchNorm’, ‘GroupNorm’])

    ]``

Example

>>> from mmpose.models import RegNet
>>> import torch
>>> self = RegNet(
        arch=dict(
            w0=88,
            wa=26.31,
            wm=2.25,
            group_w=48,
            depth=25,
            bot_mul=1.0),
         out_indices=(0, 1, 2, 3))
>>> self.eval()
>>> inputs = torch.rand(1, 3, 32, 32)
>>> level_outputs = self.forward(inputs)
>>> for level_out in level_outputs:
...     print(tuple(level_out.shape))
(1, 96, 8, 8)
(1, 192, 4, 4)
(1, 432, 2, 2)
(1, 1008, 1, 1)
adjust_width_group(widths, bottleneck_ratio, groups)[source]

Adjusts the compatibility of widths and groups.

Parameters
  • widths (list[int]) – Width of each stage.

  • bottleneck_ratio (float) – Bottleneck ratio.

  • groups (int) – number of groups in each stage

Returns

The adjusted widths and groups of each stage.

Return type

tuple(list)

forward(x)[source]

Forward function.

static generate_regnet(initial_width, width_slope, width_parameter, depth, divisor=8)[source]

Generates per block width from RegNet parameters.

Parameters
  • initial_width ([int]) – Initial width of the backbone

  • width_slope ([float]) – Slope of the quantized linear function

  • width_parameter ([int]) – Parameter used to quantize the width.

  • depth ([int]) – Depth of the backbone.

  • divisor (int, optional) – The divisor of channels. Defaults to 8.

Returns

return a list of widths of each stage and the number of

stages

Return type

list, int

get_stages_from_blocks(widths)[source]

Gets widths/stage_blocks of network at each stage.

Parameters

widths (list[int]) – Width in each stage.

Returns

width and depth of each stage

Return type

tuple(list)

static quantize_float(number, divisor)[source]

Converts a float to closest non-zero int divisible by divior.

Parameters
  • number (int) – Original number to be quantized.

  • divisor (int) – Divisor used to quantize the number.

Returns

quantized number that is divisible by devisor.

Return type

int

class mmpose.models.backbones.ResNeSt(depth, groups=1, width_per_group=4, radix=2, reduction_factor=4, avg_down_stride=True, **kwargs)[source]

ResNeSt backbone.

Please refer to the paper for details.

Parameters
  • depth (int) – Network depth, from {50, 101, 152, 200}.

  • groups (int) – Groups of conv2 in Bottleneck. Default: 32.

  • width_per_group (int) – Width per group of conv2 in Bottleneck. Default: 4.

  • radix (int) – Radix of SpltAtConv2d. Default: 2

  • reduction_factor (int) – Reduction factor of SplitAttentionConv2d. Default: 4.

  • avg_down_stride (bool) – Whether to use average pool for stride in Bottleneck. Default: True.

  • in_channels (int) – Number of input image channels. Default: 3.

  • stem_channels (int) – Output channels of the stem layer. Default: 64.

  • num_stages (int) – Stages of the network. Default: 4.

  • strides (Sequence[int]) – Strides of the first block of each stage. Default: (1, 2, 2, 2).

  • dilations (Sequence[int]) – Dilation of each stage. Default: (1, 1, 1, 1).

  • out_indices (Sequence[int]) – Output from which stages. If only one stage is specified, a single tensor (feature map) is returned, otherwise multiple stages are specified, a tuple of tensors will be returned. Default: (3, ).

  • style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer.

  • deep_stem (bool) – Replace 7x7 conv in input stem with 3 3x3 conv. Default: False.

  • avg_down (bool) – Use AvgPool instead of stride conv when downsampling in the bottleneck. Default: False.

  • frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters. Default: -1.

  • conv_cfg (dict | None) – The config dict for conv layers. Default: None.

  • norm_cfg (dict) – The config dict for norm layers.

  • norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Default: False.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

  • zero_init_residual (bool) – Whether to use zero init for last norm layer in resblocks to let them behave as identity. Default: True.

  • init_cfg (dict or list[dict], optional) –

    Initialization config dict. Default: ``[

    dict(type=’Kaiming’, layer=[‘Conv2d’]), dict(

    type=’Constant’, val=1, layer=[‘_BatchNorm’, ‘GroupNorm’])

    ]``

make_res_layer(**kwargs)[source]

Make a ResLayer.

class mmpose.models.backbones.ResNeXt(depth, groups=32, width_per_group=4, **kwargs)[source]

ResNeXt backbone.

Please refer to the paper for details.

Parameters
  • depth (int) – Network depth, from {50, 101, 152}.

  • groups (int) – Groups of conv2 in Bottleneck. Default: 32.

  • width_per_group (int) – Width per group of conv2 in Bottleneck. Default: 4.

  • in_channels (int) – Number of input image channels. Default: 3.

  • stem_channels (int) – Output channels of the stem layer. Default: 64.

  • num_stages (int) – Stages of the network. Default: 4.

  • strides (Sequence[int]) – Strides of the first block of each stage. Default: (1, 2, 2, 2).

  • dilations (Sequence[int]) – Dilation of each stage. Default: (1, 1, 1, 1).

  • out_indices (Sequence[int]) – Output from which stages. If only one stage is specified, a single tensor (feature map) is returned, otherwise multiple stages are specified, a tuple of tensors will be returned. Default: (3, ).

  • style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer.

  • deep_stem (bool) – Replace 7x7 conv in input stem with 3 3x3 conv. Default: False.

  • avg_down (bool) – Use AvgPool instead of stride conv when downsampling in the bottleneck. Default: False.

  • frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters. Default: -1.

  • conv_cfg (dict | None) – The config dict for conv layers. Default: None.

  • norm_cfg (dict) – The config dict for norm layers.

  • norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Default: False.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

  • zero_init_residual (bool) – Whether to use zero init for last norm layer in resblocks to let them behave as identity. Default: True.

  • init_cfg

    Initialization config dict. Default: ``[

    dict(type=’Kaiming’, layer=[‘Conv2d’]), dict(

    type=’Constant’, val=1, layer=[‘_BatchNorm’, ‘GroupNorm’])

    ]``

make_res_layer(**kwargs)[source]

Make a ResLayer.

class mmpose.models.backbones.ResNet(depth, in_channels=3, stem_channels=64, base_channels=64, expansion=None, num_stages=4, strides=(1, 2, 2, 2), dilations=(1, 1, 1, 1), out_indices=(3,), style='pytorch', deep_stem=False, avg_down=False, frozen_stages=- 1, conv_cfg=None, norm_cfg={'requires_grad': True, 'type': 'BN'}, norm_eval=False, with_cp=False, zero_init_residual=True, init_cfg=[{'type': 'Kaiming', 'layer': ['Conv2d']}, {'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}])[source]

ResNet backbone.

Please refer to the paper for details.

Parameters
  • depth (int) – Network depth, from {18, 34, 50, 101, 152}.

  • in_channels (int) – Number of input image channels. Default: 3.

  • stem_channels (int) – Output channels of the stem layer. Default: 64.

  • base_channels (int) – Middle channels of the first stage. Default: 64.

  • num_stages (int) – Stages of the network. Default: 4.

  • strides (Sequence[int]) – Strides of the first block of each stage. Default: (1, 2, 2, 2).

  • dilations (Sequence[int]) – Dilation of each stage. Default: (1, 1, 1, 1).

  • out_indices (Sequence[int]) – Output from which stages. If only one stage is specified, a single tensor (feature map) is returned, otherwise multiple stages are specified, a tuple of tensors will be returned. Default: (3, ).

  • style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer.

  • deep_stem (bool) – Replace 7x7 conv in input stem with 3 3x3 conv. Default: False.

  • avg_down (bool) – Use AvgPool instead of stride conv when downsampling in the bottleneck. Default: False.

  • frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters. Default: -1.

  • conv_cfg (dict | None) – The config dict for conv layers. Default: None.

  • norm_cfg (dict) – The config dict for norm layers.

  • norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Default: False.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

  • zero_init_residual (bool) – Whether to use zero init for last norm layer in resblocks to let them behave as identity. Default: True.

  • init_cfg (dict or list[dict], optional) –

    Initialization config dict. Default: ``[

    dict(type=’Kaiming’, layer=[‘Conv2d’]), dict(

    type=’Constant’, val=1, layer=[‘_BatchNorm’, ‘GroupNorm’])

    ]``

Example

>>> from mmpose.models import ResNet
>>> import torch
>>> self = ResNet(depth=18, out_indices=(0, 1, 2, 3))
>>> self.eval()
>>> inputs = torch.rand(1, 3, 32, 32)
>>> level_outputs = self.forward(inputs)
>>> for level_out in level_outputs:
...     print(tuple(level_out.shape))
(1, 64, 8, 8)
(1, 128, 4, 4)
(1, 256, 2, 2)
(1, 512, 1, 1)
forward(x)[source]

Forward function.

init_weights()[source]

Initialize the weights in backbone.

make_res_layer(**kwargs)[source]

Make a ResLayer.

property norm1

the normalization layer named “norm1”

Type

nn.Module

train(mode=True)[source]

Convert the model into training mode.

class mmpose.models.backbones.ResNetV1d(**kwargs)[source]

ResNetV1d variant described in Bag of Tricks.

Compared with default ResNet(ResNetV1b), ResNetV1d replaces the 7x7 conv in the input stem with three 3x3 convs. And in the downsampling block, a 2x2 avg_pool with stride 2 is added before conv, whose stride is changed to 1.

class mmpose.models.backbones.SCNet(depth, **kwargs)[source]

SCNet backbone.

Improving Convolutional Networks with Self-Calibrated Convolutions, Jiang-Jiang Liu, Qibin Hou, Ming-Ming Cheng, Changhu Wang, Jiashi Feng, IEEE CVPR, 2020. http://mftp.mmcheng.net/Papers/20cvprSCNet.pdf

Parameters
  • depth (int) – Depth of scnet, from {50, 101}.

  • in_channels (int) – Number of input image channels. Normally 3.

  • base_channels (int) – Number of base channels of hidden layer.

  • num_stages (int) – SCNet stages, normally 4.

  • strides (Sequence[int]) – Strides of the first block of each stage.

  • dilations (Sequence[int]) – Dilation of each stage.

  • out_indices (Sequence[int]) – Output from which stages.

  • style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer.

  • deep_stem (bool) – Replace 7x7 conv in input stem with 3 3x3 conv

  • avg_down (bool) – Use AvgPool instead of stride conv when downsampling in the bottleneck.

  • frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters.

  • norm_cfg (dict) – Dictionary to construct and config norm layer.

  • norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed.

  • zero_init_residual (bool) – Whether to use zero init for last norm layer in resblocks to let them behave as identity.

Example

>>> from mmpose.models import SCNet
>>> import torch
>>> self = SCNet(depth=50, out_indices=(0, 1, 2, 3))
>>> self.eval()
>>> inputs = torch.rand(1, 3, 224, 224)
>>> level_outputs = self.forward(inputs)
>>> for level_out in level_outputs:
...     print(tuple(level_out.shape))
(1, 256, 56, 56)
(1, 512, 28, 28)
(1, 1024, 14, 14)
(1, 2048, 7, 7)
class mmpose.models.backbones.SEResNeXt(depth, groups=32, width_per_group=4, **kwargs)[source]

SEResNeXt backbone.

Please refer to the paper for details.

Parameters
  • depth (int) – Network depth, from {50, 101, 152}.

  • groups (int) – Groups of conv2 in Bottleneck. Default: 32.

  • width_per_group (int) – Width per group of conv2 in Bottleneck. Default: 4.

  • se_ratio (int) – Squeeze ratio in SELayer. Default: 16.

  • in_channels (int) – Number of input image channels. Default: 3.

  • stem_channels (int) – Output channels of the stem layer. Default: 64.

  • num_stages (int) – Stages of the network. Default: 4.

  • strides (Sequence[int]) – Strides of the first block of each stage. Default: (1, 2, 2, 2).

  • dilations (Sequence[int]) – Dilation of each stage. Default: (1, 1, 1, 1).

  • out_indices (Sequence[int]) – Output from which stages. If only one stage is specified, a single tensor (feature map) is returned, otherwise multiple stages are specified, a tuple of tensors will be returned. Default: (3, ).

  • style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer.

  • deep_stem (bool) – Replace 7x7 conv in input stem with 3 3x3 conv. Default: False.

  • avg_down (bool) – Use AvgPool instead of stride conv when downsampling in the bottleneck. Default: False.

  • frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters. Default: -1.

  • conv_cfg (dict | None) – The config dict for conv layers. Default: None.

  • norm_cfg (dict) – The config dict for norm layers.

  • norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Default: False.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

  • zero_init_residual (bool) – Whether to use zero init for last norm layer in resblocks to let them behave as identity. Default: True.

  • init_cfg (dict or list[dict], optional) –

    Initialization config dict. Default: ``[

    dict(type=’Kaiming’, layer=[‘Conv2d’]), dict(

    type=’Constant’, val=1, layer=[‘_BatchNorm’, ‘GroupNorm’])

    ]``

Example

>>> from mmpose.models import SEResNeXt
>>> import torch
>>> self = SEResNet(depth=50, out_indices=(0, 1, 2, 3))
>>> self.eval()
>>> inputs = torch.rand(1, 3, 224, 224)
>>> level_outputs = self.forward(inputs)
>>> for level_out in level_outputs:
...     print(tuple(level_out.shape))
(1, 256, 56, 56)
(1, 512, 28, 28)
(1, 1024, 14, 14)
(1, 2048, 7, 7)
make_res_layer(**kwargs)[source]

Make a ResLayer.

class mmpose.models.backbones.SEResNet(depth, se_ratio=16, **kwargs)[source]

SEResNet backbone.

Please refer to the paper for details.

Parameters
  • depth (int) – Network depth, from {50, 101, 152}.

  • se_ratio (int) – Squeeze ratio in SELayer. Default: 16.

  • in_channels (int) – Number of input image channels. Default: 3.

  • stem_channels (int) – Output channels of the stem layer. Default: 64.

  • num_stages (int) – Stages of the network. Default: 4.

  • strides (Sequence[int]) – Strides of the first block of each stage. Default: (1, 2, 2, 2).

  • dilations (Sequence[int]) – Dilation of each stage. Default: (1, 1, 1, 1).

  • out_indices (Sequence[int]) – Output from which stages. If only one stage is specified, a single tensor (feature map) is returned, otherwise multiple stages are specified, a tuple of tensors will be returned. Default: (3, ).

  • style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer.

  • deep_stem (bool) – Replace 7x7 conv in input stem with 3 3x3 conv. Default: False.

  • avg_down (bool) – Use AvgPool instead of stride conv when downsampling in the bottleneck. Default: False.

  • frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters. Default: -1.

  • conv_cfg (dict | None) – The config dict for conv layers. Default: None.

  • norm_cfg (dict) – The config dict for norm layers.

  • norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Default: False.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

  • zero_init_residual (bool) – Whether to use zero init for last norm layer in resblocks to let them behave as identity. Default: True.

  • init_cfg (dict or list[dict], optional) –

    Initialization config dict. Default: ``[

    dict(type=’Kaiming’, layer=[‘Conv2d’]), dict(

    type=’Constant’, val=1, layer=[‘_BatchNorm’, ‘GroupNorm’])

    ]``

Example

>>> from mmpose.models import SEResNet
>>> import torch
>>> self = SEResNet(depth=50, out_indices=(0, 1, 2, 3))
>>> self.eval()
>>> inputs = torch.rand(1, 3, 224, 224)
>>> level_outputs = self.forward(inputs)
>>> for level_out in level_outputs:
...     print(tuple(level_out.shape))
(1, 256, 56, 56)
(1, 512, 28, 28)
(1, 1024, 14, 14)
(1, 2048, 7, 7)
make_res_layer(**kwargs)[source]

Make a ResLayer.

class mmpose.models.backbones.ShuffleNetV1(groups=3, widen_factor=1.0, out_indices=(2,), frozen_stages=- 1, conv_cfg=None, norm_cfg={'type': 'BN'}, act_cfg={'type': 'ReLU'}, norm_eval=False, with_cp=False, init_cfg=[{'type': 'Normal', 'std': 0.01, 'layer': ['Conv2d']}, {'type': 'Constant', 'val': 1, 'bias': 0.0001, 'layer': ['_BatchNorm', 'GroupNorm']}])[source]

ShuffleNetV1 backbone.

Parameters
  • groups (int, optional) – The number of groups to be used in grouped 1x1 convolutions in each ShuffleUnit. Default: 3.

  • widen_factor (float, optional) – Width multiplier - adjusts the number of channels in each layer by this amount. Default: 1.0.

  • out_indices (Sequence[int]) – Output from which stages. Default: (2, )

  • frozen_stages (int) – Stages to be frozen (all param fixed). Default: -1, which means not freezing any parameters.

  • conv_cfg (dict) – Config dict for convolution layer. Default: None, which means using conv2d.

  • norm_cfg (dict) – Config dict for normalization layer. Default: dict(type=’BN’).

  • act_cfg (dict) – Config dict for activation layer. Default: dict(type=’ReLU’).

  • norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Default: False.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

  • init_cfg (dict or list[dict], optional) –

    Initialization config dict. Default: ``[

    dict(type=’Normal’, std=0.01, layer=[‘Conv2d’]), dict(

    type=’Constant’, val=1, bias=0.0001 layer=[‘_BatchNorm’, ‘GroupNorm’])

    ]``

forward(x)[source]

Forward function.

Parameters

x (Tensor | tuple[Tensor]) – x could be a torch.Tensor or a tuple of torch.Tensor, containing input data for forward computation.

init_weights(pretrained=None)[source]

Initialize the weights.

make_layer(out_channels, num_blocks, first_block=False)[source]

Stack ShuffleUnit blocks to make a layer.

Parameters
  • out_channels (int) – out_channels of the block.

  • num_blocks (int) – Number of blocks.

  • first_block (bool, optional) – Whether is the first ShuffleUnit of a sequential ShuffleUnits. Default: False, which means using the grouped 1x1 convolution.

train(mode=True)[source]

Set module status before forward computation.

Parameters

mode (bool) – Whether it is train_mode or test_mode

class mmpose.models.backbones.ShuffleNetV2(widen_factor=1.0, out_indices=(3,), frozen_stages=- 1, conv_cfg=None, norm_cfg={'type': 'BN'}, act_cfg={'type': 'ReLU'}, norm_eval=False, with_cp=False, init_cfg=[{'type': 'Normal', 'std': 0.01, 'layer': ['Conv2d']}, {'type': 'Constant', 'val': 1, 'bias': 0.0001, 'layer': ['_BatchNorm', 'GroupNorm']}])[source]

ShuffleNetV2 backbone.

Parameters
  • widen_factor (float) – Width multiplier - adjusts the number of channels in each layer by this amount. Default: 1.0.

  • out_indices (Sequence[int]) – Output from which stages. Default: (0, 1, 2, 3).

  • frozen_stages (int) – Stages to be frozen (all param fixed). Default: -1, which means not freezing any parameters.

  • conv_cfg (dict) – Config dict for convolution layer. Default: None, which means using conv2d.

  • norm_cfg (dict) – Config dict for normalization layer. Default: dict(type=’BN’).

  • act_cfg (dict) – Config dict for activation layer. Default: dict(type=’ReLU’).

  • norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Default: False.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

  • init_cfg (dict or list[dict], optional) –

    Initialization config dict. Default: ``[

    dict(type=’Normal’, std=0.01, layer=[‘Conv2d’]), dict(

    type=’Constant’, val=1, bias=0.0001 layer=[‘_BatchNorm’, ‘GroupNorm’])

    ]``

forward(x)[source]

Forward function.

Parameters

x (Tensor | tuple[Tensor]) – x could be a torch.Tensor or a tuple of torch.Tensor, containing input data for forward computation.

init_weights()[source]

Initialize the weights.

train(mode=True)[source]

Set module status before forward computation.

Parameters

mode (bool) – Whether it is train_mode or test_mode

class mmpose.models.backbones.SwinTransformer(pretrain_img_size=224, in_channels=3, embed_dims=96, patch_size=4, window_size=7, mlp_ratio=4, depths=(2, 2, 6, 2), num_heads=(3, 6, 12, 24), strides=(4, 2, 2, 2), out_indices=(0, 1, 2, 3), qkv_bias=True, qk_scale=None, patch_norm=True, drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.1, use_abs_pos_embed=False, act_cfg={'type': 'GELU'}, norm_cfg={'type': 'LN'}, with_cp=False, convert_weights=False, frozen_stages=- 1, init_cfg=[{'type': 'TruncNormal', 'std': 0.02, 'layer': ['Linear']}, {'type': 'Constant', 'val': 1, 'layer': ['LayerNorm']}])[source]

Swin Transformer A PyTorch implement of : Swin Transformer: Hierarchical Vision Transformer using Shifted Windows -

Inspiration from https://github.com/microsoft/Swin-Transformer

Parameters
  • pretrain_img_size (int | tuple[int]) – The size of input image when pretrain. Defaults: 224.

  • in_channels (int) – The num of input channels. Defaults: 3.

  • embed_dims (int) – The feature dimension. Default: 96.

  • patch_size (int | tuple[int]) – Patch size. Default: 4.

  • window_size (int) – Window size. Default: 7.

  • mlp_ratio (int) – Ratio of mlp hidden dim to embedding dim. Default: 4.

  • depths (tuple[int]) – Depths of each Swin Transformer stage. Default: (2, 2, 6, 2).

  • num_heads (tuple[int]) – Parallel attention heads of each Swin Transformer stage. Default: (3, 6, 12, 24).

  • strides (tuple[int]) – The patch merging or patch embedding stride of each Swin Transformer stage. (In swin, we set kernel size equal to stride.) Default: (4, 2, 2, 2).

  • out_indices (tuple[int]) – Output from which stages. Default: (0, 1, 2, 3).

  • qkv_bias (bool, optional) – If True, add a learnable bias to query, key, value. Default: True

  • qk_scale (float | None, optional) – Override default qk scale of head_dim ** -0.5 if set. Default: None.

  • patch_norm (bool) – If add a norm layer for patch embed and patch merging. Default: True.

  • drop_rate (float) – Dropout rate. Defaults: 0.

  • attn_drop_rate (float) – Attention dropout rate. Default: 0.

  • drop_path_rate (float) – Stochastic depth rate. Defaults: 0.1.

  • use_abs_pos_embed (bool) – If True, add absolute position embedding to the patch embedding. Defaults: False.

  • act_cfg (dict) – Config dict for activation layer. Default: dict(type=’LN’).

  • norm_cfg (dict) – Config dict for normalization layer at output of backone. Defaults: dict(type=’LN’).

  • with_cp (bool, optional) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

  • pretrained (str, optional) – model pretrained path. Default: None.

  • convert_weights (bool) – The flag indicates whether the pre-trained model is from the original repo. We may need to convert some keys to make it compatible. Default: False.

  • frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). Default: -1 (-1 means not freezing any parameters).

  • init_cfg (dict or list[dict], optional) –

    Initialization config dict. Default: ``[

    dict(type=’TruncNormal’, std=.02, layer=[‘Linear’]), dict(type=’Constant’, val=1, layer=[‘LayerNorm’]),

    ]``

forward(x)[source]

Forward function.

Parameters

x (Tensor | tuple[Tensor]) – x could be a torch.Tensor or a tuple of torch.Tensor, containing input data for forward computation.

init_weights(pretrained=None)[source]

Initialize the weights in backbone.

Parameters

pretrained (str, optional) – Path to pre-trained weights. Defaults to None.

train(mode=True)[source]

Convert the model into training mode while keep layers freezed.

class mmpose.models.backbones.TCN(in_channels, stem_channels=1024, num_blocks=2, kernel_sizes=(3, 3, 3), dropout=0.25, causal=False, residual=True, use_stride_conv=False, conv_cfg={'type': 'Conv1d'}, norm_cfg={'type': 'BN1d'}, max_norm=None, init_cfg=[{'type': 'Kaiming', 'mode': 'fan_in', 'nonlinearity': 'relu', 'layer': ['Conv2d']}, {'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}])[source]

TCN backbone.

Temporal Convolutional Networks. More details can be found in the paper .

Parameters
  • in_channels (int) – Number of input channels, which equals to num_keypoints * num_features.

  • stem_channels (int) – Number of feature channels. Default: 1024.

  • num_blocks (int) – NUmber of basic temporal convolutional blocks. Default: 2.

  • kernel_sizes (Sequence[int]) – Sizes of the convolving kernel of each basic block. Default: (3, 3, 3).

  • dropout (float) – Dropout rate. Default: 0.25.

  • causal (bool) – Use causal convolutions instead of symmetric convolutions (for real-time applications). Default: False.

  • residual (bool) – Use residual connection. Default: True.

  • use_stride_conv (bool) – Use TCN backbone optimized for single-frame batching, i.e. where batches have input length = receptive field, and output length = 1. This implementation replaces dilated convolutions with strided convolutions to avoid generating unused intermediate results. The weights are interchangeable with the reference implementation. Default: False

  • conv_cfg (dict) – dictionary to construct and config conv layer. Default: dict(type=’Conv1d’).

  • norm_cfg (dict) – dictionary to construct and config norm layer. Default: dict(type=’BN1d’).

  • max_norm (float|None) – if not None, the weight of convolution layers will be clipped to have a maximum norm of max_norm.

  • init_cfg (dict or list[dict], optional) –

    Initialization config dict. Default: ``[

    dict(

    type=’Kaiming’, mode=’fan_in’, nonlinearity=’relu’, layer=[‘Conv2d’]),

    dict(

    type=’Constant’, val=1, layer=[‘_BatchNorm’, ‘GroupNorm’])

    ]``

Example

>>> from mmpose.models import TCN
>>> import torch
>>> self = TCN(in_channels=34)
>>> self.eval()
>>> inputs = torch.rand(1, 34, 243)
>>> level_outputs = self.forward(inputs)
>>> for level_out in level_outputs:
...     print(tuple(level_out.shape))
(1, 1024, 235)
(1, 1024, 217)
forward(x)[source]

Forward function.

class mmpose.models.backbones.V2VNet(input_channels, output_channels, mid_channels=32, init_cfg={'layer': ['Conv3d', 'ConvTranspose3d'], 'std': 0.001, 'type': 'Normal'})[source]

V2VNet.

Please refer to the paper <https://arxiv.org/abs/1711.07399>

for details.

Parameters
  • input_channels (int) – Number of channels of the input feature volume.

  • output_channels (int) – Number of channels of the output volume.

  • mid_channels (int) – Input and output channels of the encoder-decoder block.

  • init_cfg (dict or list[dict], optional) –

    Initialization config dict. Default: ``dict(

    type=’Normal’, std=0.001, layer=[‘Conv3d’, ‘ConvTranspose3d’]

    )``

forward(x)[source]

Forward function.

class mmpose.models.backbones.VGG(depth, num_classes=- 1, num_stages=5, dilations=(1, 1, 1, 1, 1), out_indices=None, frozen_stages=- 1, conv_cfg=None, norm_cfg=None, act_cfg={'type': 'ReLU'}, norm_eval=False, ceil_mode=False, with_last_pool=True, init_cfg=[{'type': 'Kaiming', 'layer': ['Conv2d']}, {'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}, {'type': 'Normal', 'std': 0.01, 'layer': ['Linear']}])[source]

VGG backbone.

Parameters
  • depth (int) – Depth of vgg, from {11, 13, 16, 19}.

  • with_norm (bool) – Use BatchNorm or not.

  • num_classes (int) – number of classes for classification.

  • num_stages (int) – VGG stages, normally 5.

  • dilations (Sequence[int]) – Dilation of each stage.

  • out_indices (Sequence[int]) – Output from which stages. If only one stage is specified, a single tensor (feature map) is returned, otherwise multiple stages are specified, a tuple of tensors will be returned. When it is None, the default behavior depends on whether num_classes is specified. If num_classes <= 0, the default value is (4, ), outputting the last feature map before classifier. If num_classes > 0, the default value is (5, ), outputting the classification score. Default: None.

  • frozen_stages (int) – Stages to be frozen (all param fixed). -1 means not freezing any parameters.

  • norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Default: False.

  • ceil_mode (bool) – Whether to use ceil_mode of MaxPool. Default: False.

  • with_last_pool (bool) – Whether to keep the last pooling before classifier. Default: True.

  • init_cfg (dict or list[dict], optional) –

    Initialization config dict. Default: ``[

    dict(type=’Kaiming’, layer=[‘Conv2d’]), dict(

    type=’Constant’, val=1, layer=[‘_BatchNorm’, ‘GroupNorm’]),

    dict(

    type=’Normal’, std=0.01, layer=[‘Linear’]),

    ]``

forward(x)[source]

Forward function.

Parameters

x (Tensor | tuple[Tensor]) – x could be a torch.Tensor or a tuple of torch.Tensor, containing input data for forward computation.

train(mode=True)[source]

Set module status before forward computation.

Parameters

mode (bool) – Whether it is train_mode or test_mode

class mmpose.models.backbones.ViPNAS_MobileNetV3(wid=[16, 16, 24, 40, 80, 112, 160], expan=[None, 1, 5, 4, 5, 5, 6], dep=[None, 1, 4, 4, 4, 4, 4], ks=[3, 3, 7, 7, 5, 7, 5], group=[None, 8, 120, 20, 100, 280, 240], att=[None, True, True, False, True, True, True], stride=[2, 1, 2, 2, 2, 1, 2], act=['HSwish', 'ReLU', 'ReLU', 'ReLU', 'HSwish', 'HSwish', 'HSwish'], conv_cfg=None, norm_cfg={'type': 'BN'}, frozen_stages=- 1, norm_eval=False, with_cp=False, init_cfg=[{'type': 'Normal', 'std': 0.001, 'layer': ['Conv2d']}, {'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}])[source]

ViPNAS_MobileNetV3 backbone.

“ViPNAS: Efficient Video Pose Estimation via Neural Architecture Search” More details can be found in the paper .

Parameters
  • wid (list(int)) – Searched width config for each stage.

  • expan (list(int)) – Searched expansion ratio config for each stage.

  • dep (list(int)) – Searched depth config for each stage.

  • ks (list(int)) – Searched kernel size config for each stage.

  • group (list(int)) – Searched group number config for each stage.

  • att (list(bool)) – Searched attention config for each stage.

  • stride (list(int)) – Stride config for each stage.

  • act (list(dict)) – Activation config for each stage.

  • conv_cfg (dict) – Config dict for convolution layer. Default: None, which means using conv2d.

  • norm_cfg (dict) – Config dict for normalization layer. Default: dict(type=’BN’).

  • frozen_stages (int) – Stages to be frozen (all param fixed). Default: -1, which means not freezing any parameters.

  • norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Default: False.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

  • init_cfg (dict or list[dict], optional) –

    Initialization config dict. Default: ``[

    dict(type=’Normal’, std=0.001, layer=[‘Conv2d’]), dict(

    type=’Constant’, val=1, layer=[‘_BatchNorm’, ‘GroupNorm’])

    ]``

forward(x)[source]

Forward function.

Parameters

x (Tensor | tuple[Tensor]) – x could be a torch.Tensor or a tuple of torch.Tensor, containing input data for forward computation.

train(mode=True)[source]

Set module status before forward computation.

Parameters

mode (bool) – Whether it is train_mode or test_mode

class mmpose.models.backbones.ViPNAS_ResNet(depth, in_channels=3, num_stages=4, strides=(1, 2, 2, 2), dilations=(1, 1, 1, 1), out_indices=(3,), style='pytorch', deep_stem=False, avg_down=False, frozen_stages=- 1, conv_cfg=None, norm_cfg={'requires_grad': True, 'type': 'BN'}, norm_eval=False, with_cp=False, zero_init_residual=True, wid=[48, 80, 160, 304, 608], expan=[None, 1, 1, 1, 1], dep=[None, 4, 6, 7, 3], ks=[7, 3, 5, 5, 5], group=[None, 16, 16, 16, 16], att=[None, True, False, True, True], init_cfg=[{'type': 'Normal', 'std': 0.001, 'layer': ['Conv2d']}, {'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}])[source]

ViPNAS_ResNet backbone.

“ViPNAS: Efficient Video Pose Estimation via Neural Architecture Search” More details can be found in the paper .

Parameters
  • depth (int) – Network depth, from {18, 34, 50, 101, 152}.

  • in_channels (int) – Number of input image channels. Default: 3.

  • num_stages (int) – Stages of the network. Default: 4.

  • strides (Sequence[int]) – Strides of the first block of each stage. Default: (1, 2, 2, 2).

  • dilations (Sequence[int]) – Dilation of each stage. Default: (1, 1, 1, 1).

  • out_indices (Sequence[int]) – Output from which stages. If only one stage is specified, a single tensor (feature map) is returned, otherwise multiple stages are specified, a tuple of tensors will be returned. Default: (3, ).

  • style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer.

  • deep_stem (bool) – Replace 7x7 conv in input stem with 3 3x3 conv. Default: False.

  • avg_down (bool) – Use AvgPool instead of stride conv when downsampling in the bottleneck. Default: False.

  • frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters. Default: -1.

  • conv_cfg (dict | None) – The config dict for conv layers. Default: None.

  • norm_cfg (dict) – The config dict for norm layers.

  • norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Default: False.

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

  • zero_init_residual (bool) – Whether to use zero init for last norm layer in resblocks to let them behave as identity. Default: True.

  • wid (list(int)) – Searched width config for each stage.

  • expan (list(int)) – Searched expansion ratio config for each stage.

  • dep (list(int)) – Searched depth config for each stage.

  • ks (list(int)) – Searched kernel size config for each stage.

  • group (list(int)) – Searched group number config for each stage.

  • att (list(bool)) – Searched attention config for each stage.

  • init_cfg (dict or list[dict], optional) –

    Initialization config dict. Default: ``[

    dict(type=’Normal’, std=0.001, layer=[‘Conv2d’]), dict(

    type=’Constant’, val=1, layer=[‘_BatchNorm’, ‘GroupNorm’])

    ]``

forward(x)[source]

Forward function.

make_res_layer(**kwargs)[source]

Make a ViPNAS ResLayer.

property norm1

the normalization layer named “norm1”

Type

nn.Module

train(mode=True)[source]

Convert the model into training mode.

necks

class mmpose.models.necks.CSPNeXtPAFPN(in_channels: Sequence[int], out_channels: int, out_indices=(0, 1, 2), num_csp_blocks: int = 3, use_depthwise: bool = False, expand_ratio: float = 0.5, upsample_cfg: Union[mmengine.config.config.ConfigDict, dict] = {'mode': 'nearest', 'scale_factor': 2}, conv_cfg: Optional[bool] = None, norm_cfg: Union[mmengine.config.config.ConfigDict, dict] = {'eps': 0.001, 'momentum': 0.03, 'type': 'BN'}, act_cfg: Union[mmengine.config.config.ConfigDict, dict] = {'type': 'Swish'}, init_cfg: Optional[Union[mmengine.config.config.ConfigDict, dict, List[Union[mmengine.config.config.ConfigDict, dict]]]] = {'a': 2.23606797749979, 'distribution': 'uniform', 'layer': 'Conv2d', 'mode': 'fan_in', 'nonlinearity': 'leaky_relu', 'type': 'Kaiming'})[source]

Path Aggregation Network with CSPNeXt blocks. Modified from RTMDet.

Parameters
  • in_channels (Sequence[int]) – Number of input channels per scale.

  • out_channels (int) – Number of output channels (used at each scale)

  • out_indices (Sequence[int]) – Output from which stages.

  • num_csp_blocks (int) – Number of bottlenecks in CSPLayer. Defaults to 3.

  • use_depthwise (bool) – Whether to use depthwise separable convolution in blocks. Defaults to False.

  • expand_ratio (float) – Ratio to adjust the number of channels of the hidden layer. Default: 0.5

  • upsample_cfg (dict) – Config dict for interpolate layer. Default: dict(scale_factor=2, mode=’nearest’)

  • conv_cfg (dict, optional) – Config dict for convolution layer. Default: None, which means using conv2d.

  • norm_cfg (dict) – Config dict for normalization layer. Default: dict(type=’BN’)

  • act_cfg (dict) – Config dict for activation layer. Default: dict(type=’Swish’)

  • init_cfg (dict or list[dict], optional) – Initialization config dict. Default: None.

forward(inputs: Tuple[torch.Tensor, ...]) Tuple[torch.Tensor, ...][source]
Parameters

inputs (tuple[Tensor]) – input features.

Returns

YOLOXPAFPN features.

Return type

tuple[Tensor]

class mmpose.models.necks.ChannelMapper(in_channels: List[int], out_channels: int, kernel_size: int = 3, conv_cfg: Optional[Union[mmengine.config.config.ConfigDict, dict]] = None, norm_cfg: Optional[Union[mmengine.config.config.ConfigDict, dict]] = None, act_cfg: Optional[Union[mmengine.config.config.ConfigDict, dict]] = {'type': 'ReLU'}, num_outs: Optional[int] = None, bias: Union[bool, str] = 'auto', init_cfg: Optional[Union[mmengine.config.config.ConfigDict, dict, List[Union[mmengine.config.config.ConfigDict, dict]]]] = {'distribution': 'uniform', 'layer': 'Conv2d', 'type': 'Xavier'})[source]

Channel Mapper to reduce/increase channels of backbone features.

This is used to reduce/increase channels of backbone features.

Parameters
  • in_channels (List[int]) – Number of input channels per scale.

  • out_channels (int) – Number of output channels (used at each scale).

  • kernel_size (int, optional) – kernel_size for reducing channels (used at each scale). Default: 3.

  • conv_cfg (ConfigDict or dict, optional) – Config dict for convolution layer. Default: None.

  • norm_cfg (ConfigDict or dict, optional) – Config dict for normalization layer. Default: None.

  • act_cfg (ConfigDict or dict, optional) – Config dict for activation layer in ConvModule. Default: dict(type=’ReLU’).

  • num_outs (int, optional) – Number of output feature maps. There would be extra_convs when num_outs larger than the length of in_channels.

:param init_cfg (ConfigDict or dict or list[ConfigDict or dict]: optional): Initialization config dict. :param : optional): Initialization config dict.

Example

>>> import torch
>>> in_channels = [2, 3, 5, 7]
>>> scales = [340, 170, 84, 43]
>>> inputs = [torch.rand(1, c, s, s)
...           for c, s in zip(in_channels, scales)]
>>> self = ChannelMapper(in_channels, 11, 3).eval()
>>> outputs = self.forward(inputs)
>>> for i in range(len(outputs)):
...     print(f'outputs[{i}].shape = {outputs[i].shape}')
outputs[0].shape = torch.Size([1, 11, 340, 340])
outputs[1].shape = torch.Size([1, 11, 170, 170])
outputs[2].shape = torch.Size([1, 11, 84, 84])
outputs[3].shape = torch.Size([1, 11, 43, 43])
forward(inputs: Tuple[torch.Tensor]) Tuple[torch.Tensor][source]

Forward function.

class mmpose.models.necks.FPN(in_channels, out_channels, num_outs, start_level=0, end_level=- 1, add_extra_convs=False, relu_before_extra_convs=False, no_norm_on_lateral=False, conv_cfg=None, norm_cfg=None, act_cfg=None, upsample_cfg={'mode': 'nearest'})[source]

Feature Pyramid Network.

This is an implementation of paper Feature Pyramid Networks for Object Detection.

Parameters
  • in_channels (list[int]) – Number of input channels per scale.

  • out_channels (int) – Number of output channels (used at each scale).

  • num_outs (int) – Number of output scales.

  • start_level (int) – Index of the start input backbone level used to build the feature pyramid. Default: 0.

  • end_level (int) – Index of the end input backbone level (exclusive) to build the feature pyramid. Default: -1, which means the last level.

  • add_extra_convs (bool | str) –

    If bool, it decides whether to add conv layers on top of the original feature maps. Default to False. If True, it is equivalent to add_extra_convs=’on_input’. If str, it specifies the source feature map of the extra convs. Only the following options are allowed

    • ’on_input’: Last feat map of neck inputs (i.e. backbone feature).

    • ’on_lateral’: Last feature map after lateral convs.

    • ’on_output’: The last output feature map after fpn convs.

  • relu_before_extra_convs (bool) – Whether to apply relu before the extra conv. Default: False.

  • no_norm_on_lateral (bool) – Whether to apply norm on lateral. Default: False.

  • conv_cfg (dict) – Config dict for convolution layer. Default: None.

  • norm_cfg (dict) – Config dict for normalization layer. Default: None.

  • act_cfg (dict) – Config dict for activation layer in ConvModule. Default: None.

  • upsample_cfg (dict) – Config dict for interpolate layer. Default: dict(mode=’nearest’).

Example

>>> import torch
>>> in_channels = [2, 3, 5, 7]
>>> scales = [340, 170, 84, 43]
>>> inputs = [torch.rand(1, c, s, s)
...           for c, s in zip(in_channels, scales)]
>>> self = FPN(in_channels, 11, len(in_channels)).eval()
>>> outputs = self.forward(inputs)
>>> for i in range(len(outputs)):
...     print(f'outputs[{i}].shape = {outputs[i].shape}')
outputs[0].shape = torch.Size([1, 11, 340, 340])
outputs[1].shape = torch.Size([1, 11, 170, 170])
outputs[2].shape = torch.Size([1, 11, 84, 84])
outputs[3].shape = torch.Size([1, 11, 43, 43])
forward(inputs)[source]

Forward function.

init_weights()[source]

Initialize model weights.

class mmpose.models.necks.FeatureMapProcessor(select_index: Optional[Union[int, Tuple[int]]] = None, concat: bool = False, scale_factor: float = 1.0, apply_relu: bool = False, align_corners: bool = False)[source]

A PyTorch module for selecting, concatenating, and rescaling feature maps.

Parameters
  • select_index (Optional[Union[int, Tuple[int]]], optional) – Index or indices of feature maps to select. Defaults to None, which means all feature maps are used.

  • concat (bool, optional) – Whether to concatenate the selected feature maps. Defaults to False.

  • scale_factor (float, optional) – The scaling factor to apply to the feature maps. Defaults to 1.0.

  • apply_relu (bool, optional) – Whether to apply ReLU on input feature maps. Defaults to False.

  • align_corners (bool, optional) – Whether to align corners when resizing the feature maps. Defaults to False.

forward(inputs: Union[torch.Tensor, Sequence[torch.Tensor]]) Union[torch.Tensor, List[torch.Tensor]][source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class mmpose.models.necks.GlobalAveragePooling[source]

Global Average Pooling neck.

Note that we use view to remove extra channel after pooling. We do not use squeeze as it will also remove the batch dimension when the tensor has a batch dimension of size 1, which can lead to unexpected errors.

forward(inputs)[source]

Forward function.

class mmpose.models.necks.HybridEncoder(encoder_cfg: Union[mmengine.config.config.ConfigDict, dict] = {}, projector: Optional[Union[mmengine.config.config.ConfigDict, dict]] = None, num_encoder_layers: int = 1, in_channels: List[int] = [512, 1024, 2048], feat_strides: List[int] = [8, 16, 32], hidden_dim: int = 256, use_encoder_idx: List[int] = [2], pe_temperature: int = 10000, widen_factor: float = 1.0, deepen_factor: float = 1.0, spe_learnable: bool = False, output_indices: Optional[List[int]] = None, norm_cfg: Optional[Union[mmengine.config.config.ConfigDict, dict]] = {'requires_grad': True, 'type': 'BN'}, act_cfg: Optional[Union[mmengine.config.config.ConfigDict, dict]] = {'inplace': True, 'type': 'SiLU'})[source]

Hybrid encoder neck introduced in RT-DETR by Lyu et al (2023), combining transformer encoders with a Feature Pyramid Network (FPN) and a Path Aggregation Network (PAN).

Parameters
  • encoder_cfg (ConfigType) – Configuration for the transformer encoder.

  • projector (OptConfigType, optional) – Configuration for an optional projector module. Defaults to None.

  • num_encoder_layers (int, optional) – Number of encoder layers. Defaults to 1.

  • in_channels (List[int], optional) – Input channels of feature maps. Defaults to [512, 1024, 2048].

  • feat_strides (List[int], optional) – Strides of feature maps. Defaults to [8, 16, 32].

  • hidden_dim (int, optional) – Hidden dimension of the MLP. Defaults to 256.

  • use_encoder_idx (List[int], optional) – Indices of encoder layers to use. Defaults to [2].

  • pe_temperature (int, optional) – Positional encoding temperature. Defaults to 10000.

  • widen_factor (float, optional) – Expansion factor for CSPRepLayer. Defaults to 1.0.

  • deepen_factor (float, optional) – Depth multiplier for CSPRepLayer. Defaults to 1.0.

  • spe_learnable (bool, optional) – Whether positional encoding is learnable. Defaults to False.

  • output_indices (Optional[List[int]], optional) – Indices of output layers. Defaults to None.

  • norm_cfg (OptConfigType, optional) – Configuration for normalization layers. Defaults to Batch Normalization.

  • act_cfg (OptConfigType, optional) – Configuration for activation layers. Defaults to SiLU (Swish) with in-place operation.

forward(inputs: Tuple[torch.Tensor]) Tuple[torch.Tensor][source]

Forward function.

switch_to_deploy(test_cfg)[source]

Switch to deploy mode.

class mmpose.models.necks.PoseWarperNeck(in_channels, out_channels, inner_channels, deform_groups=17, dilations=(3, 6, 12, 18, 24), trans_conv_kernel=1, res_blocks_cfg=None, offsets_kernel=3, deform_conv_kernel=3, in_index=0, input_transform=None, freeze_trans_layer=True, norm_eval=False, im2col_step=80)[source]

PoseWarper neck.

“Learning temporal pose estimation from sparsely-labeled videos”.

Parameters
  • in_channels (int) – Number of input channels from backbone

  • out_channels (int) – Number of output channels

  • inner_channels (int) – Number of intermediate channels of the res block

  • deform_groups (int) – Number of groups in the deformable conv

  • dilations (list|tuple) – different dilations of the offset conv layers

  • trans_conv_kernel (int) – the kernel of the trans conv layer, which is used to get heatmap from the output of backbone. Default: 1

  • res_blocks_cfg (dict|None) –

    config of residual blocks. If None, use the default values. If not None, it should contain the following keys:

    • block (str): the type of residual block, Default: ‘BASIC’.

    • num_blocks (int): the number of blocks, Default: 20.

  • offsets_kernel (int) – the kernel of offset conv layer.

  • deform_conv_kernel (int) – the kernel of defomrable conv layer.

  • in_index (int|Sequence[int]) – Input feature index. Default: 0

  • input_transform (str|None) –

    Transformation type of input features. Options: ‘resize_concat’, ‘multiple_select’, None. Default: None.

    • ’resize_concat’: Multiple feature maps will be resize to the same size as first one and than concat together. Usually used in FCN head of HRNet.

    • ’multiple_select’: Multiple feature maps will be bundle into a list and passed into decode head.

    • None: Only one select feature map is allowed.

  • freeze_trans_layer (bool) – Whether to freeze the transition layer (stop grad and set eval mode). Default: True.

  • norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Default: False.

  • im2col_step (int) – the argument im2col_step in deformable conv, Default: 80.

forward(inputs, frame_weight)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

train(mode=True)[source]

Convert the model into training mode.

class mmpose.models.necks.YOLOXPAFPN(in_channels, out_channels, num_csp_blocks=3, use_depthwise=False, upsample_cfg={'mode': 'nearest', 'scale_factor': 2}, conv_cfg=None, norm_cfg={'eps': 0.001, 'momentum': 0.03, 'type': 'BN'}, act_cfg={'type': 'Swish'}, init_cfg={'a': 2.23606797749979, 'distribution': 'uniform', 'layer': 'Conv2d', 'mode': 'fan_in', 'nonlinearity': 'leaky_relu', 'type': 'Kaiming'})[source]

Path Aggregation Network used in YOLOX.

Parameters
  • in_channels (List[int]) – Number of input channels per scale.

  • out_channels (int) – Number of output channels (used at each scale)

  • num_csp_blocks (int) – Number of bottlenecks in CSPLayer. Default: 3

  • use_depthwise (bool) – Whether to depthwise separable convolution in blocks. Default: False

  • upsample_cfg (dict) – Config dict for interpolate layer. Default: dict(scale_factor=2, mode=’nearest’)

  • conv_cfg (dict, optional) – Config dict for convolution layer. Default: None, which means using conv2d.

  • norm_cfg (dict) – Config dict for normalization layer. Default: dict(type=’BN’)

  • act_cfg (dict) – Config dict for activation layer. Default: dict(type=’Swish’)

  • init_cfg (dict or list[dict], optional) – Initialization config dict. Default: None.

forward(inputs)[source]
Parameters

inputs (tuple[Tensor]) – input features.

Returns

YOLOXPAFPN features.

Return type

tuple[Tensor]

detectors

heads

losses

misc

class mmpose.models.utils.CSPLayer(in_channels: int, out_channels: int, expand_ratio: float = 0.5, num_blocks: int = 1, add_identity: bool = True, use_depthwise: bool = False, use_cspnext_block: bool = False, channel_attention: bool = False, conv_cfg: Optional[Union[mmengine.config.config.ConfigDict, dict]] = None, norm_cfg: Union[mmengine.config.config.ConfigDict, dict] = {'eps': 0.001, 'momentum': 0.03, 'type': 'BN'}, act_cfg: Union[mmengine.config.config.ConfigDict, dict] = {'type': 'Swish'}, init_cfg: Optional[Union[mmengine.config.config.ConfigDict, dict, List[Union[mmengine.config.config.ConfigDict, dict]]]] = None)[source]

Cross Stage Partial Layer.

Parameters
  • in_channels (int) – The input channels of the CSP layer.

  • out_channels (int) – The output channels of the CSP layer.

  • expand_ratio (float) – Ratio to adjust the number of channels of the hidden layer. Defaults to 0.5.

  • num_blocks (int) – Number of blocks. Defaults to 1.

  • add_identity (bool) – Whether to add identity in blocks. Defaults to True.

  • use_cspnext_block (bool) – Whether to use CSPNeXt block. Defaults to False.

  • use_depthwise (bool) – Whether to use depthwise separable convolution in blocks. Defaults to False.

  • channel_attention (bool) – Whether to add channel attention in each stage. Defaults to True.

  • conv_cfg (dict, optional) – Config dict for convolution layer. Defaults to None, which means using conv2d.

  • norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type=’BN’)

  • act_cfg (dict) – Config dict for activation layer. Defaults to dict(type=’Swish’)

:param init_cfg (ConfigDict or dict or list[dict] or: list[ConfigDict], optional): Initialization config dict.

Defaults to None.

forward(x: torch.Tensor) torch.Tensor[source]

Forward function.

class mmpose.models.utils.DetrTransformerEncoder(num_layers: int, layer_cfg: Union[mmengine.config.config.ConfigDict, dict], num_cp: int = - 1, init_cfg: Optional[Union[mmengine.config.config.ConfigDict, dict]] = None)[source]

Encoder of DETR.

Parameters
  • num_layers (int) – Number of encoder layers.

  • layer_cfg (ConfigDict or dict) – the config of each encoder layer. All the layers will share the same config.

  • num_cp (int) – Number of checkpointing blocks in encoder layer. Default to -1.

  • init_cfg (ConfigDict or dict, optional) – the config to control the initialization. Defaults to None.

forward(query: torch.Tensor, query_pos: torch.Tensor, key_padding_mask: torch.Tensor, **kwargs) torch.Tensor[source]

Forward function of encoder.

Parameters
  • query (Tensor) – Input queries of encoder, has shape (bs, num_queries, dim).

  • query_pos (Tensor) – The positional embeddings of the queries, has shape (bs, num_queries, dim).

  • key_padding_mask (Tensor) – The key_padding_mask of self_attn input. ByteTensor, has shape (bs, num_queries).

Returns

Has shape (bs, num_queries, dim) if batch_first is True, otherwise (num_queries, bs, dim).

Return type

Tensor

class mmpose.models.utils.FrozenBatchNorm2d(n, eps: int = 1e-05)[source]

BatchNorm2d where the batch statistics and the affine parameters are fixed.

Copy-paste from torchvision.misc.ops with added eps before rqsrt, without which any other models than torchvision.models.resnet[18,34,50,101] produce nans.

forward(x)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class mmpose.models.utils.GAUEncoder(in_token_dims, out_token_dims, expansion_factor=2, s=128, eps=1e-05, dropout_rate=0.0, drop_path=0.0, act_fn='SiLU', bias=False, pos_enc: str = 'none', spatial_dim: int = 1)[source]

Gated Attention Unit (GAU) Encoder.

Parameters
  • in_token_dims (int) – The input token dimension.

  • out_token_dims (int) – The output token dimension.

  • expansion_factor (int, optional) – The expansion factor of the intermediate token dimension. Defaults to 2.

  • s (int, optional) – The self-attention feature dimension. Defaults to 128.

  • eps (float, optional) – The minimum value in clamp. Defaults to 1e-5.

  • dropout_rate (float, optional) – The dropout rate. Defaults to 0.0.

  • drop_path (float, optional) – The drop path rate. Defaults to 0.0.

  • act_fn (str, optional) –

    The activation function which should be one of the following options:

    • ’ReLU’: ReLU activation.

    • ’SiLU’: SiLU activation.

    Defaults to ‘SiLU’.

  • bias (bool, optional) – Whether to use bias in linear layers. Defaults to False.

  • pos_enc (bool, optional) – Whether to use rotary position embedding. Defaults to False.

  • spatial_dim (int, optional) – The spatial dimension of inputs

Reference:

Transformer Quality in Linear Time

forward(x, mask=None, pos_enc=None)[source]

Forward function.

class mmpose.models.utils.PatchEmbed(in_channels=3, embed_dims=768, conv_type='Conv2d', kernel_size=16, stride=16, padding='corner', dilation=1, bias=True, norm_cfg=None, input_size=None, init_cfg=None)[source]

Image to Patch Embedding.

We use a conv layer to implement PatchEmbed.

Parameters
  • in_channels (int) – The num of input channels. Default: 3

  • embed_dims (int) – The dimensions of embedding. Default: 768

  • conv_type (str) – The config dict for embedding conv layer type selection. Default: “Conv2d.

  • kernel_size (int) – The kernel_size of embedding conv. Default: 16.

  • stride (int) – The slide stride of embedding conv. Default: None (Would be set as kernel_size).

  • padding (int | tuple | string) – The padding length of embedding conv. When it is a string, it means the mode of adaptive padding, support “same” and “corner” now. Default: “corner”.

  • dilation (int) – The dilation rate of embedding conv. Default: 1.

  • bias (bool) – Bias of embed conv. Default: True.

  • norm_cfg (dict, optional) – Config dict for normalization layer. Default: None.

  • input_size (int | tuple | None) – The size of input, which will be used to calculate the out size. Only work when dynamic_size is False. Default: None.

  • init_cfg (mmcv.ConfigDict, optional) – The Config for initialization. Default: None.

forward(x)[source]
Parameters

x (Tensor) – Has shape (B, C, H, W). In most case, C is 3.

Returns

Contains merged results and its spatial shape.

  • x (Tensor): Has shape (B, out_h * out_w, embed_dims)

  • out_size (tuple[int]): Spatial shape of x, arrange as

    (out_h, out_w).

Return type

tuple

class mmpose.models.utils.RTMCCBlock(num_token, in_token_dims, out_token_dims, expansion_factor=2, s=128, eps=1e-05, dropout_rate=0.0, drop_path=0.0, attn_type='self-attn', act_fn='SiLU', bias=False, use_rel_bias=True, pos_enc=False)[source]

Gated Attention Unit (GAU) in RTMBlock.

Parameters
  • num_token (int) – The number of tokens.

  • in_token_dims (int) – The input token dimension.

  • out_token_dims (int) – The output token dimension.

  • expansion_factor (int, optional) – The expansion factor of the intermediate token dimension. Defaults to 2.

  • s (int, optional) – The self-attention feature dimension. Defaults to 128.

  • eps (float, optional) – The minimum value in clamp. Defaults to 1e-5.

  • dropout_rate (float, optional) – The dropout rate. Defaults to 0.0.

  • drop_path (float, optional) – The drop path rate. Defaults to 0.0.

  • attn_type (str, optional) –

    Type of attention which should be one of the following options:

    • ’self-attn’: Self-attention.

    • ’cross-attn’: Cross-attention.

    Defaults to ‘self-attn’.

  • act_fn (str, optional) –

    The activation function which should be one of the following options:

    • ’ReLU’: ReLU activation.

    • ’SiLU’: SiLU activation.

    Defaults to ‘SiLU’.

  • bias (bool, optional) – Whether to use bias in linear layers. Defaults to False.

  • use_rel_bias (bool, optional) – Whether to use relative bias. Defaults to True.

  • pos_enc (bool, optional) – Whether to use rotary position embedding. Defaults to False.

Reference:

Transformer Quality in Linear Time

forward(x)[source]

Forward function.

rel_pos_bias(seq_len, k_len=None)[source]

Add relative position bias.

class mmpose.models.utils.RepVGGBlock(in_channels: int, out_channels: int, stride: int = 1, padding: int = 1, dilation: int = 1, groups: int = 1, padding_mode: str = 'zeros', norm_cfg: Optional[Union[mmengine.config.config.ConfigDict, dict]] = {'type': 'BN'}, act_cfg: Optional[Union[mmengine.config.config.ConfigDict, dict]] = {'type': 'ReLU'}, without_branch_norm: bool = True, init_cfg: Optional[Union[mmengine.config.config.ConfigDict, dict]] = None)[source]

A block in RepVGG architecture, supporting optional normalization in the identity branch.

This block consists of 3x3 and 1x1 convolutions, with an optional identity shortcut branch that includes normalization.

Parameters
  • in_channels (int) – The input channels of the block.

  • out_channels (int) – The output channels of the block.

  • stride (int) – The stride of the block. Defaults to 1.

  • padding (int) – The padding of the block. Defaults to 1.

  • dilation (int) – The dilation of the block. Defaults to 1.

  • groups (int) – The groups of the block. Defaults to 1.

  • padding_mode (str) – The padding mode of the block. Defaults to ‘zeros’.

  • norm_cfg (dict) – The config dict for normalization layers. Defaults to dict(type=’BN’).

  • act_cfg (dict) – The config dict for activation layers. Defaults to dict(type=’ReLU’).

  • without_branch_norm (bool) – Whether to skip branch_norm. Defaults to True.

  • init_cfg (dict) – The config dict for initialization. Defaults to None.

forward(x: torch.Tensor) torch.Tensor[source]

Forward pass through the RepVGG block.

The output is the sum of 3x3 and 1x1 convolution outputs, along with the normalized identity branch output, followed by activation.

Parameters

x (Tensor) – The input tensor.

Returns

The output tensor.

Return type

Tensor

get_equivalent_kernel_bias()[source]

Derives the equivalent kernel and bias in a differentiable way.

Returns

Equivalent kernel and bias

Return type

tuple

switch_to_deploy(test_cfg: Optional[Dict] = None)[source]

Switches the block to deployment mode.

In deployment mode, the block uses a single convolution operation derived from the equivalent kernel and bias, replacing the original branches. This reduces computational complexity during inference.

class mmpose.models.utils.SinePositionalEncoding(out_channels: int, spatial_dim: int = 1, temperature: int = 100000.0, learnable: bool = False, eval_size: Optional[Union[int, Sequence[int]]] = None)[source]

Sine Positional Encoding Module. This module implements sine positional encoding, which is commonly used in transformer-based models to add positional information to the input sequences. It uses sine and cosine functions to create positional embeddings for each element in the input sequence.

Parameters
  • out_channels (int) – The number of features in the input sequence.

  • temperature (int) – A temperature parameter used to scale the positional encodings. Defaults to 10000.

  • spatial_dim (int) – The number of spatial dimension of input feature. 1 represents sequence data and 2 represents grid data. Defaults to 1.

  • learnable (bool) – Whether to optimize the frequency base. Defaults to False.

  • eval_size (int, tuple[int], optional) – The fixed spatial size of input features. Defaults to None.

static apply_additional_pos_enc(feature: torch.Tensor, pos_enc: torch.Tensor, spatial_dim: int = 1)[source]

Apply additional positional encoding to input features.

Parameters
  • feature (Tensor) – Input feature tensor.

  • pos_enc (Tensor) – Positional encoding tensor.

  • spatial_dim (int) – Spatial dimension of input features.

static apply_rotary_pos_enc(feature: torch.Tensor, pos_enc: torch.Tensor, spatial_dim: int = 1)[source]

Apply rotary positional encoding to input features.

Parameters
  • feature (Tensor) – Input feature tensor.

  • pos_enc (Tensor) – Positional encoding tensor.

  • spatial_dim (int) – Spatial dimension of input features.

forward(*args, **kwargs)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

generate_pos_encoding(size: Optional[Union[int, Sequence[int]]] = None, position: Optional[torch.Tensor] = None)[source]

Generate positional encoding for input features.

Parameters
  • size (int or tuple[int]) – Size of the input features. Required if position is None.

  • position (Tensor, optional) – Position tensor. Required if size is None.

mmpose.models.utils.check_and_update_config(neck: Optional[Union[mmengine.config.config.Config, mmengine.config.config.ConfigDict]], head: Union[mmengine.config.config.Config, mmengine.config.config.ConfigDict]) Tuple[Optional[Dict], Dict][source]

Check and update the configuration of the head and neck components. :param neck: Configuration for the neck component. :type neck: Optional[ConfigType] :param head: Configuration for the head component. :type head: ConfigType

Returns

Updated configurations for the neck

and head components.

Return type

Tuple[Optional[Dict], Dict]

mmpose.models.utils.filter_scores_and_topk(scores, score_thr, topk, results=None)[source]

Filter results using score threshold and topk candidates.

Parameters
  • scores (Tensor) – The scores, shape (num_bboxes, K).

  • score_thr (float) – The score filter threshold.

  • topk (int) – The number of topk candidates.

  • results (dict or list or Tensor, Optional) – The results to which the filtering rule is to be applied. The shape of each item is (num_bboxes, N).

Returns

Filtered results

  • scores (Tensor): The scores after being filtered, shape (num_bboxes_filtered, ).

  • labels (Tensor): The class labels, shape (num_bboxes_filtered, ).

  • anchor_idxs (Tensor): The anchor indexes, shape (num_bboxes_filtered, ).

  • filtered_results (dict or list or Tensor, Optional): The filtered results. The shape of each item is (num_bboxes_filtered, N).

Return type

tuple

mmpose.models.utils.inverse_sigmoid(x: torch.Tensor, eps: float = 0.001) torch.Tensor[source]

Inverse function of sigmoid.

Parameters
  • x (Tensor) – The tensor to do the inverse.

  • eps (float) – EPS avoid numerical overflow. Defaults 1e-5.

Returns

The x has passed the inverse function of sigmoid, has the same shape with input.

Return type

Tensor

mmpose.models.utils.nchw_to_nlc(x)[source]

Flatten [N, C, H, W] shape tensor to [N, L, C] shape tensor.

Parameters

x (Tensor) – The input tensor of shape [N, C, H, W] before conversion.

Returns

The output tensor of shape [N, L, C] after conversion.

Return type

Tensor

mmpose.models.utils.nlc_to_nchw(x, hw_shape)[source]

Convert [N, L, C] shape tensor to [N, C, H, W] shape tensor.

Parameters
  • x (Tensor) – The input tensor of shape [N, L, C] before conversion.

  • hw_shape (Sequence[int]) – The height and width of output feature map.

Returns

The output tensor of shape [N, C, H, W] after conversion.

Return type

Tensor

mmpose.models.utils.rope(x, dim)[source]

Applies Rotary Position Embedding to input tensor.

Parameters
  • x (torch.Tensor) – Input tensor.

  • dim (int | list[int]) – The spatial dimension(s) to apply rotary position embedding.

Returns

The tensor after applying rotary position

embedding.

Return type

torch.Tensor

Reference:

RoFormer: Enhanced Transformer with Rotary Position Embedding

mmpose.datasets

class mmpose.datasets.CombinedDataset(metainfo: dict, datasets: list, pipeline: List[Union[dict, Callable]] = [], sample_ratio_factor: Optional[List[float]] = None, **kwargs)[source]

A wrapper of combined dataset.

Parameters
  • metainfo (dict) – The meta information of combined dataset.

  • datasets (list) – The configs of datasets to be combined.

  • pipeline (list, optional) – Processing pipeline. Defaults to [].

  • sample_ratio_factor (list, optional) – A list of sampling ratio factors for each dataset. Defaults to None

full_init()[source]

Fully initialize all sub datasets.

get_data_info(idx: int) dict[source]

Get annotation by index.

Parameters

idx (int) – Global index of CombinedDataset.

Returns

The idx-th annotation of the datasets.

Return type

dict

property metainfo

Get meta information of dataset.

Returns

meta information collected from BaseDataset.METAINFO, annotation file and metainfo argument during instantiation.

Return type

dict

prepare_data(idx: int) Any[source]

Get data processed by self.pipeline.The source dataset is depending on the index.

Parameters

idx (int) – The index of data_info.

Returns

Depends on self.pipeline.

Return type

Any

class mmpose.datasets.MultiSourceSampler(dataset: Sized, batch_size: int, source_ratio: List[Union[int, float]], shuffle: bool = True, round_up: bool = True, seed: Optional[int] = None)[source]

Multi-Source Sampler. According to the sampling ratio, sample data from different datasets to form batches.

Parameters
  • dataset (Sized) – The dataset

  • batch_size (int) – Size of mini-batch

  • source_ratio (list[int | float]) – The sampling ratio of different source datasets in a mini-batch

  • shuffle (bool) – Whether shuffle the dataset or not. Defaults to True

  • round_up (bool) – Whether to add extra samples to make the number of samples evenly divisible by the world size. Defaults to True.

  • seed (int, optional) – Random seed. If None, set a random seed. Defaults to None

set_epoch(epoch: int) None[source]

Compatible in `epoch-based runner.

mmpose.datasets.build_dataset(cfg, default_args=None)[source]

Build a dataset from config dict.

Parameters
  • cfg (dict) – Config dict. It should at least contain the key “type”.

  • default_args (dict, optional) – Default initialization arguments. Default: None.

Returns

The constructed dataset.

Return type

Dataset

datasets

class mmpose.datasets.datasets.base.BaseCocoStyleDataset(ann_file: str = '', bbox_file: Optional[str] = None, data_mode: str = 'topdown', metainfo: Optional[dict] = None, data_root: Optional[str] = None, data_prefix: dict = {'img': ''}, filter_cfg: Optional[dict] = None, indices: Optional[Union[int, Sequence[int]]] = None, serialize_data: bool = True, pipeline: List[Union[dict, Callable]] = [], test_mode: bool = False, lazy_init: bool = False, max_refetch: int = 1000, sample_interval: int = 1)[source]

Base class for COCO-style datasets.

Parameters
  • ann_file (str) – Annotation file path. Default: ‘’.

  • bbox_file (str, optional) – Detection result file path. If bbox_file is set, detected bboxes loaded from this file will be used instead of ground-truth bboxes. This setting is only for evaluation, i.e., ignored when test_mode is False. Default: None.

  • data_mode (str) – Specifies the mode of data samples: 'topdown' or 'bottomup'. In 'topdown' mode, each data sample contains one instance; while in 'bottomup' mode, each data sample contains all instances in a image. Default: 'topdown'

  • metainfo (dict, optional) – Meta information for dataset, such as class information. Default: None.

  • data_root (str, optional) – The root directory for data_prefix and ann_file. Default: None.

  • data_prefix (dict, optional) – Prefix for training data. Default: dict(img='').

  • filter_cfg (dict, optional) – Config for filter data. Default: None.

  • indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Default: None which means using all data_infos.

  • serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Default: True.

  • pipeline (list, optional) – Processing pipeline. Default: [].

  • test_mode (bool, optional) – test_mode=True means in test phase. Default: False.

  • lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file. Basedataset can skip load annotations to save time by set lazy_init=False. Default: False.

  • max_refetch (int, optional) – If Basedataset.prepare_data get a None img. The maximum extra number of cycles to get a valid image. Default: 1000.

  • sample_interval (int, optional) – The sample interval of the dataset. Default: 1.

filter_data() List[dict][source]

Filter annotations according to filter_cfg. Defaults return full data_list.

If ‘bbox_score_thr` in filter_cfg, the annotation with bbox_score below the threshold bbox_score_thr will be filtered out.

get_data_info(idx: int) dict[source]

Get data info by index.

Parameters

idx (int) – Index of data info.

Returns

Data info.

Return type

dict

load_data_list() List[dict][source]

Load data list from COCO annotation file or person detection result file.

parse_data_info(raw_data_info: dict) Optional[dict][source]

Parse raw COCO annotation of an instance.

Parameters

raw_data_info (dict) –

Raw data information loaded from ann_file. It should have following contents:

  • 'raw_ann_info': Raw annotation of an instance

  • 'raw_img_info': Raw information of the image that

    contains the instance

Returns

Parsed instance annotation

Return type

dict | None

prepare_data(idx) Any[source]

Get data processed by self.pipeline.

BaseCocoStyleDataset overrides this method from mmengine.dataset.BaseDataset to add the metainfo into the data_info before it is passed to the pipeline.

Parameters

idx (int) – The index of data_info.

Returns

Depends on self.pipeline.

Return type

Any

class mmpose.datasets.datasets.base.BaseMocapDataset(ann_file: str = '', seq_len: int = 1, multiple_target: int = 0, causal: bool = True, subset_frac: float = 1.0, camera_param_file: Optional[str] = None, data_mode: str = 'topdown', metainfo: Optional[dict] = None, data_root: Optional[str] = None, data_prefix: dict = {'img': ''}, filter_cfg: Optional[dict] = None, indices: Optional[Union[int, Sequence[int]]] = None, serialize_data: bool = True, pipeline: List[Union[dict, Callable]] = [], test_mode: bool = False, lazy_init: bool = False, max_refetch: int = 1000)[source]

Base class for 3d body datasets.

Parameters
  • ann_file (str) – Annotation file path. Default: ‘’.

  • seq_len (int) – Number of frames in a sequence. Default: 1.

  • multiple_target (int) – If larger than 0, merge every multiple_target sequence together. Default: 0.

  • causal (bool) – If set to True, the rightmost input frame will be the target frame. Otherwise, the middle input frame will be the target frame. Default: True.

  • subset_frac (float) – The fraction to reduce dataset size. If set to 1, the dataset size is not reduced. Default: 1.

  • camera_param_file (str) – Cameras’ parameters file. Default: None.

  • data_mode (str) – Specifies the mode of data samples: 'topdown' or 'bottomup'. In 'topdown' mode, each data sample contains one instance; while in 'bottomup' mode, each data sample contains all instances in a image. Default: 'topdown'

  • metainfo (dict, optional) – Meta information for dataset, such as class information. Default: None.

  • data_root (str, optional) – The root directory for data_prefix and ann_file. Default: None.

  • data_prefix (dict, optional) – Prefix for training data. Default: dict(img='').

  • filter_cfg (dict, optional) – Config for filter data. Default: None.

  • indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Default: None which means using all data_infos.

  • serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Default: True.

  • pipeline (list, optional) – Processing pipeline. Default: [].

  • test_mode (bool, optional) – test_mode=True means in test phase. Default: False.

  • lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file. Basedataset can skip load annotations to save time by set lazy_init=False. Default: False.

  • max_refetch (int, optional) – If Basedataset.prepare_data get a None img. The maximum extra number of cycles to get a valid image. Default: 1000.

get_camera_param(imgname)[source]

Get camera parameters of a frame by its image name.

Override this method to specify how to get camera parameters.

get_data_info(idx: int) dict[source]

Get data info by index.

Parameters

idx (int) – Index of data info.

Returns

Data info.

Return type

dict

get_sequence_indices() List[List[int]][source]

Build sequence indices.

The default method creates sample indices that each sample is a single frame (i.e. seq_len=1). Override this method in the subclass to define how frames are sampled to form data samples.

Outputs:
sample_indices: the frame indices of each sample.

For a sample, all frames will be treated as an input sequence, and the ground-truth pose of the last frame will be the target.

load_data_list() List[dict][source]

Load data list from COCO annotation file or person detection result file.

prepare_data(idx) Any[source]

Get data processed by self.pipeline.

BaseCocoStyleDataset overrides this method from mmengine.dataset.BaseDataset to add the metainfo into the data_info before it is passed to the pipeline.

Parameters

idx (int) – The index of data_info.

Returns

Depends on self.pipeline.

Return type

Any

class mmpose.datasets.datasets.body.AicDataset(ann_file: str = '', bbox_file: Optional[str] = None, data_mode: str = 'topdown', metainfo: Optional[dict] = None, data_root: Optional[str] = None, data_prefix: dict = {'img': ''}, filter_cfg: Optional[dict] = None, indices: Optional[Union[int, Sequence[int]]] = None, serialize_data: bool = True, pipeline: List[Union[dict, Callable]] = [], test_mode: bool = False, lazy_init: bool = False, max_refetch: int = 1000, sample_interval: int = 1)[source]

AIC dataset for pose estimation.

“AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding”, arXiv’2017. More details can be found in the paper

AIC keypoints:

0: "right_shoulder",
1: "right_elbow",
2: "right_wrist",
3: "left_shoulder",
4: "left_elbow",
5: "left_wrist",
6: "right_hip",
7: "right_knee",
8: "right_ankle",
9: "left_hip",
10: "left_knee",
11: "left_ankle",
12: "head_top",
13: "neck"
Parameters
  • ann_file (str) – Annotation file path. Default: ‘’.

  • bbox_file (str, optional) – Detection result file path. If bbox_file is set, detected bboxes loaded from this file will be used instead of ground-truth bboxes. This setting is only for evaluation, i.e., ignored when test_mode is False. Default: None.

  • data_mode (str) – Specifies the mode of data samples: 'topdown' or 'bottomup'. In 'topdown' mode, each data sample contains one instance; while in 'bottomup' mode, each data sample contains all instances in a image. Default: 'topdown'

  • metainfo (dict, optional) – Meta information for dataset, such as class information. Default: None.

  • data_root (str, optional) – The root directory for data_prefix and ann_file. Default: None.

  • data_prefix (dict, optional) – Prefix for training data. Default: dict(img=None, ann=None).

  • filter_cfg (dict, optional) – Config for filter data. Default: None.

  • indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Default: None which means using all data_infos.

  • serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Default: True.

  • pipeline (list, optional) – Processing pipeline. Default: [].

  • test_mode (bool, optional) – test_mode=True means in test phase. Default: False.

  • lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file. Basedataset can skip load annotations to save time by set lazy_init=False. Default: False.

  • max_refetch (int, optional) – If Basedataset.prepare_data get a None img. The maximum extra number of cycles to get a valid image. Default: 1000.

class mmpose.datasets.datasets.body.CocoDataset(ann_file: str = '', bbox_file: Optional[str] = None, data_mode: str = 'topdown', metainfo: Optional[dict] = None, data_root: Optional[str] = None, data_prefix: dict = {'img': ''}, filter_cfg: Optional[dict] = None, indices: Optional[Union[int, Sequence[int]]] = None, serialize_data: bool = True, pipeline: List[Union[dict, Callable]] = [], test_mode: bool = False, lazy_init: bool = False, max_refetch: int = 1000, sample_interval: int = 1)[source]

COCO dataset for pose estimation.

“Microsoft COCO: Common Objects in Context”, ECCV’2014. More details can be found in the paper .

COCO keypoints:

0: 'nose',
1: 'left_eye',
2: 'right_eye',
3: 'left_ear',
4: 'right_ear',
5: 'left_shoulder',
6: 'right_shoulder',
7: 'left_elbow',
8: 'right_elbow',
9: 'left_wrist',
10: 'right_wrist',
11: 'left_hip',
12: 'right_hip',
13: 'left_knee',
14: 'right_knee',
15: 'left_ankle',
16: 'right_ankle'
Parameters
  • ann_file (str) – Annotation file path. Default: ‘’.

  • bbox_file (str, optional) – Detection result file path. If bbox_file is set, detected bboxes loaded from this file will be used instead of ground-truth bboxes. This setting is only for evaluation, i.e., ignored when test_mode is False. Default: None.

  • data_mode (str) – Specifies the mode of data samples: 'topdown' or 'bottomup'. In 'topdown' mode, each data sample contains one instance; while in 'bottomup' mode, each data sample contains all instances in a image. Default: 'topdown'

  • metainfo (dict, optional) – Meta information for dataset, such as class information. Default: None.

  • data_root (str, optional) – The root directory for data_prefix and ann_file. Default: None.

  • data_prefix (dict, optional) – Prefix for training data. Default: dict(img=None, ann=None).

  • filter_cfg (dict, optional) – Config for filter data. Default: None.

  • indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Default: None which means using all data_infos.

  • serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Default: True.

  • pipeline (list, optional) – Processing pipeline. Default: [].

  • test_mode (bool, optional) – test_mode=True means in test phase. Default: False.

  • lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file. Basedataset can skip load annotations to save time by set lazy_init=False. Default: False.

  • max_refetch (int, optional) – If Basedataset.prepare_data get a None img. The maximum extra number of cycles to get a valid image. Default: 1000.

class mmpose.datasets.datasets.body.CrowdPoseDataset(ann_file: str = '', bbox_file: Optional[str] = None, data_mode: str = 'topdown', metainfo: Optional[dict] = None, data_root: Optional[str] = None, data_prefix: dict = {'img': ''}, filter_cfg: Optional[dict] = None, indices: Optional[Union[int, Sequence[int]]] = None, serialize_data: bool = True, pipeline: List[Union[dict, Callable]] = [], test_mode: bool = False, lazy_init: bool = False, max_refetch: int = 1000, sample_interval: int = 1)[source]

CrowdPose dataset for pose estimation.

“CrowdPose: Efficient Crowded Scenes Pose Estimation and A New Benchmark”, CVPR’2019. More details can be found in the paper.

CrowdPose keypoints:

0: 'left_shoulder',
1: 'right_shoulder',
2: 'left_elbow',
3: 'right_elbow',
4: 'left_wrist',
5: 'right_wrist',
6: 'left_hip',
7: 'right_hip',
8: 'left_knee',
9: 'right_knee',
10: 'left_ankle',
11: 'right_ankle',
12: 'top_head',
13: 'neck'
Parameters
  • ann_file (str) – Annotation file path. Default: ‘’.

  • bbox_file (str, optional) – Detection result file path. If bbox_file is set, detected bboxes loaded from this file will be used instead of ground-truth bboxes. This setting is only for evaluation, i.e., ignored when test_mode is False. Default: None.

  • data_mode (str) – Specifies the mode of data samples: 'topdown' or 'bottomup'. In 'topdown' mode, each data sample contains one instance; while in 'bottomup' mode, each data sample contains all instances in a image. Default: 'topdown'

  • metainfo (dict, optional) – Meta information for dataset, such as class information. Default: None.

  • data_root (str, optional) – The root directory for data_prefix and ann_file. Default: None.

  • data_prefix (dict, optional) – Prefix for training data. Default: dict(img=None, ann=None).

  • filter_cfg (dict, optional) – Config for filter data. Default: None.

  • indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Default: None which means using all data_infos.

  • serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Default: True.

  • pipeline (list, optional) – Processing pipeline. Default: [].

  • test_mode (bool, optional) – test_mode=True means in test phase. Default: False.

  • lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file. Basedataset can skip load annotations to save time by set lazy_init=False. Default: False.

  • max_refetch (int, optional) – If Basedataset.prepare_data get a None img. The maximum extra number of cycles to get a valid image. Default: 1000.

class mmpose.datasets.datasets.body.ExlposeDataset(ann_file: str = '', bbox_file: Optional[str] = None, data_mode: str = 'topdown', metainfo: Optional[dict] = None, data_root: Optional[str] = None, data_prefix: dict = {'img': ''}, filter_cfg: Optional[dict] = None, indices: Optional[Union[int, Sequence[int]]] = None, serialize_data: bool = True, pipeline: List[Union[dict, Callable]] = [], test_mode: bool = False, lazy_init: bool = False, max_refetch: int = 1000, sample_interval: int = 1)[source]

Exlpose dataset for pose estimation.

“Human Pose Estimation in Extremely Low-Light Conditions”, CVPR’2023. More details can be found in the paper.

ExLPose keypoints:

0: “left_shoulder”, 1: “right_shoulder”, 2: “left_elbow”, 3: “right_elbow”, 4: “left_wrist”, 5: “right_wrist”, 6: “left_hip”, 7: “right_hip”, 8: “left_knee”, 9: “right_knee”, 10: “left_ankle”, 11: “right_ankle”, 12: “head”, 13: “neck”

Parameters
  • ann_file (str) – Annotation file path. Default: ‘’.

  • bbox_file (str, optional) – Detection result file path. If bbox_file is set, detected bboxes loaded from this file will be used instead of ground-truth bboxes. This setting is only for evaluation, i.e., ignored when test_mode is False. Default: None.

  • data_mode (str) – Specifies the mode of data samples: 'topdown' or 'bottomup'. In 'topdown' mode, each data sample contains one instance; while in 'bottomup' mode, each data sample contains all instances in a image. Default: 'topdown'

  • metainfo (dict, optional) – Meta information for dataset, such as class information. Default: None.

  • data_root (str, optional) – The root directory for data_prefix and ann_file. Default: None.

  • data_prefix (dict, optional) – Prefix for training data. Default: dict(img=None, ann=None).

  • filter_cfg (dict, optional) – Config for filter data. Default: None.

  • indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Default: None which means using all data_infos.

  • serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Default: True.

  • pipeline (list, optional) – Processing pipeline. Default: [].

  • test_mode (bool, optional) – test_mode=True means in test phase. Default: False.

  • lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file. Basedataset can skip load annotations to save time by set lazy_init=False. Default: False.

  • max_refetch (int, optional) – If Basedataset.prepare_data get a None img. The maximum extra number of cycles to get a valid image. Default: 1000.

class mmpose.datasets.datasets.body.HumanArt21Dataset(ann_file: str = '', bbox_file: Optional[str] = None, data_mode: str = 'topdown', metainfo: Optional[dict] = None, data_root: Optional[str] = None, data_prefix: dict = {'img': ''}, filter_cfg: Optional[dict] = None, indices: Optional[Union[int, Sequence[int]]] = None, serialize_data: bool = True, pipeline: List[Union[dict, Callable]] = [], test_mode: bool = False, lazy_init: bool = False, max_refetch: int = 1000, sample_interval: int = 1)[source]

Human-Art dataset for pose estimation with 21 kpts.

“Human-Art: A Versatile Human-Centric Dataset Bridging Natural and Artificial Scenes”, CVPR’2023. More details can be found in the paper .

Human-Art keypoints:

0: 'nose',
1: 'left_eye',
2: 'right_eye',
3: 'left_ear',
4: 'right_ear',
5: 'left_shoulder',
6: 'right_shoulder',
7: 'left_elbow',
8: 'right_elbow',
9: 'left_wrist',
10: 'right_wrist',
11: 'left_hip',
12: 'right_hip',
13: 'left_knee',
14: 'right_knee',
15: 'left_ankle',
16: 'right_ankle',
17: 'left_finger',
18: 'right_finger',
19: 'left_toe',
20: 'right_toe',
Parameters
  • ann_file (str) – Annotation file path. Default: ‘’.

  • bbox_file (str, optional) – Detection result file path. If bbox_file is set, detected bboxes loaded from this file will be used instead of ground-truth bboxes. This setting is only for evaluation, i.e., ignored when test_mode is False. Default: None.

  • data_mode (str) – Specifies the mode of data samples: 'topdown' or 'bottomup'. In 'topdown' mode, each data sample contains one instance; while in 'bottomup' mode, each data sample contains all instances in a image. Default: 'topdown'

  • metainfo (dict, optional) – Meta information for dataset, such as class information. Default: None.

  • data_root (str, optional) – The root directory for data_prefix and ann_file. Default: None.

  • data_prefix (dict, optional) – Prefix for training data. Default: dict(img=None, ann=None).

  • filter_cfg (dict, optional) – Config for filter data. Default: None.

  • indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Default: None which means using all data_infos.

  • serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Default: True.

  • pipeline (list, optional) – Processing pipeline. Default: [].

  • test_mode (bool, optional) – test_mode=True means in test phase. Default: False.

  • lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file. Basedataset can skip load annotations to save time by set lazy_init=False. Default: False.

  • max_refetch (int, optional) – If Basedataset.prepare_data get a None img. The maximum extra number of cycles to get a valid image. Default: 1000.

parse_data_info(raw_data_info: dict) Optional[dict][source]

Parse raw COCO annotation of an instance.

Parameters

raw_data_info (dict) –

Raw data information loaded from ann_file. It should have following contents:

  • 'raw_ann_info': Raw annotation of an instance

  • 'raw_img_info': Raw information of the image that

    contains the instance

Returns

Parsed instance annotation

Return type

dict | None

class mmpose.datasets.datasets.body.HumanArtDataset(ann_file: str = '', bbox_file: Optional[str] = None, data_mode: str = 'topdown', metainfo: Optional[dict] = None, data_root: Optional[str] = None, data_prefix: dict = {'img': ''}, filter_cfg: Optional[dict] = None, indices: Optional[Union[int, Sequence[int]]] = None, serialize_data: bool = True, pipeline: List[Union[dict, Callable]] = [], test_mode: bool = False, lazy_init: bool = False, max_refetch: int = 1000, sample_interval: int = 1)[source]

Human-Art dataset for pose estimation with 17 kpts.

“Human-Art: A Versatile Human-Centric Dataset Bridging Natural and Artificial Scenes”, CVPR’2023. More details can be found in the paper .

Human-Art keypoints:

0: 'nose',
1: 'left_eye',
2: 'right_eye',
3: 'left_ear',
4: 'right_ear',
5: 'left_shoulder',
6: 'right_shoulder',
7: 'left_elbow',
8: 'right_elbow',
9: 'left_wrist',
10: 'right_wrist',
11: 'left_hip',
12: 'right_hip',
13: 'left_knee',
14: 'right_knee',
15: 'left_ankle',
16: 'right_ankle'
Parameters
  • ann_file (str) – Annotation file path. Default: ‘’.

  • bbox_file (str, optional) – Detection result file path. If bbox_file is set, detected bboxes loaded from this file will be used instead of ground-truth bboxes. This setting is only for evaluation, i.e., ignored when test_mode is False. Default: None.

  • data_mode (str) – Specifies the mode of data samples: 'topdown' or 'bottomup'. In 'topdown' mode, each data sample contains one instance; while in 'bottomup' mode, each data sample contains all instances in a image. Default: 'topdown'

  • metainfo (dict, optional) – Meta information for dataset, such as class information. Default: None.

  • data_root (str, optional) – The root directory for data_prefix and ann_file. Default: None.

  • data_prefix (dict, optional) – Prefix for training data. Default: dict(img=None, ann=None).

  • filter_cfg (dict, optional) – Config for filter data. Default: None.

  • indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Default: None which means using all data_infos.

  • serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Default: True.

  • pipeline (list, optional) – Processing pipeline. Default: [].

  • test_mode (bool, optional) – test_mode=True means in test phase. Default: False.

  • lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file. Basedataset can skip load annotations to save time by set lazy_init=False. Default: False.

  • max_refetch (int, optional) – If Basedataset.prepare_data get a None img. The maximum extra number of cycles to get a valid image. Default: 1000.

class mmpose.datasets.datasets.body.JhmdbDataset(ann_file: str = '', bbox_file: Optional[str] = None, data_mode: str = 'topdown', metainfo: Optional[dict] = None, data_root: Optional[str] = None, data_prefix: dict = {'img': ''}, filter_cfg: Optional[dict] = None, indices: Optional[Union[int, Sequence[int]]] = None, serialize_data: bool = True, pipeline: List[Union[dict, Callable]] = [], test_mode: bool = False, lazy_init: bool = False, max_refetch: int = 1000, sample_interval: int = 1)[source]

JhmdbDataset dataset for pose estimation.

“Towards understanding action recognition”, ICCV’2013. More details can be found in the paper

sub-JHMDB keypoints:

0: "neck",
1: "belly",
2: "head",
3: "right_shoulder",
4: "left_shoulder",
5: "right_hip",
6: "left_hip",
7: "right_elbow",
8: "left_elbow",
9: "right_knee",
10: "left_knee",
11: "right_wrist",
12: "left_wrist",
13: "right_ankle",
14: "left_ankle"
Parameters
  • ann_file (str) – Annotation file path. Default: ‘’.

  • bbox_file (str, optional) – Detection result file path. If bbox_file is set, detected bboxes loaded from this file will be used instead of ground-truth bboxes. This setting is only for evaluation, i.e., ignored when test_mode is False. Default: None.

  • data_mode (str) – Specifies the mode of data samples: 'topdown' or 'bottomup'. In 'topdown' mode, each data sample contains one instance; while in 'bottomup' mode, each data sample contains all instances in a image. Default: 'topdown'

  • metainfo (dict, optional) – Meta information for dataset, such as class information. Default: None.

  • data_root (str, optional) – The root directory for data_prefix and ann_file. Default: None.

  • data_prefix (dict, optional) – Prefix for training data. Default: dict(img=None, ann=None).

  • filter_cfg (dict, optional) – Config for filter data. Default: None.

  • indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Default: None which means using all data_infos.

  • serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Default: True.

  • pipeline (list, optional) – Processing pipeline. Default: [].

  • test_mode (bool, optional) – test_mode=True means in test phase. Default: False.

  • lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file. Basedataset can skip load annotations to save time by set lazy_init=False. Default: False.

  • max_refetch (int, optional) – If Basedataset.prepare_data get a None img. The maximum extra number of cycles to get a valid image. Default: 1000.

parse_data_info(raw_data_info: dict) Optional[dict][source]

Parse raw COCO annotation of an instance.

Parameters

raw_data_info (dict) –

Raw data information loaded from ann_file. It should have following contents:

  • 'raw_ann_info': Raw annotation of an instance

  • 'raw_img_info': Raw information of the image that

    contains the instance

Returns

Parsed instance annotation

Return type

dict

class mmpose.datasets.datasets.body.MhpDataset(ann_file: str = '', bbox_file: Optional[str] = None, data_mode: str = 'topdown', metainfo: Optional[dict] = None, data_root: Optional[str] = None, data_prefix: dict = {'img': ''}, filter_cfg: Optional[dict] = None, indices: Optional[Union[int, Sequence[int]]] = None, serialize_data: bool = True, pipeline: List[Union[dict, Callable]] = [], test_mode: bool = False, lazy_init: bool = False, max_refetch: int = 1000, sample_interval: int = 1)[source]

MHPv2.0 dataset for pose estimation.

“Understanding Humans in Crowded Scenes: Deep Nested Adversarial Learning and A New Benchmark for Multi-Human Parsing”, ACM MM’2018. More details can be found in the paper

MHP keypoints:

0: "right ankle",
1: "right knee",
2: "right hip",
3: "left hip",
4: "left knee",
5: "left ankle",
6: "pelvis",
7: "thorax",
8: "upper neck",
9: "head top",
10: "right wrist",
11: "right elbow",
12: "right shoulder",
13: "left shoulder",
14: "left elbow",
15: "left wrist",
Parameters
  • ann_file (str) – Annotation file path. Default: ‘’.

  • bbox_file (str, optional) – Detection result file path. If bbox_file is set, detected bboxes loaded from this file will be used instead of ground-truth bboxes. This setting is only for evaluation, i.e., ignored when test_mode is False. Default: None.

  • data_mode (str) – Specifies the mode of data samples: 'topdown' or 'bottomup'. In 'topdown' mode, each data sample contains one instance; while in 'bottomup' mode, each data sample contains all instances in a image. Default: 'topdown'

  • metainfo (dict, optional) – Meta information for dataset, such as class information. Default: None.

  • data_root (str, optional) – The root directory for data_prefix and ann_file. Default: None.

  • data_prefix (dict, optional) – Prefix for training data. Default: dict(img=None, ann=None).

  • filter_cfg (dict, optional) – Config for filter data. Default: None.

  • indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Default: None which means using all data_infos.

  • serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Default: True.

  • pipeline (list, optional) – Processing pipeline. Default: [].

  • test_mode (bool, optional) – test_mode=True means in test phase. Default: False.

  • lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file. Basedataset can skip load annotations to save time by set lazy_init=False. Default: False.

  • max_refetch (int, optional) – If Basedataset.prepare_data get a None img. The maximum extra number of cycles to get a valid image. Default: 1000.

class mmpose.datasets.datasets.body.MpiiDataset(ann_file: str = '', bbox_file: Optional[str] = None, headbox_file: Optional[str] = None, data_mode: str = 'topdown', metainfo: Optional[dict] = None, data_root: Optional[str] = None, data_prefix: dict = {'img': ''}, filter_cfg: Optional[dict] = None, indices: Optional[Union[int, Sequence[int]]] = None, serialize_data: bool = True, pipeline: List[Union[dict, Callable]] = [], test_mode: bool = False, lazy_init: bool = False, max_refetch: int = 1000)[source]

MPII Dataset for pose estimation.

“2D Human Pose Estimation: New Benchmark and State of the Art Analysis” ,CVPR’2014. More details can be found in the paper .

MPII keypoints:

0: 'right_ankle'
1: 'right_knee',
2: 'right_hip',
3: 'left_hip',
4: 'left_knee',
5: 'left_ankle',
6: 'pelvis',
7: 'thorax',
8: 'upper_neck',
9: 'head_top',
10: 'right_wrist',
11: 'right_elbow',
12: 'right_shoulder',
13: 'left_shoulder',
14: 'left_elbow',
15: 'left_wrist'
Parameters
  • ann_file (str) – Annotation file path. Default: ‘’.

  • bbox_file (str, optional) – Detection result file path. If bbox_file is set, detected bboxes loaded from this file will be used instead of ground-truth bboxes. This setting is only for evaluation, i.e., ignored when test_mode is False. Default: None.

  • headbox_file (str, optional) – The path of mpii_gt_val.mat which provides the headboxes information used for PCKh. Default: None.

  • data_mode (str) – Specifies the mode of data samples: 'topdown' or 'bottomup'. In 'topdown' mode, each data sample contains one instance; while in 'bottomup' mode, each data sample contains all instances in a image. Default: 'topdown'

  • metainfo (dict, optional) – Meta information for dataset, such as class information. Default: None.

  • data_root (str, optional) – The root directory for data_prefix and ann_file. Default: None.

  • data_prefix (dict, optional) – Prefix for training data. Default: dict(img=None, ann=None).

  • filter_cfg (dict, optional) – Config for filter data. Default: None.

  • indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Default: None which means using all data_infos.

  • serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Default: True.

  • pipeline (list, optional) – Processing pipeline. Default: [].

  • test_mode (bool, optional) – test_mode=True means in test phase. Default: False.

  • lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file. Basedataset can skip load annotations to save time by set lazy_init=False. Default: False.

  • max_refetch (int, optional) – If Basedataset.prepare_data get a None img. The maximum extra number of cycles to get a valid image. Default: 1000.

class mmpose.datasets.datasets.body.MpiiTrbDataset(ann_file: str = '', bbox_file: Optional[str] = None, data_mode: str = 'topdown', metainfo: Optional[dict] = None, data_root: Optional[str] = None, data_prefix: dict = {'img': ''}, filter_cfg: Optional[dict] = None, indices: Optional[Union[int, Sequence[int]]] = None, serialize_data: bool = True, pipeline: List[Union[dict, Callable]] = [], test_mode: bool = False, lazy_init: bool = False, max_refetch: int = 1000, sample_interval: int = 1)[source]

MPII-TRB Dataset dataset for pose estimation.

“TRB: A Novel Triplet Representation for Understanding 2D Human Body”, ICCV’2019. More details can be found in the paper .

MPII-TRB keypoints:

0: 'left_shoulder'
1: 'right_shoulder'
2: 'left_elbow'
3: 'right_elbow'
4: 'left_wrist'
5: 'right_wrist'
6: 'left_hip'
7: 'right_hip'
8: 'left_knee'
9: 'right_knee'
10: 'left_ankle'
11: 'right_ankle'
12: 'head'
13: 'neck'

14: 'right_neck'
15: 'left_neck'
16: 'medial_right_shoulder'
17: 'lateral_right_shoulder'
18: 'medial_right_bow'
19: 'lateral_right_bow'
20: 'medial_right_wrist'
21: 'lateral_right_wrist'
22: 'medial_left_shoulder'
23: 'lateral_left_shoulder'
24: 'medial_left_bow'
25: 'lateral_left_bow'
26: 'medial_left_wrist'
27: 'lateral_left_wrist'
28: 'medial_right_hip'
29: 'lateral_right_hip'
30: 'medial_right_knee'
31: 'lateral_right_knee'
32: 'medial_right_ankle'
33: 'lateral_right_ankle'
34: 'medial_left_hip'
35: 'lateral_left_hip'
36: 'medial_left_knee'
37: 'lateral_left_knee'
38: 'medial_left_ankle'
39: 'lateral_left_ankle'
Parameters
  • ann_file (str) – Annotation file path. Default: ‘’.

  • bbox_file (str, optional) – Detection result file path. If bbox_file is set, detected bboxes loaded from this file will be used instead of ground-truth bboxes. This setting is only for evaluation, i.e., ignored when test_mode is False. Default: None.

  • data_mode (str) – Specifies the mode of data samples: 'topdown' or 'bottomup'. In 'topdown' mode, each data sample contains one instance; while in 'bottomup' mode, each data sample contains all instances in a image. Default: 'topdown'

  • metainfo (dict, optional) – Meta information for dataset, such as class information. Default: None.

  • data_root (str, optional) – The root directory for data_prefix and ann_file. Default: None.

  • data_prefix (dict, optional) – Prefix for training data. Default: dict(img=None, ann=None).

  • filter_cfg (dict, optional) – Config for filter data. Default: None.

  • indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Default: None which means using all data_infos.

  • serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Default: True.

  • pipeline (list, optional) – Processing pipeline. Default: [].

  • test_mode (bool, optional) – test_mode=True means in test phase. Default: False.

  • lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file. Basedataset can skip load annotations to save time by set lazy_init=False. Default: False.

  • max_refetch (int, optional) – If Basedataset.prepare_data get a None img. The maximum extra number of cycles to get a valid image. Default: 1000.

class mmpose.datasets.datasets.body.OCHumanDataset(ann_file: str = '', bbox_file: Optional[str] = None, data_mode: str = 'topdown', metainfo: Optional[dict] = None, data_root: Optional[str] = None, data_prefix: dict = {'img': ''}, filter_cfg: Optional[dict] = None, indices: Optional[Union[int, Sequence[int]]] = None, serialize_data: bool = True, pipeline: List[Union[dict, Callable]] = [], test_mode: bool = False, lazy_init: bool = False, max_refetch: int = 1000, sample_interval: int = 1)[source]

OChuman dataset for pose estimation.

“Pose2Seg: Detection Free Human Instance Segmentation”, CVPR’2019. More details can be found in the paper .

“Occluded Human (OCHuman)” dataset contains 8110 heavily occluded human instances within 4731 images. OCHuman dataset is designed for validation and testing. To evaluate on OCHuman, the model should be trained on COCO training set, and then test the robustness of the model to occlusion using OCHuman.

OCHuman keypoints (same as COCO):

0: 'nose',
1: 'left_eye',
2: 'right_eye',
3: 'left_ear',
4: 'right_ear',
5: 'left_shoulder',
6: 'right_shoulder',
7: 'left_elbow',
8: 'right_elbow',
9: 'left_wrist',
10: 'right_wrist',
11: 'left_hip',
12: 'right_hip',
13: 'left_knee',
14: 'right_knee',
15: 'left_ankle',
16: 'right_ankle'
Parameters
  • ann_file (str) – Annotation file path. Default: ‘’.

  • bbox_file (str, optional) – Detection result file path. If bbox_file is set, detected bboxes loaded from this file will be used instead of ground-truth bboxes. This setting is only for evaluation, i.e., ignored when test_mode is False. Default: None.

  • data_mode (str) – Specifies the mode of data samples: 'topdown' or 'bottomup'. In 'topdown' mode, each data sample contains one instance; while in 'bottomup' mode, each data sample contains all instances in a image. Default: 'topdown'

  • metainfo (dict, optional) – Meta information for dataset, such as class information. Default: None.

  • data_root (str, optional) – The root directory for data_prefix and ann_file. Default: None.

  • data_prefix (dict, optional) – Prefix for training data. Default: dict(img=None, ann=None).

  • filter_cfg (dict, optional) – Config for filter data. Default: None.

  • indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Default: None which means using all data_infos.

  • serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Default: True.

  • pipeline (list, optional) – Processing pipeline. Default: [].

  • test_mode (bool, optional) – test_mode=True means in test phase. Default: False.

  • lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file. Basedataset can skip load annotations to save time by set lazy_init=False. Default: False.

  • max_refetch (int, optional) – If Basedataset.prepare_data get a None img. The maximum extra number of cycles to get a valid image. Default: 1000.

class mmpose.datasets.datasets.body.PoseTrack18Dataset(ann_file: str = '', bbox_file: Optional[str] = None, data_mode: str = 'topdown', metainfo: Optional[dict] = None, data_root: Optional[str] = None, data_prefix: dict = {'img': ''}, filter_cfg: Optional[dict] = None, indices: Optional[Union[int, Sequence[int]]] = None, serialize_data: bool = True, pipeline: List[Union[dict, Callable]] = [], test_mode: bool = False, lazy_init: bool = False, max_refetch: int = 1000, sample_interval: int = 1)[source]

PoseTrack18 dataset for pose estimation.

“Posetrack: A benchmark for human pose estimation and tracking”, CVPR’2018. More details can be found in the paper .

PoseTrack2018 keypoints:

0: 'nose',
1: 'head_bottom',
2: 'head_top',
3: 'left_ear',
4: 'right_ear',
5: 'left_shoulder',
6: 'right_shoulder',
7: 'left_elbow',
8: 'right_elbow',
9: 'left_wrist',
10: 'right_wrist',
11: 'left_hip',
12: 'right_hip',
13: 'left_knee',
14: 'right_knee',
15: 'left_ankle',
16: 'right_ankle'
Parameters
  • ann_file (str) – Annotation file path. Default: ‘’.

  • bbox_file (str, optional) – Detection result file path. If bbox_file is set, detected bboxes loaded from this file will be used instead of ground-truth bboxes. This setting is only for evaluation, i.e., ignored when test_mode is False. Default: None.

  • data_mode (str) – Specifies the mode of data samples: 'topdown' or 'bottomup'. In 'topdown' mode, each data sample contains one instance; while in 'bottomup' mode, each data sample contains all instances in a image. Default: 'topdown'

  • metainfo (dict, optional) – Meta information for dataset, such as class information. Default: None.

  • data_root (str, optional) – The root directory for data_prefix and ann_file. Default: None.

  • data_prefix (dict, optional) – Prefix for training data. Default: dict(img=None, ann=None).

  • filter_cfg (dict, optional) – Config for filter data. Default: None.

  • indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Default: None which means using all data_infos.

  • serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Default: True.

  • pipeline (list, optional) – Processing pipeline. Default: [].

  • test_mode (bool, optional) – test_mode=True means in test phase. Default: False.

  • lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file. Basedataset can skip load annotations to save time by set lazy_init=False. Default: False.

  • max_refetch (int, optional) – If Basedataset.prepare_data get a None img. The maximum extra number of cycles to get a valid image. Default: 1000.

class mmpose.datasets.datasets.body.PoseTrack18VideoDataset(ann_file: str = '', bbox_file: Optional[str] = None, data_mode: str = 'topdown', frame_weights: List[Union[int, float]] = [0.0, 1.0], frame_sampler_mode: str = 'random', frame_range: Optional[Union[int, List[int]]] = None, num_sampled_frame: Optional[int] = None, frame_indices: Optional[Sequence[int]] = None, ph_fill_len: int = 6, metainfo: Optional[dict] = None, data_root: Optional[str] = None, data_prefix: dict = {'img': ''}, filter_cfg: Optional[dict] = None, indices: Optional[Union[int, Sequence[int]]] = None, serialize_data: bool = True, pipeline: List[Union[dict, Callable]] = [], test_mode: bool = False, lazy_init: bool = False, max_refetch: int = 1000)[source]

PoseTrack18 dataset for video pose estimation.

“Posetrack: A benchmark for human pose estimation and tracking”, CVPR’2018. More details can be found in the paper .

PoseTrack2018 keypoints:

0: 'nose',
1: 'head_bottom',
2: 'head_top',
3: 'left_ear',
4: 'right_ear',
5: 'left_shoulder',
6: 'right_shoulder',
7: 'left_elbow',
8: 'right_elbow',
9: 'left_wrist',
10: 'right_wrist',
11: 'left_hip',
12: 'right_hip',
13: 'left_knee',
14: 'right_knee',
15: 'left_ankle',
16: 'right_ankle'
Parameters
  • ann_file (str) – Annotation file path. Default: ‘’.

  • bbox_file (str, optional) – Detection result file path. If bbox_file is set, detected bboxes loaded from this file will be used instead of ground-truth bboxes. This setting is only for evaluation, i.e., ignored when test_mode is False. Default: None.

  • data_mode (str) – Specifies the mode of data samples: 'topdown' or 'bottomup'. In 'topdown' mode, each data sample contains one instance; while in 'bottomup' mode, each data sample contains all instances in a image. Default: 'topdown'

  • frame_weights (List[Union[int, float]]) – The weight of each frame for aggregation. The first weight is for the center frame, then on ascending order of frame indices. Note that the length of frame_weights should be consistent with the number of sampled frames. Default: [0.0, 1.0]

  • frame_sampler_mode (str) – Specifies the mode of frame sampler: 'fixed' or 'random'. In 'fixed' mode, each frame index relative to the center frame is fixed, specified by frame_indices, while in 'random' mode, each frame index relative to the center frame is sampled from frame_range with certain randomness. Default: 'random'.

  • frame_range (int | List[int], optional) – The sampling range of supporting frames in the same video for center frame. Only valid when frame_sampler_mode is 'random'. Default: None.

  • num_sampled_frame (int, optional) – The number of sampled frames, except the center frame. Only valid when frame_sampler_mode is 'random'. Default: 1.

  • frame_indices (Sequence[int], optional) – The sampled frame indices, including the center frame indicated by 0. Only valid when frame_sampler_mode is 'fixed'. Default: None.

  • ph_fill_len (int) – The length of the placeholder to fill in the image filenames. Default: 6

  • metainfo (dict, optional) – Meta information for dataset, such as class information. Default: None.

  • data_root (str, optional) – The root directory for data_prefix and ann_file. Default: None.

  • data_prefix (dict, optional) – Prefix for training data. Default: dict(img='').

  • filter_cfg (dict, optional) – Config for filter data. Default: None.

  • indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Default: None which means using all data_infos.

  • serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Default: True.

  • pipeline (list, optional) – Processing pipeline. Default: [].

  • test_mode (bool, optional) – test_mode=True means in test phase. Default: False.

  • lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file. Basedataset can skip load annotations to save time by set lazy_init=False. Default: False.

  • max_refetch (int, optional) – If Basedataset.prepare_data get a None img. The maximum extra number of cycles to get a valid image. Default: 1000.

parse_data_info(raw_data_info: dict) Optional[dict][source]

Parse raw annotation of an instance.

Parameters

raw_data_info (dict) –

Raw data information loaded from ann_file. It should have following contents:

  • 'raw_ann_info': Raw annotation of an instance

  • 'raw_img_info': Raw information of the image that

    contains the instance

Returns

Parsed instance annotation

Return type

dict

class mmpose.datasets.datasets.face.AFLWDataset(ann_file: str = '', bbox_file: Optional[str] = None, data_mode: str = 'topdown', metainfo: Optional[dict] = None, data_root: Optional[str] = None, data_prefix: dict = {'img': ''}, filter_cfg: Optional[dict] = None, indices: Optional[Union[int, Sequence[int]]] = None, serialize_data: bool = True, pipeline: List[Union[dict, Callable]] = [], test_mode: bool = False, lazy_init: bool = False, max_refetch: int = 1000, sample_interval: int = 1)[source]

AFLW dataset for face keypoint localization.

“Annotated Facial Landmarks in the Wild: A Large-scale, Real-world Database for Facial Landmark Localization”. In Proc. First IEEE International Workshop on Benchmarking Facial Image Analysis Technologies, 2011.

The landmark annotations follow the 19 points mark-up. The definition can be found in https://www.tugraz.at/institute/icg/research /team-bischof/lrs/downloads/aflw/

Args: ann_file (str): Annotation file path. Default: ‘’. bbox_file (str, optional): Detection result file path. If

bbox_file is set, detected bboxes loaded from this file will be used instead of ground-truth bboxes. This setting is only for evaluation, i.e., ignored when test_mode is False. Default: None.

data_mode (str): Specifies the mode of data samples: 'topdown' or

'bottomup'. In 'topdown' mode, each data sample contains one instance; while in 'bottomup' mode, each data sample contains all instances in a image. Default: 'topdown'

metainfo (dict, optional): Meta information for dataset, such as class

information. Default: None.

data_root (str, optional): The root directory for data_prefix and

ann_file. Default: None.

data_prefix (dict, optional): Prefix for training data. Default:

dict(img=None, ann=None).

filter_cfg (dict, optional): Config for filter data. Default: None. indices (int or Sequence[int], optional): Support using first few

data in annotation file to facilitate training/testing on a smaller dataset. Default: None which means using all data_infos.

serialize_data (bool, optional): Whether to hold memory using

serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Default: True.

pipeline (list, optional): Processing pipeline. Default: []. test_mode (bool, optional): test_mode=True means in test phase.

Default: False.

lazy_init (bool, optional): Whether to load annotation during

instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file. Basedataset can skip load annotations to save time by set lazy_init=False. Default: False.

max_refetch (int, optional): If Basedataset.prepare_data get a

None img. The maximum extra number of cycles to get a valid image. Default: 1000.

parse_data_info(raw_data_info: dict) Optional[dict][source]

Parse raw Face AFLW annotation of an instance.

Parameters

raw_data_info (dict) –

Raw data information loaded from ann_file. It should have following contents:

  • 'raw_ann_info': Raw annotation of an instance

  • 'raw_img_info': Raw information of the image that

    contains the instance

Returns

Parsed instance annotation

Return type

dict

class mmpose.datasets.datasets.face.COFWDataset(ann_file: str = '', bbox_file: Optional[str] = None, data_mode: str = 'topdown', metainfo: Optional[dict] = None, data_root: Optional[str] = None, data_prefix: dict = {'img': ''}, filter_cfg: Optional[dict] = None, indices: Optional[Union[int, Sequence[int]]] = None, serialize_data: bool = True, pipeline: List[Union[dict, Callable]] = [], test_mode: bool = False, lazy_init: bool = False, max_refetch: int = 1000, sample_interval: int = 1)[source]

COFW dataset for face keypoint localization.

“Robust face landmark estimation under occlusion”, ICCV’2013.

The landmark annotations follow the 29 points mark-up. The definition can be found in `http://www.vision.caltech.edu/xpburgos/ICCV13/`__ .

Parameters
  • ann_file (str) – Annotation file path. Default: ‘’.

  • bbox_file (str, optional) – Detection result file path. If bbox_file is set, detected bboxes loaded from this file will be used instead of ground-truth bboxes. This setting is only for evaluation, i.e., ignored when test_mode is False. Default: None.

  • data_mode (str) – Specifies the mode of data samples: 'topdown' or 'bottomup'. In 'topdown' mode, each data sample contains one instance; while in 'bottomup' mode, each data sample contains all instances in a image. Default: 'topdown'

  • metainfo (dict, optional) – Meta information for dataset, such as class information. Default: None.

  • data_root (str, optional) – The root directory for data_prefix and ann_file. Default: None.

  • data_prefix (dict, optional) – Prefix for training data. Default: dict(img=None, ann=None).

  • filter_cfg (dict, optional) – Config for filter data. Default: None.

  • indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Default: None which means using all data_infos.

  • serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Default: True.

  • pipeline (list, optional) – Processing pipeline. Default: [].

  • test_mode (bool, optional) – test_mode=True means in test phase. Default: False.

  • lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file. Basedataset can skip load annotations to save time by set lazy_init=False. Default: False.

  • max_refetch (int, optional) – If Basedataset.prepare_data get a None img. The maximum extra number of cycles to get a valid image. Default: 1000.

class mmpose.datasets.datasets.face.CocoWholeBodyFaceDataset(ann_file: str = '', bbox_file: Optional[str] = None, data_mode: str = 'topdown', metainfo: Optional[dict] = None, data_root: Optional[str] = None, data_prefix: dict = {'img': ''}, filter_cfg: Optional[dict] = None, indices: Optional[Union[int, Sequence[int]]] = None, serialize_data: bool = True, pipeline: List[Union[dict, Callable]] = [], test_mode: bool = False, lazy_init: bool = False, max_refetch: int = 1000, sample_interval: int = 1)[source]

CocoWholeBodyDataset for face keypoint localization.

Whole-Body Human Pose Estimation in the Wild’, ECCV’2020. More details can be found in the `paper .

The face landmark annotations follow the 68 points mark-up.

Parameters
  • ann_file (str) – Annotation file path. Default: ‘’.

  • bbox_file (str, optional) – Detection result file path. If bbox_file is set, detected bboxes loaded from this file will be used instead of ground-truth bboxes. This setting is only for evaluation, i.e., ignored when test_mode is False. Default: None.

  • data_mode (str) – Specifies the mode of data samples: 'topdown' or 'bottomup'. In 'topdown' mode, each data sample contains one instance; while in 'bottomup' mode, each data sample contains all instances in a image. Default: 'topdown'

  • metainfo (dict, optional) – Meta information for dataset, such as class information. Default: None.

  • data_root (str, optional) – The root directory for data_prefix and ann_file. Default: None.

  • data_prefix (dict, optional) – Prefix for training data. Default: dict(img=None, ann=None).

  • filter_cfg (dict, optional) – Config for filter data. Default: None.

  • indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Default: None which means using all data_infos.

  • serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Default: True.

  • pipeline (list, optional) – Processing pipeline. Default: [].

  • test_mode (bool, optional) – test_mode=True means in test phase. Default: False.

  • lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file. Basedataset can skip load annotations to save time by set lazy_init=False. Default: False.

  • max_refetch (int, optional) – If Basedataset.prepare_data get a None img. The maximum extra number of cycles to get a valid image. Default: 1000.

parse_data_info(raw_data_info: dict) Optional[dict][source]

Parse raw CocoWholeBody Face annotation of an instance.

Parameters

raw_data_info (dict) –

Raw data information loaded from ann_file. It should have following contents:

  • 'raw_ann_info': Raw annotation of an instance

  • 'raw_img_info': Raw information of the image that

    contains the instance

Returns

Parsed instance annotation

Return type

dict

class mmpose.datasets.datasets.face.Face300VWDataset(ann_file: str = '', bbox_file: Optional[str] = None, data_mode: str = 'topdown', metainfo: Optional[dict] = None, data_root: Optional[str] = None, data_prefix: dict = {'img': ''}, filter_cfg: Optional[dict] = None, indices: Optional[Union[int, Sequence[int]]] = None, serialize_data: bool = True, pipeline: List[Union[dict, Callable]] = [], test_mode: bool = False, lazy_init: bool = False, max_refetch: int = 1000, sample_interval: int = 1)[source]

300VW dataset for face keypoint tracking.

“The First Facial Landmark Tracking in-the-Wild Challenge:

Benchmark and Results”,

Proceedings of the IEEE international conference on computer vision workshops.

The landmark annotations follow the 68 points mark-up. The definition can be found in https://ibug.doc.ic.ac.uk/resources/300-VW/.

Parameters
  • ann_file (str) – Annotation file path. Default: ‘’.

  • bbox_file (str, optional) – Detection result file path. If bbox_file is set, detected bboxes loaded from this file will be used instead of ground-truth bboxes. This setting is only for evaluation, i.e., ignored when test_mode is False. Default: None.

  • data_mode (str) – Specifies the mode of data samples: 'topdown' or 'bottomup'. In 'topdown' mode, each data sample contains one instance; while in 'bottomup' mode, each data sample contains all instances in a image. Default: 'topdown'

  • metainfo (dict, optional) – Meta information for dataset, such as class information. Default: None.

  • data_root (str, optional) – The root directory for data_prefix and ann_file. Default: None.

  • data_prefix (dict, optional) – Prefix for training data. Default: dict(img=None, ann=None).

  • filter_cfg (dict, optional) – Config for filter data. Default: None.

  • indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Default: None which means using all data_infos.

  • serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Default: True.

  • pipeline (list, optional) – Processing pipeline. Default: [].

  • test_mode (bool, optional) – test_mode=True means in test phase. Default: False.

  • lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file. Basedataset can skip load annotations to save time by set lazy_init=False. Default: False.

  • max_refetch (int, optional) – If Basedataset.prepare_data get a None img. The maximum extra number of cycles to get a valid image. Default: 1000.

parse_data_info(raw_data_info: dict) Optional[dict][source]

Parse raw Face300VW annotation of an instance.

Parameters

raw_data_info (dict) –

Raw data information loaded from ann_file. It should have following contents:

  • 'raw_ann_info': Raw annotation of an instance

  • 'raw_img_info': Raw information of the image that

    contains the instance

Returns

Parsed instance annotation

Return type

dict

class mmpose.datasets.datasets.face.Face300WDataset(ann_file: str = '', bbox_file: Optional[str] = None, data_mode: str = 'topdown', metainfo: Optional[dict] = None, data_root: Optional[str] = None, data_prefix: dict = {'img': ''}, filter_cfg: Optional[dict] = None, indices: Optional[Union[int, Sequence[int]]] = None, serialize_data: bool = True, pipeline: List[Union[dict, Callable]] = [], test_mode: bool = False, lazy_init: bool = False, max_refetch: int = 1000, sample_interval: int = 1)[source]

300W dataset for face keypoint localization.

“300 faces In-the-wild challenge: Database and results”, Image and Vision Computing (IMAVIS) 2019.

The landmark annotations follow the 68 points mark-up. The definition can be found in https://ibug.doc.ic.ac.uk/resources/300-W/.

Parameters
  • ann_file (str) – Annotation file path. Default: ‘’.

  • bbox_file (str, optional) – Detection result file path. If bbox_file is set, detected bboxes loaded from this file will be used instead of ground-truth bboxes. This setting is only for evaluation, i.e., ignored when test_mode is False. Default: None.

  • data_mode (str) – Specifies the mode of data samples: 'topdown' or 'bottomup'. In 'topdown' mode, each data sample contains one instance; while in 'bottomup' mode, each data sample contains all instances in a image. Default: 'topdown'

  • metainfo (dict, optional) – Meta information for dataset, such as class information. Default: None.

  • data_root (str, optional) – The root directory for data_prefix and ann_file. Default: None.

  • data_prefix (dict, optional) – Prefix for training data. Default: dict(img=None, ann=None).

  • filter_cfg (dict, optional) – Config for filter data. Default: None.

  • indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Default: None which means using all data_infos.

  • serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Default: True.

  • pipeline (list, optional) – Processing pipeline. Default: [].

  • test_mode (bool, optional) – test_mode=True means in test phase. Default: False.

  • lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file. Basedataset can skip load annotations to save time by set lazy_init=False. Default: False.

  • max_refetch (int, optional) – If Basedataset.prepare_data get a None img. The maximum extra number of cycles to get a valid image. Default: 1000.

parse_data_info(raw_data_info: dict) Optional[dict][source]

Parse raw Face300W annotation of an instance.

Parameters

raw_data_info (dict) –

Raw data information loaded from ann_file. It should have following contents:

  • 'raw_ann_info': Raw annotation of an instance

  • 'raw_img_info': Raw information of the image that

    contains the instance

Returns

Parsed instance annotation

Return type

dict

class mmpose.datasets.datasets.face.Face300WLPDataset(ann_file: str = '', bbox_file: Optional[str] = None, data_mode: str = 'topdown', metainfo: Optional[dict] = None, data_root: Optional[str] = None, data_prefix: dict = {'img': ''}, filter_cfg: Optional[dict] = None, indices: Optional[Union[int, Sequence[int]]] = None, serialize_data: bool = True, pipeline: List[Union[dict, Callable]] = [], test_mode: bool = False, lazy_init: bool = False, max_refetch: int = 1000, sample_interval: int = 1)[source]

300W dataset for face keypoint localization.

“300 faces In-the-wild challenge: Database and results”, Image and Vision Computing (IMAVIS) 2019.

The landmark annotations follow the 68 points mark-up. The definition can be found in https://ibug.doc.ic.ac.uk/resources/300-W/.

Parameters
  • ann_file (str) – Annotation file path. Default: ‘’.

  • bbox_file (str, optional) – Detection result file path. If bbox_file is set, detected bboxes loaded from this file will be used instead of ground-truth bboxes. This setting is only for evaluation, i.e., ignored when test_mode is False. Default: None.

  • data_mode (str) – Specifies the mode of data samples: 'topdown' or 'bottomup'. In 'topdown' mode, each data sample contains one instance; while in 'bottomup' mode, each data sample contains all instances in a image. Default: 'topdown'

  • metainfo (dict, optional) – Meta information for dataset, such as class information. Default: None.

  • data_root (str, optional) – The root directory for data_prefix and ann_file. Default: None.

  • data_prefix (dict, optional) – Prefix for training data. Default: dict(img=None, ann=None).

  • filter_cfg (dict, optional) – Config for filter data. Default: None.

  • indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Default: None which means using all data_infos.

  • serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Default: True.

  • pipeline (list, optional) – Processing pipeline. Default: [].

  • test_mode (bool, optional) – test_mode=True means in test phase. Default: False.

  • lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file. Basedataset can skip load annotations to save time by set lazy_init=False. Default: False.

  • max_refetch (int, optional) – If Basedataset.prepare_data get a None img. The maximum extra number of cycles to get a valid image. Default: 1000.

class mmpose.datasets.datasets.face.LapaDataset(ann_file: str = '', bbox_file: Optional[str] = None, data_mode: str = 'topdown', metainfo: Optional[dict] = None, data_root: Optional[str] = None, data_prefix: dict = {'img': ''}, filter_cfg: Optional[dict] = None, indices: Optional[Union[int, Sequence[int]]] = None, serialize_data: bool = True, pipeline: List[Union[dict, Callable]] = [], test_mode: bool = False, lazy_init: bool = False, max_refetch: int = 1000, sample_interval: int = 1)[source]

LaPa dataset for face keypoint localization.

“A New Dataset and Boundary-Attention Semantic Segmentation for Face Parsing”, AAAI’2020.

The landmark annotations follow the 106 points mark-up. The definition can be found in `https://github.com/JDAI-CV/lapa-dataset/`__ .

Parameters
  • ann_file (str) – Annotation file path. Default: ‘’.

  • bbox_file (str, optional) – Detection result file path. If bbox_file is set, detected bboxes loaded from this file will be used instead of ground-truth bboxes. This setting is only for evaluation, i.e., ignored when test_mode is False. Default: None.

  • data_mode (str) – Specifies the mode of data samples: 'topdown' or 'bottomup'. In 'topdown' mode, each data sample contains one instance; while in 'bottomup' mode, each data sample contains all instances in a image. Default: 'topdown'

  • metainfo (dict, optional) – Meta information for dataset, such as class information. Default: None.

  • data_root (str, optional) – The root directory for data_prefix and ann_file. Default: None.

  • data_prefix (dict, optional) – Prefix for training data. Default: dict(img=None, ann=None).

  • filter_cfg (dict, optional) – Config for filter data. Default: None.

  • indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Default: None which means using all data_infos.

  • serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Default: True.

  • pipeline (list, optional) – Processing pipeline. Default: [].

  • test_mode (bool, optional) – test_mode=True means in test phase. Default: False.

  • lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file. Basedataset can skip load annotations to save time by set lazy_init=False. Default: False.

  • max_refetch (int, optional) – If Basedataset.prepare_data get a None img. The maximum extra number of cycles to get a valid image. Default: 1000.

class mmpose.datasets.datasets.face.WFLWDataset(ann_file: str = '', bbox_file: Optional[str] = None, data_mode: str = 'topdown', metainfo: Optional[dict] = None, data_root: Optional[str] = None, data_prefix: dict = {'img': ''}, filter_cfg: Optional[dict] = None, indices: Optional[Union[int, Sequence[int]]] = None, serialize_data: bool = True, pipeline: List[Union[dict, Callable]] = [], test_mode: bool = False, lazy_init: bool = False, max_refetch: int = 1000, sample_interval: int = 1)[source]

WFLW dataset for face keypoint localization.

“Look at Boundary: A Boundary-Aware Face Alignment Algorithm”, CVPR’2018.

The landmark annotations follow the 98 points mark-up. The definition can be found in `https://wywu.github.io/projects/LAB/WFLW.html`__ .

Parameters
  • ann_file (str) – Annotation file path. Default: ‘’.

  • bbox_file (str, optional) – Detection result file path. If bbox_file is set, detected bboxes loaded from this file will be used instead of ground-truth bboxes. This setting is only for evaluation, i.e., ignored when test_mode is False. Default: None.

  • data_mode (str) – Specifies the mode of data samples: 'topdown' or 'bottomup'. In 'topdown' mode, each data sample contains one instance; while in 'bottomup' mode, each data sample contains all instances in a image. Default: 'topdown'

  • metainfo (dict, optional) – Meta information for dataset, such as class information. Default: None.

  • data_root (str, optional) – The root directory for data_prefix and ann_file. Default: None.

  • data_prefix (dict, optional) – Prefix for training data. Default: dict(img=None, ann=None).

  • filter_cfg (dict, optional) – Config for filter data. Default: None.

  • indices (int or Sequence[int], optional) – Support using first few data in annotation file to facilitate training/testing on a smaller dataset. Default: None which means using all data_infos.

  • serialize_data (bool, optional) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Default: True.

  • pipeline (list, optional) – Processing pipeline. Default: [].

  • test_mode (bool, optional) – test_mode=True means in test phase. Default: False.

  • lazy_init (bool, optional) – Whether to load annotation during instantiation. In some cases, such as visualization, only the meta information of the dataset is needed, which is not necessary to load annotation file. Basedataset can skip load annotations to save time by set lazy_init=False. Default: False.

  • max_refetch (int, optional) – If Basedataset.prepare_data get a None img. The maximum extra number of cycles to get a valid image. Default: 1000.

parse_data_info(raw_data_info: dict) Optional[dict][source]

Parse raw Face WFLW annotation of an instance.

Parameters

raw_data_info (dict) –

Raw data information loaded from ann_file. It should have following contents:

  • 'raw_ann_info': Raw annotation of an instance

  • 'raw_img_info': Raw information of the image that

    contains the instance

Returns

Parsed instance annotation

Return type

dict

class mmpose.datasets.datasets.hand.CocoWholeBodyHandDataset(ann_file: str = '', bbox_file: Optional[str] = None, data_mode: str = 'topdown', metainfo: Optional[dict] = None, data_root: Optional[str] = None, data_prefix: dict = {'img': ''}, filter_cfg: Optional[dict] = None, indices: Optional[Union[int, Sequence[int]]] = None, serialize_data: bool = True, pipeline: List[Union[dict, Callable]] = [], test_mode: bool = False, lazy_init: bool = False, max_refet