Abstract
Studies of quadruped animal motion help us to identify diseases, understand behavior and unravel the mechanics behind gaits in animals. The horse is likely the best-studied animal in this aspect, but data capture is challenging and time-consuming. Computer vision techniques improve animal motion extraction, but the development relies on reference datasets, which are scarce, not open-access and often provide data from only a few anatomical landmarks. Addressing this data gap, we introduce PFERD, a video and 3D marker motion dataset from horses using a full-body set-up of densely placed over 100 skin-attached markers and synchronized videos from ten camera angles. Five horses of diverse conformations provide data for various motions from basic poses (eg. walking, trotting) to advanced motions (eg. rearing, kicking). We further express the 3D motions with current techniques and a 3D parameterized model, the hSMAL model, establishing a baseline for 3D horse markerless motion capture. PFERD enables advanced biomechanical studies and provides a resource of ground truth data for the methodological development of markerless motion capture.
Similar content being viewed by others
Background & Summary
Over the years, capturing and modeling the articulated motion of humans and animals has been a research topic across different disciplines, ranging from medicine1 to robotics2 to computer graphics3. Humans and animals integrate verbal communication with body posture and movements, and understanding body language would greatly advance the design of intelligent interactive artificial systems. Detecting anomalies in articulated motion4 can serve as a crucial tool for early health intervention, helping to mitigate potential long-term injuries or disease, hence improving the subject’s quality and length of life. Systems able to synthesize realistic motion can be a useful tool for artists to create characters for games and virtual worlds5. Recently, generative AI methods have been wildly trained to synthesize human motion5,6,7.
We focus our study on horses. Horses have historical and cultural significance in most human societies and are one of the oldest domesticated mammals. They have played a significant role in human history, from transportation to warfare, from agriculture to culture, from work to equestrian sports. Their significance is related to their unique bodily strength and speed, made possible by an efficient quadruped locomotor system. This body system has evolved to have bulky muscles close to the upper body, proportionally long limbs that act like pogo-sticks thanks to specialized tendons and a mass reduction of the lower limb and foot that effectively reduces inertia. This optimized locomotor apparatus is however susceptible to injuries as it operates under loading conditions close to its point of failure8. Thus, studying the motions of horses has been the focus of researchers in different fields such as robotics9,10, biology11,12 and veterinary medicine13,14.
Marker-based motion capture systems are widely used to capture complex human and animal motions by recording the positions of wearable markers placed on the body in an indoor environment. Marker-based motion capture has demonstrated its significance for motion study15,16,17 and utility across a wide array of applications14,18. While marker-based solutions are in need of physical contact with the animal and not scalable to in-the-wild scenarios, computer vision techniques have been developed to implement markerless motion capture, where the articulated motion of a skeleton is inferred from visual data19,20,21,22,23,24,25,26,27,28,29. With images and video as the input, straightforward systems provide the image coordinates of skeleton joints as solutions19,20,21. This is not sufficient for many high-quality downstream applications, where a solution independent from the capture geometry is often required. In these cases, a 3D articulated pose, given as the set of 3D rotation angles of skeleton joints, is preferable.
3D markerless motion capture for humans is a novel technology, accurate enough to be used in many applications. In particular, monocular markerless capture, where the 3D pose of the subject is inferred from just one camera, allows designing applications that can exploit low-cost capture devices like smartphones. The basis of these achievements is data-driven methods that leverage large amounts of captured human data. While 2D methods use data in the form of large image datasets with body joint annotations, which are easy to obtain, the data capture task for learning systems outputting 3D poses is significantly more challenging. Moreover, estimating 3D pose from a single view is ambiguous. The problem can be approached by learning explicit 3D pose priors, that constrain the ambiguous solutions to the more likely ones, or learning implicit 3D priors from annotated datasets. The first solution makes use of decoupled data, usually image or video datasets of humans and 2D joints, which are also used for 2D pose estimation, and datasets of human motions. The second solution requires large datasets of images and corresponding 3D poses. These datasets can be obtained at a large scale but only synthetically. Notably, the synthetic dataset generation still requires human articulated motion data. However, relying solely on the priors of human articulated motion is insufficient to achieve markerless motion capture. Seen from the camera, a long leg pointed to the camera can have the same appearance as a shorter leg bent with a different angle. To deal with these ambiguities, a 3D shape prior, encoding the correlation between the body segment proportions, is required.
In the last few years, we have seen a tremendous advance in 3D markerless motion capture for humans. This has been facilitated by the availability of the SMPL (Skinned Multi Person Linear) model30, a 3D parametric model, learned from thousands of 3D scans of people, encoding articulated human shape, and the AMASS dataset31, a large dataset of human articulated motion, captured with mocap systems, expressed in the parameters space of the SMPL model. Together, SMPL and AMASS incorporate knowledge about how people appear in shape and how they move7,32,33,34,35,36,37.
Animal motion capture has been making strides in recent years but is still behind human motion studies. Analogous to the SMPL model, the SMAL (Skinned Multi Animal Linear) model38, learned from 41 toy scans, encodes articulated shapes of quadruped animals. Versions of the SMAL model have been made specifically for dogs39,40 and horses41. Nonetheless, the data collection on animals is more challenging compared to humans, since it is more difficult to instruct them to perform specific motions and to keep them in fragile indoor environments. Existing animal mocap datasets prefer docile and small animals42,43,44,45, but are constrained by pose variability and lack the motion diversity that AMASS offers. This results in a paucity of comprehensive animal motion datasets for data-driven motion study, particularly for larger animals, like horses. In the equine veterinary field, motion capture has demonstrated its potential in clinic applications for lameness diagnostics46,47. However, within this field, the focus is often limited to capture of locomotion data from a limited number of anatomical landmarks48,49,50 due to difficulty and time constrain in placing markers on the horse’s body. This may lead to less analysis and a lesser understanding of full-body motions.
To bridge the gap, we introduce PFERD51, a dense motion capture dataset of horses of diverse conformation and poses with rich 3D horse articulated motion data. Recorded in an indoor riding arena in Sweden, using an optical motion capture system from the company Qualisys, the dataset includes five horses of different sizes and breeds, to ensure shape diversity (Fig. 1). Over 100 reflective markers were placed on each horse, covering both skeletal structures and soft tissues, to accurately capture motions. The dataset covers a wide variation of horse motions, guided by human instructors, ranging from basic activities like standing, walking, and trotting, to complex motions like the piaffe, the passage, the pirouett, jumping, sitting as shown in Fig. 3. Two highly trained horses perform these advanced motions while the rest of the subjects provide common gaits and motions encountered in every-day horses. Furthermore, to promote the study of markerless motion capture, we provide multiple data types. In line with AMASS, we express 3D horse articulated motion with the hSMAL model41. The dataset further enriches its data diversity by including synchronized videos from ten camera views and corresponding 2D joints.
The PFERD dataset serves as an open resource for equine motion research and for the scientific development of computer vision and modeling applications that can benefit horse health and welfare and strengthen our understanding of horse behavior. It provides synchronized 3D data from skin-placed markers and multi-view 2D RGB video streams. The dataset is small in terms of subject numbers, but unique thanks to the wealth of markers placed on the horses’ bodies as well as the subject variation in size, shape, and the rare motions that some of the horses perform. The dataset can be expanded given the detailed descriptions of data capture and model estimation procedures. With this data, we invite researchers to develop both statistical analysis and data-driven methods. We suggest the following tasks:
-
1)
Quantitative motion analysis: The diverse motions and full body marker setup allow for detailed biomechanical studies. The mocap data are, at the time of publication, the most marker-dense horse motion dataset available for research in horses. This data can help veterinary researchers to understand full body motions at a more detailed level than was earlier possible, and it contains unique movements that only highly trained horses can perform.
-
2)
Addressing animal-related computer vision problems: Researchers can utilize this dataset to develop novel computer vision models or refine existing algorithms, promoting the development of markerless motion capture. The 3D motion data can be further used for many graphic tasks, like the development of motion generative models and the improvement of 3D animatable models.
-
3)
Benchmarking in method evaluation through provided groundtruth data: The dataset contains precise 3D mocap data and multi-view RGB video, and can provide both 3D and 2D groundtruth for method evaluation, helping the development of new methods for 3D pose and shape estimation. Furthermore, it enables benchmarking of differences against the state-of-the-art methods we apply here31,52.
Methods
In this section, the procedures used to record the data are explained. In addition, the processing of mocap data and 3D pose data are presented.
Study subjects
The dataset has a diversity in terms of body shape and motion. We selected five horses of different breeds to provide variable information on shape and size. In terms of diverse motions, all horses performed some basic movements, such as standing (with the head moving from side to side, up and down), moving forward/backward, walking, trotting, and cantering. Two of the horses (Horse No.4 and No.5) performed advanced movements based on signaling ques from their owners, such as pirouetting, rearing, piaffing, kicking, jumping, etc. Table 1 shows the detailed characteristics of the five horses and some motions are listed in Fig. 3.
Before each subject was selected, the horse owners were introduced to the aim of data collection and informed about the procedures. Written informed consent was obtained from the owners, permitting the use of the horses’ data for research purposes. The study was non-invasive and the procedure was covered by an animal ethical permission No. 5.8.18-15533/2018. Written consent was provided by all humans appearing in the video recordings.
Experimental design
In this subsection, the description of the mocap system and marker setup are presented.
Motion capture system
The data were collected using Qualisys optical motion capture system on November 26–29, 2020. The system was set up in a riding arena of approximately \(19\times 30m\) at the Equine clinic of the University Animal Hospital (UDS) of the Swedish University of Agricultural Sciences (SLU) in Uppsala, Sweden. In total 56 mocap cameras from the Qualisys system (35 Oqus_700 + cameras and 21 Arqus_A12 cameras) and ten RGB full HD video cameras (Miqus_Video cameras) were mounted to the walls of the arena shown in Fig. 2. All cameras were synchronized creating an approximate \(16\times 20m\) effective recording volume in the center of the arena. The capture rate of the mocap cameras was 240 Hz and the RGB videos captured by cameras were 20,30,60 Hz, respectively, depending on the data recording.
Marker placement and attachment methods
Reflective spherical markers with a diameter of 19 mm were attached to the horses’ skin with double-coated adhesive tape cut in pieces of around \(2\times 3cm\). Different methods were empirically tested for the more challenging attachment markers to body parts such as the ears and the hooves, shown in Fig. 4.
The marker setup aimed to maximize both captures of body shape and motion and included 132 markers on both skeletal structure and soft tissues. Based on expert anatomical knowledge, markers were separated into three groups. The first group with 50 markers focuses on the precise palpation of anatomical skeletal structures that mark out the most important skeletal segments related to locomotion. Connecting these markers provides a “stick figure”, roughly representing skeletal movement from landmarks on the skin surface. We call these markers the “skeletal model”, see Fig. 5a. The second group of around 70 markers, were dispersed over the horse’s soft tissues, mainly covering the area of the neck, the thoraco-abdominal, and hindquarter segments. The third group of 12 markers were placed in groups of three on each hoof, to allow tracing of rotational motion of the hooves. The final full body marker setup is shown in Fig. 5b. Detailed descriptions are reported in Table 3.
Data acquisition
In this subsection, the whole procedure of data recording is explained, including mocap system calibration, subject preparation, and type of motion performances recorded.
Qualisys calibration
Calibration of the motion capture system was done with wand calibration, according to the manufacturer’s instructions. The video cameras were calibrated along with the marker cameras. For the first calibration, an L-shaped frame with static markers was placed in the approximate center of the capture volume to define the coordinate system. Then a calibration wand with two markers at a fixed distance was moved through the volume to present it to all cameras at different angles. Subsequent calibrations were done with only the calibration wand. The system was recalibrated before recording the first, second and fourth subject.
Study subject preparation
The fur and the hooves of horses were washed with soap water before attaching markers. Markers were cleaned between different trials if needed. Markers in the skeletal model were placed by palpating specific skeletal structures by two people with anatomical knowledge for precise positioning. The remaining markers were located on the body segment’s proportions using nearby skeletal markers and tape measure as references shown in Fig. 6. Each horse took 2-3 hours to finish all the preparations. The number of markers per horse (shown in Table 1) was a bit different since certain markers had to be excluded for various reasons. For example, Horse No.3 was a small horse, and we had to reduce the number of markers to avoid marker merging, or label-swapping due to them being too close on the small body. Horse No.5 was sweating, resulting in markers not being attached properly, especially markers on the lower belly.
Recording sessions
During the recording session for each horse, the horse was first led to the recording arena by the owner and was familiarized with the recording environment. Then, the owner engaged with the horse, using methods such as whistling, waving the whip, and offering treats to guide the horse into performing specific motions. The movements began with fundamental actions, like standing, neck bending, moving forward or backward, as well as, walking, trotting and cantering. The complexity of the movements varied depending on the horse’s ability. For instance, Horse No.4 demonstrated more advanced movements like pirouetting, rearing, piaffing, passage, spanish walk and kicking, while Horse No.5 was rearing and jumping over an obstacle. The dataset comprised different numbers of data sequences for each of the five horses, ranging from 5 to 13, shown in Table 4. While most data recordings lasted approximately one minute, there were exceptions. Each data recording captured more than one motion, allowing for a diverse range of horse movements.
Data processing
In this subsection, we describe the data process from Qualisys and the procedure of learning the 3D model from the mocap data using the hSMAL model and MoSh ++ .
Mocap data
The motion capture data was collected using Qualisys Track Manager (QTM) version 2020.3. The collected 2D data were combined into 3D trajectories using the tracking algorithms in QTM, and the trajectories were then labeled. The labeled data has been exported to c3d and fbx format. Since markers might fall off or be occluded during a capture, the number of labeled trajectories might vary slightly for the same horse. In the No.8 trial capture of Horse No.5, some miscommunication with the camera system resulted in a lot of short gaps in the data. Therefore, linear interpolation of the marker position has been used to fill single frame gaps in this one measurement. In the other measurements, no gap fill has been used. The camera calibration information was exported from QTM, including the extrinsic and intrinsic parameters.
2D keypoints and silhouette extraction
For 2D joint extraction, the 3D mocap data was projected onto each image frame from every camera view. This process utilizes the corresponding camera parameters, which are exported from Qualisys, and aligned the first frame of the mocap files to the initial video frame. Considering the differing framerates between the c3d files and videos, a downsample of the c3d files was performed to synchronize the c3d and video frames.
For silhouette extraction, Track Anything53, one of the state-of-the-art segmentation models, was employed to extract the 2D silhouettes of each horse in each video frame. The method operated as follows: every five frames within each video, Segment Anything (SAM)54 extracted the horse’s mask for the first frame, using the bounding box calculated from 2D key points. Track Anything then applied the results from SAM as a template mask to guide the five-frame segmentation. To ensure the quality of the segmentation, we selected 130 video sequences and manually excluded instances with occlusion or incomplete body visibility.
3D shape and pose modeling
The body model
The 3D shape and pose of the horse are modeled and represented through the parameters of the hSMAL body model. As a horse-specific version of SMAL38, the hSMAL model41 defines a 3D horse mesh, consisting of 1,497 vertices, 2,990 faces, and 36 body segments. hSMAL can be described as a function \(\xi (\beta ,\theta ,\gamma )\), where β is the shape parameter; θ is the 3D pose parameter; γ is the model translation. The model is learned from 37 horse toys using the procedure described in38. More specifically, a purchased 3D mesh of a horse, created by an artist, is used to create a Global/Local Stitched Shape model (GLoSS)38 for horses. The GLoSS model is fitted to each toy scan such that scans have the same mesh topology. To de-correlate body and tail shapes, tails among different toy horses are interchanged to generate a broader range of data. After a process of pose-normalization, the mean template Vmean is computed by averaging the data. The vertex-based residuals between the data and the mean template are modeled by principle component analysis (PCA). β represents the coefficients of the learned low dimensional linear space, while Bs defines the shape deformations. More specifically, under the template pose, the shape is given as:
The learned shape space of the model is shown in Fig. 7a, where the first three components mainly capture the model’s sizes of the body, the tail, and the neck, respectively. θ represents the relative rotation of each joint with respect to its parent joint in the axis-angle representation according to the skeleton tree defined in the model. The θ parameter is a vector of dimension \(3\times 36=108\). The skeleton joint positions are manually defined (Fig. 7b) to better represent the animal anatomy similar to45. The final mesh is then posed with Linear Blend Skinning (LBS)30,38 and shifted with translation parameter γ. More details are in the original papers38,41. We used the first 10 PCA coefficients of the shape space as the shape parameters.
Model fitting
The parameters of a 3D articulated shape model can be estimated from mocap markers using the MoSh52 and MoSh++31 methods. These methods consider the fact that markers cannot be attached to fixed positions on the human body, especially when the human is moving or the markers are on soft tissues and solve not only for model parameters, but also for the marker position on the body surface. MoSh++, the updated version of MoSh, has been applied to different human mocap datasets to create a unified dataset, AMASS, which includes markers and aligned SMPL model parameters.
Following MoSh++31, we use two stages to capture the 3D shape and pose of the horse using the hSMAL model from mocap data. We use a similar notation as in MoSh++. Please check for more details in the original paper31,52.
Stage I: Stage I focuses on estimating the shape and marker positions. A marker parameterization denoted as \(m({\widetilde{m}}_{i},\beta ,{\theta }_{t},{\gamma }_{t})\) is utilized to estimate marker positions considering the body’s shape, pose and location. More specifically, the latent markers \({\widetilde{m}}_{i}\) are mapped to the world by accounting for the model parameters \((\beta ,{\theta }_{t},{\gamma }_{t})\) at a particular frame t for marker i. To do this, F frames are randomly selected from subject-specific mocap sequences. The goal is to optimize the model parameters (\(\beta \), \(\Theta =\{{\theta }_{1:F}\}\), \(\Gamma =\{{\gamma }_{1:F}\}\)) and latent marker positions \(\widetilde{M}=\{{\widetilde{m}}_{i}\}\), based on the observed marker positions \(M={\{{m}_{i,t}\in {M}_{t}\}}_{1:F}\). More specifically, an objective function is defined as:
where \({E}_{D}\) measures the distance between the parameterized markers \(m({\widetilde{m}}_{i},\beta ,{\theta }_{t},{\gamma }_{t})\) and the observed markers \({m}_{i,t}\), \({E}_{R}\) ensures the markers are at an appropriate distance from the model surface (here we set 10 mm). \({E}_{I}\) maintains the parameterized markers close to their initial positions. \({E}_{\beta }\) and \({E}_{\Theta }\) are regularizers related to shape and pose prior to the hSMAL model defined in41. Finally, a four-staged approach is performed to help avoid getting stuck in local optima. Here we randomly selected \(F=12\) frames from sequences where the horse was in more static poses, as it allowed for better optimization. It was worth noting that some markers may not be visible in these selected sequences. We used specific values for \({\lambda }_{D}=105.0\times d\), \({\lambda }_{R}=10300.0\), \({\lambda }_{I}=250.0\), \({\lambda }_{\beta }=14.5\), \({\lambda }_{\theta }=7.5\) and a scaling factor \(d=50/n\), to deal with varying numbers of markers, where 50 was the marker number of the skeletal model and n was the observed mocap marker number in a frame.
Stage II: Stage II focuses on optimizing the 3D poses from all subject-based sequences. The body shape β and latent marker positions of each subject in Stage I are determined and kept fixed during Stage II. More specifically, we minimize:
where \({E}_{D}\) and \({E}_{\theta }\) are the same as in Stage I, measuring the alignment of the model with the observed data and maintaining specific constraints. Eu is a temporal smooth term, ensuring the 3D poses changed over time are natural and smooth. Here we set \({\lambda }_{D}=480.0\times d\), \({\lambda }_{\theta }=2.3\times q\), \({\lambda }_{u}=2.5\). A variable \(q=1+\left(\frac{x}{|M|},\ast ,2.5\right)\) was introduced based on the number of missing markers x. The more markers were missing, the higher the pose constrained in every given frame.
Hyper-parameter search
Certain hyperparameters λ during the two-stage optimizations were determined by line search on a synthetic dataset, as inspired by MoSh++. The synthetic dataset was first created using the toy shapes from the training data of the hSMAL model. We placed 38 synthetic markers on the model with toy shapes and animated the model. We divided this data into a training set, consisting of 32 toys and five animations and a validation set with five toys and two animations.
The searching process was adjusting one parameter while keeping other parameters fixed, both in Stage I and Stage II fitting, using different random seeds on training and validation datasets. In Stage I, we initialized the marker positions by randomly placing them near the true position. The goal was to find a better combination of \(({\lambda }_{D},{\lambda }_{R},{\lambda }_{I},{\lambda }_{\beta },{\lambda }_{\theta })\) in Eq. 2. We aimed to find the values that provide the relatively lower distance between the estimated markers and the synthetic true markers within the selected 12 frames in both training and validation sets. In Stage II, the process was to find a better combination of \(({\lambda }_{D},{\lambda }_{\theta },{\lambda }_{u})\) in Eq. 3. Here we aimed to minimize the error on all vertices of the model between the estimated results and the true results in both the training and validation set.
Data Records
The datasets are available at Harvard Dataverse51. The data is organized by subject folder with [Subject ID]. The name of files for each trial starts with [Record Date]_[Subject ID]_[Trial Number], indicated as [Trial Name]. Each subject folder stores six sub-folders with complete trials:
-
C3D_DATA: One C3D file per trial, exported from the QTM software, referenced as [Trial Name].c3d.
-
FBX_DATA: One FBX per trial, with the whole scenario including all information of cameras and 3D position of the subject, exported from the QTM software, referenced as [Trial Name].fbx.
-
VIDEO_DATA: Each folder per trial named as [Trial Name], with videos from ten camera views, references as [Trial Name]_[Camera Code].avi.
-
SEGMENT_DATA: Selected segmentation subsets for evaluating the fitting results, referenced as [Trial Name]_[Camera Code]_seg.mp4.
-
MODEL_DATA: NPZ files, referenced as [Trial Name]_hsmal.npz includes the hSMAL parameters per trial. Another NPZ file, reference as [User ID]_stagei.npz contains the latent representation of the markers.
-
KP2D_DATA: Each folder per trial named as [Trial Name], with 2D keypoints projected from 3D mocap data into video frames with ten camera views, referenced as [Trial Name]_[Camera Code]_2Dkp.npz.
-
CAM_DATA: The camera parameters from ten camera views, referenced as Camera_Miqus_Video_[Cam ID].npz.
The correspondences are as follows:
[Record Date]: Recording date (e.g. 20201128 or 20201129)
[Subject ID]: Subject ID (e.g. ID_1, ID_2, ID_3, ID_4, ID_5)
[Trial Number]: Trial number (e.g. 0001, shown in Table 4)
[Camera Code]: Camera code (e.g. Miqus_65_20715)
[Camera ID]: Camera ID (e.g. 20715, 21386, 23348, 23350, 23414, 23415, 23416, 23417, 23603, 23604)
Technical Validation
In this section, we first provide calibration errors from QTM software and quantitative and qualitative results of the reconstructed 3D model.
Mocap data
The motion capture system was calibrated three times over the two days of data capture. The standard deviation of the wand length varies from 1.0–1.9 mm for the calibrations, with an average camera residual between 2.6–2.9 mm during calibration. For the different captures, the average camera residual varied between 1.5–3.1, with higher residuals when the horse moves close to the edge of the covered volume.
3D model evaluation
Qualitative Results
We show the optimized shape and marker positions in Fig. 8. The positions of the markers (in red) have been optimized to fit different shapes, starting from their initial guess positions (in blue) on the template shape. Additionally, some markers (in gray) represent markers not visible during Stage I, which could be either the exclusion of attaching certain markers or fail marker detection during the data recording process.
Examples of the captured 3D model are displayed in Fig. 9 with corresponding motions illustrated in Fig. 3. The 3D shape and pose representations, derived from mocap data, effectively capture the horse’s real motions, even in challenging poses, like prancing and kicking. However, room for improvement remains in capturing more complex poses, such as sitting and lying down, especially when the limbs are in unusual positions.
Figure 10 provides results for the five horses. The images on the left side display the fitting results and the mocap frame, while the images on the right side provide a view of the reconstructed model as seen from ten different camera views. This visual comparison highlights the precision of the captured 3D shape and pose of the horses.
Shape Visualization
We provide the visualization of the estimated 3D horse shapes. Figure 12b shows the UMAP visualization of the components of the shape parameters from our five subjects, together with those of all the hSMAL toy training data. We can observe that Horse No. 3, the smallest pony, is quite distinct from the other four horses (Fig. 12a).
Quantitative Results
3D Mocap Error
To evaluate the accuracy of the model in capturing the shape and pose information from the mocap data, we analyze the Euclidean distance between the observed markers and the estimated virtual markers in each frame. As failed markers and noisy labels are inevitable, we focus on frames where more than 23 markers are detected. The results, detailed in Table 2, show results per horse across all trials and the average between different horse subjects is 0.031 meters.
2D Silhouette Error
We measure the accuracy of the captured 3D shape and pose by calculating the intersection over union (IOU) (Eq. 4) to measure the overlapping (Fig. 11c) between the extracted segmentation \(S\) (Fig. 11a) and the silhouette \(\widetilde{S}\) (Fig. 11b), obtained by projecting the corresponding 3D model with the camera. We calculate the IOU of frames that are inside the selected silhouette subsets with more than 23 markers detected. Table 2 shows results over the selected silhouette subset per horse, with an average IOU of 0.85 across different horse subjects. Intersection over union is defined as:
Current limitations
Our current dataset has some limitations, as in Fig. 13, where the motion is not well captured. While the mocap system is precise, occasional irregularities may appear. For example, Fig. 13a shows a case where some of the markers are undetected, leading to misalignment of the model with the silhouette. Another limitation is the hSMAL model itself, which is learned from toy scans, and cannot perfectly represent real-world horses. As seen in Fig. 13b,c, certain body parts like the shape of the cheek and the back cannot be accurately captured, indicating the need for a more precise model.
Usage Notes
The provided FBX files can be imported into different animation applications, including Blender and any other application that supports loading fbx files. The provided c3d files can be processed using Python.
Researchers should be aware of certain anomalies in the data. Despite our meticulous data collection, a small number of irregularities may appear as we mentioned in the previous section, including missing markers, swapped marker labels, and noisy marker positions. This results in weird poses of the reconstructed model when too few correct markers are visible within a given frame. To mitigate this, we set a threshold requiring more than 23 visible markers for pose evaluation. However, these potential issues may influence the final results.
The silhouettes generated through deep learning methods are not fully infallible but can be considered as pseudo-ground truth. Even when we carefully select a subset manually comprised of relatively complete silhouettes, it’s important to note that potential errors remain.
Additionally, discrepancies in response times among different camera sensors exist, despite all cameras being synchronized. In our current setup, we align the first frame of the mocap files with the first video frame of the corresponding videos, which may still result in minor inaccuracies. Users should be cognizant of this small margin of error during data analysis and interpretation.
Code availability
We make available functions for users to use our datasets:
• Loading c3d files and the hSMAL model with the captured parameters to visualize the mocap data and the fitted results.
• Projecting the reconstructed model in image planes with provided camera information.
• Quantitative evaluation using the mocap data and silhouette subsets. Further detail about environment settings and code usage can be found in https://github.com/Celiali/PFERD.git.
References
Louis, N. et al. Temporally guided articulated hand pose tracking in surgical videos. International Journal of Computer Assisted Radiology and Surgery 18, 117–125, https://doi.org/10.1007/s11548-022-02761-6 (2023).
Zhang, J. Z. et al. Slomo: A general system for legged robot motion imitation from casual videos. IEEE Robotics and Automation Letters 8, 7154–7161, https://doi.org/10.1109/LRA.2023.3313937 (2023).
Luo, H. et al. Artemis: Articulated neural pets with appearance and motion synthesis. ACM Transactions on Graphics 41, https://doi.org/10.1145/3528223.3530086 (2022).
Khokhlova, M., Migniot, C., Morozov, A., Sushkova, O. & Dipanda, A. Normal and pathological gait classification lstm model. Artificial Intelligence in Medicine 94, 54–66, https://doi.org/10.1016/j.artmed.2018.12.007 (2019).
Raab, S. et al. Single motion diffusion. In The Twelfth International Conference on Learning Representations, https://doi.org/10.48550/arXiv.2302.05905 (2024).
Guo, C. et al. Generating diverse and natural 3d human motions from text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5152–5161, https://doi.org/10.1109/CVPR52688.2022.00509 (2022).
Mir, A., Puig, X., Kanazawa, A. & Pons-Moll, G. Generating continual human motion in diverse 3d scenes. In International Conference on 3D Vision, https://doi.org/10.48550/arXiv.2304.02061 (2024).
Wilson, A. & Weller, R. The biomechanics of the equine limb and its effect on lameness. In Diagnosis and Management of Lameness in the Horse, 270–281, https://doi.org/10.1016/B978-1-4160-6069-7.00026-2 (Elsevier, 2011).
Makita, S., Murakami, N., Sakaguchi, M. & Furusho, J. Development of horse-type quadruped robot. In IEEE SMC'99 Conference Proceedings. 1999 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No. 99CH37028), vol. 6, 930–935, https://doi.org/10.1109/ICSMC.1999.816677 (IEEE, 1999).
Moro, F. L. et al. Horse-like walking, trotting, and galloping derived from kinematic motion primitives (kmps) and their application to walk/trot transitions in a compliant quadruped robot. Biological cybernetics 107, 309–320, https://doi.org/10.1007/s00422-013-0551-9 (2013).
Hoyt, D. F. & Taylor, C. R. Gait and the energetics of locomotion in horses. Nature 292, 239–240, https://doi.org/10.1038/292239a0 (1981).
Park, H. O., Dibazar, A. A. & Berger, T. W. Cadence analysis of temporal gait patterns for seismic discrimination between human and quadruped footsteps. In IEEE International Conference on Acoustics, Speech and Signal Processing, 1749–1752, https://doi.org/10.1109/ICASSP.2009.4959942 (IEEE, 2009).
Buchner, H., Obermüller, S. & Scheidl, M. Body centre of mass movement in the lame horse. Equine Veterinary Journal 33, 122–127, https://doi.org/10.1111/j.2042-3306.2001.tb05374.x (2001).
Rhodin, M. et al. Vertical movement symmetry of the withers in horses with induced forelimb and hindlimb lameness at trot. Equine veterinary journal 50, 818–824, https://doi.org/10.1111/evj.12844 (2018).
Ionescu, C., Papava, D., Olaru, V. & Sminchisescu, C. Human3.6 m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 1325–1339, https://doi.org/10.1109/TPAMI.2013.248 (2013).
Sigal, L., Balan, A. & Black, M. Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision 87, 4–27, https://doi.org/10.1007/s11263-009-0273-6 (2010).
Mandery, C., Terlemez, Ö., Do, M., Vahrenkamp, N. & Asfour, T. The kit whole-body human motion database. In International Conference on Advanced Robotics, 329–336, https://doi.org/10.1109/ICAR.2015.7251476 (IEEE, 2015).
Santos, G., Wanderley, M., Tavares, T. & Rocha, A. A multi-sensor human gait dataset captured through an optical system and inertial measurement units. Scientific Data 9, 545, https://doi.org/10.1038/s41597-022-01638-2 (2022).
Cao, Z., Simon, T., Wei, S.-E. & Sheikh, Y. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7291–7299, https://doi.org/10.1109/CVPR.2017.143 (2017).
Mathis, A. et al. Deeplabcut: markerless pose estimation of user-defined body parts with deep learning. Nature Neuroscience https://doi.org/10.1038/s41593-018-0209-y (2018).
Cao, J. et al. Cross-domain adaptation for animal pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9498–9507, https://doi.org/10.1109/ICCV.2019.00959 (2019).
Kocabas, M., Karagoz, S. & Akbas, E. Self-supervised learning of 3d human pose using multi-view geometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1077–1086, https://doi.org/10.1109/CVPR.2019.00117 (2019).
Li, X., Fan, Z., Liu, Y., Li, Y. & Dai, Q. 3d pose detection of closely interactive humans using multi-view cameras. Sensors 19, 2831, https://doi.org/10.3390/s19122831 (2019).
Joska, D. et al. Acinoset: a 3d pose estimation dataset and baseline models for cheetahs in the wild. In IEEE International Conference on Robotics and Automation, 13901–13908, https://doi.org/10.1109/ICRA48506.2021.9561338 (IEEE, 2021).
Günel, S. et al. Deepfly3d, a deep learning-based approach for 3d limb and appendage tracking in tethered, adult drosophila. Elife 8, e48571, https://doi.org/10.7554/eLife.48571 (2019).
Bala, P. C. et al. Automated markerless pose estimation in freely moving macaques with openmonkeystudio. Nature communications 11, 4560, https://doi.org/10.1038/s41467-020-18441-5 (2020).
Patel, M., Gu, Y., Carstensen, L. C., Hasselmo, M. E. & Betke, M. Animal pose tracking: 3d multimodal dataset and token-based pose optimization. International Journal of Computer Vision 131, 514–530, https://doi.org/10.1007/s11263-022-01714-5 (2023).
Zimmermann, C., Schneider, A., Alyahyay, M., Brox, T. & Diester, I. Freipose: a deep learning framework for precise animal motion capture in 3d spaces. BioRxiv 2020–02, https://doi.org/10.1101/2020.02.27.967620 (2020).
Gosztolai, A. et al. Liftpose3d, a deep learning-based approach for transforming two-dimensional to three-dimensional poses in laboratory animals. Nature methods 18, 975–981, https://doi.org/10.1038/s41592-021-01226-z (2021).
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G. & Black, M. J. Smpl: A skinned multi-person linear model. ACM Transactions on Graphics 34, 1–16, https://doi.org/10.1145/2816795.2818013 (2015).
Mahmood, N., Ghorbani, N., Troje, N. F., Pons-Moll, G. & Black, M. J. Amass: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 5442–5451, https://doi.org/10.1109/ICCV.2019.00554 (2019).
Kocabas, M., Athanasiou, N. & Black, M. J. Vibe: Video inference for human body pose and shape estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5253–5263, https://doi.org/10.1109/CVPR42600.2020.00530 (2020).
Ghorbani, N. & Black, M. J. Soma: Solving optical marker-based mocap automatically. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 11097–11106, https://doi.org/10.1109/ICCV48922.2021.01093 (2021).
Yuan, Y., Iqbal, U., Molchanov, P., Kitani, K. & Kautz, J. Glamr: Global occlusion-aware human mesh recovery with dynamic cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11028–11039, https://doi.org/10.1109/CVPR52688.2022.01076 (2022).
Li, J. et al. Task-generic hierarchical human motion prior using vaes. In International Conference on 3D Vision, 771–781, https://doi.org/10.1109/3DV53792.2021.00086 (IEEE, 2021).
Rempe, D. et al. Humor: 3d human motion model for robust pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, https://doi.org/10.1109/ICCV48922.2021.01129 (2021).
Voleti, V. et al. Smpl-ik: Learned morphology-aware inverse kinematics for ai driven artistic workflows. In SIGGRAPH Asia 2022 Technical Communications, SA ‘22, https://doi.org/10.1145/3550340.3564227 (Association for Computing Machinery, New York, NY, USA, 2022).
Zuffi, S., Kanazawa, A., Jacobs, D. W. & Black, M. J. 3d menagerie: Modeling the 3d shape and pose of animals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6365–6373, https://doi.org/10.1109/CVPR.2017.586 (2017).
Biggs, B., Boyne, O., Charles, J., Fitzgibbon, A. & Cipolla, R. Who left the dogs out? 3D animal reconstruction with expectation maximization in the loop. In European Conference on Computer Vision, https://doi.org/10.1007/978-3-030-58621-8_12 (2020).
Rüegg, N., Zuffi, S., Schindler, K. & Black, M. J. Barc: Learning to regress 3d dog shape from images by exploiting breed information. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3876–3884, https://doi.org/10.1109/CVPR52688.2022.00385 (2022).
Li, C. et al. hsmal: Detailed horse shape and pose reconstruction for motion pattern recognition. In First CV4Animals Workshop, IEEE/CVF Conference on Computer Vision and Pattern Recognition, https://doi.org/10.48550/arXiv.2106.10102 (2021).
Luo, H. et al. Artemis: articulated neural pets with appearance and motion synthesis. ACM Transactions on Graphics 41, 1–19, https://doi.org/10.1145/3528223.3530086 (2022).
Dunn, T. W. et al. Geometric deep learning enables 3d kinematic profiling across species and environments. Nature methods 18, 564–573, https://doi.org/10.1038/s41592-021-01106-6 (2021).
Zhang, H., Starke, S., Komura, T. & Saito, J. Mode-adaptive neural networks for quadruped motion control. ACM Transactions on Graphics 37, 1–11, https://doi.org/10.1145/3197517.3201366 (2018).
Kearney, S., Li, W., Parsons, M., Kim, K. I. & Cosker, D. Rgbd-dog: Predicting canine pose from rgbd sensors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8336–8345, https://doi.org/10.1109/CVPR42600.2020.00836 (2020).
Warner, S., Koch, T. & Pfau, T. Inertial sensors for assessment of back movement in horses during locomotion over ground. Equine Veterinary Journal 42, 417–424, https://doi.org/10.1111/j.2042-3306.2010.00200.x (2010).
Bragança, F. S., Rhodin, M. & Van Weeren, P. On the brink of daily clinical application of objective gait analysis: What evidence do we have so far from studies using an induced lameness model. The Veterinary Journal 234, 11–23, https://doi.org/10.1016/j.tvjl.2018.01.006 (2018).
Unt, V., Evans, J., Reed, S., Pfau, T. & Weller, R. Variation in frontal plane joint angles in horses. Equine Veterinary Journal 42, 444–450, https://doi.org/10.1111/j.2042-3306.2010.00192.x (2010).
Bosch, S. et al. Equimoves: A wireless networked inertial measurement system for objective examination of horse gait. Sensors 18, 850, https://doi.org/10.3390/s18030850 (2018).
Ericson, C., Stenfeldt, P., Hardeman, A. & Jacobson, I. The effect of kinesiotape on flexion-extension of the thoracolumbar back in horses at trot. Animals 10, 301, https://doi.org/10.3390/ani10020301 (2020).
Li, C. et al. The Poses for Equine Research Dataset (PFERD). Harvard Dataverse https://doi.org/10.7910/DVN/2EXONE (2024).
Loper, M., Mahmood, N. & Black, M. J. Mosh: motion and shape capture from sparse markers. ACM Transactions on Graphics 33, 220–1, https://doi.org/10.1145/2661229.2661273 (2014).
Yang, J. et al. Track anything: Segment anything meets videos. arXiv:2304.11968 https://doi.org/10.48550/arXiv.2304.11968 (2023).
Kirillov, A. et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4015–4026, https://doi.org/10.1109/ICCV51070.2023.00371 (2023).
Sisson, S. A Textbook of Veterinary Anatomy (Saunders, 1911).
Acknowledgements
The project was supported by a career grant from SLU No. 57122058, received by Elin Hernlund. Silvia Zuffi was in part supported by the European Commission’s NextGeneration EU Programme, PNRR grant PE0000013 Future Artificial Intelligence Research–FAIR CUP B53C22003630006. The authors sincerely thank Tove Kjellmark for her assistance during the data collection and all horses and owners for their collaboration in the data collection experiments. Thank you to Zala Zgank for helping with the labeling of motion capture data.
Funding
Open access funding provided by Swedish University of Agricultural Sciences.
Author information
Authors and Affiliations
Contributions
C.L. processed the data, conducted the computer science experiments and wrote the manuscript. Y.M. and J.K. labeled and processed the motion capture data. E.H., S.P., and M.H. planned, designed and contributed the resources for the data collection and performed the data collection together with J.K.. N.G. provided the framework for supporting the 3D modeling experiments. M.J.B. provided helpful insight, discussion and resources in the project. H.K., S.Z. and E.H. supervised the project, revised the manuscripts, and E.H. provided overall project management. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work was done while the author was at MPI-IS.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Li, C., Mellbin, Y., Krogager, J. et al. The Poses for Equine Research Dataset (PFERD). Sci Data 11, 497 (2024). https://doi.org/10.1038/s41597-024-03312-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-024-03312-1