Self-Supervised Representation Learning with Spatial-Temporal Consistency for Sign Language Recognition
Weichao Zhao, Hezhen Hu, Wengang Zhou, Min Wang and Houqiang Li
This repository includes Python (PyTorch) implementation of this paper.
Under review in TIP2024
python==3.8.13
torch==1.8.1+cu111
torchvision==0.9.1+cu111
tensorboard==2.9.0
scikit-learn==1.1.1
tqdm==4.64.0
numpy==1.22.4
Please refer to the bash scripts
-
Download the original datasets, including SLR500, NMFs_CSL, WLASL and MSASL
-
Utilize the off-the-shelf pose estimator MMPose with the setting of Topdown Heatmap + Hrnet + Dark on Coco-Wholebody to extract the 2D keypoints for sign language videos.
-
The final data is formatted as follows:
Data
├── NMFs_CSL
├── SLR500
├── WLASL
└── MSASL
├── Video
├── Pose
└── Annotations
You can download the pretrained model from this link: pretrained model on four ISLR datasets
The framework of our code is based on skeleton-contrast.