ILLC & Computational Science Lab, University of Amsterdam
We present an efficient approach for capturing sign language in 3D, introduce the 3D-LEX v1.0 dataset, and detail a method for semi-automatic annotation of phonetic properties. Our procedure integrates three motion capture techniques encompassing high-resolution 3D poses, 3D handshapes, and depth-aware facial features, and attains an average sampling rate of one sign every 10 seconds. The 3D-LEX dataset includes 1,000 signs from American Sign Language and an additional 1,000 signs from the Sign Language of the Netherlands. We showcase the dataset utility by presenting a simple method for generating handshape annotations directly from 3D-LEX. The labels enhance gloss recognition accuracy by 5% over using no handshape annotations, and by 1% over expert annotations. Our motion capture data supports in-depth analysis of sign features and facilitates the generation of 2D projections from any viewpoint.
3D-LEX combines three complementary motion capture technologies to record body pose, hand configuration, and facial dynamics simultaneously, with a triple-foot-pedal system enabling touchless synchronised control.
| System | Captures | Equipment |
|---|---|---|
| Vicon optical MoCap | Full-body 3D pose | 12 Vero v2.2 cameras, 53-marker template |
| StretchSense gloves | 3D handshapes | Pro Fidelity gloves, 26 sensors/hand |
| Live Link Face | Facial blendshapes | iPhone 13 Pro, ARKit |
| Statistic | Value |
|---|---|
| ASL signs | 1,000 |
| NGT signs | 1,000 |
| Average capture rate | ~10 sec / sign |
| Landmark detection rate | 97.6% (real), 97.8% (synthetic) |
| Alignment with WLASL | 695 / 1,000 glosses |
| Alignment with SEMLEX | 921 / 1,000 glosses |
| Alignment with SignBank NGT | 888 / 1,000 glosses |
We introduce a two-stage pipeline for automatically generating handshape labels from StretchSense glove data. First, temporal segmentation via Euclidean distance to calibration poses isolates the characteristic handshape portion of each sign from resting/transitional poses. Second, k-means clustering (k = 50 for ASL) on the segmented frames assigns new handshape labels without requiring expert knowledge.
Automatic handshape labels are evaluated on the WLASL 2000 isolated sign recognition benchmark using SL-GCN. Automatic annotations from 3D-LEX match or exceed expert annotations at negligible cost compared to manual labelling.
| Condition | Top-1 Accuracy |
|---|---|
| No handshape labels | 0.44 ± 0.01 |
| Expert handshape labels (ASL-LEX 2.0) | 0.48 ± 0.01 |
| Automatic labels — 3D-LEX (ours) | 0.49 ± 0.01 |
Beyond phonetic annotation, 3D ground-truth enables generation of 2D projections from arbitrary viewpoints via avatar retargeting — providing a route to multi-view training data without expensive multi-camera setups.
@inproceedings{ranum-etal-2024-3d,
title = {3{D}-{LEX} v1.0 {--} 3{D} Lexicons for {A}merican {S}ign {L}anguage
and {S}ign {L}anguage of the {N}etherlands},
author = {Ranum, Oline and Otterspeer, Gom{\`e}r and Andersen, Jari I. and
Belleman, Robert G. and Roelofsen, Floris},
editor = {Efthimiou, Eleni and Fotinea, Stavroula-Evita and Hanke, Thomas and
Hochgesang, Julie A. and Mesch, Johanna and Schulder, Marc},
booktitle = {Proceedings of the LREC-COLING 2024 11th Workshop on the Representation
and Processing of Sign Languages: Evaluation of Sign Language Resources},
month = may,
year = {2024},
address = {Torino, Italia},
publisher = {ELRA and ICCL},
pages = {290--301},
url = {https://aclanthology.org/2024.signlang-1.33/}
}