LREC-COLING 2024 · Sign Language Workshop

3D-LEX v1.0: 3D Lexicons for American Sign Language
and Sign Language of the Netherlands

Oline Ranum  ·  Gomèr Otterspeer  ·  Jari I. Andersen  ·  Robert G. Belleman  ·  Floris Roelofsen

ILLC & Computational Science Lab, University of Amsterdam

Phonetic annotation of sign languages is expensive and requires expert linguists. 3D-LEX provides high-resolution motion capture data for 2,000 signs across ASL and NGT, enabling semi-automatic phonetic annotation that matches or exceeds expert accuracy at a fraction of the cost.

Abstract

We present an efficient approach for capturing sign language in 3D, introduce the 3D-LEX v1.0 dataset, and detail a method for semi-automatic annotation of phonetic properties. Our procedure integrates three motion capture techniques encompassing high-resolution 3D poses, 3D handshapes, and depth-aware facial features, and attains an average sampling rate of one sign every 10 seconds. The 3D-LEX dataset includes 1,000 signs from American Sign Language and an additional 1,000 signs from the Sign Language of the Netherlands. We showcase the dataset utility by presenting a simple method for generating handshape annotations directly from 3D-LEX. The labels enhance gloss recognition accuracy by 5% over using no handshape annotations, and by 1% over expert annotations. Our motion capture data supports in-depth analysis of sign features and facilitates the generation of 2D projections from any viewpoint.

Capture System

3D-LEX combines three complementary motion capture technologies to record body pose, hand configuration, and facial dynamics simultaneously, with a triple-foot-pedal system enabling touchless synchronised control.

Vicon camera layout
Vicon setup. 10 ceiling-mounted + 2 floor cameras (Vero v2.2) forming the detection zone around the signer.
Marker layout
53-marker FrontWaist template used for full-body 3D pose capture in Shogun Live.
SystemCapturesEquipment
Vicon optical MoCapFull-body 3D pose12 Vero v2.2 cameras, 53-marker template
StretchSense gloves3D handshapesPro Fidelity gloves, 26 sensors/hand
Live Link FaceFacial blendshapesiPhone 13 Pro, ARKit

Dataset Statistics

StatisticValue
ASL signs1,000
NGT signs1,000
Average capture rate~10 sec / sign
Landmark detection rate97.6% (real), 97.8% (synthetic)
Alignment with WLASL695 / 1,000 glosses
Alignment with SEMLEX921 / 1,000 glosses
Alignment with SignBank NGT888 / 1,000 glosses

Semi-Automatic Phonetic Annotation

We introduce a two-stage pipeline for automatically generating handshape labels from StretchSense glove data. First, temporal segmentation via Euclidean distance to calibration poses isolates the characteristic handshape portion of each sign from resting/transitional poses. Second, k-means clustering (k = 50 for ASL) on the segmented frames assigns new handshape labels without requiring expert knowledge.

Temporal handshape segmentation
Temporal segmentation. Classification of the ASL sign "zero" (expert label: "o"). Each bar represents a captured frame; colour indicates the handshape identified frame-by-frame by the Euclidean distance method. The pipeline identifies "5", "f", "c", and "o", selecting "o" as the characteristic signal.
Handshape label distributions
Handshape distributions. Comparison of distributions from expert annotations (ASL-LEX 2.0) and our automatic k-means labels on 1,000 ASL glosses.
t-SNE projection of handshapes
t-SNE projection. Average hand poses projected to 2D, colour-coded by k-means cluster label, showing clearly separable handshape clusters.

Results

Automatic handshape labels are evaluated on the WLASL 2000 isolated sign recognition benchmark using SL-GCN. Automatic annotations from 3D-LEX match or exceed expert annotations at negligible cost compared to manual labelling.

Condition Top-1 Accuracy
No handshape labels0.44 ± 0.01
Expert handshape labels (ASL-LEX 2.0)0.48 ± 0.01
Automatic labels — 3D-LEX (ours)0.49 ± 0.01

Multi-View Data Generation

Beyond phonetic annotation, 3D ground-truth enables generation of 2D projections from arbitrary viewpoints via avatar retargeting — providing a route to multi-view training data without expensive multi-camera setups.

Multi-view synthetic projection
Synthetic 2D projections generated from 3D-LEX ground truth at multiple viewpoints, used to augment the NGT200 multi-view dataset.

Citation

@inproceedings{ranum-etal-2024-3d,
  title     = {3{D}-{LEX} v1.0 {--} 3{D} Lexicons for {A}merican {S}ign {L}anguage
               and {S}ign {L}anguage of the {N}etherlands},
  author    = {Ranum, Oline and Otterspeer, Gom{\`e}r and Andersen, Jari I. and
               Belleman, Robert G. and Roelofsen, Floris},
  editor    = {Efthimiou, Eleni and Fotinea, Stavroula-Evita and Hanke, Thomas and
               Hochgesang, Julie A. and Mesch, Johanna and Schulder, Marc},
  booktitle = {Proceedings of the LREC-COLING 2024 11th Workshop on the Representation
               and Processing of Sign Languages: Evaluation of Sign Language Resources},
  month     = may,
  year      = {2024},
  address   = {Torino, Italia},
  publisher = {ELRA and ICCL},
  pages     = {290--301},
  url       = {https://aclanthology.org/2024.signlang-1.33/}
}