3D-LEX v1.0 — 3D Lexicons for ASL and NGT

Phonetic annotation of sign languages is expensive and requires expert linguists. 3D-LEX provides high-resolution motion capture data for 2,000 signs across ASL and NGT, enabling semi-automatic phonetic annotation that matches or exceeds expert accuracy at a fraction of the cost.

Abstract

We present an efficient approach for capturing sign language in 3D, introduce the 3D-LEX v1.0 dataset, and detail a method for semi-automatic annotation of phonetic properties. Our procedure integrates three motion capture techniques encompassing high-resolution 3D poses, 3D handshapes, and depth-aware facial features, and attains an average sampling rate of one sign every 10 seconds. The 3D-LEX dataset includes 1,000 signs from American Sign Language and an additional 1,000 signs from the Sign Language of the Netherlands. We showcase the dataset utility by presenting a simple method for generating handshape annotations directly from 3D-LEX. The labels enhance gloss recognition accuracy by 5% over using no handshape annotations, and by 1% over expert annotations. Our motion capture data supports in-depth analysis of sign features and facilitates the generation of 2D projections from any viewpoint.

Capture System

3D-LEX combines three complementary motion capture technologies to record body pose, hand configuration, and facial dynamics simultaneously, with a triple-foot-pedal system enabling touchless synchronised control.

Vicon camera layout — **Vicon setup.** 10 ceiling-mounted + 2 floor cameras (Vero v2.2) forming the detection zone around the signer.

Marker layout — **53-marker FrontWaist template** used for full-body 3D pose capture in Shogun Live.

System	Captures	Equipment
Vicon optical MoCap	Full-body 3D pose	12 Vero v2.2 cameras, 53-marker template
StretchSense gloves	3D handshapes	Pro Fidelity gloves, 26 sensors/hand
Live Link Face	Facial blendshapes	iPhone 13 Pro, ARKit

Dataset Statistics

Statistic	Value
ASL signs	1,000
NGT signs	1,000
Average capture rate	~10 sec / sign
Landmark detection rate	97.6% (real), 97.8% (synthetic)
Alignment with WLASL	695 / 1,000 glosses
Alignment with SEMLEX	921 / 1,000 glosses
Alignment with SignBank NGT	888 / 1,000 glosses

Semi-Automatic Phonetic Annotation

We introduce a two-stage pipeline for automatically generating handshape labels from StretchSense glove data. First, temporal segmentation via Euclidean distance to calibration poses isolates the characteristic handshape portion of each sign from resting/transitional poses. Second, k-means clustering (k = 50 for ASL) on the segmented frames assigns new handshape labels without requiring expert knowledge.

Temporal handshape segmentation — **Temporal segmentation.** Classification of the ASL sign "zero" (expert label: "o"). Each bar represents a captured frame; colour indicates the handshape identified frame-by-frame by the Euclidean distance method. The pipeline identifies "5", "f", "c", and "o", selecting "o" as the characteristic signal.

Handshape label distributions — **Handshape distributions.** Comparison of distributions from expert annotations (ASL-LEX 2.0) and our automatic k-means labels on 1,000 ASL glosses.

t-SNE projection of handshapes — **t-SNE projection.** Average hand poses projected to 2D, colour-coded by k-means cluster label, showing clearly separable handshape clusters.

Results

Automatic handshape labels are evaluated on the WLASL 2000 isolated sign recognition benchmark using SL-GCN. Automatic annotations from 3D-LEX match or exceed expert annotations at negligible cost compared to manual labelling.

Condition	Top-1 Accuracy
No handshape labels	0.44 ± 0.01
Expert handshape labels (ASL-LEX 2.0)	0.48 ± 0.01
Automatic labels — 3D-LEX (ours)	0.49 ± 0.01

Multi-View Data Generation

Beyond phonetic annotation, 3D ground-truth enables generation of 2D projections from arbitrary viewpoints via avatar retargeting — providing a route to multi-view training data without expensive multi-camera setups.

Multi-view synthetic projection — Synthetic 2D projections generated from 3D-LEX ground truth at multiple viewpoints, used to augment the NGT200 multi-view dataset.

Citation

@inproceedings{ranum-etal-2024-3d,
  title     = {3{D}-{LEX} v1.0 {--} 3{D} Lexicons for {A}merican {S}ign {L}anguage
               and {S}ign {L}anguage of the {N}etherlands},
  author    = {Ranum, Oline and Otterspeer, Gom{\`e}r and Andersen, Jari I. and
               Belleman, Robert G. and Roelofsen, Floris},
  editor    = {Efthimiou, Eleni and Fotinea, Stavroula-Evita and Hanke, Thomas and
               Hochgesang, Julie A. and Mesch, Johanna and Schulder, Marc},
  booktitle = {Proceedings of the LREC-COLING 2024 11th Workshop on the Representation
               and Processing of Sign Languages: Evaluation of Sign Language Resources},
  month     = may,
  year      = {2024},
  address   = {Torino, Italia},
  publisher = {ELRA and ICCL},
  pages     = {290--301},
  url       = {https://aclanthology.org/2024.signlang-1.33/}
}