GRaM @ ICML 2024

The NGT200 Dataset:
Geometric Multi-View Isolated Sign Recognition

Oline Ranum  ·  David Wessels  ·  Gomèr Otterspeer  ·  Erik J. Bekkers  ·  Floris Roelofsen  ·  Jari I. Andersen

SignLab Amsterdam & AMLab, University of Amsterdam

Sign languages are three-dimensional, yet almost all recognition research operates on frontal-view 2D video. We show that a model trained on a single frontal view loses over 50% relative accuracy when tested on a side view — and that geometric, viewpoint-aware learning recovers this gap.

Abstract

Sign Language Processing (SLP) provides a foundation for a more inclusive future in language technology; however, the field faces several significant challenges that must be addressed to achieve practical, real-world applications. This work addresses multi-view isolated sign recognition (MV-ISR), and highlights the essential role of 3D awareness and geometry in SLP systems. We introduce the NGT200 dataset, a novel spatio-temporal multi-view benchmark, establishing MV-ISR as distinct from single-view ISR (SV-ISR). We demonstrate the benefits of synthetic data and propose conditioning sign representations on spatial symmetries inherent in sign language. Leveraging an SE(2)-equivariant model improves MV-ISR performance by 8–22% over the baseline.

The NGT200 Dataset

NGT200 contains pose and video data for 200 common NGT signs recorded from three calibrated viewpoints (−25°, 0°, +25°) by 3 Deaf signers and a retargeted synthetic avatar, yielding a multi-view benchmark designed explicitly for viewpoint-invariant sign recognition research.

SignCollect recording setup
Recording setup. Each camera is positioned 4 metres from the signer, with 25° separation between the three viewpoints. The setup uses the SignCollect platform.
Spatio-temporal point clouds
Spatio-temporal point clouds extracted with MediaPipe Holistic from the front and right viewpoints. White landmarks show a single frame; blue landmarks encode temporal dynamics across multiple frames.
PropertyValue
Signs200 NGT glosses
Viewpoints3 (left −25°, front 0°, right +25°)
Signers3 Deaf human signers + 1 synthetic avatar
Landmarks75 per frame (Holistic MediaPipe)
ModalitiesSpatio-temporal pose, video
LicenseCC BY 4.0

Method

Pose Graph Construction

Pose landmarks are downsampled to a 27-node graph (10 nodes per hand, 7 for overall body position) with spatial edges approximating the human skeletal structure. The graph is used as input to both the SL-GCN baseline and the proposed Temporal-PONITA model.

Reduced spatial graph
Reduced spatial graph. 27-node skeleton with approximate bone-structure edges, used as input to both model architectures.

Temporal-PONITA: SE(2)-Equivariant Sign Recognition

We propose Temporal-PONITA, an extension of the PONITA architecture augmented with temporal convolution modules. By conditioning representations on SE(2) spatial symmetries (rotations and translations in the image plane), the model reduces the learning burden imposed by viewpoint variation and inter-signer differences.

Temporal-PONITA architecture
Temporal-PONITA architecture. Input features are embedded with a linear layer, then passed through L temporal-PONITA layers each containing one ConvNeXt block and one temporal block (two convolutional layers with GeLU activations).

Results

Our experiments are structured around four research questions, each probing a different aspect of multi-view isolated sign recognition. Together they establish that MV-ISR is a distinct task from single-view ISR, that geometry-aware models provide a strong path forward, and that synthetic data can meaningfully close the data gap.

Q1
Does viewing angle matter in pose-based ISR?
Yes — a model trained on one view drops to near-chance on another view.
Q2
How does including more views during training affect performance?
Consistently better. All-3-view training reaches .46/.49/.47 Top-1 across views.
Q3
Can synthetic data boost MV-ISR performance for novel signers?
Yes — combining 3-view avatar data with SignBank raises novel-signer accuracy to .32/.48/.43.
Q4
Is a geometrically grounded model viable for ISR?
Yes — SE(2)-equivariant Temporal-PONITA gains +8% to +22% over SL-GCN and trains 40% faster.

Q1 & Q2 — Viewing Angle and Multi-View Training (SL-GCN)

A model trained on only one viewpoint achieves near-chance performance when tested on a different view. Adding more frontal data from SignBank (Sb) marginally improves frontal accuracy but does not recover side-view performance, confirming that MV-ISR is a genuinely distinct task from single-view ISR.

Training view Test Left Test Front Test Right
Left only .05.03.01
Right only .02.03.06
Front only .03.09.03
Front + SignBank .06.20.05

Top-1 accuracy. Bold = best test view for that training condition. All human signers (IDs 1, 2, 3 + avatar A).

Top-1 accuracy on single-view trained models. Bold = best test view for that training condition.

Training on combinations of views substantially improves generalisation. Using all three views together yields the best results across every test condition.

Training views Test Left Top-3 Left Test Front Top-3 Front Test Right Top-3 Right
Left + Front .25.51.35.59
Left + Right .27.47.28.51
Front + Right .42.67.39.62
All 3 views.46.69.49.74.47.72

Q3 — Effect of Synthetic Data (SL-GCN, novel signer)

When training on signers 1 & 2 and testing on unseen signer 3, synthetic avatar data provides substantial gains. Combining 3-view avatar data with SignBank frontal views achieves the best novel-signer performance.

Training data Test Left (S3) Test Front (S3) Test Right (S3)
3-view, signers 1+2 only .03.14.14
+ 1-view avatar .22.27.09
+ SignBank frontal .10.28.26
+ 3-view avatar .19.43.38
+ 3-view avatar + SignBank.32.48.43

Accuracies averaged over 10 runs. Novel signer (ID 3) held out from training entirely.

Q4 — Temporal-PONITA vs. SL-GCN: The Case for Geometric Models

Temporal-PONITA consistently outperforms SL-GCN across all view combinations, with absolute gains of +8% to +22%. The equivariant model also converges significantly faster despite higher per-epoch compute cost.

Training views Test view SL-GCN Top-1 PONITA Top-1 Gain
Left + Front Left .25.43+.18
Left + Right Left .27.48+.21
Left + Front Front .35.55+.20
Front + Right Front .42.57+.15
Left + Right Right .28.50+.22
Front + Right Right .39.49+.10
All 3 viewsLeft .46.54+.08
All 3 viewsFront .49.59+.10
All 3 viewsRight .47.55+.08

Citation

@inproceedings{ranum2024the,
  title     = {The {NGT}200 Dataset - Geometric Multi-View Isolated Sign Recognition},
  author    = {Ranum, Oline and Wessels, David and Otterspeer, Gom{\`e}r and
               Bekkers, Erik J. and Roelofsen, Floris and Andersen, Jari I.},
  booktitle = {ICML 2024 Workshop on Geometry-grounded Representation Learning
               and Generative Modeling},
  year      = {2024},
  url       = {https://openreview.net/forum?id=idkNzTC67X}
}