SignLab Amsterdam & AMLab, University of Amsterdam
Sign Language Processing (SLP) provides a foundation for a more inclusive future in language technology; however, the field faces several significant challenges that must be addressed to achieve practical, real-world applications. This work addresses multi-view isolated sign recognition (MV-ISR), and highlights the essential role of 3D awareness and geometry in SLP systems. We introduce the NGT200 dataset, a novel spatio-temporal multi-view benchmark, establishing MV-ISR as distinct from single-view ISR (SV-ISR). We demonstrate the benefits of synthetic data and propose conditioning sign representations on spatial symmetries inherent in sign language. Leveraging an SE(2)-equivariant model improves MV-ISR performance by 8–22% over the baseline.
NGT200 contains pose and video data for 200 common NGT signs recorded from three calibrated viewpoints (−25°, 0°, +25°) by 3 Deaf signers and a retargeted synthetic avatar, yielding a multi-view benchmark designed explicitly for viewpoint-invariant sign recognition research.
| Property | Value |
|---|---|
| Signs | 200 NGT glosses |
| Viewpoints | 3 (left −25°, front 0°, right +25°) |
| Signers | 3 Deaf human signers + 1 synthetic avatar |
| Landmarks | 75 per frame (Holistic MediaPipe) |
| Modalities | Spatio-temporal pose, video |
| License | CC BY 4.0 |
Pose landmarks are downsampled to a 27-node graph (10 nodes per hand, 7 for overall body position) with spatial edges approximating the human skeletal structure. The graph is used as input to both the SL-GCN baseline and the proposed Temporal-PONITA model.
We propose Temporal-PONITA, an extension of the PONITA architecture augmented with temporal convolution modules. By conditioning representations on SE(2) spatial symmetries (rotations and translations in the image plane), the model reduces the learning burden imposed by viewpoint variation and inter-signer differences.
Our experiments are structured around four research questions, each probing a different aspect of multi-view isolated sign recognition. Together they establish that MV-ISR is a distinct task from single-view ISR, that geometry-aware models provide a strong path forward, and that synthetic data can meaningfully close the data gap.
A model trained on only one viewpoint achieves near-chance performance when tested on a different view. Adding more frontal data from SignBank (Sb) marginally improves frontal accuracy but does not recover side-view performance, confirming that MV-ISR is a genuinely distinct task from single-view ISR.
| Training view | Test Left | Test Front | Test Right |
|---|---|---|---|
| Left only | .05 | .03 | .01 |
| Right only | .02 | .03 | .06 |
| Front only | .03 | .09 | .03 |
| Front + SignBank | .06 | .20 | .05 |
Top-1 accuracy. Bold = best test view for that training condition. All human signers (IDs 1, 2, 3 + avatar A).
Top-1 accuracy on single-view trained models. Bold = best test view for that training condition.
Training on combinations of views substantially improves generalisation. Using all three views together yields the best results across every test condition.
| Training views | Test Left | Top-3 Left | Test Front | Top-3 Front | Test Right | Top-3 Right |
|---|---|---|---|---|---|---|
| Left + Front | .25 | .51 | .35 | .59 | — | — |
| Left + Right | .27 | .47 | — | — | .28 | .51 |
| Front + Right | — | — | .42 | .67 | .39 | .62 |
| All 3 views | .46 | .69 | .49 | .74 | .47 | .72 |
When training on signers 1 & 2 and testing on unseen signer 3, synthetic avatar data provides substantial gains. Combining 3-view avatar data with SignBank frontal views achieves the best novel-signer performance.
| Training data | Test Left (S3) | Test Front (S3) | Test Right (S3) |
|---|---|---|---|
| 3-view, signers 1+2 only | .03 | .14 | .14 |
| + 1-view avatar | .22 | .27 | .09 |
| + SignBank frontal | .10 | .28 | .26 |
| + 3-view avatar | .19 | .43 | .38 |
| + 3-view avatar + SignBank | .32 | .48 | .43 |
Accuracies averaged over 10 runs. Novel signer (ID 3) held out from training entirely.
Temporal-PONITA consistently outperforms SL-GCN across all view combinations, with absolute gains of +8% to +22%. The equivariant model also converges significantly faster despite higher per-epoch compute cost.
| Training views | Test view | SL-GCN Top-1 | PONITA Top-1 | Gain |
|---|---|---|---|---|
| Left + Front | Left | .25 | .43 | +.18 |
| Left + Right | Left | .27 | .48 | +.21 |
| Left + Front | Front | .35 | .55 | +.20 |
| Front + Right | Front | .42 | .57 | +.15 |
| Left + Right | Right | .28 | .50 | +.22 |
| Front + Right | Right | .39 | .49 | +.10 |
| All 3 views | Left | .46 | .54 | +.08 |
| All 3 views | Front | .49 | .59 | +.10 |
| All 3 views | Right | .47 | .55 | +.08 |
@inproceedings{ranum2024the,
title = {The {NGT}200 Dataset - Geometric Multi-View Isolated Sign Recognition},
author = {Ranum, Oline and Wessels, David and Otterspeer, Gom{\`e}r and
Bekkers, Erik J. and Roelofsen, Floris and Andersen, Jari I.},
booktitle = {ICML 2024 Workshop on Geometry-grounded Representation Learning
and Generative Modeling},
year = {2024},
url = {https://openreview.net/forum?id=idkNzTC67X}
}