NGT200 — Geometric Multi-View Isolated Sign Recognition

Abstract

Sign Language Processing (SLP) provides a foundation for a more inclusive future in language technology; however, the field faces several significant challenges that must be addressed to achieve practical, real-world applications. This work addresses multi-view isolated sign recognition (MV-ISR), and highlights the essential role of 3D awareness and geometry in SLP systems. We introduce the NGT200 dataset, a novel spatio-temporal multi-view benchmark, establishing MV-ISR as distinct from single-view ISR (SV-ISR). We demonstrate the benefits of synthetic data and propose conditioning sign representations on spatial symmetries inherent in sign language. Leveraging an SE(2)-equivariant model improves MV-ISR performance by 8–22% over the baseline.

The NGT200 Dataset

NGT200 contains pose and video data for 200 common NGT signs recorded from three calibrated viewpoints (−25°, 0°, +25°) by 3 Deaf signers and a retargeted synthetic avatar, yielding a multi-view benchmark designed explicitly for viewpoint-invariant sign recognition research.

SignCollect recording setup — **Recording setup.** Each camera is positioned 4 metres from the signer, with 25° separation between the three viewpoints. The setup uses the SignCollect platform.

**Spatio-temporal point clouds** extracted with MediaPipe Holistic from the front and right viewpoints. White landmarks show a single frame; blue landmarks encode temporal dynamics across multiple frames.

Property	Value
Signs	200 NGT glosses
Viewpoints	3 (left −25°, front 0°, right +25°)
Signers	3 Deaf human signers + 1 synthetic avatar
Landmarks	75 per frame (Holistic MediaPipe)
Modalities	Spatio-temporal pose, video
License	CC BY 4.0

Method

Pose Graph Construction

Pose landmarks are downsampled to a 27-node graph (10 nodes per hand, 7 for overall body position) with spatial edges approximating the human skeletal structure. The graph is used as input to both the SL-GCN baseline and the proposed Temporal-PONITA model.

**Reduced spatial graph.** 27-node skeleton with approximate bone-structure edges, used as input to both model architectures.

Temporal-PONITA: SE(2)-Equivariant Sign Recognition

We propose Temporal-PONITA, an extension of the PONITA architecture augmented with temporal convolution modules. By conditioning representations on SE(2) spatial symmetries (rotations and translations in the image plane), the model reduces the learning burden imposed by viewpoint variation and inter-signer differences.

**Temporal-PONITA architecture.** Input features are embedded with a linear layer, then passed through L temporal-PONITA layers each containing one ConvNeXt block and one temporal block (two convolutional layers with GeLU activations).

Results

Our experiments are structured around four research questions, each probing a different aspect of multi-view isolated sign recognition. Together they establish that MV-ISR is a distinct task from single-view ISR, that geometry-aware models provide a strong path forward, and that synthetic data can meaningfully close the data gap.

Does viewing angle matter in pose-based ISR?

Yes — a model trained on one view drops to near-chance on another view.

How does including more views during training affect performance?

Consistently better. All-3-view training reaches .46/.49/.47 Top-1 across views.

Can synthetic data boost MV-ISR performance for novel signers?

Yes — combining 3-view avatar data with SignBank raises novel-signer accuracy to .32/.48/.43.

Is a geometrically grounded model viable for ISR?

Yes — SE(2)-equivariant Temporal-PONITA gains +8% to +22% over SL-GCN and trains 40% faster.

Q1 & Q2 — Viewing Angle and Multi-View Training (SL-GCN)

A model trained on only one viewpoint achieves near-chance performance when tested on a different view. Adding more frontal data from SignBank (S_b) marginally improves frontal accuracy but does not recover side-view performance, confirming that MV-ISR is a genuinely distinct task from single-view ISR.

Training view	Test Left	Test Front	Test Right
Left only	.05	.03	.01
Right only	.02	.03	.06
Front only	.03	.09	.03
Front + SignBank	.06	.20	.05

Top-1 accuracy. Bold = best test view for that training condition. All human signers (IDs 1, 2, 3 + avatar A).

Top-1 accuracy on single-view trained models. Bold = best test view for that training condition.

Training on combinations of views substantially improves generalisation. Using all three views together yields the best results across every test condition.

Training views	Test Left	Top-3 Left	Test Front	Top-3 Front	Test Right	Top-3 Right
Left + Front	.25	.51	.35	.59	—	—
Left + Right	.27	.47	—	—	.28	.51
Front + Right	—	—	.42	.67	.39	.62
All 3 views	.46	.69	.49	.74	.47	.72

Q3 — Effect of Synthetic Data (SL-GCN, novel signer)

When training on signers 1 & 2 and testing on unseen signer 3, synthetic avatar data provides substantial gains. Combining 3-view avatar data with SignBank frontal views achieves the best novel-signer performance.

Training data	Test Left (S3)	Test Front (S3)	Test Right (S3)
3-view, signers 1+2 only	.03	.14	.14
+ 1-view avatar	.22	.27	.09
+ SignBank frontal	.10	.28	.26
+ 3-view avatar	.19	.43	.38
+ 3-view avatar + SignBank	.32	.48	.43

Accuracies averaged over 10 runs. Novel signer (ID 3) held out from training entirely.

Q4 — Temporal-PONITA vs. SL-GCN: The Case for Geometric Models

Temporal-PONITA consistently outperforms SL-GCN across all view combinations, with absolute gains of +8% to +22%. The equivariant model also converges significantly faster despite higher per-epoch compute cost.

Training views	Test view	SL-GCN Top-1	PONITA Top-1	Gain
Left + Front	Left	.25	.43	+.18
Left + Right	Left	.27	.48	+.21
Left + Front	Front	.35	.55	+.20
Front + Right	Front	.42	.57	+.15
Left + Right	Right	.28	.50	+.22
Front + Right	Right	.39	.49	+.10
All 3 views	Left	.46	.54	+.08
All 3 views	Front	.49	.59	+.10
All 3 views	Right	.47	.55	+.08

The NGT200 Dataset:
Geometric Multi-View Isolated Sign Recognition