Under Review

What's the Point?
Spatial Grammar & Index Resolution
for Sign Language Processing

Oline Ranum  ·  Simon Hadfield  ·  Richard Bowden

Centre for Vision, Speech and Signal Processing, University of Surrey, UK

Paper (coming soon) Code (coming soon) arXiv (coming soon)

In spontaneous signed discourse, approximately 40% of signs are non-lexical, consisting of productive constructions that exploit three-dimensional space, iconicity, and context. Among these, spatial indexing (pointing) is particularly frequent: signers assign discourse entities to spatial loci and re-reference them through pointing signs that carry pronominal, locative, or determiner-like meaning. Despite their prevalence, current Sign Language Processing benchmarks largely collapse indexing signs into coarse gloss tokens or omit their referential structure entirely. We show that a state-of-the-art CSLR model recovers indexing tokens at only 6.9% accuracy despite competitive overall WER, and propose a modular framework combining a pose-based index detector with an online entity memory that substantially closes this gap without retraining the recognition backbone.

Spatial indexing illustration
Spatial indexing is a grammatical function embedded in the signing space. A signer associates discourse entities with spatial loci and subsequently re-references them through pointing. While visually simplistic, these pointing signs encode discourse structure through spatial configuration rather than discrete lexical identity, a phenomenon largely under-modeled in gloss-centric SLP pipelines.

Abstract

Sign Language Processing (SLP) approaches predominantly represent signing as sequences of glosses or text, thereby under-modeling non-lexical and productive constructions. A prominent example is indexing: pointing gestures that assign discourse entities to spatial loci for subsequent co-reference, which lexicon-centric objectives largely fail to capture. We quantify this limitation on BOBSL, showing that a CSLR model achieving state-of-the-art Word Error Rate (WER) fails to reliably recover indexing. To address this gap, we propose a two-stage framework for spatial reference resolution, decomposing the task into (i) index detection and (ii) discourse entity linking. Our method combines a pose-based mention detection network with an online memory module trained using cluster supervision inferred by a large language model (LLM). The resulting mention representations support downstream applications such as automatic annotation and modeling of non-lexical structure in SLP pipelines. We further demonstrate their utility as an auxiliary indexing expert that augments a frozen CSLR model at inference time, improving index recovery while reducing indexing, lexical, and overall WER.

Pipeline Overview

Our method decomposes spatial reference resolution into two stages. An Index Proposal Network (IPN) first detects whether a gloss-aligned pose segment constitutes an indexing sign. Detected mentions are then passed to an Entity Linking Module (ELM), an online differentiable memory that incrementally clusters index mentions into discourse entities across the document. At inference time, both outputs are injected as additive logit biases into a frozen CSLR backbone.

Each segment is represented by 3D skeletal pose extracted from RGB video: 8 upper-body joints from SMPL-X and 21 joints per hand from WiLoR (42 hand joints total), normalised for scale and viewpoint to support cross-dataset generalisation. These features are processed by an SL-GCN to produce a segment-level embedding.

Pipeline architecture
Fig. 1. Overview of the proposed pipeline. Gloss-aligned pose segments are first processed by the IPN. Detected index mentions are passed to the online ELM, which maintains a bounded entity memory. The resulting signals are injected as inference-time biases into a frozen CSLR model.

Results

Baseline: Index Recovery in CSLR

A strong retrieval-based CSLR model (CSLR2) achieves competitive overall WER but recovers indexing tokens poorly. After reinserting excluded non-lexical segments (GIS) into evaluation, WERIndex reaches 98.4% and Index IoU drops to 6.1%, motivating an explicit indexing expert.

System Index IoU (%) WERAll (%) WERIndex (%) WERLex (%)
BLLexical15.869.591.377.5
BLIndexRestored6.170.598.477.5

WERLex excludes pointing-token vocabulary; WERIndex is computed over the pointing sub-vocabulary only.

Phase-1: Index Detection

The IPN generalises across sign languages (BSL ↔ DGS), and joint training (B+M) consistently outperforms single-corpus models on held-out sets. A precision-calibrated variant (cw4, threshold 0.90) is used for downstream CSLR integration to suppress false-positive bleed.

TrainEval BalAccMacro F1 Prec-IRec-I
BB0.850.850.850.85
M0.760.760.720.86
BOB0.770.680.350.66
MB0.750.740.870.58
M0.870.870.870.88
BOB0.720.700.410.52
BM-cw4B0.820.810.910.70
M0.850.850.900.79
BOB0.730.720.470.52
BMB0.850.850.880.82
M0.870.870.860.88
BOB0.780.690.370.68

B = BSLCP, M = MDGS, BOB = BOBSL. Mean over 3 seeds, threshold = 0.5.

Phase-2: Entity Linking

Joint training on BSLCP+MDGS yields slightly higher Entity Cluster Accuracy (0.65 vs. 0.62) but comparable F1. However, BSLCP-only ELM training achieves lower downstream WER on BOBSL, so it is used for CSLR integration.

IPN ModelELM Source ECA (B)F1 (B) WERAll (BOB) WERLex (BOB) WERIndex (BOB)
BM-cw4B0.62 ±0.020.44 ±0.0170.178.464.3
BM-cw4B+M0.650.43 ±0.0170.278.464.7

ECA = entity cluster accuracy on BSLCP test set. WER reported at wIPN=10, wELM=60.

Phase-3: Index-Supported CSLR (BOBSL)

The grid search sweeps detection boost weight (wIPN) and entity linking boost weight (wELM). Increasing wIPN is the primary driver of WERIndex reduction; wELM provides consistent additional gains. However, high wIPN causes lexical bleed, creating a trade-off that produces a visible valley in WERAll around wIPN ≈ 8. The selected operating point wIPN=8, wELM=10 reduces WERIndex from 96.2 to 59.6 and WERAll from 70.5 to 68.3, while leaving WERLex stable.

Grid search heatmap
Fig. 3. Grid sweep over wIPN and wELM. The three panels report WERAll, WERIndex, and WERLex on BOBSL. The selected operating point (wIPN=8, wELM=10) lies at the valley of the trade-off surface.
Config WERAll (%) WERIndex (%) WERLex (%)
Baseline (CSLR2 only)70.596.277.5
wIPN=8, wELM=0 (IPN only)69.571.477.4
wIPN=8, wELM=10 (selected)68.359.677.1

Index-Token Flow Analysis

Each flow traces a ground-truth pointing token to the coarse referential class of the predicted token, grouped into pronoun and deictic categories. Under the baseline, only 6.1% of ground-truth index tokens are recovered. The IPN detection boost alone raises this to 61.5%, and the full system reaches 71.2%, confirming that entity linking provides a consistent gain on top of detection.

The top panel shows all GIS and IPS instances; the bottom panel isolates IPS tokens to highlight entity-specific behaviour.

Sankey flow overview
Fig. 4a. Overview flow analysis for all pointing instances (GIS + IPS). Flows terminating in the correct referential category indicate recovered pointing predictions.
Sankey flow detail
Fig. 4b. Detailed flow analysis isolating IPS tokens to highlight pronoun vs. deictic category distinctions.

Qualitative Results

Entity Cluster Example

The figure below shows a qualitative example on BOBSL. The system resolves clusters across sentences and captures subtle pointing distinctions, separating a third-person singular entity from a second-person "you" reference. The base CSLR2 picked up only the "you" index sign and missed all third-person references.

Qualitative example: cross-sentence cluster recognition
Fig. 5. Samples from episode 6164207930460576679. The system demonstrates both cross-sentence cluster recognition and in-sentence cluster separation between a third-person entity and a second-person "you" reference. The base CSLR2 only recovered the "you" index sign.

BOBSL Entity Cluster Visualisations

The following figures show predicted entity clusters on BOBSL alongside the subtitle text for each segment. Cluster structure is reflected in coherent lexical output and visual similarity across instances assigned to the same entity.

Cluster vis: PRO1SG ep6040895553921856506
Fig. 6a. Instances of PRO1SG (me/I) across four examples in episode 6040895553921856506. Three instances are correctly detected and grouped. The second panel (red border) highlights an apparent annotation error: the sign contains both a locative (there) and a self-pointing (me) component but is labelled only as the former. The system nevertheless predicts PRO1SG consistently.
Cluster vis: PRO1SG ep6003446875091426252
Fig. 6b. Instances of PRO1SG (me) across four examples in episode 6003446875091426252. The third panel highlights a missing ground-truth annotation: the video contains a self-pointing sign but is glossed as the preposition to. The model groups all four instances consistently and predicts PRO1SG throughout.
Cluster vis: two co-occurring clusters ep6177
Fig. 6c. Two co-occurring entity clusters in episode 6177195911563690266. Five instances of Cluster 1 (PRO1SG, me/I): all correctly detected and grouped. Two instances of a PRO3SG cluster (distinct third-person referent) introduced in the same sentence window. The system maintains both entities simultaneously across a substantial temporal gap. The baseline predicts only two of the eight pointing tokens.

Error Analysis: False Positives & False Negatives

Red border = false positive (lexical sign predicted as index); blue border = false negative (pointing sign missed). FPs are largely systematic: mispredicted signs frequently exhibit index-like handshapes. FNs often arise from co-articulation with co-occurring lexical signs.

FP: road
FP "road"
FP: same
FP "same"
FN: general
FN (general)
FN: general
FN (general)
FP: back
FP "back"
FP: because
FP "because"
FN: here
FN "here"
FN: I
FN "I"
FP: become
FP "become"
FP: bottles
FP "bottles"
FN: me
FN "me"
FN: this
FN "this"
FP: can
FP "can"
FP: fantastic
FP "fantastic"
FN: general
FN (general)
FN: general
FN (general)

Citation

@inproceedings{ranum2026whatsthepoint,
  title     = {What's the Point? Spatial Grammar \& Index Resolution
               for Sign Language Processing},
  author    = {Ranum, Oline and Hadfield, Simon and Bowden, Richard},
  booktitle = {Under Review},
  year      = {2026},
}