What's the Point? Sign Language Processing

In spontaneous signed discourse, approximately 40% of signs are non-lexical, consisting of productive constructions that exploit three-dimensional space, iconicity, and context. Among these, spatial indexing (pointing) is particularly frequent: signers assign discourse entities to spatial loci and re-reference them through pointing signs that carry pronominal, locative, or determiner-like meaning. Despite their prevalence, current Sign Language Processing benchmarks largely collapse indexing signs into coarse gloss tokens or omit their referential structure entirely. We show that a state-of-the-art CSLR model recovers indexing tokens at only 6.9% accuracy despite competitive overall WER, and propose a modular framework combining a pose-based index detector with an online entity memory that substantially closes this gap without retraining the recognition backbone.

Abstract

Sign Language Processing (SLP) approaches predominantly represent signing as sequences of glosses or text, thereby under-modeling non-lexical and productive constructions. A prominent example is indexing: pointing gestures that assign discourse entities to spatial loci for subsequent co-reference, which lexicon-centric objectives largely fail to capture. We quantify this limitation on BOBSL, showing that a CSLR model achieving state-of-the-art Word Error Rate (WER) fails to reliably recover indexing. To address this gap, we propose a two-stage framework for spatial reference resolution, decomposing the task into (i) index detection and (ii) discourse entity linking. Our method combines a pose-based mention detection network with an online memory module trained using cluster supervision inferred by a large language model (LLM). The resulting mention representations support downstream applications such as automatic annotation and modeling of non-lexical structure in SLP pipelines. We further demonstrate their utility as an auxiliary indexing expert that augments a frozen CSLR model at inference time, improving index recovery while reducing indexing, lexical, and overall WER.

Pipeline Overview

Our method decomposes spatial reference resolution into two stages. An Index Proposal Network (IPN) first detects whether a gloss-aligned pose segment constitutes an indexing sign. Detected mentions are then passed to an Entity Linking Module (ELM), an online differentiable memory that incrementally clusters index mentions into discourse entities across the document. At inference time, both outputs are injected as additive logit biases into a frozen CSLR backbone.

Each segment is represented by 3D skeletal pose extracted from RGB video: 8 upper-body joints from SMPL-X and 21 joints per hand from WiLoR (42 hand joints total), normalised for scale and viewpoint to support cross-dataset generalisation. These features are processed by an SL-GCN to produce a segment-level embedding.

Pipeline architecture — **Fig. 1.** Overview of the proposed pipeline. Gloss-aligned pose segments are first processed by the IPN. Detected index mentions are passed to the online ELM, which maintains a bounded entity memory. The resulting signals are injected as inference-time biases into a frozen CSLR model.

Results

Baseline: Index Recovery in CSLR

A strong retrieval-based CSLR model (CSLR2) achieves competitive overall WER but recovers indexing tokens poorly. After reinserting excluded non-lexical segments (GIS) into evaluation, WER_Index reaches 98.4% and Index IoU drops to 6.1%, motivating an explicit indexing expert.

System	Index IoU (%)	WER_All (%)	WER_Index (%)	WER_Lex (%)
BL_Lexical	15.8	69.5	91.3	77.5
BL_{IndexRestored}	6.1	70.5	98.4	77.5

WER_Lex excludes pointing-token vocabulary; WER_Index is computed over the pointing sub-vocabulary only.

Phase-1: Index Detection

The IPN generalises across sign languages (BSL ↔ DGS), and joint training (B+M) consistently outperforms single-corpus models on held-out sets. A precision-calibrated variant (cw4, threshold 0.90) is used for downstream CSLR integration to suppress false-positive bleed.

Train	Eval	BalAcc	Macro F1	Prec-I	Rec-I
B	B	0.85	0.85	0.85	0.85
	M	0.76	0.76	0.72	0.86
	BOB	0.77	0.68	0.35	0.66
M	B	0.75	0.74	0.87	0.58
	M	0.87	0.87	0.87	0.88
	BOB	0.72	0.70	0.41	0.52
BM-cw4	B	0.82	0.81	0.91	0.70
	M	0.85	0.85	0.90	0.79
	BOB	0.73	0.72	0.47	0.52
BM	B	0.85	0.85	0.88	0.82
	M	0.87	0.87	0.86	0.88
	BOB	0.78	0.69	0.37	0.68

B = BSLCP, M = MDGS, BOB = BOBSL. Mean over 3 seeds, threshold = 0.5.

Phase-2: Entity Linking

Joint training on BSLCP+MDGS yields slightly higher Entity Cluster Accuracy (0.65 vs. 0.62) but comparable F1. However, BSLCP-only ELM training achieves lower downstream WER on BOBSL, so it is used for CSLR integration.

IPN Model	ELM Source	ECA (B)	F1 (B)	WER_All (BOB)	WER_Lex (BOB)	WER_Index (BOB)
BM-cw4	B	0.62 ±0.02	0.44 ±0.01	70.1	78.4	64.3
BM-cw4	B+M	0.65	0.43 ±0.01	70.2	78.4	64.7

ECA = entity cluster accuracy on BSLCP test set. WER reported at w_IPN=10, w_ELM=60.

Phase-3: Index-Supported CSLR (BOBSL)

The grid search sweeps detection boost weight (w_IPN) and entity linking boost weight (w_ELM). Increasing w_IPN is the primary driver of WER_Index reduction; w_ELM provides consistent additional gains. However, high w_IPN causes lexical bleed, creating a trade-off that produces a visible valley in WER_All around w_IPN ≈ 8. The selected operating point w_IPN=8, w_ELM=10 reduces WER_Index from 96.2 to 59.6 and WER_All from 70.5 to 68.3, while leaving WER_Lex stable.

Grid search heatmap — **Fig. 3.** Grid sweep over w_IPN and w_ELM. The three panels report WER_All, WER_Index, and WER_Lex on BOBSL. The selected operating point (w_IPN=8, w_ELM=10) lies at the valley of the trade-off surface.

Config	WER_All (%)	WER_Index (%)	WER_Lex (%)
Baseline (CSLR2 only)	70.5	96.2	77.5
w_IPN=8, w_ELM=0 (IPN only)	69.5	71.4	77.4
w_IPN=8, w_ELM=10 (selected)	68.3	59.6	77.1

Index-Token Flow Analysis

Each flow traces a ground-truth pointing token to the coarse referential class of the predicted token, grouped into pronoun and deictic categories. Under the baseline, only 6.1% of ground-truth index tokens are recovered. The IPN detection boost alone raises this to 61.5%, and the full system reaches 71.2%, confirming that entity linking provides a consistent gain on top of detection.

The top panel shows all GIS and IPS instances; the bottom panel isolates IPS tokens to highlight entity-specific behaviour.

Sankey flow overview — **Fig. 4a.** Overview flow analysis for all pointing instances (GIS + IPS). Flows terminating in the correct referential category indicate recovered pointing predictions.

Sankey flow detail — **Fig. 4b.** Detailed flow analysis isolating IPS tokens to highlight pronoun vs. deictic category distinctions.

Qualitative Results

Entity Cluster Example

The figure below shows a qualitative example on BOBSL. The system resolves clusters across sentences and captures subtle pointing distinctions, separating a third-person singular entity from a second-person "you" reference. The base CSLR2 picked up only the "you" index sign and missed all third-person references.

Qualitative example: cross-sentence cluster recognition — **Fig. 5.** Samples from episode 6164207930460576679. The system demonstrates both cross-sentence cluster recognition and in-sentence cluster separation between a third-person entity and a second-person "you" reference. The base CSLR2 only recovered the "you" index sign.

BOBSL Entity Cluster Visualisations

The following figures show predicted entity clusters on BOBSL alongside the subtitle text for each segment. Cluster structure is reflected in coherent lexical output and visual similarity across instances assigned to the same entity.

Cluster vis: PRO1SG ep6040895553921856506 — **Fig. 6a.** Instances of PRO1SG (me/I) across four examples in episode 6040895553921856506. Three instances are correctly detected and grouped. The second panel (red border) highlights an apparent annotation error: the sign contains both a locative (*there*) and a self-pointing (me) component but is labelled only as the former. The system nevertheless predicts PRO1SG consistently.

Cluster vis: PRO1SG ep6003446875091426252 — **Fig. 6b.** Instances of PRO1SG (me) across four examples in episode 6003446875091426252. The third panel highlights a missing ground-truth annotation: the video contains a self-pointing sign but is glossed as the preposition to. The model groups all four instances consistently and predicts PRO1SG throughout.

Cluster vis: two co-occurring clusters ep6177 — **Fig. 6c.** Two co-occurring entity clusters in episode 6177195911563690266. Five instances of Cluster 1 (PRO1SG, me/I): all correctly detected and grouped. Two instances of a PRO3SG cluster (distinct third-person referent) introduced in the same sentence window. The system maintains both entities simultaneously across a substantial temporal gap. The baseline predicts only two of the eight pointing tokens.

Error Analysis: False Positives & False Negatives

Red border = false positive (lexical sign predicted as index); blue border = false negative (pointing sign missed). FPs are largely systematic: mispredicted signs frequently exhibit index-like handshapes. FNs often arise from co-articulation with co-occurring lexical signs.