A multimodal spatio-temporal GCN model with enhancements for isolated sign recognition

Files
24015.pdf(1.11 MB)
Published version
Date
2024-05-01
DOI
Authors
Zhou, Yang
Xia, Zhaoyang
Chen, Yuxiao
Neidle, Carol
Metaxas, Dimitris
Version
Published version
OA Version
Citation
Y. Zhou, Z. Xia, Y. Chen, C. Neidle, D. Metaxas. "A Multimodal Spatio-Temporal GCN Model with Enhancements for Isolated Sign Recognition" Proceedings of the {LREC-COLING} 2024 11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources, pp.132-143.
Abstract
We propose a multimodal network using skeletons and handshapes as input to recognize individual signs and detect their boundaries in American Sign Language (ASL) videos. Our method integrates a spatio-temporal Graph Convolutional Network (GCN) architecture to estimate human skeleton keypoints; it uses a late-fusion approach for both forward and backward processing of video streams. Our (core) method is designed for the extraction---and analysis of features from---ASL videos, to enhance accuracy and efficiency of recognition of individual signs. A Gating module based on per-channel multi-layer convolutions is employed to evaluate significant frames for recognition of isolated signs. Additionally, an auxiliary multimodal branch network, integrated with a transformer, is designed to estimate the linguistic start and end frames of an isolated sign within a video clip. We evaluated performance of our approach on multiple datasets that include isolated, citation-form signs and signs pre-segmented from continuous signing based on linguistic annotations of start and end points of signs within sentences. We have achieved very promising results when using both types of sign videos combined for training, with overall sign recognition accuracy of 80.8% Top-1 and 95.2% Top-5 for citation-form signs, and 80.4% Top-1 and 93.0% Top-5 for signs pre-segmented from continuous signing.
Description
License
© 2024 ELRA Language Resources Association: CC BY-NC 4.0.