Skeleton-based methods have recently achieved good performance in deep learning-based gait emotion recognition (DL-GER). However, the current methods have two drawbacks that limit the ability to learn discriminative emotional features from gait. First, these methods do not exclude the effect of the subject’s walking orientation on emotion classification. Second, they do not sufficiently learn the implicit connections between the joints during human walking. In this paper, an augmented spatial-temporal graph convolutional neural network (AST-GCN) is introduced to solve these two problems. The interframe shift encoding (ISE) module acquires interframe shifts of joints to make the network sensitive to changes in emotion-related joint movements regardless of the subject’s walking orientation. A multichannel implicit connection inference method learns more implicit connection relations related to emotions. Notably, we unify current skeleton-based methods into a common framework that validates the most powerful feature representation capability of our AST-GCN from a theoretical perspective. In addition, we extend the skeleton-based gait dataset using posture estimation software. Experiments demonstrate that our AST-GCN outperforms state-of-the-art methods on three datasets on two tasks.
arXiv
UniEmoX: Cross-modal Semantic-Guided Large-Scale Pretraining for Universal Scene Emotion Perception
Visual emotion analysis holds significant research value in both computer vision and psychology. However, existing methods for visual emotion analysis suffer from limited generalizability due to the ambiguity of emotion perception and the diversity of data scenarios. To tackle this issue, we introduce UniEmoX, a cross-modal semantic-guided large-scale pretraining framework. Inspired by psychological research emphasizing the inseparability of the emotional exploration process from the interaction between individuals and their environment, UniEmoX integrates scene-centric and person-centric low-level image spatial structural information, aiming to derive more nuanced and discriminative emotional representations. By exploiting the similarity between paired and unpaired image-text samples, UniEmoX distills rich semantic knowledge from the CLIP model to enhance emotional embedding representations more effectively. To the best of our knowledge, this is the first large-scale pretraining framework that integrates psychological theories with contemporary contrastive learning and masked image modeling techniques for emotion analysis across diverse scenarios. Additionally, we develop a visual emotional dataset titled Emo8. Emo8 samples cover a range of domains, including cartoon, natural, realistic, science fiction and advertising cover styles, covering nearly all common emotional scenes. Comprehensive experiments conducted on six benchmark datasets across two downstream tasks validate the effectiveness of UniEmoX. The source code is available at https://github.com/chincharles/u-emo.
2023
ICME
STA-GCN:Spatial Temporal Adaptive Graph Convolutional Network for Gait Emotion Recognition
Chuang Chen, and Xiao Sun*
In 2023 IEEE International Conference on Multimedia and Expo (ICME), 2023
Graph Convolutional Neural Networks (GCNs) recently have been widely used in Gait Emotion Recognition (GER). However, the existing GCNs-based GER methods have two drawbacks that limit the ability to learn discriminative feature. In spatial modeling, context-sensitive affective feature of joint is under-extracted due to the neglect of implicit connection. In temporal modeling, multi-scale temporal feature of joint motion is under-extracted or aggregated rigidly. In this paper, we propose a novel Spatial-Temporal Adaptive Graph Convolutional Network (STA-GCN) where two main modules are introduced, respectively. Spatial Feature Learning Module (SFLM) infers context-sensitive joint implicit connection and adaptively aggregates spatial feature mined from implicit and explicit connection. Temporal Feature Learning Module (TFLM) extracts and adaptively aggregates multi-scale temporal feature of joint motion. It is worth mentioning that we first pre-train the model using hand-crafted affective feature and counterpart gait. Experimental results demonstrate our STA-GCN outperforms state-of-the-art methods in two tasks.
arXiv
Multimodal Feature Extraction and Attention-based Fusion for Emotion Estimation in Videos
Tao Shu, Xinke Wang, Ruotong Wang, Chuang Chen, and 2 more authors
The continuous improvement of human-computer interaction technology makes it possible to compute emotions. In this paper, we introduce our submission to the CVPR 2023 Competition on Affective Behavior Analysis in-the-wild (ABAW). Sentiment analysis in human-computer interaction should, as far as possible Start with multiple dimensions, fill in the single imperfect emotion channel, and finally determine the emotion tendency by fitting multiple results. Therefore, We exploited multimodal features extracted from video of different lengths from the competition dataset, including audio, pose and images. Well-informed emotion representations drive us to propose a Attention-based multimodal framework for emotion estimation. Our system achieves the performance of 0.361 on the validation dataset. The code is available at [this https URL].