Multimodal Deep Learning Methods and Counterfactual Reasoning for EFL Learners’ Public Speaking Anxiety Detection (75519)

Session Information: Artificial Intelligence
Session Chair: Takeshi Sato

Saturday, 11 November 2023 12:15
Session: Session 2
Room: Kirimas
Presentation Type: Paper Presentation

All presentation times are UTC + 7 (Asia/Bangkok)

Effective public speaking plays a crucial role in personal and professional achievement of English as a Foreign Language (EFL) learners. Public speaking anxiety (PSA) is one of the most common social phobias, which hinders learners speaking performance. PSA usually involves both psychological and physiological symptoms including feelings of dread, stuttering, trembling, pausing or tachycardia. Accurate evaluation of PSA helps identify learners’ anxiety state and provides personalized instructional guidance to alleviate their anxious feelings.
Recently, deep learning models have gained widespread attention in various fields, including natural language processing, computer vision, speech recognition. The assessment of public speaking involves extracting features from unimodal data sources such as video, audio, and text, as well as fusing cross-modal data. Thus, we employ multimodal deep learning techniques (MDLT) to achieve automated detection of PSA.
In this study, we integrated learners’ visual, acoustic, and textual information and constructed a novel deep learning model named Multimodal Speaking Anxiety Detection (MSAD), based on our previous work for detecting PSA. The construction of the MSAD model involves feature extraction, unimodal representations, cross-modal fusion, multimodal representations. The model based on cross-attention mechanism was trained to automatically judge whether the learner is anxious or not. Furthermore, we evaluated the performance of MSAD model by conducting training and testing on our self-developed large-scale dataset, SARC (Speaking Anxiety in Real Classrooms).
Moreover, although deep learning models have achieved promising performance, they are still inevitably affected by spurious correlations within modalities. For example, certain actions such as lowering the head or raising the hand in the visual modality seem strongly correlated with anxiety labels, but this could be unreliable. The model may unintentionally capture or even amplify unintended dataset biases (i.e., label bias) during training. So, we resort to causal inference and leverage counterfactual reasoning to eliminate the single-modal bias.
The results showed that the proposed model based on counterfactual reasoning is effective for the automated assessment of PSA. The model-agnostic counterfactual framework can preserve the positive effect of single-modal features and mitigate its negative effect. Our approach is a more general paradigm for detecting PSA and opens up new research possibilities.

Abstract Summary
Public speaking anxiety (PSA) is one of the most common social phobias, which hinders learners speaking performance. Accurate evaluation of PSA helps identify learners’ anxiety state and provides personalized instructional guidance to alleviate their anxious feelings. Recently, deep learning models have gained widespread attention in various fields, so we employ multimodal deep learning techniques to achieve automated detection of PSA.
In this study, we integrated learners’ visual, acoustic, and textual information and constructed a deep learning model named Multimodal Speaking Anxiety Detection (MSAD). The construction of the MSAD model involves feature extraction, unimodal representations, cross-modal fusion, multimodal representations. Moreover, although deep learning models have achieved promising performance, they are still inevitably affected by spurious correlations within modalities. We resort to causal inference and leverage counterfactual reasoning to eliminate the single-modal bias. The results showed that the proposed model based on counterfactual reasoning is effective for the automated assessment of PSA.

Authors:
Tingting Zhang, Beijing University of Posts and Telecommunications, China
Xu Chen, Beijing University of Posts and Telecommunications, China
Chunping Zheng, Beijing University of Posts and Telecommunications, China
Bin Wu, Beijing University of Posts and Telecommunications, China

About the Presenter(s)
Dr Tingting Zhang is a University Doctoral Student at Beijing University of Posts and Telecommunications in China

See this presentation on the full schedule – Saturday Schedule

Multimodal Deep Learning Methods and Counterfactual Reasoning for EFL Learners’ Public Speaking Anxiety Detection (75519)

Conference Comments & Feedback

Comments

Powered by WP LinkPress

Share this Presentation

Posted by Amina Batbold

Find a Presentation

Search by Keyword