All posts by eerzin

PhD Defense by Sasan Asadiabadi

Title: Deep Learning Approaches for Vocal Tract Boundary Segmentation in rtMRI

Speaker: Sasan Asadiabadi

Time: January 04, 2021, 13:00

Place: This thesis defense will be held on online. You can join the presentation through the below link at the mentioned date and time.

Join Zoom Meeting

Thesis Committee Members:
Prof. Engin Erzin (Advisor, Koç University)
Prof. Yücel Yemez (Koç University)
Prof. Alper Erdoğan (Koç University)
Prof. Murat Saraçlar (Boğaziçi University)
Prof. Levent Arslan (Boğaziçi University)

Recent advances in real-time Magnetic Resonance Imaging (rtMRI) provide an invaluable tool to study speech articulation. Development of automatic algorithms to detect the landmarks defining the boundaries of the vocal tract (VT) is crucial for a wide range of research, from speech modeling and synthesis to clinical research. In this thesis, we present two effective deep learning approaches for supervised detection and tracking of vocal tract contours in a sequence of rtMRI frames; (1) we propose a fully convolutional network to estimate the VT contour in heatmap regression fashion and (2) we introduce a deep temporal regression network which learns the non-linear mapping from a temporal overlapping fixed-length sequence of rtMRI frames to the corresponding articulatory movements. We as well introduce two post-processing algorithms succeeding the deep models, to further improve the quality of VT contour detection; (i) a novel appearance model based contour refinement to overcome the potential failures of data-driven approaches for highly deformable articulators and (ii) a spatiotemporal stabilization scheme to stabilize the estimated contours in space and time by removing the spatial outliers and temporal jitter. The proposed VT contour tracking models are trained and evaluated over the large audiovisual USC-TIMIT dataset. Performance evaluation is carried out using various objective assessment metrics for the spatial error and temporal stability of the contour landmarks in comparison with several baseline approaches from the recent literature. Results indicate significant improvements with the proposed methods over the state-of-the-art baselines. In addition, we develop a graphical user interface (GUI) for the analysis of the rtMRI data, integrated with various attributes including automatic segmentation of the VT boundaries using the proposed contour estimation methods and calculation of tract variables and cross-sectional distance.

PhD Defense by Nusrah Hussain

Title: Engaging Human-Robot Interaction with Batch Reinforcement Learning

Speaker: Nusrah Hussain

Time: September 08, 2020, 18:00

Place: This thesis defense will be held on online. You can join the presentation through the below link at the mentioned date and time.
Meeting ID: 924 2064 7424

Thesis Committee Members:
Prof. Engin Erzin (Advisor, Koç University)
Prof. Yücel Yemez (Co- Advisor, Koç University)
Assoc. Prof. Metin Sezgin (Koç University)
Prof. Murat Tekalp (Koç University)
Prof. Ali Albert Salah (Utrecht University)
Assis. Prof. Anca Dragan (University of California, Berkeley)

A common issue in the field of social robotics is the need to maintain user engagement during human-robot interaction (HRI). Engagement has been used as a typical metric to gauge the success of HRI, and hence it is regarded as a universal goal in the design of social robots. In this thesis, we train a generation model of non-verbal behaviors, smiles and nods, as backchannels in a robot to engage humans during HRI. We propose a novel batch reinforcement learning (batch-RL) formulation for the task, where we take advantage of recorded human-to-human interaction data to learn a policy offline. The formulation treats user engagement as the reward and constructs a backchannel policy that maximizes it. We propose three value-based off-policy batch-RL algorithms to address the problem, which differ in the manipulation of the samples in the dataset to make the gradient updates. To evaluate the policies trained with these algorithms, offline evaluation methods are used such as off-policy policy evaluation (OPE) and Bellman residual. A final work presented in the thesis is the design and execution of a user study on HRI with a backchanneling robot. The interaction is designed with an expressive 3d robotic head in a story-shaping interaction scenario, where the learned backchannel policy controls the nod and smile behaviors. Subjective questionnaires and engagement values extracted from user’s social signals are used to assess the impact of robot’s social behavior on the participants. The higher acceptability of the RL policy versus a baseline policy is indicated by the statistically significant differences in the evaluation scores. The research work presented in this thesis addresses only one class of robot behavior towards socially engaging robots. As a pioneering work, it paves way for automation of numerous other desired robot behaviors that target other metrics used in the design of human-robot interaction systems.

PhD Defense by Rizwan Sadiq

Title: Domain Adaptation for Speech-Driven Affective Facial Synthesis.
Speaker: Rizwan Sadiq
Time:  September 04, 2020, 12:00pm

Place: This thesis defense will be held on online. You can join the presentation through the below link at the mentioned date and time.
Meeting ID: 951 5239 0890

 Thesis Committee Members:

Prof. Engin Erzin  (Advisor, Koç University)
Prof. Yücel Yemez (Koç University)
Prof. Çiğdem Eroğlu Erdem (Marmara University)
Prof. A. Murat Tekalp (Koç University)
Prof. Murat Saraçlar (Boğaziçi University)


Although speech-driven facial animation has been studied extensively in the literature, works focusing on the affective content of the speech are limited. This is mostly due to the scarcity of affective audio-visual data. In this thesis, we present three major studies that lead us to speech-driven affective facial synthesis. First, we investigate the use of lip articulations for affect recognition to better understand dependencies across lip articulations, phonetic classes and affect. Then, we propose a multimodal system, consisting of text and speech, for affective facial feature synthesis. A phoneme-based model driven from text qualifies generation of speaker independent animation, whereas a speech based model enables capturing affective variation during the facial feature synthesis. Finally,  we improve the affective facial synthesis using domain adaptation by partially reducing the data scarcity. In this last study, we first define a domain adaptation to map affective and neutral speech representations to a common latent space in which cross-domain bias is smaller. Then, the domain adaption is used to augment affective representations for each emotion category, including angry, disgust, fear, happy, sad, surprise and neutral, so that we can better train emotion-dependent deep audio-to-visual (A2V) mapping models. Based on the emotion-dependent deep A2V models, the proposed affective facial synthesis system is realized in two stages: first, speech emotion recognition extracts soft emotion category likelihoods for the utterances; then a soft fusion of the emotion-dependent A2V mapping outputs form the affective facial synthesis. Experimental evaluations are performed on the SAVEE audio-visual dataset with objective and subjective evaluations. The proposed affective A2V system achieves significant mean square error loss improvements in comparison to the recent literature. Furthermore, the resulting facial animations of the proposed system are preferred over the baseline animations in the subjective evaluations.

PhD Defense by Syeda Narjis Fatima

Title: Continuous Emotion Recognition in Dyadic Interactions
Speaker: Syeda Narjis Fatima
Time: June 22, 2020, 11:30

Understanding emotional dynamics of dyadic interactions is crucial for developing more natural human-computer interaction (HCI) systems. Emotional dependencies and affect context play important roles in dyadic interactions. The emotional state of a participant is modulated in many communication channels such as speech, head and body motion, vocal activity patterns as well as non-verbal vocalizations of speech during dyadic interactions. Recent studies have shown that affect recognition tasks can benefit by the incorporation of a particular interaction’s context, however, particularly the investigation of the role and contribution of affect context and its incorporation into dyadic neural architectures remains a complex and open problem. Our work takes motivation from this perspective and therefore, in this thesis, a series of related studies targeting emotional dependencies during dyadic interactions are conducted to improve continuous emotion recognition (CER). Firstly, we define a convolutional neural network (CNN) architecture for single-subject CER based on speech and body motion data. We then introduce dyadic CER as a two-stage regression framework and explore ways in which cross-subject affect can be used to improve CER performance for a target subject. Specifically, we propose two dyadic CNN architectures where cross-subject contribution to the CER task is achieved by fusion of cross-subject affect and feature maps. As a conclusive work, we define dyadic affect context (DAC) and propose a new Convolutional LSTM (ConvLSTM) model that exploits it for dyadic CER. Our ConvLSTM model captures local spectro-temporal correlations in speech and body motion as well as the long-term affect inter-dependencies between subjects. Our multimodal analysis demonstrates that modeling and incorporation of the DAC in the proposed CER models provide significant performance improvements on the USC CreativeIT database and the achieved results compare favorably to the state-of-the-art.