Title: Domain Adaptation for Speech-Driven Affective Facial Synthesis.
Speaker: Rizwan Sadiq
Time: September 04, 2020, 12:00pm
Place: This thesis defense will be held on online. You can join the presentation through the below link at the mentioned date and time.
Meeting ID: 951 5239 0890
Thesis Committee Members:
Prof. Engin Erzin (Advisor, Koç University)
Prof. Yücel Yemez (Koç University)
Prof. Çiğdem Eroğlu Erdem (Marmara University)
Prof. A. Murat Tekalp (Koç University)
Prof. Murat Saraçlar (Boğaziçi University)
Although speech-driven facial animation has been studied extensively in the literature, works focusing on the affective content of the speech are limited. This is mostly due to the scarcity of affective audio-visual data. In this thesis, we present three major studies that lead us to speech-driven affective facial synthesis. First, we investigate the use of lip articulations for affect recognition to better understand dependencies across lip articulations, phonetic classes and affect. Then, we propose a multimodal system, consisting of text and speech, for affective facial feature synthesis. A phoneme-based model driven from text qualifies generation of speaker independent animation, whereas a speech based model enables capturing affective variation during the facial feature synthesis. Finally, we improve the affective facial synthesis using domain adaptation by partially reducing the data scarcity. In this last study, we first define a domain adaptation to map affective and neutral speech representations to a common latent space in which cross-domain bias is smaller. Then, the domain adaption is used to augment affective representations for each emotion category, including angry, disgust, fear, happy, sad, surprise and neutral, so that we can better train emotion-dependent deep audio-to-visual (A2V) mapping models. Based on the emotion-dependent deep A2V models, the proposed affective facial synthesis system is realized in two stages: first, speech emotion recognition extracts soft emotion category likelihoods for the utterances; then a soft fusion of the emotion-dependent A2V mapping outputs form the affective facial synthesis. Experimental evaluations are performed on the SAVEE audio-visual dataset with objective and subjective evaluations. The proposed affective A2V system achieves significant mean square error loss improvements in comparison to the recent literature. Furthermore, the resulting facial animations of the proposed system are preferred over the baseline animations in the subjective evaluations.