F0
Modeling for Isarn Speech Synthesis using Deep Neural
Networks and Syllable-level Feature Representation
Pongsathon Janyoi and
Pusadee Seresangtakul
Department of Computer Science, Khon Kaen University, Thailand
Abstract: The generation of the fundamental frequency (F0)
plays an important role in speech synthesis, which directly influences the
naturalness of synthetic speech. In conventional parametric speech
synthesis, F0 is predicted frame-by-frame. This method is
insufficient to represent F0 contours in larger units,
especially tone contours of syllables in tonal languages that deviate as a
result of long-term context dependency. This work proposes a syllable-level F0
model that represents F0 contours within syllables, using
syllable-level F0 parameters that comprise the sampling F0
points and dynamic features. A Deep Neural Network (DNN) was used to represent
the relationships between syllable-level contextual features and syllable-level
F0 parameters. The proposed model was examined using an Isarn
speech synthesis system with both large and small training sets. For all
training sets, the results of objective and subjective tests indicate that the
proposed approach outperforms the baseline systems based on hidden Markov
models and DNNS that predict F0 values at the frame level.
Keywords: Fundamental frequency, speech synthesis, deep
neural networks.
Received
July 14, 2019; accepted May 28, 2020