Report CopyRight/DMCA Form For : Idlak Tangle An Open Source Kaldi Based Parametric Speech
Idlak Tangle An Open Source Kaldi Based Parametric Speech Synthesiser based on DNN Blaise Potard 1 3 Matthew P Aylett 2 David A Braude Petr Motlicek 1CereProc Ltd United Kingdom 2The Centre for Speech Technology Research University of Edinburgh United Kingdom 3The Idiap Research Institute Martigny Switzerland Abstract This paper presents a text to speech TTS extension to
fer the first free high quality parametric Text to Speech. 3 By making our system openly available together with. the tests we describe we offer a useful test harness and a. better sounding baseline than HTS demo to the commu. In the following sections we discuss the choice of our HMM. based HTS parametric speech synthesis baseline a descrip. tion of the DNN modelling process and the speech synthesis. process We continue by carrying out a listening test compar. ing a set of different synthesisers present our evaluation and. conclude by discussing the choices made in our design and po. tential future work,2 Using HTS demo as an Idlak baseline. Idlak supplies a bash script to download and build the publicly. available HTS demo and its dependencies for Linux based sys. tems for comparisons The full context models for both training. and running the system are then replaced as detailed in 7 and. the Idlak documentation Subsequently standard HTS demo. training is carried out followed by the synthesis of the begin. ning of Lewis Carroll s Alice s Adventures in Wonderland 2. using HTS engine Figure 1 Idlak text processing system idlaktxp The system. HTS demo was chosen as a baseline not because it is the comprises of a set of modules operating on XML input and pro. best HMM system available there are many better and more ducing further tagged XML. sophisticated systems presented in previous research but be. cause it is the only system we could source that did not require. proprietary audio databases proprietary lexicons or proprietary processing modules each operate on an XML marked up. signal analysis e g STRAIGHT 11 Kaldi itself was partly stream of text Each module will typically add structure. the result of the difficulties of adapting HTK for research work to the XML and may be dependent on structure added. because it has license restrictions that considerably limits its use by previous modules The modules can be chained into. in conventional open source projects These same restrictions command line binaries that use Kaldi style options and. are present in HTS demo which relies on HTK to build HMM can take input from pipes or files Figure 1 shows the. models and trees together with a patch However the models current modules that form idlaktxp See 7 for more de. created by the training process can then be freely distributed tails. and some can be used using free software tools 2 These labels are then transformed into an input suitable. In this paper we will compare Tangle to the output of HTS for the DNN by mapping all features to numerical values. demo v 2 3 alpha using the SPTK toolkit v 3 6 for acoustic. analysis and HTS engine v 1 07 for synthesis The speech 3 The durations within each phone and within each HMM. database used is the standard HTS demo database i e CMU state are predicted using the Duration model DNN. ARCTIC speaker SLT upsampled to 48 kHz 4 The input to the Acoustic model DNN is generated by. creating input label frames for each acoustic frames de. 3 DNN based duration and acoustic sired i e input labels for each HMM states are dupli. cated as needed to match the predicted durations The. models quantized positions within the phone and the HMM. 3 1 Generalities states are appended to the input labels. A collection of tools and scripts were added to the Idlak toolkit 5 The raw acoustic features and their derivatives are pre. to allow the training of DNNs suitable for TTS The internal dicted using the Acoustic model DNN. structures training procedures and methods were derived from 6 The acoustic features trajectories are smoothed using the. the nnetbin DNN variant of Kaldi MLPG algorithm, Two deep neural networks need to be trained a Duration. 7 An excitation signal is built using the voicing and band. Model DNN DM DNN that will predict the durations of both. aperiodicity information, phones and HMM states from input phone labels and an Acous. tic Model DNN AM DNN that will predict an acoustic se Figure 2 summarises the training procedure. quence from a sequence of acoustic labels The AM DNN training requires a frame level mapping be. In practice the synthesis procedure works as follows tween input labels and acoustic features therefore the unit level. labels have to be sampled so that we have an input label per. 1 The Idlak front end analyses and normalises the input. acoustic frame Based on our previous work 12 we chose to. text then generates a rich phonetic and contextual rep. add 2 numerical values to the full context labels respectively. resentation from it a k a full labels The Idlak text. coding for the frame position within the current HMM state. 2 The full text is available from Project Gutenberg and the position within the current phone We treated the state. http www gutenberg org ebooks 11 identity as a numerical value rather than a categorical feature. ARCTIC labels ARCTIC speech data SLT,IDLAK Front end MFCC order 13. for the state,HMM Alignment,3 4 Acoustic modelling DNN. full context quinphone labels frame level quinphone alignments. The input of this DNN needs to have the same sampling rate as. Merge full labels alignment Framing the acoustic data so the full labels with state and phone dura. tion need to be oversampled In practice we duplicate as many. state level full context labels, state level labels as needed based on the output predicted by the. frame level full context labels, DM DNN The input frames within a state are then made dis. tinct by appending quantized positions respectively within the. AM DNN current state and within the current phone The positions within. the state were restricted to 5 distinct values while the positions. within the phone were restricted to 10 distinct values. The acoustic features contained 2 values for modelling the. state phone duration, periodic excitation continuous F0 and voicing probability 25. Acoustic MCEP F0 Bndap values for aperiodic excitation Bark scale band aperiodicity. Figure 2 Tangle DNN training architecture and 60 values for modelling the filter First and second order. derivatives of all these features were also modelled for a total. As this system is more intended to be used as a light weight output vector size of 261. baseline rather than a state of the art system the DNNs were. built using relatively modest numbers of hidden layers 3 and 3 5 DNN synthesis. nodes 100 and 700 respectively for the duration and acoustic. models in each layer Each layer comprised an affine compo During synthesis the full labels generated on input text by. nent followed by a sigmoid activation function the Idlak front end are converted to numerical values then out. The input data label was further normalized for each com put durations are calculated by forward propagation in the DM. ponent to be of zero mean and unit variance To reduce the DNN These values are then post processed for consistency so. issues linked to frame by frame independence we spliced to that the sum of the states durations within a given phone is equal. gether 11 input frames 5 back 5 front which gave us input to the phone duration. dimensions of respectively 4125 and 4169 for duration acous By combining input labels and durations together we can. tic DNNs then generate valid input for the acoustic model DNN The full. labels with state and phone duration appended are oversampled. The output data duration or acoustic was normalized glob. as described in the previous sub section and then forward prop. ally so that each output component had values between 0 01 and. agated in the AM DNN,0 99 the output activation function was a sigmoid. This generates sequences of acoustic features with their. Unlike other approaches such as Zen 13 or Qian 14, derivatives these sequences are post processed using the. we did not remove silent frames from the training as it was. MLPG algorithm to generate a smooth sequence of acoustic fea. not found to be necessary for synthesis quality The training. tures These features are then fed to a mixed excitation MLSA. procedure was standard we used a stochastic gradient descent. synthesizer 8 to generate the audio output, based on back propagation The minimisation criterion was the. The analysis and synthesis tools used have been integrated. Mean Square Error MSE The training was run on a training. set and we used a development set for cross validation. 3 2 Forced alignment procedure 4 Experiment, A forced alignment procedure performed on the full database For a fair comparison to the HTS demo we used the same audio. was used to align the full context labels with the acoustic data database for both systems the CMU ARCTIC database speaker. using standard tools from the Kaldi toolkit SLT The training set consisted of 1132 audio files encoded in. mono PCM wave format with a sampling rate of 48kHz up. The models for the alignment were trained on the train. sampled from 32kHz totalling 47 01 minutes once start and. ing plus development sets and state level labels force aligned. end silences had been trimmed, to acoustic frame boundaries were generated for the training. and development sets The models used were 5 states left right. HTS demo Tangle DNN, HMMs with multiple Gaussians 3230 tied HMM states and. Filter MCEP ord 60 MCEP ord 60, about 50k Gaussians were used The acoustic features used. Periodic exc Discontinuous Continuous, for alignment were order 13 MFCC with first and second or. Aperiodic exc None Band aperiod ord 25,der derivatives. Table 1 Acoustic parametrisation,3 3 Duration modelling DNN. We trained a first DNN to learn a mapping between full label The tools supplied with Idlak to build an HTS demo with. information and the respective durations of states and phones the Idlak front end were used as detailed in 7 followed by. The input of this DNN is the full label mapped to numerical synthesis of the beginning of Lewis Carroll s Alice s Adven. features the output is the respective duration as a number of tures in Wonderland using HTS engine Note that training. frames of the units the label belong to which we limited to the HTS models requires the proprietary HTK toolkit which. phone and HMM state as extracted from the forced alignment requires registration but the synthesis procedure can be per. In case some states had been skipped in the alignment input formed with the hts engine tool which is distributed as free. for the skipped states were added with an output duration of 0 software. System Naturalness, The acoustic parametrisation used for both Tangle and. HTS demo is summarised in Table 1 The MCEP coeffi 60. cients were extracted in both cases using SPTK MCEP ex. 0 100 MUSHRA,traction tool with 0 55 The periodic excitation. was extracted using respectively SPTK s pitch and Kaldi s 40. compute kaldi pitch feats and the aperiodic energy. was extracted using Idlak s compute aperiodic feats. A unit selection voice and a commercial grade HTS voice 20. were provided by CereProc Ltd created with the same audio. database for use in comparisons As the speech database is very. small the unit selection voice was not expected to perform sig 0. nificantly better than the parametric systems 15,Unit selection. Commercial HTS,5 Evaluation, 15 expert listeners completed a MUSHRA like preference. test 16 on 12 output phrases selected to cover different phrase System. lengths where the listeners were tasked in rating between 0 and. Figure 3 Mean opinion score of the four systems 1 with Idlak. 100 the naturalness of the outputs generated by each of the 4. Tangle 2 with commercial HTS style system 3 with HTS demo. systems Note that the test had neither reference nor anchor as. Baseline 4 with reference Unit Selection system Error bars. there are no original recordings of these samples by the target. show standard error All means except for DNN and commercial. speaker and none of the system was expected to be consistently. HTS are significantly different p 0 025,better or worse than all the others. For statistical analysis opinion scores were averaged across systems mixed excitation is typically driven with band aperi. subjects to produce an average score for each phrase3 A odic energy parameters produced using the restricted license. repeated measures ANOVA was carried out by phrase with STRAIGHT 11 system Hence the lack of mixed excitation. four conditions HTS demo HTS commercial Idlak Tangle in the HTS demo output which leads to a strong sense of audio. Unit Selection Results showed a significant difference between buzz in voiced regions An important contribution from this. groups F 3 33 29 821 p 0 001 pairwise comparison work is a open source method of determining aperiodic band. of the means using the least significant difference LSD pro energy for use in more freely licensed systems. cedure with Bonferroni correction showed a significant differ The voice quality produced by Tangle is not buzzy but does. ence p 0 025 between all means except between the com exhibit the dull and muffled quality associated with early HMM. mercial HTS system and Idlak Tangle The unit selection sys systems which did not use global variance to increase the vari. tem performed best but with a wider variance and Idlak Tangle ance in trajectory modelling In this early system no attempt has. DNN system consistently outperformed the HTS demo base been made to use global variance or variance scaling to increase. line However it was neither better nor worse than the propri variability of the speech output Future work intends to improve. etary HMM system this part of the released vocoder. 6 Discussion 7 Conclusions, These example voices were built from the freely available ARC The DNN Tangle system presented here is using a simple open. TIC SLT voice With 47 minutes of data this is a small corpus framework Compared to the HTS based system the architec. for TTS voice building by today s standards Given the small ture and the licensing situation is simple and allows liberal use. size of the database the unit selection system performed surpris of the system within both commercial and academic environ. ingly well Results in Blizzard 15 challenges have generally ments The performance of Tangle is significantly better than. shown unit selection voices to be below parametric quality for the baseline HTS demo parametric system Tangle is to our. databases on this size knowledge the first DNN based parametric synthesis system. Previous work on DNN approaches to synthesis have typi with no usage restriction however this is by no means the cur. cally used larger databases e g Zen et al 13 30 hours Wu rent state of the art We look forward to other research groups. and King 2400 utterances The results here show that with comparing Tangle to their own systems and contributing to the. the right architecture a DNN solution can also outperform or Idlak Tangle open source project. match an HMM system with a small corpora This is especially. important for less resourced languages where the expense of 8 Acknowledgements. recording many hours of data can be a barrier to development. This work was funded by the Eurostars Programme powered. Readers are encouraged to listen to the samples at by Eurostars and the European Community under the project. http homepages inf ed ac uk matthewa D Box A generic dialogue box for multi lingual conversa. interspeech2016DNN to gauge the quality of both the tional applications and by the European Union s Horizon 2020. HMM baseline and the Tangle DNN system The HTS Demo research and innovation programme under grant agreement No. output does not use mixed excitation Within HTS based 645378 Aria VALUSPA. 3 By assuming the discrete opinion scores are independent and iden. tically distributed samples we are able to use the central limit theorem. to regard the means as being drawn from an approximately Gaussian. distribution 17,9 References, 1 H Zen T Nose J Yamagishi S Sako T Masuko A Black. and K Tokuda The HMM based speech synthesis system HTS. version 2 0 in Proc SSW6 2007 pp 294 299, 2 G Hinton L Deng D Yu G E Dahl A r Mohamed N Jaitly. A Senior V Vanhoucke P Nguyen T N Sainath et al Deep. neural networks for acoustic modeling in speech recognition The. shared views of four research groups Signal Processing Maga. zine IEEE vol 29 no 6 pp 82 97 2012,3 Z H Ling S Y Kang H Zen A Senior M Schuster X J. Qian H M Meng and L Deng Deep learning for acoustic. modeling in parametric speech generation A systematic review. of existing techniques and future trends Signal Processing Mag. azine IEEE vol 32 no 3 pp 35 52 2015, 4 Z H Ling L Deng and D Yu Modeling spectral envelopes. using restricted boltzmann machines and deep belief networks. for statistical parametric speech synthesis Audio Speech and. Language Processing IEEE Transactions on vol 21 no 10 pp. 2129 2139 2013, 5 Z Wu and S King Improving trajectory modelling for. dnn based speech synthesis by using stacked bottleneck fea. tures and minimum trajectory error training arXiv preprint. arXiv 1602 06727 2016, 6 D Povey A Ghoshal G Boulianne L Burget O Glembek. N Goel M Hannemann P Motl c ek Y Qian P Schwarz, J Silovsky G Stemmer and K Vesely The Kaldi speech recog. nition toolkit Proc IEEE ASRU 2011, 7 M P Aylett R Dall A Ghoshal G E Henter and T Merritt. A flexible front end for HTS in Proc Interspeech 2014 pp. 8 T Yoshimura K Tokuda T Masuko T Kobayashi and T Ki. tamura Mixed excitation for HMM based speech Synthesis in. Proceedings of Eurospeech 2001 pp 2259 2262,9 K Vesely Karel A Ghoshal L Burget and D Povey. Sequence discriminative training of deep neural networks in. Proc Interspeech 2013 pp 2345 2349, 10 D Povey L Burget M Agarwal P Akyazi F Kai A Ghoshal. O Glembek N Goel M Karafia t A Rastrow R C Rose, P Schwarz and S Thomas The subspace Gaussian mixture. model a structured model for speech recognition Comput. Speech Lang vol 25 no 2 pp 404 439 2011, 11 H Kawahara STRAIGHT exploitation of the other aspect of. VOCODER Perceptually isomorphic decomposition of speech. sounds Acoust Sci Technol vol 27 no 6 pp 349 353 2006. 12 A Lazaridis B Potard and P N Garner DNN based Speech. Synthesis Importance of input features and training data in In. ternational Conference on Speech and Computer SPECOM ser. Lecture Notes in Computer Science A Ronzhin R Potapova. and N Fakotakis Eds Springer Berlin Heidelberg 2015 vol. 9319 pp 193 200, 13 H Zen A Senior and M Schuster Statistical parametric speech. synthesis using deep neural networks in Proc of ICASSP 2013. pp 7962 7966, 14 Y Qian Y Fan W Hu and F Soong On the training aspects of. deep neural network dnn for parametric tts synthesis in Proc. of ICASSP 2014 pp 3829 3833, 15 M P Aylett C J Pidcock and M E Fraser The cerevoice bliz. zard entry 2006 A prototype database unit selection engine in. In Proc BLIZZARD Challenge 2006, 16 Method for the Subjective Assessment of Intermediate Qual. ity Level of Coding Systems International Telecommunications. Union Std ITU R Rec BS 1534 1 2003, 17 H N Boone and D A Boone Analyzing Likert data J Exten.