第一范文网 - 专业文章范例文档资料分享平台

WaveNet语音合成技术

说明:文章内容仅供预览,部分内容可能不全,需要完整文档或者需要复制内容,请下载word后使用。下载word有问题请添加微信号:xxxxxxx或QQ:xxxxxx 处理(尽可能给您提供完整文档),感谢您的支持与谅解。

WAVENET:AGENERATIVEMODELFORRAWAUDIO

A¨aronvandenOordKarenSimonyanNalKalchbrenner

SanderDielemanOriolVinyalsAndrewSenior

HeigaZen?AlexGraves

KorayKavukcuoglu

GoogleDeepMind,London,UK?

Google,London,UK

ABSTRACT

ThispaperintroducesWaveNet,adeepneuralnetworkforgeneratingrawaudiowaveforms.Themodelisfullyprobabilisticandautoregressive,withthepredic-tivedistributionforeachaudiosampleconditionedonallpreviousones;nonethe-lessweshowthatitcanbeef?cientlytrainedondatawithtensofthousandsofsamplespersecondofaudio.Whenappliedtotext-to-speech,ityieldsstate-of-the-artperformance,withhumanlistenersratingitassigni?cantlymorenaturalsoundingthanthebestparametricandconcatenativesystemsforbothEnglishandChinese.AsingleWaveNetcancapturethecharacteristicsofmanydifferentspeakerswithequal?delity,andcanswitchbetweenthembyconditioningonthespeakeridentity.Whentrainedtomodelmusic,we?ndthatitgeneratesnovelandoftenhighlyrealisticmusicalfragments.Wealsoshowthatitcanbeemployedasadiscriminativemodel,returningpromisingresultsforphonemerecognition.

1INTRODUCTION

Thisworkexploresrawaudiogenerationtechniques,inspiredbyrecentadvancesinneuralautore-gressivegenerativemodelsthatmodelcomplexdistributionssuchasimages(vandenOordetal.,2016a;b)andtext(J′ozefowiczetal.,2016).Modelingjointprobabilitiesoverpixelsorwordsusingneuralarchitecturesasproductsofconditionaldistributionsyieldsstate-of-the-artgeneration.Remarkably,thesearchitecturesareabletomodeldistributionsoverthousandsofrandomvariables(e.g.64×64pixelsasinPixelRNN(vandenOordetal.,2016a)).Thequestionthispaperaddressesiswhethersimilarapproachescansucceedingeneratingwidebandrawaudiowaveforms,whicharesignalswithveryhightemporalresolution,atleast16,000samplespersecond(seeFig.1).

Figure1:Asecondofgeneratedspeech.

ThispaperintroducesWaveNet,anaudiogenerativemodelbasedonthePixelCNN(vandenOordetal.,2016a;b)architecture.Themaincontributionsofthisworkareasfollows:

?WeshowthatWaveNetscangeneraterawspeechsignalswithsubjectivenaturalnessneverbeforereportedinthe?eldoftext-to-speech(TTS),asassessedbyhumanraters.

1

?Inordertodealwithlong-rangetemporaldependenciesneededforrawaudiogeneration,wedevelopnewarchitecturesbasedondilatedcausalconvolutions,whichexhibitverylargereceptive?elds.

?Weshowthatasinglemodelcanbeusedtogeneratedifferentvoices,conditionedonaspeakeridentity.?Thesamearchitectureshowsstrongresultswhentestedonasmallspeechrecognitiondataset,andispromisingwhenusedtogenerateotheraudiomodalitiessuchasmusic.WebelievethatWaveNetsprovideagenericand?exibleframeworkfortacklingmanyapplicationsthatrelyonaudiogeneration(e.g.TTS,music,speechenhancement,voiceconversion,sourcesep-aration).2WAVENETInthispaperaudiosignalsaremodelledwithagenerativemodeloperatingdirectlyontherawaudiowaveform.Thejointprobabilityofawaveformx={x1,...,xT}isfactorisedasaproductofconditionalprobabilitiesasfollows:p(x)=T??t=1p(xt|x1,...,xt?1)(1)Eachaudiosamplextisthereforeconditionedonthesamplesatallprevioustimesteps.SimilarlytoPixelCNNs(vandenOordetal.,2016a;b),theconditionalprobabilitydistributionismodelledbyastackofconvolutionallayers.Therearenopoolinglayersinthenetwork,andtheoutputofthemodelhasthesametimedimensionalityastheinput.Themodeloutputsacategoricaldistributionoverthenextvaluextwithasoftmaxlayer.Themodelisoptimizedtomaximizethelog-likelihoodofthedataw.r.t.theparameters.Becauselog-likelihoodsaretractable,wetunehyper-parametersonavalidationsetandcaneasilymeasureover?tting/under?tting.2.1DILATEDCAUSALCONVOLUTIONSOutputHidden LayerHidden LayerHidden LayerInputFigure2:Visualizationofastackofcausalconvolutionallayers.ThemainingredientofWaveNetarecausalconvolutions.Byusingcausalconvolutions,wemakesurethemodelcannotviolatetheorderinginwhichwemodelthedata:thepredictionp(xt+1|x1,...,xt)emittedbythemodelattimesteptcannotdependonanyofthefuturetimestepsxt+1,xt+2,...,xT.ThisisvisualizedinFig.2.Forimages,theequivalentofacausalconvolutionisamaskedconvolution(vandenOordetal.,2016a)whichcanbeimplementedbyconstructingamasktensorandmultiplyingthiselementwisewiththeconvolutionkernelbeforeapplyingit.For1-Ddatasuchasaudioonecanmoreeasilyimplementthisbyshiftingtheoutputofanormalcon-volutionbyafewtimesteps.Attrainingtime,theconditionalpredictionsforalltimestepscanbemadeinparallelbecausealltimestepsofgroundtruthxareknown.Whengeneratingwiththemodel,thepredictionsarese-quential:aftereachsampleispredicted,itisfedbackintothenetworktopredictthenextsample.2Becausemodelswithcausalconvolutionsdonothaverecurrentconnections,theyaretypicallyfastertotrainthanRNNs,especiallywhenappliedtoverylongsequences.Oneoftheproblemsofcausalconvolutionsisthattheyrequiremanylayers,orlarge?lterstoincreasethereceptive?eld.Forexample,inFig.2thereceptive?eldisonly5(=#layers+?lterlength-1).Inthispaperweusedilatedconvolutionstoincreasethereceptive?eldbyordersofmagnitude,withoutgreatlyincreasingcomputationalcost.

Adilatedconvolution(alsocalleda`trous,orconvolutionwithholes)isaconvolutionwherethe?lterisappliedoveranarealargerthanitslengthbyskippinginputvalueswithacertainstep.Itisequivalenttoaconvolutionwithalarger?lterderivedfromtheoriginal?lterbydilatingitwithzeros,butissigni?cantlymoreef?cient.Adilatedconvolutioneffectivelyallowsthenetworktooperateonacoarserscalethanwithanormalconvolution.Thisissimilartopoolingorstridedconvolutions,butheretheoutputhasthesamesizeastheinput.Asaspecialcase,dilatedconvolutionwithdilation1yieldsthestandardconvolution.Fig.3depictsdilatedcausalconvolutionsfordilations1,2,4,and8.Dilatedconvolutionshavepreviouslybeenusedinvariouscontexts,e.g.signalprocessing(Holschneideretal.,1989;Dutilleux,1989),andimagesegmentation(Chenetal.,2015;Yu&Koltun,2016).

OutputDilation = 8Hidden LayerDilation = 4Hidden LayerDilation = 2Hidden LayerDilation = 1InputFigure3:Visualizationofastackofdilatedcausalconvolutionallayers.

Stackeddilatedconvolutionsef?cientlyenableverylargereceptive?eldswithjustafewlayers,whilepreservingtheinputresolutionthroughoutthenetwork.Inthispaper,thedilationisdoubledforeverylayeruptoacertainpointandthenrepeated:e.g.

1,2,4,...,512,1,2,4,...,512,1,2,4,...,512.

Theintuitionbehindthiscon?gurationistwo-fold.First,exponentiallyincreasingthedilationfactorresultsinexponentialreceptive?eldgrowthwithdepth(Yu&Koltun,2016).Forexampleeach1,2,4,...,512blockhasreceptive?eldofsize1024,andcanbeseenasamoreef?cientanddis-criminative(non-linear)counterpartofa1×1024convolution.Second,stackingtheseblocksfurtherincreasesthemodelcapacityandthereceptive?eldsize.2.2

SOFTMAXDISTRIBUTIONS

Oneapproachtomodelingtheconditionaldistributionsp(xt|x1,...,xt?1)overtheindividualaudiosampleswouldbetouseamixturemodelsuchasamixturedensitynetwork(Bishop,1994)ormixtureofconditionalGaussianscalemixtures(MCGSM)(Theis&Bethge,2015).However,vandenOordetal.(2016a)showedthatasoftmaxdistributiontendstoworkbetter,evenwhenthedataisimplicitlycontinuous(asisthecaseforimagepixelintensitiesoraudiosamplevalues).Oneofthereasonsisthatacategoricaldistributionismore?exibleandcanmoreeasilymodelarbitrarydistributionsbecauseitmakesnoassumptionsabouttheirshape.

Becauserawaudioistypicallystoredasasequenceof16-bitintegervalues(onepertimestep),asoftmaxlayerwouldneedtooutput65,536probabilitiespertimesteptomodelallpossiblevalues.Tomakethismoretractable,we?rstapplyaμ-lawcompandingtransformation(ITU-T,1988)tothedata,andthenquantizeitto256possiblevalues:

ln(1+μ|xt|)

,f(xt)=sign(xt)

ln(1+μ)

3

ReLU??tanh1?1?+where?1

搜索更多关于: WaveNet语音合成技术 的文档
WaveNet语音合成技术.doc 将本文的Word文档下载到电脑,方便复制、编辑、收藏和打印
本文链接:https://www.diyifanwen.net/c4uyva2g9ff9vfqw3dfm9_1.html(转载请注明文章来源)
热门推荐
Copyright © 2012-2023 第一范文网 版权所有 免责声明 | 联系我们
声明 :本网站尊重并保护知识产权,根据《信息网络传播权保护条例》,如果我们转载的作品侵犯了您的权利,请在一个月内通知我们,我们会及时删除。
客服QQ:xxxxxx 邮箱:xxxxxx@qq.com
渝ICP备2023013149号
Top