WaveNet语音合成技术

来源：用户分享时间：2025/7/6 20:08:26 本文由

闂傚倸鍊风粈渚€骞栭位鍥敃閿曗偓閻ら箖鏌ｉ幇顒傛憼闁告瑥绻橀弻鏇㈠醇濠垫劖笑闂佺粯甯掗…鐑藉蓟閵娾晜鍋嗛柛灞剧☉椤忥拷 分享下载这篇文档手机版

说明：文章内容仅供预览，部分内容可能不全，需要完整文档或者需要复制内容，请下载word后使用。下载word有问题请添加微信号:xxxxxxx或QQ：xxxxxx 处理（尽可能给您提供完整文档），感谢您的支持与谅解。

WAVENET:AGENERATIVEMODELFORRAWAUDIO

A¨aronvandenOordKarenSimonyanNalKalchbrenner

SanderDielemanOriolVinyalsAndrewSenior

HeigaZen?AlexGraves

KorayKavukcuoglu

GoogleDeepMind,London,UK?

Google,London,UK

ABSTRACT

ThispaperintroducesWaveNet,adeepneuralnetworkforgeneratingrawaudiowaveforms.Themodelisfullyprobabilisticandautoregressive,withthepredic-tivedistributionforeachaudiosampleconditionedonallpreviousones;nonethe-lessweshowthatitcanbeef?cientlytrainedondatawithtensofthousandsofsamplespersecondofaudio.Whenappliedtotext-to-speech,ityieldsstate-of-the-artperformance,withhumanlistenersratingitassigni?cantlymorenaturalsoundingthanthebestparametricandconcatenativesystemsforbothEnglishandChinese.AsingleWaveNetcancapturethecharacteristicsofmanydifferentspeakerswithequal?delity,andcanswitchbetweenthembyconditioningonthespeakeridentity.Whentrainedtomodelmusic,we?ndthatitgeneratesnovelandoftenhighlyrealisticmusicalfragments.Wealsoshowthatitcanbeemployedasadiscriminativemodel,returningpromisingresultsforphonemerecognition.

1INTRODUCTION

Thisworkexploresrawaudiogenerationtechniques,inspiredbyrecentadvancesinneuralautore-gressivegenerativemodelsthatmodelcomplexdistributionssuchasimages(vandenOordetal.,2016a;b)andtext(J′ozefowiczetal.,2016).Modelingjointprobabilitiesoverpixelsorwordsusingneuralarchitecturesasproductsofconditionaldistributionsyieldsstate-of-the-artgeneration.Remarkably,thesearchitecturesareabletomodeldistributionsoverthousandsofrandomvariables(e.g.64×64pixelsasinPixelRNN(vandenOordetal.,2016a)).Thequestionthispaperaddressesiswhethersimilarapproachescansucceedingeneratingwidebandrawaudiowaveforms,whicharesignalswithveryhightemporalresolution,atleast16,000samplespersecond(seeFig.1).

Figure1:Asecondofgeneratedspeech.

ThispaperintroducesWaveNet,anaudiogenerativemodelbasedonthePixelCNN(vandenOordetal.,2016a;b)architecture.Themaincontributionsofthisworkareasfollows:

?WeshowthatWaveNetscangeneraterawspeechsignalswithsubjectivenaturalnessneverbeforereportedinthe?eldoftext-to-speech(TTS),asassessedbyhumanraters.

?Inordertodealwithlong-rangetemporaldependenciesneededforrawaudiogeneration,wedevelopnewarchitecturesbasedondilatedcausalconvolutions,whichexhibitverylargereceptive?elds.

?Weshowthatasinglemodelcanbeusedtogeneratedifferentvoices,conditionedonaspeakeridentity.?Thesamearchitectureshowsstrongresultswhentestedonasmallspeechrecognitiondataset,andispromisingwhenusedtogenerateotheraudiomodalitiessuchasmusic.WebelievethatWaveNetsprovideagenericand?exibleframeworkfortacklingmanyapplicationsthatrelyonaudiogeneration(e.g.TTS,music,speechenhancement,voiceconversion,sourcesep-aration).2WAVENETInthispaperaudiosignalsaremodelledwithagenerativemodeloperatingdirectlyontherawaudiowaveform.Thejointprobabilityofawaveformx={x1,...,xT}isfactorisedasaproductofconditionalprobabilitiesasfollows:p(x)=T??t=1p(xt|x1,...,xt?1)(1)Eachaudiosamplextisthereforeconditionedonthesamplesatallprevioustimesteps.SimilarlytoPixelCNNs(vandenOordetal.,2016a;b),theconditionalprobabilitydistributionismodelledbyastackofconvolutionallayers.Therearenopoolinglayersinthenetwork,andtheoutputofthemodelhasthesametimedimensionalityastheinput.Themodeloutputsacategoricaldistributionoverthenextvaluextwithasoftmaxlayer.Themodelisoptimizedtomaximizethelog-likelihoodofthedataw.r.t.theparameters.Becauselog-likelihoodsaretractable,wetunehyper-parametersonavalidationsetandcaneasilymeasureover?tting/under?tting.2.1DILATEDCAUSALCONVOLUTIONSOutputHidden LayerHidden LayerHidden LayerInputFigure2:Visualizationofastackofcausalconvolutionallayers.ThemainingredientofWaveNetarecausalconvolutions.Byusingcausalconvolutions,wemakesurethemodelcannotviolatetheorderinginwhichwemodelthedata:thepredictionp(xt+1|x1,...,xt)emittedbythemodelattimesteptcannotdependonanyofthefuturetimestepsxt+1,xt+2,...,xT.ThisisvisualizedinFig.2.Forimages,theequivalentofacausalconvolutionisamaskedconvolution(vandenOordetal.,2016a)whichcanbeimplementedbyconstructingamasktensorandmultiplyingthiselementwisewiththeconvolutionkernelbeforeapplyingit.For1-Ddatasuchasaudioonecanmoreeasilyimplementthisbyshiftingtheoutputofanormalcon-volutionbyafewtimesteps.Attrainingtime,theconditionalpredictionsforalltimestepscanbemadeinparallelbecausealltimestepsofgroundtruthxareknown.Whengeneratingwiththemodel,thepredictionsarese-quential:aftereachsampleispredicted,itisfedbackintothenetworktopredictthenextsample.2Becausemodelswithcausalconvolutionsdonothaverecurrentconnections,theyaretypicallyfastertotrainthanRNNs,especiallywhenappliedtoverylongsequences.Oneoftheproblemsofcausalconvolutionsisthattheyrequiremanylayers,orlarge?lterstoincreasethereceptive?eld.Forexample,inFig.2thereceptive?eldisonly5(=#layers+?lterlength-1).Inthispaperweusedilatedconvolutionstoincreasethereceptive?eldbyordersofmagnitude,withoutgreatlyincreasingcomputationalcost.

Adilatedconvolution(alsocalleda`trous,orconvolutionwithholes)isaconvolutionwherethe?lterisappliedoveranarealargerthanitslengthbyskippinginputvalueswithacertainstep.Itisequivalenttoaconvolutionwithalarger?lterderivedfromtheoriginal?lterbydilatingitwithzeros,butissigni?cantlymoreef?cient.Adilatedconvolutioneffectivelyallowsthenetworktooperateonacoarserscalethanwithanormalconvolution.Thisissimilartopoolingorstridedconvolutions,butheretheoutputhasthesamesizeastheinput.Asaspecialcase,dilatedconvolutionwithdilation1yieldsthestandardconvolution.Fig.3depictsdilatedcausalconvolutionsfordilations1,2,4,and8.Dilatedconvolutionshavepreviouslybeenusedinvariouscontexts,e.g.signalprocessing(Holschneideretal.,1989;Dutilleux,1989),andimagesegmentation(Chenetal.,2015;Yu&Koltun,2016).

OutputDilation = 8Hidden LayerDilation = 4Hidden LayerDilation = 2Hidden LayerDilation = 1InputFigure3:Visualizationofastackofdilatedcausalconvolutionallayers.

Stackeddilatedconvolutionsef?cientlyenableverylargereceptive?eldswithjustafewlayers,whilepreservingtheinputresolutionthroughoutthenetwork.Inthispaper,thedilationisdoubledforeverylayeruptoacertainpointandthenrepeated:e.g.

1,2,4,...,512,1,2,4,...,512,1,2,4,...,512.

Theintuitionbehindthiscon?gurationistwo-fold.First,exponentiallyincreasingthedilationfactorresultsinexponentialreceptive?eldgrowthwithdepth(Yu&Koltun,2016).Forexampleeach1,2,4,...,512blockhasreceptive?eldofsize1024,andcanbeseenasamoreef?cientanddis-criminative(non-linear)counterpartofa1×1024convolution.Second,stackingtheseblocksfurtherincreasesthemodelcapacityandthereceptive?eldsize.2.2

SOFTMAXDISTRIBUTIONS

Oneapproachtomodelingtheconditionaldistributionsp(xt|x1,...,xt?1)overtheindividualaudiosampleswouldbetouseamixturemodelsuchasamixturedensitynetwork(Bishop,1994)ormixtureofconditionalGaussianscalemixtures(MCGSM)(Theis&Bethge,2015).However,vandenOordetal.(2016a)showedthatasoftmaxdistributiontendstoworkbetter,evenwhenthedataisimplicitlycontinuous(asisthecaseforimagepixelintensitiesoraudiosamplevalues).Oneofthereasonsisthatacategoricaldistributionismore?exibleandcanmoreeasilymodelarbitrarydistributionsbecauseitmakesnoassumptionsabouttheirshape.

Becauserawaudioistypicallystoredasasequenceof16-bitintegervalues(onepertimestep),asoftmaxlayerwouldneedtooutput65,536probabilitiespertimesteptomodelallpossiblevalues.Tomakethismoretractable,we?rstapplyaμ-lawcompandingtransformation(ITU-T,1988)tothedata,andthenquantizeitto256possiblevalues:

ln(1+μ|xt|)

,f(xt)=sign(xt)

ln(1+μ)

ReLU??tanh1?1?+where?1

搜索更多关于： WaveNet语音合成技术的文档

WaveNet语音合成技术.doc 将本文的Word文档下载到电脑，方便复制、编辑、收藏和打印

下载这篇word文档

本文链接：https://www.diyifanwen.net/c4uyva2g9ff9vfqw3dfm9_1.html（转载请注明文章来源）