Improving the Quality of Automated DVD Subtitles....pdf

(130 KB) Pobierz
381579812 UNPDF
ImprovingtheQualityofAutomatedDVDSubtitles
viaExample-BasedMachineTranslation
StephenArmstrongAndyWay
NationalCentreforLanguageTechnology
SchoolofComputing
DublinCityUniversity
Dublin9,Ireland
f sarmstrong,away g @computing.dcu.ie g
ColmCa®reyMarianFlanaganDorothyKennyMinakoO'Hagan
CentreforTranslationandTextualStudies
SALIS
DublinCityUniversity
Dublin9,Ireland
f colm.ca®rey,marian.°anagan,dorothy.kenny,minako.ohagan g @dcu.ie
October24,2006
Abstract
Denoual(2005)discoveredthat,contrarytopopularbelief,anEBMTsystemtrainedon
heterogeneousdataproducedsigni¯cantlybetterresultsthanasystemtrainedonhomogeneous
data.Usingsimilarevaluationmetricsandafewadditionalones,inthispaperweshowthatthis
doesnotholdtruefortheautomatedtranslationofsubtitles.Infact,oursystem(whentrained
onhomogeneousdata)showsarelativeincreaseof74%BLEUinthelanguagedirectionGerman-
Englishand86%BLEUEnglish-German.Furthermore,weshowthatincreasingtheamountof
heterogeneousdataresultsin`badexamples'beingputforwardastranslationcandidates,thus
loweringthetranslationquality.
1Introduction
Thedemandonsubtitlerstoproducehigh-qualitysubtitlesinanever-diminishingspaceoftime
isatarecordhigh,withmanybelievingthatatechnology-basedtranslationapproachistheway
forward(O'Hagan,2003;Carroll,1990;Gambier,2005).Followingonfromrecentresearch(Arm-
strongetal.,2006a,b)wherewedocumentedourmotivationsforusingExample-BasedMachine
Translation(EBMT)intheSubtitlingdomainandproducedsomerudimentarytranslations,we
havenowcometothestageofimprovingtheoutputqualityofoursystem.EBMTreliesheavily
onaparallelcorpus,onwhichthesystemistrained.Thequestionthenarises:whattypeofcorpus
1
willimprovetranslationqualitythemost?Alanguage-speci¯ccorpus,oracorpuscontainingout-
of-domaindata?
ThispaperaimstoinvestigatewhetheracorrelationexistsbetweenthequalityofDVDsubti-
tlesandthecorpususedtotrainthesystem.WepresentamodularMachineTranslation(MT)
system,newlydevelopedattheNCLTinDublinCityUniversity(Stroppaetal.,2006),whichwe
usetotranslatesubtitlesfromEnglishintoGermanbywayofEBMT.Thesystemwasloadedwith
separatesetsofbothhomogeneousdata(rippedsubtitles)andheterogeneousdata(parliamentary
proceedings),andanumberofexperimentswereconductedtodeterminewhichdatasetproduced
thehighestqualityoutput.
Theremainderofthispaperisstructuredasfollows:InSection2,webrie°ydiscussrecentre-
searchintheareaofhomogeneousandheterogeneousdatarelevanttoEBMT.Wegiveanoverview
ofEBMTandtheMarkerHypothesisinSection3.Section4,introducesthesystemanddetailsthe
chunking,chunkalignment,andtranslationprocesses.InSection5,wepresentthedi®erenttypes
ofevaluationconductedanddiscusstheresultsthesystemachievedwhenloadedwiththedi®erent
trainingdatasets.Finally,weconcludethepaperwithasummaryoftheresultsfromourevaluation
andgiveanoutlookonpossiblefutureresearchinthisarea.
2HeterogeneousversusHomogeneousData
WithalmostallresearchinMTtodaybeingcarriedoutusingcorpus-basedtechniques,itisstrange
tonotethattherehasbeenlittlestudyintothee®ectthetraining-corpushasonthe¯naloutputof
thesystem.Upuntilrecentlyitwasassumedthatcorpus-basedMTsystemsachievebetterresults
whentrainedwithhomogenousdata.Denoual(2005)setouttoreassessthisgeneralassumption,
anddiscoveredthat,contrarytothisbelief,hissystemyieldedbetterresultswhentrainedonhet-
erogeneousdata,comparedwithequalamountsofhomogeneousdata.UsingtheBTECcorpus
(amulti-lingualspeechcorpuscomprisedoftourismrelatedsentences)herandomlyextracted510
Japanesesentencesandusedtheseasinputtothesystem.Thesystemwasthentrainedonincreas-
ingamountsofdatafromtheremainderofthecorpus,andautomaticevaluationmetrics(BLEU,
NISTandmWER)werereliedontoestimatethetranslationqualityoftheoutputproducedbythe
system.Basedonthesethreemeasures,heshowsthatforincreasingamountsofdata,translation
qualityimprovesacrosstheboard.Morenotably,whentrainedontherandomheterogeneousdata,
translationqualityisfoundtobeeitherequalorhigherthanwhenusinghomogeneousdatafor
training.
Denoual's¯ndingsprovetrueforlargeramountsofdata,butwhentrainedonrelativelysmall
amounts(29,000sentencesandless),translationseemedtobeofhigherqualityusingthehomo-
geneousdata(basedonNISTscores).Noreasonisgivenforthisanditisunclearwhetherthis
cut-o®pointcanbegeneralisedforothertypesofcorporaotherthanthesetsheusedduringhis
experiments.
Obviouslythenatureofthedatausedtotrainasystemwillhaveimplicationsontranslation
quality;however,onealsohastotakeintoaccountthenatureofthedatathatwillbeusedasinput
2
tothesystem.Subtitlescanappearverydi®erentfromtextinotherdomains;aquickglanceatthe
statisticsforourhomogeneouscorpusshowsthatsentencesaremuchshortercomparedwithsen-
tencesfromtheEuroparlcorpus(seeSection5.1,Table2).Eventhisonesimplestatisticsuggests
thatwemightbebettero®usingacorpusofsubtitlestotrainthesystem.Asnopreviousresearch
hasbeencarriedoutwithrespecttothespeci¯ctaskoftranslatingsubtitlesusingacorpus-based
approach,webelievethatyoucannotgeneralisethateitherhomogeneousorheterogeneouswillyield
betterresults,thuswarrantingitsowninvestigation.
3EBMTandtheMarkerHypothesis
TheapproachwetaketotheautomatictranslationofsubtitlesisExample-BasedMachineTrans-
lation(EBMT).Thisisbasedontheintuitionthathumansmakeuseofpreviouslyseentranslation
examplestotranslateunseeninput.Thesystemistrainedonanalignedbilingualcorpus,from
which`examples'areextractedandstored.Duringtranslation,theinputsentenceissegmented,
anditsconstituentsarematchedagainstthisexample-database,withthecorrespondingtargetlan-
guageexamplesbeingrecombinedtoproducethe¯naloutput.
EventhoughEBMTdrawssomeparallelswithTranslationMemory(TM)thereisoneessential
di®erence:TMsoftwareneedsahumanpresentatalltimesduringthetranslationprocess,and
doesnottranslateautomatically.EBMT,ontheotherhand,isanessentiallyautomatictechnique;
havinglocatedasetofrelevantexamples,thesystemrecombinesthemtoderivea¯naltranslation,
ratherthanhandingthemovertothehumantodecidewhattodowiththem.Anothermajorben-
e¯tofEBMTisthatsearchgoesbeyondsentence-level,wheresubsententialexamplesareobtained,
meaningwedonotmissoutonmatcheswhichmaynotbeseenbylookingatthesentenceasa
whole.Recently,thetwoparadigmsarebecomingmoreandmoresimilar(SimardandLanglais,
2001),withsecondgenerationTMsystemsadoptingasubsententialapproachtoextractingmatches
andpostulatingatranslationproposalbasedonthesematches.
3.1Marker-BasedChunking
AsmentionedinSection3.2,theinputalongwiththesource-targettrainingcorpushastobe`chun-
ked'inordertoobtainsubsententialexamples.TheMarkerHypothesis(Green,1979)statesthat
\allnaturallanguagesaremarkedforcomplexsyntacticstructureatsurfaceformbyaclosedsetof
speci¯clexemesandmorphemeswhichappearinalimitedsetofgrammaticalcontextsandwhich
signalthatcontext".Wehavecarriedoutseveralexperiments(WayandGough,2005;Stroppa
etal.,2006;GrovesandWay,2006)usingthisideaasthebasisforthechunkingcomponentofour
EBMTsystem,andfoundittobeaverye±cientwayofsegmentingsourceandtargetsentences
intosmallerchunks.Asetofclosed-class(ormarker)words,suchasdeterminers,conjunctions,
prepositions,andpronouns,areusedtoindicatewhereonechunkendsandthenextonebegins
(Table1),withtheconstraintthateachchunkmustcontainatleastonecontent(non-marker)word.
Tomakethisprocessalittleclearer,letuslookatthefollowingEnglish-Germanexamplein(1):
3
Determiners h DET i
Quanti¯ers h Q i
Prepositions h P i
Conjunctions h C i
WH-Adverbs h WH i
PossessivePronouns h POSSPRON i
PersonalPronouns h PERSPRO i
Punctuation h PUNC i
Table1:Someofthetagsusedduringthechunkingphase
(1) Darling,I'msorrybutI'velostmykey
¡! MeinGuter,estutmirleidaberichhabemeinenSchlÄusselverloren
Forthe¯rststepweautomaticallytageachclosed-classwordwithitsmarkertag,asin(2):
(2) Darling h PUNC i , h PERSPRO i Iamsorry h CONJ i but h PERSPRO i I'velost h POSSPRO i mykey
¡! MeinGuter h PUNC i , h PERSPRO i estut h PERSPRO i mirleid h CONJ i aber h PERSPRO i ich
habe h POSSPRO i meinenSchlÄusselverloren
Aseverychunkmustcontainatleastonenon-markerword,wejustkeepthe¯rstmarkertagwhen
multiplemarker-wordsappearalongsideeachotheranddiscardtherest(3):
(3) Darling h PUNC i ,Iamsorry h CONJ i butI'velost h POSSPRO i mykey
¡! MeinGuter h PUNC i ,estut h PERSPRO i mirleid h CONJ i aberichhabe h POSSPRO i meinen
SchlÄusselverloren
3.2EBMT-anExample
ThetaskfortheEBMTsystemistotranslatetheinputsentencein(4)giventhealigneddatain
(5)asitstrainingcorpus.
(4) IchwohneinParismitmeinerFrau
(5) IchwohneinDublin $ IliveinDublin
EsgibtvielzutuninParis $ There'slotstodoinParis
IchgehegerninsKinomitmeinerFrau $ Ilovegoingtothecinemawithmywife
Thedataisthenchunked(basedontheMarkerHypothesis),withusefulchunksandtheirtarget-
languagepartnersbeingextractedandstoredforlateruse(6),andlessusefulchunksbeingcast
4
381579812.001.png 381579812.002.png 381579812.003.png 381579812.004.png
aside.Theseusefulchunkpairsareidenti¯edusingarangeofsimilaritymetrics(seeSection4.3.1).
(6) Ichwohne $ Ilive
inDublin $ inDublin
Esgibtviel $ There'slots
zutun $ todo
inParis $ inParis
Ichgehegern $ Ilovegoing
insKino $ tothecinema
mitmeinerFrau $ withmywife
WestartthetranslationprocessbysearchingtheGermansideoftheoriginalcorpusin(5)toseeif
itcontainsthewholesentence.Itdoesnot,sowechunktheinputsentenceintosmallerconstituents
(7)usingthesamehypothesisforsegmentingtheoriginalcorpus,andsearchfortheseinthecorpus
ofalignedchunks(6).
(7) Ichwohne
inParis
mitmeinerFrau
Havingfoundthesechunksinourdatabase,theyarerecombinedbythedecoder(seeSection4.4)
toproducethe¯naltranslationin(8):
(8) IliveinPariswithmywife
4SystemArchitecture
WeusetheMaTrEx(MachineTranslationusingExamples)systemtoproducetheoutputused
inourexperimentsinSection5.Thesystemisacorpus-basedMTengine,andisdesignedina
modularfashion,allowingtheusertoextendandre-implementmodulesatease.Themainmodules
areasfollows:
² WordAlignmentModule:takesasinputanalignedcorpus,andproducesasetofword
alignments;
² ChunkingModule:takesasinputanalignedcorpus,andproducesacorpusofsourceand
targetchunks;
² ChunkAlignmentModule:takesinsourceandtargetchunks,andalignsthemsentenceby
sentence;
² Decoder:searchesforatranslationusingtheoriginalalignedcorpusandderivedwordand
chunkalignments;
5
Zgłoś jeśli naruszono regulamin