Improving the Quality of Automated DVD Subtitles....pdf
(
130 KB
)
Pobierz
381579812 UNPDF
ImprovingtheQualityofAutomatedDVDSubtitles
viaExample-BasedMachineTranslation
StephenArmstrongAndyWay
NationalCentreforLanguageTechnology
SchoolofComputing
DublinCityUniversity
Dublin9,Ireland
f
sarmstrong,away
g
@computing.dcu.ie
g
ColmCa®reyMarianFlanaganDorothyKennyMinakoO'Hagan
CentreforTranslationandTextualStudies
SALIS
DublinCityUniversity
Dublin9,Ireland
f
colm.ca®rey,marian.°anagan,dorothy.kenny,minako.ohagan
g
@dcu.ie
October24,2006
Abstract
Denoual(2005)discoveredthat,contrarytopopularbelief,anEBMTsystemtrainedon
heterogeneousdataproducedsigni¯cantlybetterresultsthanasystemtrainedonhomogeneous
data.Usingsimilarevaluationmetricsandafewadditionalones,inthispaperweshowthatthis
doesnotholdtruefortheautomatedtranslationofsubtitles.Infact,oursystem(whentrained
onhomogeneousdata)showsarelativeincreaseof74%BLEUinthelanguagedirectionGerman-
Englishand86%BLEUEnglish-German.Furthermore,weshowthatincreasingtheamountof
heterogeneousdataresultsin`badexamples'beingputforwardastranslationcandidates,thus
loweringthetranslationquality.
1Introduction
Thedemandonsubtitlerstoproducehigh-qualitysubtitlesinanever-diminishingspaceoftime
isatarecordhigh,withmanybelievingthatatechnology-basedtranslationapproachistheway
forward(O'Hagan,2003;Carroll,1990;Gambier,2005).Followingonfromrecentresearch(Arm-
strongetal.,2006a,b)wherewedocumentedourmotivationsforusingExample-BasedMachine
Translation(EBMT)intheSubtitlingdomainandproducedsomerudimentarytranslations,we
havenowcometothestageofimprovingtheoutputqualityofoursystem.EBMTreliesheavily
onaparallelcorpus,onwhichthesystemistrained.Thequestionthenarises:whattypeofcorpus
1
willimprovetranslationqualitythemost?Alanguage-speci¯ccorpus,oracorpuscontainingout-
of-domaindata?
ThispaperaimstoinvestigatewhetheracorrelationexistsbetweenthequalityofDVDsubti-
tlesandthecorpususedtotrainthesystem.WepresentamodularMachineTranslation(MT)
system,newlydevelopedattheNCLTinDublinCityUniversity(Stroppaetal.,2006),whichwe
usetotranslatesubtitlesfromEnglishintoGermanbywayofEBMT.Thesystemwasloadedwith
separatesetsofbothhomogeneousdata(rippedsubtitles)andheterogeneousdata(parliamentary
proceedings),andanumberofexperimentswereconductedtodeterminewhichdatasetproduced
thehighestqualityoutput.
Theremainderofthispaperisstructuredasfollows:InSection2,webrie°ydiscussrecentre-
searchintheareaofhomogeneousandheterogeneousdatarelevanttoEBMT.Wegiveanoverview
ofEBMTandtheMarkerHypothesisinSection3.Section4,introducesthesystemanddetailsthe
chunking,chunkalignment,andtranslationprocesses.InSection5,wepresentthedi®erenttypes
ofevaluationconductedanddiscusstheresultsthesystemachievedwhenloadedwiththedi®erent
trainingdatasets.Finally,weconcludethepaperwithasummaryoftheresultsfromourevaluation
andgiveanoutlookonpossiblefutureresearchinthisarea.
2HeterogeneousversusHomogeneousData
WithalmostallresearchinMTtodaybeingcarriedoutusingcorpus-basedtechniques,itisstrange
tonotethattherehasbeenlittlestudyintothee®ectthetraining-corpushasonthe¯naloutputof
thesystem.Upuntilrecentlyitwasassumedthatcorpus-basedMTsystemsachievebetterresults
whentrainedwithhomogenousdata.Denoual(2005)setouttoreassessthisgeneralassumption,
anddiscoveredthat,contrarytothisbelief,hissystemyieldedbetterresultswhentrainedonhet-
erogeneousdata,comparedwithequalamountsofhomogeneousdata.UsingtheBTECcorpus
(amulti-lingualspeechcorpuscomprisedoftourismrelatedsentences)herandomlyextracted510
Japanesesentencesandusedtheseasinputtothesystem.Thesystemwasthentrainedonincreas-
ingamountsofdatafromtheremainderofthecorpus,andautomaticevaluationmetrics(BLEU,
NISTandmWER)werereliedontoestimatethetranslationqualityoftheoutputproducedbythe
system.Basedonthesethreemeasures,heshowsthatforincreasingamountsofdata,translation
qualityimprovesacrosstheboard.Morenotably,whentrainedontherandomheterogeneousdata,
translationqualityisfoundtobeeitherequalorhigherthanwhenusinghomogeneousdatafor
training.
Denoual's¯ndingsprovetrueforlargeramountsofdata,butwhentrainedonrelativelysmall
amounts(29,000sentencesandless),translationseemedtobeofhigherqualityusingthehomo-
geneousdata(basedonNISTscores).Noreasonisgivenforthisanditisunclearwhetherthis
cut-o®pointcanbegeneralisedforothertypesofcorporaotherthanthesetsheusedduringhis
experiments.
Obviouslythenatureofthedatausedtotrainasystemwillhaveimplicationsontranslation
quality;however,onealsohastotakeintoaccountthenatureofthedatathatwillbeusedasinput
2
tothesystem.Subtitlescanappearverydi®erentfromtextinotherdomains;aquickglanceatthe
statisticsforourhomogeneouscorpusshowsthatsentencesaremuchshortercomparedwithsen-
tencesfromtheEuroparlcorpus(seeSection5.1,Table2).Eventhisonesimplestatisticsuggests
thatwemightbebettero®usingacorpusofsubtitlestotrainthesystem.Asnopreviousresearch
hasbeencarriedoutwithrespecttothespeci¯ctaskoftranslatingsubtitlesusingacorpus-based
approach,webelievethatyoucannotgeneralisethateitherhomogeneousorheterogeneouswillyield
betterresults,thuswarrantingitsowninvestigation.
3EBMTandtheMarkerHypothesis
TheapproachwetaketotheautomatictranslationofsubtitlesisExample-BasedMachineTrans-
lation(EBMT).Thisisbasedontheintuitionthathumansmakeuseofpreviouslyseentranslation
examplestotranslateunseeninput.Thesystemistrainedonanalignedbilingualcorpus,from
which`examples'areextractedandstored.Duringtranslation,theinputsentenceissegmented,
anditsconstituentsarematchedagainstthisexample-database,withthecorrespondingtargetlan-
guageexamplesbeingrecombinedtoproducethe¯naloutput.
EventhoughEBMTdrawssomeparallelswithTranslationMemory(TM)thereisoneessential
di®erence:TMsoftwareneedsahumanpresentatalltimesduringthetranslationprocess,and
doesnottranslateautomatically.EBMT,ontheotherhand,isanessentiallyautomatictechnique;
havinglocatedasetofrelevantexamples,thesystemrecombinesthemtoderivea¯naltranslation,
ratherthanhandingthemovertothehumantodecidewhattodowiththem.Anothermajorben-
e¯tofEBMTisthatsearchgoesbeyondsentence-level,wheresubsententialexamplesareobtained,
meaningwedonotmissoutonmatcheswhichmaynotbeseenbylookingatthesentenceasa
whole.Recently,thetwoparadigmsarebecomingmoreandmoresimilar(SimardandLanglais,
2001),withsecondgenerationTMsystemsadoptingasubsententialapproachtoextractingmatches
andpostulatingatranslationproposalbasedonthesematches.
3.1Marker-BasedChunking
AsmentionedinSection3.2,theinputalongwiththesource-targettrainingcorpushastobe`chun-
ked'inordertoobtainsubsententialexamples.TheMarkerHypothesis(Green,1979)statesthat
\allnaturallanguagesaremarkedforcomplexsyntacticstructureatsurfaceformbyaclosedsetof
speci¯clexemesandmorphemeswhichappearinalimitedsetofgrammaticalcontextsandwhich
signalthatcontext".Wehavecarriedoutseveralexperiments(WayandGough,2005;Stroppa
etal.,2006;GrovesandWay,2006)usingthisideaasthebasisforthechunkingcomponentofour
EBMTsystem,andfoundittobeaverye±cientwayofsegmentingsourceandtargetsentences
intosmallerchunks.Asetofclosed-class(ormarker)words,suchasdeterminers,conjunctions,
prepositions,andpronouns,areusedtoindicatewhereonechunkendsandthenextonebegins
(Table1),withtheconstraintthateachchunkmustcontainatleastonecontent(non-marker)word.
Tomakethisprocessalittleclearer,letuslookatthefollowingEnglish-Germanexamplein(1):
3
Determiners
h
DET
i
Quanti¯ers
h
Q
i
Prepositions
h
P
i
Conjunctions
h
C
i
WH-Adverbs
h
WH
i
PossessivePronouns
h
POSSPRON
i
PersonalPronouns
h
PERSPRO
i
Punctuation
h
PUNC
i
Table1:Someofthetagsusedduringthechunkingphase
(1)
Darling,I'msorrybutI'velostmykey
¡!
MeinGuter,estutmirleidaberichhabemeinenSchlÄusselverloren
Forthe¯rststepweautomaticallytageachclosed-classwordwithitsmarkertag,asin(2):
(2)
Darling
h
PUNC
i
,
h
PERSPRO
i
Iamsorry
h
CONJ
i
but
h
PERSPRO
i
I'velost
h
POSSPRO
i
mykey
¡!
MeinGuter
h
PUNC
i
,
h
PERSPRO
i
estut
h
PERSPRO
i
mirleid
h
CONJ
i
aber
h
PERSPRO
i
ich
habe
h
POSSPRO
i
meinenSchlÄusselverloren
Aseverychunkmustcontainatleastonenon-markerword,wejustkeepthe¯rstmarkertagwhen
multiplemarker-wordsappearalongsideeachotheranddiscardtherest(3):
(3)
Darling
h
PUNC
i
,Iamsorry
h
CONJ
i
butI'velost
h
POSSPRO
i
mykey
¡!
MeinGuter
h
PUNC
i
,estut
h
PERSPRO
i
mirleid
h
CONJ
i
aberichhabe
h
POSSPRO
i
meinen
SchlÄusselverloren
3.2EBMT-anExample
ThetaskfortheEBMTsystemistotranslatetheinputsentencein(4)giventhealigneddatain
(5)asitstrainingcorpus.
(4)
IchwohneinParismitmeinerFrau
(5)
IchwohneinDublin
$
IliveinDublin
EsgibtvielzutuninParis
$
There'slotstodoinParis
IchgehegerninsKinomitmeinerFrau
$
Ilovegoingtothecinemawithmywife
Thedataisthenchunked(basedontheMarkerHypothesis),withusefulchunksandtheirtarget-
languagepartnersbeingextractedandstoredforlateruse(6),andlessusefulchunksbeingcast
4
aside.Theseusefulchunkpairsareidenti¯edusingarangeofsimilaritymetrics(seeSection4.3.1).
(6)
Ichwohne
$
Ilive
inDublin
$
inDublin
Esgibtviel
$
There'slots
zutun
$
todo
inParis
$
inParis
Ichgehegern
$
Ilovegoing
insKino
$
tothecinema
mitmeinerFrau
$
withmywife
WestartthetranslationprocessbysearchingtheGermansideoftheoriginalcorpusin(5)toseeif
itcontainsthewholesentence.Itdoesnot,sowechunktheinputsentenceintosmallerconstituents
(7)usingthesamehypothesisforsegmentingtheoriginalcorpus,andsearchfortheseinthecorpus
ofalignedchunks(6).
(7)
Ichwohne
inParis
mitmeinerFrau
Havingfoundthesechunksinourdatabase,theyarerecombinedbythedecoder(seeSection4.4)
toproducethe¯naltranslationin(8):
(8)
IliveinPariswithmywife
4SystemArchitecture
WeusetheMaTrEx(MachineTranslationusingExamples)systemtoproducetheoutputused
inourexperimentsinSection5.Thesystemisacorpus-basedMTengine,andisdesignedina
modularfashion,allowingtheusertoextendandre-implementmodulesatease.Themainmodules
areasfollows:
²
WordAlignmentModule:takesasinputanalignedcorpus,andproducesasetofword
alignments;
²
ChunkingModule:takesasinputanalignedcorpus,andproducesacorpusofsourceand
targetchunks;
²
ChunkAlignmentModule:takesinsourceandtargetchunks,andalignsthemsentenceby
sentence;
²
Decoder:searchesforatranslationusingtheoriginalalignedcorpusandderivedwordand
chunkalignments;
5
Plik z chomika:
zeebebonio
Inne pliki z tego folderu:
Using Linguistic Annotations in Statistical MAchine Translation of Film Subtitles.pdf
(394 KB)
Translation&the Film.Defamiliarizing Effect of Translation.pdf
(114 KB)
Translation Techniques Used In Translating Television Series Subtitles.pdf
(2135 KB)
Translating Humor across Cultures Verbal Humor in Animated Films.doc
(58 KB)
Translating Compliments In Subtitles.pdf
(1032 KB)
Inne foldery tego chomika:
Audiovisual Translation
BUSINESS ENGLISH
Cultural Studies
Descriptive Grammar
Filmy
Zgłoś jeśli
naruszono regulamin