Synchroznizing Translated Movie Subtitles.pdf

Synchronizing Translated Movie Subtitles

Jorg Tiedemann

Information Science

University of Groningen

PO Box 716

9700 AS Groningen, The Netherlands

j.tiedemann@rug.nl

Abstract

This paper addresses the problem of synchronizing movie subtitles, which is necessary to improve alignment quality when building a

parallel corpus out of translated subtitles. In particular, synchronization is done on the basis of aligned anchor points. Previous studies

have shown that cognate ﬁlters are useful for the identiﬁcation of such points. However, this restricts the approach to related languages

with similar alphabets. Here, we propose a dictionary-based approach using automatic word alignment. We can show an improvement

in alignment quality even for related languages compared to the cognate-based approach.

1. Introduction

The remaining part is organized as follows: First, we will

brieﬂy describe the data we collected. Thereafter, we will

shortly summarize our sentence alignment approach. Fi-

nally, we will discuss the synchronization of subtitles using

a dictionary ﬁlter including an evaluation of some sample

data.

Movie subtitles in various languages are available on-line

in ever growing databases. They can be compiled into di-

verse collections of parallel corpora useful for many cross-

lingual investigations and NLP applications (Lavecchia et

al., 2007; Volk and Harder, 2007; Armstrong et al., 2006).

Although they ought to be aligned to the corresponding

movies, on-line subtitles suffer from a serious problem of

synchronization differences. This is due to the process of

creating textual subtitle ﬁles which is mainly done by “rip-

ping” (i.e. scanning) them from DVDs using various tools.

In previous studies, we have shown that time information

is a valuable feature for proper subtitle alignment (Tiede-

mann, 2007). However, synchronization differences in

terms of starting time and speed cause serious alignment

problems as shown in the same study. In (Tiedemann,

2006a), several ways of synchronizing misaligned subti-

tles have been discussed already. Synchronization is done

by re-computing time information in one subtitle ﬁle by

adjusting speed and time offset according to two aligned

ﬁx-points in a pair of subtitles. The remaining problem

is to ﬁnd appropriate ﬁx-points that can be used for this

procedure. Besides of manually deﬁning them two auto-

matic methods have been compared in (Tiedemann, 2006a).

They are both based on a “cognate ﬁlter” using string sim-

ilarity measures and some heuristics for selecting the two

points necessary for synchronization. Although this tech-

nique produces promising results there are some obvious

shortcomings of using simple string similarity measures.

First, there is the risk of ﬁnding false friends. However,

the chance of false friends in corresponding sentences and

their local context is rather low. Secondly, the risk of se-

lecting the wrong candidate is high in cases where names

are frequently repeated in close context. The impact of such

erroneous selections is minimized by considering all com-

binations of candidates and selecting the most promising

pair according to some heuristics as described in (Tiede-

mann, 2006a). Finally, the most severe drawback of string

similarity measures is the restriction to languages with at

least similar alphabets. This problem will be addressed in

this paper.

2. Data Collection

We collected data from one on-line provider of movie sub-

titles, http://www.opensubtitles.org . All data

ﬁles have been converted to a standalone XML format and

UTF8 encoding. We also applied a language classiﬁer to

clean the database. Details are given in (Tiedemann, 2007).

The current collection contains 22,794 pairs of subtitles in

29 languages covering 2,780 movies. Figure 1 lists some

statistics of the 15 largest bitexts in our collection.

language nr sentences nr words

pair source target source target

eng-spa 592,355 524,412 4,696,792 4,071,345

por-spa 443,521 414,725 3,124,539 3,170,790

cze-eng 403,605 421,135 2,581,318 3,260,751

eng-por 397,085 370,866 3,071,277 2,611,508

eng-slv 394,941 376,971 3,036,584 2,343,233

eng-swe 386,269 339,953 2,971,600 2,441,469

dut-eng 378,475 425,600 2,804,742 3,338,842

dut-spa 367,421 359,944 2,729,557 2,739,981

cze-por 365,676 366,861 2,311,908 2,532,080

cze-spa 361,038 335,278 2,278,212 2,532,657

cze-rum 347,454 345,553 2,220,880 2,491,271

por-rum 340,227 335,356 2,352,743 2,412,681

cze-slv 328,751 335,555 2,093,731 2,123,347

eng-pob 323,621 308,458 2,525,747 2,183,897

pob-spa 320,934 293,701 2,280,703 2,340,992

Figure 1: The 15 largest bitexts in the subtitle corpus

3. Time Slot Alignment

An important steps in building parallel corpora is the align-

ment of textual units at some level. Commonly, correspond-

ing sentences are aligned assuming that they are the small-

est linguistic units that still can be aligned monotonically

between two languages. In (Tiedemann, 2007), we have

shown that standard length-based approaches fail for our

kind of data and that sentence alignment using time infor-

mation is superior to these techniques. Basically the time-

slot alignment approach is based on the assumption that

corresponding texts are shown at approximately the same

time on screen and, hence, text fragments with the largest

time overlap are aligned to each other. The details are de-

scribed in (Tiedemann, 2007). This approach nicely han-

dles cases of deletions and insertions which usually cause

major problems in alignment.

4.1. Manually Adding Anchor Points

The safest way of synchronizing subtitles using anchor

points is to manually mark corresponding frames. For this,

the interactive sentence alignment front-end, ISA (Tiede-

mann, 2006b) can be used. ISA includes features to manu-

ally add hard boundaries before aligning a bitext. It allows

to easily jump to the end and back to the beginning of the

current bitext and hard boundaries can simply be added and

deleted by clicking on corresponding sentences. A screen-

shot of the interface is shown in ﬁgure 2.

The alignment back-end is deﬁned in a corpus speciﬁc con-

ﬁguration ﬁle. ISA passes all existing hard boundaries as

parameters to the back-end in case the time-slot aligner for

subtitles is used.

The advantage of the manual approach is that the alignment

can be done interactively, i.e. the resulting alignment can

be edited or synchronization can be repeated using differ-

ent anchor points. Furthermore, the user may decide if syn-

chronization is necessary at all. However, manual synchro-

nization and interactive alignment is not an option for large

amounts of data. Therefore, we use automatic techniques

for anchor point identiﬁcation as discussed in the following

two sections.

4. Subtitle Synchronization

Despite the fact that the simple alignment approach de-

scribed in the previous section works well for perfectly syn-

chronized subtitles, it badly fails for pairs of subtitles even

with only slight timing differences. Unfortunately, syn-

chronization differences are quite common among the data

ﬁles we have collected. The mis-synchronization problems

come down to differences in starting time and the speed

of showing subtitle frames together with the movie. This

is caused by the software used for creating the plain text

subtitle ﬁles. Therefore, media players commonly include

features to adjust the timing manually.

For sentence alignment we do not require proper alignment

of subtitles to movies but proper alignment of subtitles to

each other. However, this involves the same adjustments

of starting time and displaying speed for one of the subti-

tle ﬁles in order to apply the time-slot alignment method.

The approach suggested in (Tiedemann, 2006a) uses two

ﬁxed anchor points to compute the speed and time offset

for a given subtitle pair to recalculate time stamps in one

of the ﬁles. Having two anchor points with time stamps

< src 1 , trg 1 > and < src 2 , trg 2 > ( src x corresponds

to the time in the source language subtitle ﬁle and trg x to

the time in the target language ﬁle) we can compute the

time ratio and the time of f set as follows:

4.2. Using Cognates for Synchronization

An obvious idea for anchor point identiﬁcation is to use so-

called “cognates” 1 known to be useful for sentence align-

ment (see, e.g., (Simard et al., 1992; Melamed, 1996)). The

cognate approach for anchor point identiﬁcation in subtitle

alignment has already been used in the study presented in

(Tiedemann, 2007). We basically scan the beginning and

the end of the bitext for cognate pairs using a string similar-

ity measure and a sliding window. The two pairs of cognate

candidates which are furthest away from each other are then

used for synchronization by simply using the time stamps

given at the beginning of the sentences in which they ap-

pear.

The cognate approach works well for languages which use

identical or at least similar alphabets especially because

subtitles usually contain a lot of names that can easily be

matched. Obviously, simple string matching algorithm do

not work for language pairs with different alphabets such

as English and Bulgarian as shown in ﬁgure 2 even though

there might be a lot of closely related words (such as names

transliterated using the respective alphabet). One possibil-

ity to use string similarity measures for such languages is

to deﬁne scoring schemes to match arbitrary character pairs

from both alphabets. The problem here is to deﬁne appro-

priate matching functions for each language pair under con-

sideration. There are certainly ways of learning such func-

tions and it would be interesting to investigate this direction

further in future work. Another possibility for ﬁnding an-

chor points is to use bilingual dictionaries. This approach

will be discussed in the following.

time ratio = (trg 1 −trg 2 )

(src 1 −src 2 )

time of f set = trg 2 −src 2 time ratio

Using these two parameters we now adjust the time stamps

in the source language ﬁle by simply multiplying each of

them by the time ratio and adding the value of time of f set .

A crucial step for this technique is to ﬁnd appropriate an-

chor points for the synchronization. Essentially, we need to

ﬁnd pairs of subtitle frames which truly correspond to each

other. We can then use the time stamps given at the begin-

ning of each of these subtitle frames to compute the syn-

chronization parameters. The best result can be expected

if the two anchor points are as far away from each other

as possible. Therefore, we need to look for corresponding

frames in the beginning and at the end of each subtitle pair.

There are several ways of ﬁnding such anchor points. In

the following we ﬁrst discuss the strategies previously used

and, thereafter, we present our new extension using bilin-

gual dictionaries derived from automatic word alignment.

4.3. Word Alignment for Synchronization

Using bilingual dictionaries is a straightforward idea to ﬁnd

candidates for aligned anchor points in the same fashion

1 The term cognate is used here in terms of words which are

similar in spelling.

Figure 2: Manually adding hard boundaries to an English/Bulgarian subtitle pair (from the movie “Chicken Run”) using

the Interactive Sentence Aligner ISA

as it is done by the cognate ﬁlter described earlier. Now

the task is to obtain appropriate dictionaries. We still like

to keep the alignment approach as language independent

as possible and therefore we do not want to rely on exist-

ing machine-readable dictionaries. Naturally, dictionaries

are the opposite of a language-independent resource. How-

ever, this is not an issue if they can be created automatically

for any given language pair. Fortunately, word alignment

software is capable to ﬁnd common translational equiva-

lences in a given parallel corpus and, hence, “rough” bilin-

gual dictionaries can be extracted from word aligned cor-

pora. For the following we assume that word alignment

is robust enough even for parallel corpora with many sen-

tence alignment errors. In particular, we assume that at least

the frequently aligned words correspond to good translation

equivalents. The procedure is as follows:

reduces already the noise of the alignment dramatically.

Furthermore, we simply set a frequency threshold for the

extraction of translation equivalents and restrict ourselves

to tokens of a minimal length that contain alphabetic char-

acters only.

4.4. Anchor Point Selection

An obvious drawback of automatic synchronization ap-

proaches compared to manual synchronization is the risk of

mis-synchronization. Both, cognate and dictionary based

approaches bare the risk to select inappropriate anchor

points which may cause an even worse alignment than with-

out any synchronization. However, anchor point selection

is cheap when considering limited windows (initial and ﬁ-

nal sections) only. Dictionaries are static and do not have

to be re-compiled each time an alignment has to be re-

peated and token based string comparison is also fast and

easy. Hence, various candidate pairs can be tested when

applying synchronization. The main difﬁculty is to choose

between competing candidates in order to select the one

that yields the best sentence alignment in the end. Here,

we apply a simple heuristics that works well in our exper-

iments. Knowing that incorrect synchronization causes a

lot of mismatches between time slots we assume that the

alignment in such cases includes many empty alignments

(one sentence in one language aligned to no sentence in the

other language). On the other hand we expect well syn-

chronized subtitles to align nicely with only a few empty

links. Hence, we use the alignment-type ratio (Tiedemann,

2006a) to measure the relative “quality” of an alignment

compared to other possible alignments:

1. sentence align the entire corpus using the time-slot

overlap approach (using cognate ﬁlters if applicable)

2. word-align each parallel corpus using GIZA++ (Och

and Ney, 2003) using standard settings in both align-

ment directions

3. use the intersection of both Viterbi alignments (to ob-

tain high precision) and extract relevant word corre-

spondences (using alignment frequency thresholds)

4. run the sentence aligner once again with language-pair

speciﬁc dictionaries

Important in our setting is to extract reliable word transla-

tions. Using the intersection of statistical word alignments

algtype ratio = number of non-1:0-alignments + 1

number of 1:0-alignments + 1

For example, in the Dutch/English dictionary there are 5 er-

rors among the last 10 pairs as shown in ﬁgure 3. However,

due to the exhaustive test of candidate pairs as described

in section 4.4. we assume that our method is very robust to

such noise in the dictionaries.

In the Dutch/German sample we can see another typical er-

ror in our data caused by the software used for ripping the

subtitles from the DVDs. Quite frequently we can observe

OCR errors such as “aiies” (instead of “alles”) in the data.

This, however should not have a negative impact on the syn-

chronization approach. On the contrary, these links might

even be very useful for our data collection.

The word alignment dictionaries were now used without

further processing for movie synchronization as described

earlier. For evaluation purposes we used 10 randomly se-

lected movies and manually aligned 20 initial, 20 interme-

diate and 20 ﬁnal sentence of each movie and language pair.

The reason for focusing on different regions in the subtitles

was originally to investigate the differences between vari-

ous approaches on speciﬁc alignment problems. For exam-

ple, subtitles often include different amounts of information

in the beginning and at the end of a movie. Titles, songs

and credits may be (partly) translated in some languages

whereas they are not in others. Insertions and deletions usu-

ally cause large problems for traditional approaches. How-

ever, we could not observe signiﬁcant differences in these

regions between the length-based approach and our time-

slot approach. We also wanted to look at the ability to syn-

chronize after initial mistakes but could not see signiﬁcant

differences between the two types of alignment approaches

either. Therefore, we will not include separate scores for

the three regions but present the overall results only (see

table 1).

Using algtype ratio as indicator for alignment quality we

can now apply all possible pais of anchor point candidates

and select the one that maximizes this ratio. This strategy

is also applied in the experiments described below.

5. Experiments

In our experiments we used Dutch-English and Dutch-

German subtitles (mainly because evaluation data was read-

ily available from previous experiments). The intersec-

tion of word token links resulted in 111,460 unique word

type pairs for Dutch-English and 45,585 pairs for Dutch-

German. We extracted all word pairs with alignment fre-

quency larger than 5 and, furthermore, we removed all pairs

that contain non-alphabetic characters and words shorter

than 5 characters. In this way we discarded most of the

alignment errors and also dismissed most of the highly

frequent function words which would cause a lot of false

hits when looking for anchor points for synchronization.

The resulting dictionaries consist of 4,802 pairs (Dutch-

English) and 1,133 pairs (Dutch-German). Figure 3 shows

a small sample of the Dutch/English and Dutch/German

word alignment dictionaries.

Dutch/English

Dutch/German

misschien maybe

hallo hello

sorry sorry

bedankt thanks

omdat because

alsjeblieft please

luister listen

bedankt thank

niets nothing

mensen people

alles everything

dacht thought

nooit never

jezus jesus

vader father

...

alleen wanna

alleen there

alleen people

alleen about

alden alden

akkoord agreed

afdrukken prints

adios adios

aanvallen attack

aanval attack

waarom warum

hebben haben

alles alles

leven leben

niets nichts

vader vater

misschien vielleicht

weten wissen

zeggen sagen

moeten muussen

kunnen konnen

altijd immer

gezien gesehen

terug zuruck

bedankt danke

...

begonnen begonnen

bedoelt meinst

banken banken

badal badal

atlanta atlanta

alsjeblief bitte

aiies aiies

achter hinten

aartsbisschop erzbischof

aarde erden

approach correct partial wrong

Dutch - English

length

0.397

0.095

0.508

time

0.599

0.119

0.282

time-cog

0.738

0.115

0.147

time-dic

0.765

0.131

0.104

Dutch - German

length

0.631

0.148

0.220

time

0.515

0.085

0.400

time-cog

0.733

0.163

0.104

time-dic

0.752

0.148

0.100

Table 1: The quality of different alignment approaches:

length refers to the baseline using a length-based alignment

approach, time refers to the time-slot overlap approach. The

extension cog refers to the application of the cognate ﬁlter

and dic to the dictionary approach.

The following settings have been used for the time-based

alignment with movie synchronization: Anchor points are

searched in a window of 25 sentence from the beginning

and from the end of each subtitle ﬁle. As discussed earlier,

all combinations of candidate pairs in the initial window

and the ones in the ﬁnal window are tried out and the one

that gives the largest alignment-type ratio is taken.

The parameters of the cognate ﬁlter are set as follows: min-

Figure 3: The top 15 and the last 10 entries from the

Dutch/English and Dutch/German word type alignments

(sorted by alignment frequency which is not shown here)

As the ﬁgure indicates, the quality of these dictionaries is

quite good but not perfect. Especially among less frequent

alignments we can see several mistakes even after ﬁltering.

imal string length is 5 characters and the similarity measure

is the longest common sub-sequence ratio with a threshold

of 0.6.

http://www.let.rug.nl/tiedeman/OPUS/lex.php .

We would like to thank the University of Groningen and

Oslo University for providing hard disk space and Internet

bandwidth for our resources.

6. Discussion and Conclusions

Table 1 shows that there is a clear improvement in align-

ment quality using the dictionary approach compared to

the baseline and also to the other time-based alignment ap-

proaches. Note that we consider close related languages

with similar alphabets only. We still outperform the cog-

nate based approach, which is an encouraging result con-

sidering the noisy word alignment dictionaries used. Con-

cluding from this we expect that the dictionary approach

also helps to improve the alignment of more distant lan-

guage pairs with incompatible alphabets for which the cog-

nate method does not work at all. However, we should

be careful with our expectations for several reasons. First

of all, there is often less data available for such language

pairs. They also often come from less reliable sources and

include many corrupt and incomplete ﬁles. Furthermore,

for many languages various character encodings are used

which complicates the pre-processing step. We should also

not forget that less related language pairs are more difﬁcult

to word-align anyway because of syntactic, morphological

and semantic differences. Also we might see even more

OCR errors because of the limited language support of the

subtitle ripping software used. Finally, the word alignment

will be based on a non-synchronized parallel corpus (be-

cause the cognate-based synchronization is not applicable).

All of these issues will cause smaller and noisier bilingual

dictionaries.

We word-aligned the entire corpus with all its language

pairs and applied the dictionary approach to all bitexts. Un-

fortunately, due to time constraints, we were not able to

measure the success of the this approach on some exam-

ple data of distant language pairs. This has to be done in

future work and is just a matter of producing appropriate

gold standard alignments (which can also be done using

ISA). We still expect an improvement even with small and

noisy dictionaries due to the robustness of our approach as

we discussed earlier. The improvements might be smaller,

though, and it could be an idea to iteratively alternate be-

tween word alignment and sentence alignment to push the

quality further up. However, word alignment is expensive

and, therefore, this approach might not be reasonable with

current technology. Further investigations in this directin

should be carried out in the future.

8. References

Stephen Armstrong, Colm Caffrey, Marian Flanagan,

Dorothy Kenny, Minako O’Hagan, and Andy Way.

2006. Leading by example: Automatic translation of

subtitles via ebmt. Perspectives: Studies in Translatol-

ogy , 14(3):163 – 184.

Caroline Lavecchia, Kamel Smaili, and David Langlois.

2007. Building parallel corpora from movies. In Proc.

of the 4th International Workshop on Natural Language

Processing and Cognitive Science - NLPCS 2007 , Fun-

chal, Madeira.

I. Dan Melamed. 1996. A geometric approach to map-

ping bitext correspondence. In Eric Brill and Kenneth

Church, editors, Conference on Empirical Methods in

Natural Language Processing (EMNLP) , pages 1–12,

Philadelphia, PA.

Franz Josef Och and Hermann Ney. 2003. A system-

atic comparison of various statistical alignment models.

Computational Linguistics , 29(1):19–51.

Michel Simard, George F. Foster, and Pierre Isabelle.

1992. Using cognates to align sentences in bilingual cor-

pora. In Proceedings of the 4th International Conference

on Theoretical and Methodological Issues in Machine

Translation (TMI) , pages 67–81, Montreal, Canada.

Jorg Tiedemann and Lars Nygard. 2004. The OPUS corpus

- parallel and free. In Proceedings of the Fourth Interna-

tional Conference on Language Resources and Evalua-

tion (LREC’2004) , Lisbon, Portugal, May.

Jorg Tiedemann. 2006a. Building a multilingual parallel

subtitle corpus. In Proceedings of 17th CLIN, to appear ,

Leuven, Belgium.

Jorg Tiedemann. 2006b. ISA & ICA - two web interfaces

for interactive alignment of bitexts. In Proceedings of

the 5th International Conference on Language Resources

and Evaluation, (LREC’2006) , Genova, Italy.

Jorg Tiedemann. 2007. Improved sentence alignment for

movie subtitles. In Proceedings of RANLP 2007 , pages

582–588, Borovets, Bulgaria.

Martin Volk and Søren Harder. 2007. Evaluating mt with

translations or translators. what is the difference? In Ma-

chine Translation Summit XI Proceedings , Copenhagen.

7. Availability

The parallel subtitle corpus is part of OPUS (Tiede-

mann and Nygard, 2004), a free collection of parallel

corpora. The corpora are available from the project web-

site at http://www.let.rug.nl/tiedeman/OPUS/

including the latest sentence alignment and links to

tools and search interfaces. The entire OPUS cor-

pus has been indexed using the Corpus Work Bench

from IMS Stuttgart and can be queried on-line. Fur-

thermore, we provide a word alignment database with

access to multi-lingual dictionaries derived from auto-

matic word alignment. The database can be queried at

Plik z chomika:

Inne pliki z tego folderu:

Inne foldery tego chomika: