Spoken_language_translation.pdf

[ Alex Waibel and Christian Fügen ]

Translation

[ Enabling cross-lingual human–human

communication ]

fringe of speech and language processing conferences to one of the main pil-

lars of current research activity. The expanding interest and excitement can

be explained by a convergence of emerging and powerful new technical

capabilities and a growing appreciation of the needs for better cross-lingual

communication in a globalizing world [31].

Governments, commercial enterprises, and academic and humanitarian organi-

zations all face internationalization and globalization at an unprecedented scale.

Security, effectiveness in trade and commerce, market size, and competitive reach

all depend on global information awareness and the ability to interact and commu-

nicate globally. Increased integration (e.g., witness the integration efforts in Europe

and Asia) requires natural, yet effective, international cross-lingual communication.

It is true that there are common languages to communicate (English, Spanish,

Mandarin) among certain language groups, but language abilities vary and often

prevent true integration and equal opportunities for all. Effective solutions address-

ing the linguistic divide (not just the “digital divide”) could therefore offer consider-

able practical and economic benefits.

For the research community, speech translation also presents fascinating new prob-

lems that appear solvable by the introduction of considerable computing resources, seem-

ingly unlimited Web data, and promising new machine learning techniques. Despite the

promise and potential, considerable improvements are still needed in the component

technologies: speech recognition, machine translation (MT), and speech synthesis.

Moreover, to achieve effective cross-lingual human–human communication in practice,

not only do recognition and translation error rates matter but also the user interface and

overall system design in each communication setting.

In the following, we present an overview of the field of speech translation. We review

the history of the field, the main achievements, and remaining challenges. We discuss the

main approaches and the most promising applications. We also address the human factors

of delivering and deploying speech translation in different human communicative scenar-

ios and discuss issues regarding scaling the technology across domains, speaking styles,

and languages.

Digital Object Identifier 10.1109/MSP.2008.918415

IEEE SIGNAL PROCESSING MAGAZINE [ 70 ] MAY 2008

Spoken Language

D uring the past 15 years, speech translation has grown from an oddity at the

TECHNOLOGY

Speech translation systems typically consist of three components

(see Figure 1): automatic speech recognition (ASR), MT, and text-

to-speech synthesis (TTS). The underlying technologies for these

three components have been developed independently and many

of their performance issues and

techniques are used to apply to

speech translation as well.

Clearly, better ASR, MT, or TTS

performance makes for better

speech translation performance.

However, a speech translation system is not only the cascade of

its parts. Since the goal is to produce output in a target language,

the correctness of the components’ output is of secondary con-

cern. Uncertainty at the component level can be addressed by

being noncommittal at their interface, linking components via

near-miss hypothesis lattices [1]. Such a view also offers the pos-

sibility of jointly optimizing components based on the overall

output rather than each component independently [2].

10%) than for other ASR applications

that can tolerate higher error

rates (e.g., retrieval). Since many

speech translation applications

involve free spontaneous human

dialog, however, such low error

rates are more difficult to obtain:

Spontaneous dialogs tend to be disfluent, containing false starts,

hesitations, repetitions, and spontaneous speech, which is less

articulated, leading to higher error rates. Speech translation

dialogs also often involve accents, as regional variations and

cross-language expressions enter into use. Furthermore, many

speech translation tasks are further affected by environmental

noise, or by issues deriving from microphone positioning and

type. In a two-way system for doctor–patient dialogs, for exam-

ple, it may be feasible for the owner of a system to wear a close

speaking microphone but not for the patient.

The ASR component must also provide a useful indica-

tion of sentence boundaries so that a subsequent transla-

tion engine can translate a sentence or fragment into

another language. In many tasks, a continuous stream of

voice (broadcast news, speeches, lectures, etc.) is presented

so that such sentence-level segments must be inferred auto-

matically. The resulting segmentation algorithms use natu-

ral pauses, language model statistics, and prosodic cues to

infer such segments. Optimizing subsequent translation

∼

TO ERR IS HUMAN, AND USEFUL

SYSTEMS MUST ACCOMMODATE A

SPEAKER’S MISTAKES.

AUTOMATIC SPEECH RECOGNITION

The ASR component of a speech translation system, of course,

faces all the challenges that are typically for ASR in general:

noise, disfluencies, vocabulary size, and language perplexity,

which complicate recognition and increase recognition errors.

Recognition quality is generally measured in terms of word

error rate (WER) as compared with a reference transcription. In

addition to WER, other factors must be included to judge the

capabilities of a recognizer: the language model perplexity (a

measure of information/surprise provided by the word

Acoustic

Model

Language

Model

Translation

Model

Language

Model

Voice

Speech

Recognition

Hypothesis

Lattice

(Statistical)

Machine

Translation

Speech

Synthesis

Dictionary

(a)

Acoustic

Model

Language

Model

Japanese

Voice

Speech

Recognition

Hypothesis

NLP

(Parsing) Tree

Parse

Interchange

Language

Translation

Speech

Synthesis

Generation

Dictionary

Parsing

Grammar

German

Generation

Grammar

(b)

[FIG1] Schematic overview of a speech-to-speech translation system and its models: (a) direct approach using, e.g., statistical MT, and

(b) interlingua approach using an interchanged language.

IEEE SIGNAL PROCESSING MAGAZINE [ 71 ] MAY 2008

sequence), speed, memory usage, processor requirements, and

microphone positioning. For speech translation a few of these

challenges are particularly noteworthy.

First off, ASR error rates should generally be lower for trans-

lation to make sense (

quality and minimizing latency are both important consid-

erations for an effective design.

To improve ASR modeling, more training and/or adaptation

data are required, but conversational dialog data are hard to col-

lect and “natural” dialogs

through a speech translation

device are difficult to simulate.

Similarly, text data for lan-

guage modeling and diction-

ary construction may be

available for certain speech

translation tasks (broadcast

news, parliamentary speeches)

but not for others (dialogs,

lectures). Construction of a suitable recognition “word” lexicon

can also be a problem if a language provides for many inflections

of its root forms (morphology). Depending on language, text

data for language model training and dictionary constructions

may in fact not even be available at all if the language is a spo-

ken language or a dialect.

ASR has a number of added practical requirements that are

of special importance to speech translation. These include speed

requirements when dialog completion is at stake or proper han-

dling of named entities (city names, person names, food, med-

ication, symptoms, etc.) as they vary in the field and application

and are essential for translation. Since effective end-to-end com-

munication is the goal, ASR components frequently output

hypothesis lattice structures (confusion networks) and confi-

dence measures to pass multiple near-miss alternatives to subse-

quent translation components. These allow for better

integration and overall system optimization.

lan-

guage translators. It also permits regeneration of a para-

phrase in a speaker’s own

language for verification. A

semantic representation also

strips the input surface real-

ization from all its disfluen-

cies and colloquialisms and

can lead to a clean and

semantically equivalent

utterance in the target lan-

guage. The biggest drawback

of the interlingua approach is the manual development of

semantic parsers and the complications in designing a

semantic representation common to all languages.

Statistical MT, by contrast, can handle the ambiguities of

language by a stochastic source channel model, much like

today’s speech recognizers do. With it, the most likely target

language word string

(

N 2

)

DOMAIN LIMITATION MAY BE

ACCEPTABLE IN CERTAIN TASKS AND

ENVIRONMENTS (TOURISM, MEDICAL

ASSISTANCE, ETC.), BUT FOR OTHERS

IT IMPOSES TOO GREAT A

RESTRICTION TO BE USEFUL.

e given a source language word string

f is estimated by way of Bayes’ rule as the product of a trans-

lation model p

(

)

and a language model p

(

)

arg max

{

(

) }=

arg max

{

(

)

(

) }

Effectively, the model combines the probabilities of different

translations of words in a sentence with the monolingual

likelihood of each resulting word sequence to determine the

most likely translation of that sentence. Instead of just mod-

eling this as a noisy channel approach, current SMT systems

use a log linear combination of a number of feature func-

tions that model important aspects including a language

model, a word reordering model, word penalties, and various

other translation models [24], [25], [27]. The SMT approach

has the advantage that it requires no manual development of

grammars or representations but is trained on large

amounts of translated reference texts (parallel corpora). Its

drawback is its need for large parallel corpora, its lack of a

common representation to connect multiple languages, and

challenges in view of highly disfluent input.

For domain-limited translation systems (see discussion

below) the design of an interlingua has been shown to be possi-

ble and helpful, but for domain-unlimited applications (due to

their unrestricted semantic coverage) SMT methods have been

generally preferred. A number of hybrid techniques have been

proposed to retain some of the advantages of both, including

statistically trained semantic analyzers in an interlingua frame-

work [6], or using a natural language (e.g., English) as an inter-

mediate “pivot” language [7], [8] to connect multiple languages.

SPOKEN LANGUAGE TRANSLATION

For MT of text, the choice of technology and design remains a

topic of discussion. Three different approaches have been pop-

ular in MT: the direct approach, the transfer approach, and

the interlingua approach. In the first, a direct mapping

between source language and target language is attempted,

while transfer and interlingua approaches attempt to extract

deeper linguistic structures first. Most commonly, transfer

approaches will perform a syntactic analysis and transfer the

derived structures from the source to the target language for

generation of the target language sentence. Interlingua

approaches [3] attempt to derive a semantic representation of

an input sentence first and then generate a sentence in the

target language from those semantic concepts. Direct

approaches bypass this analysis and map input sentences

directly onto a target language sentence. While early attempts

at direct translation were rejected due to the high ambiguity

of language, they have regained considerable following and

popularity with the advent of statistical data-driven approach-

es, such as example based and statistical MT [4], [23].

All three MT approaches have been used for speech trans-

lation as well, each with notable advantages and disadvan-

tages [5], [9]–[11], [13]. The interlingua approach has the

OUTPUT GENERATION (SPEECH, TEXT)

The output of a speech translation system most typically is syn-

thetic speech in the target language. Alternative outputs, how-

ever, are possible depending on the purpose and ultimate use of

IEEE SIGNAL PROCESSING MAGAZINE [ 72 ] MAY 2008

advantage that it can connect N languages in any combina-

tion through its common semantic representation and

therefore does not require the development of O

the speech translator (see discussion below). They include target

language text, or summaries from translations. In

human–human cross-lingual speech dialogs a speech synthesis

component generates audible

output from a translated text

string [30]. Commonly, full TTS

is used for convenience and

modularity, even though one

could arguably also synthesize

speech based on conceptual or

syntactic structures if they are

provided by the MT component.

Special concerns in TTS for

speech translators involve gen-

erating appropriate emotion,

style, and voices, so that the output speaking style corresponds

to the input speech in the source language [30]. Voice conver-

sion, in particular, attempts to generate speech in the output

language with the voice of the speaker of the input language.

THE FIELD HAS PROGRESSED TO DATE

FROM HIGHLY RESTRICTIVE

DEMONSTRATION SYSTEMS TO FREE

SIMULTANEOUS TRANSLATION OF

SPONTANEOUS SPEECH ABOUT

UNLIMITED TOPICS, PUSHING BACK

ON EACH OF THE RESTRICTIONS

SUCCESSIVELY.

limit the usefulness of a system in many real-world situations.

A domain-limited one-way system for tourists may be helpful

but is limiting, as it requires the user to memorize the allow-

able phrases and it cannot

translate back the response of

the other party. It is equally

limiting if the user is not

allowed a hesitation (aeh, hum,

etc.) while speaking, or if

he/she must produce perfectly

grammatical sentences to

obtain useful output. To err is

human, and useful systems

must accommodate a speaker’s

mistakes. Finally, domain limi-

tation may be acceptable in certain tasks and environments

(tourism, medical assistance, etc.), but for others it imposes

too great a restriction to be useful. Translation of broadcast

news and speeches, for example, is only possible if a system

can accommodate or adapt to a broad variety of topics, an

unlimited vocabulary, and free speaking style.

Dimensions that increase uncertainty and ambiguity in

speech and hence present challenges for speech translation sys-

tems are signal degradation/noise, vocabulary size/perplexity,

spontaneity/disfluencies/speaking style, domain size, and speed

requirements. Consequentially, the field has progressed to date

from highly restrictive demonstration systems to free simulta-

neous translation of spontaneous speech about unlimited topics,

pushing back on each of the restrictions successively.

PROGRESS IN SPEECH TRANSLATION

Based on the advances in component technologies, research on

speech translation began in earnest during the late 1980s and

early 1990s. In the following two decades, impressive speech

translation systems have been developed. The systems and their

progression can be categorized by distinct new system-level

capabilities at each stage of development. These capabilities are

summarized in Table 1.

Overall, each advance distinguishes itself by the levels of

uncertainty that a given system can tolerate. Language is

ambiguous at all levels, from signal to phonetics to syntax to

semantics. Earlier systems have imposed greater constraints

to control such ambiguity. For example, restrictions in speak-

ing style, vocabulary, domain, and the use and operation of a

system limit ambiguity and the search for translation

hypotheses. Such constraints are inherent in the task (prere-

corded announcements, limited domain, or phraseology) or

recording conditions (e.g., broadcast news versus telephone

conversations). Alternatively, it can be imposed as a require-

ment for system use. Restrictions on use, however, severely

RESTRICTED DOMAIN, RESTRICTED SPEAKING STYLE

The first speech translation systems date back to the late 1980s

and early 1990s [5], [9], [12]. They were demonstration sys-

tems that showed the concept of speech translation and proved

that speech translation was possible at all. They attracted a

great deal of attention, as they showed that bridging the lan-

guage divide by spoken language might indeed be possible

[32]. These early systems did not permit free dialog and

required speakers to act out prescribed sentence patterns or

allowable sentences in a correct speaking style according to a

[TABLE 1] SUMMARY OF SYSTEM-LEVEL CAPABILITIES.

YEARS

VOCABULARY SPEAKING

DOMAIN

SPEED

PLATFORM

EXAMPLE

STYLE

SYSTEMS

FIRST DIALOG

1989–1993

RESTRICTED

CONSTRAINED

LIMITED

2–10

WORKSTATION

C-STAR-I

DEMONSTRATION

SYSTEMS

ONE-WAY

1997–PRESENT RESTRICTED,

CONSTRAINED

LIMITED

1–3

HANDHELD

PHRASELATOR,

PHRASEBOOKS

MODIFIABLE

ECTACO

SPONTANEOUS

1993–PRESENT UNRESTRICTED

SPONTANEOUS

LIMITED

1–5

PC/HANDHELD

C-STAR, VERBMOBIL,

TWO-WAY

DEVICES

NESPOLE, BABYLON,

SYSTEMS

TRANSTAC

TRANSLATION OF

2003–PRESENT UNRESTRICTED

READ/ PREPARED OPEN

OFFLINE

PCS, PC-CLUSTERS

NSF-STRDUST,

BROADCAST NEWS,

SPEECH

EC TC-STAR,

POLITICAL SPEECHES

DARPA GALE,

SIMULTANEOUS

2005–PRESENT UNRESTRICTED

SPONTANEOUS

OPEN

REALTIME

PC, LAPTOP

LECTURE

TRANSLATION

TRANSLATOR

OF LECTURES

IEEE SIGNAL PROCESSING MAGAZINE [ 73 ] MAY 2008

restricted syntax and/or a restricted vocabulary.

Nevertheless, they were systems that were pro-

posed at a time when the idea of speaker-inde-

pendent, continuous speech was still a novelty

and MT was considered close to impossible.

the direct statistical approach. The former

semantic constraints can be exploited to extract

possible interpretation in fragmented input.

For limited-domain applications, this is possi-

ble where the typical concepts and arguments

can be enumerated and represented. The statis-

tical approach, by contrast, accommodates ill-

formed input by using large translation and

language models to compute the statistically

most likely word sequence [13]. The first spon-

taneous speech translation systems were

demonstrated in the early 1990s under the

Consortium for Speech Translation Advanced

Research (C-STAR) (http://www.c-star.org) [9],

[10], [16]. Considerable work continued

throughout the 1990s in Japan, Europe, and

the United States until today, with large consortia and national

projects supporting research [C-STAR, Verbmobil (http://verb-

mobil.dfki.de), Negotiating through Spoken Language in E-

Commerce (Nespole) (http://nespole.itc.it), Enthusiast, Digital

Olympics, Babylon, and Spoken Language Communication and

Translation System for Tactical Use (Transtac) (http://www.

darpa.mil/ipto/programs/transtac/transtac.asp)].

DOMAIN-LIMITED,

SPONTANEOUS SPEECH

In 1992, it was already recognized that these

early concept demonstration systems fall short

of being usable, as speakers had to speak in a

well-behaved manner and remember the words

and sentences they would be allowed to say. The

most unacceptable constraints were the vocabu-

lary, syntax, and speaking-style limitation, as it

is generally not possible for humans to speak

flawlessly in a limited speaking style (effectively reading sen-

tences) or remember a limited set of words or syntactic patterns.

By contrast, it is generally possible for humans to stick to a

domain of discourse when solving certain limited tasks. Many

important applications are inherently domain limited, thus

making spontaneous domain-limited speech translation a prac-

tically useful technology. Hotel bookings, car rentals, taxis and

shopping negotiations, medical assistance, emergency relief,

hotel/hospital/conference registration, force protection, mili-

tary/police missions, and many more all require only dialogs in a

limited domain. But they do require accuracy, speed, and an

acceptable human-factors design.

More advanced technology was developed to address the

limitation of speaking style: two-way dialogs handling free

spontaneous speech input, both in recognition and translation.

Spontaneous speech is a requirement for two-way dialog sys-

tems as the input of the respondent cannot be controlled or

restricted. For spontaneous dialogs, we must relax syntactic

constraints and allow for variations in expression. Two

approaches have been popular: the interlingua approach and

[FIG2] Phraselator (courtesy

of Voxtec International,

Annapolis, MD, http://www.

voxtec.com).

PORTABLE, FIELDABLE SYSTEMS

More recently, portable, fieldable speech-to-speech translation

systems have been developed around wearable platforms (lap-

tops, PDAs). This may impose additional hardware-related

constraints on the ASR, SMT, and TTS components. For PDAs,

memory limitations and the lack of a floating-point unit

require redesign of algorithms and data structures. Thus, the

recognition and translation accuracy of PDA-based speech-to-

speech translation systems may decrease compared to systems

developed for laptops. In addition to continued attention to

speed, recognition, translation, and synthesis quality, usability

of the user interface, microphone type, place and number,

user training, and field maintenance must be considered.

Figure 2 and Figure 3 show two mobile speech translators.

The first, the Phraselator (http://www.sarich. com/translator),

is a pragmatic approach based on restricted-domain/restrict-

ed-speaking style technology (Figure 2). This approach does

not address the problem of speaking style, but it relaxes

vocabulary restrictions and provides speakable phrases on a

hand-held device. Sometimes called a “one-way,” it does not

allow for free dialogs between two conversants (this requires

spontaneous speech), but it permits speech entry of a list of

useful phrases for a given situation. Figure 3 is a two-way

device, the Pocket Translator, based on two-way speech trans-

lation technology described above. It runs on a standard PDA

platform and permits spoken input for travel, medical, and

military domains. A push-to-talk button on the device acti-

vates the system. The display shows recognition output, back-

translation for verification, and translation output. A

combination of using common pretranslated phrases by clas-

sifiers and look-up and performing actual translation has also

[FIG3] A PDA two-way pocket translator (English-Thai)

(courtesy of Mobile Technologies, LLC, Pittsburgh, PA, http://

www.mobytrans.com).

IEEE SIGNAL PROCESSING MAGAZINE [ 74 ] MAY 2008

Plik z chomika:

Inne pliki z tego folderu:

Inne foldery tego chomika: