Spoken_language_translation.pdf

(1685 KB) Pobierz
untitled
[ Alex Waibel and Christian Fügen ]
Translation
[ Enabling cross-lingual human–human
communication ]
fringe of speech and language processing conferences to one of the main pil-
lars of current research activity. The expanding interest and excitement can
be explained by a convergence of emerging and powerful new technical
capabilities and a growing appreciation of the needs for better cross-lingual
communication in a globalizing world [31].
Governments, commercial enterprises, and academic and humanitarian organi-
zations all face internationalization and globalization at an unprecedented scale.
Security, effectiveness in trade and commerce, market size, and competitive reach
all depend on global information awareness and the ability to interact and commu-
nicate globally. Increased integration (e.g., witness the integration efforts in Europe
and Asia) requires natural, yet effective, international cross-lingual communication.
It is true that there are common languages to communicate (English, Spanish,
Mandarin) among certain language groups, but language abilities vary and often
prevent true integration and equal opportunities for all. Effective solutions address-
ing the linguistic divide (not just the “digital divide”) could therefore offer consider-
able practical and economic benefits.
For the research community, speech translation also presents fascinating new prob-
lems that appear solvable by the introduction of considerable computing resources, seem-
ingly unlimited Web data, and promising new machine learning techniques. Despite the
promise and potential, considerable improvements are still needed in the component
technologies: speech recognition, machine translation (MT), and speech synthesis.
Moreover, to achieve effective cross-lingual human–human communication in practice,
not only do recognition and translation error rates matter but also the user interface and
overall system design in each communication setting.
In the following, we present an overview of the field of speech translation. We review
the history of the field, the main achievements, and remaining challenges. We discuss the
main approaches and the most promising applications. We also address the human factors
of delivering and deploying speech translation in different human communicative scenar-
ios and discuss issues regarding scaling the technology across domains, speaking styles,
and languages.
Digital Object Identifier 10.1109/MSP.2008.918415
IEEE SIGNAL PROCESSING MAGAZINE [ 70 ] MAY 2008
1053-5888/08/$25.00©2008IEEE
Spoken Language
D uring the past 15 years, speech translation has grown from an oddity at the
665545064.034.png 665545064.035.png
 
TECHNOLOGY
Speech translation systems typically consist of three components
(see Figure 1): automatic speech recognition (ASR), MT, and text-
to-speech synthesis (TTS). The underlying technologies for these
three components have been developed independently and many
of their performance issues and
techniques are used to apply to
speech translation as well.
Clearly, better ASR, MT, or TTS
performance makes for better
speech translation performance.
However, a speech translation system is not only the cascade of
its parts. Since the goal is to produce output in a target language,
the correctness of the components’ output is of secondary con-
cern. Uncertainty at the component level can be addressed by
being noncommittal at their interface, linking components via
near-miss hypothesis lattices [1]. Such a view also offers the pos-
sibility of jointly optimizing components based on the overall
output rather than each component independently [2].
10%) than for other ASR applications
that can tolerate higher error
rates (e.g., retrieval). Since many
speech translation applications
involve free spontaneous human
dialog, however, such low error
rates are more difficult to obtain:
Spontaneous dialogs tend to be disfluent, containing false starts,
hesitations, repetitions, and spontaneous speech, which is less
articulated, leading to higher error rates. Speech translation
dialogs also often involve accents, as regional variations and
cross-language expressions enter into use. Furthermore, many
speech translation tasks are further affected by environmental
noise, or by issues deriving from microphone positioning and
type. In a two-way system for doctor–patient dialogs, for exam-
ple, it may be feasible for the owner of a system to wear a close
speaking microphone but not for the patient.
The ASR component must also provide a useful indica-
tion of sentence boundaries so that a subsequent transla-
tion engine can translate a sentence or fragment into
another language. In many tasks, a continuous stream of
voice (broadcast news, speeches, lectures, etc.) is presented
so that such sentence-level segments must be inferred auto-
matically. The resulting segmentation algorithms use natu-
ral pauses, language model statistics, and prosodic cues to
infer such segments. Optimizing subsequent translation
TO ERR IS HUMAN, AND USEFUL
SYSTEMS MUST ACCOMMODATE A
SPEAKER’S MISTAKES.
AUTOMATIC SPEECH RECOGNITION
The ASR component of a speech translation system, of course,
faces all the challenges that are typically for ASR in general:
noise, disfluencies, vocabulary size, and language perplexity,
which complicate recognition and increase recognition errors.
Recognition quality is generally measured in terms of word
error rate (WER) as compared with a reference transcription. In
addition to WER, other factors must be included to judge the
capabilities of a recognizer: the language model perplexity (a
measure of information/surprise provided by the word
Acoustic
Model
Language
Model
Translation
Model
Language
Model
Voice
Speech
Recognition
Hypothesis
Lattice
(Statistical)
Machine
Translation
Translation
Speech
Synthesis
Dictionary
(a)
Acoustic
Model
Language
Model
Japanese
Voice
Speech
Recognition
Hypothesis
NLP
(Parsing) Tree
Parse
Interchange
Language
Translation
Speech
Synthesis
Generation
Dictionary
Parsing
Grammar
German
Generation
Grammar
(b)
[FIG1] Schematic overview of a speech-to-speech translation system and its models: (a) direct approach using, e.g., statistical MT, and
(b) interlingua approach using an interchanged language.
IEEE SIGNAL PROCESSING MAGAZINE [ 71 ] MAY 2008
sequence), speed, memory usage, processor requirements, and
microphone positioning. For speech translation a few of these
challenges are particularly noteworthy.
First off, ASR error rates should generally be lower for trans-
lation to make sense (
665545064.036.png 665545064.001.png 665545064.002.png 665545064.003.png 665545064.004.png 665545064.005.png 665545064.006.png 665545064.007.png 665545064.008.png 665545064.009.png 665545064.010.png 665545064.011.png 665545064.012.png 665545064.013.png 665545064.014.png 665545064.015.png 665545064.016.png 665545064.017.png 665545064.018.png 665545064.019.png 665545064.020.png 665545064.021.png 665545064.022.png 665545064.023.png 665545064.024.png 665545064.025.png
quality and minimizing latency are both important consid-
erations for an effective design.
To improve ASR modeling, more training and/or adaptation
data are required, but conversational dialog data are hard to col-
lect and “natural” dialogs
through a speech translation
device are difficult to simulate.
Similarly, text data for lan-
guage modeling and diction-
ary construction may be
available for certain speech
translation tasks (broadcast
news, parliamentary speeches)
but not for others (dialogs,
lectures). Construction of a suitable recognition “word” lexicon
can also be a problem if a language provides for many inflections
of its root forms (morphology). Depending on language, text
data for language model training and dictionary constructions
may in fact not even be available at all if the language is a spo-
ken language or a dialect.
ASR has a number of added practical requirements that are
of special importance to speech translation. These include speed
requirements when dialog completion is at stake or proper han-
dling of named entities (city names, person names, food, med-
ication, symptoms, etc.) as they vary in the field and application
and are essential for translation. Since effective end-to-end com-
munication is the goal, ASR components frequently output
hypothesis lattice structures (confusion networks) and confi-
dence measures to pass multiple near-miss alternatives to subse-
quent translation components. These allow for better
integration and overall system optimization.
lan-
guage translators. It also permits regeneration of a para-
phrase in a speaker’s own
language for verification. A
semantic representation also
strips the input surface real-
ization from all its disfluen-
cies and colloquialisms and
can lead to a clean and
semantically equivalent
utterance in the target lan-
guage. The biggest drawback
of the interlingua approach is the manual development of
semantic parsers and the complications in designing a
semantic representation common to all languages.
Statistical MT, by contrast, can handle the ambiguities of
language by a stochastic source channel model, much like
today’s speech recognizers do. With it, the most likely target
language word string
(
N 2
)
DOMAIN LIMITATION MAY BE
ACCEPTABLE IN CERTAIN TASKS AND
ENVIRONMENTS (TOURISM, MEDICAL
ASSISTANCE, ETC.), BUT FOR OTHERS
IT IMPOSES TOO GREAT A
RESTRICTION TO BE USEFUL.
e given a source language word string
f is estimated by way of Bayes’ rule as the product of a trans-
lation model p
ˆ
(
f
|
e
)
and a language model p
(
e
)
:
e
=
arg max
e
{
p
(
e
|
f
) }=
arg max
e
{
p
(
e
)
p
(
f
|
e
) }
.
Effectively, the model combines the probabilities of different
translations of words in a sentence with the monolingual
likelihood of each resulting word sequence to determine the
most likely translation of that sentence. Instead of just mod-
eling this as a noisy channel approach, current SMT systems
use a log linear combination of a number of feature func-
tions that model important aspects including a language
model, a word reordering model, word penalties, and various
other translation models [24], [25], [27]. The SMT approach
has the advantage that it requires no manual development of
grammars or representations but is trained on large
amounts of translated reference texts (parallel corpora). Its
drawback is its need for large parallel corpora, its lack of a
common representation to connect multiple languages, and
challenges in view of highly disfluent input.
For domain-limited translation systems (see discussion
below) the design of an interlingua has been shown to be possi-
ble and helpful, but for domain-unlimited applications (due to
their unrestricted semantic coverage) SMT methods have been
generally preferred. A number of hybrid techniques have been
proposed to retain some of the advantages of both, including
statistically trained semantic analyzers in an interlingua frame-
work [6], or using a natural language (e.g., English) as an inter-
mediate “pivot” language [7], [8] to connect multiple languages.
SPOKEN LANGUAGE TRANSLATION
For MT of text, the choice of technology and design remains a
topic of discussion. Three different approaches have been pop-
ular in MT: the direct approach, the transfer approach, and
the interlingua approach. In the first, a direct mapping
between source language and target language is attempted,
while transfer and interlingua approaches attempt to extract
deeper linguistic structures first. Most commonly, transfer
approaches will perform a syntactic analysis and transfer the
derived structures from the source to the target language for
generation of the target language sentence. Interlingua
approaches [3] attempt to derive a semantic representation of
an input sentence first and then generate a sentence in the
target language from those semantic concepts. Direct
approaches bypass this analysis and map input sentences
directly onto a target language sentence. While early attempts
at direct translation were rejected due to the high ambiguity
of language, they have regained considerable following and
popularity with the advent of statistical data-driven approach-
es, such as example based and statistical MT [4], [23].
All three MT approaches have been used for speech trans-
lation as well, each with notable advantages and disadvan-
tages [5], [9]–[11], [13]. The interlingua approach has the
OUTPUT GENERATION (SPEECH, TEXT)
The output of a speech translation system most typically is syn-
thetic speech in the target language. Alternative outputs, how-
ever, are possible depending on the purpose and ultimate use of
IEEE SIGNAL PROCESSING MAGAZINE [ 72 ] MAY 2008
advantage that it can connect N languages in any combina-
tion through its common semantic representation and
therefore does not require the development of O
ˆ
665545064.026.png 665545064.027.png 665545064.028.png
 
the speech translator (see discussion below). They include target
language text, or summaries from translations. In
human–human cross-lingual speech dialogs a speech synthesis
component generates audible
output from a translated text
string [30]. Commonly, full TTS
is used for convenience and
modularity, even though one
could arguably also synthesize
speech based on conceptual or
syntactic structures if they are
provided by the MT component.
Special concerns in TTS for
speech translators involve gen-
erating appropriate emotion,
style, and voices, so that the output speaking style corresponds
to the input speech in the source language [30]. Voice conver-
sion, in particular, attempts to generate speech in the output
language with the voice of the speaker of the input language.
THE FIELD HAS PROGRESSED TO DATE
FROM HIGHLY RESTRICTIVE
DEMONSTRATION SYSTEMS TO FREE
SIMULTANEOUS TRANSLATION OF
SPONTANEOUS SPEECH ABOUT
UNLIMITED TOPICS, PUSHING BACK
ON EACH OF THE RESTRICTIONS
SUCCESSIVELY.
limit the usefulness of a system in many real-world situations.
A domain-limited one-way system for tourists may be helpful
but is limiting, as it requires the user to memorize the allow-
able phrases and it cannot
translate back the response of
the other party. It is equally
limiting if the user is not
allowed a hesitation (aeh, hum,
etc.) while speaking, or if
he/she must produce perfectly
grammatical sentences to
obtain useful output. To err is
human, and useful systems
must accommodate a speaker’s
mistakes. Finally, domain limi-
tation may be acceptable in certain tasks and environments
(tourism, medical assistance, etc.), but for others it imposes
too great a restriction to be useful. Translation of broadcast
news and speeches, for example, is only possible if a system
can accommodate or adapt to a broad variety of topics, an
unlimited vocabulary, and free speaking style.
Dimensions that increase uncertainty and ambiguity in
speech and hence present challenges for speech translation sys-
tems are signal degradation/noise, vocabulary size/perplexity,
spontaneity/disfluencies/speaking style, domain size, and speed
requirements. Consequentially, the field has progressed to date
from highly restrictive demonstration systems to free simulta-
neous translation of spontaneous speech about unlimited topics,
pushing back on each of the restrictions successively.
PROGRESS IN SPEECH TRANSLATION
Based on the advances in component technologies, research on
speech translation began in earnest during the late 1980s and
early 1990s. In the following two decades, impressive speech
translation systems have been developed. The systems and their
progression can be categorized by distinct new system-level
capabilities at each stage of development. These capabilities are
summarized in Table 1.
Overall, each advance distinguishes itself by the levels of
uncertainty that a given system can tolerate. Language is
ambiguous at all levels, from signal to phonetics to syntax to
semantics. Earlier systems have imposed greater constraints
to control such ambiguity. For example, restrictions in speak-
ing style, vocabulary, domain, and the use and operation of a
system limit ambiguity and the search for translation
hypotheses. Such constraints are inherent in the task (prere-
corded announcements, limited domain, or phraseology) or
recording conditions (e.g., broadcast news versus telephone
conversations). Alternatively, it can be imposed as a require-
ment for system use. Restrictions on use, however, severely
RESTRICTED DOMAIN, RESTRICTED SPEAKING STYLE
The first speech translation systems date back to the late 1980s
and early 1990s [5], [9], [12]. They were demonstration sys-
tems that showed the concept of speech translation and proved
that speech translation was possible at all. They attracted a
great deal of attention, as they showed that bridging the lan-
guage divide by spoken language might indeed be possible
[32]. These early systems did not permit free dialog and
required speakers to act out prescribed sentence patterns or
allowable sentences in a correct speaking style according to a
[TABLE 1] SUMMARY OF SYSTEM-LEVEL CAPABILITIES.
YEARS
VOCABULARY SPEAKING
DOMAIN
SPEED
PLATFORM
EXAMPLE
STYLE
SYSTEMS
FIRST DIALOG
1989–1993
RESTRICTED
CONSTRAINED
LIMITED
2–10
×
RT
WORKSTATION
C-STAR-I
DEMONSTRATION
SYSTEMS
ONE-WAY
1997–PRESENT RESTRICTED,
CONSTRAINED
LIMITED
1–3
×
RT
HANDHELD
PHRASELATOR,
PHRASEBOOKS
MODIFIABLE
ECTACO
SPONTANEOUS
1993–PRESENT UNRESTRICTED
SPONTANEOUS
LIMITED
1–5
×
RT
PC/HANDHELD
C-STAR, VERBMOBIL,
TWO-WAY
DEVICES
NESPOLE, BABYLON,
SYSTEMS
TRANSTAC
TRANSLATION OF
2003–PRESENT UNRESTRICTED
READ/ PREPARED OPEN
OFFLINE
PCS, PC-CLUSTERS
NSF-STRDUST,
BROADCAST NEWS,
SPEECH
EC TC-STAR,
POLITICAL SPEECHES
DARPA GALE,
SIMULTANEOUS
2005–PRESENT UNRESTRICTED
SPONTANEOUS
OPEN
REALTIME
PC, LAPTOP
LECTURE
TRANSLATION
TRANSLATOR
OF LECTURES
IEEE SIGNAL PROCESSING MAGAZINE [ 73 ] MAY 2008
665545064.029.png 665545064.030.png 665545064.031.png
 
restricted syntax and/or a restricted vocabulary.
Nevertheless, they were systems that were pro-
posed at a time when the idea of speaker-inde-
pendent, continuous speech was still a novelty
and MT was considered close to impossible.
the direct statistical approach. The former
semantic constraints can be exploited to extract
possible interpretation in fragmented input.
For limited-domain applications, this is possi-
ble where the typical concepts and arguments
can be enumerated and represented. The statis-
tical approach, by contrast, accommodates ill-
formed input by using large translation and
language models to compute the statistically
most likely word sequence [13]. The first spon-
taneous speech translation systems were
demonstrated in the early 1990s under the
Consortium for Speech Translation Advanced
Research (C-STAR) (http://www.c-star.org) [9],
[10], [16]. Considerable work continued
throughout the 1990s in Japan, Europe, and
the United States until today, with large consortia and national
projects supporting research [C-STAR, Verbmobil (http://verb-
mobil.dfki.de), Negotiating through Spoken Language in E-
Commerce (Nespole) (http://nespole.itc.it), Enthusiast, Digital
Olympics, Babylon, and Spoken Language Communication and
Translation System for Tactical Use (Transtac) (http://www.
darpa.mil/ipto/programs/transtac/transtac.asp)].
DOMAIN-LIMITED,
SPONTANEOUS SPEECH
In 1992, it was already recognized that these
early concept demonstration systems fall short
of being usable, as speakers had to speak in a
well-behaved manner and remember the words
and sentences they would be allowed to say. The
most unacceptable constraints were the vocabu-
lary, syntax, and speaking-style limitation, as it
is generally not possible for humans to speak
flawlessly in a limited speaking style (effectively reading sen-
tences) or remember a limited set of words or syntactic patterns.
By contrast, it is generally possible for humans to stick to a
domain of discourse when solving certain limited tasks. Many
important applications are inherently domain limited, thus
making spontaneous domain-limited speech translation a prac-
tically useful technology. Hotel bookings, car rentals, taxis and
shopping negotiations, medical assistance, emergency relief,
hotel/hospital/conference registration, force protection, mili-
tary/police missions, and many more all require only dialogs in a
limited domain. But they do require accuracy, speed, and an
acceptable human-factors design.
More advanced technology was developed to address the
limitation of speaking style: two-way dialogs handling free
spontaneous speech input, both in recognition and translation.
Spontaneous speech is a requirement for two-way dialog sys-
tems as the input of the respondent cannot be controlled or
restricted. For spontaneous dialogs, we must relax syntactic
constraints and allow for variations in expression. Two
approaches have been popular: the interlingua approach and
[FIG2] Phraselator (courtesy
of Voxtec International,
Annapolis, MD, http://www.
voxtec.com).
PORTABLE, FIELDABLE SYSTEMS
More recently, portable, fieldable speech-to-speech translation
systems have been developed around wearable platforms (lap-
tops, PDAs). This may impose additional hardware-related
constraints on the ASR, SMT, and TTS components. For PDAs,
memory limitations and the lack of a floating-point unit
require redesign of algorithms and data structures. Thus, the
recognition and translation accuracy of PDA-based speech-to-
speech translation systems may decrease compared to systems
developed for laptops. In addition to continued attention to
speed, recognition, translation, and synthesis quality, usability
of the user interface, microphone type, place and number,
user training, and field maintenance must be considered.
Figure 2 and Figure 3 show two mobile speech translators.
The first, the Phraselator (http://www.sarich. com/translator),
is a pragmatic approach based on restricted-domain/restrict-
ed-speaking style technology (Figure 2). This approach does
not address the problem of speaking style, but it relaxes
vocabulary restrictions and provides speakable phrases on a
hand-held device. Sometimes called a “one-way,” it does not
allow for free dialogs between two conversants (this requires
spontaneous speech), but it permits speech entry of a list of
useful phrases for a given situation. Figure 3 is a two-way
device, the Pocket Translator, based on two-way speech trans-
lation technology described above. It runs on a standard PDA
platform and permits spoken input for travel, medical, and
military domains. A push-to-talk button on the device acti-
vates the system. The display shows recognition output, back-
translation for verification, and translation output. A
combination of using common pretranslated phrases by clas-
sifiers and look-up and performing actual translation has also
[FIG3] A PDA two-way pocket translator (English-Thai)
(courtesy of Mobile Technologies, LLC, Pittsburgh, PA, http://
www.mobytrans.com).
IEEE SIGNAL PROCESSING MAGAZINE [ 74 ] MAY 2008
665545064.032.png 665545064.033.png
 
Zgłoś jeśli naruszono regulamin