(Brain Study)_Speech Recognition using Neural Networks.pdf

(970 KB) Pobierz
23440852 UNPDF
Speech Recognition using Neural Networks
Joe Tebelskis
May 1995
CMU-CS-95-142
School of Computer Science
Carnegie Mellon University
Pittsburgh, Pennsylvania 15213-3890
Submitted in partial fulfillment of the requirements for
a degree of Doctor of Philosophy in Computer Science
Thesis Committee:
Alex Waibel, chair
Raj Reddy
Jaime Carbonell
Richard Lippmann, MIT Lincoln Labs
Copyright
ã1995
Joe Tebelskis
This research was supported during separate phases by ATR Interpreting Telephony Research Laboratories,
NEC Corporation, Siemens AG, the National Science Foundation, the Advanced Research Projects Adminis-
tration, and the Department of Defense under Contract No. MDA904-92-C-5161.
The views and conclusions contained in this document are those of the author and should not be interpreted as
representing the official policies, either expressed or implied, of ATR, NEC, Siemens, NSF, or the United
States Government.
Keywords : Speech recognition, neural networks, hidden Markov models, hybrid systems,
acoustic modeling, prediction, classification, probability estimation, discrimination, global
optimization.
iii
Abstract
This thesis examines how artificial neural networks can benefit a large vocabulary, speaker
independent, continuous speech recognition system. Currently, most speech recognition
systems are based on hidden Markov models (HMMs), a statistical framework that supports
both acoustic and temporal modeling. Despite their state-of-the-art performance, HMMs
make a number of suboptimal modeling assumptions that limit their potential effectiveness.
Neural networks avoid many of these assumptions, while they can also learn complex func-
tions, generalize effectively, tolerate noise, and support parallelism. While neural networks
can readily be applied to acoustic modeling, it is not yet clear how they can be used for tem-
poral modeling. Therefore, we explore a class of systems called NN-HMM hybrids , in which
neural networks perform acoustic modeling, and HMMs perform temporal modeling. We
argue that a NN-HMM hybrid has several theoretical advantages over a pure HMM system,
including better acoustic modeling accuracy, better context sensitivity, more natural dis-
crimination, and a more economical use of parameters. These advantages are confirmed
experimentally by a NN-HMM hybrid that we developed, based on context-independent
phoneme models, that achieved 90.5% word accuracy on the Resource Management data-
base, in contrast to only 86.0% accuracy achieved by a pure HMM under similar conditions.
In the course of developing this system, we explored two different ways to use neural net-
works for acoustic modeling: prediction and classification. We found that predictive net-
works yield poor results because of a lack of discrimination, but classification networks
gave excellent results. We verified that, in accordance with theory, the output activations of
a classification network form highly accurate estimates of the posterior probabilities
P ( class | input ), and we showed how these can easily be converted to likelihoods
P ( input | class ) for standard HMM recognition algorithms. Finally, this thesis reports how we
optimized the accuracy of our system with many natural techniques, such as expanding the
input window size, normalizing the inputs, increasing the number of hidden units, convert-
ing the network’s output activations to log likelihoods, optimizing the learning rate schedule
by automatic search, backpropagating error from word level outputs, and using gender
dependent networks.
23440852.001.png
iv
23440852.002.png
v
Acknowledgements
I wish to thank Alex Waibel for the guidance, encouragement, and friendship that he man-
aged to extend to me during our six years of collaboration over all those inconvenient
oceans — and for his unflagging efforts to provide a world-class, international research
environment, which made this thesis possible. Alex’s scientific integrity, humane idealism,
good cheer, and great ambition have earned him my respect, plus a standing invitation to
dinner whenever he next passes through my corner of the world. I also wish to thank Raj
Reddy, Jaime Carbonell, and Rich Lippmann for serving on my thesis committee and offer-
ing their valuable suggestions, both on my thesis proposal and on this final dissertation. I
would also like to thank Scott Fahlman, my first advisor, for channeling my early enthusi-
asm for neural networks, and teaching me what it means to do good research.
Many colleagues around the world have influenced this thesis, including past and present
members of the Boltzmann Group, the NNSpeech Group at CMU, and the NNSpeech
Group at the University of Karlsruhe in Germany. I especially want to thank my closest col-
laborators over these years — Bojan Petek, Otto Schmidbauer, Torsten Zeppenfeld, Her-
mann Hild, Patrick Haffner, Arthur McNair, Tilo Sloboda, Monika Woszczyna, Ivica
Rogina, Michael Finke, and Thorsten Schueler — for their contributions and their friend-
ship. I also wish to acknowledge valuable interactions I’ve had with many other talented
researchers, including Fil Alleva, Uli Bodenhausen, Herve Bourlard, Lin Chase, Mike
Cohen, Mark Derthick, Mike Franzini, Paul Gleichauff, John Hampshire, Nobuo Hataoka,
Geoff Hinton, Xuedong Huang, Mei-Yuh Hwang, Ken-ichi Iso, Ajay Jain, Yochai Konig,
George Lakoff, Kevin Lang, Chris Lebiere, Kai-Fu Lee, Ester Levin, Stefan Manke, Jay
McClelland, Chris McConnell, Abdelhamid Mellouk, Nelson Morgan, Barak Pearlmutter,
Dave Plaut, Dean Pomerleau, Steve Renals, Roni Rosenfeld, Dave Rumelhart, Dave Sanner,
Hidefumi Sawai, David Servan-Schreiber, Bernhard Suhm, Sebastian Thrun, Dave
Touretzky, Minh Tue Voh, Wayne Ward, Christoph Windheuser, and Michael Witbrock. I
am especially indebted to Yochai Konig at ICSI, who was extremely generous in helping me
to understand and reproduce ICSI’s experimental results; and to Arthur McNair for taking
over the Janus demos in 1992 so that I could focus on my speech research, and for con-
stantly keeping our environment running so smoothly. Thanks to Hal McCarter and his col-
leagues at Adaptive Solutions for their assistance with the CNAPS parallel computer; and to
Nigel Goddard at the Pittsburgh Supercomputer Center for help with the Cray C90. Thanks
to Roni Rosenfeld, Lin Chase, and Michael Finke for proofreading portions of this thesis.
I am also grateful to Robert Wilensky for getting me started in Artificial Intelligence, and
especially to both Douglas Hofstadter and Allen Newell for sharing some treasured, pivotal
hours with me.
23440852.003.png
Zgłoś jeśli naruszono regulamin