(Brain Study)_Speech Recognition using Neural Networks.pdf

Speech Recognition using Neural Networks

Joe Tebelskis

May 1995

CMU-CS-95-142

School of Computer Science

Carnegie Mellon University

Pittsburgh, Pennsylvania 15213-3890

Submitted in partial fulﬁllment of the requirements for

a degree of Doctor of Philosophy in Computer Science

Thesis Committee:

Alex Waibel, chair

Raj Reddy

Jaime Carbonell

Richard Lippmann, MIT Lincoln Labs

ã1995

Joe Tebelskis

This research was supported during separate phases by ATR Interpreting Telephony Research Laboratories,

NEC Corporation, Siemens AG, the National Science Foundation, the Advanced Research Projects Adminis-

tration, and the Department of Defense under Contract No. MDA904-92-C-5161.

The views and conclusions contained in this document are those of the author and should not be interpreted as

representing the ofﬁcial policies, either expressed or implied, of ATR, NEC, Siemens, NSF, or the United

States Government.

Keywords : Speech recognition, neural networks, hidden Markov models, hybrid systems,

acoustic modeling, prediction, classiﬁcation, probability estimation, discrimination, global

optimization.

iii

Abstract

This thesis examines how artiﬁcial neural networks can beneﬁt a large vocabulary, speaker

independent, continuous speech recognition system. Currently, most speech recognition

systems are based on hidden Markov models (HMMs), a statistical framework that supports

both acoustic and temporal modeling. Despite their state-of-the-art performance, HMMs

make a number of suboptimal modeling assumptions that limit their potential effectiveness.

Neural networks avoid many of these assumptions, while they can also learn complex func-

tions, generalize effectively, tolerate noise, and support parallelism. While neural networks

can readily be applied to acoustic modeling, it is not yet clear how they can be used for tem-

poral modeling. Therefore, we explore a class of systems called NN-HMM hybrids , in which

neural networks perform acoustic modeling, and HMMs perform temporal modeling. We

argue that a NN-HMM hybrid has several theoretical advantages over a pure HMM system,

including better acoustic modeling accuracy, better context sensitivity, more natural dis-

crimination, and a more economical use of parameters. These advantages are confirmed

experimentally by a NN-HMM hybrid that we developed, based on context-independent

phoneme models, that achieved 90.5% word accuracy on the Resource Management data-

base, in contrast to only 86.0% accuracy achieved by a pure HMM under similar conditions.

In the course of developing this system, we explored two different ways to use neural net-

works for acoustic modeling: prediction and classification. We found that predictive net-

works yield poor results because of a lack of discrimination, but classification networks

gave excellent results. We veriﬁed that, in accordance with theory, the output activations of

a classification network form highly accurate estimates of the posterior probabilities

P ( class | input ), and we showed how these can easily be converted to likelihoods

P ( input | class ) for standard HMM recognition algorithms. Finally, this thesis reports how we

optimized the accuracy of our system with many natural techniques, such as expanding the

input window size, normalizing the inputs, increasing the number of hidden units, convert-

ing the network’s output activations to log likelihoods, optimizing the learning rate schedule

by automatic search, backpropagating error from word level outputs, and using gender

dependent networks.

Acknowledgements

I wish to thank Alex Waibel for the guidance, encouragement, and friendship that he man-

aged to extend to me during our six years of collaboration over all those inconvenient

oceans — and for his unflagging efforts to provide a world-class, international research

environment, which made this thesis possible. Alex’s scientiﬁc integrity, humane idealism,

good cheer, and great ambition have earned him my respect, plus a standing invitation to

dinner whenever he next passes through my corner of the world. I also wish to thank Raj

Reddy, Jaime Carbonell, and Rich Lippmann for serving on my thesis committee and offer-

ing their valuable suggestions, both on my thesis proposal and on this final dissertation. I

would also like to thank Scott Fahlman, my first advisor, for channeling my early enthusi-

asm for neural networks, and teaching me what it means to do good research.

Many colleagues around the world have influenced this thesis, including past and present

members of the Boltzmann Group, the NNSpeech Group at CMU, and the NNSpeech

Group at the University of Karlsruhe in Germany. I especially want to thank my closest col-

laborators over these years — Bojan Petek, Otto Schmidbauer, Torsten Zeppenfeld, Her-

mann Hild, Patrick Haffner, Arthur McNair, Tilo Sloboda, Monika Woszczyna, Ivica

Rogina, Michael Finke, and Thorsten Schueler — for their contributions and their friend-

ship. I also wish to acknowledge valuable interactions I’ve had with many other talented

researchers, including Fil Alleva, Uli Bodenhausen, Herve Bourlard, Lin Chase, Mike

Cohen, Mark Derthick, Mike Franzini, Paul Gleichauff, John Hampshire, Nobuo Hataoka,

Geoff Hinton, Xuedong Huang, Mei-Yuh Hwang, Ken-ichi Iso, Ajay Jain, Yochai Konig,

George Lakoff, Kevin Lang, Chris Lebiere, Kai-Fu Lee, Ester Levin, Stefan Manke, Jay

McClelland, Chris McConnell, Abdelhamid Mellouk, Nelson Morgan, Barak Pearlmutter,

Dave Plaut, Dean Pomerleau, Steve Renals, Roni Rosenfeld, Dave Rumelhart, Dave Sanner,

Hidefumi Sawai, David Servan-Schreiber, Bernhard Suhm, Sebastian Thrun, Dave

Touretzky, Minh Tue Voh, Wayne Ward, Christoph Windheuser, and Michael Witbrock. I

am especially indebted to Yochai Konig at ICSI, who was extremely generous in helping me

to understand and reproduce ICSI’s experimental results; and to Arthur McNair for taking

over the Janus demos in 1992 so that I could focus on my speech research, and for con-

stantly keeping our environment running so smoothly. Thanks to Hal McCarter and his col-

leagues at Adaptive Solutions for their assistance with the CNAPS parallel computer; and to

Nigel Goddard at the Pittsburgh Supercomputer Center for help with the Cray C90. Thanks

to Roni Rosenfeld, Lin Chase, and Michael Finke for proofreading portions of this thesis.

I am also grateful to Robert Wilensky for getting me started in Artiﬁcial Intelligence, and

especially to both Douglas Hofstadter and Allen Newell for sharing some treasured, pivotal

hours with me.

Plik z chomika:

Inne pliki z tego folderu:

Inne foldery tego chomika: