( 1 of 6 ) |
United States Patent | 6,421,645 |
Beigi ,   et al. | July 16, 2002 |
A method and apparatus are disclosed for automatically transcribing audio information from an audio-video source and concurrently identifying the speakers. The disclosed audio transcription and speaker classification system includes a speech recognition system, a speaker segmentation system and a speaker identification system. A common front-end processor computes feature vectors that are processed along parallel branches in a multi-threaded environment by the speech recognition system, speaker segmentation system and speaker identification system, for example, using a shared memory architecture that acts in a server-like manner to distribute the computed feature vectors to a channel associated with each parallel branch. The speech recognition system produces transcripts with time-alignments for each word in the transcript. The speaker segmentation system separates the speakers and identifies all possible frames where there is a segment boundary between non-homogeneous speech portions. The speaker identification system thereafter uses an enrolled speaker database to assign a speaker to each identified segment. The audio information from the audio-video source is concurrently transcribed and segmented to identify segment boundaries. Thereafter, the speaker identification system assigns a speaker label to each portion of the transcribed text.
Inventors: | Beigi; Homayoon Sadr Mohammad (Yorktown Heights, NY); Tritschler; Alain Charles Louis (New York, NY); Viswanathan; Mahesh (Yorktown Heights, NY) |
Assignee: | International Business Machines Corporation (Armonk, NY) |
Appl. No.: | 345237 |
Filed: | June 30, 1999 |
Current U.S. Class: | 704/272; 704/500; 704/275; 704/251 |
Intern'l Class: | G10L 015/00 |
Field of Search: | 704/231,500,245,239,241,240,256,255,251,235,253,270,257,272,275,260,236,238 |
5659662 | Aug., 1997 | Wilcox et al. | 704/245. |
6185527 | Feb., 2001 | Petkovic et al. | 704/231. |
ICASSP-97. Roy et al., "Speaker Identification based text to audio alignment for audio retrieval system". pp. 1099-1102, vol. 2. Apr. 1997.* S. Dharanipragada et al., "Experimental Results in Audio Indexing," Proc. ARPA SLT Workshop, (Feb. 1996). L. Polymenakos et al., "Transcription of Broadcast News--Some Recent Improvements to IBM's LVCSR System," Proc. ARPA SLT Workshop, (Feb. 1996). R. Bakis, "Transcription of Broadcast News Shows with the IBM Large Vocabulary Speech Recognition System," Proc. ICASSP98, Seattle, WA (1998). H. Beigi et al., "A Distance Measure Between Collections of Distributions and its Application to Speaker Recognition," Proc. ICASSP98, Seattle, WA (1998). S. Chen, "Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion," Proceedings of the Speech Recognition Workshop (1998). S. Chen et al., "Clustering via the Bayesian Information Criterion with Applications in Speech Recognition," Proc. ICASSP98, Seattle, WA (1998). S. Chen et al., "IBM's LVCSR System for Transcription of Broadcast News Used in the 1997 Hub4 English Evaluation," Proceedings of the Speech Recognition Workshop (1998). S. Dharanipragada et al., "A Fast Vocabulary Independent Algorithm for Spotting Words in Speech," Proc. ICASSP98, Seattle, WA (1998). J. Navratil et al., "An Efficient Phonotactic-Acoustic system for Language Identification," Proc. ICASSP98, Seattle, WA (1998). G. N. Ramaswamy et al., "Compression of Acoustic Features for Speech Recognition in Network Environments," Proc. ICASSP98, Seattle, WA (1998). S. Chen et al., "Recent Improvements to IBM's Speech Recognition System for Automatic Transcription of Broadcast News," Proceedings of the Speech Recognition Workshop (1999). S. Dharanipragada et al., "Story Segmentation and Topic Detection in the Broadcast News Domain," Proceedings of the Speech Recognition Workshop (1999). C. Neti et al., "Audio-Visual Speaker Recognition for Video Broadcast News," Proceedings of the Speech Recognition Workshop (1999). |