|United States Patent||6,748,356|
|Beigi , et al.||June 8, 2004|
A method and apparatus are disclosed for identifying speakers participating in an audio-video source, whether or not such speakers have been previously registered or enrolled. A speaker segmentation system separates the speakers and identifies all possible frames where there is a segment boundary between non-homogeneous speech portions. A hierarchical speaker tree clustering system clusters homogeneous segments (generally corresponding to the same speaker), and assigns a cluster identifier to each detected segment, whether or not the actual name of the speaker is known. A hierarchical enrolled speaker database is used that includes one or more background models for unenrolled speakers to assign a speaker to each identified segment. Once speech segments are identified by the segmentation system, the disclosed unknown speaker identification system compares the segment utterances to the enrolled speaker database using a hierarchical approach and finds the "closest" speaker, if any, to assign a speaker label to each identified segment. A speech segment having an unknown speaker is initially assigned a general speaker label from a set of background models for speaker identification, such as "unenrolled male" or "unenrolled female." The "unenrolled" segment is assigned a cluster identifier and is positioned in the hierarchical tree. Thus, the hierarchical speaker tree clustering system assigns a unique cluster identifier corresponding to a given node, for each speaker to further differentiate the general speaker labels.
|Inventors:||Beigi; Homayoon Sadr Mohammad (Yorktown Heights, NY); Viswanathan; Mahesh (Yorktown Heights, NY)|
|Assignee:||International Business Machines Corporation (Armonk, NY)|
|Filed:||June 7, 2000|
|Current U.S. Class:||704/245; 704/248; 704/250|
|Intern'l Class:||G10L 015/04; G10L 015/06|
|Field of Search:||704/243,244,245,246,248,249,250|
|5659662||Aug., 1997||Wilcox et al.|
|6345252||Feb., 2002||Beigi et al.||704/272.|
|6421645||Jul., 2002||Beigi et al.||704/272.|
|6424946||Jul., 2002||Tritschler et al.||704/272.|
|6684186||Jan., 2004||Beigi et al.||704/246.|
S. Dharanipragada et al., "Experimental Results in Audio Indexing," Proc. ARPA SLT Workshop, (Feb. 1996).
L. Polymenakos et al., "Transcription of Broadcast News--Some Recent Improvements to IBM's LVCSR System," Proc. ARPA SLT Workshop, (Feb. 1996).
R. Bakis, "Transcription of Broadcast News Shows with the IBM Large Vocabulary Speech Recognition System," Proc. ICASSP98, Seattle, WA (1998).
H. Beigi et al., "A Distance Measure Between Collections of Distributions and its Application to Speaker Recognition" Proc. ICASSP98, Seattle, WA (1998).
S. Chen, "Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion," Proceedings of the Speech Recognition Workshop (1998).
S. Chen et al., "Clustering via the Bayesian Information Criterion with Applications in Speech Recognition," Proc. ICASSP98, Seattle, WA (1998).
S. Chen et al., "IBM's LVCSR System for Transcription of Broadcast News Used in the 1997 Hub4 English Evaluation," Proceedings of the Speech Recognition Workshop (1998).
S. Dharanipragada et al., "A Fast Vocabulary Independent Algorithm for Spotting Words in Speech," Proc. ICASSP98, Seattle, WA (1998).
J. Navratil et al., "An Efficient Phonotactic-Acoustic system for Language Identification," Proc. ICASSP98, Seattle, WA (1998).
G. N. Ramaswamy et al., "Compression of Acoustic Features for Speech Recognition in Network Environments," Proc. ICASSP98, Seattle, WA (1998).
S. Chen et al., "Recent Improvements to IBM's Speech Recognition System for Automatic Transcription of Broadcast News," Proceedings of the Speech Recognition Workshop (1999).
S. Dharanipragada et al., "Story Segmentation and Topic Detection in the Broadcast News Domain," Proceedings of the Speech Recognition Workshop (1999).
C. Neti et al., "Audio-Visual Speaker Recognition for Video Broadcast News," Proceedings of the Speech Recognition Workshop (1999).