Overview of Speaker Recognition.pdf (công nhận loa)

725 A. E. Rosenberg, F. Bimbot, S. Parthasarathy An introduction to automatic speaker recognition is presented in this chapter. The identifying characteristics of a person’s voice that make it possible to automatically identify a speaker are discussed. Subtasks such as speaker identification, verification, and detection are described. An overview of the techniques used to build speaker models as well as issues related to system performance are presented. Finally, a few selected applications of speaker recognition are introduced to demonstrate the wide range of applications of speaker recognition technologies. Details of text-dependent and text-independent speaker recognition and their applications are covered in the following two chapters. 36.1 Speaker Recognition ............................. 36.1.1 Personal Identity Characteristics.... 36.1.2 Speaker Recognition Definitions.... 36.1.3 Bases for Speaker Recognition ...... 36.1.4 Extracting Speaker Characteristics from the Speech Signal ................ 36.1.5 Applications ............................... 725 725 726 726 36.2 Measuring Speaker Features .................. 729 36.2.1 Acoustic Measurements ................ 729 36.2.2 Linguistic Measurements .............. 730 36.3 Constructing Speaker Models ................. 731 36.3.1 Nonparametric Approaches .......... 731 36.3.2 Parametric Approaches ................ 732 36.4 Adaptation .......................................... 735 36.5 Decision and Performance ..................... 36.5.1 Decision Rules ............................ 36.5.2 Threshold Setting and Score Normalization .............. 36.5.3 Errors and DET Curves................... 36.6 Selected Applications for Automatic Speaker Recognition ........ 36.6.1 Indexing Multispeaker Data.......... 36.6.2 Forensics.................................... 36.6.3 Customization: SCANmail .............. 735 735 736 736 737 737 737 738 36.7 Summary ............................................. 739 727 728 References .................................................. 739 Human beings have many characteristics that make it possible to distinguish one individual from another. Some individuating characteristics can be perceived very readily such as facial features and vocal qualities and behavior. Others, such as fingerprints, iris patterns, and DNA structure are not readily perceived and require measurements, often quite complex measurements, to capture distinguishing characteristics. In recent years biometrics has emerged as an applied scientific discipline with the objective of automatically capturing personal identifying characteristics and using the measurements for security, surveillance, and forensic applications [36.1]. Typical applications using biometrics secure transactions, information, and premises to authorized individuals. In surveillance applications, the goal is to detect and track a target individual among a set of nontarget individuals. In forensic applications a sample of biometric measurements is obtained from an unknown individual, the perpetrator. The task is to compare this sample with a database of similar measurements from known individuals to find a match. Many personal identifying characteristics are based on physiological properties, others on behavior, and some combine physiological and behavioral properties. From the point of view of using personal identity characteristics as a biometric for security, physiological characteristics may offer more intrinsic security since they are not subject to the kinds of voluntary variations found in behavioral features. Voice is an example of a biometric that combines physiological and behav- 36.1 Speaker Recognition 36.1.1 Personal Identity Characteristics Part F 36 Overview of S 36. Overview of Speaker Recognition 726 Part F Speaker Recognition Part F 36.1 ioral characteristics. Voice is attractive as a biometric for many reasons. It can be captured non-intrusively and conveniently with simple transducers and recording devices. It is particularly useful for remote-access transactions over telecommunication networks. A drawback is that voice is subject to many sources of variability, including behavioral variability, both voluntary and involuntary. An example of involuntary variability is a speaker’s inability to repeat utterances precisely the same way. Another example is the spectral changes that occur when speakers vary their vocal effort as background noise increases. Voluntary variability is an issue when speakers attempt to disguise their voices. Other sources of variability include physical voice variations due to respiratory infections and congestion. External sources of variability are especially problematic, including variations in background noise, and transmission and recording characteristics. 36.1.2 Speaker Recognition Definitions Different tasks are defined under the general heading of speaker recognition. They differ mainly with respect to the kind of decision that is required for each task. In speaker identification a voice sample from an unknown speaker is compared with a set of labeled speaker models. When it is known that the set of speaker models includes all speakers of interest the task is referred to as closed-set identification. The label of the best matching speaker is taken to be the identified speaker. Most speaker identification applications are open-set, meaning that it is possible that the unknown speaker is not included in the set of speaker models. In this case, if no satisfactory match is obtained, a no-match decision is provided. In a speaker verification trial an identity claim is provided or asserted along with the voice sample. In this case, the unknown voice sample is compared only with the speaker model whose label corresponds to the identity claim. If the quality of the comparison is satisfactory, the identity claim is accepted; otherwise the claim is rejected. Speaker verification is a special case of open-set speaker identification with a one-speaker target set. The speaker verification decision mode is intrinsic to most access control applications. In these applications, it is assumed that the claimant will respond to prompts cooperatively. It can readily be seen that in the speaker identification task performance degrades as the number of speaker models and the number of comparisons increases. In a speaker verification trial only one comparison is required, so speaker verification performance is independent of the size of the speaker population. A third speaker recognition task has been defined in recent years in National Institute of Standards and Technology (NIST) speaker recognition evaluations; it is generally referred to as speaker detection [36.2, 3]. The NIST task is an open-set identification decision associated exclusively with conversational speech. In this task an unknown voice sample is provided and the task is to determine whether or not one of a specified set of known speakers is present in the sample. A complicating factor for this task is that the unknown sample may contain speech from more than one speaker, such as in the summed two sides of a telephone conversation. In this case, an additional task called speaker tracking is defined, in which it is required to determine the intervals in the test sample during which the detected speaker is talking. In other applications where the speech samples are multispeaker, speaker tracking has also been referred to as speaker segmentation, speaker indexing, and speaker diarization [36.4–10]. It is possible to cast the speaker segmentation task as an acoustical change detection task without creating models. The time instants where a significant acoustic change occurs are assumed to be the boundaries between different speaker segments. In this case, in the absence of speaker models, speaker segmentation would not be considered a speaker recognition task. However, in most reported approaches to this task some sort of speaker modeling does take place. The task usually includes labeling the speaker segments. In this case the task falls unambiguously under the speaker recognition heading. In addition to decision modes, speaker recognition tasks can be categorized by the kind of speech that is input. If the speaker is prompted or expected to provide a known text and if speaker models have been trained explicitly for this text, the input mode is said to be text dependent. If, on the other hand, the speaker cannot be expected to utter specified texts the input mode is text independent. In this case speaker models are not trained on explicit texts. 36.1.3 Bases for Speaker Recognition The principal function associated with the transmission of a speech signal is to convey a message. However, along with the message, additional kinds of information are transmitted. These include information about the gender, identity, emotional state, health, etc. of the speaker. The source of all these kinds of information lie in both physiological and behavioral characteristics. Overview of Speaker Recognition Hard palate Soft palate (velum) Pharyngeal cavity Larynx Esophagus Trachea Nasal cavity Nostril Lip Tongue Teeth Oral cavity Jaw Lung Diaphragm Fig. 36.1 Physiology of the human vocal tract (Reproduced with permission from L. H. Jamieson [36.11]) speech sounds or segments and also with suprasegmental characteristics governing how individual speech sounds are strung together to form words. Higher-level speaking behavior is associated with choices of words and syntactic units. Variations in fundamental frequency or pitch and rhythm are also higher-level features of the speech signal along with such qualities as breathiness, strength of vocal effort, etc. All of these vary significantly from speaker to speaker. 36.1.4 Extracting Speaker Characteristics from the Speech Signal A perceptual view classifies speech as containing lowlevel and high-level kinds of information. Low-level features of speech are associated with the periphery in the brain’s perception of speech and are relatively accessible from the speech signal. High-level features are associated with more-central locations in the perception mechanism. Generally speaking, low-level speaker features are easier to extract from the speech signal and model than high-level features. Many such features are associated with spectral correlates such as formant locations and bandwidths, pitch periodicity, and segmental timings. High-level features include the perception of words and their meaning, syntax, prosody, dialect, and idiolect. It is not easy to extract stable and reliable formant features explicitly from the speech signal. In most instances it is easier to carry out short-term spectral amplitude measurements that capture low-level speaker characteristics implicitly. Short-term spectral measurements are typically carried out over 20–30 ms windows and advanced every 10 ms. Short speech sounds have durations less than 100 ms whereas stressed vowel sounds can last for 300 ms or more. Advancing the time window every 10 ms enables the temporal characteristics of individual speech sounds to be tracked and the 30 ms analysis window is usually sufficient to provide good spectral resolution of these sounds and at the same time short enough to resolve significant temporal characteristics. There are two principal methods of short-term spectral analysis, filter bank analysis and linear predictive coding (LPC) analysis. In filter bank analysis the speech signal is passed through a bank of bandpass filters covering a range of frequencies consistent with the transmission characteristics of the signal. The spacing of the filters can be uniform or, more likely, spaced nonuniformly, consistent with perceptual criteria such as the mel or bark scale [36.12], which provides a linear spacing in frequency below 1000 Hz 727 Part F 36.1 The physiological features are shown in Fig. 36.1 showing a cross-section of the human vocal tract. The shape of the vocal tract, determined by the position of articulators, the tongue, jaw, lips, teeth, and velum, creates a set of acoustic resonances in response to periodic puffs of air generated by the glottis for voiced sounds or aperiodic excitation caused by air passing through tight constrictions in the vocal tract. The spectral peaks associated with periodic resonances are referred to as speech formants. The locations in frequency and, to a lesser degree, the shapes of the resonances distinguish one speech sound from another. In addition, formant locations and bandwidths and spectral differences associated with the overall size of the vocal tract serve to distinguish the same sounds spoken by different speakers. The shape of the nasal tract, which determines the quality of nasal sounds, also varies significantly from speaker to speaker. The mass of the glottis is associated with the basic fundamental frequency for voiced speech sounds. The average basic fundamental frequency is approximately 100 Hz for adult males, 200 Hz for adult females, and 300 Hz for children. It also varies from individual to individual. Speech signal events can be classified as segmental or suprasegmental. Generally, segmental refers to the features of individual sounds or segments, whereas suprasegmental refers to properties that extend over several speech sounds. Speaking behavior is associated with the individual’s control of articulators for individual 36.1 Speaker Recognition 728 Part F Speaker Recognition Part F 36.1 and logarithmic spacing above. The output of each filter is typically implemented as a windowed, short-term Fourier transform using fast Fourier transform (FFT) techniques. This output is subject to a nonlinearity and low-pass filter to provide an energy measurement. LPC-derived features almost always include regression measurements that capture the temporal evolution of these features from one speech segment to another. It is no accident that short-term spectral measurements are also the basis for speech recognizers. This is because an analysis that captures the differences between one speech sound and another can also capture the difference between the same speech sound uttered by different speakers, often with resolutions surpassing human perception. Other measurements that are often carried out are correlated with prosody such as pitch and energy tracking. Pitch or periodicity measurements are relatively easy to make. However, periodicity measurement is meaningful only for voiced speech sounds so it is necessary also to have a detector that can discriminate voiced from unvoiced sounds. This complication often makes it difficult to obtain reliable pitch tracks over long-duration utterances. Long-term average spectral and fundamental frequency measurements have been used in the past for speaker recognition, but since these measurements provide feature averages over long durations they are not capable of resolving detailed individual differences. Although computational ease is an important consideration for selecting speaker-sensitive feature measurements, equally important considerations are the stability of the measurements, including whether they are subject to variability, noise, and distortions from one measurement of a speaker’s utterances to another. One source of variability is the speaker himself. Features that are correlated with behavior such as pitch contours – pitch measured as a function of time over specified utterances – can be consciously varied from Speech sample from an unknown speaker Speech signal processing one token of an utterance to another. Conversely, cooperative speakers can control such variability. More difficult to deal with are the variability and distortion associated with recording environments, microphones, and transmission media. The most severe kinds of variability problems occur when utterances used to train models are recorded under one set of conditions and test utterances are recorded under another. A block diagram of a speaker recognition is shown in Fig. 36.2, showing the basic elements discussed in this section. A sample of speech from an unknown speaker is input to the system. If the system is a speaker verification system, an identity claim or assertion is also input. The speech sample is recorded, digitized, and analyzed. The analysis is typically some sort of short-term spectral analysis that captures speaker-sensitive features as described earlier in this section. These features are compared with prototype features compiled into the models of known speakers. A matching process is invoked to compare the sample features and the model features. In the case of closed-set speaker identification, the match is assigned to the model with the best matching score. In the case of speaker verification, the matching score is compared with a predetermined threshold to decide whether to accept or reject the identity claim. For open-set identification, if the matching score for the best matching model does not pass a threshold test, a no-match decision is made. 36.1.5 Applications As mentioned, the most widespread applications for automatic speaker recognition are for security. These are typically speaker verification applications intended to control access to privileged transactions or information remotely over a telecommunication network. These are usually configured in a text-dependent mode in which customers are prompted to speak personalized verification phrases such as personal identification numbers Feature extraction Identity claim Fig. 36.2 Block diagram of a speaker recognition system Pattern match Speaker models Decision Overview of Speaker Recognition for efficient, versatile, and accurate data mining tools for extracting useful information content from the data. A typical need is to search or browse through the data, scanning for specified topics, words, phrase, or speakers. Most of this data is multispeaker data, collected from broadcasts, recorded meetings, telephone conversations, etc. The process of obtaining a list of speaker segments from such data is referred to as speaker indexing, segmentation, or diarization. A more-general task of annotating audio data from various audio sources including speakers has been referred to as audio diarization [36.10]. Still another speaker recognition application is to improve automatic speech recognition by adapting speaker-independent speech models to specified speakers. Many commercial speech recognizers do adapt their speech models to individual users, but this cannot be regarded as a speaker recognition application unless speaker models are constructed and speaker recognition is a part of the process. Speaker recognition can also be used to improve speech recognition for multispeaker data. In this situation speaker indexing can provide a table of speech segments assigned to individual speakers. The speech data in these segments can then be used to adapt speech models to each speaker. Speech recognition of multispeaker speech samples can be improved in another way. Errors and ambiguities in speech recognition transcripts can be corrected using the knowledge provided by speaker segmentation assigning the segments to the correct speakers. 36.2 Measuring Speaker Features 36.2.1 Acoustic Measurements As mentioned in Sect. 36.1, low-level acoustic features such as short-time spectra are commonly used in speaker modeling. Such features are useful in authentication systems because speakers have less control over spectral details than higher-level features such as pitch. Short-Time Spectrum There are many ways of representing the short-time spectrum. A popular representation is the mel-frequency cepstral coefficients (MFCC), which were originally developed for speaker-independent speech recognition. The choice of center frequencies and bandwidths of the filter bank used in MFCC were motivated by the properties of the human auditory system. In particular, this representation provides limited spectral resolution above 2 kHz, which might be detrimental in speaker recognition. However, somewhat counterintuitively, MFCCs have been found to be quite effective in speaker recognition. There are many minor variations in the definition of MFCC but the essential details are as follows. Let {S(k), 0 ≤ k < K } be the discrete Fourier transform (DFT) coefficients of a windowed speech signal ŝ(t). A set of triangular filters are defined such that ⎧ (k/K ) f s − f c j−1 ⎪ ⎪ , lj ≤ k ≤ cj , ⎪ ⎨ fc j − f c j−1 w j (k)= f c − k f s / f c − f c , c j < k ≤ u j , j j+1 j+1 ⎪ K ⎪ ⎪ ⎩0 , elsewhere , (36.1) 729 Part F 36.2 (PINs) spoken as a string of digits. Typically, PIN utterances are decoded using a speaker-independent speech recognizer to provide an identity claim. The utterances are then processed in a speaker recognition mode and compared with speaker models associated with the identity claim. Speaker models are trained by recording and processing prompted verification phrases in an enrollment session. In addition to security applications, speaker verification may be used to offer personalized services to users. For example, once a speaker verification phrase is authenticated, the user may be given access to a personalized phone book for voice repertory dialing. A forensic application is likely to be an open-set identification or verification task. A sample of speech exists from an unknown perpetrator. A suspect is required to speak utterances contained in the suspect speech sample in order to train a model. The suspect speech sample is compared both with the suspect and nonsuspect models to decide whether to accept or reject the hypothesis that the suspect and perpetrator voices are the same. In surveillance applications the input speech mode is most likely to be text independent. Since the speaker may be unaware that his voice is being monitored, he cannot be expected to speak specified texts. The decision task is open-set identification or verification. Large amounts of multimedia data, including speech, are being recorded and stored on digital media. The existence of such large amounts of data has created a need 36.2 Measuring Speaker Features 730 Part F Speaker Recognition Part F 36.2 where f c j−1 and f c j+1 are the lower and upper limits of the pass band for filter j with f c0 = 0 and f c j < f s /2 for all j, and l j , c j and u j are the DFT indices corresponding to the lower, center, and upper limits of the pass band for filter j. The log-energy at the outputs for the J filters are given by ⎡ ⎤ ⎢ ⎢ e( j) = ln ⎢ ⎢ ⎣ ⎥ ⎥ S(k) w j (k)⎥ ⎥ , (36.2) ⎦ w j (k) k=l j 1 uj uj 2 k=l j and Gaussianization [36.16], map the observed feature distribution to a normal distribution over a sliding window, and have been shown to be useful in speaker recognition. It has been long established that incorporating dynamic information is useful for speaker recognition and speech recognition [36.17]. The dynamic information is typically incorporated by extending the static cepstral vectors by their first and second derivatives computed as: l tct+k ΔCk = and the MFCC coefficients are the discrete cosine transform of the filter energies computed as J C(k) = j=0 1 π , e( j) cos k j − 2 J k = 1, 2, . . . , K . ΔΔCk = (36.3) The zeroth coefficient C(0) is set to be the average logenergy of the windowed speech signal. Typical values of the various parameters involved in the MFCC computation are as follows. A cepstrum vector is calculated using a window length of 20 ms and updated every 10 ms. The center frequencies f c j are uniformly spaced from 0 to 1000 Hz and logarithmically spaced above 1000 Hz. The number of filter energies is typically 24 for telephoneband speech and the number of cepstrum coefficients used in modeling varies from 12 to 18 [36.13]. Cepstral coefficients based on short-time spectra estimated using linear predictive analysis and perceptual linear prediction are other popular representations [36.14]. Short-time spectral measurements are sensitive to channel and transducer variations. Cepstral mean subtraction (CMS) is a simple and effective method to compensate for convolutional distortions introduced by slowly varying channels. In this method, the cepstral vectors are transformed such that they have zero mean. The cepstral average over a sufficiently long speech signal approximates the estimate of a stationary channel [36.14]. Therefore, subtracting the mean from the original vectors is roughly equivalent to normalizing the effects of the channel, if we assume that the average of the clean speech signal is zero. Cepstral variance normalization, which results in feature vectors with unit variance, has also been shown to improve performance in text-independent speaker recognition when there is more than a minute of speech for enrollment. Other feature normalization methods, such as feature warping [36.15] t=−l l , (36.4) |t| t=−l l t 2 ct+k t=−l l t2 . (36.5) t=−l Pitch Voiced sounds are produced by a quasiperiodic opening and closing of the vocal folds in the larynx at a fundamental frequency that depends on the speaker. Pitch is a complex auditory attribute of sound that is closely related to this fundamental frequency. In this chapter, the term pitch is used simply to refer to the measure of periodicity observed in voiced speech. Prosodic information represented by pitch and energy contours has been used successfully to improve the performance of speaker recognition systems [36.18]. There are a number of techniques for estimating pitch from the speech signal [36.19] and the performance of even simple pitch-estimation techniques is adequate for speaker recognition. The major failure modes occur during speech segments that are at the boundaries of voiced and unvoiced sounds and can be ignored for speaker recognition. A more-significant problem with using pitch information for speaker recognition is that speakers have a fair amount of control over it, which results in large intraspeaker variations and mismatch between enrollment and test utterances. 36.2.2 Linguistic Measurements In traditional speaker authentication applications, the enrollment data is limited to a few repetitions of a password, and the same password is spoken to gain access to the system. In such cases, speaker models based on short-time spectra are very effective and it is difficult to Overview of Speaker Recognition Word Usage Features such as vocabulary choices, function word frequencies, part-of-speech frequencies, etc., have been shown to be useful in speaker recognition [36.20]. In addition to words, spontaneous speech contains fillers and hesitations that can be characterized by statistical models and used for identifying speakers [36.20, 21]. There are a number of issues with speaker recognition systems based on lexical features: they are susceptible to errors introduced by large-vocabulary speech recognizers, a significant amount of enrollment data is needed to build robust models, and the speaker models are likely to characterize the topic of conversation as well as the speaker. recognizer to produce the phone sequences and the robustness of the speaker models of phone sequences. For example, the use of lexical constraints in the automatic speech recognition (ASR) reproduces phone sequences found in a predetermined dictionary and prevents phone sequences that may be characteristic of a speaker but not represented in the dictionary. The phone accuracy computed using one-best output phone strings generated by ASR systems without lexical constraints is typically not very high. On the other hand, the correct phone sequence can be found in a phone lattice output by an ASR with a high probability. It has been shown that it is advantageous to construct speaker models based on phone-lattice output rather than the one-best phone sequence [36.22]. Systems based on onebest phone sequences use the counts of a term such as a phone unigram or bigram in the decoded sequence. In the case of lattice outputs, these raw counts are replaced by the expected counts given by E[C(τ|X)] = p(Q|X)C(τ|Q) , (36.6) Q Phone Sequences and Lattices Models of phone sequences output by speech recognizers using phonotactic grammars, typically phone unigrams, can be used to represent speaker characteristics [36.22]. It is assumed that these models capture speaker-specific pronunciations of frequently occurring words, choice of words, and also an implicit characterization of the acoustic space occupied by the speech signal from a given speaker. It turns out that there is an optimal tradeoff between the constraints used in the where Q is a path through the phone lattice for the utterance X with associated probability p(Q|X), and C(τ|Q) is the count of the term τ in the path Q. Other Linguistic Features A number of other features that have been found to be useful for speaker modeling are (a) pronunciation modeling of carefully chosen words, and (b) prosodic statistics such as pitch and energy contours as well as durations of phones and pauses [36.23]. 36.3 Constructing Speaker Models A speaker recognition system provides the ability to construct a model λs for speaker s using enrollment utterances from that speaker, and a method for comparing the quality of match of a test utterance to the speaker model. The choice of models is determined by the application constraints. In applications in which the user is expected to say a fixed password each time, it is beneficial to develop models for words or phrases to capture the temporal characteristics of speech. In passive surveillance applications, the test utterance may contain phonemes or words not seen in the enrollment data. In such cases, lessdetailed models that model the overall acoustic space of the user’s utterances tend to be effective. A survey of general techniques that have been used in speaker mod- eling follows. The methods can be broadly classified as nonparametric or parametric. Nonparametric models make few structural assumptions and are effective when there is sufficient enrollment data that is matched to the test data. Parametric models allow a parsimonious representation of the structural constraints and can make effective use of the enrollment data if the constraints are appropriately chosen. 36.3.1 Nonparametric Approaches Templates This is the simplest form of speaker modeling and is appropriate for fixed-password speaker verification sys- 731 Part F 36.3 extract meaningful high-level or linguistic features. In applications such as indexing broadcasts by speaker and passive surveillance, a significant amount of enrollment data, perhaps several minutes, may be available. In such cases, the use of linguistic features has been shown to be beneficial [36.18]. 36.3 Constructing Speaker Models 732 Part F Speaker Recognition Part F 36.3 tems [36.24]. The enrollment data consists of a small number of repetitions of the password spoken by the target speaker. Each enrollment utterance X is a seT −1 quence of feature vectors {xt }t=0 generated as described in Sect. 36.2, and serves as the template for the password as spoken by the target speaker. A test utterance Y con T −1 sisting of vectors {yt }t=0 , is compared to each of the enrollment utterances and the identity claim is accepted if the distance between the test and enrollment utterances is below a decision threshold. The comparison is done as follows. Associated with each pair of vectors, xi and y j , is a distance, d(xi , y j ). The feature vectors of X and Y are aligned using an algorithm referred to as dynamic time warping to minimize an overall distance defined as the average intervector distance d(xi , y j ) between the aligned vectors [36.12]. This approach is effective in simple fixed-password applications in which robustness to channel and transducer differences are not an issue. This technique is described here mostly for historical reasons and is rarely used in real applications today. Nearest-Neighbor Modeling Nearest-neighbor models have been popular in nonparametric classification [36.25]. This approach is often thought of as estimating the local density of each class by a Parzen estimate and assigning the test vector to the class with the maximum local density. The local density of a class (speaker) with enrollment data X at a test vector y is defined as pnn ( y; X) = 1 , V [dnn ( y, X)] (36.7) where dnn ( y, X) = minx j ∈X y − x j is the nearestneighbor distance and V (r) is the volume of a sphere of radius r in the D-dimensional feature space. Since V (r) is proportional to r D , ln[ pnn ( y; X)] ≈ −D ln[dnn ( y, X)] . (36.8) The log-likelihood score of the test utterances Y with respect to a speaker specified by enrollment X is given by snn (Y; X) ≈ − ln[dnn ( y, X)] , (36.9) y j ∈Y and the speaker with the greatest s(Y; X) is identified. A modified version of the nearest-neighbor model, motivated by the discussion above, has been successfully used in speaker identification [36.26]. It was found empirically that a score defined as 1 snn (Y; X) = min y j − xi 2 xi ∈X Ny y j ∈Y 1 + Nx − 1 Ny 1 − Nx min yi − x j 2 x j ∈X yi ∈Y min yi − y j 2 y j ∈Y yi ∈Y; j=i min xi − x j 2 (36.10) x j ∈X xi ∈X; j=i gives much better performance than snn (Y; X). 36.3.2 Parametric Approaches Vector Quantization Modeling Vector quantization constructs a set of representative samples of the target speaker’s enrollment utterances by clustering the feature vectors. Although a variety of clustering techniques exist, the most commonly used is k-means clustering [36.14]. This approach partitions N feature vectors into K disjoint subsets S j to minimize an overall distance such as J D= (xi − μ j ) , (36.11) j=1 xi ∈S j where μ j = (1/N j ) xi ∈S j xi is the centroid of the N j samples in the j-th cluster. The algorithm proceeds in two steps: 1. Compute the centroid of each cluster using an initial assignment of the feature vectors to the clusters. 2. Reassign xi to that cluster whose centroid is closest to it. These steps are iterated until successive steps do not reassign samples. This algorithm assumes that there exists an initial clustering of the samples into K clusters. It is difficult to obtain a good initialization of K clusters in one step. In fact, it may not even be possible to reliably estimate K clusters because of data sparsity. The Linde–Buzo– Gray (LBG) algorithm [36.27] provides a good solution for this problem. Given m centroids, the LBG algorithm produces additional centroids by perturbing one or more of the centroids using a heuristic. One common heuristic is to choose the μ for the cluster with the largest variance and produce two centroids μ and μ + . The enrollment feature vectors are assigned to the resulting m + 1 centroids. The k-means algorithm described previously can Overview of Speaker Recognition Gaussian Mixture Models In the case of text-independent speaker recognition (the subject of Chap. 38) where the system has no prior knowledge of the text of the speaker’s utterance, Gaussian mixture models (GMMs) have proven to be very effective. This can be thought of as a refinement of the VQ model. Feature vectors of the enrollment utterances X are assumed to be drawn from a probability density function that is a mixture of Gaussians given by K p(x|λ) = wk pk (x|λk ) , (36.12) k=1 where 0 ≤ wk ≤ 1 for 1 ≤ k ≤ K , pk (x|λk ) = K k=1 wk = 1, and 1 (2π) D/2 |Σk |1/2 1 T −1 × exp − (x − μk ) Σk (x − μk ) , 2 (36.13) K λ represents the parameters (μi , Σi , wi )i=1 of the distribution. Since the size of the training data is often small, it is difficult to estimate full covariance matrices reliably. K are assumed to be diagonal. In practice, {Σk }k=1 Given the enrollment data X, the maximumlikelihood estimates of the λ can be obtained using the expectation-maximization (EM) algorithm [36.12]. The K -means algorithm can be used to initialize the parameters of the component densities. The posterior probability that xt is drawn from the component pm (xt |λm ) can be written P(m|xt , λ) = wm pm (xt |λm ) . p(xt |λ) (36.14) The maximum-likelihood estimates of the parameters of λ in terms of P(m|xt , λ) are T μm = P(m|xt , λ)xt t=1 T , (36.15) P(m|xt , λ) t=1 T Σm = wm = t=1 P(m|xt , λ)xt xtT − μm μTm , T 1 T (36.16) P(m|xt , λ) t=1 T P(m|xt , λ) . (36.17) t=1 The two steps of the EM algorithm consist of computing P(m|xt , λ) given the current model, and updating the model using the equations above. These two steps are iterated until a convergence criteria is satisfied. Test utterance scores are obtained as the average log-likelihood given by s(Y|λ) = 1 T T log[ p( yt |λ)] . (36.18) t=1 Speaker verification is often based on a likelihoodratio test statistic of the form p(Y|λ)/ p(Y|λbg ) where λ is the speaker model and λbg represents a background model [36.29]. For such systems, speaker models can also be trained by adapting λbg , which is generally trained on a large independent speech database [36.30]. There are many motivations for this approach. Generating a speaker model by adapting a well-trained background GMM may yield models that are more robust to channel differences, and other kinds of mismatch between enrollment and test conditions than models estimated using only limited enrollment data. Details of this procedure can be found in Chap. 38. 733 Part F 36.3 then be applied to refine the centroid estimates. This process can be repeated until m = M or the cluster sizes fall below a threshold. The LBG algorithm is usually initialized with m = 1 and computes the centroid of all the enrollment data. There are many variations of this algorithm that differ in the heuristic used for perturbing the centroids, the termination criteria, and similar details. In general, this algorithm for generating VQ models has been shown to be quite effective. The choice of K is a function of the size of enrollment data set, the application, and other system considerations such as limits on computation and memory. Once the VQ models are established for a target speaker, scoring consists of evaluating D in (36.11) for feature vectors in the test utterance. This approach is general and can be used for text-dependent and textindependent speaker recognition, and has been shown to be quite effective [36.28]. Vector quantization models can also be constructed on sequences of feature vectors, which are effective at modeling the temporal structure of speech. If distance functions and centroids are suitably redefined, the algorithms described in this section continue to be applicable. Although VQ models are still useful in some situations, they have been superseded by models such as the Gaussian mixture models and hidden Markov models, which are described in the following sections. 36.3 Constructing Speaker Models 734 Part F Speaker Recognition Part F 36.3 Speaker modeling using GMMs is attractive for text-independent speaker recognition because it is simple to implement and computationally inexpensive. The fact that this model does not model temporal aspects of speech is a disadvantage. However, it has been difficult to exploit temporal structure to improve speaker recognition performance when the linguistic content of test utterances does not overlap significantly with the linguistic content of enrollment utterances. Hidden Markov Models In applications where the system has prior knowledge of the text and there is significant overlap of what was said during enrollment and testing, text-dependent statistical models are much more effective than GMMs. An example of such applications is access control to personal information or bank accounts using a voice password. Hidden Markov models (HMMs) [36.12] for phones, words, or phrases, have been shown to be very effective [36.31, 32]. Passwords consisting of word sequences drawn from specialized vocabularies such as digits are commonly used. Each word can be characterized by an HMM with a small number of states, in which each state is represented by a Gaussian mixture density. The maximum-likelihood estimates of the parameters of the model can be obtained using a generalization of the EM algorithm [36.12]. The ML training aims to approximate the underlying distribution of the enrollment data for a speaker. The estimates deviate from the true distribution due to lack of sufficient training data and incorrect modeling assumptions. This leads to a suboptimal classifier design. Some limitations of ML training can be overcome using discriminative training of speaker models in which an attempt is made to minimize an overall cost function that depends on misclassification or detection errors [36.33–35]. Discriminative training approaches require examples from competing speakers in addition to examples from the target speaker. In the case of closed-set speaker identification, it is possible to construct a misclassification measure to evaluate how likely a test sample, spoken by a target speaker, is misclassified as any of the others. One example of such a measure is the minimum classification error (MCE) defined as follows. Consider the set of S discriminant functions {gk (x; Λs ), 1 ≤ s ≤ S}, where gk (x; Λs ) is the log-likelihood of observation x given the models Λs for speaker s. A set of misclassification measures for each speaker can be de- fined as ds (x; Λ) = −gs (x; Λs ) + G s (x; Λ), (36.19) where Λ is the set of all speaker models and G s (x; Λ) is the antidiscriminant function for speaker s. G s (x; Λ) is defined so that ds (x; Λ) is positive only if x is incorrectly classified. In speech recognition problems, G s (x; Λ) is usually defined as a collective representation of all competing classes. In the speaker identification task, it is often advantageous to construct pairwise misclassification measures such as dss (x; Λ) = −gs (x; Λs ) + gs x; Λs , (36.20) with respect to a set of competing speakers s , a subset of the S speakers. Each misclassification measure is embedded into a smooth empirical loss function lss (x; Λ) = 1 , 1 + exp(−αdss (x; Λ)) (36.21) which approximates a loss directly related to the number of classification errors, and α is a smoothness parameter. The loss functions can then be combined into an overall loss given by l(x; Λ) = lss (x; Λ)δs (x) , (36.22) s s ∈Sc where δs (x) is an indicator function which is equal to 1 when x is uttered by speaker s and 0 otherwise, and Sc is the set of competing speakers. The total loss, defined as the sum of l(x; Λ) over all training data, can be optimized with respect to all the model parameters using a gradientdescent algorithm. A similar algorithm has been developed for speaker verification in which samples from a large number of speakers in a development set is used to compute a minimum verification measure [36.36]. The algorithm described above is only to illustrate the basic principles of discriminative training for speaker identification. Many other approaches that differ in their choice of the loss function or the optimization method have been developed and shown to be effective [36.35, 37]. The use of HMMs in text-dependent speaker verification is discussed in detail in Chap. 37. Support Vector Modeling Traditional discriminative training approaches such as those based on MCE have a tendency to overtrain on the training set. The complexity and generalization ability of the models are usually controlled by testing on

Overview of Speaker Recognition

Nội dung