Báo cáo hóa học: " Vibrato in Singing Voice: The Link between Source-Filter and Sinusoidal Models"

pdf
Số trang Báo cáo hóa học: " Vibrato in Singing Voice: The Link between Source-Filter and Sinusoidal Models" 14 Cỡ tệp Báo cáo hóa học: " Vibrato in Singing Voice: The Link between Source-Filter and Sinusoidal Models" 935 KB Lượt tải Báo cáo hóa học: " Vibrato in Singing Voice: The Link between Source-Filter and Sinusoidal Models" 0 Lượt đọc Báo cáo hóa học: " Vibrato in Singing Voice: The Link between Source-Filter and Sinusoidal Models" 1
Đánh giá Báo cáo hóa học: " Vibrato in Singing Voice: The Link between Source-Filter and Sinusoidal Models"
4.7 ( 9 lượt)
Nhấn vào bên dưới để tải tài liệu
Đang xem trước 10 trên tổng 14 trang, để tải xuống xem đầy đủ hãy nhấn vào bên trên
Chủ đề liên quan

Nội dung

EURASIP Journal on Applied Signal Processing 2004:7, 1007–1020 c 2004 Hindawi Publishing Corporation  Vibrato in Singing Voice: The Link between Source-Filter and Sinusoidal Models Ixone Arroabarren Departamento de Ingenierı́a Eléctrica y Electrónica, Universidad Pública de Navarra, Campus de Arrosadia, 31006 Pamplona, Spain Email: ixone.arroabarren@unavarra.es Alfonso Carlosena Departamento de Ingenierı́a Eléctrica y Electrónica, Universidad Pública de Navarra, Campus de Arrosadia, 31006 Pamplona, Spain Email: carlosen@unavarra.es Received 4 July 2003; Revised 30 October 2003 The application of inverse filtering techniques for high-quality singing voice analysis/synthesis is discussed. In the context of source-filter models, inverse filtering provides a noninvasive method to extract the voice source, and thus to study voice quality. Although this approach is widely used in speech synthesis, this is not the case in singing voice. Several studies have proved that inverse filtering techniques fail in the case of singing voice, the reasons being unclear. In order to shed light on this problem, we will consider here an additional feature of singing voice, not present in speech: the vibrato. Vibrato has been traditionally studied by sinusoidal modeling. As an alternative, we will introduce here a novel noninteractive source filter model that incorporates the mechanisms of vibrato generation. This model will also allow the comparison of the results produced by inverse filtering techniques and by sinusoidal modeling, as they apply to singing voice and not to speech. In this way, the limitations of these conventional techniques, described in previous literature, will be explained. Both synthetic signals and singer recordings are used to validate and compare the techniques presented in the paper. Keywords and phrases: voice quality, source-filter model, inverse filtering, singing voice, vibrato, sinusoidal model. 1. INTRODUCTION Inverse filtering provides a noninvasive method to study voice quality. In this context, high-quality speech synthesis is developed using a source-filter model, where voice texture is controlled by glottal source characteristics. Efforts to apply this approach to singing voice have failed, the reasons being not clear: either the unsuitability of the model, or the different range of frequencies, or both, could be the cause. The lyric singers, being professionals, have an efficiency requirement, and as a result, they are educated to change their formants position moving them towards the first harmonics position, what could also be another reason of the model’s failure [1]. This paper purports to shed light on this problem by comparing two salient methods for glottal source and vocal tract response (VTR) estimation, with a novel frequencydomain method proposed by the authors. In this way, the inverse filtering approach will be tested in singing voice analysis. In order to have a benchmark, the source-filter model will be compared to sinusoidal model and this comparison will be performed thanks to the particular feature of singing voice: vibrato. Regarding the voice production models, we can distinguish two approaches as follows. (i) On the one hand, interactive models are closer to the physical features of the vocal system. This system is composed by two resonant cavities (subglottal and supraglottal) which are connected by a valve, the glottis, where vocal folds are located. The movement of the vocal folds provides the harmonic nature of the air flow of voiced sounds, and also controls the coupling between the two resonant cavities, which will be different during the open and closed phases. As a result of this effect, the VTR will change during a single fundamental period and there will be a relationship between the glottal source and the VTR. This physical behavior has been modeled in several ways, by physical models [2] or aerodynamic models [3, 4]. From the signal processing point of view, in [4] the VTR variation is related to the glottal area, which controls the coupling of the cavities, and this relationship is represented by a frequency modulation of the central frequency and bandwidth of the formants. Other effect of the source-tract interaction is the increase of the skewness of the glottal source [4], which emphasizes the difference between the glottal area and the glottal source [5]. 1008 (ii) On the other hand, Non Interactive Models separate the glottal source and the VTR, and both are independently modeled as linear time-varying systems. This is the case of the source-filter model proposed by Fant in [6]. The VTR is modeled as an all-pole filter, in the case of nonnasal sounds. For the glottal source several waveform models have been proposed [7, 8, 9], but all of them try to include some of the features of the source-tract interaction, typically the asymmetric shape of the pulse. These models provide a high quality synthesis framework for the speech with a low computational complexity. The synthesis is preceded by an analysis stage, which is divided into two steps: an inverse filtering step where the glottal source and the VTR are separated [9, 10, 11, 12, 13] and a parameterization step where the most relevant parameters of both elements are obtained [14, 15, 16]. In general, inverse filtering techniques yield worse results as the fundamental frequency increases, as is the case of women and children in speech and singing voice. In the latter case, singing voice, the number of published works is very scarce [1, 17]. In [1], the glottal source features are studied in speech and singing voice by acoustic and electroglottographic signals [18, 19]. From these works, it is not apparent which is the main limitation of inverse filtering in singing voice. It might be possible that the source-tract interaction was more complex than in speech, what would represent a paradox in the noninteractive assumption [20]. Other reason mentioned in [1] is that perhaps the glottal source models used in speech are not suitable for singing voice. These statements are not demonstrated, but are interesting questions that should be answered. On the other hand, in [17] the noninteractive sourcefilter model is used as a high-quality singing voice synthesis approach. The main contribution of that work is the development of an analysis procedure that estimates the parameters of the synthesis model [12, 21]. However, there is no evidence that could point to differences between speech and singing as it is indicated in [1]. One of the goals of the present work is to clarify whether the noninteractive models are able to model singing voice in the same way as high-quality speech, or on the contrary, the source-tract interaction is different from speech, and precludes this linear model assumption. If the noninteractive model could model singing voice, the reason of the failure of inverse filtering techniques would be just the high fundamental frequency of singing voice. To this end, we will compare in this paper three different inverse filtering techniques, one of them novel and proposed recently by the authors in order to obtain the sourcefilter decomposition. Though they work correctly for speech and low-frequency signals, we will show their limitations as the fundamental frequency increases. This is described in Section 2. Since fundamental frequency in singing voice is higher than in speech, it seems obvious that the above-mentioned methods fail, apparently due to the limited spectral information provided in high pitched signals. To compensate for that, we claim that the introduction of a feature such as vibrato EURASIP Journal on Applied Signal Processing Glottal source VTR Lip radiation diagram Singing voice 1 − l · z −1 Figure 1: Noninteractive source-filter model of voice production system. may serve to increase the information available by virtue of the frequency modulated nature, and therefore wider bandwidth, of vibrato [22, 23, 24]. Frequency variations are influenced by the VTR, and this effect can be used to obtain information about it. With this in mind, it is not surprising that vibrato has been traditionally analyzed by sinusoidal modeling [25, 26], the most important limitation being the impossibility to separate the sound generation and the VTR. In Section 3, we will take a step forward by introducing a source-filter model, which accounts for the physical origin of the main features of singing voice. Making use of this model, we will also demonstrate how the simpler sinusoidal model can serve to obtain a complementary information to inverse filtering, particularly in those conditions where the latter method fails. 2. INVERSE FILTERING Along this section, the noninteractive source-filter model, depicted in Figure 1, will be considered and some of the possible estimation algorithms for it will be reviewed. According to the block diagram in Figure 1, singing voice production can be modeled by a glottal source excitation that is linearly modified by the VTR and the lip radiation diagram. Typically, the VTR is modeled by an all-pole filter, and relying on the linearity of the model, the lip radiation system is combined with the glottal source, in such a way that the glottal source derivative (GSD) is considered as the vocal tract excitation. In this context, during the last decades many inverse filtering algorithms to estimate the model elements have been proposed. This technique is usually accomplished in two steps. In the first one, the GSD waveform and the VTR are estimated. In the second one, these signals are parameterized in a few numerical values. This whole analysis can be practically implemented in several ways. For the sake of clarity, we can group these possibilities into two types. (i) In the first group, the two identification steps are combined in a single algorithm, for instance in [9, 12]. There, a mathematical model for GSD and the autoregressive (AR) model for the VTR are considered, and then authors estimate simultaneously the VTR and the GSD model parameters. In this way, the GSD model parameterizes a given phonation type. Several different algorithms follow this structure, but all of them are invariably time domain implementations that require glottal closure instant (GCI) detection [27]. Therefore, they suffer from a high computational load, what makes them very cumbersome. Vibrato in Singing Voice 1009 Vocal tract parameters Speech Preemphasis Covariance LPC Voice source model Voice source parameters optimization Preemphasis Voice source parameters Figure 2: Block diagram of the AbS inverse filtering algorithm. (ii) The procedures in the second group split the whole process into two stages. Regarding the first step, different inverse filtering techniques are proposed, [11, 13]. These algorithms remove the GSD effect from the speech signal and the VTR is obtained by linear prediction (LP) [28] or alternatively by discrete all-pole (DAP) modeling [29], which avoids the fundamental frequency dependence of the former. For this comparative study three inverse filtering approaches have been selected. The first one is the analysis by synthesis (AbS) procedure presented in [9], the second one is the one proposed by the authors in [13], Glottal Spectrum Based (GSB) inverse filtering. In this way, both groups of algorithms mentioned above are represented. In addition, the Closed Phase Covariance (CPC) [10] has been added to the comparison. This approach is difficult to classify because it only obtains the VTR, as it is the case in the second group, but it is a time domain implementation as in the first one. The most interesting feature of this algorithm is that it is less affected by the formant ripple due to the source-tract interaction, because it only takes into account the time interval when the vocal folds are closed. In what follows, the three approaches will be shortly described, and finally compared. 2.2. 2.1. Analysis by synthesis This inverse filtering algorithm was proposed in [9]. It is based on covariance LPC [29], but the least squares error is modified in order to include the input of the system: E= N −1  s(n) − ŝ(n) n=0 =  N −1 n=0  s(n) − 2 p  (1) 2 ak s(n − k) + a p+1 g(n) , k=1 where g(n) represents the GSD, and H(z) = taneous search is developed. The block diagram of the algorithm is represented in Figure 2. As in covariance LP without source, this approach allows shorter analysis windows. However, the stability of the system is not guaranteed and a stabilization step must be included with this purpose. Also, and since it is a time domain implementation, the voice source model must be synchronized with the speech signal and a high sampling frequency is mandatory in order to obtain satisfactory results. As a result, the computational load is also high. Regarding the GSD parameter optimization, it is dependent on the chosen model. In the results shown in Section 2.4, the LF model is selected because it is one of the most powerful GSD models, and it allows an independent control of the three main features of the glottal source: open quotient, asymmetry coefficient and spectral tilt. The disadvantage of this model is its computational load. For more details on the topic readers are referred to [8]. Regarding fundamental frequency limits, it is shown in [1] that this algorithm provides unsatisfactory results for medium and high pitched signals. a p+1 1 − k=1 ak z−k p (2) represents the VTR. Since neither VTR nor GSD parameters are known, an iterative algorithm is proposed and a simul- Glottal spectrum based inverse filtering This technique was proposed by the authors in [13] and will be briefly described here. Unlike the technique described in the previous section, it is essentially a frequency domain implementation. In the AbS approach, the GSD effect was included in the LP error, and the AR coefficients were obtained by Covariance LPC. In our case, a short term spectrum of speech is considered (3 or 4 fundamental periods), and the GSD effect is removed from the speech spectrum. Then, the AR coefficients of (2) are obtained by the DAP modeling [29]. For this spectral implementation, the KLGLOTT88 model [7] has been considered. It is less powerful than the LF model, but of a simpler implementation. As it is shown in Figure 3, there is a basic voicing waveform controlled by the open quotient (Oq ) and the amplitude of voicing (AV), the spectral tilt being included by a firstorder lowpass filter. EURASIP Journal on Applied Signal Processing Lowpass filter spectral tilt g(t) Basic voicing waveform Oq Normalized amplitude 1010 GSD 1 1−µz−1 AV Figure 3: Block diagram of the KLGLOTT88 model. 1.5 1 0.5 0 0.105 −0.5 −1 −1.5 −2 −2.5 −3 −3.5 0.11 0.12 0.125 Closed phase Time (s) Speech Short term spectrum 0.115 Peak detection − DAP modeling V. tract + ST (N + 1)th order V. tract and ST separation GSD Voice Figure 5: Closed phase interval in voice. Basic voicing spectrum Vocal tract parameters Speech Interval selection Figure 4: Block diagram of the GSB inverse filtering algorithm. In our inverse filtering algorithm, once the short term spectrum is calculated, the glottal source effect is removed, by spectral division, by using the spectrum of the basic voicing waveform (3), which can be directly obtained by the Fourier transform of the basic voicing waveform [30]: 27 AV G( f ) = 2 Oq (2π f )3  je− j2π f Oq To 1 + 2e− j2π f Oq To + 2 2π f Oq To  1 − e− j2π f Oq To + 3j  2 . 2π f Oq To Covariance LPC GCI detection EGG Vocal tract parameters Closed phase detection Figure 6: Closed phase covariance (CPC). fer function. So, this inverse filtering algorithm will also have a limit in the highest achievable fundamental frequency. (3) The spectral tilt (ST) and the VTR are combined in an (N + 1)th order all-pole filter. The block diagram of the algorithm is shown in Figure 4. Since DAP modeling is the most important part of the algorithm, we should explain its rationale. In classical autocorrelation LP [28], it is a well-known effect that as fundamental frequency increases the resulting transfer function is biased by the spectral peaks of the signal. This happens because the signal is assumed to be the impulse response of the system, and this assumption is obviously not entirely correct. In order to avoid this problem, an alternative proposed in [29] is to obtain the LP error based on the spectral peaks, instead of on the time domain samples. Unfortunately, this error calculation is based on an aliased version of the right autocorrelation of the signal, and this aliasing grows as the fundamental frequency increases. Then, the resulting transfer function is not correct again. To solve this problem, the DAP modeling uses the Itakura-Saito error, instead of the least squares error, and it can be shown that the error is minimized using only the spectral peaks information. The details of the algorithm are explained in [29]. This technique allows higher fundamental frequencies than classical autocorrelation LP, but for proper operation requires an enough number of spectral peaks in order to estimate the right trans- 2.3. Closed phase covariance This inverse filtering technique was proposed in [31]. It is also based on covariance LP, as the AbS approach explained above. However, instead of removing the effect of the GSD from a long speech interval, the classical covariance LP takes only into account a portion of a single cycle where the vocal folds are closed. In this way, and in the considered time interval, there is no GSD information to be removed, and the application of covariance LP will lead to the right transfer function. Considering the linearity of the model shown in Figure 1, the closed phased interval will be the time interval where the GSD is zero. This situation is depicted in Figure 5. The most difficult step in this technique is to detect the closed phase in the speech signal. In [10], a two-channel speech processing is proposed, making use of electroglottographic signals to detect the closed phase. Electroglottography (EGG) is a technique used to indirectly register laryngeal behavior by measuring the electrical impedance across the throat during speech. Rapid variation in the conductance is mainly caused by movement of the vocal folds. As they approximate and the physical contact between them increases, the impedance decreases, what results in a relatively higher current flow through the larynx structures. Therefore, this signal will provide information about the contact surface of the vocal cords. The complete inverse filtering algorithm is represented in Figure 6. Vibrato in Singing Voice 0.02 0.025 0.03 1011 0.035 0.04 0.045 Time (s) 0.05 0.055 0.06 0.014 GSB CPC AbS Original GSD 0.016 0.018 0.02 Time (s) 1000 2000 Amplitude (dB) 0.026 (b) Amplitude (dB) 0 0.024 GSB CPC AbS Original GSD (a) 50 40 30 20 10 0 −10 −20 −30 −40 −50 0.022 3000 4000 5000 6000 7000 50 40 30 20 10 0 −10 −20 −30 −40 0 1000 2000 3000 4000 5000 6000 7000 Frequency (Hz) Frequency (Hz) GSB CPC AbS Original VTR GSB CPC AbS Original VTR (c) (d) Figure 7: (a) Estimated GSD. F0 = 100 Hz, vowel “a.” (b) Estimated GSD. F0 = 300 Hz, vowel “a.” (c) Estimated VTR. F0 = 100 Hz, vowel “a.” (d) Estimated VTR. F0 = 300 Hz, vowel “a.” In Figure 6, a GCI detection block [27] is included, because, even though both acoustic and electroglottographic signals are simultaneously recorded, there is a propagation delay between the acoustic signal recorded on the microphone and the impedance variation at the neck of the singer. Thus, a precise synchronization is mandatory. Since this technique is based on the covariance LP, it may work with very short window lengths. However, as the fundamental frequency increases, the time length of the closed phase gets shorter, and there is much less information left for the vocal tract estimation. This fact imposes a fundamental frequency limit, even using the covariance LP. 2.4. Practical results Once the basics of three inverse filtering techniques have been presented and described, they will be compared by simulations and also by making use of natural singing voice records. The main goal of this analysis is to see how the three techniques are compared in terms of their fundamental frequency limitations. 2.4.1. Simulation results First, the non interactive model for voice production shown in Figure 1 will be used in order to synthesize some artificial signals for test. The lip radiation effect and the glottal source are combined in a mathematical model for the GSD, also making use of the LF model. It is well known [1, 17] that the formant position can affect inverse filtering results. In [3], it is also shown that the lower first formant central frequency is, the higher is the source-tract interaction. So, the interaction is higher in vowels where the first format central frequency is lower. Therefore, and in order to cover all possible situations, two vocal all-pole filters have been used for synthesizing the test signal: one representing Spanish vowel “a,” and the other one representing Spanish vowel “e.” In this latter case, the first formant is located at lower frequencies. In order to see the fundamental frequency dependence of inverse filtering techniques, this parameter has been varied from 100 Hz to 300 Hz in 25 Hz steps. For each fundamental frequency, the three algorithms have been applied and the GSD as well as the VTR have been estimated. In Figures 7a to EURASIP Journal on Applied Signal Processing 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 ErrorF1 ErrorF1 1012 90 140 190 F0 (Hz) 240 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 290 90 GSB CPC AbS 140 290 (b) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 ErrorGSD ErrorGSD 240 GSB CPC AbS (a) 90 190 F0 (Hz) 140 190 F0 (Hz) 240 290 GSB CPC AbS 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 90 140 190 F0 (Hz) 240 290 GSB CPC AbS (c) (d) Figure 8: Fundamental frequency dependence. (a) ErrorF1 in vowel “a.” (b) ErrorF1 in vowel “e.” (c) ErrorGSD in vowel “a.” (d) ErrorGSD in vowel “e.” 7d, the glottal GSD and the VTR estimated by the three approaches are shown for two different fundamental frequencies. Note that in them, and in other figures, DC level has been arbitrarily modified to facilitate comparisons. Comparing the results obtained by the three inverse filtering approaches, it is shown that as fundamental frequency increases the error in both GSD and VTR increases. Recalling the implementation of the algorithms, the CPC uses only the time interval where the GSD is zero. When the fundamental frequency is low, it is possible to see that the result of this technique is the closest one to the original one. In the case of the other two techniques, both have slight variations in the closed phase, because in both cases the glottal source effect is removed from the speech signal in an approximated manner. Otherwise, when the fundamental frequency is high, the AbS approach leads comparatively to the best result. However, it provides neither the right GSD, nor the right VTR. In Figure 8, the relative error in the first formant central frequency and the error in the GSD are represented for the three methods, calculated according to the following expressions: ErrorF1 = F1 − F̂1 , F1 N −1 ErrorGSD = n=0 g(n) − ĝ(n) N (4) 2 , where F1 represents the first formant central frequency and g(n) and ĝ(n) are the original and estimated GSD waveforms, respectively. Although the simulation model does not take into account source-tract interactions, Figure 8 shows that inverse filtering results are dependent on the first formant position, being worse as it moves to lower frequencies. Also, it is possible to see that both errors increase as fundamental frequency increases. Therefore, the main conclusion of this simulationbased study is that the inverse filtering results have fundamental frequency dependence even when applied to a non interactive source-filter model. Vibrato in Singing Voice 1013 Amplitude (dB) 0 1.269 1.274 1.279 1.284 Time (s) 1.289 1.294 −20 −40 −60 −80 −100 1.299 0 GSB CPC AbS 1000 2000 3000 4000 Frequency (Hz) 5000 6000 7000 5000 6000 7000 GSB CPC AbS (a) (b) Amplitude (dB) 0 0.765 0.767 0.769 0.771 Time (s) 0.773 0.775 0.777 −20 −40 −60 −80 −100 GSB CPC AbS 0 1000 2000 3000 4000 Frequency (Hz) GSB CPC AbS (c) (d) Figure 9: (a) Estimated GSD. F0 = 123 Hz, vowel “a.” (b) Estimated VTR. F0 = 123 Hz, vowel “a.” (c) Estimated GSD. F0 = 295 Hz, vowel “a.” (d) Estimated VTR. F0 = 295 Hz, vowel “a.” 2.4.2. Natural singing voice results For this analysis, three male professional singers were recorded: two tenors and one baritone. They were asked to sing notes of different fundamental frequency values, in order to register samples of all of their tessitura. Besides, different vocal tract configurations are considered, and thus, this exercise was repeated for the five Spanish vowels “a,” “e,” “i,” “o,” “u.” The singing material was recorded in a professional studio, in such a way that reverberation was reduced as much as possible. Acoustic and electroglottographic signals were synchronously recorded, with a bandwidth of 20 KHz, and stored in . wav format. In order to remove low frequency ambient noise, the signals were filtered out by a high pass linear phase FIR filter whose cut-off frequency was set to a 75% of the fundamental frequency. In the case of electroglottographic signals, this filtering was also applied because of low frequency artifacts typical of this kind of signals due to larynx movements. In Figures 9a to 9c, the results obtained for different fundamental frequencies and vowel “a,” for the same singer, are shown. These results are also representative of the other singers’ recordings and of the different vowels. By comparing Figures 9a and 9c, it is possible to conclude that in the case of a low fundamental frequency, the three algorithms provide very close results. In the case of CPC, the GSD presents less formant ripple in the closed phase interval. Regarding the VTR, the central frequencies of the formants and the frequency responses are very similar. Nevertheless, in the case of a high fundamental frequency, the resulting GSD of the three analyses are very different from those of Figure 9a, and also from the waveform model provided by the LF model. Also, the calculated VTR is very different for the three methods. Thus, conclusions with natural recorded voices are similar to those obtained with synthetic signals. 3. 3.1. VIBRATO IN SINGING VOICE Definition In Section 2, inverse filtering techniques, successfully employed in speech processing, have been used for singing voice 1014 EURASIP Journal on Applied Signal Processing s(t) = N −1 ai (t) cos θi (t) + r(t), (5) 0 −10 Amplitude (dB) processing. It has been shown that as fundamental frequency increases, they reach a limit and thus an alternative technique should be used. As we will show in this section, the introduction of vibrato in singing voice provides more information about what can be happening. Vibrato in singing voice could be defined as a small quasiperiodic variation of the fundamental frequency of the note. As a result of this variation, all of the harmonics of the voice will also present an amplitude variation, because of the filtering effect of the VTR. Due to these nonstationary characteristics of the signal, singing voice has been modeled by the modified sinusoidal model [25, 26]: −20 −30 −40 −50 −60 −70 −80 −90 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Frequency (Hz) Figure 10: AM-FM representation for the first 20 harmonics. Anechoic tenor recording F0 = 220 Hz, vowel “a.” i=0 where θi (t) = 2π t −∞ fi (τ)dτ (6) and ai (t) is the instantaneous amplitude of the partial, fi (t) the instantaneous frequency of the partial, and r(t) the stochastic residual. The acoustic signal is composed by a set of components, (partials), whose amplitude and frequency change with time, plus a stochastic residual, which is modeled by a spectral density time-varying function. Also in [25, 26], detailed information is given on how these time-varying characteristics can be measured. Of the two features of a vibrato signal, frequency and amplitude variations, frequency is the most widely studied and characterized. In [32, 33], the instantaneous frequency is characterized and decomposed into three main components which account for three musically meaningful characteristics, respectively. Namely, f (t) = i(t) + e(t) cos ϕ(t), (7) where ϕ(t) = 2π t −∞ r(τ)dτ (8) f (t) being the instantaneous frequency, i(t) the intonation of the note, which corresponds to slow variations of pitch; e(t) represents the extent or amplitude of pitch variations, and r(t) represents the rate or frequency of pitch variations. All of them are time-dependent magnitudes and rely on the musical context and singer’s talent and training. In the case of intonation, its value depends on the sung note, and thus, on the context. But extent and rate are mostly singerdependent features, typical values being a 10% of the intonation value and 5 Hz, respectively. Regarding the amplitude variation of the harmonics during vibrato, a well-established parameterization is not accepted, and probably it does not exist, because this variation is different for all of the harmonics. It is therefore not strange that amplitude variation has been the topic of inter- est of some few papers. The first work on this topic is [34], where the perceptual relevance on spectral envelope discrimination of the instantaneous amplitude is proven. In [22], the relevance of this feature is experimentally demonstrated in the case of synthesis of singing voice. Also, its physical cause is tackled and a representation in terms of the instantaneous amplitude versus instantaneous frequency of the harmonics is introduced for the first time. This representation is proposed as a means of obtaining a local information of the VTR in limited frequency ranges. Something similar is done in [35], where the singing voice is synthesized using this local information of the VTR. We have also contributed in this direction, for instance in [23], where the instantaneous amplitude is decomposed in two parts. The first one represents the sound intensity variation and the other one represents the amplitude variation determined by the local VTR, in an attempt to split the contribution of the source and the vocal tract. Moreover, in [24], different time-frequency processing tools have been used and compared in order to identify the relationship between instantaneous amplitude and instantaneous frequency. In that work, the AM-FM representation is defined as the instantaneous amplitude versus instantaneous frequency representation, with time being an implicit parameter. This representation is compared to the magnitude response of an all-pole filter, which is typically used for VTR modeling. Two main conclusions are derived, the first one is that only when anechoic recordings are considered, these two representations can be compared. Otherwise, the instantaneous magnitudes will be affected by reverberation. The second one is that, as a frequency modulated input is considered, and frequency modulation is not a linear operation, the phase of the all-pole system will affect the AM-FM representation, leading to a different representation than the vocal tract magnitude response. However the relevance of this effect depends on the formant bandwidth and vibrato characteristics, vibrato rate in this case. It was also shown that in natural vibrato the phase effect of VTR is not noticeable, because vibrato rate is slow comparing to formant bandwidths. Figure 10 constitutes a good example of the kind of AM-FM representations we are talking about. In it, each Vibrato in Singing Voice harmonic’s instantaneous amplitude is represented versus its instantaneous frequency. For this case, only two vibrato cycles, where the vocal intensity does not change significantly, have been considered. As the number of harmonic increases, the frequency range swept by each harmonic widens. Comparing Figure 10 and Figure 9b, the AM-FM representation of the former one is very similar to the VTR of Figure 9b. However, in the case of the AM-FM representation, no source-filter separation has been made, and thus both elements are melted in that representation. The results obtained by other authors [22, 35] are quite similar regarding the instantaneous amplitude versus instantaneous frequency representation, however, in those works no comment is made about the conditions of recordings. 3.2. Simplified noninteractive source-tract model with vibrato The main conclusion from the results presented above could be that vibrato might be used in order to extract more information about glottal source and VTR in singing voice. Therefore, we will propose here a simplified noninteractive source-filter model with vibrato that will be a signal model of vibrato production and will explain the results provided by sinusoidal modeling. We will first make some basic assumptions regarding what is happening with GSD and VTR during vibrato. These assumptions are based on perceptual aspects of vibrato, and on the AM-FM representation for natural singing voice. (1) The GSD characteristics remain constant during vibrato, and only the fundamental frequency of the voice changes. This assumption is justified by the fact that perceptually there is no phonation change during a single note. (2) The intensity of the sound is constant, at least during one or two vibrato cycles. (3) The VTR remains invariant during vibrato. This assumption relies on the fact that vocalization does not change along the note. (4) The three vibrato characteristics remain constant. This assumption is not strictly true, but their time constants are considerably larger than the signal fundamental period. Taking into account these four assumptions, the simplified noninteractive source-filter model with vibrato could be represented by the block diagram in Figure 11. Based on this model, we will simulate the production of vibrato. The GSD characteristics are the same as in Section 2.4, and the VTR has been implemented as an all-pole filter whose frequency response represents Spanish vowel “a.” A frequency variation, typical of vibrato, has been applied to the GSD with a 120 Hz intonation, an extent of 10% of the intonation value, and a rate of 5,5 Hz. All of them are kept constant in the complete register. We have applied to the resulting signal both inverse filtering (where the presence or absence of vibrato does not influence the algorithm), and sinusoidal modeling, where in- 1015 Oq ft α Glotal source derivative LF model F0 (t): vibrato intonation rate extent VTR H(z) = 1 p 1− k=1 ak z−k Singing voice Figure 11: Noninteractive source-filter model with vibrato. stantaneous amplitude and instantaneous frequency of each harmonic need to be measured. Results obtained for this simulation are shown in Figures 12, 13, 14, and 15. In Figure 12a inverse filtering results are shown for a short window analysis. When fundamental frequency is low, GSD and VTR are well separated. In Figures 12a, 13a , sinusoidal modeling results are shown. The frequency variations of the harmonics of the signal are clearly observed and, as a result, the amplitude variation. On the other hand, in Figure 14, the AMFM representation of the partials is shown. Taking into account the AM-FM representation of every partial, and comparing this to the VTR shown in Figure 12a, it is possible to conclude that a local information of the VTR is provided by this method. However, as no source-filter decomposition has been developed, each AM-FM representation is shifted in amplitude depending on the GSD spectral features. This effect is a result of keeping GSD parameters constant during vibrato. Comparing Figures 14 and 15, it can be noticed that if the GSD magnitude spectrum is removed from the AM-FM representation of the harmonics, the resulting AM-FM representation would provide only VTR information. The result of this operation is shown in Figure 16. For this simplified noninteractive source-filter model with vibrato, instantaneous parameters of sinusoidal modeling provide a complementary information about both GSD and VTR. When inverse filtering works, the GSD effect can be removed from the AM-FM representation provided by sinusoidal modeling and only the information of the VTR remains. 3.3. Natural singing voice The relationship between these two signal models, noninteractive source-filter model and sinusoidal model, has been established for a synthetic signal where vibrato has been included under the four assumptions stated at the beginning of the section. Now, the question is whether this relationship holds in natural singing voice too. Therefore, both kinds of signal analysis will be now applied to natural singing voice. In order to get close to simulation conditions, some precautions have been taken in the recording process. (1) The musical context has been selected in order to control intensity variations of the sound. Singers were asked to sing a word of three notes, where the first and the last one simply provide a musical support and the note in between is a long sustained note. This note is two semitones higher than the two accompanying ones. EURASIP Journal on Applied Signal Processing Amplitude (dB) 1016 0.402 0.407 0.412 0.417 Time (s) 0.422 0.427 30 20 10 0 −10 −20 −30 −40 −50 0 0.432 2000 4000 Frequency (Hz) 6000 Original VTR Inverse filtered VTR Original GSD Inverse filtered GSD (a) (b) Figure 12: Inverse filtering results. GSB inverse filtering algorithm. (a) GSD. (b) VTR. 2000 Amplitude (dB) Frequency (Hz) 2500 1500 1000 500 0 0 0.2 0.4 Time (s) 0.6 80 75 70 65 60 55 50 45 40 35 0.8 0 0.2 0.4 Time (s) (a) 0.6 0.8 (b) 80 70 60 50 40 30 20 10 0 0 Amplitude (dB) Amplitude (dB) Figure 13: Sinusoidal modeling results. (a) Instantaneous frequency. (b) Instantaneous amplitude. −10 −20 −30 −40 −50 −60 −70 0 500 1000 1500 2000 Frequency (Hz) 2500 3000 Figure 14: AM-FM representation. 0 500 1000 1500 2000 Frequency (Hz) 2500 3000 Short term spectrum Spectral peaks Figure 15: GSD short term spectrum. Blackman-Harris window. (2) Recordings have been done in a studio where reverberations are reduced but not completely eliminated as in an anechoic room. In this situation, the AM-FM representation will present slight variations from the actual VTR, but it is still possible to develop a qualitative study. In Figures 17, 18, 19, and 20 the results of these analyses are shown for a low-pitched baritone recording, F0 = 128 Hz, vowel “a”. Contrarily to Figures 12, 13, 14, and 15, here there is no reference for the original GSD and VTR. Comparing Figures 12b, 13b and 17b, 18b, instantaneous frequency variation is similar in simulation and natural singing voice.
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.