Is my understanding of FFT and Pitch Detection correct here?

There have been countless discussions on Stackoverflow and beyond about FFT and pitch detection.

It's generally accepted that FFT, while fast, is not very accurate for a lot of applications but it's not often explained why.

I'd like to explain my understanding of why this is the case, in the hope that someone smarter than me can correct me and fill in the blanks where I cannot.


FFT converts input data from the time domain to the frequency domain.

Initially, we start with a series of data which, if we were to plot on a graph, would have the amplitude of the sound at a given point in time on the Y axis, and time along the X axis. This is in the time domain.

FFT converts these values of amplitudes at points in time, to amplitudes at different frequencies.

The number of data output from the FFT is the SAME as the number of data input

If we input the amplitudes for 10 points in time (10 samples) FFT will output the amplitudes for 10 different frequencies within those samples (after multiplying the sqrt of the imaginary and real numbers).

Which frequencies is determined by the following:

We call the output from an FFT a bin, the width of each bin is calculated by dividing the sample rate by the number of samples in the FFT:

bin width = Sample Rate(Hz)/FFT Length (n samples)

With some real values, that could be:

bin_width = 44100 / 512 = 86.132

So we have 512 bins from our FFT (remember it's the same no. of data in as out), and each one spans 86.132 Hz.

So for a given bin, we can calculate which frequency it represents by:

Bin Freq (Hz) = Bin number (n) * bin width (Hz)

Using the values from above, the 3rd bin in the FFT output would represent the amplitude at 258.398Hz:

Bin Freq (Hz) = 3 * 86.132 = 258.396Hz

This means that with the given sample rate and buffer size, the FFT output cannot be any more accurate than ± 86.132Hz.

If you require greater accuracy (say 1Hz) you'd have to either reduce the sample rate, or increase the buffer size (or both).

desired bin width: 1Hz = 44100 / 44100  # A buffer size of 44100 would work in this instance 

As the buffer size gets closer to the sample rate, the issue of latency becomes more of a problem.

FFT Results per second = Sample Rate / Buffer Size = 44100/44100 = 1 FFT per second

(44100 samples per second, to fill a 44100 sample buffer = 1 full buffer per second).

I realise that there's more to FFT than merely calculating the fundamental frequency (the bin with the highest amplitude), but is my understanding of FFT in Pitch Detection correct so far?

Are there any ways of improving the accuracy of an FFT without sacrificing latency?


In addition to the nice answer by @HartmutPfitzinger recommending zero padding and interpolation, it's worth pointing out that there are important fundamental limits on the information you can get from the Fourier transform of a time-limited signal extract.

Think about the limiting case of zero padding -- eg, taking a single sample, then padding it out to 1 sec duration in order to take a Fourier transform with 1 Hz resolution. It's clear that a very short fragment of signal simply doesn't contain the information about periodicity. Intuitively, we need a fragment longer than the period in question to be able to say anything about whether the signal really repeats at that period.

We can do better if we have constraints on the shape of our periodic signal. For instance, if we are only looking for single sinusoids (ie, we know our signal is s(t) = A*cos(w*t + phi)), then we can solve for unknown amplitude A, frequency w, and phase phi using as few as three samples of s(t). However, it's pretty rare that we're looking at signals that exactly fit that formulation. At the very least we expect added noise, but most often we have lots of harmonics, ie, an unknown, non-sinusoidal periodic waveform.

If you try implement interpolated peak picking and/or zero-padding suggested above, and then look at the results you get as you make your signal excerpt shorter (while keeping the FFT length the same), you'll see the uncertainty (error) growing as the fragment gets shorter - to the point where you'll likely get useless results when the fragment is shorter than around twice the cycle length of the periodicity you're trying to measure.

This illustrates a somewhat counter-intuitive but very fundamental limit: It's hard to decide the frequency of a signal to better than 1/T Hz based an observation shorter than T seconds. This is sometimes called the uncertainty principle, and mathematically it's the same as the Heisenberg uncertainty principle from quantum mechanics.

Finally, one other technique I've used to improve the resolution of discrete Fourier transforms is Instantaneous Frequency, as described in:

Toshihiko Abe, Takao Kobayashi, Satoshi Imai: Robust pitch estimation with harmonics enhancement in noisy environments based on instantaneous frequency. ICSLP 1996 (you can find the PDF online, I've run out of my link allowance).

Frequency is just the derivative of phase with respect to time; it turns out you can use "derivative by parts" to directly calculate the instantaneous frequency in each FFT bin by combining the real and imaginary parts of two FFTs using different window functions (one the derivative of the other). For a Matlab implementation, see

http://labrosa.ee.columbia.edu/matlab/chroma-ansyn/ifgram.m

or in Python:

https://github.com/bmcfee/librosa/blob/master/librosa/core.py#L343


Concerning your 1st question ("is my understanding of FFT in Pitch Detection correct so far?") I would say yes, but I would like to point to a pitfall:

Using the values from above, the 3rd bin in the FFT output would represent the amplitude at 258.398Hz:

Bin Freq (Hz) = 3 * 86.132 = 258.396Hz

Be aware that the 0th bin represents 0 Hz. This means that the bin representing 3 * 86.132 = 258.396Hz, is in the 4th position of the resulting array.

And to complete this index pitfall, if you have an FFT of 512 points (=fftsize), the index value 256 represents the Nyquist frequency (= sample frequency / 2). This means you always get fftsize/2+1 bins representing the real frequency spectrum, ie in your case 257 bins.

Concerning your 2nd question , there are two wide-spread and simple methods to increase the frequency detection precision:

  • Zero-padding (See eg some answers why zero-padding is used)

  • Parabolic Interpolation (See eg the 1st answer)

  • Finally, a not-asked answer: It is absolutely recommended to apply a window function , not only because of it is the premise for parabolic interpolation but also because it reduces the amplitude of significant artificial side lobes.


    Generally, increasing the length of the FFT not only increases the latency but can also make it difficult to detect the frequency since it is probably not constant.
    A common practice it to use overlapping and windowing, take a look at this: http://en.wikipedia.org/wiki/Spectral_density_estimation

    There are a few method of improving the accuracy of the estimated frequency, after the peak detection. For example, by interpolation on the Fourier coefficients.
    Take a look at section 2 here or section 1.3 here

    链接地址: http://www.djcxy.com/p/33840.html

    上一篇: 为什么理想带通滤波器不能按预期工作?

    下一篇: 我对FFT和音高检测的理解是否正确?