How should I interpret a Spearman's rank correlation significance of zero?

2018-06-20 12:59:30

I'm calculating the Spearman rank correlation coefficient between two vectors using corr .

[rho, p] = corr(freq_type1, freq_type2, 'type', 'Spearman');

These vectors represent the frequency of terms in a different types of document. For example, type1 might be a webpage and type2 might be a newspaper article. So each vector freq_type1 and freq_type2 is 1 by n where n is the number of of terms in my vocabulary. The reason that I am calculating rank correlation is I want to be able to say whether the vocabulary differs in frequency between the different types of document. I normalize each vector, so that the rank corresponds to the percentage of documents each vocabulary term appears in.

The call above returns rho = .8879 and p = 0

As I understand it, when p is small the correlation is significant, but this is so extremely small that I am slightly concerned.

My first thought was that maybe the function didn't return p-values for the Spearman method. To test the method, I tried calculating the correlation of two random vectors.

[rho, p] = corr(rand(5,1), rand(5,1), 'type', 'Spearman');

This returns rho = 0.80 and p = 1.3, so the function seems to be working.

This is what my data distribution looks like on a loglog plot.

数据图

From the Matlab documentation for corr , the p-value for Spearman is computed using permutation distributions.

Here is my understanding of how this calculation works, building on the Wikipedia article about permutation testing. Initially the correlation coefficient is calculated as the "observed value of the test statistic, T(obs)". Then both input sets are mixed together and all possible resampling of the mixed datapoints are tested for the correlation coefficient. The one-sided p-value of the test is calculated as the proportion of sampled permutations where correlation is greater than or equal to T(obs). The two-sided p-value of the test is the proportion where it was less than or equal to T(obs).

Therefore, to get a p-value of zero, I would need to get all of the correlation coefficients for the sampled permutations to either be greater than or all be less than T(obs). That seems extremely unlikely since my datapoints don't lie exactly on a line.

Does the rank correlation require the data to be mean centered or some other constraint?

Here is a link to the data on Dropbox, if you want to see if you get the same results.

You'll have to look elsewhere for in depth statistical advice, but I can show what the Octave (MATLAB clone) code is doing (which btw returns exactly the same results you observe). Here's the relevant code commented with the observed values:

    % --> from previous computations, R =  0.88786, NN=1540

    % SIGNIFICANCE TEST
    tmp = 1 - R.*R;

    % --> tmp =  0.21171

    t   = R.*sqrt(max(NN-2,0)./tmp);

    % --> t =  75.675

    sig = tcdf(t,NN-2);

    % --> sig =  1

    sig  = 2 * min(sig,1 - sig);

    % --> sig = 0  (same as p which is reported)

Again, you may want to consult with someone more familiar with statistics for an understanding of these steps, but my conclusion is that, yes, given the large size of the data set there is unquestionably a significant (nonzero) correlation.

I agree p=0 is very odd. But for me, it's your second example that shows all is not well. "p = 1.3" means it's not giving a standard p value as p is a probability so must fall between 0 and 1. Your p>1!!

I use

cor.test(datafr$variable1, datafr$variable2, method="spearman")

This returns a standard rho and p : but I've never tried it with a vector as you describe (rather than just a dataset).

链接地址: http://www.djcxy.com/p/57758.html

上一篇: 计算Spearman相关性并更正p

下一篇: 我应该如何解释斯皮尔曼的等级相关性为零的意义？