Understanding the order() function
I'm trying to understand how the order()
function works. I was under the impression that it returned a permutation of indices, which when sorted, would sort the original vector.
For instance,
> a <- c(45,50,10,96)
> order(a)
[1] 3 1 2 4
I would have expected this to return c(2, 3, 1, 4)
, since the list sorted would be 10 45 50 96.
Can someone help me understand the return value of this function?
This seems to explain it.
The definition of order
is that a[order(a)]
is in increasing order. This works with your example, where the correct order is the fourth, second, first, then third element.
You may have been looking for rank
, which returns the rank of the elements
R> a <- c(4.1, 3.2, 6.1, 3.1)
R> order(a)
[1] 4 2 1 3
R> rank(a)
[1] 3 2 4 1
so rank
tells you what order the numbers are in, order
tells you how to get them in ascending order.
plot(a, rank(a)/length(a))
will give a graph of the CDF. To see why order
is useful, though, try plot(a, rank(a)/length(a),type="S")
which gives a mess, because the data are not in increasing order
If you did
oo<-order(a)
plot(a[oo],rank(a[oo])/length(a),type="S")
or simply
oo<-order(a)
plot(a[oo],(1:length(a))/length(a)),type="S")
you get a line graph of the CDF.
I'll bet you're thinking of rank.
To sort a 1D vector or a single column of data, just call the sort function and pass in your sequence.
On the other hand, the order function is necessary to sort data two-dimensional data--ie, multiple columns of data collected in a matrix or dataframe.
Stadium Home Week Qtr Away Off Def Result Kicker Dist
751 Out PHI 14 4 NYG PHI NYG Good D.Akers 50
491 Out KC 9 1 OAK OAK KC Good S.Janikowski 32
702 Out OAK 15 4 CLE CLE OAK Good P.Dawson 37
571 Out NE 1 2 OAK OAK NE Missed S.Janikowski 43
654 Out NYG 11 2 PHI NYG PHI Good J.Feely 26
307 Out DEN 14 2 BAL DEN BAL Good J.Elam 48
492 Out KC 13 3 DEN KC DEN Good L.Tynes 34
691 Out NYJ 17 3 BUF NYJ BUF Good M.Nugent 25
164 Out CHI 13 2 GB CHI GB Good R.Gould 25
80 Out BAL 1 2 IND IND BAL Good M.Vanderjagt 20
Here is an excerpt of data for field goal attempts in the 2008 NFL season, a dataframe i've called 'fg'. suppose that these 10 data points represent all of the field goals attempted in 2008; further suppose you want to know the the distance of the longest field goal attempted that year, who kicked it, and whether it was good or not; you also want to know the second-longest, as well as the third-longest, etc.; and finally you want the shortest field goal attempt.
Well, you could just do this:
sort(fg$Dist, decreasing=T)
which returns: 50 48 43 37 34 32 26 25 25 20
That is correct, but not very useful--it does tell us the distance of the longest field goal attempt, the second-longest,...as well as the shortest; however, but that's all we know--eg, we don't know who the kicker was, whether the attempt was successful, etc. Of course, we need the entire dataframe sorted on the "Dist" column (put another way, we want to sort all of the data rows on the single attribute Dist. that would look like this:
Stadium Home Week Qtr Away Off Def Result Kicker Dist
751 Out PHI 14 4 NYG PHI NYG Good D.Akers 50
307 Out DEN 14 2 BAL DEN BAL Good J.Elam 48
571 Out NE 1 2 OAK OAK NE Missed S.Janikowski 43
702 Out OAK 15 4 CLE CLE OAK Good P.Dawson 37
492 Out KC 13 3 DEN KC DEN Good L.Tynes 34
491 Out KC 9 1 OAK OAK KC Good S.Janikowski 32
654 Out NYG 11 2 PHI NYG PHI Good J.Feely 26
691 Out NYJ 17 3 BUF NYJ BUF Good M.Nugent 25
164 Out CHI 13 2 GB CHI GB Good R.Gould 25
80 Out BAL 1 2 IND IND BAL Good M.Vanderjagt 20
This is what order does. It is 'sort' for two-dimensional data; put another way, it returns a 1D integer index comprised of the row numbers such that sorting the rows according to that vector, would give you a correct row-oriented sort on the column, Dist
Here's how it works. Above, sort was used to sort the Dist column; to sort the entire dataframe on the Dist column, we use 'order' exactly the same way as 'sort' is used above:
ndx = order(fg$Dist, decreasing=T)
(i usually bind the array returned from 'order' to the variable 'ndx', which stands for 'index', because i am going to use it as an index array to sort.)
that was step 1, here's step 2:
'ndx', what is returned by 'sort' is then used as an index array to re-order the dataframe, 'fg':
fg_sorted = fg[ndx,]
fg_sorted is the re-ordered dataframe immediately above.
In sum, 'sort' is used to create an index array (which specifies the sort order of the column you want sorted), which then is used as an index array to re-order the dataframe (or matrix).
(I thought it might be helpful to lay out the ideas very simply here to summarize the good material posted by @doug, & linked by @duffymo; +1 to each,btw.)
?order tells you which element of the original vector needs to be put first, second, etc., so as to sort the original vector, whereas ?rank tell you which element has the lowest, second lowest, etc., value. For example:
> a <- c(45, 50, 10, 96)
> order(a)
[1] 3 1 2 4
> rank(a)
[1] 2 3 1 4
So order(a)
is saying, 'put the third element first when you sort... ', whereas rank(a)
is saying, 'the first element is the second lowest... '. (Note that they both agree on which element is lowest, etc.; they just present the information differently.) Thus we see that we can use order()
to sort, but we can't use rank()
that way:
> a[order(a)]
[1] 10 45 50 96
> sort(a)
[1] 10 45 50 96
> a[rank(a)]
[1] 50 10 45 96
In general, order()
will not equal rank()
unless the vector has been sorted already:
> b <- sort(a)
> order(b)==rank(b)
[1] TRUE TRUE TRUE TRUE
Also, since order()
is (essentially) operating over ranks of the data, you could compose them without affecting the information, but the other way around produces gibberish:
> order(rank(a))==order(a)
[1] TRUE TRUE TRUE TRUE
> rank(order(a))==rank(a)
[1] FALSE FALSE FALSE TRUE
链接地址: http://www.djcxy.com/p/24890.html
上一篇: 学习R.从哪里开始?
下一篇: 了解order()函数