Convert row data to binary columns
I am attempting to format a column of data into many binary columns to eventually use for association rule mining. I have had some success using a for loop and a simple triplet matrix, but I am unsure how to aggregate by the levels in the first column thereafter--similar to a group by statement in SQL. I have provided an example below, albeit with a much smaller data set--if successful my actual data set will be 4,200 rows by 3,902 columns so any solution needs to be scaleable. Any suggestions or alternative approaches would be greatly appreciated!
> data <- data.frame(a=c('sally','george','andy','sue','sue','sally','george'), b=c('green','yellow','green','yellow','purple','brown','purple'))
> data
a b
1 sally green
2 george yellow
3 andy green
4 sue yellow
5 sue purple
6 sally brown
7 george purple
x <- data[,1]
for(i in as.numeric(2:ncol(data)))
x <- cbind(x, simple_triplet_matrix(i=1:nrow(data), j=as.numeric(data[,i]),
v = rep(1,nrow(data)), dimnames = list(NULL, levels(data[,i]))) )
##Looks like this:
> as.matrix(x)
name brown green purple yellow
[1,] "sally" "0" "1" "0" "0"
[2,] "george" "0" "0" "0" "1"
[3,] "andy" "0" "1" "0" "0"
[4,] "sue" "0" "0" "0" "1"
[5,] "sue" "0" "0" "1" "0"
[6,] "sally" "1" "0" "0" "0" ##Need to aggregate by Name
##Would like it to look like this:
name brown green purple yellow
[1,] "sally" "1" "1" "0" "0"
[2,] "george" "0" "0" "0" "1"
[3,] "andy" "0" "1" "0" "0"
[4,] "sue" "0" "0" "1" "1"
这应该可以做到这一点:
## Get a contingency table of counts
X <- with(data, table(a,b))
## Massage it into the format you're wanting
cbind(name = rownames(X), apply(X, 2, as.character))
# name brown green purple yellow
# [1,] "andy" "0" "1" "0" "0"
# [2,] "george" "0" "0" "1" "1"
# [3,] "sally" "1" "1" "0" "0"
# [4,] "sue" "0" "0" "1" "1"
链接地址: http://www.djcxy.com/p/66672.html
下一篇: 将行数据转换为二进制列