Separated Column to Columns with Booleans
I have the following comma-separated data in one of my data.frame's columns called services
.
> dput(structure(df$services[1:5]))
list("Global Expense Management, Company Privacy Policy", "Removal Services, Global Expense Management",
"Removal Services, Exception & Cost Admin, Global Cost Estimate, Company Privacy Policy",
"Removal Services, Exception & Cost Admin, Ancillary Services, Global Cost Estimate, Global Expense Management, Perm Storage, Company Privacy Policy",
"Global Expense Management, Company Privacy Policy")
I would like to transform this data into separate columns in my dataframe and if the row contains the service, then set TRUE under that service's column. Otherwise, set the value as FALSE.
For example, if I would like my dataframe to look like this:
GlobalExpenseManagement | CompanyPrivacyPolicy | etc...
TRUE TRUE
TRUE FALSE
FALSE TRUE
I assume I would have to split out the comma-sep values, group them to remove duplicates, then add them as names(df)
to my dataframe. However, I don't know how to iterate over the dataset and set true/false if the row contains that service.
Does anyone have any good ideas of have to do this?
Edit: Combining the data back
I am now trying to combine the new matrix with my existing dataframe to replace the services with their new column counterparts. I have tried this based on @plafort's great answer below:
names(df) <- headnames
rbind(mat, df)
However, I get this error:
Error in names(df) <- headnames : 'names' attribute [178] must be the same length as the vector [7]
I have also tried this:
final <- data.frame(cbind(mat, df))
But, it seems to be missing the columns from df
. How can I combine the columns from mat
to df
?
I would consider cSplit_e
from my "splitstackshape" package. The result is as a binary "1" and "0" instead of TRUE
and FALSE
, but that should be easy to convert.
Sample data:
df <- data.frame(services = I(
list("Global Expense Management, Company Privacy Policy", "Removal Services, Global Expense Management",
"Removal Services, Exception & Cost Admin, Global Cost Estimate, Company Privacy Policy",
"Removal Services, Exception & Cost Admin, Ancillary Services, Global Cost Estimate, Global Expense Management, Perm Storage, Company Privacy Policy",
"Global Expense Management, Company Privacy Policy")))
Convert the "services" column to a vector
instead of a list
:
df$services <- unlist(df$services)
Now split it up:
library(splitstackshape)
cSplit_e(df, "services", ",", type = "character", fill = 0)
## services
## 1 Global Expense Management, Company Privacy Policy
## 2 Removal Services, Global Expense Management
## 3 Removal Services, Exception & Cost Admin, Global Cost Estimate, Company Privacy Policy
## 4 Removal Services, Exception & Cost Admin, Ancillary Services, Global Cost Estimate, Global Expense Management, Perm Storage, Company Privacy Policy
## 5 Global Expense Management, Company Privacy Policy
## services_Ancillary Services services_Company Privacy Policy services_Exception & Cost Admin
## 1 0 1 0
## 2 0 0 0
## 3 0 1 1
## 4 1 1 1
## 5 0 1 0
## services_Global Cost Estimate services_Global Expense Management services_Perm Storage
## 1 0 1 0
## 2 0 1 0
## 3 1 0 0
## 4 1 1 1
## 5 0 1 0
## services_Removal Services
## 1 0
## 2 1
## 3 1
## 4 1
## 5 0
Try:
splitup <- sapply(unlist(lst), strsplit, ', ')
headnames <- unique(unlist(splitup))
(mat <- t(unname(sapply(splitup, function(x) headnames %in% x))))
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] TRUE TRUE FALSE FALSE FALSE FALSE FALSE
[2,] TRUE FALSE TRUE FALSE FALSE FALSE FALSE
[3,] FALSE TRUE TRUE TRUE TRUE FALSE FALSE
[4,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[5,] TRUE TRUE FALSE FALSE FALSE FALSE FALSE
We start by splitting up the data by comma and use unlist
to access the elements directly. headnames
does as you mention, looks for the unique category headings. The last line first matches the heading categories with each list item, then removes the automatic naming with unname
and transposes the data back to how we'd like with t
.
To add the names on top we assign the unique names that were previously defined as column headings using the function colnames
. The order works out correctly because this is the same headnames
vector that was used to make the row observations.
colnames(mat) <- headnames
Global Expense Management Company Privacy Policy
[1,] TRUE TRUE
[2,] TRUE FALSE
[3,] FALSE TRUE
[4,] TRUE TRUE
[5,] TRUE TRUE...
链接地址: http://www.djcxy.com/p/96264.html
上一篇: 无法重置R rownames,rownames(df)将返回NULL
下一篇: 用布尔值分隔列