Separated Column to Columns with Booleans

2018-07-04 13:42:05

I have the following comma-separated data in one of my data.frame's columns called services .

> dput(structure(df$services[1:5]))
list("Global Expense Management, Company Privacy Policy", "Removal Services, Global Expense Management", 
    "Removal Services, Exception &amp; Cost Admin, Global Cost Estimate, Company Privacy Policy", 
    "Removal Services, Exception &amp; Cost Admin, Ancillary Services, Global Cost Estimate, Global Expense Management, Perm Storage, Company Privacy Policy", 
    "Global Expense Management, Company Privacy Policy")

I would like to transform this data into separate columns in my dataframe and if the row contains the service, then set TRUE under that service's column. Otherwise, set the value as FALSE.

For example, if I would like my dataframe to look like this:

GlobalExpenseManagement    |    CompanyPrivacyPolicy   |   etc...
TRUE                            TRUE
TRUE                            FALSE
FALSE                           TRUE

I assume I would have to split out the comma-sep values, group them to remove duplicates, then add them as names(df) to my dataframe. However, I don't know how to iterate over the dataset and set true/false if the row contains that service.

Does anyone have any good ideas of have to do this?

Edit: Combining the data back

I am now trying to combine the new matrix with my existing dataframe to replace the services with their new column counterparts. I have tried this based on @plafort's great answer below:

names(df) <- headnames
rbind(mat, df)

However, I get this error:

Error in names(df) <- headnames : 'names' attribute [178] must be the same length as the vector [7]

I have also tried this:

final <- data.frame(cbind(mat, df))

But, it seems to be missing the columns from df . How can I combine the columns from mat to df ?

I would consider cSplit_e from my "splitstackshape" package. The result is as a binary "1" and "0" instead of TRUE and FALSE , but that should be easy to convert.

Sample data:

df <- data.frame(services = I(
  list("Global Expense Management, Company Privacy Policy", "Removal Services, Global Expense Management", 
       "Removal Services, Exception &amp; Cost Admin, Global Cost Estimate, Company Privacy Policy", 
       "Removal Services, Exception &amp; Cost Admin, Ancillary Services, Global Cost Estimate, Global Expense Management, Perm Storage, Company Privacy Policy", 
       "Global Expense Management, Company Privacy Policy")))

Convert the "services" column to a vector instead of a list :

df$services <- unlist(df$services)

Now split it up:

library(splitstackshape)
cSplit_e(df, "services", ",", type = "character", fill = 0)
##                                                                                                                                                  services
## 1                                                                                                       Global Expense Management, Company Privacy Policy
## 2                                                                                                             Removal Services, Global Expense Management
## 3                                                              Removal Services, Exception &amp; Cost Admin, Global Cost Estimate, Company Privacy Policy
## 4 Removal Services, Exception &amp; Cost Admin, Ancillary Services, Global Cost Estimate, Global Expense Management, Perm Storage, Company Privacy Policy
## 5                                                                                                       Global Expense Management, Company Privacy Policy
##   services_Ancillary Services services_Company Privacy Policy services_Exception &amp; Cost Admin
## 1                           0                               1                                   0
## 2                           0                               0                                   0
## 3                           0                               1                                   1
## 4                           1                               1                                   1
## 5                           0                               1                                   0
##   services_Global Cost Estimate services_Global Expense Management services_Perm Storage
## 1                             0                                  1                     0
## 2                             0                                  1                     0
## 3                             1                                  0                     0
## 4                             1                                  1                     1
## 5                             0                                  1                     0
##   services_Removal Services
## 1                         0
## 2                         1
## 3                         1
## 4                         1
## 5                         0

Try:

splitup <- sapply(unlist(lst), strsplit, ', ')
headnames <- unique(unlist(splitup))
(mat <- t(unname(sapply(splitup, function(x) headnames %in% x))))

      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]
[1,]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
[2,]  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE
[3,] FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
[4,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[5,]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

We start by splitting up the data by comma and use unlist to access the elements directly. headnames does as you mention, looks for the unique category headings. The last line first matches the heading categories with each list item, then removes the automatic naming with unname and transposes the data back to how we'd like with t .

To add the names on top we assign the unique names that were previously defined as column headings using the function colnames . The order works out correctly because this is the same headnames vector that was used to make the row observations.

colnames(mat) <- headnames

Global Expense Management Company Privacy Policy
[1,]                      TRUE                   TRUE
[2,]                      TRUE                  FALSE
[3,]                     FALSE                   TRUE
[4,]                      TRUE                   TRUE
[5,]                      TRUE                   TRUE...

链接地址: http://www.djcxy.com/p/96264.html

上一篇: 无法重置R rownames，rownames（df）将返回NULL

下一篇: 用布尔值分隔列