Errors related to data frame columns during merging
The following piece of code is supposed to load and prepare datasets from a specified directory for further data analysis. The problem is that the code generates the following errors during attempts to merge data (one for each merging option). I'm confused about what is going on here. However, my gut feeling tells me that the errors might be due to the absence of column names in some data frames. I would appreciate a clarification. Also, please advise on the preferred merging option (between #1 and #2). Thank you!
UPDATE 2 (Reworked with minimal reproducible example, previous versions removed):
Current error (Merging Option 1 enabled):
Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column
Current error (Merging Option 2 enabled):
Error in `[.data.frame`(x, rep.int(NA_integer_, nyy), nm.x, drop = FALSE) : undefined columns selected
Required Packages: none, other than standard ones.
Source Code (includes reproducible data):
# load the datasets of transformed data
# real data
#dataSets <- loadDataSets(SRDA_DIR)
# reproducible example data
# (generated via `dput(lapply(dataSets, head))`, thanks to @MrFlick)
dataSets <- list(structure(list(`NA` = c("284", "284", "284", "284", "284",
"284"), `NA` = c("490", "490", "490", "490", "490", "490")), .Names = c(NA_character_,
NA_character_), SQL = structure("ClNFTEVDVCBnLmdyb3VwX2lkLCB1LnVzZXJfaWQKRlJPTSBzZjA1MTQuZ3JvdXBzIGcsIHNmMDUxNC51c2VycyB1LCBzZjA1MTQucHJvamVjdF9oaXN0b3J5IHBoLCBzZjA1MTQucHJvamVjdF90YXNrIHB0LCBzZjA1MTQucHJvamVjdF9ncm91cF9saXN0IHBnbApXSEVSRSBwaC5wcm9qZWN0X3Rhc2tfaWQgPSBwdC5wcm9qZWN0X3Rhc2tfaWQKQU5EIHB0Lmdyb3VwX3Byb2plY3RfaWQgPSBwZ2wuZ3JvdXBfcHJvamVjdF9pZApBTkQgZy5ncm91cF9pZCA9IHBnbC5ncm91cF9pZApBTkQgcGgubW9kX2J5ID0gdS51c2VyX2lk", class = "base64"), indicatorName = structure("Y29udHJpYlBlb3BsZQ==", class = "base64"), resultNames = structure("TkE=", class = "base64"), row.names = c(NA,
6L), class = "data.frame"), structure(list(`Project ID` = c("85684",
"172552", "228484", "173865", "94140", "179097"), Enabled = c("1",
"1", "1", "1", "1", "1"), `Repo URL` = c("http://svn.tr51.org/svn/variomat/trunk/",
"http://svn.hyperic.org/?root=Hyperic+SIGAR", "http://code.google.com/p/ufolder/source/browse",
"https://svn.canoo.com/trunk/webtestclipse/", "http://www.rasilon.net/svn/sptools/trunk/sptools",
"http://trac.pocoo.org/repos/pygments"), `Repo Instructions` = c("Login is currently disabled.",
"For anonymous access, simply issue the command 'svn co http://svn.hyperic.org/projects/sigar' For developer access, send email to sourceforge user "hyperic".",
"Anonymous browsing", "https://svn.canoo.com/trunk/webtestclipse/",
"SVN stuff to go here. If you just want a copy of the source, run svn co http://www.rasilon.net/svn/sptools/trunk/sptools",
"The Subversion repository is at http://trac.pocoo.org/repos/pygments."
)), .Names = c("Project ID", "Enabled", "Repo URL", "Repo Instructions"
), SQL = structure("ClNFTEVDVCBncm91cF9pZCwgZW5hYmxlZCwgdXJsX3ByaW1hcnksIGluc3RydWN0aW9uc19wdWJsaWMKRlJPTSBzZjA1MTQuZXh0ZXJuYWxfdG9vbF9saW5rcw==", class = "base64"), indicatorName = structure("ZGV2TGlua3M=", class = "base64"), resultNames = structure("UHJvamVjdCBJRCwgRW5hYmxlZCwgUmVwbyBVUkwsIFJlcG8gSW5zdHJ1Y3Rpb25z", class = "base64"), row.names = c(NA,
6L), class = "data.frame"), structure(list(`NA` = c("1343228",
"230959", "1938195", "1883362", "404683", "650286"), `NA` = c("6",
"6", "6", "6", "6", "6"), `NA` = c("http://sourceforge.net/p/aprpg/discussion",
"http://sourceforge.net/project/memberlist.php?group_id=230959",
"http://www.polishavenue.com", "http://sourceforge.net/p/wakemypc/tickets",
"http://sourceforge.net/apps/trac/graphz/", "http://testando1"
)), .Names = c(NA_character_, NA_character_, NA_character_), SQL = structure("ClNFTEVDVCBncm91cF9pZCwgcHJlZmVycmVkX3N1cHBvcnRfdHlwZSwgcHJlZmVycmVkX3N1cHBvcnRfcmVzb3VyY2UKRlJPTSBzZjA1MTQuZ3JvdXBzCldIRVJFIHByZWZlcnJlZF9zdXBwb3J0X3R5cGUgPSA2", class = "base64"), indicatorName = structure("ZGV2U3VwcG9ydA==", class = "base64"), resultNames = structure("TkE=", class = "base64"), row.names = c(NA,
6L), class = "data.frame"), structure(list(`Project ID` = c("1692507",
"1095949", "685064", "900864", "976917", "1949934"), `Development Team Size` = c(1,
1, 1, 1, 1, 1)), .Names = c("Project ID", "Development Team Size"
), SQL = structure("ClNFTEVDVCBncm91cF9pZCwgQ09VTlQodXNlcl9pZCkKRlJPTSBzZjA1MTQudXNlcl9ncm91cApXSEVSRSBncmFudGN2cyA9IDEKR1JPVVAgQlkgZ3JvdXBfaWQ=", class = "base64"), indicatorName = structure("ZGV2VGVhbVNpemU=", class = "base64"), resultNames = structure("UHJvamVjdCBJRCwgRGV2ZWxvcG1lbnQgVGVhbSBTaXpl", class = "base64"), row.names = c(NA,
6L), class = "data.frame"), structure(list(`NA` = c("1844416",
"1849571", "1850512", "1850521", "1854556", "1855148"), `NA` = c("0",
"0", "0", "0", "0", "0"), `NA` = c("1", "1", "1", "1", "1", "1"
)), .Names = c(NA_character_, NA_character_, NA_character_), SQL = structure("ClNFTEVDVCBncm91cF9pZCwgdXNlX3dpa2ksIHVzZV9mb3J1bQpGUk9NIHNmMDUxNC5ncm91cHM=", class = "base64"), indicatorName = structure("ZG1Qcm9jZXNz", class = "base64"), resultNames = structure("TkE=", class = "base64"), row.names = c(NA,
6L), class = "data.frame"), structure(list(`Project ID` = c("2107960",
"2068039", "2156229", "2068032", "2068046", "2081469"), `Project Age` = c(5,
6.5, 4, 6.5, 6.5, 6)), .Names = c("Project ID", "Project Age"
), row.names = c(NA, 6L), class = "data.frame"), structure(list(
`Project ID` = c("708994", "1586967", "581072", "738614",
"758081", "782990"), `Project License` = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = c("", "afl", "apache", "apache2",
"artistic", "boostlicense", "bsd", "cddl", "eclipselicense",
"educom", "fair", "gpl", "ibm", "ibmcpl", "iosl", "jabber",
"lgpl", "mit", "mpl", "mpl11", "ms-rl", "nasalicense", "ncsa",
"nethack", "none", "nposl3", "osl", "other", "php", "php-license",
"psfl", "public", "publicdomain", "python", "qpl", "sissl",
"sunpublic", "website", "wxwindows", "zlib", "zope"), class = "factor"),
`License Restrictiveness` = structure(c(NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_), .Label = c("Highly Restrictive",
"Permissive", "Restrictive", "Unknown"), class = "factor")), .Names = c("Project ID",
"Project License", "License Restrictiveness"), SQL = structure("ClNFTEVDVCBncm91cF9pZCwgbGljZW5zZQpGUk9NIHNmMDUxNC5ncm91cHM=", class = "base64"), indicatorName = structure("cHJqTGljZW5zZQ==", class = "base64"), resultNames = structure("UHJvamVjdCBJRCwgUHJvamVjdCBMaWNlbnNl", class = "base64"), row.names = c(NA,
6L), class = "data.frame"), structure(list(`Project ID` = c("2",
"3", "7", "11", "12", "14"), `Latest Release` = c("Snapshots",
"7.5", "gedit 0.9.5", "r2-00", "0.9.7", "dhiggen_merge-5.0"),
`Project Maturity` = structure(c(NA, 3L, 1L, 3L, 1L, 3L), .Label = c("Alpha/Beta",
"Stable", "Mature"), class = "factor")), .Names = c("Project ID",
"Latest Release", "Project Maturity"), SQL = structure("ClNFTEVDVCBmcC5ncm91cF9pZCwgTUFYKGZyLm5hbWUpCkZST00gc2YwNTE0LmZyc19wYWNrYWdlIGZwLCBzZjA1MTQuZnJzX3JlbGVhc2UgZnIsIHNmMDUxNC5mcnNfc3RhdHVzIGZzCldIRVJFIGZwLnBhY2thZ2VfaWQgPSBmci5wYWNrYWdlX2lkCkdST1VQIEJZIGZwLmdyb3VwX2lk", class = "base64"), indicatorName = structure("cHJqTWF0dXJpdHk=", class = "base64"), resultNames = structure("UHJvamVjdCBJRCwgTGF0ZXN0IFJlbGVhc2U=", class = "base64"), row.names = c(NA,
6L), class = "data.frame"), structure(list(`NA` = c("1660372",
"1590394", "1590772", "85777", "1591062", "1591181"), `NA` = c("0",
"0", "0", "0", "0", "0")), .Names = c(NA_character_, NA_character_
), SQL = structure("ClNFTEVDVCBncm91cF9pZCwgdXNlX3dpa2kKRlJPTSBzZjA1MTQuZ3JvdXBz", class = "base64"), indicatorName = structure("cHViUm9hZG1hcA==", class = "base64"), resultNames = structure("TkE=", class = "base64"), row.names = c(NA,
6L), class = "data.frame"), structure(list(ID = c("141", "66",
"55", "45", "75", "80"), `Software Type` = c("Clustering", "Database",
"Desktop", "Development", "Financial", "Games")), .Names = c("ID",
"Software Type"), SQL = structure("ClNFTEVDVCB0cm92ZV9jYXRfaWQsIGRlc2NyaXB0aW9uCkZST00gc2YwNTE0LnRyb3ZlX2Zyb250cGFnZQ==", class = "base64"), indicatorName = structure("c29mdHdhcmVUeXBl", class = "base64"), resultNames = structure("SUQsIFNvZnR3YXJlIFR5cGU=", class = "base64"), row.names = c(NA,
6L), class = "data.frame"), structure(list(`Project ID` = c("142",
"129", "120", "119", "107", "106"), `User Community Size` = c("153237",
"3299", "135710", "16249", "6042", "2508")), .Names = c("Project ID",
"User Community Size"), SQL = structure("ClNFTEVDVCBncm91cF9pZCwgZG93bmxvYWRzCkZST00gc2YwNTE0LnN0YXRzX3Byb2plY3RfYWxs", class = "base64"), indicatorName = structure("dXNlckNvbW11bml0eVNpemU=", class = "base64"), resultNames = structure("UHJvamVjdCBJRCwgVXNlciBDb21tdW5pdHkgU2l6ZQ==", class = "base64"), row.names = c(NA,
6L), class = "data.frame"))
# Merging Option 1
flossData <- data.frame(dataSets[[1]][1])
# merge all loaded datasets by common column ("Project ID")
silent <- lapply(seq(2, length(dataSets) - 1),
function(i) {merge(flossData, dataSets[[1]][i],
by = "Project ID",
all.y = TRUE)})
# Merging Option 2
#flossData <- Reduce(function(...)
# merge(..., by.x = "row.names", by.y = "Project ID", all = TRUE),
# dataSets)
# Additional Transformations
# convert presence of Repo URL to integer
flossData[["Repo URL"]] <- as.integer(flossData[["Repo URL"]] != "")
# convert License Restrictiveness' factor levels to integers
#flossData[["License Restrictiveness"]] <-
# as.integer(flossData[["License Restrictiveness"]])
# convert User Community Size from character to integer
flossData[["User Community Size"]] <-
as.integer(flossData[["User Community Size"]])
# remove NAs
#flossData <- flossData[complete.cases(flossData[,3]),]
rowsNA <- apply(flossData, 1, function(x) {any(is.na(x))})
flossData <- flossData[!rowsNA,]
Environment:
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=C
[4] LC_COLLATE=C LC_MONETARY=C LC_MESSAGES=C
[7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] plspm_0.4.1 turner_0.1.7 tester_0.1.7 diagram_1.6.2 shape_1.4.1 amap_0.8-12
loaded via a namespace (and not attached):
[1] tools_3.1.1
UPDATE 3:
An attempt to merge data frames, using reshape package ( reshape::merge_all(dataSets)
), resulted in the following error message: Error: cannot allocate vector of size 332.8 Gb
. This is quite strange, considering that the total size of R objects stored in that directory and being merged is only 4.3 MB.
An attempt to merge data fames, using plyr package ( plyr::join_all(dataSets)
), resulted in the following error message: Error in ``[.data.frame``(x, by) : undefined columns selected
. This seems to match the error message in the Merging Option 2.
lapply(dataSets, names)
pids = which(sapply(dataSets, FUN=function(x) { 'Project ID' %in% names(x) }))
acc = dataSets[[pids[1]]]
for (id in pids[2:length(pids)]) {
acc = merge(acc, dataSets[[id]], by='Project ID', all=T)
}
Assumptions I had to make:
dataSet
have Project ID
column, so assumed you need to join only those which have it. So first I find them and put their indexes to pids
all=T
flag in merge
Regarding your question about what way of joining to use, I'd say it's not a good idea to do this in R. I'd use an RDBMS whenever possible to do joins and other things before loading data to R
链接地址: http://www.djcxy.com/p/24798.html上一篇: 如何加入(合并)数据框架(内部,外部,左侧,右侧)?
下一篇: 合并期间与数据帧列相关的错误