合并期间与数据帧列相关的错误

下面的一段代码应该从指定的目录加载和准备数据集以进行进一步的数据分析。 问题是,在尝试合并数据期间,代码会生成以下错误 (每个合并选项一个)。 我对这里发生的事情感到困惑。 但是,我的直觉告诉我,错误可能是由于某些数据框中没有列名。 我希望澄清。 另外,请告知首选合并选项 (#1和#2之间)。 谢谢!

更新2 (以最小的可重现示例重写,删除以前的版本):

当前错误 (启用合并选项1):

Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column

当前错误 (启用合并选项2):

Error in `[.data.frame`(x, rep.int(NA_integer_, nyy), nm.x, drop = FALSE) : undefined columns selected

必需软件包:无,除标准软件以外。

源代码(包括可重现的数据):

# load the datasets of transformed data

# real data
#dataSets <- loadDataSets(SRDA_DIR)

# reproducible example data
# (generated via `dput(lapply(dataSets, head))`, thanks to @MrFlick)

dataSets <- list(structure(list(`NA` = c("284", "284", "284", "284", "284", 
"284"), `NA` = c("490", "490", "490", "490", "490", "490")), .Names = c(NA_character_, 
NA_character_), SQL = structure("ClNFTEVDVCBnLmdyb3VwX2lkLCB1LnVzZXJfaWQKRlJPTSBzZjA1MTQuZ3JvdXBzIGcsIHNmMDUxNC51c2VycyB1LCBzZjA1MTQucHJvamVjdF9oaXN0b3J5IHBoLCBzZjA1MTQucHJvamVjdF90YXNrIHB0LCBzZjA1MTQucHJvamVjdF9ncm91cF9saXN0IHBnbApXSEVSRSBwaC5wcm9qZWN0X3Rhc2tfaWQgPSBwdC5wcm9qZWN0X3Rhc2tfaWQKQU5EIHB0Lmdyb3VwX3Byb2plY3RfaWQgPSBwZ2wuZ3JvdXBfcHJvamVjdF9pZApBTkQgZy5ncm91cF9pZCA9IHBnbC5ncm91cF9pZApBTkQgcGgubW9kX2J5ID0gdS51c2VyX2lk", class = "base64"), indicatorName = structure("Y29udHJpYlBlb3BsZQ==", class = "base64"), resultNames = structure("TkE=", class = "base64"), row.names = c(NA, 
6L), class = "data.frame"), structure(list(`Project ID` = c("85684", 
"172552", "228484", "173865", "94140", "179097"), Enabled = c("1", 
"1", "1", "1", "1", "1"), `Repo URL` = c("http://svn.tr51.org/svn/variomat/trunk/", 
"http://svn.hyperic.org/?root=Hyperic+SIGAR", "http://code.google.com/p/ufolder/source/browse", 
"https://svn.canoo.com/trunk/webtestclipse/", "http://www.rasilon.net/svn/sptools/trunk/sptools", 
"http://trac.pocoo.org/repos/pygments"), `Repo Instructions` = c("Login is currently disabled.", 
"For anonymous access, simply issue the command 'svn co http://svn.hyperic.org/projects/sigar'  For developer access, send email to sourceforge user &quot;hyperic&quot;.", 
"Anonymous browsing", "https://svn.canoo.com/trunk/webtestclipse/", 
"SVN stuff to go here.  If you just want a copy of the source, run svn co http://www.rasilon.net/svn/sptools/trunk/sptools", 
"The Subversion repository is at http://trac.pocoo.org/repos/pygments."
)), .Names = c("Project ID", "Enabled", "Repo URL", "Repo Instructions"
), SQL = structure("ClNFTEVDVCBncm91cF9pZCwgZW5hYmxlZCwgdXJsX3ByaW1hcnksIGluc3RydWN0aW9uc19wdWJsaWMKRlJPTSBzZjA1MTQuZXh0ZXJuYWxfdG9vbF9saW5rcw==", class = "base64"), indicatorName = structure("ZGV2TGlua3M=", class = "base64"), resultNames = structure("UHJvamVjdCBJRCwgRW5hYmxlZCwgUmVwbyBVUkwsIFJlcG8gSW5zdHJ1Y3Rpb25z", class = "base64"), row.names = c(NA, 
6L), class = "data.frame"), structure(list(`NA` = c("1343228", 
"230959", "1938195", "1883362", "404683", "650286"), `NA` = c("6", 
"6", "6", "6", "6", "6"), `NA` = c("http://sourceforge.net/p/aprpg/discussion", 
"http://sourceforge.net/project/memberlist.php?group_id=230959", 
"http://www.polishavenue.com", "http://sourceforge.net/p/wakemypc/tickets", 
"http://sourceforge.net/apps/trac/graphz/", "http://testando1"
)), .Names = c(NA_character_, NA_character_, NA_character_), SQL = structure("ClNFTEVDVCBncm91cF9pZCwgcHJlZmVycmVkX3N1cHBvcnRfdHlwZSwgcHJlZmVycmVkX3N1cHBvcnRfcmVzb3VyY2UKRlJPTSBzZjA1MTQuZ3JvdXBzCldIRVJFIHByZWZlcnJlZF9zdXBwb3J0X3R5cGUgPSA2", class = "base64"), indicatorName = structure("ZGV2U3VwcG9ydA==", class = "base64"), resultNames = structure("TkE=", class = "base64"), row.names = c(NA, 
6L), class = "data.frame"), structure(list(`Project ID` = c("1692507", 
"1095949", "685064", "900864", "976917", "1949934"), `Development Team Size` = c(1, 
1, 1, 1, 1, 1)), .Names = c("Project ID", "Development Team Size"
), SQL = structure("ClNFTEVDVCBncm91cF9pZCwgQ09VTlQodXNlcl9pZCkKRlJPTSBzZjA1MTQudXNlcl9ncm91cApXSEVSRSBncmFudGN2cyA9IDEKR1JPVVAgQlkgZ3JvdXBfaWQ=", class = "base64"), indicatorName = structure("ZGV2VGVhbVNpemU=", class = "base64"), resultNames = structure("UHJvamVjdCBJRCwgRGV2ZWxvcG1lbnQgVGVhbSBTaXpl", class = "base64"), row.names = c(NA, 
6L), class = "data.frame"), structure(list(`NA` = c("1844416", 
"1849571", "1850512", "1850521", "1854556", "1855148"), `NA` = c("0", 
"0", "0", "0", "0", "0"), `NA` = c("1", "1", "1", "1", "1", "1"
)), .Names = c(NA_character_, NA_character_, NA_character_), SQL = structure("ClNFTEVDVCBncm91cF9pZCwgdXNlX3dpa2ksIHVzZV9mb3J1bQpGUk9NIHNmMDUxNC5ncm91cHM=", class = "base64"), indicatorName = structure("ZG1Qcm9jZXNz", class = "base64"), resultNames = structure("TkE=", class = "base64"), row.names = c(NA, 
6L), class = "data.frame"), structure(list(`Project ID` = c("2107960", 
"2068039", "2156229", "2068032", "2068046", "2081469"), `Project Age` = c(5, 
6.5, 4, 6.5, 6.5, 6)), .Names = c("Project ID", "Project Age"
), row.names = c(NA, 6L), class = "data.frame"), structure(list(
    `Project ID` = c("708994", "1586967", "581072", "738614", 
    "758081", "782990"), `Project License` = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L), .Label = c("", "afl", "apache", "apache2", 
    "artistic", "boostlicense", "bsd", "cddl", "eclipselicense", 
    "educom", "fair", "gpl", "ibm", "ibmcpl", "iosl", "jabber", 
    "lgpl", "mit", "mpl", "mpl11", "ms-rl", "nasalicense", "ncsa", 
    "nethack", "none", "nposl3", "osl", "other", "php", "php-license", 
    "psfl", "public", "publicdomain", "python", "qpl", "sissl", 
    "sunpublic", "website", "wxwindows", "zlib", "zope"), class = "factor"), 
    `License Restrictiveness` = structure(c(NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_), .Label = c("Highly Restrictive", 
    "Permissive", "Restrictive", "Unknown"), class = "factor")), .Names = c("Project ID", 
"Project License", "License Restrictiveness"), SQL = structure("ClNFTEVDVCBncm91cF9pZCwgbGljZW5zZQpGUk9NIHNmMDUxNC5ncm91cHM=", class = "base64"), indicatorName = structure("cHJqTGljZW5zZQ==", class = "base64"), resultNames = structure("UHJvamVjdCBJRCwgUHJvamVjdCBMaWNlbnNl", class = "base64"), row.names = c(NA, 
6L), class = "data.frame"), structure(list(`Project ID` = c("2", 
"3", "7", "11", "12", "14"), `Latest Release` = c("Snapshots", 
"7.5", "gedit 0.9.5", "r2-00", "0.9.7", "dhiggen_merge-5.0"), 
    `Project Maturity` = structure(c(NA, 3L, 1L, 3L, 1L, 3L), .Label = c("Alpha/Beta", 
    "Stable", "Mature"), class = "factor")), .Names = c("Project ID", 
"Latest Release", "Project Maturity"), SQL = structure("ClNFTEVDVCBmcC5ncm91cF9pZCwgTUFYKGZyLm5hbWUpCkZST00gc2YwNTE0LmZyc19wYWNrYWdlIGZwLCBzZjA1MTQuZnJzX3JlbGVhc2UgZnIsIHNmMDUxNC5mcnNfc3RhdHVzIGZzCldIRVJFIGZwLnBhY2thZ2VfaWQgPSBmci5wYWNrYWdlX2lkCkdST1VQIEJZIGZwLmdyb3VwX2lk", class = "base64"), indicatorName = structure("cHJqTWF0dXJpdHk=", class = "base64"), resultNames = structure("UHJvamVjdCBJRCwgTGF0ZXN0IFJlbGVhc2U=", class = "base64"), row.names = c(NA, 
6L), class = "data.frame"), structure(list(`NA` = c("1660372", 
"1590394", "1590772", "85777", "1591062", "1591181"), `NA` = c("0", 
"0", "0", "0", "0", "0")), .Names = c(NA_character_, NA_character_
), SQL = structure("ClNFTEVDVCBncm91cF9pZCwgdXNlX3dpa2kKRlJPTSBzZjA1MTQuZ3JvdXBz", class = "base64"), indicatorName = structure("cHViUm9hZG1hcA==", class = "base64"), resultNames = structure("TkE=", class = "base64"), row.names = c(NA, 
6L), class = "data.frame"), structure(list(ID = c("141", "66", 
"55", "45", "75", "80"), `Software Type` = c("Clustering", "Database", 
"Desktop", "Development", "Financial", "Games")), .Names = c("ID", 
"Software Type"), SQL = structure("ClNFTEVDVCB0cm92ZV9jYXRfaWQsIGRlc2NyaXB0aW9uCkZST00gc2YwNTE0LnRyb3ZlX2Zyb250cGFnZQ==", class = "base64"), indicatorName = structure("c29mdHdhcmVUeXBl", class = "base64"), resultNames = structure("SUQsIFNvZnR3YXJlIFR5cGU=", class = "base64"), row.names = c(NA, 
6L), class = "data.frame"), structure(list(`Project ID` = c("142", 
"129", "120", "119", "107", "106"), `User Community Size` = c("153237", 
"3299", "135710", "16249", "6042", "2508")), .Names = c("Project ID", 
"User Community Size"), SQL = structure("ClNFTEVDVCBncm91cF9pZCwgZG93bmxvYWRzCkZST00gc2YwNTE0LnN0YXRzX3Byb2plY3RfYWxs", class = "base64"), indicatorName = structure("dXNlckNvbW11bml0eVNpemU=", class = "base64"), resultNames = structure("UHJvamVjdCBJRCwgVXNlciBDb21tdW5pdHkgU2l6ZQ==", class = "base64"), row.names = c(NA, 
6L), class = "data.frame"))

# Merging Option 1

flossData <- data.frame(dataSets[[1]][1])

# merge all loaded datasets by common column ("Project ID")
silent <- lapply(seq(2, length(dataSets) - 1),
                 function(i) {merge(flossData, dataSets[[1]][i],
                                    by = "Project ID",
                                    all.y = TRUE)})

# Merging Option 2

#flossData <- Reduce(function(...) 
#  merge(..., by.x = "row.names", by.y = "Project ID", all = TRUE),
#  dataSets)

# Additional Transformations

# convert presence of Repo URL to integer
flossData[["Repo URL"]] <- as.integer(flossData[["Repo URL"]] != "")

# convert License Restrictiveness' factor levels to integers
#flossData[["License Restrictiveness"]] <- 
#  as.integer(flossData[["License Restrictiveness"]])

# convert User Community Size from character to integer
flossData[["User Community Size"]] <- 
  as.integer(flossData[["User Community Size"]])

# remove NAs
#flossData <- flossData[complete.cases(flossData[,3]),]
rowsNA <- apply(flossData, 1, function(x) {any(is.na(x))})
flossData <- flossData[!rowsNA,]

环境:

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C         LC_TIME=C           
 [4] LC_COLLATE=C         LC_MONETARY=C        LC_MESSAGES=C       
 [7] LC_PAPER=C           LC_NAME=C            LC_ADDRESS=C        
[10] LC_TELEPHONE=C       LC_MEASUREMENT=C     LC_IDENTIFICATION=C 

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] plspm_0.4.1   turner_0.1.7  tester_0.1.7  diagram_1.6.2 shape_1.4.1   amap_0.8-12  

loaded via a namespace (and not attached):
[1] tools_3.1.1

更新3:

尝试合并数据帧,使用重塑包( reshape::merge_all(dataSets) )导致以下错误消息: Error: cannot allocate vector of size 332.8 Gb 。 这很奇怪,考虑到存储在该目录中并被合并的R对象的总大小仅为4.3 MB。

试图合并数据脂肪酸甲酯,使用plyr封装( plyr::join_all(dataSets) ),导致以下错误消息: Error in ``[.data.frame``(x, by) : undefined columns selected 。 这似乎与合并选项2中的错误消息相匹配。


lapply(dataSets, names)
pids = which(sapply(dataSets, FUN=function(x) { 'Project ID' %in% names(x) }))

acc = dataSets[[pids[1]]]

for (id in pids[2:length(pids)]) {
  acc = merge(acc, dataSets[[id]], by='Project ID', all=T)
}

我不得不作出的假设:

  • 并非来自dataSet所有数据集都有Project ID列,因此假定您只需加入那些拥有它的人。 所以首先我找到它们并将它们的索引放到pids
  • 因为有数据集没有任何共同的项目ID,我假设你想在merge执行外部联接 - all=T标志
  • 关于您使用何种方式进行连接的问题,我想说在R中执行此操作并不是一个好主意。在将数据加载到R之前,我会尽可能使用RDBMS进行连接和其他操作

    链接地址: http://www.djcxy.com/p/24797.html

    上一篇: Errors related to data frame columns during merging

    下一篇: Raise exception in de