如何加入（合并）数据框架（内部，外部，左侧，右侧）？

2018-05-30 10:50:05

给定两个数据帧：

df1 = data.frame(CustomerId = c(1:6), Product = c(rep("Toaster", 3), rep("Radio", 3)))
df2 = data.frame(CustomerId = c(2, 4, 6), State = c(rep("Alabama", 2), rep("Ohio", 1)))

df1
#  CustomerId Product
#           1 Toaster
#           2 Toaster
#           3 Toaster
#           4   Radio
#           5   Radio
#           6   Radio

df2
#  CustomerId   State
#           2 Alabama
#           4 Alabama
#           6    Ohio

我怎样才能做数据库风格，即SQL风格，加入？那是，我怎么得到：

df1和df2的内部连接：
仅返回左表在右表中具有匹配键的行。

df1和df2的外部连接：
返回两个表中的所有行，从左侧连接右表中具有匹配键的记录。

df1和df2左外连接（或简单地左连接）
返回左表中的所有行，以及右表中具有匹配键的所有行。

df1和df2右外连接
返回右表中的所有行，以及左表中具有匹配键的所有行。

额外信贷：

我怎样才能做一个SQL样式选择语句？

通过使用merge函数及其可选参数：

内部连接： merge(df1, df2)将适用于这些示例，因为R通过公共变量名自动连接框架，但您最可能要指定merge(df1, df2, by = "CustomerId")以确保您只匹配你想要的字段。如果匹配变量在不同数据框中具有不同的名称，则也可以使用by.x和by.y参数。

外连接： merge(x = df1, y = df2, by = "CustomerId", all = TRUE)

左外部： merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)

右外部： merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)

交叉连接： merge(x = df1, y = df2, by = NULL)

就像内连接一样，您可能希望显式地将“CustomerId”传递给R作为匹配变量。我认为明确指出要合并的标识符总是最好的; 如果输入数据帧意外更改并且稍后更容易阅读，则更安全。

您可以by向量（例如， by = c("CustomerId", "OrderId")给多个列进行合并。

如果要合并的列名不相同，可以指定例如by.x = "CustomerId_in_df1", by.y =“CustomerId_in_df2” where CustomerId_in_df1 is the name of the column in the first data frame and CustomerId_in_df2是第二个数据框中列的名称。（如果您需要在多列上合并，这些也可以是向量。）

我建议查看Gabor Grothendieck的sqldf软件包，它允许你用SQL来表达这些操作。

library(sqldf)

## inner join
df3 <- sqldf("SELECT CustomerId, Product, State 
              FROM df1
              JOIN df2 USING(CustomerID)")

## left join (substitute 'right' for right join)
df4 <- sqldf("SELECT CustomerId, Product, State 
              FROM df1
              LEFT JOIN df2 USING(CustomerID)")

我发现SQL语法比其R等价物更简单也更自然（但这可能仅反映了我的RDBMS偏见）。

有关连接的更多信息，请参阅Gabor的sqldf GitHub。

有一种内部联接的data.table方法，它具有非常高的时间和内存效率（并且对于一些更大的数据框架是必需的）：

library(data.table)

dt1 <- data.table(df1, key = "CustomerId") 
dt2 <- data.table(df2, key = "CustomerId")

joined.dt1.dt.2 <- dt1[dt2]

merge也适用于data.tables（因为它是通用的并且调用merge.data.table ）

merge(dt1, dt2)

在stackoverflow中记录data.table：
如何做一个data.table合并操作
将外键上的SQL连接转换为R data.table语法
合并大数据的高效替代方案
如何在R中使用data.table进行基本的左外连接？

另一种选择是在plyr包中找到的join函数

library(plyr)

join(df1, df2,
     type = "inner")

#   CustomerId Product   State
# 1          2 Toaster Alabama
# 2          4   Radio Alabama
# 3          6   Radio    Ohio

type选项： inner ， left ， right ， full 。

From ?join ：与merge不同，[ join ]保留x的顺序，无论使用哪种连接类型。

链接地址: http://www.djcxy.com/p/4235.html

上一篇: How to join (merge) data frames (inner, outer, left, right)?

下一篇: Create a 100 number vector with random values in R rounded to 2 decimals