How to join (merge) data frames (inner, outer, left, right)?
Given two data frames:
df1 = data.frame(CustomerId = c(1:6), Product = c(rep("Toaster", 3), rep("Radio", 3)))
df2 = data.frame(CustomerId = c(2, 4, 6), State = c(rep("Alabama", 2), rep("Ohio", 1)))
df1
# CustomerId Product
# 1 Toaster
# 2 Toaster
# 3 Toaster
# 4 Radio
# 5 Radio
# 6 Radio
df2
# CustomerId State
# 2 Alabama
# 4 Alabama
# 6 Ohio
How can I do database style, ie, sql style, joins? That is, how do I get:
df1
and df2
: Return only the rows in which the left table have matching keys in the right table.
df1
and df2
: Returns all rows from both tables, join records from the left which have matching keys in the right table.
df1
and df2
Return all rows from the left table, and any rows with matching keys from the right table.
df1
and df2
Return all rows from the right table, and any rows with matching keys from the left table.
Extra credit:
How can I do a SQL style select statement?
By using the merge
function and its optional parameters:
Inner join: merge(df1, df2)
will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specify merge(df1, df2, by = "CustomerId")
to make sure that you were matching on only the fields you desired. You can also use the by.x
and by.y
parameters if the matching variables have different names in the different data frames.
Outer join: merge(x = df1, y = df2, by = "CustomerId", all = TRUE)
Left outer: merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)
Right outer: merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)
Cross join: merge(x = df1, y = df2, by = NULL)
Just as with the inner join, you would probably want to explicitly pass "CustomerId" to R as the matching variable. I think it's almost always best to explicitly state the identifiers on which you want to merge; it's safer if the input data.frames change unexpectedly and easier to read later on.
You can merge on multiple columns by giving by
a vector, eg, by = c("CustomerId", "OrderId")
.
If the column names to merge on are not the same, you can specify, eg, by.x = "CustomerId_in_df1",
by.y = "CustomerId_in_df2" where
CustomerId_in_df1 is the name of the column in the first data frame and
CustomerId_in_df2` is the name of the column in the second data frame. (These can also be vectors if you need to merge on multiple columns.)
I would recommend checking out Gabor Grothendieck's sqldf package, which allows you to express these operations in SQL.
library(sqldf)
## inner join
df3 <- sqldf("SELECT CustomerId, Product, State
FROM df1
JOIN df2 USING(CustomerID)")
## left join (substitute 'right' for right join)
df4 <- sqldf("SELECT CustomerId, Product, State
FROM df1
LEFT JOIN df2 USING(CustomerID)")
I find the SQL syntax to be simpler and more natural than its R equivalent (but this may just reflect my RDBMS bias).
See Gabor's sqldf GitHub for more information on joins.
There is the data.table approach for an inner join, which is very time and memory efficient (and necessary for some larger data.frames):
library(data.table)
dt1 <- data.table(df1, key = "CustomerId")
dt2 <- data.table(df2, key = "CustomerId")
joined.dt1.dt.2 <- dt1[dt2]
merge
also works on data.tables (as it is generic and calls merge.data.table
)
merge(dt1, dt2)
data.table documented on stackoverflow:
How to do a data.table merge operation
Translating SQL joins on foreign keys to R data.table syntax
Efficient alternatives to merge for larger data.frames R
How to do a basic left outer join with data.table in R?
Yet another option is the join
function found in the plyr package
library(plyr)
join(df1, df2,
type = "inner")
# CustomerId Product State
# 1 2 Toaster Alabama
# 2 4 Radio Alabama
# 3 6 Radio Ohio
Options for type
: inner
, left
, right
, full
.
From ?join
: Unlike merge
, [ join
] preserves the order of x no matter what join type is used.
上一篇: 大部分数据可视化效果不佳