快速,高效的方式循环数百万行并匹配列
我现在正在使用眼动追踪数据,所以有一个巨大的数据集(想想成百万行),所以想要一个快速的方法来完成这个任务。 这是它的简化版本。
数据告诉你眼睛在每个时间点看什么,以及我们正在看的每个文件。 X1,Y1到我们正在查看的点的坐标。 每个文件有多个时间点(代表着眼睛在不同时间看着文件中的不同位置)。
Filename Time X1 Y1
1 1 10 10
1 2 12 10
我还有一个文件,显示每个文件名的项目位置。 每个文件都包含(在这个简化的情况下)两个对象。 X1,Y1是左下角坐标,X2,Y2是右上角。 你可以想象这是给每个文件中项目所在的边界框。 例如
Filename Item X1 Y1 X2 Y2
1 Dog 11 10 20 20
我想要做的是在第一个数据框中添加另一列,告诉我每个文件在每个时间内人们正在查看的对象。 如果没有查看任何对象,我希望列可以说“无”。 在边界上的事情被视为正在计算。 例如
Filename Time X1 Y1 LookingAt
1 1 10 10 none
1 2 12 11 Dog
我知道如何做到for循环的方式,但它需要永远(并使我的RStudio崩溃)。 我想知道是否可能有更快,更有效的方式我错过了。
这里是第一个数据帧的输入(这些包含更多的行,我上面展示的例子):
structure(list(Filename = structure(c(1L, 1L, 1L, 2L, 2L, 3L,
3L, 3L, 3L), .Label = c("1", "2", "3"), class = "factor"), Time = structure(c(1L,
2L, 3L, 1L, 2L, 1L, 2L, 4L, 5L), .Label = c("1", "2", "3", "5",
"6"), class = "factor"), X1 = structure(c(1L, 4L, 3L, 2L, 1L,
4L, 6L, 5L, 1L), .Label = c("10", "11", "12", "15", "20", "25"
), class = "factor"), Y1 = structure(c(1L, 5L, 6L, 4L, 1L, 2L,
3L, 4L, 1L), .Label = c("10", "11", "12", "15", "20", "25"), class = "factor")), .Names = c("Filename",
"Time", "X1", "Y1"), row.names = c(NA, -9L), class = "data.frame")
这是第二次投资:
structure(list(Filename = structure(c(1L, 1L, 2L, 2L), .Label = c("1",
"3"), class = "factor"), Item = structure(1:4, .Label = c("Cat",
"Dog", "House", "Mouse"), class = "factor"), X1 = structure(c(2L,
4L, 3L, 1L), .Label = c("10", "11", "20", "35"), class = "factor"),
Y1 = structure(c(2L, 4L, 3L, 1L), .Label = c("10", "11",
"13", "35"), class = "factor"), X2 = structure(c(1L, 3L,
4L, 2L), .Label = c("10", "11", "20", "35"), class = "factor"),
Y2 = structure(c(1L, 3L, 4L, 2L), .Label = c("10", "11",
"13", "35"), class = "factor")), .Names = c("Filename", "Item",
"X1", "Y1", "X2", "Y2"), row.names = c(NA, -4L), class = "data.frame")
使用data.table和您提供的示例数据,我会按如下方式处理它:
# getting the data in the right format
datcols <- c("X","Y")
lucols <- c("X1","X2","Y1","Y2")
setDT(dat)[, (datcols) := lapply(.SD, function(x) as.numeric(as.character(x))), .SDcol = datcols
][, Filename := as.character(Filename)]
setDT(lu)[, (lucols) := lapply(.SD, function(x) as.numeric(as.character(x))), .SDcol = lucols
][, `:=` (Filename = as.character(Filename),
X1 = pmin(X1,X2), X2 = pmax(X1,X2), # make sure that 'X1' is always the lowest value
Y1 = pmin(Y1,Y2), Y2 = pmax(Y1,Y2))] # make sure that 'Y1' is always the lowest value
# matching the 'Items' to the correct rows
dat[, looked_at := lu$Item[Filename==lu$Filename &
between(X, lu$X1, lu$X2) &
between(Y, lu$Y1, lu$Y2)],
by = .(Filename,Time)]
这使:
> dat
Filename Time X Y looked_at
1: 1 1 10 10 Cat
2: 1 2 15 20 NA
3: 1 3 12 25 NA
4: 2 1 11 15 NA
5: 2 2 10 10 NA
6: 3 1 15 11 NA
7: 3 2 25 12 NA
8: 3 5 20 15 House
9: 3 6 10 10 Mouse
使用的数据:
dat <- structure(list(Filename = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("1", "2", "3"), class = "factor"),
Time = structure(c(1L, 2L, 3L, 1L, 2L, 1L, 2L, 4L, 5L), .Label = c("1", "2", "3", "5", "6"), class = "factor"),
X = structure(c(1L, 4L, 3L, 2L, 1L, 4L, 6L, 5L, 1L), .Label = c("10", "11", "12", "15", "20", "25"), class = "factor"),
Y = structure(c(1L, 5L, 6L, 4L, 1L, 2L, 3L, 4L, 1L), .Label = c("10", "11", "12", "15", "20", "25"), class = "factor")),
.Names = c("Filename", "Time", "X", "Y"), row.names = c(NA, -9L), class = "data.frame")
lu <- structure(list(Filename = structure(c(1L, 1L, 2L, 2L), .Label = c("1", "3"), class = "factor"),
Item = structure(1:4, .Label = c("Cat", "Dog", "House", "Mouse"), class = "factor"),
X1 = structure(c(2L, 4L, 3L, 1L), .Label = c("10", "11", "20", "35"), class = "factor"),
X2 = structure(c(1L, 3L, 4L, 2L), .Label = c("10", "11", "20", "35"), class = "factor"),
Y1 = structure(c(2L, 4L, 3L, 1L), .Label = c("10", "11", "13", "35"), class = "factor"),
Y2 = structure(c(1L, 3L, 4L, 2L), .Label = c("10", "11", "13", "35"), class = "factor")),
.Names = c("Filename", "Item", "X1", "X2", "Y1", "Y2"), row.names = c(NA, -4L), class = "data.frame")
链接地址: http://www.djcxy.com/p/89539.html
上一篇: fast, efficient way to loop over millions of rows and match columns