快速，高效的方式循环数百万行并匹配列

2018-07-02 02:52:12

我现在正在使用眼动追踪数据，所以有一个巨大的数据集（想想成百万行），所以想要一个快速的方法来完成这个任务。这是它的简化版本。

数据告诉你眼睛在每个时间点看什么，以及我们正在看的每个文件。 X1，Y1到我们正在查看的点的坐标。每个文件有多个时间点（代表着眼睛在不同时间看着文件中的不同位置）。

Filename    Time    X1    Y1
   1         1      10    10
   1         2      12    10

我还有一个文件，显示每个文件名的项目位置。每个文件都包含（在这个简化的情况下）两个对象。 X1，Y1是左下角坐标，X2，Y2是右上角。你可以想象这是给每个文件中项目所在的边界框。例如

Filename    Item    X1   Y1   X2   Y2
  1          Dog    11   10   20   20

我想要做的是在第一个数据框中添加另一列，告诉我每个文件在每个时间内人们正在查看的对象。如果没有查看任何对象，我希望列可以说“无”。在边界上的事情被视为正在计算。例如

Filename    Time    X1    Y1   LookingAt
   1         1      10    10    none
   1         2      12    11    Dog

我知道如何做到for循环的方式，但它需要永远（并使我的RStudio崩溃）。我想知道是否可能有更快，更有效的方式我错过了。

这里是第一个数据帧的输入（这些包含更多的行，我上面展示的例子）：

structure(list(Filename = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 
3L, 3L, 3L), .Label = c("1", "2", "3"), class = "factor"), Time = structure(c(1L, 
2L, 3L, 1L, 2L, 1L, 2L, 4L, 5L), .Label = c("1", "2", "3", "5", 
"6"), class = "factor"), X1 = structure(c(1L, 4L, 3L, 2L, 1L, 
4L, 6L, 5L, 1L), .Label = c("10", "11", "12", "15", "20", "25"
), class = "factor"), Y1 = structure(c(1L, 5L, 6L, 4L, 1L, 2L, 
3L, 4L, 1L), .Label = c("10", "11", "12", "15", "20", "25"), class = "factor")), .Names = c("Filename", 
"Time", "X1", "Y1"), row.names = c(NA, -9L), class = "data.frame")

这是第二次投资：

structure(list(Filename = structure(c(1L, 1L, 2L, 2L), .Label = c("1", 
"3"), class = "factor"), Item = structure(1:4, .Label = c("Cat", 
"Dog", "House", "Mouse"), class = "factor"), X1 = structure(c(2L, 
4L, 3L, 1L), .Label = c("10", "11", "20", "35"), class = "factor"), 
Y1 = structure(c(2L, 4L, 3L, 1L), .Label = c("10", "11", 
"13", "35"), class = "factor"), X2 = structure(c(1L, 3L, 
4L, 2L), .Label = c("10", "11", "20", "35"), class = "factor"), 
Y2 = structure(c(1L, 3L, 4L, 2L), .Label = c("10", "11", 
"13", "35"), class = "factor")), .Names = c("Filename", "Item", 
"X1", "Y1", "X2", "Y2"), row.names = c(NA, -4L), class = "data.frame")

使用data.table和您提供的示例数据，我会按如下方式处理它：

# getting the data in the right format
datcols <- c("X","Y")
lucols <- c("X1","X2","Y1","Y2")
setDT(dat)[, (datcols) := lapply(.SD, function(x) as.numeric(as.character(x))), .SDcol = datcols
           ][, Filename := as.character(Filename)]
setDT(lu)[, (lucols) := lapply(.SD, function(x) as.numeric(as.character(x))), .SDcol = lucols
          ][, `:=` (Filename = as.character(Filename),
                    X1 = pmin(X1,X2), X2 = pmax(X1,X2),   # make sure that 'X1' is always the lowest value
                    Y1 = pmin(Y1,Y2), Y2 = pmax(Y1,Y2))]  # make sure that 'Y1' is always the lowest value

# matching the 'Items' to the correct rows
dat[, looked_at := lu$Item[Filename==lu$Filename &
                      between(X, lu$X1, lu$X2) &
                      between(Y, lu$Y1, lu$Y2)],
    by = .(Filename,Time)]

这使：

> dat
   Filename Time  X  Y looked_at
1:        1    1 10 10       Cat
2:        1    2 15 20        NA
3:        1    3 12 25        NA
4:        2    1 11 15        NA
5:        2    2 10 10        NA
6:        3    1 15 11        NA
7:        3    2 25 12        NA
8:        3    5 20 15     House
9:        3    6 10 10     Mouse

使用的数据：

dat <- structure(list(Filename = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("1", "2", "3"), class = "factor"), 
                     Time = structure(c(1L, 2L, 3L, 1L, 2L, 1L, 2L, 4L, 5L), .Label = c("1", "2", "3", "5", "6"), class = "factor"), 
                     X = structure(c(1L, 4L, 3L, 2L, 1L, 4L, 6L, 5L, 1L), .Label = c("10", "11", "12", "15", "20", "25"), class = "factor"), 
                     Y = structure(c(1L, 5L, 6L, 4L, 1L, 2L, 3L, 4L, 1L), .Label = c("10", "11", "12", "15", "20", "25"), class = "factor")), 
                .Names = c("Filename", "Time", "X", "Y"), row.names = c(NA, -9L), class = "data.frame")
lu <- structure(list(Filename = structure(c(1L, 1L, 2L, 2L), .Label = c("1", "3"), class = "factor"), 
                     Item = structure(1:4, .Label = c("Cat", "Dog", "House", "Mouse"), class = "factor"), 
                     X1 = structure(c(2L, 4L, 3L, 1L), .Label = c("10", "11", "20", "35"), class = "factor"), 
                     X2 = structure(c(1L, 3L, 4L, 2L), .Label = c("10", "11", "20", "35"), class = "factor"), 
                     Y1 = structure(c(2L, 4L, 3L, 1L), .Label = c("10", "11", "13", "35"), class = "factor"), 
                     Y2 = structure(c(1L, 3L, 4L, 2L), .Label = c("10", "11", "13", "35"), class = "factor")), 
                .Names = c("Filename", "Item", "X1", "X2", "Y1", "Y2"), row.names = c(NA, -4L), class = "data.frame")

链接地址: http://www.djcxy.com/p/89539.html

上一篇: fast, efficient way to loop over millions of rows and match columns

下一篇: pick" in Github App for Mac