Writings functions (procedures) for data.table objects
In the book Software for Data Analysis: Programming with R, John Chambers emphasizes that functions should generally not be written for their side effect; rather, that a function should return a value without modifying any variables in its calling environment. Conversely, writing good script using data.table objects should specifically avoid the use of object assignment with <-
, typically used to store the result of a function.
First, is a technical question. Imagine an R function called proc1
that accepts a data.table
object x
as its argument (in addition to, maybe, other parameters). proc1
returns NULL but modifies x
using :=
. From what I understand, proc1
calling proc1(x=x1)
makes a copy of x1
just because of the way that promises work. However, as demonstrated below, the original object x1
is still modified by proc1
. Why/how is this?
> require(data.table)
> x1 <- CJ(1:2, 2:3)
> x1
V1 V2
1: 1 2
2: 1 3
3: 2 2
4: 2 3
> proc1 <- function(x){
+ x[,y:= V1*V2]
+ NULL
+ }
> proc1(x1)
NULL
> x1
V1 V2 y
1: 1 2 2
2: 1 3 3
3: 2 2 4
4: 2 3 6
>
Furthermore, it seems that using proc1(x=x1)
isn't any slower than doing the procedure directly on x, indicating that my vague understanding of promises are wrong and that they work in a pass-by-reference sort of way:
> x1 <- CJ(1:2000, 1:500)
> x1[, paste0("V",3:300) := rnorm(1:nrow(x1))]
> proc1 <- function(x){
+ x[,y:= V1*V2]
+ NULL
+ }
> system.time(proc1(x1))
user system elapsed
0.00 0.02 0.02
> x1 <- CJ(1:2000, 1:500)
> system.time(x1[,y:= V1*V2])
user system elapsed
0.03 0.00 0.03
So, given that passing a data.table argument to a function doesn't add time, that makes it possible to write procedures for data.table objects, incorporating both the speed of data.table and the generalizability of a function. However, given what John Chambers said, that functions should not have side-effects, is it really "ok" to write this type of procedural programming in R? Why was he arguing that side effects are "bad"? If I'm going to ignore his advice, what sort of pitfalls should I be aware of? What can I do to write "good" data.table procedures?
Yes, the addition, modification, deletion of columns in data.table
s is done by reference
. In a sense, it is a good thing because a data.table
usually holds a lot of data, and it would be very memory and time consuming to reassign it all every time a change to it is made. On the other hand, it is a bad thing because it goes against the no-side-effect
functional programming approach that R tries to promote by using pass-by-value
by default. With no-side-effect programming, there is little to worry about when you call a function: you can rest assured that your inputs or your environment won't be affected, and you can just focus on the function's output. It's simple, hence comfortable.
Of course it is ok to disregard John Chambers's advice if you know what you are doing. About writing "good" data.tables procedures, here are a couple rules I would consider if I were you, as a way to limit complexity and the number of side-effects:
do.something.to(table)
and not table <- do.something.to(table)
. If instead the function had another ("real") output, then when calling result <- do.something.to(table)
, it is easy to imagine how you may focus your attention on the output and forget that calling the function had a side effect on your table. While "one output / no-side-effect" functions are the norm in R, the above rules allow for "one output or side-effect". If you agree that a side-effect is somehow a form of output, then you'll agree I am not bending the rules too much by loosely sticking to R's one-output functional programming style. Allowing functions to have multiple side-effects would be a little more of a stretch; not that you can't do it, but I would try to avoid it if possible.
The documentation could be improved (suggestions very welcome), but here is what's there at the moment. Perhaps it should say "even within functions"?
In ?":="
:
data.tables are not copied-on-change by :=, setkey or any of the other set* functions. See copy.
DT is modified by reference and the new value is returned. If you require a copy, take a copy first (using DT2=copy(DT)). Recall that this package is for large data (of mixed column types, with multi-column keys) where updates by reference can be many orders of magnitude faster than copying the entire table.
and in ?copy
(but I realise this is muddled in with setkey) :
The input is modified by reference, and returned (invisibly) so it can be used in compound statements; eg, setkey(DT,a)[J("foo")]. If you require a copy, take a copy first (using DT2=copy(DT)). copy() may also sometimes be useful before := is used to subassign to a column by reference. See ?copy. Note that setattr is also in package bit. Both packages merely expose R's internal setAttrib function at C level, but differ in return value. bit::setattr returns NULL (invisibly) to remind you the function is used for its side effect. data.table::setattr returns the changed object (invisibly), for use in compound statements.
where the last two sentences about bit::setattr
relate to flodel's point 2, interestingly.
Also see these related questions :
Understanding exactly when a data.table is a reference to (vs a copy of) another data.table
Pass by reference: The := operator in the data.table package
data.table 1.8.1.: “DT1 = DT2” is not the same as DT1 = copy(DT2)?
I very much like this part of your question :
that makes it possible to write procedures for data.table objects, incorporating both the speed of data.table and the generalizability of a function.
Yes this is definitely one of the intentions. Consider how a database works: lots of different users/programs change by reference (insert/update/delete) one or more (large) tables in the database. That works fine in database land, and is more like the way data.table thinks. Hence the svSocket video on the homepage, and the desire for insert
and delete
(by reference, verb only, side-effect functions).
上一篇: 标准化每行data.table