Limiting size of hierarchical data for reproducible example

2018-05-30 10:43:53

I am trying to come up with reproducible example (RE) for this question: Errors related to data frame columns during merging. To be qualified as having a RE, the question lacks only reproducible data. However, when I tried to use pretty much standard approach of dput(head(myDataObj)) , the output produced is 14MB size file. The problem is that my data object is a list of data frames, so head() limitation doesn't appear to work recursively .

I haven't found any options for dput() and head() functions that would allow me to control data size recursively for complex objects. Unless I am wrong on the above, what other approaches to creating a minimal RE dataset would you recommend me in this situation?

Along the lines of @MrFlick's comment of using lapply , you may use any of the apply family of functions to perform the head or sample functions depending on your needs in order to reduce the size for both REs and for testing purposes (I've found that working with subsets or subsamples of large sets of data is preferable for debugging and even charting).

It should be noted that head and tail provide the first or last bits of a structure, but sometimes these don't have sufficient variance in them for RE purposes, and are certainly not random, which is where sample may become more useful.

Suppose we have a hierarchical tree structure (list of lists of...) and we want to subset each "leaf" while preserving the structure and labels in the tree.

x <- list( 
    a=1:10, 
    b=list( ba=1:10, bb=1:10 ), 
    c=list( ca=list( caa=1:10, cab=letters[1:10], cac="hello" ), cb=toupper( letters[1:10] ) ) )

NOTE: In the following, I actually can't tell the difference between using how="replace" and how="list" .

ALSO NOTE: This won't be great for data.frame leaf nodes.

# Set seed so the example is reproducible with randomized methods:
set.seed(1)

You can use the default head in a recursive apply in this way:

rapply( x, head, how="replace" )

Or pass an anonymous function that modifies the behavior:

# Complete anonymous function
rapply( x, function(y){ head(y,2) }, how="replace" )
# Same behavior, but using the rapply "..." argument to pass the n=2 to head.
rapply( x, head, how="replace", n=2 )

The following gets a randomized sample ordering of each leaf:

# This works because we use minimum in case leaves are shorter
# than the requested maximum length.
rapply( x, function(y){ sample(y, size=min(length(y),2) ) }, how="replace" )

# Less efficient, but maybe easier to read:
rapply( x, function(y){ head(sample(y)) }, how="replace" )

# XXX: Does NOT work The following does **not** work 
# because `sample` with a `size` greater than the 
# item being sampled does not work (when 
# sampling without replacement)
rapply( x, function(y){ sample(y, size=2) }, how="replace" )

链接地址: http://www.djcxy.com/p/4224.html

上一篇: 给定从连续单变量分布中抽取的一组随机数，找到分布

下一篇: 限制分层数据的大小以重现示例