How to easily combine data sets; how to quantify text data
I'm just getting started with R and R-Studio. I'm working with a couple different data sets: each contains the same variables, and within those variables the same types of information.
The data sets have been imported into R-Studio as separate sets/files. First question: how can I go about combining them? There are seventeen in all. Here is an abbreviated example of two of them:
EVENT_ID STATE YEAR MONTH_NAME EVENT_TYPE INJURIES_DIRECT DEATHS_DIRECT
1 5551758 MASSACHUSETTS 1996 January Heavy Snow 0 0
2 5551581 MASSACHUSETTS 1996 January Heavy Snow 0 0
3 5551757 MASSACHUSETTS 1996 January Heavy Snow 0 0
4 5551573 MASSACHUSETTS 1996 January Heavy Snow 0 0
5 5551572 MASSACHUSETTS 1996 January Heavy Snow 0 0
EVENT_ID STATE YEAR MONTH_NAME EVENT_TYPE INJURIES_DIRECT DEATHS_DIRECT
1 5591809 MASSACHUSETTS 1997 January Winter Weather 0 0
2 5591810 MASSACHUSETTS 1997 January Winter Weather 0 0
3 5591817 MASSACHUSETTS 1997 January Heavy Snow 0 0
4 5591820 MASSACHUSETTS 1997 January Heavy Snow 0 0
5 5591819 MASSACHUSETTS 1997 January Heavy Snow 0 0
6 5591811 MASSACHUSETTS 1997 January Heavy Snow 0 0
7 5591813 MASSACHUSETTS 1997 January Heavy Snow 0 0
As you can see, each has the same headers. Once I have combined these data sets - to not include the headers in the middle of the data! - I will begin analysis. Second question: how can I go about quantifying factors, such as those found in the EVENT_TYPE variable? I tried converting them to "as.numeric", which I believe orders them 1-x based on alphabetical order. That's fine, but how would I go about keeping track of that data? I'm hoping to play with them like I would numeric data, but do not know where or how to begin doing that.
If there is another place this is explained, please let me know and I'm happy to read those examples. I wasn't sure how to best ask.
创建一个列表并使用do.call在它们上运行rbind:
do.call( rbind, list(df1,df2,df3, ....,dfN) )
For the actual unification: see BondedDust's answer (for a more expansive thing to achieve basically the same end, see here.)
In terms of ordering and ranking the EVENT_TYPE quantitative elements; have you looked at ?as.factor() at all? If you could explain what you're looking to do with the data we'll probably be able to provide a more substantive answer :).
help(rbind)
will get you started.
You want to read the data in as datafames, probably using read.csv
or read.table
, then combine the dataframes with rbind
. See help(data.frame)
and help(rbind)
for explanations and examples. There's also a very brief example at http://www.endmemo.com/program/R/rbind.php
Don't convert the strings to factors until after you combine them. You can do this by specifying strings.as.factors = False
when you load the data.
Once you have the dataframes combined, though, you can use d[,colnum] = as.factor(d[,colnum])
to convert the column to a factor. This will create integer levels for each of the phrases that appear in that column. You may want to specify the order of the factors to use if you want to actually work with them numerically (so that heavy snow has a higher number than snow, etc.). You will also need to check for missing values, and watch out for variations in the factors such as upper case/lower case or extra spaces.
上一篇: 快速重现数据
下一篇: 如何轻松组合数据; 如何量化文本数据