Cleaning mixed decimal separators after Excel import (gsub maybe?)
I needed to read several Excel files and used the gdata package. Unfortunately the files were formated lazily, some with "," decimal/thousand separators some with "." and some with none.
To get you an idea, the numbers can look like this:
#Five Times 1000.1 and four times 1000.0
x <- c("1,000.1","1.000.1","1000.1","1000,1","1.000,1","1000","1,000","1.000","1000.0")
x
Is there a general way to convert these into 1000.1 and 1000.0 respectively? I thought about using gsub() and a regexp.
A first gsub() to replace the "," with "." and for a second gsub() a regexp might be done in a way that all "." which have three numbers to the right of it are deleted while the other "." are kept.
However I'm not familiar with regexp and don't know how to do that. Can anybody help? Is there a simpler way to clean excel sheets?
Thanks!
Using gsub
for example:
as.numeric(gsub('([0-9])[,|.]?([0-9]{3})[,|.]?','12.',x))
[1] 1000.1 1000.1 1000.1 1000.1 1000.1 1000.0 1000.0 1000.0 1000.0
For this specific case you can even simplify the regular expression to:
as.numeric(gsub('^(1)[,|.]?(0{3})[,|.]?','12.',x))
And here I decorticate the last regular expression:
^ | 1 | [,|.]? | 0{3} | [,|.]? | (0|1)?
beginwith | 1 | comma or point | 3 zeros | comma or point | 0 or 1 or nothing
链接地址: http://www.djcxy.com/p/6530.html