从因子到数字或整数错误

2018-06-08 05:23:08

我有一个从CSV文件加载的R数据框。其中一个变量称为“金额”，意在包含正数和负数。

当我查看数据框时，这个变量的数据类型被列为一个因子，我需要它的数字格式（不知道哪种 - 整数 - 数字，umm ...？）。所以，我试图将它转换为这两种格式之一，但看到了一些有趣的行为。

初始数据帧：

str(df)

Amount        : Factor w/ 11837 levels "","-1","-10",..: 2 2 1664 4 6290 6290 6290 6290 6290 6290 ...

正如我上面提到的，当我试图将它转换为数字或整数时，我看到了一些奇怪的东西。为了表明这一点，我把这个比较放在一起：

df2 <- data.frame(df$Amount, as.numeric(df$Amount), as.integer(df$Amount))

str(df2)
'data.frame':   2620276 obs. of  3 variables:
 $ df.Amount            : Factor w/ 11837 levels "","-1","-10",..: 2 2 1664 4 6290 6290 6290 6290 6290 6290 ...
 $ as.numeric.df.Amount.: num  2 2 1664 4 6290 ...
 $ as.integer.df.Amount.: int  2 2 1664 4 6290 6290 6290 6290 6290 6290 ...

> head(df2, 20)
         df.Amount        as.numeric.df.Amount.       as.integer.df.Amount.
1               -1                           2                           2
2               -1                           2                           2
3             -201                        1664                        1664
4             -100                           4                           4
5                1                        6290                        6290
6                1                        6290                        6290
7                1                        6290                        6290
8                1                        6290                        6290
9                1                        6290                        6290
10               1                        6290                        6290
11               1                        6290                        6290
12               1                        6290                        6290
13               1                        6290                        6290
14               1                        6290                        6290
15               1                        6290                        6290
16               1                        6290                        6290
17               1                        6290                        6290
18               2                        7520                        7520
19               2                        7520                        7520
20               2                        7520                        7520

as.numeric和as.integer函数使用了Amount变量并对它做了一些事情，但我不知道那是。我的目标是将Amount变量变为某种数字数据类型，以便我可以对它执行sum / mean /等。

我做错了什么导致了奇怪的数字，我能做些什么来解决它？

问题的根源可能是您导入的csv中某些时髦的值。如果它来自excel，这并不罕见。它可以是一个百分号，来自excel的“评论”字符或任何一长串事物。我会在你选择的编辑器中查看csv，看看你能看到什么。

除此之外，你有几个选择。

read.csv接受可选参数stringsAsFactors ，您可以将其设置为FALSE

一个因子被存储为映射到值的整数级别。当你直接用as.numeric转换时，你会得到这些整数级别而不是初始值：

> x<-10:20
> as.numeric(factor(x))
 [1]  1  2  3  4  5  6  7  8  9 10 11
>

否则看看?factor ：

特别是，因为应用于某个因素的数字是无意义的，并且可能通过隐式强制而发生。为了将因子f转换为大约其原始数值， as.numeric(levels(f))[f]被推荐并且比as.numeric(as.character(f))稍微更有效率。

但是，我怀疑这会错误，因为输入除了数字之外还有其他内容。

@Justin是对的。以下是如何查找违规值的步骤：

# A sample data set with a weird value ("4%") in it
d <- read.table(text="A Bn1 2n3 4%n", header=TRUE)
str(d)
#'data.frame':   2 obs. of  2 variables:
# $ A: int  1 3
# $ B: Factor w/ 2 levels "2","4%": 1 2

as.numeric(d$B) # WRONG, returns 1 2 (the internal factor codes)

# This correctly converts to numeric
x <- as.numeric(levels(d$B))[d$B] # 2 NA

# ...and this finds the offending value(s):
d$B[is.na(x)]  # 4% 

# and this finds the offending row numbers:
which(is.na(x)) # row 2

请注意，如果您的数据集的缺失值被编码为非空单元格或字符串“NA”，则必须指定read.table：

# Here "N/A" is used instead of "NA"...
read.table(text="A Bn1 2n3 N/An", header=TRUE, na.strings="N/A")

我是新来的，但我一直在使用这个论坛来查询。我有类似的问题，但下面的工作对我来说。我将数据从txt文件移植到数据框

data <- read.delim(paste(folderpath,"data.txt",sep=""),header=TRUE,sep="",as.is=6)

请注意，我在列6上使用了as.is，其中包含数字数据以及某些行中的一些垃圾回收字符。使用as.is端口数据作为第6列中的字符，则以下内容将第6列中的字符更改为数字值。所有的垃圾值都被转换为NA，以后可以删除。

data[,6] <- as.numeric(data[,6])

希望这可以帮助

链接地址: http://www.djcxy.com/p/24945.html

上一篇: From Factor to Numeric or Integer error

下一篇: Convert Factor columns in data frame to numeric type columns