Does varchar perform better than string in Hive?
Since version 0.12 Hive supports the VARCHAR
data type.
Will VARCHAR
provide better performance than STRING
in a typical analytical Hive query?
In hive by default String is mapped to VARCHAR(32762) so this means
The default behavior for the STRING data type is to map the type to SQL data type of VARCHAR(32762), the default behavior can lead to performance issues
This explanation is on the basis of IBM BIG SQL which uses Hive implictly
IBM BIGINSIGHTS doc reference
varchar datatype is also saved internally as a String. The only difference I see is String is unbounded with a max value of 32,767 bytes and Varchar is bounded with a max value of 65,535 bytes. I don't think we will have any performance gain because the internal implementation for both the cases is String. I don't know much about hive internals but I could see the additional processing done by hive for truncating the varchar values. Below is the code (org.apache.hadoop.hive.common.type.HiveVarchar) :-
public static String enforceMaxLength(String val, int maxLength) {
String value = val;
if (maxLength > 0) {
int valLength = val.codePointCount(0, val.length());
if (valLength > maxLength) {
// Truncate the excess chars to fit the character length.
// Also make sure we take supplementary chars into account.
value = val.substring(0, val.offsetByCodePoints(0, maxLength));
}
}
return value;
}
If anyone has done performance analysis/benchmarking please share.
链接地址: http://www.djcxy.com/p/22612.html上一篇: Eclipse:JVM共享库不包含JNI