Does varchar perform better than string in Hive?

Since version 0.12 Hive supports the VARCHAR data type.

Will VARCHAR provide better performance than STRING in a typical analytical Hive query?


In hive by default String is mapped to VARCHAR(32762) so this means

  • if value exceed 32762 then the value is truncated
  • if data does not require the maximum VARCHAR length for storage (for example, if the column never exceeds 100 characters), then it allocates unnecessary resources for the handling of that column
  • The default behavior for the STRING data type is to map the type to SQL data type of VARCHAR(32762), the default behavior can lead to performance issues

    This explanation is on the basis of IBM BIG SQL which uses Hive implictly

    IBM BIGINSIGHTS doc reference


    varchar datatype is also saved internally as a String. The only difference I see is String is unbounded with a max value of 32,767 bytes and Varchar is bounded with a max value of 65,535 bytes. I don't think we will have any performance gain because the internal implementation for both the cases is String. I don't know much about hive internals but I could see the additional processing done by hive for truncating the varchar values. Below is the code (org.apache.hadoop.hive.common.type.HiveVarchar) :-

    public static String enforceMaxLength(String val, int maxLength) {
    String value = val;
    
    if (maxLength > 0) {
      int valLength = val.codePointCount(0, val.length());
      if (valLength > maxLength) {
        // Truncate the excess chars to fit the character length.
        // Also make sure we take supplementary chars into account.
        value = val.substring(0, val.offsetByCodePoints(0, maxLength));
      }
    }
    return value;
    }
    

    If anyone has done performance analysis/benchmarking please share.

    链接地址: http://www.djcxy.com/p/22612.html

    上一篇: Eclipse:JVM共享库不包含JNI

    下一篇: 在Hive中,varchar执行比字符串更好吗?