Haskell and mutable structures' performance

I was studying the answers given in Optimizing Haskell code and noticed that using a small input would indeed result in a faster Haskell run compared to Python.

But as the dataset grew in size, Python took the lead. Using a hashmap based version had improved the performance, but it was still lagging behind.

Even worse, I tried transliterating Python's dictionaries into hashtables and observed a hard performance hit. I really want to understand what's going on as I'll need mutable structures for a future application.

Here's the slightly modified Python code :

#! /usr/bin/env python2.7
import random
import re
import cPickle

class Markov:
    def __init__(self, filenames):
        self.filenames = filenames
        self.cache = self.train(self.readfiles())
        picklefd = open("dump", "w")
        cPickle.dump(self.cache, picklefd)
    print "Built a db of length "+str(len(self.cache))
        picklefd.close()

    def train(self, text):
        splitted = text.split(' ')
        print "Total of %d splitted words" % (len(splitted))
        cache = {}
        for i in xrange(len(splitted)-2):
            pair = (splitted[i], splitted[i+1])
            followup = splitted[i+2]
            if pair in cache:
                if followup not in cache[pair]:
                    cache[pair][followup] = 1
                else:
                    cache[pair][followup] += 1
            else:
                cache[pair] = {followup: 1}
        return cache

    def readfiles(self):
        data = ""
        for filename in self.filenames:
            fd = open(filename)
            data += fd.read()
            fd.close()
        return data

Markov(["76.txt"])

Haskell, with the original response ( train4 ), a hashmap variant thereof ( trainHM2 ) and the hashtable transliteration ( trainHT ) :

{-# LANGUAGE BangPatterns,DeriveGeneric #-}

import GHC.Generics (Generic)

import Data.List (foldl')
import Data.Hashable

import qualified Data.Map as M

import qualified Data.HashMap.Strict as HM
import qualified Data.ByteString.Char8 as B

import qualified Data.HashTable.IO as HT

--Using this instead of tuples yielded a 5~10% speedup
data StringTuple = STP !B.ByteString !B.ByteString deriving(Ord,Eq,Generic)
instance Hashable StringTuple

type Database3 = M.Map StringTuple (M.Map B.ByteString Int)
type DatabaseHM = HM.HashMap StringTuple (HM.HashMap B.ByteString Int)
type DatabaseHT = HT.BasicHashTable StringTuple DatabaseInnerHT
type DatabaseInnerHT = (HT.BasicHashTable B.ByteString Int)

train4 :: [B.ByteString] -> Database3
train4 words = foldl' update M.empty (zip3 words (drop 1 words) (drop 2 words))
    where update m (x,y,z) = M.insertWith' (inc z) (STP x y) (M.singleton z 1) m
          inc k _ = M.insertWith' (+) k 1

trainHM2 :: [B.ByteString] -> DatabaseHM
trainHM2 words = trainHM2G words HM.empty
    where 
    trainHM2G (x:y:[]) !hm = hm
    trainHM2G (x:y:z:rem) !hm = trainHM2G (y:z:rem) (HM.insertWith (inc z) (STP x y) (HM.singleton z 1) hm)
            where inc k _ = HM.insertWith (+) k 1


trainHT :: [B.ByteString] -> IO (DatabaseHT)
trainHT words = do 
 hm <- HT.new 

 trainHT' words hm 
 where 
  trainHT' (x:y:[]) !hm = return hm
  trainHT' (x:y:z:rem) !hm = do
   let pair = STP x y
   inCache <- HT.lookup hm pair
   case inCache of
    Nothing -> do
     htN <- HT.new :: IO (DatabaseInnerHT)
     HT.insert htN z $! 1
     HT.insert hm pair $! htN
    Just ht -> do
     cvM <- HT.lookup ht z
     case cvM of
      Nothing -> HT.insert ht z 1
      Just cv -> HT.insert ht z $! (cv+1)
   trainHT' (y:z:rem) hm

main = do contents <- B.readFile "76.txt"
      let bcont = B.split ' ' $ contents
      print $ length bcont
      let db = train4 $ bcont
      print $ "Built a DB of " ++ show (M.size db) ++ " words" 
      --let db = trainHM2 $ bcont
      --print $ "Built a DB of " ++ show (HM.size db) ++ " words"         
      --db <- trainHT $ (bcont)
      --print $ "Built a DB" 

A makeshift C++11 transliteration (requires -fpermissive to compile, feel free to correct it) :

#include <iostream>
#include <fstream>
#include <sstream>
#include <vector>
#include <unordered_map>
#include <tuple>

/*
 Hash stuff here
 Taken from https://stackoverflow.com/a/7111460/314327
*/
size_t hash_combiner(size_t left, size_t right) //replacable
{ return left^right;}

template<int index, class...types>
struct hash_impl {
    size_t operator()(size_t a, const std::tuple<types...>& t) const {
        typedef typename std::tuple_element<index, std::tuple<types...>>::type nexttype;
        hash_impl<index-1, types...> next;
        size_t b = std::hash<nexttype>()(std::get<index>(t));
        return next(hash_combiner(a, b), t); 
    }
};
template<class...types>
struct hash_impl<0, types...> {
    size_t operator()(size_t a, const std::tuple<types...>& t) const {
        typedef typename std::tuple_element<0, std::tuple<types...>>::type nexttype;
        size_t b = std::hash<nexttype>()(std::get<0>(t));
        return hash_combiner(a, b); 
    }
};

namespace std {
    template<class...types>
    struct hash<std::tuple<types...>> {
        size_t operator()(const std::tuple<types...>& t) {
            const size_t begin = std::tuple_size<std::tuple<types...>>::value-1;
            return hash_impl<begin, types...>()(1, t); //1 should be some largervalue
        }
    };
}

/*
 Hash stuff end
*/

using namespace std;

/*
 Split, from https://stackoverflow.com/a/236803/314327
*/
vector<string> &split(const string &s, char delim, vector<string> &elems) {
 stringstream ss(s);
 string item;
 while (getline(ss, item, delim)) {
 elems.push_back(item);
 }
 return elems;
}

vector<string> split(const string &s, char delim) {
 vector<string> elems;
 split(s, delim, elems);
 return elems;
}
/*
 Split end
*/

typedef tuple<string,string> STP;

unordered_map< STP,unordered_map< string,int > > train(vector<string> &words)
{
 unordered_map< STP,unordered_map< string,int > > cache; 

 for(int i=0;i<words.size()-2;i++)
 {
  STP tup = make_tuple(words[i],words[i+1]);
  auto it = cache.find(tup);
  if(it!=cache.end())
  {
   auto it2 = it->second.find(words[i+2]);
   if(it2!=it->second.end())
   {
    it2->second += 1;
   }
   else
    it->second[words[i+2]] = 1;
  }
  else
  {    
   unordered_map< string,int > cacheInner;
   cacheInner[words[i+2]] = 1;
   cache[tup] = cacheInner;
  }
 }

 return cache;
}

int main()
{
 ifstream ifs("76.txt");
 stringstream buf;
 buf << ifs.rdbuf();
 string contents(buf.str());

 auto words = split(contents,' '); 
 cout << words.size(); 

 auto wordCache = train(words);

 cout << "nHashtable count " << wordCache.size();

 cout << "n";
 return 0;
}

And the results are :

C++ (GCC 4.6.3)

$ g++ -O3 -fpermissive -std=c++0x cpptest.cpp -o cpptest
$ /usr/bin/time -f "%E" ./cpptest
1255153

Hashtable count 64442
0:01.02

Python (2.7)

$ /usr/bin/time -f "%E" ./pytest.py 
Total of 1255153 splitted words
Built a db of length 64442
0:02.62

Haskell (GHC 7.4.1) - "train4"

$ ghc -fllvm -O2 -rtsopts -fforce-recomp -funbox-strict-fields hasktest.hs -o hasktest
[1 of 1] Compiling Main             ( hasktest.hs, hasktest.o )
Linking hasktest ...
$ /usr/bin/time -f "%E" ./hasktest
1255153
"Built a DB of 64442 words"
0:06.35

Haskell - "trainHM2"

$ /usr/bin/time -f "%E" ./hasktest
1255153
"Built a DB of 64442 words"
0:04.23

Haskell - "trainHT" - Using Basic variant (which is close to what Python does for dictionaries, I guess ?)

$ /usr/bin/time -f "%E" ./hasktest
1255153
"Built a DB"
0:10.42

Using Linear or Cuckoo for both tables

0:06.06
0:05.69

Using Cuckoo for the outermost table and Linear on the inside

0:04.17

Profiling had shown that there's quite a lot of GC, so, with +RTS -A256M

0:02.11

For the input data, I chose 76.txt as indicated in one of the answers and duplicated the whole text 12 times. It should amount to about 7 MBs.

Tests were run on Ubuntu 12.04 in a VirtualBox container, using a single i5-520M core. Done more than a single run, all the results were pretty close.

The last result is pretty fine for this microbenchmark, but is there anything else to improve in the code , considering that :

  • Cuckoo & Linear might be better suited for this dataset, but the "generic" Python solution is good to go without much an optimisation in this regard,
  • Valgrind reports that the C++ & Python versions take approximately 60MBs whilst Haskell RTS reports anywhere from 125MBs (Cuckoo&Linear) to 409MBs (Basic, larger heap) of memory for the same task. Wouldn't tweaking the garbage collector this much in a production environment be detrimental ? Is it possible to refactor the code so it has less memory usage ?
  • Update :

    I guess "reducing garbage" is what I'm looking for. I know Haskell doesn't work the same way C++ does, but I want to know if it's possible to reduce garbage produced in imperative code, as the C++ example consumed half the memory without any space leaks. It'd hopefully be an improvement in terms of memory usage and execution time (as there'll be less GC).

    Update 2 :

    Computing the length during the table construction has reduced the memory footprint for sure (down to around 40MBs , actually !), which causes the GC to take longer, resulting in a slower execution time (due to discarding values that had been lazily read from the list, I presume ?).

    And yes, hashtables' operations take a significant amount of time. I'll try mimicking alterations to see if it improves any further.


    This isn't really an answer, but it's too much to put in comments, so I'll drop it here until something better comes along. I don't see anything obviously wrong with your hashtable code (the only part I really looked at), other than a bit of refactoring/golfing.

    First, the memory usage isn't very surprising to me. With -A256M , you're requiring that the RTS have a minimum allocation area of 256M, so that puts a floor on your memory usage. If data gets promoted or copied after a GC, the memory usage will grow. Also note that Haskell data structures tend to be a bit memory-hungry relative to other languages, see for example Memory footprint of Haskell data types. Given both of these factors, I'm not surprised by the total memory usage with a large allocation area.

    Structures like the HashMap or a bytestring trie could use less memory, with the attendant downsides of using a data structure other than a hashtable.

    Speaking of the allocation area, this code is a bit of a microbenchmark in that nearly all the allocated data (mostly bytestring data and internal hashtable values) are long-lived (they last until the program ends). This puts your test program in a situation where a very large allocation area is particularly beneficial, whereas if this database construction were just a part of a larger program the costs of the larger area might become dominant.

    As to the optimal GC settings for a production environment, it's very hard to tell outside the context of an actual complete program. I can say that if performance really matters, it's worth spending some time tuning the GC flags. Even more so if you've enabled the threaded runtime.

    Aside from memory issues, I strongly suspect that the hashtables package is working against you here. According to a profile, the 4 costliest functions are lookup/go , lookup , insert , and delete'.go . I think if it had the equivalent of Data.Map.alter , some of your operations could be coalesced for a performance gain. I would be very surprised if Python dictionaries weren't optimized for cases like cache[key] += 1 after all.

    链接地址: http://www.djcxy.com/p/74894.html

    上一篇: 基于模拟器的Manymo,它是如何工作的?

    下一篇: 哈斯克尔和可变结构的表现