How can I get a strict accumArray?

2018-06-22 16:18:29

I have a list of key-value pairs and I want to count how many times each key occurs and what values it occurs with, but when I try, I get a stack overflow. Here's a simplified version of the code I'm running:

import Array
add (n, vals) val = n `seq` vals `seq` (n+1,val:vals)
histo = accumArray add (0,[]) (0,9) [(0, n) | n <- [0..5000000]]
main = print histo

When I compile this with 'ghc -O' and run it, I get "Stack space overflow: current size 8388608 bytes."

I think I know what's going on: accumArray has the same properties as foldl, and so I need a strict version of accumArray. Unfortunately, the only one I've found is in Data.Array.Unboxed, which doesn't work for an array of lists.

The documentation says that when the accumulating function is strict, then accumArray should be too, but I can't get this to work, and the discussion here claims that the documentation is wrong (at least for GHC).

Is there a strict version of accumArray other than the one in Data.Array.Unboxed? Or is there a better way to do what I want?

Well, strict doesn't necessarily mean that no thunks are created, it just means that if an argument is bottom, the result is bottom too. But accumArray is not that strict, it just writes bottoms to the array if they occur. It can't really do anything else, since it must allow for non-strict functions that could produce defined values from intermediate bottoms. And the strictness analyser can't rewrite it so that the accumulation function is evaluated to WHNF on each write if it is strict, because that would change the semantics of the programme in a rather drastic way (an array containing some bottoms vs. bottom).

That said, I agree that there's an unfortunate lack of strict and eager functions in several areas.

For your problem, you can use a larger stack ( +RTS -K128M didn't suffice here, but 256M did), or you can use

import Data.Array.Base (unsafeRead, unsafeWrite)
import Data.Array.ST
import GHC.Arr

strictAccumArray :: Ix i => (e -> a -> e) -> e -> (i,i) -> [(i,a)] -> Array i e
strictAccumArray fun ini (l,u) ies = case iar of
                                       Array _ _ m barr -> Array l u m barr
  where
    iar = runSTArray $ do
        let n = safeRangeSize (l,u)
            stuff = [(safeIndex (l,u) n i, e) | (i, e) <- ies]
        arr <- newArray (0,n-1) ini
        let go ((i,v):ivs) = do
                old <- unsafeRead arr i
                unsafeWrite arr i $! fun old v
                go ivs
            go [] = return arr
        go stuff

With a strict write, the thunks are kept small, so there's no stack overflow. But beware, the lists take a lot of space, so if your list is too long, you may get a heap exhaustion.

Another option would be to use a Data.Map (or Data.IntMap , if the version of containers is 0.4.1.0 or later) instead of an array, since that comes with insertWith' , which forces the result of the combining function on use. The code could for example be

import qualified Data.Map as M  -- or Data.IntMap
import Data.List (foldl')

histo :: M.Map Int (Int,[Int])  -- M.IntMap (Int,[Int])
histo = foldl' upd M.empty [(0,n) | n <- [0 .. 15000000]]
  where
    upd mp (i,n) = M.insertWith' add i (1,[n]) mp
    add (j,val:_) (k,vals) = k `seq` vals `seq` (k+j,val:vals)
    add _ pr = pr  -- to avoid non-exhaustive pattern warning

Disadvantages of using a Map are

the combining function must have type a -> a -> a , so it needs to be a bit more complicated in your case.

an update is O(log size) instead of O(1), so for large histograms, it will be considerably slower.

Map s and IntMap s have some book-keeping overhead, so that will use more space than an array. But if the list of updates is large compared to the number of indices, the difference will be negligible (the overhead is k words per key, independent of the size of the values) in this case, where the size of the values grows with each update.

链接地址: http://www.djcxy.com/p/63632.html

上一篇: Haskell：并发数据结构指南

下一篇: 我怎样才能得到一个严格的accumArray？