creation with while loop + stacks

I'm a bit embarassed to admit this, but I seem to be pretty stumped by what should be a simple programming problem. I'm building a decision tree implementation, and have been using recursion to take a list of labeled samples, recursively split the list in half, and turn it into a tree.

Unfortunately, with deep trees I run into stack overflow errors (ha!), so my first thought was to use continuations to turn it into tail recursion. Unfortunately Scala doesn't support that kind of TCO, so the only solution is to use a trampoline. A trampoline seems kinda inefficient and I was hoping there would be some simple stack-based imperative solution to this problem, but I'm having a lot of trouble finding it.

The recursive version looks sort of like (simplified):

private def trainTree(samples: Seq[Sample], usedFeatures: Set[Int]): DTree = {
  if (shouldStop(samples)) {
    DTLeaf(makeProportions(samples))
  } else {
    val featureIdx = getSplittingFeature(samples, usedFeatures)
    val (statsWithFeature, statsWithoutFeature) = samples.partition(hasFeature(featureIdx, _))
    DTBranch(
      trainTree(statsWithFeature, usedFeatures + featureIdx), 
      trainTree(statsWithoutFeature, usedFeatures + featureIdx),
      featureIdx)
  }
}

So basically I'm recursively subdividing the list into two according to some feature of the data, and passing through a list of used features so I don't repeat - that's all handled in the "getSplittingFeature" function so we can ignore it. The code is really simple! Still, I'm having trouble figuring out a stack-based solution that doesn't just use closures and effectively become a trampoline. I know we'll at least have to keep around little "frames" of arguments in the stack but I would like to avoid closure calls.

I get that I should be writing out explicitly what the callstack and program counter handle for me implicitly in the recursive solution, but I'm having trouble doing that without continuations. At this point it's hardly even about efficiency, I'm just curious. So please, no need to remind me that premature optimization is the root of all evil and the trampoline-based solution will probably work just fine. I know it probably will - this is basically a puzzle for it's own sake.

Can anyone tell me what the canonical while-loop-and-stack-based solution to this sort of thing is?

UPDATE: Based on Thipor Kong's excellent solution, I've coded up a while-loops/stacks/hashtable based implementation of the algorithm that should be a direct translation of the recursive version. This is exactly what I was looking for:

FINAL UPDATE: I've used sequential integer indices, as well as putting everything back into arrays instead of maps for performance, added maxDepth support, and finally have a solution with the same performance as the recursive version (not sure about memory usage but I would guess less):

private def trainTreeNoMaxDepth(startingSamples: Seq[Sample], startingMaxDepth: Int): DTree = {
  // Use arraybuffer as dense mutable int-indexed map - no IndexOutOfBoundsException, just expand to fit
  type DenseIntMap[T] = ArrayBuffer[T]
  def updateIntMap[@specialized T](ab: DenseIntMap[T], idx: Int, item: T, dfault: T = null.asInstanceOf[T]) = {
    if (ab.length <= idx) {ab.insertAll(ab.length, Iterable.fill(idx - ab.length + 1)(dfault)) }
    ab.update(idx, item)
  }
  var currentChildId = 0 // get childIdx or create one if it's not there already
  def child(childMap: DenseIntMap[Int], heapIdx: Int) =
    if (childMap.length > heapIdx && childMap(heapIdx) != -1) childMap(heapIdx)
    else {currentChildId += 1; updateIntMap(childMap, heapIdx, currentChildId, -1); currentChildId }
  // go down
  val leftChildren, rightChildren = new DenseIntMap[Int]() // heapIdx -> childHeapIdx
  val todo = Stack((startingSamples, Set.empty[Int], startingMaxDepth, 0)) // samples, usedFeatures, maxDepth, heapIdx
  val branches = new Stack[(Int, Int)]() // heapIdx, featureIdx
  val nodes = new DenseIntMap[DTree]() // heapIdx -> node
  while (!todo.isEmpty) {
    val (samples, usedFeatures, maxDepth, heapIdx) = todo.pop()
    if (shouldStop(samples) || maxDepth == 0) {
      updateIntMap(nodes, heapIdx, DTLeaf(makeProportions(samples)))
    } else {
      val featureIdx = getSplittingFeature(samples, usedFeatures)
      val (statsWithFeature, statsWithoutFeature) = samples.partition(hasFeature(featureIdx, _))
      todo.push((statsWithFeature, usedFeatures + featureIdx, maxDepth - 1, child(leftChildren, heapIdx)))
      todo.push((statsWithoutFeature, usedFeatures + featureIdx, maxDepth - 1, child(rightChildren, heapIdx)))
      branches.push((heapIdx, featureIdx))
    }
  }
  // go up
  while (!branches.isEmpty) {
    val (heapIdx, featureIdx) = branches.pop()
    updateIntMap(nodes, heapIdx, DTBranch(nodes(child(leftChildren, heapIdx)), nodes(child(rightChildren, heapIdx)), featureIdx))
  }
  nodes(0)
}

Just store the binary tree in an array, as described on Wikipedia: For node i , the left child goes into 2*i+1 and the right child in to 2*i+2 . When doing "down", you keep a collection of todos, that still have to be splitted to reach a leaf. Once you've got only leafs, to go upward (from right to left in the array) to build the decision nodes:

Update: A cleaned up version, that also supports the features stored int the branches (type parameter B) and that is more functional/fully pure and that supports sparse trees with a map as suggested by ron.

Update2-3: Make economical use of name space for node ids and abstract over type of ids to allow of large trees. Take node ids from Stream.

sealed trait DTree[A, B]
case class DTLeaf[A, B](a: A, b: B) extends DTree[A, B]
case class DTBranch[A, B](left: DTree[A, B], right: DTree[A, B], b: B) extends DTree[A, B]

def mktree[A, B, Id](a: A, b: B, split: (A, B) => Option[(A, A, B)], ids: Stream[Id]) = {
  @tailrec
  def goDown(todo: Seq[(A, B, Id)], branches: Seq[(Id, B, Id, Id)], leafs: Map[Id, DTree[A, B]], ids: Stream[Id]): (Seq[(Id, B, Id, Id)], Map[Id, DTree[A, B]]) =
    todo match {
      case Nil => (branches, leafs)
      case (a, b, id) :: rest =>
        split(a, b) match {
          case None =>
            goDown(rest, branches, leafs + (id -> DTLeaf(a, b)), ids)
          case Some((left, right, b2)) =>
            val leftId #:: rightId #:: idRest = ids
            goDown((right, b2, rightId) +: (left, b2, leftId) +: rest, (id, b2, leftId, rightId) +: branches, leafs, idRest)
        }
    }

  @tailrec
  def goUp[A, B](branches: Seq[(Id, B, Id, Id)], nodes: Map[Id, DTree[A, B]]): Map[Id, DTree[A, B]] =
    branches match {
      case Nil => nodes
      case (id, b, leftId, rightId) :: rest =>
        goUp(rest, nodes + (id -> DTBranch(nodes(leftId), nodes(rightId), b)))
    }

  val rootId #:: restIds = ids
  val (branches, leafs) = goDown(Seq((a, b, rootId)), Seq(), Map(), restIds)
  goUp(branches, leafs)(rootId)
}

// try it out

def split(xs: Seq[Int], b: Int) =
  if (xs.size > 1) {
    val (left, right) = xs.splitAt(xs.size / 2)
    Some((left, right, b + 1))
  } else {
    None
  }

val tree = mktree(0 to 1000, 0, split _, Stream.from(0))
println(tree)
链接地址: http://www.djcxy.com/p/10762.html

上一篇: 编译xml布局文件并使用它们

下一篇: 用while循环+堆栈创建