Scanning a HUGE JSON file for deserializable data in Scala

I need to be able to process large JSON files, instantiating objects from deserializable sub-strings as we are iterating-over/streaming-in the file.

For example:

Let's say I can only deserialize into instances of the following:

case class Data(val a: Int, val b: Int, val c: Int)

and the expected JSON format is:

{   "foo": [ {"a": 0, "b": 0, "c": 0 }, {"a": 0, "b": 0, "c": 1 } ], 
    "bar": [ {"a": 1, "b": 0, "c": 0 }, {"a": 1, "b": 0, "c": 1 } ], 
     .... MANY ITEMS .... , 
    "qux": [ {"a": 0, "b": 0, "c": 0 }  }

What I would like to do is:

import com.codahale.jerkson.Json
val dataSeq : Seq[Data] = Json.advanceToValue("foo").stream[Data](fileStream)
// NOTE: this will not compile since I pulled the "advanceToValue" out of thin air.

As a final note, I would prefer to find a solution that involves Jerkson or any other libraries that comes with the Play framework, but if another Scala library handles this scenario with greater ease and decent performance: I'm not opposed to trying another library. If there is a clean way of manually seeking through the file and then using a Json library to continue parsing from there: I'm fine with that.

What I do not want to do is ingest the entire file without streaming or using an iterator, as keeping the entire file in memory at a time would be prohibitively expensive.


I have not done it with JSON (and I hope someone will come up with a turnkey solution for you) but done it with XML and here is a way of handling it.

It is basically a simple Map->Reduce process with the help of stream parser.

Map (your advanceTo )

Use a streaming parser like JSON Simple (not tested). When on the callback you match your "path", collect anything below by writing it to a stream (file backed or in-memory, depending on your data). That will be your foo array in your example. If your mapper is sophisticated enough, you may want to collect multiple paths during the map step.

Reduce (your stream[Data] )

Since the streams you collected above look pretty small, you probably do not need to map/split them again and you can parse them directly in memory as JSON objects/arrays and manipulate them (transform, recombine, etc...).


Here is the current way I am solving the problem:

import collection.immutable.PagedSeq
import util.parsing.input.PagedSeqReader
import com.codahale.jerkson.Json
import collection.mutable

private def fileContent = new PagedSeqReader(PagedSeq.fromFile("/home/me/data.json"))
private val clearAndStop = ']'

private def takeUntil(readerInitial: PagedSeqReader, text: String) : Taken = {
  val str = new StringBuilder()
  var readerFinal = readerInitial

  while(!readerFinal.atEnd && !str.endsWith(text)) {
    str += readerFinal.first
    readerFinal = readerFinal.rest
  }

  if (!str.endsWith(text) || str.contains(clearAndStop))
    Taken(readerFinal, None)
  else
    Taken(readerFinal, Some(str.toString))
}

private def takeUntil(readerInitial: PagedSeqReader, chars: Char*) : Taken = {
  var taken = Taken(readerInitial, None)
  chars.foreach(ch => taken = takeUntil(taken.reader, ch.toString))

  taken
}

def getJsonData() : Seq[Data] = {
  var data = mutable.ListBuffer[Data]()
  var taken = takeUntil(fileContent, ""foo"")
  taken = takeUntil(taken.reader, ':', '[')

  var doneFirst = false
  while(taken.text != None) {
    if (!doneFirst)
      doneFirst = true
    else
      taken = takeUntil(taken.reader, ',')

    taken = takeUntil(taken.reader, '}')
    if (taken.text != None) {
      print(taken.text.get)
      places += Json.parse[Data](taken.text.get)
    }
  }

  data
}

case class Taken(reader: PagedSeqReader, text: Option[String])
case class Data(val a: Int, val b: Int, val c: Int)

Granted, This code doesn't exactly handle malformed JSON very cleanly and to use for multiple top-level keys "foo", "bar" and "qux", will require looking ahead (or matching from a list of possible top-level keys), but in general: I believe this does the job. It's not quite as functional as I'd like and isn't super robust but PagedSeqReader definitely keeps this from getting too messy.

链接地址: http://www.djcxy.com/p/11738.html

上一篇: 在js修改输入的值后,“滚动”到输入的末尾

下一篇: 扫描一个巨大的JSON文件,用于Scala中的反序列化数据