Parsing huge logfiles in Node.js

I need to do some parsing of large (5-10 Gb)logfiles in Javascript/Node.js (I'm using Cube).

The logline looks something like:

10:00:43.343423 I'm a friendly log message. There are 5 cats, and 7 dogs. We are in state "SUCCESS".

We need to read each line, do some parsing (eg strip out 5 , 7 and SUCCESS ), then pump this data into Cube (https://github.com/square/cube) using their JS client.

Firstly, what is the canonical way in Node to read in a file, line by line?

It seems to be fairly common question online:

  • http://www.quora.com/What-is-the-best-way-to-read-a-file-line-by-line-in-node-js
  • Read a file one line at a time in node.js?
  • A lot of the answers seem to point to a bunch of third-party modules:

  • https://github.com/nickewing/line-reader
  • https://github.com/jahewson/node-byline
  • https://github.com/pkrumins/node-lazy
  • https://github.com/Gagle/Node-BufferedReader
  • However, this seems like a fairly basic task - surely, there's a simple way within the stdlib to read in a textfile, line-by-line?

    Secondly, I then need to process each line (eg convert the timestamp into a Date object, and extract useful fields).

    What's the best way to do this, maximising throughput? Is there some way that won't block on either reading in each line, or on sending it to Cube?

    Thirdly - I'm guessing using string splits, and the JS equivalent of contains (IndexOf != -1?) will be a lot faster than regexes? Has anybody had much experience in parsing massive amounts of text data in Node.js?

    Cheers, Victor


    I searched for a solution to parse very large files (gbs) line by line using a stream. All the third-party libraries and examples did not suit my needs since they processed the files not line by line (like 1 , 2 , 3 , 4 ..) or read the entire file to memory

    The following solution can parse very large files, line by line using stream & pipe. For testing I used a 2.1 gb file with 17.000.000 records. Ram usage did not exceed 60 mb.

    var fs = require('fs')
        , es = require('event-stream');
    
    var lineNr = 0;
    
    var s = fs.createReadStream('very-large-file.csv')
        .pipe(es.split())
        .pipe(es.mapSync(function(line){
    
            // pause the readstream
            s.pause();
    
            lineNr += 1;
    
            // process line here and call s.resume() when rdy
            // function below was for logging memory usage
            logMemoryUsage(lineNr);
    
            // resume the readstream, possibly from a callback
            s.resume();
        })
        .on('error', function(err){
            console.log('Error while reading file.', err);
        })
        .on('end', function(){
            console.log('Read entire file.')
        })
    );
    

    在这里输入图像描述

    Please let me know how it goes!


    You can use the inbuilt readline package, see docs here. I use stream to create a new output stream.

    var fs = require('fs'),
        readline = require('readline'),
        stream = require('stream');
    
    var instream = fs.createReadStream('/path/to/file');
    var outstream = new stream;
    outstream.readable = true;
    outstream.writable = true;
    
    var rl = readline.createInterface({
        input: instream,
        output: outstream,
        terminal: false
    });
    
    rl.on('line', function(line) {
        console.log(line);
        //Do your stuff ...
        //Then write to outstream
        rl.write(cubestuff);
    });
    

    Large files will take some time to process. Do tell if it works.


    I really liked @gerard answer which is actually deserves to be the correct answer here. I made some improvements:

  • Code is in a class (modular)
  • Parsing is included
  • Ability to resume is given to the outside in case there is an asynchronous job is chained to reading the CSV like inserting to DB, or a HTTP request
  • Reading in chunks/batche sizes that user can declare. I took care of encoding in the stream too, in case you have files in different encoding.
  • Here's the code:

    'use strict'
    
    const fs = require('fs'),
        util = require('util'),
        stream = require('stream'),
        es = require('event-stream'),
        parse = require("csv-parse"),
        iconv = require('iconv-lite');
    
    class CSVReader {
      constructor(filename, batchSize, columns) {
        this.reader = fs.createReadStream(filename).pipe(iconv.decodeStream('utf8'))
        this.batchSize = batchSize || 1000
        this.lineNumber = 0
        this.data = []
        this.parseOptions = {delimiter: 't', columns: true, escape: '/', relax: true}
      }
    
      read(callback) {
        this.reader
          .pipe(es.split())
          .pipe(es.mapSync(line => {
            ++this.lineNumber
    
            parse(line, this.parseOptions, (err, d) => {
              this.data.push(d[0])
            })
    
            if (this.lineNumber % this.batchSize === 0) {
              callback(this.data)
            }
          })
          .on('error', function(){
              console.log('Error while reading file.')
          })
          .on('end', function(){
              console.log('Read entirefile.')
          }))
      }
    
      continue () {
        this.data = []
        this.reader.resume()
      }
    }
    
    module.exports = CSVReader
    

    So basically, here is how you will use it:

    let reader = CSVReader('path_to_file.csv')
    reader.read(() => reader.continue())
    

    I tested this with a 35GB CSV file and it worked for me and that's why I chose to build it on @gerard's answer, feedbacks are welcomed.

    链接地址: http://www.djcxy.com/p/52278.html

    上一篇: NodeJs读取csv文件

    下一篇: 在Node.js中解析巨大的日志文件