在Node.js中解析巨大的日志文件

2018-06-18 12:57:31

我需要在Javascript / Node.js（我使用Cube）中对大型（5-10 Gb）日志文件进行解析。

logline看起来像这样：

10:00:43.343423 I'm a friendly log message. There are 5 cats, and 7 dogs. We are in state "SUCCESS".

我们需要阅读每一行，做一些解析（如剔除5 ， 7 ，并SUCCESS利用其客户端JS），则该泵的数据为立方体（https://github.com/square/cube）。

首先，Node中的规范方法是逐行读入文件？

这似乎是相当常见的在线问题：

http://www.quora.com/What-is-the-best-way-to-read-a-file-line-by-line-in-node-js

在node.js中一次读取一行文件？

很多答案似乎都指向了一堆第三方模块：

https://github.com/nickewing/line-reader

https://github.com/jahewson/node-byline

https://github.com/pkrumins/node-lazy

https://github.com/Gagle/Node-BufferedReader

然而，这似乎是一个相当基本的任务 - 当然，stdlib中有一个简单的方法可以逐行读入文本文件？

其次，我需要处理每一行（例如，将时间戳转换为Date对象，并提取有用的字段）。

什么是最好的方式来做到这一点，最大限度地提高吞吐量？是否有某种方法不会在每行读取或将其发送到Cube时阻塞？

第三 - 我猜测是使用字符串分割，并且包含（IndexOf！= -1？）的JS等价物比正则表达式要快很多？有没有人在解析Node.js中的大量文本数据方面有很多经验？

干杯，维克多

我搜索了一个解决方案，使用流来逐行解析非常大的文件（gbs）。所有的第三方库和示例都不适合我的需求，因为他们不是逐行处理文件（如1,2,3,4 ...）或将整个文件读取到内存中

以下解决方案可以使用stream＆pipe逐行解析非常大的文件。为了测试，我使用了一个2.1 GB的文件，记录了17万条记录。公羊使用量不超过60 mb。

var fs = require('fs')
    , es = require('event-stream');

var lineNr = 0;

var s = fs.createReadStream('very-large-file.csv')
    .pipe(es.split())
    .pipe(es.mapSync(function(line){

        // pause the readstream
        s.pause();

        lineNr += 1;

        // process line here and call s.resume() when rdy
        // function below was for logging memory usage
        logMemoryUsage(lineNr);

        // resume the readstream, possibly from a callback
        s.resume();
    })
    .on('error', function(err){
        console.log('Error while reading file.', err);
    })
    .on('end', function(){
        console.log('Read entire file.')
    })
);

在这里输入图像描述

请让我知道它是怎么回事！

您可以使用内置的readline包，请参阅文档。我使用流创建一个新的输出流。

var fs = require('fs'),
    readline = require('readline'),
    stream = require('stream');

var instream = fs.createReadStream('/path/to/file');
var outstream = new stream;
outstream.readable = true;
outstream.writable = true;

var rl = readline.createInterface({
    input: instream,
    output: outstream,
    terminal: false
});

rl.on('line', function(line) {
    console.log(line);
    //Do your stuff ...
    //Then write to outstream
    rl.write(cubestuff);
});

大文件需要一些时间来处理。请告诉它是否有效。

我真的很喜欢@gerard答案，这实际上应该是这里的正确答案。我做了一些改进：

代码是在一个类（模块化）

解析包括在内

如果存在异步作业被链接为读取CSV（如插入到数据库或HTTP请求），则可以将恢复能力发送给外部

读取用户可以声明的块/大小。我也照顾了流中的编码，以防万一你有不同编码的文件。

代码如下：

'use strict'

const fs = require('fs'),
    util = require('util'),
    stream = require('stream'),
    es = require('event-stream'),
    parse = require("csv-parse"),
    iconv = require('iconv-lite');

class CSVReader {
  constructor(filename, batchSize, columns) {
    this.reader = fs.createReadStream(filename).pipe(iconv.decodeStream('utf8'))
    this.batchSize = batchSize || 1000
    this.lineNumber = 0
    this.data = []
    this.parseOptions = {delimiter: 't', columns: true, escape: '/', relax: true}
  }

  read(callback) {
    this.reader
      .pipe(es.split())
      .pipe(es.mapSync(line => {
        ++this.lineNumber

        parse(line, this.parseOptions, (err, d) => {
          this.data.push(d[0])
        })

        if (this.lineNumber % this.batchSize === 0) {
          callback(this.data)
        }
      })
      .on('error', function(){
          console.log('Error while reading file.')
      })
      .on('end', function(){
          console.log('Read entirefile.')
      }))
  }

  continue () {
    this.data = []
    this.reader.resume()
  }
}

module.exports = CSVReader

所以基本上，这里是你将如何使用它：

let reader = CSVReader('path_to_file.csv')
reader.read(() => reader.continue())

我用一个35GB的CSV文件测试了它，它为我工作，这就是为什么我选择在@ gerard的答案上构建它，欢迎反馈。

链接地址: http://www.djcxy.com/p/52277.html

上一篇: Parsing huge logfiles in Node.js

下一篇: node.js: read a text file into an array. (Each line an item in the array.)