将大型数据集加载到crossfilter / dc.js中

我使用dc.js构建了具有多个维度和组的可视化对象以显示数据。 可视化的数据是自行车行程数据,每次行程都将被载入。目前,有超过750,000条数据。 我使用的JSON文件大小为70 MB,只需随着我在未来几个月内收到更多数据而增加。

所以我的问题是,我怎样才能使数据更加精简,因此可以很好地扩展? 现在,我的互联网连接需要大约15秒的时间才能加载,但是我担心,一旦数据太多,这将花费太长时间。 另外,我试过(不成功)在数据加载时显示进度条/微调器,但我没有成功。

我需要的数据列是start_date, start_time, usertype, gender, tripduration, meters, age 。 我将JSON中的这些字段缩短为start_date, start_time, u, g, dur, m, age使文件变小。 在交叉过滤器上,顶部有一个折线图,显示每天的总次数。 在此之下,有星期几(根据数据计算),月份(也是计算的)和用于用户类型,性别和年龄的饼图的行图。 在此之下,有两个start_time(四舍五入到小时)和tripduration(四舍五入到分钟)的条形图。

该项目位于GitHub上:https://github.com/shaunjacobsen/divvy_explorer(数据集位于data2.json中)。 我试图创建一个jsfiddle,但它不工作(可能是由于数据,即使只收集1000行并使用<pre>标记将其加载到HTML中):http://jsfiddle.net/QLCS2/

理想情况下,它将起作用,因此只有顶部图表的数据将首先加载:由于这只是一天中的数据计数,因此加载速度很快。 但是,一旦它进入其他图表,它需要逐步更多的数据来深入细节。 任何想法如何让这个功能?


我建议将JSON中的所有字段名称缩短为1个字符(包括“start_date”和“start_time”)。 这应该有一点帮助。 另外,请确保您的服务器上打开了压缩。 这样,发送到浏览器的数据将在传输过程中自动压缩,如果尚未开启,这会加快处理速度。

为了获得更好的响应能力,我还建议首先设置Crossfilter(空白),所有尺寸和组以及所有dc.js图表​​,然后使用Crossfilter.add()将大量数据添加到Crossfilter中。 做到这一点的最简单方法是将数据分成一口大小的块(每块几个MB)并串行加载。 因此,如果您使用的是d3.json,则在前一个文件加载的回调中启动下一个文件加载。 这导致了一堆嵌套的回调,这有点令人讨厌,但应该允许用户界面在数据加载时响应。

最后,有了这么多的数据,我相信你会开始在浏览器中遇到性能问题,而不仅仅是在加载数据的时候。 我怀疑你已经看到了这一点,你所看到的15秒暂停至少部分在浏览器中。 您可以通过分析浏览器的开发人员工具来进行检查。 为了解决这个问题,你需要分析和识别性能瓶颈,然后尝试优化这些瓶颈。 另外 - 一定要在较慢的电脑上测试它们是否在您的观众中。


考虑我的班级设计。 它不符合你的要求,但它说明了我的观点。

public class MyDataModel
{
    public List<MyDatum> Data { get; set; }
}

public class MyDatum
{
    public long StartDate { get; set; }
    public long EndDate { get; set; }
    public int Duration { get; set; }
    public string Title { get; set; }
}

开始和结束日期是Unix时间戳,持续时间以秒为单位。

序列化为:“{”Data“:
[{“StartDate”:1441256019,“EndDate”:1441257181,“Duration”:451,“Title”:“Rad是一个很酷的单词。”},...]}“

一行数据是92个字符。

我们开始压缩吧! 将日期和时间转换为60个字符串。 将所有内容存储在一个字符串数组的数组中。

public class MyDataModel
{
    public List<List<string>> Data { get; set; }
}

序列化为:“{”Data“:[[”1pCSrd“,”1pCTD1“,”7V“,”Rad是个很酷的词。“],...]}”

一行数据现在是47个字符。 moment.js是处理日期和时间的好库。 它具有内置的解压缩基础格式的功能。

使用数组数组将使您的代码不易读取,因此添加注释来记录代码。

仅加载最近的90天。 缩放至30天。 当用户在范围图左侧拖动画笔时,开始以90天的块为单位获取更多数据,直到用户停止拖动。 使用add方法将数据添加到现有的交叉过滤器。

随着您添加越来越多的数据,您会注意到您的图表响应越来越少。 那是因为你已经在你的svg中渲染了数百甚至数千个元素。 浏览器正在崩溃。 使用d3量化功能将数据点分组到桶中。 将显示的数据减少到50个桶。

量化值得努力,而且是您可以创建具有不断增长的数据集的可伸缩图形的唯一方法。

您的其他选择是放弃范围图表并将数据月份,日期和小时数据分组。 然后添加日期范围选择器。 由于您的数据按月份,日期和小时分组,因此即使您在一天中的每个小时骑自行车,也不会有大于8766行的结果集。


我观察到类似的问题(在企业公司工作),我发现了值得尝试的几个想法。

  • 您的数据具有规则的结构,因此您可以将键放在第一行,并且只有数据在以下行中 - 模仿CSV(标题第一,数据下一个)
  • 日期时间可以更改为纪元号码(您可以将纪元的开始时间移至01/01/2015并在收到时计算
  • 从服务器(http://oboejs.com/)获取响应时使用oboe.js,因为数据集很大,请考虑在加载过程中使用oboe.drop
  • 使用JavaScript定时器更新可视化
  • 定时器样本

    var datacnt=0;
    var timerId=setInterval(function () {
        // body...
        d3.select("#count-data-current").text(datacnt);
        //update visualization should go here, something like dc.redrawAll()...
    },300);
    
    oboe("relative-or-absolute path to your data(ajax)")
    .node('CNT',function (count,path) {
        // body...
        d3.select("#count-data-all").text("Expecting " + count + " records");
        return oboe.drop;
    })
    .node('data.*', function (record, path) {
        // body...
        datacnt++;
        return oboe.drop;
    })
    .node('done', function (item, path) {
        // body...
        d3.select("#progress-data").text("all data loaded");
        clearTimeout(timerId);
        d3.select("#count-data-current").text(datacnt);
    });
    

    数据样本

    {"CNT":107498, 
     "keys": "DATACENTER","FQDN","VALUE","CONSISTENCY_RESULT","FIRST_REC_DATE","LAST_REC_DATE","ACTIVE","OBJECT_ID","OBJECT_TYPE","CONSISTENCY_MESSAGE","ID_PARAMETER"], 
     "data": [[22,202,"4.9.416.2",0,1449655898,1453867824,-1,"","",0,45],[22,570,"4.9.416.2",0,1449655912,1453867884,-1,"","",0,45],[14,377,"2.102.453.0",-1,1449654863,1468208273,-1,"","",0,45],[14,406,"2.102.453.0",-1,1449654943,1468208477,-1,"","",0,45],[22,202,"10.2.293.0",0,1449655898,1453867824,-1,"","",0,8],[22,381,"10.2.293.0",0,1449655906,1453867875,-1,"","",0,8],[22,570,"10.2.293.0",0,1449655912,1453867884,-1,"","",0,8],[22,381,"1.80",0,1449655906,1453867875,-1,"","",0,41],[22,570,"1.80",0,1449655912,1453867885,-1,"","",0,41],[22,202,"4",0,1449655898,1453867824,-1,"","",0,60],[22,381,"4",0,1449655906,1453867875,-1,"","",0,60],[22,570,"4",0,1449655913,1453867885,-1,"","",0,60],[22,202,"A20",0,1449655898,1453867824,-1,"","",0,52],[22,381,"A20",0,1449655906,1453867875,-1,"","",0,52],[22,570,"A20",0,1449655912,1453867884,-1,"","",0,52],[22,202,"20140201",2,1449655898,1453867824,-1,"","",0,40],[22,381,"20140201",2,1449655906,1453867875,-1,"","",0,40],[22,570,"20140201",2,1449655912,1453867884,-1,"","",0,40],[22,202,"16",-4,1449655898,1453867824,-1,"","",0,58],[22,381,"16",-4,1449655906,1453867875,-1,"","",0,58],[22,570,"16",-4,1449655913,1453867885,-1,"","",0,58],[22,202,"512",0,1449655898,1453867824,-1,"","",0,57],[22,381,"512",0,1449655906,1453867875,-1,"","",0,57],[22,570,"512",0,1449655913,1453867885,-1,"","",0,57],[22,930,"I32",0,1449656143,1461122271,-1,"","",0,66],[22,930,"20140803",-4,1449656143,1461122271,-1,"","",0,64],[14,1359,"10.2.340.19",0,1449655203,1468209257,-1,"","",0,131],[14,567,"10.2.340.19",0,1449655185,1468209111,-1,"","",0,131],[22,930,"4.9.416.0",-1,1449656143,1461122271,-1,"","",0,131],[14,1359,"10.2.293.0",0,1449655203,1468209258,-1,"","",0,13],[14,567,"10.2.293.0",0,1449655185,1468209112,-1,"","",0,13],[22,930,"4.9.288.0",-1,1449656143,1461122271,-1,"","",0,13],[22,930,"4",0,1449656143,1461122271,-1,"","",0,76],[22,930,"96",0,1449656143,1461122271,-1,"","",0,77],[22,930,"4",0,1449656143,1461122271,-1,"","",0,74],[22,930,"VMware ESXi 5.1.0 build-2323236",0,1449656143,1461122271,-1,"","",0,17],[21,616,"A20",0,1449073850,1449073850,-1,"","",0,135],[21,616,"4",0,1449073850,1449073850,-1,"","",0,139],[21,616,"12",0,1449073850,1449073850,-1,"","",0,138],[21,616,"4",0,1449073850,1449073850,-1,"","",0,140],[21,616,"2",0,1449073850,1449073850,-1,"","",0,136],[21,616,"512",0,1449073850,1449073850,-1,"","",0,141],[21,616,"Microsoft Windows Server 2012 R2 Datacenter",0,1449073850,1449073850,-1,"","",0,109],[21,616,"4.4.5.100",0,1449073850,1449073850,-1,"","",0,97],[21,616,"3.2.7895.0",-1,1449073850,1449073850,-1,"","",0,56],[9,2029,"10.7.220.6",-4,1470362743,1478315637,1,"vmnic0","",1,8],[9,1918,"10.7.220.6",-4,1470362728,1478315616,1,"vmnic3","",1,8],[9,1918,"10.7.220.6",-4,1470362727,1478315616,1,"vmnic2","",1,8],[9,1918,"10.7.220.6",-4,1470362727,1478315615,1,"vmnic1","",1,8],[9,1918,"10.7.220.6",-4,1470362727,1478315615,1,"vmnic0","",1,8],[14,205,"934.5.45.0-1vmw",-50,1465996556,1468209226,-1,"","",0,47],[14,1155,"934.5.45.0-1vmw",-50,1465996090,1468208653,-1,"","",0,14],[14,963,"934.5.45.0-1vmw",-50,1465995972,1468208526,-1,"","",0,14],
     "done" : true}
    

    先将键更改为完整的对象数组

        //function to convert main data to array of objects
        function convertToArrayOfObjects(data) {
            var keys = data.shift(),
                i = 0, k = 0,
                obj = null,
                output = [];
    
            for (i = 0; i < data.length; i++) {
                obj = {};
    
                for (k = 0; k < keys.length; k++) {
                    obj[keys[k]] = data[i][k];
                }
    
                output.push(obj);
            }
    
            return output;
        }
    

    上面的这个函数适用于这里的数据样本的一些修改版本

       [["ID1","ID2","TEXT1","STATE1","DATE1","DATE2","STATE2","TEXT2","TEXT3","ID3"],
        [14,377,"2.102.453.0",-1,1449654863,1468208273,-1,"","",0,45],
        [14,406,"2.102.453.0",-1,1449654943,1468208477,-1,"","",0,45],
        [22,202,"10.2.293.0",0,1449655898,1453867824,-1,"","",0,8],
        [22,381,"10.2.293.0",0,1449655906,1453867875,-1,"","",0,8],
        [22,570,"10.2.293.0",0,1449655912,1453867884,-1,"","",0,8],
        [22,381,"1.80",0,1449655906,1453867875,-1,"","",0,41],
        [22,570,"1.80",0,1449655912,1453867885,-1,"","",0,41],
        [22,202,"4",0,1449655898,1453867824,-1,"","",0,60],
        [22,381,"4",0,1449655906,1453867875,-1,"","",0,60],
        [22,570,"4",0,1449655913,1453867885,-1,"","",0,60],
        [22,202,"A20",0,1449655898,1453867824,-1,"","",0,52]]
    

    还可以考虑使用memcached https://memcached.org/或redis https://redis.io/来缓存服务器端的数据,根据数据大小,redis可能会让您更进一步

    链接地址: http://www.djcxy.com/p/32697.html

    上一篇: Load large dataset into crossfilter/dc.js

    下一篇: DC, Crossfilter dimension: multiple columns as different keys?