将大型数据集加载到crossfilter / dc.js中
我使用dc.js构建了具有多个维度和组的可视化对象以显示数据。 可视化的数据是自行车行程数据,每次行程都将被载入。目前,有超过750,000条数据。 我使用的JSON文件大小为70 MB,只需随着我在未来几个月内收到更多数据而增加。
所以我的问题是,我怎样才能使数据更加精简,因此可以很好地扩展? 现在,我的互联网连接需要大约15秒的时间才能加载,但是我担心,一旦数据太多,这将花费太长时间。 另外,我试过(不成功)在数据加载时显示进度条/微调器,但我没有成功。
我需要的数据列是start_date, start_time, usertype, gender, tripduration, meters, age
。 我将JSON中的这些字段缩短为start_date, start_time, u, g, dur, m, age
使文件变小。 在交叉过滤器上,顶部有一个折线图,显示每天的总次数。 在此之下,有星期几(根据数据计算),月份(也是计算的)和用于用户类型,性别和年龄的饼图的行图。 在此之下,有两个start_time(四舍五入到小时)和tripduration(四舍五入到分钟)的条形图。
该项目位于GitHub上:https://github.com/shaunjacobsen/divvy_explorer(数据集位于data2.json中)。 我试图创建一个jsfiddle,但它不工作(可能是由于数据,即使只收集1000行并使用<pre>
标记将其加载到HTML中):http://jsfiddle.net/QLCS2/
理想情况下,它将起作用,因此只有顶部图表的数据将首先加载:由于这只是一天中的数据计数,因此加载速度很快。 但是,一旦它进入其他图表,它需要逐步更多的数据来深入细节。 任何想法如何让这个功能?
我建议将JSON中的所有字段名称缩短为1个字符(包括“start_date”和“start_time”)。 这应该有一点帮助。 另外,请确保您的服务器上打开了压缩。 这样,发送到浏览器的数据将在传输过程中自动压缩,如果尚未开启,这会加快处理速度。
为了获得更好的响应能力,我还建议首先设置Crossfilter(空白),所有尺寸和组以及所有dc.js图表,然后使用Crossfilter.add()将大量数据添加到Crossfilter中。 做到这一点的最简单方法是将数据分成一口大小的块(每块几个MB)并串行加载。 因此,如果您使用的是d3.json,则在前一个文件加载的回调中启动下一个文件加载。 这导致了一堆嵌套的回调,这有点令人讨厌,但应该允许用户界面在数据加载时响应。
最后,有了这么多的数据,我相信你会开始在浏览器中遇到性能问题,而不仅仅是在加载数据的时候。 我怀疑你已经看到了这一点,你所看到的15秒暂停至少部分在浏览器中。 您可以通过分析浏览器的开发人员工具来进行检查。 为了解决这个问题,你需要分析和识别性能瓶颈,然后尝试优化这些瓶颈。 另外 - 一定要在较慢的电脑上测试它们是否在您的观众中。
考虑我的班级设计。 它不符合你的要求,但它说明了我的观点。
public class MyDataModel
{
public List<MyDatum> Data { get; set; }
}
public class MyDatum
{
public long StartDate { get; set; }
public long EndDate { get; set; }
public int Duration { get; set; }
public string Title { get; set; }
}
开始和结束日期是Unix时间戳,持续时间以秒为单位。
序列化为:“{”Data“:
[{“StartDate”:1441256019,“EndDate”:1441257181,“Duration”:451,“Title”:“Rad是一个很酷的单词。”},...]}“
一行数据是92个字符。
我们开始压缩吧! 将日期和时间转换为60个字符串。 将所有内容存储在一个字符串数组的数组中。
public class MyDataModel
{
public List<List<string>> Data { get; set; }
}
序列化为:“{”Data“:[[”1pCSrd“,”1pCTD1“,”7V“,”Rad是个很酷的词。“],...]}”
一行数据现在是47个字符。 moment.js是处理日期和时间的好库。 它具有内置的解压缩基础格式的功能。
使用数组数组将使您的代码不易读取,因此添加注释来记录代码。
仅加载最近的90天。 缩放至30天。 当用户在范围图左侧拖动画笔时,开始以90天的块为单位获取更多数据,直到用户停止拖动。 使用add方法将数据添加到现有的交叉过滤器。
随着您添加越来越多的数据,您会注意到您的图表响应越来越少。 那是因为你已经在你的svg中渲染了数百甚至数千个元素。 浏览器正在崩溃。 使用d3量化功能将数据点分组到桶中。 将显示的数据减少到50个桶。
量化值得努力,而且是您可以创建具有不断增长的数据集的可伸缩图形的唯一方法。
您的其他选择是放弃范围图表并将数据月份,日期和小时数据分组。 然后添加日期范围选择器。 由于您的数据按月份,日期和小时分组,因此即使您在一天中的每个小时骑自行车,也不会有大于8766行的结果集。
我观察到类似的问题(在企业公司工作),我发现了值得尝试的几个想法。
定时器样本
var datacnt=0;
var timerId=setInterval(function () {
// body...
d3.select("#count-data-current").text(datacnt);
//update visualization should go here, something like dc.redrawAll()...
},300);
oboe("relative-or-absolute path to your data(ajax)")
.node('CNT',function (count,path) {
// body...
d3.select("#count-data-all").text("Expecting " + count + " records");
return oboe.drop;
})
.node('data.*', function (record, path) {
// body...
datacnt++;
return oboe.drop;
})
.node('done', function (item, path) {
// body...
d3.select("#progress-data").text("all data loaded");
clearTimeout(timerId);
d3.select("#count-data-current").text(datacnt);
});
数据样本
{"CNT":107498,
"keys": "DATACENTER","FQDN","VALUE","CONSISTENCY_RESULT","FIRST_REC_DATE","LAST_REC_DATE","ACTIVE","OBJECT_ID","OBJECT_TYPE","CONSISTENCY_MESSAGE","ID_PARAMETER"],
"data": [[22,202,"4.9.416.2",0,1449655898,1453867824,-1,"","",0,45],[22,570,"4.9.416.2",0,1449655912,1453867884,-1,"","",0,45],[14,377,"2.102.453.0",-1,1449654863,1468208273,-1,"","",0,45],[14,406,"2.102.453.0",-1,1449654943,1468208477,-1,"","",0,45],[22,202,"10.2.293.0",0,1449655898,1453867824,-1,"","",0,8],[22,381,"10.2.293.0",0,1449655906,1453867875,-1,"","",0,8],[22,570,"10.2.293.0",0,1449655912,1453867884,-1,"","",0,8],[22,381,"1.80",0,1449655906,1453867875,-1,"","",0,41],[22,570,"1.80",0,1449655912,1453867885,-1,"","",0,41],[22,202,"4",0,1449655898,1453867824,-1,"","",0,60],[22,381,"4",0,1449655906,1453867875,-1,"","",0,60],[22,570,"4",0,1449655913,1453867885,-1,"","",0,60],[22,202,"A20",0,1449655898,1453867824,-1,"","",0,52],[22,381,"A20",0,1449655906,1453867875,-1,"","",0,52],[22,570,"A20",0,1449655912,1453867884,-1,"","",0,52],[22,202,"20140201",2,1449655898,1453867824,-1,"","",0,40],[22,381,"20140201",2,1449655906,1453867875,-1,"","",0,40],[22,570,"20140201",2,1449655912,1453867884,-1,"","",0,40],[22,202,"16",-4,1449655898,1453867824,-1,"","",0,58],[22,381,"16",-4,1449655906,1453867875,-1,"","",0,58],[22,570,"16",-4,1449655913,1453867885,-1,"","",0,58],[22,202,"512",0,1449655898,1453867824,-1,"","",0,57],[22,381,"512",0,1449655906,1453867875,-1,"","",0,57],[22,570,"512",0,1449655913,1453867885,-1,"","",0,57],[22,930,"I32",0,1449656143,1461122271,-1,"","",0,66],[22,930,"20140803",-4,1449656143,1461122271,-1,"","",0,64],[14,1359,"10.2.340.19",0,1449655203,1468209257,-1,"","",0,131],[14,567,"10.2.340.19",0,1449655185,1468209111,-1,"","",0,131],[22,930,"4.9.416.0",-1,1449656143,1461122271,-1,"","",0,131],[14,1359,"10.2.293.0",0,1449655203,1468209258,-1,"","",0,13],[14,567,"10.2.293.0",0,1449655185,1468209112,-1,"","",0,13],[22,930,"4.9.288.0",-1,1449656143,1461122271,-1,"","",0,13],[22,930,"4",0,1449656143,1461122271,-1,"","",0,76],[22,930,"96",0,1449656143,1461122271,-1,"","",0,77],[22,930,"4",0,1449656143,1461122271,-1,"","",0,74],[22,930,"VMware ESXi 5.1.0 build-2323236",0,1449656143,1461122271,-1,"","",0,17],[21,616,"A20",0,1449073850,1449073850,-1,"","",0,135],[21,616,"4",0,1449073850,1449073850,-1,"","",0,139],[21,616,"12",0,1449073850,1449073850,-1,"","",0,138],[21,616,"4",0,1449073850,1449073850,-1,"","",0,140],[21,616,"2",0,1449073850,1449073850,-1,"","",0,136],[21,616,"512",0,1449073850,1449073850,-1,"","",0,141],[21,616,"Microsoft Windows Server 2012 R2 Datacenter",0,1449073850,1449073850,-1,"","",0,109],[21,616,"4.4.5.100",0,1449073850,1449073850,-1,"","",0,97],[21,616,"3.2.7895.0",-1,1449073850,1449073850,-1,"","",0,56],[9,2029,"10.7.220.6",-4,1470362743,1478315637,1,"vmnic0","",1,8],[9,1918,"10.7.220.6",-4,1470362728,1478315616,1,"vmnic3","",1,8],[9,1918,"10.7.220.6",-4,1470362727,1478315616,1,"vmnic2","",1,8],[9,1918,"10.7.220.6",-4,1470362727,1478315615,1,"vmnic1","",1,8],[9,1918,"10.7.220.6",-4,1470362727,1478315615,1,"vmnic0","",1,8],[14,205,"934.5.45.0-1vmw",-50,1465996556,1468209226,-1,"","",0,47],[14,1155,"934.5.45.0-1vmw",-50,1465996090,1468208653,-1,"","",0,14],[14,963,"934.5.45.0-1vmw",-50,1465995972,1468208526,-1,"","",0,14],
"done" : true}
先将键更改为完整的对象数组
//function to convert main data to array of objects
function convertToArrayOfObjects(data) {
var keys = data.shift(),
i = 0, k = 0,
obj = null,
output = [];
for (i = 0; i < data.length; i++) {
obj = {};
for (k = 0; k < keys.length; k++) {
obj[keys[k]] = data[i][k];
}
output.push(obj);
}
return output;
}
上面的这个函数适用于这里的数据样本的一些修改版本
[["ID1","ID2","TEXT1","STATE1","DATE1","DATE2","STATE2","TEXT2","TEXT3","ID3"],
[14,377,"2.102.453.0",-1,1449654863,1468208273,-1,"","",0,45],
[14,406,"2.102.453.0",-1,1449654943,1468208477,-1,"","",0,45],
[22,202,"10.2.293.0",0,1449655898,1453867824,-1,"","",0,8],
[22,381,"10.2.293.0",0,1449655906,1453867875,-1,"","",0,8],
[22,570,"10.2.293.0",0,1449655912,1453867884,-1,"","",0,8],
[22,381,"1.80",0,1449655906,1453867875,-1,"","",0,41],
[22,570,"1.80",0,1449655912,1453867885,-1,"","",0,41],
[22,202,"4",0,1449655898,1453867824,-1,"","",0,60],
[22,381,"4",0,1449655906,1453867875,-1,"","",0,60],
[22,570,"4",0,1449655913,1453867885,-1,"","",0,60],
[22,202,"A20",0,1449655898,1453867824,-1,"","",0,52]]
还可以考虑使用memcached https://memcached.org/或redis https://redis.io/来缓存服务器端的数据,根据数据大小,redis可能会让您更进一步
链接地址: http://www.djcxy.com/p/32697.html上一篇: Load large dataset into crossfilter/dc.js
下一篇: DC, Crossfilter dimension: multiple columns as different keys?