Data Mining in a Django/Postgres application

I need to build in a analytics (reporting, charting & graphing) system into my Django application. In an ideal world I could just query my Postgres DB and get the data I want but when the amount of data in the DB goes through the roof, I'll hit performance bottlenecks and other issues like index hell.

I'm wondering if you could point me in a right direction to implement this:

  • Is this a good scenario to use a NoSQL DB like (CouchDB, MongoDB, Redis) and query the data from that?
  • Since Postgres and Django have no OLAP/MDX support should I go along with a star-schema in a different databse and query that?
  • I'm looking to avoid two things:

  • I don't want to query my actual DB for analytics as it might take a huge performance hit.
  • I'd like to keep my analytics as up to date as possible ie I'd like to incrementally update my data warehouse to have a the latest data. Every time, there's a CRUD operation on my transactional DB, I'd like to update the data warehouse.
  • This is yet another scenario that I haven't worked with and am trying to understand the quickest and best way to accomplish.

    I hope I've been verbose enough. If not, I'dd gladly explain more.

    Thanks everyone


    After digging around the web and using the knowledge I have, I've come to this solution:

    Use the Postgres to store the relational data. On Every CRUD operation, call the analytics code to do the calculations on the data and store the data in a NoSQL DB like Redis/CouchDB.

    Looking at this good comparison of the NoSQL DB's (http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis), I think Redis or CouchDB fits in just fine. Targeted for analytics.

    I could store calculated Analytics in Redis/CouchDB and update them incrementally when my source data changes.

    Is this a good solution?


    You might want to consider Cube. It is not a Django app, but it has a lot of nice features baked in, and Django can communicate to it easily. Also, it is lovely.

    立方体截图

    You could have you Django app just blast off events into MongoDB when the occur. This separation of systems would prevent any additional strain on your Django app.


    Sorry Mridang Agarwalla, some times your question come back to mind ...

    I thought the way to keep in sync both databases , OLAP and OLTP, up to date with low impact in OLTP.

    In 2002 I successfully employed this technique for a similar issue. It works as follows:

  • You write a trigger for each fact table. When fact data is modified trigger inserts a row in a table that reflects this event (idEvent + update | delete | insert + foreign key to fact table).
  • A low priority daemon do a infinite loop, for each loop iteration you 'pop' 10 events from table and update OLAP database with this new information.
  • You can optimize daemon behavior, for example, if table don't has new events daemon can sleep for 15 seconds.

    In my escenario only fact tables have trigger. If fact table reference data that no is in OLAP database I created the data at this time (OLTP and OLAP has different schema).

    If you analize your database you can find hundred of tables but only few tables are really fact tables.

    Well, I konw this is only a partial answer of your answer. Second part of your question talk about a power tool to analize data . I can't suggest to you any open source product (because I have not experience with open source analysis tools). I have worked with Microsoft Analysis Services + Tableau software in frontend. This is a very nice solution but I don't know if it matches with your filosofy. For data minning you have KNIME (Konstanz Information Miner) that is a user-friendly and comprehensive open-source data integration, processing, analysis, and exploration platform (but you need a previus ETL task).

    Please, sent to me news about your project, I'm very interested. I have a django student attendance solution and I want to add analysis functionality.

    链接地址: http://www.djcxy.com/p/66406.html

    上一篇: NoSQL使用案例场景或WHEN使用NoSQL

    下一篇: 在Django / Postgres应用程序中进行数据挖掘