PostgreSQL tuning best practices for data warehousing

I have found plenty of online and print guides on how to tune and optimize performance for Postgres for OLTP applications, but I haven't found anything of the sort specific to Data Warehousing applications. Since there are so many differences in the types of workload, I'm sure there has to be some differences in how the databases are managed and tuned.

Some of my own:

  • I have found from the DDL side that I use indexes a lot more liberally, since I usually only worry about inserts once a day and can do batch inserts with index rebuilds.

  • I will typically use integer surrogate keys to data that typically has more than one natural key for faster joins

  • I will usually define and maintain a very comprehensive date table that has prebuilt date manipulations (fiscal date as opposed to calendar date, fiscal year-month, starting day of the week, etc) and use it liberally as opposed to using functions in select statements and where statements. This usually helps during CPU-bound aggregate queries.

  • I was hoping that I would find some information on memory management and other database settings, but I would be happy to hear any useful best practices specific to Postgres-based Data Warehousing.


    My experience (admittedly on a pretty small scale when it comes to data warehouses):

  • Like you mention, pre-aggregating data is easily the most important thing, as it reduces the amount of data that needs to be read by many orders of magnitude.
  • Avoid short writing transactions, subtransactions and savepoints. This includes exception handling in PL/pgSQL. These burn through the available "transaction ID" space quickly, and cause expensive "wraparound" vacuums that need to rewrite whole tables.
  • I found that partitioning tables such that each partition individually can fit in the kernel's cache is good for maintenance and migrations, if you ever need to do any. This means you can recreate all indexes on a partition with just 1 seq scan from disk, instead of one scan for each index.
  • Like Chris already mentioned, be generous with work_mem and maintenance_work_mem; if your workload doesn't fit in RAM then keeping more temporary data in memory saves I/O and CPU time due to smarter query plans (most importantly HashAggregate).
  • If you need to do huge sorts, it can help to buy a dedicated SSD for storing the temporary files.

  • From a memory management perspective one of your largest differences is that you can often hope to keep the working OLTP set in memory while this is not the case with OLAP environments. Additionally very often your joined sets are bigger. This means higher work_mem settings can be very helpful and to the extent tables are denormalized this means one can push work_mem a bit higher than it might be otherwise. I am not sure my advice on shared_buffers would change (I prefer to start low and increase, testing performance at each step) but work_mem certainly would need to increase if you are doing reporting on sets of any size.

    链接地址: http://www.djcxy.com/p/11182.html

    上一篇: Magento模块中的Mydomain错误

    下一篇: PostgreSQL调优数据仓库的最佳实践