long COMMIT duration, high I/O wait in postgres 9.1

2018-07-03 17:44:28

We are observing long COMMIT times in postgres logs and high I/O wait times. Postgres version PostgreSQL 9.1.14 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit

iotop shows following output

          TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
04:01:25 15676 be/4 postgres    0.00 B/s    0.00 B/s  0.00 % 99.99 % postgres: masked masked 10.2.21.22(37713) idle
04:01:16 15676 be/4 postgres    0.00 B/s    0.00 B/s  0.00 % 99.99 % postgres: masked masked 10.2.21.22(37713) idle
04:01:15 15675 be/4 postgres    0.00 B/s    0.00 B/s  0.00 % 99.99 % postgres: masked masked 10.2.21.22(37712) idle in transaction
04:00:51 15407 be/4 postgres  173.52 K/s    0.00 B/s  0.00 % 99.99 % postgres: masked masked 10.2.21.22(37670) idle
04:02:12 16054 be/4 postgres    0.00 B/s    0.00 B/s  0.00 % 96.63 % postgres: masked masked 10.2.21.22(37740) idle
04:04:11 16578 be/4 postgres    0.00 B/s   23.66 K/s  0.00 % 95.39 % postgres: masked masked 10.2.21.22(37793) idle
04:00:59 15570 be/4 postgres    0.00 B/s    0.00 B/s  0.00 % 85.27 % postgres: masked masked 10.2.21.22(37681) COMMIT
04:02:11 16051 be/4 postgres    0.00 B/s    0.00 B/s  0.00 % 80.07 % postgres: masked masked 10.2.21.22(37737) idle
04:01:23 15660 be/4 postgres    0.00 B/s   15.75 K/s  0.00 % 52.99 % postgres: masked masked 10.2.21.22(37693) idle
04:01:35 15658 be/4 postgres    0.00 B/s   39.42 K/s  0.00 % 39.18 % postgres: masked masked 10.2.21.22(37691) idle in transaction
04:01:59 15734 be/4 postgres 1288.75 K/s    0.00 B/s  0.00 % 30.35 % postgres: masked masked 10.2.21.22(37725) idle
04:01:02 15656 be/4 postgres    7.89 K/s    0.00 B/s  0.00 % 30.06 % postgres: masked masked 10.2.21.22(37689) idle
04:02:28 16064 be/4 postgres 1438.18 K/s   15.72 K/s  0.00 % 23.72 % postgres: masked masked 10.2.21.22(37752) SELECT
04:03:30 16338 be/4 postgres  433.52 K/s   15.76 K/s  0.00 % 22.59 % postgres: masked masked 10.2.21.22(37775) idle in transaction
04:01:43 15726 be/4 postgres    0.00 B/s    7.88 K/s  0.00 % 20.77 % postgres: masked masked 10.2.21.22(37717) idle
04:01:23 15570 be/4 postgres    0.00 B/s   15.75 K/s  0.00 % 19.81 % postgres: masked masked 10.2.21.22(37681) idle
04:02:51 16284 be/4 postgres  441.56 K/s    7.88 K/s  0.00 % 17.11 % postgres: masked masked 10.2.21.22(37761) idle
04:03:39 16343 be/4 postgres  497.22 K/s   63.14 K/s  0.00 % 13.77 % postgres: masked masked 10.2.21.22(37780) idle
04:02:40 16053 be/4 postgres  204.88 K/s   31.52 K/s  0.00 % 11.31 % postgres: masked masked 10.2.21.22(37739) BIND
04:01:13 15646 be/4 postgres    0.00 B/s   47.24 K/s  0.00 % 11.17 % postgres: masked masked 10.2.21.22(37682) BIND
04:01:13 15660 be/4 postgres   94.49 K/s    0.00 B/s  0.00 % 10.80 % postgres: masked masked 10.2.21.22(37693) COMMIT

Commit times go up to 60 seconds during peak times. The issue started a week ago and it appears to happen during first minute of every hour. There was no change to application. There are no batch jobs running at that time that can cause this issue. We have eliminated that by stopping all jobs/crawling processes.

We have used pg_repack to remove bloats from 99% tables. The slow COMMIT operation is on a table that does not have bloats anymore.

RAID10 configuration is used. Storage is magnetic EBS. synchronous commit is on. Postgres is using fdatasync(). AWS supports claims that storage is healthy.

strace showed bunch of semop calls taking lot of times and only one slow fdatasync call.

$ egrep "<[0-9][0-9]." t.*
t.31944:1479632446.159939 semop(6029370, {{11, -1, 0}}, 1) = 0 <15.760687>
t.32000:1479632447.872642 semop(6127677, {{0, -1, 0}}, 1) = 0 <14.095245>
t.32001:1479632444.780242 semop(6094908, {{15, -1, 0}}, 1) = 0 <17.113239>
t.32151:1479632493.655164 select(8, [3 6 7], NULL, NULL, {60, 0}) = 1 (in [3], left {46, 614240}) <14.339090>
t.32198:1479632451.200194 semop(5963832, {{7, -1, 0}}, 1) = 0 <11.095583>
t.32200:1479632445.740529 semop(6094908, {{13, -1, 0}}, 1) = 0 <16.153911>
t.32207:1479632451.329028 semop(6062139, {{7, -1, 0}}, 1) = 0 <10.970497>
t.32226:1479632446.384585 semop(6029370, {{8, -1, 0}}, 1) = 0 <15.565608>
t.32289:1479632451.044155 fdatasync(106)        = 0 <10.849081>
t.32289:1479632470.284825 semop(5996601, {{14, -1, 0}}, 1) = 0 <10.686889>
t.32290:1479632444.608746 semop(5963832, {{8, -1, 0}}, 1) = 0 <17.284606>
t.32301:1479632445.757671 semop(6127677, {{8, -1, 0}}, 1) = 0 <16.137046>
t.32302:1479632445.504563 semop(6094908, {{4, -1, 0}}, 1) = 0 <16.389120>
t.32303:1479632445.889161 semop(6029370, {{6, -1, 0}}, 1) = 0 <16.005659>
t.32304:1479632446.377368 semop(6062139, {{12, -1, 0}}, 1) = 0 <15.554953>
t.32305:1479632448.269680 semop(6062139, {{14, -1, 0}}, 1) = 0 <13.717228>
t.32306:1479632450.465661 semop(5963832, {{3, -1, 0}}, 1) = 0 <11.783744>
t.32307:1479632448.959793 semop(6062139, {{8, -1, 0}}, 1) = 0 <13.289375>
t.32308:1479632446.948341 semop(6062139, {{10, -1, 0}}, 1) = 0 <15.001958>
t.32315:1479632451.534348 semop(6127677, {{12, -1, 0}}, 1) = 0 <10.765300>
t.32316:1479632450.209942 semop(6094908, {{3, -1, 0}}, 1) = 0 <12.039340>
t.32317:1479632451.032158 semop(6094908, {{7, -1, 0}}, 1) = 0 <11.217471>
t.32318:1479632451.088017 semop(5996601, {{12, -1, 0}}, 1) = 0 <11.161855>
t.32320:1479632452.161327 semop(5963832, {{14, -1, 0}}, 1) = 0 <10.138437>
t.32321:1479632451.070412 semop(5963832, {{13, -1, 0}}, 1) = 0 <11.179321>

pg_test_fsync output is available.

Any other pointers are greatly appreciated. Thanks!

Resolved the issue by making following changes.

Moved master DB to EBS optimized instance

Backed by SSD

Using provisioned IOPS

used pg_repack to remove bloats

链接地址: http://www.djcxy.com/p/93990.html

上一篇: 使用输入流阅读器读取时缺少行

下一篇: 长COMMIT持续时间，Postgres 9.1中的高I / O等待