Huge deadlocks count on the last phase operations ALTER INDEX RENAME TO #35

aulaev · 2020-10-31T19:22:03Z

Hello, Maxim!

Thank you for very useful tool. We have deadlock errors in postgresql logs related to last phase of pgcompacttable - "REINDEX".
For example:

2020-10-31 21:05:37 MSK [21303] 172.20.2.144 PostgreSQL JDBC Driver queue2@queue2 40P01 UPDATE ERROR: deadlock detected
2020-10-31 21:05:37 MSK [21303] 172.20.2.144 PostgreSQL JDBC Driver queue2@queue2 40P01 UPDATE DETAIL: Process 21303 waits for RowExclusiveLock on relation 38747785 of database 16619; blocked by process 20639.
Process 20639 waits for AccessExclusiveLock on relation 39109132 of database 16619; blocked by process 21303.
Process 21303: UPDATE public.queue_message2 as x SET sended_at = now(), state='SENDED'
WHERE id = (
SELECT id
FROM public.queue_message2
WHERE queue_id = $1
AND state = 'QUEUED'
AND router_tag = $2
AND expired_at > now()
ORDER BY queue_id, state, router_tag,
priority DESC, queued_at ASC FOR UPDATE SKIP LOCKED LIMIT 1)
RETURNING x.*
Process 20639:
ALTER INDEX "public".pgcompact_index_20528 RENAME TO "pgcompact_index_14090";

Can there be a positive effect from using "Options controlling the behaviour" like --reindex-*** or any else?
May be --print-reindex-queries will help in such situation to perform manual reindex later?

Problem arising only on huge DML-loaded database shardes.

aulaev · 2020-10-31T19:27:14Z

2020-10-31 21:05:28 MSK [20639] [local] [unknown] postgres@queue2 57014 ALTER INDEX ERROR: canceling statement due to statement timeout
2020-10-31 21:05:28 MSK [20639] [local] [unknown] postgres@queue2 57014 ALTER INDEX STATEMENT:
ALTER INDEX "public".pgcompact_index_20528 RENAME TO "idx_queue_message2_main";

aulaev · 2020-11-01T19:05:22Z

I've set execution hourly by crontab:
/u01/postgres/pgcompacttable-master/bin/pgcompacttable --dbname queue2 --verbose info --reindex-retry-count 5 --reindex-retry-pause 10 --reindex-lock-timeout 5000 -f

and still having errors:

[postgres@ ~]$ grep -c ERR /u01/postgres/main/log/postgresql-2020-11-01.log
26

[local]:5432 postgres@queue2=# \d+ queue_message2
Table "public.queue_message2"
...
Indexes:
"pk_queue_message1" PRIMARY KEY, btree (id)
"pgcompact_index_28802" UNIQUE, btree (id)
"idx_queue_message1_msg_id" btree (message_id)
"idx_queue_message2_cnt" btree (expired_at, queue_id) WHERE state = 'QUEUED'::text
"idx_queue_message2_content" btree (queue_id, state, router_tag, content_type, priority DESC, queued_at, expired_at, id)
"idx_queue_message2_expired" btree (state, expired_at, sended_at, id)
"idx_queue_message2_main" btree (queue_id, state, router_tag, priority DESC, queued_at, expired_at, id)
"idx_queue_message2_message_id" btree (queue_id, message_id) WHERE state = 'SENDED'::text
"idx_queue_message2_mt" btree (queue_id, state, router_tag, message_type, priority DESC, queued_at, expired_at, id)
"pgcompact_index_14090" btree (queue_id, state, router_tag, priority DESC, queued_at, expired_at, id)
"pgcompact_index_4300" btree (queue_id, state, router_tag, priority DESC, queued_at, expired_at, id)
"pgcompact_index_4803" btree (queue_id, state, router_tag, priority DESC, queued_at, expired_at, id)
"queue_message2_qualified_content_name_idx" btree (qualified_content_name)

It seems strange that pgcompact_index**** not removed and all of them + their "master" index (idx_queue_message2_main) lead to deadlock errors by ALTER INDEX RENAME TO operations.

aulaev · 2020-11-01T19:09:26Z

[local]:5432 postgres@queue2=# select * from pg_stat_user_indexes where relname='queue_message2';
relid | indexrelid | schemaname | relname | indexrelname | idx_scan | idx_tup_read | idx_tup_fetch
-------+------------+------------+----------------+-------------------------------------------+----------+--------------+---------------
16632 | 39485382 | public | queue_message2 | idx_queue_message2_message_id | 25 | 626576 | 0
16632 | 39485389 | public | queue_message2 | idx_queue_message2_cnt | 56 | 3689 | 0
16632 | 39485402 | public | queue_message2 | idx_queue_message1_msg_id | 683526 | 591117 | 377035
16632 | 39485403 | public | queue_message2 | pk_queue_message1 | 408 | 10 | 10
16632 | 39485405 | public | queue_message2 | pgcompact_index_28802 | 4752409 | 305091 | 272196
16632 | 39485406 | public | queue_message2 | queue_message2_qualified_content_name_idx | 0 | 0 | 0
16632 | 39485407 | public | queue_message2 | idx_queue_message2_main | 1652 | 3945 | 78
16632 | 39485408 | public | queue_message2 | pgcompact_index_14090 | 1251 | 771 | 63
16632 | 39485409 | public | queue_message2 | pgcompact_index_4300 | 1903 | 2173 | 93
16632 | 39485410 | public | queue_message2 | pgcompact_index_4803 | 3989263 | 3574874 | 318559
16632 | 39485423 | public | queue_message2 | idx_queue_message2_mt | 9799 | 5 | 2
16632 | 39485436 | public | queue_message2 | idx_queue_message2_expired | 594 | 10766611 | 17919
16632 | 39485437 | public | queue_message2 | idx_queue_message2_content | 724183 | 209192 | 77533
(13 rows)

aulaev · 2020-11-01T19:19:26Z

[postgres@ ~]$ /u01/postgres/pgcompacttable-master/bin/pgcompacttable -V
/u01/postgres/pgcompacttable-master/bin/pgcompacttable v1.0.10

We're using the latest tool's version

aulaev · 2020-11-03T09:28:06Z

One more type of errors - canceling autovacuum task (arising only in the time pgcompacttable executed)

PG log:
2020-11-02 17:30:28 MSK [6018] ERROR: canceling autovacuum task
2020-11-02 17:30:28 MSK [6018] CONTEXT: automatic vacuum of table "queue2.public.queue_message2"

tool log:
[Mon Nov 2 17:30:03 2020] (queue2:public.queue_message2) Set pages/round: 5.
[Mon Nov 2 17:30:03 2020] (queue2:public.queue_message2) Set pages/vacuum: 626.
[Mon Nov 2 17:30:28 2020] (queue2:public.queue_message2) Vacuum final: cannot clean 32 pages, 7488 pages left, duration 2.411 seconds.
[Mon Nov 2 17:30:29 2020] (queue2:public.queue_message2) Analyze final: duration 0.297 second.
[Mon Nov 2 17:30:29 2020] (queue2:public.queue_message2) Bloat statistics with pgstattuple: duration 0.048 seconds.

Melkij · 2020-11-05T13:38:31Z

Hello
For some applications, it is important not to change the index names. To do this, we first have to build a new index (named pgcompact_index_N), then swap new and old indexes under the index rename lock (with max lock duration --reindex-lock-timeout and it's exactly reason for "canceling statement due to statement timeout" errors in PG log). When we can't rename index, we sleep for --reindex-retry-pause seconds. We trying rename again and again up to --reindex-retry-count retries. print-reindex-queries will print SQL that tool uses for reindex.

I didn't actually expect a deadlock here. I think that in this case we have no further error handling and pgcompacttable will leave the new index. I suggest checking all pgcompact_index_N indexes, most likely they are redundant and should be removed.

I've set execution hourly by crontab

I doubt this is good idea at all. For regular maintenance it would be better to tune autovacuum settings and investigate root cause of your table (or index) bloat. The usual reason is long transactions and default (means lazy) settings of autovacuum. Typically postgresql with appropriate tuning of autovacuum will not need pgcompacttable (excepts after massive deletes).

One more type of errors - canceling autovacuum task

Yes, this is expected behavior. pgcompacttable run some commands (e.g. VACUUM, ANALYZE) which definitely conflicts with autovacuum - autovacuum notices this and cancels itself with such log message.

If you have modern PostgreSQL (12 and above) - it would be better to use native REINDEX CONCURRENTLY command instead of manual indexes rename. It works with indexes in the same way as pgcompacttable, but has no concurrent locking issues during index swap. I am one who worked on this feature in postgresql core, but unfortunately still have no time to add support in pgcompacttable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huge deadlocks count on the last phase operations ALTER INDEX RENAME TO #35

Huge deadlocks count on the last phase operations ALTER INDEX RENAME TO #35

aulaev commented Oct 31, 2020 •

edited

Loading

aulaev commented Oct 31, 2020

aulaev commented Nov 1, 2020

aulaev commented Nov 1, 2020

aulaev commented Nov 1, 2020 •

edited

Loading

aulaev commented Nov 3, 2020 •

edited

Loading

Melkij commented Nov 5, 2020

Huge deadlocks count on the last phase operations ALTER INDEX RENAME TO #35

Huge deadlocks count on the last phase operations ALTER INDEX RENAME TO #35

Comments

aulaev commented Oct 31, 2020 • edited Loading

aulaev commented Oct 31, 2020

aulaev commented Nov 1, 2020

aulaev commented Nov 1, 2020

aulaev commented Nov 1, 2020 • edited Loading

aulaev commented Nov 3, 2020 • edited Loading

Melkij commented Nov 5, 2020

aulaev commented Oct 31, 2020 •

edited

Loading

aulaev commented Nov 1, 2020 •

edited

Loading

aulaev commented Nov 3, 2020 •

edited

Loading