Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker Coordinator fails with PgBouncer due to unsupported statement_timeout #205

Closed
mfrister opened this issue Feb 16, 2024 · 10 comments
Closed

Comments

@mfrister
Copy link

Thanks for the extensive documentation on using river with PgBouncer!

I’m still having a small issue with river and PgBouncer - when the worker tries to connect, it fails (wrapped for better readability):

river client start: error making initial connection to database:
failed to connect to `host=... user=... database=...`:
server error (FATAL: unsupported startup parameter: statement_timeout (SQLSTATE 08P01))

PgBouncer doesn’t seem to support the statement_timeout parameter. Disabling the statement_timeout in the Notifier by removing the line makes the worker run. The comment above this line indicates that the timeout is necessary, though:

// Rely on an overall statement timeout instead of setting identical context timeouts on every query:

Side Note: PgBouncer does seem to have an option to ignore certain startup parameters, but I’m a bit wary of doing so, given that ignoring the statement timeout might have hard-to-notice side effects for other applications. I’d rather have the applications fail while connecting and adapt them however necessary.

My questions:

  1. Do you expect river to be negatively impacted by removing the statement timeout? I interpret the comment mentioned above as the statement timeout being required for reliable operation, although I'm wondering whether context-based timeouts would be necessary anyway to properly deal with network issues that would prevent river from receiving a response from Postgres. I found at least one context-based timeout in the Notifier, I'm not sure that's enough to cover all cases, though.
  2. If river is able to work reliably without the statement timeout, would you be open to making the timeout optional in some way or replace it with context-based timeouts where necessary?

I’m happy to submit a PR if you think a change in river makes sense.

@brandur
Copy link
Contributor

brandur commented Feb 17, 2024

Hi @mfrister, thanks for the report -- we'd intended to do more testing with PgBouncer to make sure everything looked good there, but hadn't gotten around to it yet.

One question before I start: what pooling mode are you using PgBouncer in?

I didn't test it, but reading through the features docs in PgBouncer [1], SET should be supported in session pooling mode. And if you aren't using session pooling mode, even if we take out the statement_timeout, the client's probably going to fail on something anyway. You can run an insert-only client in other other PgBouncer pooling modes, but when the client is trying to act as work coordinator, it's going to need session pooling or bust.

Do you expect river to be negatively impacted by removing the statement timeout? I interpret the comment mentioned above as the statement timeout being required for reliable operation, although I'm wondering whether context-based timeouts would be necessary anyway to properly deal with network issues that would prevent river from receiving a response from Postgres. I found at least one context-based timeout in the Notifier, I'm not sure that's enough to cover all cases, though.

Blake would probably have a different answer on this, but I think realistically, no. statement_timeout is generally a good idea, especially when you're protecting against any query that might potentially be written in an app, some of which may eventually become very unperformant, but everything the notifier does we'd expect to be pretty fast.

If river is able to work reliably without the statement timeout, would you be open to making the timeout optional in some way or replace it with context-based timeouts where necessary?

Yeah, we're not married to the statement_timeout there. Talked to Blake and we'll probably make the change (however see my note on pooling mode at the top) since most of the rest of the River code uses context timeouts anyway.

[1] https://www.pgbouncer.org/features.html

@andreirtaylor
Copy link

I'm assuming that this is related, but I have not been able to get river to work with supabase they use a proprietary connection pooling system called supavisor.

The error output from river is below when using session pooling.

Happy to help with any testing here and or opening up a new issue.

time=2024-02-25T11:52:38.976-05:00 level=ERROR msg="Scheduler: Error scheduling jobs" error="error deleting completed jobs: ERROR: prepared statement \"stmtcache_9233d4c846c6a2a54af178e1e500878088ef7296ba8c6bf0\" already exists (SQLSTATE 42P05)"
time=2024-02-25T11:52:39.275-05:00 level=ERROR msg="error attempting reelection" elector.err="ERROR: prepared statement \"stmtcache_e924bba8ef07b260df276cc58553fdf77a5d2ac1321679b8\" already exists (SQLSTATE 42P05)"
time=2024-02-25T11:52:39.504-05:00 level=ERROR msg="producer: Error fetching jobs" err="ERROR: prepared statement \"stmtcache_533cb1679bed1f4917f01ee331deb9347cca82532ad1cad9\" already exists (SQLSTATE 42P05)"
time=2024-02-25T11:52:40.416-05:00 level=ERROR msg="producer: Error fetching jobs" err="ERROR: prepared statement \"stmtcache_533cb1679bed1f4917f01ee331deb9347cca82532ad1cad9\" already exists (SQLSTATE 42P05)"
time=2024-02-25T11:52:41.393-05:00 level=ERROR msg="producer: Error fetching jobs" err="ERROR: prepared statement \"stmtcache_533cb1679bed1f4917f01ee331deb9347cca82532ad1cad9\" already exists (SQLSTATE 42P05)"
time=2024-02-25T11:52:42.405-05:00 level=ERROR msg="producer: Error fetching jobs" err="ERROR: prepared statement \"stmtcache_533cb1679bed1f4917f01ee331deb9347cca82532ad1cad9\" already exists (SQLSTATE 42P05)"
time=2024-02-25T11:52:43.381-05:00 level=ERROR msg="producer: Error fetching jobs" err="ERROR: prepared statement \"stmtcache_533cb1679bed1f4917f01ee331deb9347cca82532ad1cad9\" already exists (SQLSTATE 42P05)"
time=2024-02-25T11:52:43.977-05:00 level=ERROR msg="Scheduler: Error scheduling jobs" error="error deleting completed jobs: ERROR: prepared statement \"stmtcache_9233d4c846c6a2a54af178e1e500878088ef7296ba8c6bf0\" already exists (SQLSTATE 42P05)"
time=2024-02-25T11:52:44.394-05:00 level=ERROR msg="producer: Error fetching jobs" err="ERROR: prepared statement \"stmtcache_533cb1679bed1f4917f01ee331deb9347cca82532ad1cad9\" already exists (SQLSTATE 42P05)"
time=2024-02-25T11:52:44.644-05:00 level=ERROR msg="error attempting reelection" elector.err="ERROR: prepared statement \"stmtcache_e924bba8ef07b260df276cc58553fdf77a5d2ac1321679b8\" already exists (SQLSTATE 42P05)"
time=2024-02-25T11:52:45.374-05:00 level=ERROR msg="producer: Error fetching jobs" err="ERROR: prepared statement \"stmtcache_533cb1679bed1f4917f01ee331deb9347cca82532ad1cad9\" already exists (SQLSTATE 42P05)"
time=2024-02-25T11:52:46.418-05:00 level=ERROR msg="producer: Error fetching jobs" err="ERROR: prepared statement \"stmtcache_533cb1679bed1f4917f01ee331deb9347cca82532ad1cad9\" already exists (SQLSTATE 42P05)"

brandur added a commit that referenced this issue Feb 25, 2024
We had an issue opened a little while back (#171) wherein it turned out
that River couldn't operate with pgx's simple protocol turned on, which
appeared to be because the simple protocol doesn't know anything about a
target Postgres tuple's makeup and therefore had to guess how to encode
parameters based on input types, which was leading `[]byte` intended for
`jsonb`s to be binary encoded, which isn't allowed.

We closed that issue after I pointed out that PgBouncer now supports
prepared statement pooling, and therefore simple protocol is no longer
necessary.

We recently got another comment [1] where someone had tried to use River
with Supavisor on Supabase, which also does not supported prepared
statements, leading to lots of errors. Normally, pgx's simple protocol
could be used to address this, but as we saw in #171, River doesn't
support that.

Here, a proof of concept demonstrating how River could support simple
protocol by modeling `jsonb` fields in sqlc as strings, which prevents
pgx from encoding to binary with simple procotol, and we get to a point
where all driver tests are passing.

I'm not sure if we should action on this one or not. Having to fall back
to `string` fields a bit gross, although I don't *think* carries
significant performance penalty because it's basically the same
underlying structure as far as Go is concerned.

[1] #205 (comment)
@brandur
Copy link
Contributor

brandur commented Feb 25, 2024

Interesting. I've done a little investigating on this, and I'd love to get supavisor working, but it looks like it might be a bit of a yak shake.

Supavisor does support prepared statements now [1], but only in transaction mode, and of course the client needs to start in session mode.

Unfortunately, the only way to stop Pgx from using prepared statements is to have it downgrade to "simple protocol", and again unfortunately, we got a report in #171 that simple protocol doesn't work because sqlc tries to encode our jsonb input as binary, which Postgres does not allow.

So I think we could potentially try to solve this in two ways:

  1. We could try to force pgx to do the right thing with simple protocol, which basically involves modeling jsonbs as strings so that pgx won't binary encode their input. I put together a proof of concept of this in Proof of concept: River operating with pgx simple protocol #234, but it'd be a little more work to make sure it's the right move and vet everything.

  2. We could try to get the River client working in such a way that it doesn't require session pooling and can operate in transaction mode. This has a sizable downside in that we'd no longer be able to use listen/notify, and therefore it'd have to poll for leadership changes and new jobs. That might not be so bad though, and I've been thinking about adding a poll mode so that we could potentially interoperate with MySQL and SQLite which don't have listen/notify anyway.

The trouble with (1) is that even if we got that fixed, Supavisor also doesn't support listen/notify [2], so you'd probably just run into more problems.

(2) seems plausible, but it's a bit of a job, and it may not be the most important thing we should be working on at the moment.

[1] supabase/supavisor#207
[2] supabase/supavisor#85

@mfrister
Copy link
Author

Hi @brandur, thanks for your quick reply! Sorry it took me so long - I wasn’t working last week.

I’m using PgBouncer in session mode.

I looked a bit further into this and while PgBouncer supports SET in session mode, pgx passes RuntimeConfig values in a special connection startup message and not via the SQL SET command.

For the startup message, PgBouncer unfortunately only supports a very limited number of parameters and statement_timeout is not among them, so river fails to connect to PgBouncer.

Maybe setting a AfterConnect function on the pgxpool config that executes a SET command could replicate the same behaviour, but I haven’t tried whether this actually works. If you’re interested in this, I’m happy to look into this and implement it if it works.

Thanks for discussing and being open to using contexts instead of statement_timeout!

How would you like to continue?

@bgentry
Copy link
Contributor

bgentry commented Feb 26, 2024

@mfrister out of curiosity, does your setup give you any ability to access the raw (non-pgbouncer) url? One of the setups we’ve been considering for pgbouncer users is to allow specifying two pgx pools, or maybe one pool and an additional conn config: one for general use, and the latter solely for the notify listener.

However if you are ultimately just using Supabase and your setup doesn’t support listen/notify at all, this is a moot point and we probably can’t support this kind of infrastructure as of now. There’s a fair bit of stuff dependent on the pub/sub setup and probably will be more in the future. And although we’ve talked about ways to “bring your own pubsub” it’s probably not on the short term priority list.

@brandur
Copy link
Contributor

brandur commented Feb 26, 2024

Thanks for discussing and being open to using contexts instead of statement_timeout!

FWIW, the statement_timeout ended up getting refactored out in #212. I'm also curious to the answer to Blake's question though, because I suspect that even with that fixed, there may be more problems laying in wait.

@bgentry
Copy link
Contributor

bgentry commented Feb 26, 2024

Further context on my idea above: each River client that's working jobs needs a single dedicated Postgres connection that's used solely for listening for notifications. This is the only connection that actually needs to be able to support Postgres LISTEN functionality.

All of the other connections used within the client are for regular short-lived queries that can easily share a larger connection pool with the rest of your application and/or pass through pgbouncer in txn mode, but the listener's connection really needs either a direct DB connection or at least pgbouncer in session mode.

@mfrister
Copy link
Author

@bgentry Our setup currently doesn't allow me to access Postgres directly, only via PgBouncer. I'll only get a limited number of connections via the session-based PgBouncer, so I'm already using the session-based PgBouncer for the worker coordinator only and I'm using a separate connection pool for the workers themselves that uses a transaction-based PgBouncer. So as far as I understand, using the coordinator with a separate connection pool might already work well enough, but maybe it's still opening more connections than I expect and having a single connection for LISTEN makes sense (haven't put it into production yet, only in a test environment with a single worker process).

We're not using Supabase at all, so I'd expect us to be able to use NOTIFY/LISTEN. I've already tried the worker with the statement_timeout line commented out before and it picked up jobs promptly, so I think the LISTEN via our session-based PgBouncer works in general.

FWIW, the statement_timeout ended up getting refactored out in #212.

🎉 Thanks a lot! This probably solves this issue for me. I'll try the unreleased master tomorrow and report back whether it works.

@mfrister
Copy link
Author

mfrister commented Feb 28, 2024

I tested commit b2cb142 and it works with our PgBouncer/Postgres setup. Thanks again for the quick fix!

I'll close the issue, given that the issue I originally reported is now fixed. If I notice issues regarding too many connections for the session-based PgBouncer, I'll open a separate one. Feel free to reopen this, if you want to further discuss Supabase support or something else in here, of course.

@brandur
Copy link
Contributor

brandur commented Feb 28, 2024

@mfrister Ah that's great to hear. Thanks for the follow up.

I opened #238 separately for potential Supabase support.

brandur added a commit that referenced this issue Mar 3, 2024
We had an issue opened a little while back (#171) wherein it turned out
that River couldn't operate with pgx's simple protocol turned on, which
appeared to be because the simple protocol doesn't know anything about a
target Postgres tuple's makeup and therefore had to guess how to encode
parameters based on input types, which was leading `[]byte` intended for
`jsonb`s to be binary encoded, which isn't allowed.

We closed that issue after I pointed out that PgBouncer now supports
prepared statement pooling, and therefore simple protocol is no longer
necessary.

We recently got another comment [1] where someone had tried to use River
with Supavisor on Supabase, which also does not supported prepared
statements, leading to lots of errors. Normally, pgx's simple protocol
could be used to address this, but as we saw in #171, River doesn't
support that.

Here, a proof of concept demonstrating how River could support simple
protocol by modeling `jsonb` fields in sqlc as strings, which prevents
pgx from encoding to binary with simple procotol, and we get to a point
where all driver tests are passing.

I'm not sure if we should action on this one or not. Having to fall back
to `string` fields a bit gross, although I don't *think* carries
significant performance penalty because it's basically the same
underlying structure as far as Go is concerned.

[1] #205 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants