Running Elixir apps on GKE at scale with PostgreSQL backend using PGBouncer

Published in

Searce

5 min readNov 20, 2020

Elixir alias ErLang is in the programming language category since 2012. It's a dynamic, functional language designed for building scalable and maintainable applications. We recently migrated an Elixr heavy environment from a well known public cloud to Google Cloud Platform. As part of the App and Infra modernization, we have moved all of the application servers from VMs to Kubernetes(GKE clusters). In this blog, we are going to see how the elixir app is maintaining the backend PostgreSQL connections and the main bottleneck with the application-level connection pooling. Then finally how PgBouncer solved our scaling issues.

Moving to the GKE clusters:

Elixir apps are compatible with the Kubernetes. In the initial phase, we just did the GKE cluster deployment. Then we started the unit tests with a bunch of pods and the backend database was PostgreSQL(CloudSQL version 11). From the application side, we made the connection settings as like below.

pool_size: 20
queue_target: 2_000
queue_interval: 1_000

For the first set of basic tests, it worked well. But when start the stress test with some decent amount of traffic, we were getting more database-related errors. Most of the error messages are related to connection drop.

tcp recv: closed (the connection was closed by the pool, possibly due to a timeout or because the pool has been terminated)

Then we pushed more traffic and see what's happening on the backend, then we database connections started throttling due to more pods got provisioned by the GKE cluster.

now psql: error: could not connect to server: FATAL:  Sorry, too many clients already

PgBouncer:

We realized that, this modernization needs a robust connection pooler for the PostgreSQL database to handle more traffic from multiple pods. Then we deployed the pgbouncer — an opensource connection pooler for the PostgreSQL database. To make sure the availability of the pgbouncer, we deployed two pgbouncers in Compute Engine(with the different zone) and put them under the TCP load balancer.

After this pgboucner setup, we never faced the database level connection issue. Here is the initial pgbouncer settings.

proddb_1 = host=10.10.10.5  port=5432server_reset_query = DISCARD ALL
pool_mode = transaction
max_client_conn = 4000
min_pool_size = 15
default_pool_size = 20

After the necessary changes, we started the testing again. but this time we ran into a different issue. This error was happening for a few sessions, almost 50% of the sessions were running fine.

ERROR 08P01 (protocol_violation) bind message supplies 4 parameters, but prepared statement "ecto_323" requires 2

We identified that its coming from the prepare statements, seems something is not deallocating properly and still using the previous statement parameters. Then we decided to use session as the pool mode. We again ran into issues related to prepared statement. The error is said the prepared statement itself doesn’t exist.

ERROR 26000 (invalid_sql_statement_name) prepared statement "ecto_850" does not exist

Then we started looking at similar issues on Elixir GitHub repo and we found that many folks had issues with pgbouncer integration.

On the #issue219 they added the link for the elixir ecto’s documentation where they documented the settings for integrating with pgbouncer.

After adding the prepare parameter we changed the below setting in the pgbouncer. These changes helped us to solve the prepared statement related issues.

pool_mode=transaction
server_reset_query = DEALLOCATE ALL;

Database level optimization:

But unfortunately, still these changes didn’t solve the connection closed error.

Erlang error: {:timeout, {Task, :await, [%Task{owner: #PID<0.30705.1>, pid: #PID<0.30754.1>, ref: #Reference<0.1369119585.2451046402.239841>}, 5000]}}Last message: {:continue, :start} 08:56:58.111 [error] GenServer Oban.Queue.Refund_transaction.Producer terminating ** (DBConnection.ConnectionError) connection not available and request was dropped from queue after 1994ms.

Then we started zooming into the database, one of the issues from the database side is, an Update query was running for more than 30seconds and this was running in many sessions.

Also we found that we need to create the right indexes for a couple of tables. Based on the query analysis, we have created them. And finally, almost all the errors got resolved. But still we were not production ready — we tunned the database, pgbouncer and GKE cluster. The next step was application level tuning.

Elixir — DB Connection parameter:

From the monitoring dashboard, we identified that the maximum execution time for a query is just 2seconds, it never went more than that. Also from the pgbouncer, the system view called show pools; was showing more than 200 connections were waiting and the oldest waiting was 2seconds. Then we realized that something is blocking from the client level that is not waiting more than 2 mins.

Then finally we made some changes on the database connection parameters and here are our final app-level settings.

adapter: Ecto.Adapters.Postgres,
database: "coredb",
username: "${POSTGRES_DB_USER}",
password: "${POSTGRES_DB_PASSWORD}",
hostname: "${POSTGRES_DB_HOST}",
port: "${POSTGRES_DB_PORT}",
prepare: :unnamed,
pool_size: 100,
timeout: 65_000,
queue_target: 2_000,
queue_interval: 10_000

We ran many stress tests and we didn’t see any connection breaks, so we fixed that these values are the optimal values for us. And the pgbouncer settings are modified as like below.(we were running 2 pgbouncer VMs)

[databases]
corefb = host=10.10.10.5  port=5432 pool_size=120 reserve_pool=10
processdb = host=10.10.10.8 port=5432 pool_size=110 reserve_pool=20
timescaledb  = host=10.10.10.20 port=5432 pool_size=110 reserve_pool=20pool_mode = transaction
server_reset_query = DEALLOCATE ALL;
ignore_startup_parameters = extra_float_digits
server_check_query = select 1
max_client_conn = 4000
reserve_pool_timeout = 5

Here is the illustration of the database connection flow.

More improvements:

GCP CloudSQL doesn’t provide the database level metrics. So we integrated the Percona PMM with CloudSQL. So we are able to monitor the deep database level metrics and the query level metrics as well.
PGbouncer has system tables like show pools and show servers but there is no proper grafana dashboard for pgbouncer for better monitoring, So we developed a Grafana dashboard and integrated with Percona PMM.

Conclusion:

The migration was smooth, and the pgbouncer is the mandatory layer for the large scale Elixir deployment to handle the connections in a better way. We tried pgpool as well(to leverage the features like caching and the read/write split), but seems there is no official support from Elixir for pgpool. It gave some issues in our tests. So we went live with pgbouncer.

Right now we are running this whole setup with the pgbouncer on a VM, but the next phase will be deploying the pgbouncer as a sidecar in our GKE clusters and directly talk the PostgreSQL database, or keep running the pgboucner as sidecar they will talk to a centralised Pgbouncer then it’ll go to the postgresql.

Hope you found it useful! :)