CNPG Recipe 19 – Finer Control Over Postgres Startup with Probes · Unleashing the Power of Postgres in Kubernetes

Table of Contents

CloudNativePG 1.26 introduces enhanced support for Kubernetes startup probes, giving users finer control over how and when PostgreSQL instances are marked as “started.” This article explores the new capabilities, including both basic and advanced configuration modes, and explains the different probe strategies—such as pg_isready, SQL query, and streaming for replicas. It provides practical guidance for improving the reliability of high-availability Postgres clusters by aligning startup conditions with actual database readiness.

CloudNativePG 1.26 introduces an enhanced implementation of Kubernetes startup and readiness probes for each Postgres pod in a high-availability cluster—comprising a primary instance and any number of optional replicas (also known as standby instances). This improvement gives Kubernetes more accurate insight into when a Postgres instance should be considered started and, more importantly, ready to accept connections from applications or to be promoted following a failover.

In this article, I’ll focus on the new startup probe, while deferring discussion of the updated readiness probe to a future post.

Understanding Startup Probes #

Startup probes, introduced in Kubernetes 1.20, provide better control over workloads that take time to initialise—delaying the execution of liveness and readiness probes until the application is truly up. If a startup probe fails, Kubernetes will terminate the container and attempt to restart it.

Before startup probes, the typical workaround was to configure an initial delay in the readiness probe—an imprecise and often unreliable solution.

Startup probes are as configurable as liveness and readiness probes, using the same key parameters:

failureThreshold – The number of consecutive failures allowed before the container is considered to have failed startup and is restarted.
periodSeconds – How often (in seconds) to perform the probe.
successThreshold – The number of consecutive successes required for the probe to be considered successful (typically set to 1).
timeoutSeconds – The number of seconds after which the probe times out if no response is received.

These parameters allow for fine-tuned behaviour based on the expected startup characteristics of your application.

As with other probes, the outcome (success or failure) is determined by invoking a specific endpoint or command defined in the container’s configuration.

For more details, refer to the official Kubernetes documentation on probes.

Rather than reinventing the wheel with custom health checks at the PostgreSQL layer, CloudNativePG—staying true to its cloud-native principles—builds on Kubernetes’ native probe framework to manage the health of individual Postgres instances and the overall cluster. This article, along with the ones that follow, aims to explain how these probes work and how to make the most of them in your deployments.

Why Startup Probes Matter for Postgres #

Postgres is a prime example of a workload that can require significant time to start—especially in high-availability setups. Whether it’s a primary or a replica, startup time can vary depending on factors such as crash recovery or recovery from a restart point, or replication lag to reach a consistent state. Replicas in particular may take longer to become ready, especially when replaying WAL segments after a restart.

In the early days of EDB’s Cloud Native Postgres operator (the proprietary predecessor of CloudNativePG), this behaviour often led to problems: replicas would fail to become ready within the configured window and be prematurely restarted by Kubernetes.

With startup probes, CloudNativePG can now give Kubernetes a clearer, more accurate signal about a pod’s initialisation state—reducing unnecessary restarts and improving cluster stability.

How CloudNativePG Implements Startup Probes #

CloudNativePG offers two modes for configuring startup probes, giving users the choice between simplicity and fine-grained control:

Basic mode – A simplified configuration that automatically sets key probe parameters based on the value of .spec.startDelay.
Advanced mode – A fully customisable setup using the .spec.probes.startup stanza.

By default, CloudNativePG uses pg_isready to check whether a PostgreSQL instance has started successfully (more on this later).

The following diagram provides a simplified flow of the startup probe lifecycle, illustrating how its parameters interact and how it fits into the broader context of liveness and readiness probes.

A simplified view of the startup probe implemented by CloudNativePG

Basic Mode: Simplicity Through `startDelay` #

The default and recommended approach is to use the startDelay parameter defined in the Cluster specification. This value is expressed in seconds and is set to 3600 (1 hour) by default—ensuring sufficient time for even complex or delayed PostgreSQL startup scenarios, such as replica recovery.

In this mode, CloudNativePG preconfigures the startup probe with the following values:

periodSeconds: 10 seconds – the probe runs every 10 seconds
successThreshold: 1 – only one successful check is required
timeoutSeconds: 5 seconds – the probe will wait up to 5 seconds for a response

The failureThreshold is automatically calculated using the formula:

failureThreshold = startDelay / periodSeconds

With the default startDelay of 3600 seconds and periodSeconds of 10, this results in:

failureThreshold = 3600 / 10 = 360

This means Kubernetes will wait up to an hour (360 failed probes at 10-second intervals) before declaring the container as failed and restarting it, in accordance with the pod’s restart policy (Always with CloudNativePG).

The following example creates a PostgreSQL cluster with a startDelay value of 600 seconds (10 minutes), using the basic startup probe configuration mode:

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: freddie
spec:
  instances: 1

  storage:
    size: 1Gi

  startDelay: 600

Once the cluster has been created, you can inspect the startupProbe configuration applied to the container in the primary pod (freddie-1) with the following command:

kubectl get pod freddie-1 \
  -o jsonpath='{.spec.containers[*].startupProbe}' | jq

The output should be similar to:

{
  "failureThreshold": 60,
  "httpGet": {
    "path": "/startupz",
    "port": 8000,
    "scheme": "HTTPS"
  },
  "periodSeconds": 10,
  "successThreshold": 1,
  "timeoutSeconds": 5
}

While this approach is sufficient for most use cases—providing a reliable and consistent startup window with minimal configuration—the remainder of this article focuses on the advanced method for those who require greater control and customisation.

Advanced Mode: Full Probe Customisation #

For use cases that require more granular control, CloudNativePG provides the .spec.probes.startup stanza. This allows you to explicitly define all the startup probe parameters introduced earlier in this article.

The following configuration instructs Kubernetes to:

Probe the container every 5 seconds (periodSeconds)
Allow up to 120 consecutive failures (failureThreshold)—equivalent to 10 minutes (600 seconds)—before considering the startup probe failed
Restart the container if it hasn’t successfully started within that time

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: freddie
spec:
  instances: 1

  storage:
    size: 1Gi

  probes:
    startup:
      periodSeconds: 5
      timeoutSeconds: 3
      failureThreshold: 120

This approach is ideal when the default settings are not suitable for your workload, in particular the periodSeconds option.

As you may have noticed, these settings apply uniformly to all PostgreSQL instance pods—whether they are primaries or standbys. Now, let’s explore what additional aspects you can customise in your setup to fine-tune behaviour even further.

Probe Strategies #

By default, CloudNativePG uses PostgreSQL’s pg_isready utility to determine whether a PostgreSQL instance—primary or replica—is accepting connections. If pg_isready returns 0, the probe is considered successful; any other return code results in a failure.

The strategy can be changed through the .spec.probes.startup.type parameter.

Type: `pg_isready` (default) #

This is the default behaviour for both primary and replica pods and requires no additional configuration.

Type: `query` #

Alternatively, you can configure the probe to execute a basic ping SQL query on the local postgres database (see query.go).

To enable this mode, set .spec.probes.startup.type to query, as shown in the following example:

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: freddie
spec:
  instances: 1

  storage:
    size: 1Gi

  probes:
    startup:
      type: query
      periodSeconds: 5
      timeoutSeconds: 3
      failureThreshold: 120

This approach works for both primaries and replicas and may be preferred in scenarios where a successful connection isn’t enough—you also want to ensure the instance can respond to a minimal query.

Type: `streaming` (replica-only) #

CloudNativePG also provides a third probe type, streaming, specifically designed for replicas.

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: freddie
spec:
  instances: 3

  storage:
    size: 1Gi

  probes:
    startup:
      type: streaming
      maximumLag: 32Mi
      periodSeconds: 5
      timeoutSeconds: 3
      failureThreshold: 120

In this example, a replica is considered started only when it is actively streaming from the primary and its replication lag falls below a defined threshold. This threshold is controlled via the maximumLag parameter, which specifies the acceptable lag in bytes (in this case it is set to 32MB—the equivalent of two WAL segments).

If maximumLag is not set, the default behaviour, the replica is marked as started as soon as it begins streaming from the primary, regardless of its lag.

I strongly recommend considering this probe strategy for production-grade high availability setups. It ensures that a replica is marked as started not merely when it accepts connections, but when it is meaningfully synchronised with the primary—either already attached and streaming, or doing so with minimal lag (as you require). This leads to a more reliable failover experience and better application consistency.

Key Takeaways #

Startup probes in Kubernetes are essential for ensuring that PostgreSQL instances—especially replicas—aren’t prematurely marked as healthy. With CloudNativePG 1.26, you now have the flexibility to define exactly what “started” means for your workloads, whether that’s simply accepting connections, executing a basic query, or actively streaming with minimal lag.

For most production environments, the new streaming probe strategy for replicas is worth adopting. It allows you to delay declaring a replica as started until it’s actually useful for failover, leading to more robust and predictable high-availability behaviour.

In short: startup probes are no longer a checkbox—they’re a tuning instrument. In the next article, I’ll walk you through readiness probes and how they complement startup probes to ensure your PostgreSQL workloads are both ready and resilient.

Stay tuned for the upcoming recipes! For the latest updates, consider subscribing to my LinkedIn and Twitter channels.

If you found this article informative, feel free to share it within your network on social media using the provided links below. Your support is immensely appreciated!

Cover Picture: “Baby Sri Lankan Elephant“.