Health checks and circuit breakers

Uses: Kong Gateway

How does a health check determine health?

Determining health for Targets

Any request to a Target can produce a TCP error, timeout, or an HTTP status code. The health check uses the data in the request to determine whether a Target is healthy or unhealthy.

For active checks, this information is gathered by an active probe
For passive checks, this information is gathered from a proxied request

Based on the gathered data, the health checker updates a series of internal counters:

If the returned status code is configured as healthy, it increments the Successes counter for the Target and clears all its other counters
If it fails to connect, it increments the TCP failures counter for the Target and clears the Successes counter
If it times out, it increments the Timeouts counter for the Target and clears the Successes counter
If the returned status code is one configured as unhealthy, it increments the HTTP failures counter for the Target and clears the Successes counter

If any of the TCP failures, HTTP failures, or timeouts counters reach their configured threshold, the Target will be marked as unhealthy.

If the Successes counter reaches its configured threshold, the Target will be marked as healthy.

The list of which HTTP status codes are healthy or unhealthy and the individual thresholds for each of these counters are configurable for each individual Upstream. You can find all of the default values for an Upstream in the Upstream schema.

Notes:

Unhealthy Targets won’t be removed from the load balancer, and won’t have any impact on the balancer layout when using a hashing algorithm. Instead, they will just be skipped.

Health checks operate only on enabled Targets and don’t modify the status of a Target in the Kong Gateway database.

The DNS caveats also apply to health checks. If using hostnames for the Targets, then make sure the DNS server always returns the full set of IP addresses for a name, and does not limit the response.

Determining health for Upstreams

The health of an Upstream is determined based on the status of its Targets. You can configure the threshold for a healthy Upstream using its healthchecks.threshold parameter. This sets a percentage of minimum available Target weight (capacity) for the Upstream to be considered healthy.

If the available capacity percentage of an Upstream is less than the configured threshold, the Upstream is considered unhealthy and Kong Gateway will respond to requests to the Upstream with 503 Service Unavailable.

Here is a simple example:

You have an Upstream configured with healthchecks.threshold=55
The Upstream has 5 Targets, each with weight=100, so the total weight in the ring balancer is 500
Each Target represents 20% of the total available capacity

In this scenario, the Upstream can handle losing 2 of its 5 Targets, as it will then be working at 60% capacity, which is still higher than the configured threshold of 55%. Once a third Target becomes unhealthy, the capacity drops to 40%, and the Upstream itself becomes unhealthy as well.

Once it enters an unhealthy state, the Upstream will only return errors. This lets the Targets recover from the cascading failures they were experiencing.

When the Targets start recovering and the Upstream’s available capacity passes the threshold again, the health status of the ring balancer is automatically updated and the Upstream is reactivated.

Active health checks

Active health checks actively probe Targets for their health. When active health checks are enabled in an Upstream entity, Kong Gateway periodically issues HTTP or HTTPS requests to a configured path at each Target of the Upstream. This allows Kong Gateway to automatically enable and disable Targets in the balancer based on the probe results.

The interval between active health checks can be configured separately for healthy or unhealthy Targets.

Note: Active health checks only support HTTP/HTTPS Targets. They don’t apply to Upstreams assigned to Services with the protocol attribute set to tcp or tls.

Configure active health checks

To enable active health checks, you need to configure the parameters under healthchecks.active in the Upstream object configuration.

Parameter	Description
`healthchecks.active.concurrency` Default: `10`	Number of Targets to check concurrently in active health checks.
`healthchecks.active.healthy.http_statuses` Default: `200, 302`	An array of HTTP statuses to consider a success, indicating healthiness, when returned by a probe in active health checks.
`healthchecks.active.healthy.interval` Default: `0`	Interval between active health checks for healthy Targets (in seconds). Set this to a positive value to enable active healthchecks for healthy Targets.
`healthchecks.active.healthy.successes` Default: `0`	Number of successes in active probes (as defined by `healthchecks.active.healthy.http_statuses`) to consider a Target healthy.
`healthchecks.active.http_path` Default: `/`	The path that should be used when issuing the HTTP GET request to the Target. The default value is `"/"`.
`healthchecks.active.https_sni`	(Only used for HTTPS) The hostname to use as an SNI (Server Name Identification) when performing active health checks using HTTPS. This is particularly useful when Targets are configured using IPs, so that the Target host’s certificate can be verified with the proper SNI.
`healthchecks.active.https_verify_certificate` Default: `true`	(Only used for HTTPS) Whether to check the validity of the SSL certificate of the remote host when performing active health checks using HTTPS. Failed TLS verifications will increment the `TCP failures` counter. `HTTP failures` refer only to HTTP status codes, whether probes are done through HTTP or HTTPS.
`healthchecks.active.timeout` Default: `1`	The connection timeout limit for the HTTP GET request of the probe. The default value is 1 second.
`healthchecks.active.type` Default: `http`	Specify whether to perform `http` or `https` probes, or set this field to `tcp` to test the connection to a given host and port.
`healthchecks.active.unhealthy.http_failures` Default: `0`	Number of HTTP failures in active probes (as defined by `healthchecks.active.unhealthy.http_statuses`) to consider a Target unhealthy.
`healthchecks.active.unhealthy.http_statuses` Default: `429, 404, 500, 501, 502, 503, 504, 505`	An array of HTTP statuses to consider a failure, indicating unhealthiness, when returned by a probe in active health checks.
`healthchecks.active.unhealthy.interval` Default: `0`	Interval between active health checks for unhealthy Targets (in seconds). Set this to a positive value to enable active healthchecks for unhealthy Targets.
`healthchecks.active.unhealthy.tcp_failures` Default: `0`	Number of TCP failures or TLS verification failures in active probes to consider a Target unhealthy.
`healthchecks.active.unhealthy.timeouts` Default: `0`	Number of timeouts in active probes to consider a Target unhealthy.

Disable active health checks

To completely disable active health checks for an Upstream, set healthchecks.active.healthy.interval and healthchecks.active.unhealthy.interval to 0.

Passive health checks (circuit breakers)

Note: This feature is not supported in Konnect or hybrid mode.

Passive health checks, also known as circuit breakers, are checks performed based on the requests proxied by Kong Gateway (HTTP/HTTPS/TCP) with no additional traffic generated. When a Target becomes unresponsive, the passive health checker detects that and marks the Target unhealthy. The ring balancer starts skipping this Target and doesn’t route any more traffic to it.

Configure passive health checks

Passive health checks don’t have a probe, as they work by interpreting the ongoing traffic that flows from a Target. To enable passive checks, you only need to configure the Upstream’s counter thresholds, which you can find under healthchecks.passive in the Upstream object configuration:

Parameter	Description
`healthchecks.passive.healthy.http_statuses` Default: `200, 201, 202, 203, 204, 205, 206, 207, 208, 226, 300, 301, 302, 303, 304, 305, 306, 307, 308`	An array of HTTP statuses which represent healthiness when produced by proxied traffic, as observed by passive health checks.
`healthchecks.passive.healthy.successes` Default: `0`	Number of successes in proxied traffic (as defined by `healthchecks.passive.healthy.http_statuses`) to consider a Target healthy, as observed by passive health checks. This needs to be positive when passive checks are enabled so that healthy traffic resets the unhealthy counters.
`healthchecks.passive.unhealthy.http_failures` Default: `0`	Number of HTTP failures in proxied traffic (as defined by `healthchecks.passive.unhealthy.http_statuses`) to consider a Target unhealthy, as observed by passive health checks.
`healthchecks.passive.unhealthy.http_statuses` Default: `429, 500, 503`	An array of HTTP statuses which represent unhealthiness when produced by proxied traffic, as observed by passive health checks.
`healthchecks.passive.unhealthy.tcp_failures` Default: `0`	Number of TCP failures in proxied traffic to consider a Target unhealthy, as observed by passive health checks.
`healthchecks.passive.unhealthy.timeouts` Default: `0`	Number of timeouts in proxied traffic to consider a Target unhealthy, as observed by passive health checks.

Re-enable a Target disabled by a passive health check

Passive health checks have the advantage of not producing extra traffic, but they are unable to automatically mark a Target as healthy again. Once the problem with a Target is solved and it is ready to receive traffic, you have to manually inform the health checker that the Target’s status is healthy:

curl -i -X PUT http://localhost:8001/upstreams/example-upstream/targets/10.1.2.3:1234/healthy

Copied to clipboard!

This command broadcasts a cluster-wide message so that the healthy status is propagated to the whole Kong Gateway cluster. This resets the health counters of the health checkers running in all workers of the Kong Gateway node, allowing the ring balancer to route traffic to the Target again.

Disable passive health checks

To completely disable passive health checks for an Upstream, set all counter thresholds under healthchecks.passive to 0.

FAQs

Why should I choose active or passive health checks?

Do health checks affect all nodes in the cluster?