One disadvantage of using TTL based health check is the high network
traffic between Consul agent (either between servers, or between server
and client).
In order for the services considered alive by Consul, microservices must
send an update TTL to Consul every n seconds (currently 30 seconds).
Here is the explanation about TTL check from Consul documentation [1]
Time to Live (TTL) - These checks retain their last known state for a
given TTL. The state of the check must be updated periodically over
the HTTP interface. If an external system fails to update the status
within a given TTL, the check is set to the failed state. This
mechanism, conceptually similar to a dead man's switch, relies on the
application to directly report its health. For example, a healthy app
can periodically PUT a status update to the HTTP endpoint; if the app
fails, the TTL will expire and the health check enters a critical
state. The endpoints used to update health information for a given
check are the pass endpoint and the fail endpoint. TTL checks also
persist their last known status to disk. This allows the Consul agent
to restore the last known status of the check across restarts.
Persisted check status is valid through the end of the TTL from the
time of the last check.
Hint:
TTL checks also persist their last known status to disk. This allows
the Consul agent to restore the last known status of the check
across restarts.
When microservices update the TTL, Consul will write to disk. Writing to
disk means all other slaves need to replicate it, which means master need
to inform other standby Consul to pull the new catalog. Hence, the
increased traffic.
More information about this issue can be viewed at Consul mailing list [2].
[1] https://www.consul.io/docs/agent/checks.html
[2] https://groups.google.com/forum/#!topic/consul-tool/84h7qmCCpjg
Consul sees a healthcheck that is in the warning state as a "failed"
node. This means that when we ask Consul for services that are passing,
it would not return nodes that have warning healthchecks.
In the cache, we only check on critical to skip for nodes. This makes
the cache out of sync with the non-cache implementation.
This patch reworks the non-cache implementation to ask for all nodes
(even unhealthy ones) and does the same check as within the cache, skip
nodes that have critical healthchecks.
We've noticed this issue when we deployed custom healthchecks where the
cache was acting properly, but after 1 minute we saw "None Available"
errors. This is due to the TTL expiry on the cache, which is then
followed by doing a non cached request.