One disadvantage of using TTL based health check is the high network
traffic between Consul agent (either between servers, or between server
and client).
In order for the services considered alive by Consul, microservices must
send an update TTL to Consul every n seconds (currently 30 seconds).
Here is the explanation about TTL check from Consul documentation [1]
Time to Live (TTL) - These checks retain their last known state for a
given TTL. The state of the check must be updated periodically over
the HTTP interface. If an external system fails to update the status
within a given TTL, the check is set to the failed state. This
mechanism, conceptually similar to a dead man's switch, relies on the
application to directly report its health. For example, a healthy app
can periodically PUT a status update to the HTTP endpoint; if the app
fails, the TTL will expire and the health check enters a critical
state. The endpoints used to update health information for a given
check are the pass endpoint and the fail endpoint. TTL checks also
persist their last known status to disk. This allows the Consul agent
to restore the last known status of the check across restarts.
Persisted check status is valid through the end of the TTL from the
time of the last check.
Hint:
TTL checks also persist their last known status to disk. This allows
the Consul agent to restore the last known status of the check
across restarts.
When microservices update the TTL, Consul will write to disk. Writing to
disk means all other slaves need to replicate it, which means master need
to inform other standby Consul to pull the new catalog. Hence, the
increased traffic.
More information about this issue can be viewed at Consul mailing list [2].
[1] https://www.consul.io/docs/agent/checks.html
[2] https://groups.google.com/forum/#!topic/consul-tool/84h7qmCCpjg
When developing go-micro services, it is frequently possible to set invalid results in the response pointer. When this happens (as I and @trushton personally experienced), `sendResponse()` returns an error correctly explaining what happened (e.g. protobuf refused to encode a bad struct) but the `call()` function one above it in the stack ignores the returned error object.
Thus, invalid structs go un-encoded and the _client side times out_. @trushton and I first caught this in our CI builds when we left a protobuf.Empty field uninitialized (nil) instead of setting it to `&ptypes.Empty{}`. This resulted in an `proto: oneof field has nil value` error, but it was dropped and became a terribly confusing client timeout instead.
This patch is two independent changes:
* In rpc_codec, when a serialization failure occurs serialize an error message, which will correctly become a 500 for HTTP services, about the encoding failure. This means rpc_codec only returns an `error` when a socket failure occurs, which I believe is the behavior that rpc_service is expecting anyway.
* In rpc_service, log any errors returned by sendResponse instead of dropping the error object. This will make debugging client timeouts less of a hassle.