In a perfect world all of your servers are hard-wired to each other in a room. There would be one opening with which the world connects to your little slice of the Internet. That way all of your cross-service communication into databases, infrastructure and other services happens within the single set of servers all directly connected.
In the modern world we often have to reach across the Internet to access services and applications. This can be an awkward feeling and presents some unique problems. However, there are a few techniques and patterns you can use to make it a little less frightening. Let’s talk through some of the bigger concerns and what you can do about them.
One big reason reaching across data centers, even for first-party systems, can be an issue is the matter of security. There’s a base level of security you lose sending data and commands across the Internet where anyone can glance at your requests and try to decode the data or alter what’s being sent. All someone needs is a basic understanding of networking, attack techniques like Man-in-the-middle and perhaps Wireshark to unpack your cross-service request, see sensitive data, tinker with the request and send an altered request to the final destination. Fear not, however, there are some standard techniques to mitigate this risk:
Always communicate over SSL when you’re sending requests back and forth over your systems. This is a straightforward, standard way to secure communications between two services or entities on the Web. Under the hood, SSL uses Public/Private Key encryption to secure the body of a request between two entities. Reddit, Facebook and all of your financial institutions use SSL (HTTPS) to communicate with your browser, and likely when they communicate between internal services. Its become far easier and cheaper (free) to get SSL for your services as well thanks to organizations like Let’s Encrypt.
While communication over SSL is somewhat secure, it can fail. Or perhaps you don’t need SSL to prevent snooping, but you do want to ensure the data wasn’t tampered with. At Highrise we decided to utilize a drafted standard that is being worked on currently under IETF, which outlines a method for signing a request. This means you can use an encryption algorithm and set of keys that you configure to define a formal verification for the content of your request. Let’s say I want to ensure that the Digest, Authentication and Date headers were specifically never altered. By following this protocol I would: Set up the request, retrieve the signature (using signing keys) for the specified headers, add the signature to the request and execute the request. This standard allows for specifying what keys you used to sign the request (via a KeyId parameter), which headers were signed, and which algorithm was used to do the signing. The recipient server can use this information to verify the contents of those headers were not altered during transport. The details of this freshly forming protocol go a fair bit deeper and are worth understanding. There will be a followup post directed at this topic shortly.
These two protocols give us a stronger confidence in the things being sent over the wire to other services.
Speed of accessing external services due to network fluctuations as well as actual downtime are facts of a cross-data-center world. Obviously, both types of issues can compound themselves and start making whole services virtually unusable. You often won’t be able to stop these things from happening so you have to prepare for them. Let’s talk about four mitigation techniques:
Caching or intelligently deciding when to request across services can cut down on the number of actual requests you need to make. Things like eTags can help with this as well as expiration headers or simply not requesting data unless you absolutely need it to accomplish your task. If the thing didn’t change from the last time it was requested let the client reuse the data it already has.
I mentioned earlier that slow responses from external services can create a compounding problem for your system. You can mitigate this risk by planning for it to happen and wrapping specific patterns around your communication. Specifically, set reasonable timeouts when you make external requests. One problem with timeouts is that you can’t tell if it ever reached the server. So you should plan to make your endpoint idempotent whenever possible. Idempotent endpoints make retries simpler as well, since you can just keep hitting the endpoint and expect no unexpected change. Finally, you maybe should slow down rescheduling the request to give some time for a system to recover or avoid hammering the service. This is called exponential back-off.
At Highrise, certain important requests have a timeout like 1 second. If the request fails it will be retried 3 times before it stops trying and starts messaging our team about issues. Each time it will schedule the job to retry further out: 3 seconds after failure, 9 seconds after failure and 27 seconds after failure, because of the exponential back-off algorithm. In cases where something is, for instance, sending an email via an external request, idempotency is a very serious concern so that you avoid sending the exact same email 3 times because of retries. You can accomplish something like that with a key that the server uses to decide if that operation has already been accomplished.
Circuit Breakers paired with timeouts can help you both better handle full-service degradation and provide a window for recovery. A Circuit Breaker basically lets you define a set of rules that says when a breaker should “trip.” When a breaker trips you skip over an operation and instead respond with “try again later please,” re-queue a job or use some other retry mechanism. In practice at Highrise, we wrap requests to an external service in a circuit breaker. If the breaker trips due to too many request timeouts, we display a message to any users trying to access functionality that would use that service, and put jobs on hold that use that service. Jobs that were in-flight will presumably fail and be retried as usual. A tripped breaker stays tripped for several minutes (a configured value) and thus keeps us from hammering a service that may be struggling to keep up. This gives Operations some breathing room to add servers, fix a bug or simply allow network latency to recover a little.
Upchecks, Health-Checks and the like are very useful to get a basic understanding of whether you can reach a service. Libraries standardize some of this for you so you don’t have to think much about what to provide. Really what you want is to understand whether you can reach the service and if its basic functions are operational. Upchecks paired with a circuit breaker can help decide whether to show a maintenance page or to skip jobs that won’t work at the moment. These checks should be extremely fast. At Highrise for our first-party, external services we check once on each web-request for the livelihood of a feature about to be accessed. Again let’s say we have an external emailing service. If someone goes to the email feature we wouldn’t check at each email operation, in the code, that the service is up. Instead, we would check at the beginning of the web request if the email service is up. If it is up continue to the feature, if it isn’t display a basic “down, please try later” message.
When it comes to external services, even if you wrote it, you have to act like you have no control of it. You can’t assume any external service will always operate normally. The reality is you have limited control, so you have to design a system that explains issues to your users as they happen and mostly recovers on its own. Protect what you can control and avoid requiring humans to repair issues like this. The more your computers can recover on their own, the more you can worry about the next feature, the next user or the next beer.