Resilience in Ruby: Handling Failure
For the second half of 2014 and 2015, most of my time at GitHub was spent working on moving the notifications feature off the primary MySQL cluster on to a new MySQL cluster and making the feature resilient to failure of that cluster.
When all your data is stored in one database, making your application resilient to failures of that database is kind of pointless. If the database is unavailable, your entire application is likely unusable.
Once you outgrow a single database and begin to partition data, resilience to failure becomes important. You do not want your entire application unavailable to users solely because you are unable to get a count of unread notifications to show in the top bar of the website.
Error Handling Basics
The first step in resilience is error handling. There are a lot of different approaches to handling errors, as outlined by Joe Duffy in his marvelous post The Error Model. The approach Ruby takes to errors is exceptions. When an error is encountered, it is pretty typical that an exception will be raised. For example, let’s make an HTTP request to a service that is not available.
require "net/http"
Net::HTTP.new("localhost", 9999).request(Net::HTTP::Get.new("/"))
If you drop the snippet above in IRB, you will end up with an error like this (assuming you do not have a web server running on port 9999):
Errno::ECONNREFUSED: Connection refused - connect(2) for "localhost" port 9999
The error causes your program to exit with a failure. If you drop the snippet above in a file, execute it with ruby and check the exit status, it will be 1 (failure).
$ ruby resilience.rb
... [snip several lines of error and backtrace] ...
$ echo $?
1
In order to handle the error, Ruby provides begin
/rescue
.
require "net/http"
begin
Net::HTTP.new("localhost", 9999).request(Net::HTTP::Get.new("/"))
puts "Succeeded"
rescue
puts "Failed"
end
# Outputs: Failed
The snippet above, if dropped in a file and executed, would return an exit status of 0 (success).
$ ruby resilience.rb
Failed
$ echo $?
0
The problem with a rescue statement like this is that all exceptions will be rescued, even those related to programmer error, such as an ArgumentError
. For example, in this snippet note that I forgot to provide a request.
require "net/http"
begin
Net::HTTP.new("localhost", 9999).request()
puts "Succeeded"
rescue
puts "Failed"
end
# Outputs: Failed
If you execute this, your script indeed says that it failed and exits with success.
$ ruby resilience.rb
Failed
$ echo $?
0
The rescue
happily swallows the ArgumentError
and you never get feedback that the request is failing due to programmer error. You likely would assume that the backend service was down and lose a lot of time debugging it, instead of quickly noticing that you were missing an argument.
In order to allow for programmer feedback during development (like an ArgumentError
), but resilience in production, you can be more specific about which exceptions you would like to handle. For example, you can handle the Errno::ECONNREFUSED
error like so:
require "net/http"
begin
Net::HTTP.new("localhost", 9999).request(Net::HTTP::Get.new("/"))
puts "Succeeded"
rescue Errno::ECONNREFUSED
puts "Failed"
end
# Outputs: Failed
If you run this, you do indeed get Failed
and a successful exit status.
$ ruby resilience.rb
Failed
$ echo $?
0
Additionally, if you forget the argument, you get an ArgumentError
and an unsuccessful exit code. Running this snippet:
require "net/http"
begin
Net::HTTP.new("localhost", 9999).request()
puts "Succeeded"
rescue Errno::ECONNREFUSED
puts "Failed"
end
results in:
$ ruby resilience.rb
/Users/jnunemaker/.rbenv/versions/2.2.5/lib/ruby/2.2.0/net/http.rb:1373:in `request': wrong number of arguments (0 for 1..2) (ArgumentError)
from resilience.rb:3:in `<main>'
$ echo $?
1
Verbose and Brittle
Though begin
and rescue
give you the tools you need to handle errors in Ruby, they definitely left me wanting. For starters, needing to wrap every call felt verbose. A call that use to be one line and relatively clear, is now a bit muddled with at least 3 extra lines. In plain old ruby code, the verbosity is not that bad, but once you move into Rails views or something similar, you really feel you need to begin being rescued (see what I did there).
Additionally, if you do not provide some extra layer on top, you will end up with a lot of call sites rescuing a lot of exceptions. Each new exception will require updating all of the call sites and, trust me, this is tedious and brittle (easy to miss a call site).
A Layer On Top
Let’s continue with our HTTP request example and add a layer on top to chokepoint error handling to one place. Let’s also pretend our HTTP request is reaching out to a service that provides a JSON response.
require "json"
require "net/http"
class Client
# Returns Hash of notifications data for successful response.
# Returns nil if notification data cannot be retrieved.
def notifications
begin
request = Net::HTTP::Get.new("/")
http = Net::HTTP.new("localhost", 9999)
response = http.request(request)
JSON.parse(response.body)
rescue Errno::ECONNREFUSED
# what should we return here???
end
end
end
client = Client.new
p client.notifications
You now have the chokepoint for error handling in the notifications method, which is cool, but nil is being used as a way of communicating that something went wrong, which is not cool. You could return something other than nil, but you end up with a fork in the road. Success returns one object and failure returns something different.
Response Objects
What I have been reaching for lately is something along these lines:
require "json"
require "net/http"
class Client
class NotificationsResponse
attr_reader :notifications, :error
def initialize(&block)
@error = false
@notifications = begin
yield
rescue Errno::ECONNREFUSED => error
@error = error
{} # sensible default
end
end
def ok?
@error == false
end
end
def notifications
NotificationsResponse.new do
request = Net::HTTP::Get.new("/")
http = Net::HTTP.new("localhost", 9999)
http_response = http.request(request)
JSON.parse(http_response.body)
end
end
end
client = Client.new
response = client.notifications
if response.ok?
# Do something with notifications like show them as a list...
else
# Communicate that things went wrong to the caller or user.
end
By putting a response object in between the caller and the call to get the data:
- we always return the same object, avoiding nil and retaining duck typing.
- we now have a place to add more context about the failure if necessary, which we did not have with
nil
. - we have a single place to update rescued exceptions if a new one pops up.
- we have a nice place for instrumentation and circuit breakers in the future.
- we avoid needing
begin
andrescue
all over and instead can use conditionals or whatever makes sense. - we give the caller the ability to handle different failures differently (Conn refused vs Timeout vs Rate limited, etc.).
If we so desire, we can even go a step further and make sure that Response#ok?
is checked by doing something like this:
require "json"
require "net/http"
class Client
class NotificationsResponse
attr_reader :error
def initialize(&block)
@error = false
@notifications = begin
yield
rescue Errno::ECONNREFUSED => error
@error = error
{} # sensible default
end
end
def ok?
@ok_predicate_checked = true
@error == false
end
def notifications
unless @ok_predicate_checked
raise "ok? must be checked prior to accessing response data"
end
@notifications
end
end
def notifications
NotificationsResponse.new do
request = Net::HTTP::Get.new("/")
http = Net::HTTP.new("localhost", 9999)
response = http.request(request)
JSON.parse(response.body)
end
end
end
client = Client.new
response = client.notifications
# response.notifications would raise error because ok? was not checked
GitHub::Result
Along these lines, I recently open sourced github-ds, which contains a GitHub::Result
class. We use GitHub::Result
at GitHub more and more of late to bake resiliency in at the beginning of projects. GitHub::KV
is a good example of GitHub::Result
in the real world, but here is a little taste:
def do_something
1
end
def do_something_that_errors
raise "noooooppppeeeee"
end
result = GitHub::Result.new { do_something }
p result.ok? # => true
p result.value! # => 1
result = GitHub::Result.new { do_something_that_errors }
p result.ok? # => false
p result.value { "default when error happens" } # => "default when error happens"
begin
result.value! # raises exception because error happened
rescue => error
p result.error
p error
end
# Outputs Step 1, 2, 3
result = GitHub::Result.new {
GitHub::Result.new { puts "Step 1: success!" }
}.then { |value|
GitHub::Result.new { puts "Step 2: success!" }
}.then { |value|
GitHub::Result.new { puts "Step 3: success!" }
}
p result.ok? # => true
# Outputs Step 1, 2 and stops.
result = GitHub::Result.new {
GitHub::Result.new { puts "Step 1: success!" }
}.then { |value|
GitHub::Result.new {
puts "Step 2: failed!"
raise
}
}.then { |value|
GitHub::Result.new {
puts "Step 3: should not get here because previous step failed!"
}
}
p result.ok? # => false
Conclusion
Welp, this article has rambled on long enough. It is by no means a comprehensive article on resiliency, but it covers step 1, which is handling failure. The key to me including a layer on top that bakes in the resiliency, making it easy for callers to do the right thing in the face of failure. Using response objects and/or GitHub::Result
can definitely help with that.
I am not going to be wreckless and promise that I will write up step 2, but if I do, there is a good chance it will be titled “Limit Everything”. :)