Nobody likes when something fails. When a Chef run fails for some reason, we have to figure out what
resource caused the failure and then try to fix it and then re-run it. If we know what might fail, we
can try to do some steps and then try the action again. This message sent to the Chef community
mailing list proposes something very similar.
Here is an RFC proposed by the Chef community which is still in discussion. It provides a resource
specific error handling mechanism. It has many features that will help making an awesome overall Chef experience.
If we are aware of something that can fail, why not make the resource try something to fix it by itself?
The Chef Handler provides a way of handling exceptions but that is very limited to just handling exceptions and
performing some actions after the failure and be done. I wrote a blog post earlier about how to write custom
Chef Handlers.
Proof of Concept
I wrote a simple cookbook called on_failure that monkey patches the way Chef runs the resources and implements
this feature. I tried to add as many features as possible but there are still some discussions on how to do
certain things. This is just a starting point for trying out how this would work. This cookbook is also published
to the community.
In Action
The on_failure cookbook has some additional cookbooks that help trying out this feature. The following are those
cookbooks and their purposes:
sample – Contains some sample recipes for using the on_failure handler feature.
meal – Provides a resource called meal to mimic the example provided in the RFC.
food – Provides a resource called food to mimic the example provided in the RFC.
I tried to make all the examples used here to be very similar to the ones proposed in the RFC to make them
easier to understand and those examples are also fun to use.
Simple Example
Here is a simple example of how adding this feature to a resource will look like:
This recipe doesn’t actually raise any error so there is no use for the on_failure handler in the
meal[breakfast] resource. Let’s simulate a failure and see what the (sample::with_exception) recipe looks like:
sample::with_exception
12345678910
# At least 1 bacon slice is required to complete breakfast without getting hungry.node.override['meal']['bacon_required']=1meal'breakfast'doon_failure(HungryError){notify:eat,'food[bacon]'}endfood'bacon'doaction:nothingend
This recipe will raise a HungryError exception when meal[breakfast] resource first runs. During failure, the
food[bacon] resource’s :eat action will be run and it will make the resource meal[breakfast] to succeed when
it is tried again. See the output in the following Chef run:
Chef run
1234567891011121314
[2014-04-09T04:07:10+00:00] INFO: *** Chef 11.12.0 ***
[2014-04-09T04:07:10+00:00] INFO: Chef-client pid: 8683
[2014-04-09T04:07:12+00:00] INFO: Setting the run_list to ["recipe[sample::with_exception]"] from CLI options
[2014-04-09T04:07:12+00:00] INFO: Run List is [recipe[sample::with_exception]]
[2014-04-09T04:07:12+00:00] INFO: Run List expands to [sample::with_exception]
[2014-04-09T04:07:12+00:00] INFO: Starting Chef Run for on-failure-berkshelf
[2014-04-09T04:07:12+00:00] INFO: Running start handlers
[2014-04-09T04:07:12+00:00] INFO: Start handlers complete.
[2014-04-09T04:07:12+00:00] INFO: meal[breakfast] failed with: #<MealExceptions::HungryError: I want 1 bacon slices. But I only ate: 0>
[2014-04-09T04:07:12+00:00] INFO: Ate 'bacon' successfully. yummy!
[2014-04-09T04:07:12+00:00] INFO: Meal 'breakfast' completed successfully.
[2014-04-09T04:07:12+00:00] INFO: Chef Run complete in 0.091023597 seconds
[2014-04-09T04:07:12+00:00] INFO: Running report handlers
[2014-04-09T04:07:12+00:00] INFO: Report handlers complete
Handling with Retries
By default, only one attempt is made to rerun the resource action. It can be customized by giving the retries
option in the on_failure construct. The resource action will be tried for the specified number of times and the block
given will be executed every time before retrying the resource action.
sample::with_retries
12345678910
# 3 bacon slices are required for completing the breakfast without getting hungry. This is used to do retries.node.override['meal']['bacon_required']=3meal'breakfast'doon_failure(retries:5){notify:eat,'food[bacon]'}endfood'bacon'doaction:nothingend
In this example, the meal[breakfast] resource will succeed only if 3 bacon slices are eaten — so it will pass
at the 4th attempt. See the output in the following Chef run:
Chef run
123456789101112131415161718
[2014-04-09T05:12:19+00:00] INFO: *** Chef 11.12.0 ***
[2014-04-09T05:12:19+00:00] INFO: Chef-client pid: 8919
[2014-04-09T05:12:21+00:00] INFO: Setting the run_list to ["recipe[sample::with_retries]"] from CLI options
[2014-04-09T05:12:21+00:00] INFO: Run List is [recipe[sample::with_retries]]
[2014-04-09T05:12:21+00:00] INFO: Run List expands to [sample::with_retries]
[2014-04-09T05:12:21+00:00] INFO: Starting Chef Run for on-failure-berkshelf
[2014-04-09T05:12:21+00:00] INFO: Running start handlers
[2014-04-09T05:12:21+00:00] INFO: Start handlers complete.
[2014-04-09T05:12:21+00:00] INFO: meal[breakfast] failed with: #<MealExceptions::HungryError: I want 3 bacon slices. But I only ate: 0>
[2014-04-09T05:12:21+00:00] INFO: Ate 'bacon' successfully. yummy!
[2014-04-09T05:12:21+00:00] INFO: meal[breakfast] failed with: #<MealExceptions::HungryError: I want 3 bacon slices. But I only ate: 1>
[2014-04-09T05:12:21+00:00] INFO: Ate 'bacon' successfully. yummy!
[2014-04-09T05:12:21+00:00] INFO: meal[breakfast] failed with: #<MealExceptions::HungryError: I want 3 bacon slices. But I only ate: 2>
[2014-04-09T05:12:21+00:00] INFO: Ate 'bacon' successfully. yummy!
[2014-04-09T05:12:21+00:00] INFO: Meal 'breakfast' completed successfully.
[2014-04-09T05:12:21+00:00] INFO: Chef Run complete in 0.099481932 seconds
[2014-04-09T05:12:21+00:00] INFO: Running report handlers
[2014-04-09T05:12:21+00:00] INFO: Report handlers complete
Handling Multiple Exceptions
Multiple exceptions can be specified in on_failure and it will be caught if any of the exceptions are raised by the
resource.
sample::multiple_exceptions
12345678910
# Simlate that the meal is not cooked so UncookedError is raisednode.set['meal']['cooked']=falsemeal'breakfast'doon_failure(UncookedError,HungryError,retries:3){notify:fry,'food[bacon]'}endfood'bacon'doaction:nothingend
See the output in the following Chef run:
Chef run
1234567891011121314
[2014-04-09T05:14:08+00:00] INFO: *** Chef 11.12.0 ***
[2014-04-09T05:14:08+00:00] INFO: Chef-client pid: 9141
[2014-04-09T05:14:10+00:00] INFO: Setting the run_list to ["recipe[sample::multiple_exceptions]"] from CLI options
[2014-04-09T05:14:10+00:00] INFO: Run List is [recipe[sample::multiple_exceptions]]
[2014-04-09T05:14:10+00:00] INFO: Run List expands to [sample::multiple_exceptions]
[2014-04-09T05:14:10+00:00] INFO: Starting Chef Run for on-failure-berkshelf
[2014-04-09T05:14:10+00:00] INFO: Running start handlers
[2014-04-09T05:14:10+00:00] INFO: Start handlers complete.
[2014-04-09T05:14:10+00:00] INFO: meal[breakfast] failed with: #<MealExceptions::UncookedError: Meal is not cooked well>
[2014-04-09T05:14:10+00:00] INFO: Meal 'bacon' is cooked
[2014-04-09T05:14:11+00:00] INFO: Meal 'breakfast' completed successfully.
[2014-04-09T05:14:11+00:00] INFO: Chef Run complete in 0.089208967 seconds
[2014-04-09T05:14:11+00:00] INFO: Running report handlers
[2014-04-09T05:14:11+00:00] INFO: Report handlers complete
Using Multiple on_failure Blocks
We may not want to do the same thing for all sorts of failures, so you can specify different on_failure blocks
for different exception types. The blocks will be evaluated in the order they appear in the resource declaration.
sample::multiple_blocks
1234567891011121314
# Simulate that meal is not cooked so UncookedError is raisednode.set['meal']['cooked']=false# Simulate that meal is cold so ColdError is raisednode.set['meal']['cold']=truemeal'breakfast'doon_failure(UncookedError){notify:fry,'food[bacon]'}on_failure(ColdError){notify:microwave,'food[bacon]'}endfood'bacon'doaction:nothingend
See the output of the following Chef run:
Chef run
12345678910111213141516
[2014-04-09T05:15:33+00:00] INFO: *** Chef 11.12.0 ***
[2014-04-09T05:15:33+00:00] INFO: Chef-client pid: 9362
[2014-04-09T05:15:35+00:00] INFO: Setting the run_list to ["recipe[sample::multiple_blocks]"] from CLI options
[2014-04-09T05:15:35+00:00] INFO: Run List is [recipe[sample::multiple_blocks]]
[2014-04-09T05:15:35+00:00] INFO: Run List expands to [sample::multiple_blocks]
[2014-04-09T05:15:35+00:00] INFO: Starting Chef Run for on-failure-berkshelf
[2014-04-09T05:15:35+00:00] INFO: Running start handlers
[2014-04-09T05:15:35+00:00] INFO: Start handlers complete.
[2014-04-09T05:15:35+00:00] INFO: meal[breakfast] failed with: #<MealExceptions::UncookedError: Meal is not cooked well>
[2014-04-09T05:15:35+00:00] INFO: Meal 'bacon' is cooked
[2014-04-09T05:15:35+00:00] INFO: meal[breakfast] failed with: #<MealExceptions::ColdError: Meal is cold>
[2014-04-09T05:15:35+00:00] INFO: Meal 'bacon' is now hot!
[2014-04-09T05:15:35+00:00] INFO: Meal 'breakfast' completed successfully.
[2014-04-09T05:15:35+00:00] INFO: Chef Run complete in 0.090865808 seconds
[2014-04-09T05:15:35+00:00] INFO: Running report handlers
[2014-04-09T05:15:35+00:00] INFO: Report handlers complete
It is clear that initially the Chef run failed because the food was not cooked. So it tried to fry the bacon so the
food was then cooked but it was cold (what?? just for example) so it tried to microwave the food and hence the Chef run
succeeded.
Access to Resource Attributes
The block specified in on_failure can take the resource as an argument so it can access all the resource’s
attributes.
sample::resource_attributes
12345678
# At least 1 bacon slice is required to complete breakfast without getting hungry.node.override['meal']['bacon_required']=1# The handler block is just used to demonstrate that the resource attributes can be accessed.meal'breakfast'dotime'2014-04-09 08:00:00 -0700'on_failure{|breakfast|Chef::Log.info"Tried eating breakfast at: #{breakfast.time}"}end
See the output of the following Chef run:
Chef run
123456789101112131415161718192021
[2014-04-09T06:09:05+00:00] INFO: *** Chef 11.12.0 ***
[2014-04-09T06:09:05+00:00] INFO: Chef-client pid: 2919
[2014-04-09T06:09:07+00:00] INFO: Setting the run_list to ["recipe[sample::resource_attributes]"] from CLI options
[2014-04-09T06:09:07+00:00] INFO: Run List is [recipe[sample::resource_attributes]]
[2014-04-09T06:09:07+00:00] INFO: Run List expands to [sample::resource_attributes]
[2014-04-09T06:09:07+00:00] INFO: Starting Chef Run for on-failure-berkshelf
[2014-04-09T06:09:07+00:00] INFO: Running start handlers
[2014-04-09T06:09:07+00:00] INFO: Start handlers complete.
[2014-04-09T06:09:07+00:00] INFO: meal[breakfast] failed with: #<MealExceptions::HungryError: I want 1 bacon slices. But I only ate: 0>
[2014-04-09T06:09:07+00:00] INFO: Tried eating breakfast at: 2014-04-09 08:00:00 -0700
[2014-04-09T06:09:07+00:00] INFO: meal[breakfast] failed with: #<MealExceptions::HungryError: I want 1 bacon slices. But I only ate: 0>
================================================================================
Error executing action `eat` on resource 'meal[breakfast]'
================================================================================
MealExceptions::HungryError
---------------------------
I want 1 bacon slices. But I only ate: 0
...
After the failure occurred, the block given in on_failure got executed which just had a single log line to show
that it had access to the resource’s attributes.