Overview

Nobody likes when something fails. When a Chef run fails for some reason, we have to figure out what resource caused the failure and then try to fix it and then re-run it. If we know what might fail, we can try to do some steps and then try the action again. This message sent to the Chef community mailing list proposes something very similar.

Here is an RFC proposed by the Chef community which is still in discussion. It provides a resource specific error handling mechanism. It has many features that will help making an awesome overall Chef experience. If we are aware of something that can fail, why not make the resource try something to fix it by itself?

The Chef Handler provides a way of handling exceptions but that is very limited to just handling exceptions and performing some actions after the failure and be done. I wrote a blog post earlier about how to write custom Chef Handlers.

Proof of Concept

I wrote a simple cookbook called on_failure that monkey patches the way Chef runs the resources and implements this feature. I tried to add as many features as possible but there are still some discussions on how to do certain things. This is just a starting point for trying out how this would work. This cookbook is also published to the community.

In Action

The on_failure cookbook has some additional cookbooks that help trying out this feature. The following are those cookbooks and their purposes:

  • sample – Contains some sample recipes for using the on_failure handler feature.
  • meal – Provides a resource called meal to mimic the example provided in the RFC.
  • food – Provides a resource called food to mimic the example provided in the RFC.

I tried to make all the examples used here to be very similar to the ones proposed in the RFC to make them easier to understand and those examples are also fun to use.

Simple Example

Here is a simple example of how adding this feature to a resource will look like:

sample::default
1
2
3
4
5
6
7
meal 'breakfast' do
  on_failure { notify :eat, 'food[bacon]' }
end

food 'bacon' do
  action :nothing
end

This recipe doesn’t actually raise any error so there is no use for the on_failure handler in the meal[breakfast] resource. Let’s simulate a failure and see what the (sample::with_exception) recipe looks like:

sample::with_exception
1
2
3
4
5
6
7
8
9
10
# At least 1 bacon slice is required to complete breakfast without getting hungry.
node.override['meal']['bacon_required'] = 1

meal 'breakfast' do
  on_failure(HungryError) { notify :eat, 'food[bacon]' }
end

food 'bacon' do
  action :nothing
end

This recipe will raise a HungryError exception when meal[breakfast] resource first runs. During failure, the food[bacon] resource’s :eat action will be run and it will make the resource meal[breakfast] to succeed when it is tried again. See the output in the following Chef run:

Chef run
1
2
3
4
5
6
7
8
9
10
11
12
13
14
[2014-04-09T04:07:10+00:00] INFO: *** Chef 11.12.0 ***
[2014-04-09T04:07:10+00:00] INFO: Chef-client pid: 8683
[2014-04-09T04:07:12+00:00] INFO: Setting the run_list to ["recipe[sample::with_exception]"] from CLI options
[2014-04-09T04:07:12+00:00] INFO: Run List is [recipe[sample::with_exception]]
[2014-04-09T04:07:12+00:00] INFO: Run List expands to [sample::with_exception]
[2014-04-09T04:07:12+00:00] INFO: Starting Chef Run for on-failure-berkshelf
[2014-04-09T04:07:12+00:00] INFO: Running start handlers
[2014-04-09T04:07:12+00:00] INFO: Start handlers complete.
[2014-04-09T04:07:12+00:00] INFO: meal[breakfast] failed with: #<MealExceptions::HungryError: I want 1 bacon slices. But I only ate: 0>
[2014-04-09T04:07:12+00:00] INFO: Ate 'bacon' successfully. yummy!
[2014-04-09T04:07:12+00:00] INFO: Meal 'breakfast' completed successfully.
[2014-04-09T04:07:12+00:00] INFO: Chef Run complete in 0.091023597 seconds
[2014-04-09T04:07:12+00:00] INFO: Running report handlers
[2014-04-09T04:07:12+00:00] INFO: Report handlers complete

Handling with Retries

By default, only one attempt is made to rerun the resource action. It can be customized by giving the retries option in the on_failure construct. The resource action will be tried for the specified number of times and the block given will be executed every time before retrying the resource action.

sample::with_retries
1
2
3
4
5
6
7
8
9
10
# 3 bacon slices are required for completing the breakfast without getting hungry. This is used to do retries.
node.override['meal']['bacon_required'] = 3

meal 'breakfast' do
  on_failure(retries: 5) { notify :eat, 'food[bacon]' }
end

food 'bacon' do
  action :nothing
end

In this example, the meal[breakfast] resource will succeed only if 3 bacon slices are eaten — so it will pass at the 4th attempt. See the output in the following Chef run:

Chef run
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[2014-04-09T05:12:19+00:00] INFO: *** Chef 11.12.0 ***
[2014-04-09T05:12:19+00:00] INFO: Chef-client pid: 8919
[2014-04-09T05:12:21+00:00] INFO: Setting the run_list to ["recipe[sample::with_retries]"] from CLI options
[2014-04-09T05:12:21+00:00] INFO: Run List is [recipe[sample::with_retries]]
[2014-04-09T05:12:21+00:00] INFO: Run List expands to [sample::with_retries]
[2014-04-09T05:12:21+00:00] INFO: Starting Chef Run for on-failure-berkshelf
[2014-04-09T05:12:21+00:00] INFO: Running start handlers
[2014-04-09T05:12:21+00:00] INFO: Start handlers complete.
[2014-04-09T05:12:21+00:00] INFO: meal[breakfast] failed with: #<MealExceptions::HungryError: I want 3 bacon slices. But I only ate: 0>
[2014-04-09T05:12:21+00:00] INFO: Ate 'bacon' successfully. yummy!
[2014-04-09T05:12:21+00:00] INFO: meal[breakfast] failed with: #<MealExceptions::HungryError: I want 3 bacon slices. But I only ate: 1>
[2014-04-09T05:12:21+00:00] INFO: Ate 'bacon' successfully. yummy!
[2014-04-09T05:12:21+00:00] INFO: meal[breakfast] failed with: #<MealExceptions::HungryError: I want 3 bacon slices. But I only ate: 2>
[2014-04-09T05:12:21+00:00] INFO: Ate 'bacon' successfully. yummy!
[2014-04-09T05:12:21+00:00] INFO: Meal 'breakfast' completed successfully.
[2014-04-09T05:12:21+00:00] INFO: Chef Run complete in 0.099481932 seconds
[2014-04-09T05:12:21+00:00] INFO: Running report handlers
[2014-04-09T05:12:21+00:00] INFO: Report handlers complete

Handling Multiple Exceptions

Multiple exceptions can be specified in on_failure and it will be caught if any of the exceptions are raised by the resource.

sample::multiple_exceptions
1
2
3
4
5
6
7
8
9
10
# Simlate that the meal is not cooked so UncookedError is raised
node.set['meal']['cooked'] = false

meal 'breakfast' do
  on_failure(UncookedError, HungryError, retries: 3) { notify :fry, 'food[bacon]' }
end

food 'bacon' do
  action :nothing
end

See the output in the following Chef run:

Chef run
1
2
3
4
5
6
7
8
9
10
11
12
13
14
[2014-04-09T05:14:08+00:00] INFO: *** Chef 11.12.0 ***
[2014-04-09T05:14:08+00:00] INFO: Chef-client pid: 9141
[2014-04-09T05:14:10+00:00] INFO: Setting the run_list to ["recipe[sample::multiple_exceptions]"] from CLI options
[2014-04-09T05:14:10+00:00] INFO: Run List is [recipe[sample::multiple_exceptions]]
[2014-04-09T05:14:10+00:00] INFO: Run List expands to [sample::multiple_exceptions]
[2014-04-09T05:14:10+00:00] INFO: Starting Chef Run for on-failure-berkshelf
[2014-04-09T05:14:10+00:00] INFO: Running start handlers
[2014-04-09T05:14:10+00:00] INFO: Start handlers complete.
[2014-04-09T05:14:10+00:00] INFO: meal[breakfast] failed with: #<MealExceptions::UncookedError: Meal is not cooked well>
[2014-04-09T05:14:10+00:00] INFO: Meal 'bacon' is cooked
[2014-04-09T05:14:11+00:00] INFO: Meal 'breakfast' completed successfully.
[2014-04-09T05:14:11+00:00] INFO: Chef Run complete in 0.089208967 seconds
[2014-04-09T05:14:11+00:00] INFO: Running report handlers
[2014-04-09T05:14:11+00:00] INFO: Report handlers complete

Using Multiple on_failure Blocks

We may not want to do the same thing for all sorts of failures, so you can specify different on_failure blocks for different exception types. The blocks will be evaluated in the order they appear in the resource declaration.

sample::multiple_blocks
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Simulate that meal is not cooked so UncookedError is raised
node.set['meal']['cooked'] = false

# Simulate that meal is cold so ColdError is raised
node.set['meal']['cold'] = true

meal 'breakfast' do
  on_failure(UncookedError) { notify :fry, 'food[bacon]' }
  on_failure(ColdError) { notify :microwave, 'food[bacon]' }
end

food 'bacon' do
  action :nothing
end

See the output of the following Chef run:

Chef run
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
[2014-04-09T05:15:33+00:00] INFO: *** Chef 11.12.0 ***
[2014-04-09T05:15:33+00:00] INFO: Chef-client pid: 9362
[2014-04-09T05:15:35+00:00] INFO: Setting the run_list to ["recipe[sample::multiple_blocks]"] from CLI options
[2014-04-09T05:15:35+00:00] INFO: Run List is [recipe[sample::multiple_blocks]]
[2014-04-09T05:15:35+00:00] INFO: Run List expands to [sample::multiple_blocks]
[2014-04-09T05:15:35+00:00] INFO: Starting Chef Run for on-failure-berkshelf
[2014-04-09T05:15:35+00:00] INFO: Running start handlers
[2014-04-09T05:15:35+00:00] INFO: Start handlers complete.
[2014-04-09T05:15:35+00:00] INFO: meal[breakfast] failed with: #<MealExceptions::UncookedError: Meal is not cooked well>
[2014-04-09T05:15:35+00:00] INFO: Meal 'bacon' is cooked
[2014-04-09T05:15:35+00:00] INFO: meal[breakfast] failed with: #<MealExceptions::ColdError: Meal is cold>
[2014-04-09T05:15:35+00:00] INFO: Meal 'bacon' is now hot!
[2014-04-09T05:15:35+00:00] INFO: Meal 'breakfast' completed successfully.
[2014-04-09T05:15:35+00:00] INFO: Chef Run complete in 0.090865808 seconds
[2014-04-09T05:15:35+00:00] INFO: Running report handlers
[2014-04-09T05:15:35+00:00] INFO: Report handlers complete

It is clear that initially the Chef run failed because the food was not cooked. So it tried to fry the bacon so the food was then cooked but it was cold (what?? just for example) so it tried to microwave the food and hence the Chef run succeeded.

Access to Resource Attributes

The block specified in on_failure can take the resource as an argument so it can access all the resource’s attributes.

sample::resource_attributes
1
2
3
4
5
6
7
8
# At least 1 bacon slice is required to complete breakfast without getting hungry.
node.override['meal']['bacon_required'] = 1

# The handler block is just used to demonstrate that the resource attributes can be accessed.
meal 'breakfast' do
  time '2014-04-09 08:00:00 -0700'
  on_failure { |breakfast| Chef::Log.info "Tried eating breakfast at: #{breakfast.time}" }
end

See the output of the following Chef run:

Chef run
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
[2014-04-09T06:09:05+00:00] INFO: *** Chef 11.12.0 ***
[2014-04-09T06:09:05+00:00] INFO: Chef-client pid: 2919
[2014-04-09T06:09:07+00:00] INFO: Setting the run_list to ["recipe[sample::resource_attributes]"] from CLI options
[2014-04-09T06:09:07+00:00] INFO: Run List is [recipe[sample::resource_attributes]]
[2014-04-09T06:09:07+00:00] INFO: Run List expands to [sample::resource_attributes]
[2014-04-09T06:09:07+00:00] INFO: Starting Chef Run for on-failure-berkshelf
[2014-04-09T06:09:07+00:00] INFO: Running start handlers
[2014-04-09T06:09:07+00:00] INFO: Start handlers complete.
[2014-04-09T06:09:07+00:00] INFO: meal[breakfast] failed with: #<MealExceptions::HungryError: I want 1 bacon slices. But I only ate: 0>
[2014-04-09T06:09:07+00:00] INFO: Tried eating breakfast at: 2014-04-09 08:00:00 -0700
[2014-04-09T06:09:07+00:00] INFO: meal[breakfast] failed with: #<MealExceptions::HungryError: I want 1 bacon slices. But I only ate: 0>

================================================================================
Error executing action `eat` on resource 'meal[breakfast]'
================================================================================


MealExceptions::HungryError
---------------------------
I want 1 bacon slices. But I only ate: 0
...

After the failure occurred, the block given in on_failure got executed which just had a single log line to show that it had access to the resource’s attributes.

Comments