Archive for March, 2008

Finding and Fixing Strange Rails Bugs (or, How I Wasted My Afternoon)

March 13, 2008

Rails is a great framework, but sometimes bugs can be incredibly tricky to track down. Case in point, I spent a significant portion of yesterday tracking down a very strange bug. I’ll post the resolution – but perhaps of more general interest, I’ll outline some tips that you can use when tracking down Rails issues (or really, any issues). These tips may be obvious, but sometimes in my desire to quickly fix a bug, I forget the most basic rules of debugging.
Short Story: Ruby Classifier requires ‘mathn’. As it turns out, requiring ‘mathn’ breaks redirect behavior in Rails. Check out Rails ticket 5433 for more info and a workaround.
(Very) Long Story:
The initial behavior I noticed was that one integration test was failing. The integration test was doing the following


post_via_redirect '/account/login', options
assert_response :success

And was failing with the following error:
Expected response to be a , but was <302>
Not so weird in itself, but the bug only appeared when I ran the tests using autotest. When I ran the test via rake test:integration, everything worked. I looked at the test and the controller and everything seemed normal (and I hadn’t changed anything in the test or in the controller lately), so I started to dig in deeper.
Tip 1: Isolate the changes. My first step was to use svn to track down the exact revision that had caused the bug (I hadn’t caught the bug before checking in because I had been testing using rake, not autotest). Luckily, this is pretty simple – just keep performing svn update -r REVISION_NUM and running the test until you find the revision that causes the test to fail (I wish I had a tool to do this automatically…).
However, in my laziness, I’d forgotten to …
Tip 2: Minimize the Repro. This really should be tip 1, but I’m telling this story chronologically, so you can learn from my mistakes. As I said, this bug only reproed when using autotest. Each time I wanted to test for the bug, I had to wait almost a minute for autotest to run the tests. Due to this long repro, I ended up spending probably ten times the amount of time I needed to figure out which checkin caused the bug.
Usually, when a test fails, you can just use ruby to run a single test. But if you ever have a bug that only seems to repro in autotest, remember that autotest prints out all the commands it runs. In this case, it was simply a case of copying and pasting the code (which ran a ton of test files) from the output of autotest onto the command line. It looked something like this:


ruby -I.:lib:test -rtest/unit
-e "%w[test/functional/resources_controller_test.rb
test/functional/account_controller_test.rb test/unit/merge_test.rb
test/unit/render_state_test.rb test/integration/filter_clears_test.rb
test/unit/item_test.rb test/functional/admin_items_controller_test.rb
test/unit/comment_test.rb test/unit/lists_helper_test.rb
test/functional/lists_controller_test.rb test/functional/items_controller_test.rb
test/unit/list_test.rb test/unit/memory_list_test.rb
test/integration/contact_form_test.rb test/unit/xpath_test.rb
test/functional/contact_controller_test.rb test/unit/resource_test.rb
test/unit/membership_test.rb test/unit/list_classifier_test.rb
test/functional/listgroups_controller_test.rb test/integration/resources_test.rb].each
{ |f| require f }" | unit_diff -u

Yikes. That successfully reproed the bug, but it took way too long. The next step was manually pulling out test files until the bug stopped reproing. That left me with a simpler (and much faster) repro:
ruby -I.:lib:test -rtest/unit -e "%w[test/unit/list_classifier_test.rb test/integration/resources_test.rb].each { |f| require f }"
OK, so it seemed that there is some weird interaction between list_classifier_test and resources_test.
Tip 3: Simplify, Simplify This is really an extension of tip 2, but I’m repeating it because most of debugging is systematically simplifying the problem by removing complicating factors. So, I went to list_classifier and commented out all the tests. The bug disappeared. I put a single test back in – the bug reappeared.
Now I was onto something. I then realized that loading the file under test (list_classifier.rb) was the problem. That code is based on the awesome Ruby Classifier library. As a result, the file begins with require 'classifier'. Interesting. So I tried just requiring ‘classifier’ in the integration test. Bingo. Somehow, using the Classifier library was breaking my integration test.
Tip 4: Follow your gut, but validate. For nasty bugs like this, the interactions are so complicated it would take forever to just step through the code and find the bug. So, you have to develop a gut feeling for what might be causing the bug and follow that first. Of course, you have to develop tricks to quickly confirm or reject your hunches so you don’t go down the wrong path for an hour.
For instance, my initial feeling was that monkey patching was to blame. Specifically, I noticed that both Classifier and Rails had defined a ‘sum’ method on Array (actually, Rails defines ‘sum’ on Enumerable, but Array includes Enumerable). A quick check confirmed that Classifier’s ‘sum’ had overwritten Rails’ ‘sum’ by the time my integration test ran. But I wasn’t sure if the Rails code actually ever called this method.
To check this, I used ruby-debug. I first set a breakpoint right before the troublesome line in the integration test. At the debugger prompt, I turned on trace mode (trace on all) and then continued (c). Trace mode prints out every single line of code that is executed, including the file in which the line appears. It was simple enough to copy/paste the trace into Emacs and search for any line executed in the ‘classifier’ library.
As it turns out, I didn’t find anything. My hunch was wrong – Classifier’s ‘sum’ method wasn’t to blame. But luckily, it didn’t take me too long figure it out.
So, I went back to simplifying the repro. Instead of doing require 'classifier' (which includes the whole gem), I started including specific files from the gem. After doing this for awhile, I found that only requiring the file ‘classifier-1.3.1/lib/classifier/extensions/vector.rb’ caused the bug.
I tried to simplify even more. That file starts with both require 'matrix and require 'mathn'. Could one of those be the problem? I tried requiring each of those files at the top of the integration test and sure enough, require 'mathn' caused the problem.
Tip 5: Google is your friend. All throughout this long process, I’d been Googling to see if anyone had solved this issue. Of course, until the end I didn’t really know what the root cause was, so my searches turned up nothing useful. However, after narrowing it down to integration tests and mathn, I found a this Rails ticket (search ‘rails integration test mathn’). As it turns out, mathn changes the behavior of integer division, which makes the code return the wrong HTTP status code. And the reporter includes a simple, one-line patch which is easy to apply.
In retrospect, it all seems to simple (and in fact, it’s frustrating that it took me so long to solve). When tracking down a bug, it’s very easy to get frustrated or overwhelmed with the sheer size of the search space and forget basic debugging practices. Although tracking down bugs will always remain somewhere between science and art, these tips should help you the next time you encounter a real head-scratcher.