Other Posts in This Series
- Part 1: Test Timeouts (You’re here!)
- Part 2: On Video Conferencing
- Part 3:
<select>
Tags Crash Chrome
Everybody’s broken something in prod before. Not everyone has broken something in other people’s prod before. Browser and OS developers can certainly say they have!
In web development, a bug in a browser engine or even OS level is enough to have a big effect on how a web application behaves, even without changing a single line of application code.
Today I want to look at the first of three cases where a Chromium release unexpectedly affected our application, leading to some panicked incident reports and some pretty hilarious problems. A common thread to these issues is that there’s usually both a problem with application code that lines up with a change made upstream in Chromium.
Browser tests using headless Chrome suddenly time out
Background
This case is the most recent and had a big effect on feedback cycles for developers, but luckily wasn’t really an issue in production.
The app in question is Ruby on Rails, running an RSpec test suite using Capybara / Selenium for some integration tests. That in turn is using Chromedriver to run Chrome headless and interact with a real UI.
Symptoms Begin
A few days ago, Chromium 126 released over the previous 125 version. Our CI pipeline grabs the latest Chromium version for the browser tests it needs, so it would automatically get the 126 version. The Chromium release cadence looks to be roughly once a month, so it’s not uncommon for CI to automatically run tests on Chromium tip. When 126 released, tests that normally ran in seconds soon began taking minutes and would simply hang while they waited for some unspecified request to complete (we weren’t exactly sure on initial reports). Our pipeline started taking over an hour to finish, but we were lucky as there were actually only a handful of affected tests. More tests would have taken hours to crunch through or possibly crashed or stalled workers. Engineers were baffled because the bug wasn’t immediately reproducible – upgrading Chrome / Chromium is managed by the developer at their discretion.
Diagnosis
Luckily, we had decent logs of which tests failed. This let us narrow down that it was a small group of related tests. We cross-referenced any changes to those files and their tests and came up empty as they had been stable for a while. Crucially, in those logs it’s possible to see, more or less, the environment setup for the browser and noted Chromedriver 126 mismatched from local environments of 125 (typically this version is tied to the version of Chromium, so this meant Chromium was on 126 in CI and 125 locally also). Once we found that out, it was possible to reproduce the problem on a dev machine.
So What the Heck Happened?
After carefully logging requests while the tests were running, we found that on Chromium / Chromedriver 126:
- The failing tests were closely related and involved a shared UI partial template.
- The template is responsible for rendering image content of various filetypes, like JPEG, PNG, and crucially, PDF.
- PDFs are rendered in an iframe.
- The PDF being rendered requires a high level of authorization, so access is only granted with a specific permission (a concept I have talked about once or twice already).
- The test user didn’t have the right permission, so it would have been redirected with a
403 Unauthorized
error.
We were able to confirm that removing the iframe, changing the content type to an image format to something other than PDF, and reverting to a previous Chromium / Chromedriver resolved the issue. And for good measure double checked that pages weren’t hanging in prod.
Now, I don’t pretend to 100% know what is going on under the covers – there’s a weird interaction between any one of RSpec, Capybara, Selenium, Chromedriver, Chromium, and Chrome itself that simply wasn’t worth digging too much into. All I know for certain is that a failed permission check in an iframe will cause tests to hang in Chromium 126.
The short term solution: just add the right permissions. A lot of work for a one-line configuration change!
That’s all for now. Thanks for reading!
- ← Previous: Is WGU the New Bootcamp?
- Next: Interesting Browser Engine Bugs, Part 2: On Video Conferencing →