Application performance monitoring typically ignores HTTP 4XX error codes when monitoring errors for a given application. Usually this is desirable – most people who encounter these do so because of user error, be it authorization (401/403) or a NOT FOUND
issue (400). For the latter, the most common 400 error someone encounters is to a specific record that can’t be located by some identifier, such as navigating to /users/:id
but not being able to lookup the user by a given ID. On the other hand, a 400 error might also be encountered by someone visiting an endpoint that the application simply can’t handle, such as making a typo in the route: /usrs/:id
instead of /users/:id
.
This latter issue is relevant for today’s story.
The Setup
We had a chunk of code that was deprecated and scheduled for deletion so that we could make way for a new feature and limit having to maintain the old code. It encompassed work over many years, across multiple teams, and by engineers that were no longer with the company. The PR was heavily scrutinized, and after several rounds of feedback, was eventually merged. Teams with ownership over the application monitored for a while, and we determined there were no major regressions, so we rolled forward.
Unbeknownst to us, however, we had removed an API endpoint that was pretty heavily utilized by a 3rd party vendor. This endpoint was responsible for uploading some statistics and reporting that our data team consumed and generated reports from. The vendor ran some export code on a schedule or in a background job, and so apparently didn’t notice that we had removed the route entirely and now their reporting was erroring. We, on the other hand, also didn’t know anything was wrong, because the API that used to exist simply errored with 400 NOT FOUND
, and APM never picked it up.
We were effectively flying blind.
Casting a Net
In order to prevent this in the future and identify any current gaps we might have, we implemented a simple fallback in our Rails routes:
match "*path", :to => "application#missing", :via => :all
What this does is simply redirect to a known good controller action if there’s a missing route, and we manually upload an error to APM if that happens. This way, we never drop a 4XX issue.
NOTE: You do not want to do this in an externally-facing application. It’ll be extraordinarily noisy and might cause you to overrun your monitoring quota if you have a heavily trafficked site. Since our application is internal and low-traffic, this is acceptable for our use case.
Catching More Than We Bargained For
We fully expected to receive some noise even with our low request volume. We even caught a useless internal API call that we were able to fix rather quickly.
What we did not expect was our users making a very specific typo on a single route. This raised a giant red flag that warranted further investigation.
Our users frequently added a /detail
suffix to a couple resources. This happened quite often, but we were only able to estimate traffic by how much they made typos (such as /details
or /detial
). It encompassed, by far, the largest portion of missing routes in the application. Based on this monitoring, we were even able to determine that they were attempting this on completely unrelated resources (without the typo), but the route simply didn’t exist.
This signaled 3 things:
- Our users are manually entering a URL to navigate to some sort of detail view
- They are trying to do this willy-nilly in order to gain granular access to a variety of resources across the application
- We now need to audit the usage of this to determine appropriate authorization is happening
All Hands On Deck
We pulled in the team that owned this section of the application and found that they were aware of the existence of this page and that they used it heavily as sort of a “debug” menu for engineers. They also knew that there were no buttons or links that navigated to it. It had been handed off from the previous engineer when they took ownership of the app. They had assumed it was appropriately secured to a small group.
What we found, though, was that any logged in user could access the page.
We immediately issued a patch to fix the authorization. Furthermore, given that this was a useful page, we had the second team commit to moving this feature to a more appropriate location (with its own UI paths to navigate there instead of manually adding /detail
) such that we could better control who was using it and how they did so. Finally, product teams engaged our internal users to figure out how this feature had essentially “leaked” and subsequently spread to other internal teams purely by word of mouth.
All in all, it was a very wild chain of events that luckily ended in a positive result.
That’s all for now. Thanks for reading!