Update 6/21/23: I wrote a postmortem about building this feature, which you can read here.

This post could alternatively be titled “How to replace an airplane’s wings while it’s flying, without anyone noticing”. I’m only half joking, because that’s what it felt like sometimes.


Today I want to talk about a professional project that I’m really proud of. Not so much that it was difficult, or I was clever, or I worked really fast and got the solution out the door quickly – although most of those things did happen. The success I’ll be covering is that I designed a solution in light of vague requirements, lots of unknowns about how the existing system worked (more on that later), and was able to silently swap out the internals of a heavily used tool without anyone realizing my team and I had pulled the ol’ switcheroo.

You might be suspecting what I’ll be talking about – “dark” or “shadow” launches.

In essence, it involves writing new code that will operate in parallel to existing code such that you can instrument it and compare to the original work. This is in contrast to something like feature flagging or A/B testing where only one version of the code is running at a given time and the cohort of people exposed to treatment is some percentage of the total population. With shadow launches, the entire cohort is exposed to both the control and treatment version of the code.


Project Background

When I joined ID.me back in September as a Rails developer, there were a couple of long-running project ideas that had been floating around for a while. One sounded particularly interesting to me – redesigning our authorization system. For one reason or another, this got pushed back for a few months until finally, at the beginning of the year, my team was given the green light to start working on it. I got put in charge of designing the system mainly because I asked to do it (if you ever get this opportunity in your career, you should take it!). By the time I sat down to come up with a better solution, we had accumulated some serious tech debt. Some of it I would even categorize as organizational debt – I learned that it was so crusty that even experienced engineers struggled to understand it. In turn, that led to a culture of wantonly requesting far more access than people required just to ensure they could do their jobs.

Specifically:

  1. The application used CanCan for its authorization library. I’m not here to bash on CanCan, but it caused a lot of problems for us (admittedly some self-inflicted):
    • As developers came and went, some did not have as strong a grasp on the library as others.
    • Our system allowed for multiple roles instead of enforcing a single role.
    • Abilities (CanCan’s representation of permissions) are order-dependent.
    • Abilities can be negative and remove permissions when they are added.
    • “Magic” abilities that grant ALL permissions.
  2. The above issues led to engineers not really understanding levels of access, and so could not articulate that to the IT team that was responsible with assigning them.
  3. The IT team’s confusion was passed on to the end user and so access requests expanded to cover more permissions than required “just in case”. Ultimately this resulted in approximately 165 (!) unique role combinations in our system.

If you ever design an authorization library, I beg you not to ever allow permission subtraction or magic keywords. These two things caused by far the most headaches for us.


Examples

Just to level set, in CanCan you define an ability by using can(:verb, :noun), with some exceptions I’ll definitely call out below. This makes it pretty readable – you can “do” (verb) something to a “thing” (noun). There are of course some other helpers, like can?(:verb, :noun) returning a boolean if you have a certain permission. You can also use classes in your permissions like can(:update, Thing) for updating Thing instances. All of these things live in a CanCan::Ability instance and get invoked during UI authorization, requests authorization, etc.

Let’s take a look at a few issues with our implementation of CanCan:


Abilities can be subtractive

# app/models/ability.rb

if user.has_role?(:role_1)
  can :do, :thing
end

if user.has_role?(:role_2)
  cannot :do, :thing
end

This isn’t bad on it’s own. If you have the role_1 role, you can do things. If you have role_2, you can’t.

But if you have [role_1, role_2], you can’t, because the second role has subtracted your first permission. That also leads into the second issue:


Abilities are order-dependent

Because we’re using multiple roles, this now means that the order the roles are defined also determines the access:

# First implementation
if user.has_role?(:role_1)
  can :do, :thing
end

if user.has_role?(:role_2)
  cannot :do, :thing
end

# Second implementation
if user.has_role?(:role_2)
  cannot :do, :thing
end

if user.has_role?(:role_1)
  can :do, :thing
end

These look very similar, but they aren’t. In the first implementation, [role_1, role_2] can’t do things. But in the second, it can! The order we’ve defined things causes the subtraction to work in the wrong order!


Magic abilities are massively confusing

There are a bunch of magic words in CanCan but I want to talk about 2 of the really dangerous ones – manage and all. manage is like the catch-all verb – it grants every action on a given resource. Similarly, all is the catch-all resource, granting a single action to all resources. This is incredibly dangerous because it’s not just all current actions. Left untouched, this code grants access to all resources that will ever exist in your application:

if user.has_role?(:superadmin)
  can :manage, :all
end

Yikes!


Here’s just a short snippet showing a few roles with a single resource, just to drive home how difficult it is to understand a system set up like this. In reality, our system covered approximately 20 separate roles, with permissions across most of the models in the system. And remember, we also had to deal with understanding ~165 unique combinations of roles at the same time!

I encourage you to take a second to read this and try to figure out what you think it does before looking at the answer that follows.

if user.has_role?(:superadmin)
  can :manage, :all
end

if user.has_role?(:role_1)
  can :view, Thing
  can :update, Thing
  
  cannot :destroy, Thing
end

if user.has_role?(:role_2)
  can :view, Thing
  can :destroy, Thing

  cannot :update, Thing
end

This maps to:

Role Combination Thing Permissions
[superadmin, role_1, role_2] view
[superadmin, role_1] view, update
[superadmin, role_2] view, destroy
[superadmin] view, update, destroy
[role_1, role_2] view
[role_1] view, update
[role_2] view, destroy

As you can see, adding roles to superadmin behaves completely counterintuitively!


Out With the Old, In With the New

The new system is quite a bit easier. We based it on Pundit, and it represents a permission-based access control (PBAC) method over role-based (RBAC).

Our authorization system contained a UI for permission management that would get passed to each of the systems it controlled, meaning the authorization system is free to be generic as long as it provided a single permission handle and the authorizing systems can implement permissions however they like, so long as they can accept that handle.

A handle is quite simple – we simply mash the controller name, the action we’re in, and any optional attributes:

Controller Name Action Attribute Final Handle
ThingsController edit locked things:edit:locked

Next, we define policies that can handle our permission handles (get it?):

class ThingPolicy < ModelPolicy

  register_attribute :lock, required_if: :locked?

  def edit?
    required_permissions?
  end

end

You’ll notice this is not a standard Pundit policy. I’m not going to get into the specifics of how this works, but basically this policy knows based on the model name that it’s looking for a things permission (due to being the ThingPolicy), with an edit action (due to the edit? method), with an optional attribute of locked if the thing meets an additional conditional.

It’s sufficient to know that this policy defines 2 permissions: a permission to edit things, and a permission to edit locked things.

Once in the application, we simply ensure authorization on every controller action, something that Pundit supports out of the box:

class ApplicationController < ActionController::Base
  include Pundit::Authorization

  after_action :verify_authorized, except: :index
  before_action :authorize
end

Here, the before_action :authorize is calling into our policy to determine if we have permissions. Easy peasy!


Now for the Hard Part

So the new system is easy to use, readable, extensible, and doesn’t suffer from some of the gotchas as the old system. But how the hell do we implement it without taking down the authorization layer of our application and impacting critical workflows? Remember that even the contrived example rules are complicated in the RBAC model, and our production system has dozens of implemented roles and hundreds of unique combinations of roles assigned to users. It’s simply not feasible to know up front if we have 100% accurate reproduction of roles in our new system, as it would be time-prohibitive for an engineer to walk through them all. If we screw up, people can’t work.

Enter solution: dark launch.

Here’s the gist:

We came up with a query to consolidate some of the lesser used combinations of roles into a more commonly used one. That reduced the total number of new roles we needed to implement.

Next, we devised our dark launch framework. We literally ran the two authorization libraries in parallel with a complicated series of feature flags (both in a feature flagging tool called Flipper and on a per-user basis in our authorization system DB).

Remember that before_action :authorize lifecycle hook we called earlier? We overloaded it to support our use case as well:

def authorize(record = nil, dependencies: {})
  @_pundit_authorized = true
  return unless current_user.permissions_enabled || Flipper.enabled?(:permission_dark_launch)

  @_pundit_exception = nil

  begin
    ApplicationPolicy.new(current_user, record, controller_path, action_name, dependencies).authorize!
  rescue Pundit::NotAuthorized => exception
    @_pundit_exception = exception
    raise exception if current_user.permissions_enabled
  end
end

Bear with me, there’s a lot going on.

  1. We early return unless we have some feature flags in place. permissions_enabled lives on the users table and is a per-user feature flag that will persist over time. The Flipper feature flag is temporary, but controls the entire dark launch.
  2. We call out to the ApplicationPolicy. It’s the master policy that knows the user, the thing being authorized, what controller and action caused it, and any additional data needed.
  3. We capture any exceptions
  4. We re-raise it if permissions are enabled for the user. That means if you enable the per-user flag, it makes this method a pass-through
  5. Finally, report on the captured exception later (not shown)

The exact same thing happens for authorization in our existing system, and we also included other helpers to do similar work on the front end (such as if you needed to hide a button based on permissions).

All in all, we were able to capture granular data about each page of the application, in situ, with almost no downtime, simply by running two discrete implementation of permissions in parallel. We started with the Pundit approach running in “dark” mode and eventually segmented our user base based on their role combination (again, with feature flags) so that we could move small subsets of users into the new implementation as we completed them.


Vetting the Work

That’s the technical piece, but probably 2/3rds of the time spent on the feature was validating that our new implementation matched our old. We needed to carefully monitor any mismatches and re-implement pieces that didn’t work. Ultimately, this is just testing in prod, but looking back, I’m not sure how we could have done it any differently given the design constraints.

One interesting thing we identified as a risk was that actions that have low throughput (read: don’t get used a lot) are at risk for false negatives. That is to say, we implement a permission, it doesn’t get used, but it actually doesn’t match the existing RBAC ability. When this happens, it’s a time bomb – initially things look fine until that feature gets used, and then the first time someone tries to perform the action, they get hit with a 401. Unfortunately, that’s also a fundamental issue with this approach, so bear that in mind if you implement something with a dark launch pattern.

Overall, this was a really interesting project to work on and I learned quite a lot about experimentation frameworks and how to choose an appropriate one. I don’t think I’ll use this particular one a ton in the future, but it presented unique challenges and I’m not sure of any better way to tackle the issue.

That’s all for now. Thanks for reading!

Update 6/21/23: I wrote a postmortem about building this feature, which you can read here.