June 21st, 2023

A little over 6 months ago I wrote a post about our dark launch pattern in which I replaced our authorization system with a bespoke system I designed in-house. To eventually rip out CanCan, I ran it and our home-grown Pundit system side by side in production with a slew of feature flags to orchestrate the entire swap.

If you haven’t read the article, I encourage you to do so now as the remainder won’t make much sense without this context.

While the endeavor was a resounding success, there are some lessons learned from this approach. Some are specific to the use case (authorization), but hopefully some are generic enough to be useful outside this context.

What Went Well

We completed the work in about 3 months, give or take.

If you remember from the original post, it would have been very difficult to even begin to estimate the level of effort for such a lift in our old system, so this feels relatively quick given the huge amount of unknowns and the level of complexity in the system that needed to be replaced.

Case in point: I gave an example role matrix in the original article where I defined 3 roles with abilities that add or subtract in random orders. This matrix is wrong. I missed the fact that the combination [superadmin, role_1, role_2] has view and destroy permissions, not just view!

Almost no downtime.

While I can’t say that there was no downtime, if we did have issues, it was always an opportunity to roll forward rather than revert or shut off the feature we were building.

High reliability.

The system works as intended nearly 100% of the time. I chalk this up to vetting it against a known good implementation in production for 3 months before calling it “done”.

It’s easy to prove it’s reliable.

This is a combination of all the above, really, but it’s not something I foresaw as a huge feature when developing.

Basically, when you announce a feature this big, and developers are used to the old system being clunky and hard to use, one thing you’re trying to do is to build confidence that your system works. Occasionally, to this day (almost a year after I finished implementation), I still get asked if bugs are caused by my permission system. Of course, I give this a fair shake, but the answer has always been “no”.

I can prove it to you – just turn the V2 system off and watch your bug continue to occur. If you still have trouble, let’s talk. I’ve never had anyone take me up on that offer, though.

What Went Wrong

Migrating prod is “easy”. Migrating other environments, not so much.

When I set out to build a new authorization system, everyone knew it was a big ask, but there was huge demand for it. When it was done, I got high fives all around, and then it was off to the next feature.

But wait! I wasn’t done! I’d worked quickly and built out a version that worked in prod, but I still had to tack on the pieces to maintain it in dev environments.

This was a huge undertaking. We’re still ongoing here, almost 12 months later. Part of the problem is that there were a lot of feature flags, so it was easier for devs to simply leave most of them off, since by design the code is intended to run two separate versions of authentication.

The other problem was…

There was a huge delay in ripping out the dark launch code.

Again, because we can’t break the system, it was designed to fall back to the Version 1 implementation if it had to. But because it was forward compatible with Version 2 at the same time… There’s no impetus to rip it out.

Like the previous example, this is still ongoing (but wrapping up!).

It broke a lower environment that I didn’t consider.

I didn’t catch this problem until it was too late, and the team was well into the final stages of the refactor. The two authorization implementations had eventually diverged at this point, as they naturally would, and it broke a lab environment that was in-use outside the scope of our team.

By this point, there was so much that needed to be done to get the application healthy – seed the authorization database, flip a dozen feature flags, mark all the users as V2 permissions enabled in the auth DB, etc. Basically, we needed to re-do all the little incremental steps we had taken in production all at once (thank god for good documentation of this).

There was no early detection of this issue, again because of the double implementation approach.

CanCan performed a second function in our application, delaying its removal.

While the permissions refactor was highly successful, it turns out that CanCan was being used to load records from the DB on almost every controller action. Its implementation of this functionality was equally obtuse to its authorization strategy, in my opinion.

To fully get rid of CanCan, we needed to remove it, but yet again, it would be dangerous to do so without a second dark launch strategy. This effort is ongoing and expected to be completed soon.

Appliation performance monitoring tools are not the best choice for this task.

Our application uses Sentry for our APM. For one thing, in most large applications this is sampled (i.e. for 10 errors with a sample rate of 10%, you’d only get 1 report).

Secondly, and as was called out as a risk in the previous article, things that don’t cause errors immediately don’t get reported. Low-traffic endpoints don’t pose a problem until weeks or months down the road.

I hope this was helpful. Our biggest win was being able to quickly iterate on authorization functionality with high confidence. The team implemented the feature with low downtime and in a way that gave other devs assurance that the refactor wasn’t interfering in their workflows.

However, on the flip side, once the dark launch was done, I had a hard time fighting for the time to build tooling needed for lower environments. The fact that two implementations can work in parallel at any given time meant that there was no real drive to remove the deprecated version, even though it lead to slow uptake in dev envs and issues with config mismatches in others.

That’s all for now. Thanks for reading!

development (54)

ty-porter

Launching Dark (Postmortem)
Revisiting our dark launch pattern with lessons learned.

What Went Well

We completed the work in about 3 months, give or take.

Almost no downtime.

High reliability.

It’s easy to prove it’s reliable.

What Went Wrong

Migrating prod is “easy”. Migrating other environments, not so much.

There was a huge delay in ripping out the dark launch code.

It broke a lower environment that I didn’t consider.

CanCan performed a second function in our application, delaying its removal.

Appliation performance monitoring tools are not the best choice for this task.

Launching Dark (Postmortem) Revisiting our dark launch pattern with lessons learned.

What Went Well

We completed the work in about 3 months, give or take.

Almost no downtime.

High reliability.

It’s easy to prove it’s reliable.

What Went Wrong

Migrating prod is “easy”. Migrating other environments, not so much.

There was a huge delay in ripping out the dark launch code.

It broke a lower environment that I didn’t consider.

CanCan performed a second function in our application, delaying its removal.

Appliation performance monitoring tools are not the best choice for this task.

Launching Dark (Postmortem)
Revisiting our dark launch pattern with lessons learned.