Six months ago, Assembled supported only a handful of configurable permissions within user instances. Now, users can create as many custom roles as they need and adjust their level of access to meet their unique business requirements. The Identity and Configuration Team achieved this by developing a new internal authorization framework to replace the existing combination of rigid roles, enums, and custom logic. This article describes the steps we took to deprecate the existing system and make access control a more delightful experience for our customers.
It started with role-based permissions
Originally, Assembled came with five predefined roles: basic, standard, team lead, manager, and admin. Permissions were primarily enforced through HTTP middleware by adding handlers to different routers for simplicity. However, this setup proved inflexible for modifying permissions after they were established, leading engineers to embed custom permissions code directly into handlers and other parts of the codebase.
This ad hoc approach often mixed authorization logic with business logic, resulting in a system that was both unclear and inconsistent. The complexity of navigating this code also led to permissions being duplicated and inconsistently applied across the frontend and backend, which caused not only user-facing errors, but also potential security vulnerabilities.
The problem with role-based permissions
Five out-of-the-box roles are insufficient for large customer support organizations with diverse internal functions, including managers, leads, schedulers, and forecasters, all of whom need access to different tools. Additionally, managing access for outsourced teams from multiple vendors further complicates the process.
A lot of custom work was required to address this limitation. Each team had slightly different needs — some wanted everyone in the organization to have access to all data, while others preferred to carefully restrict each team’s access to only their own data and limit each user’s access to the specific features they needed. During an audit we conducted while scoping the project, we discovered 42 existing feature flags used to modify role behavior, with a few more added during the build process. This approach created technical debt, made the code difficult to maintain and understand, and consumed a lot of time, causing friction for both the engineering and customer teams.
Several other permissions systems had been built when feature flags weren’t enough, but none of them presented a unified configuration. In some cases, there was no user-facing configuration at all. Examples of these include:
Event type permissions allowed customers to configure roles so they could only edit certain types of events (e.g. time off, breaks) in the schedule. This could only be configured using an internal Retool app.
Event change request permissions (yes, it’s confusing) restricted which users were allowed to request changes to their schedules based on different attributes, and which roles could approve or deny event change requests.
Restricted sites prevented users from seeing data from other sites. This was mostly used to prevent outsourced teams from viewing internal data.
Each of these systems was implemented and enforced separately, and whenever a new feature was added, engineers had to think about whether it needed to interact with them (this is especially relevant for restricted sites).
This was particularly problematic because there were multiple sources of truth. Permissions logic had to be duplicated on both the client and server, and often, they didn’t stay perfectly in sync for very long. This caused problems such as users navigating to pages in the app that would only ever throw errors, and worse, permissions that were only enforced on the frontend. Eventually, it became difficult to know what intended behavior was, and there was no definitive documentation that accurately described the permissions for each role. It became very difficult to debug issues — and very easy to create new ones. Most backend authorization checks happened in middleware, but there were also some additional ones buried in the application code, and once in a while, you would come across a boolean logic monstrosity and have no idea what to do with it.
Key requirements for a new permissions system
Having seen the pitfalls of the existing system, it was clear we needed a solution that could:
Perform tens to hundreds of millions of authorization checks per day without impacting site reliability or request latency.
Synchronize permissions across the client and server without duplicating logic.
Be configured in a centralized user interface. This user interface needed to be powerful enough to set up granular permissions, but intuitive enough for non-technical users to understand. Given that adjusting roles is a high-risk action that shouldn’t happen very often, and that the feature would only be used by a very small set of users, we decided that we could sacrifice brevity and efficiency for flexibility and clarity.
Support all of the existing permissions, as well as some new ones that we wanted to add but couldn’t with the previous system.
Meet several enterprise customer commitments out of the box without introducing additional custom code.
Allow us to reuse code across different applications without authorization interfering (e.g. web API server, internal CLI tools, cron jobs).
Allow engineers to easily write unit tests for code that uses the permissions system.
Developing a custom authorization framework
To address the problems with the original system, the Identity and Configuration Team set out to create a scalable and flexible authorization framework. Our solution needed to integrate seamlessly with existing code while offering improved scalability and flexibility.
We evaluated several existing solutions, including Casbin, AuthZed, Auth0, and Amazon Verified Permissions, but found they either didn't meet all our technical requirements or were too rigid to support all of the different configurations we needed. We also wanted to avoid replicating data across multiple systems. We found inspiration in AWS IAM policies, which strike a good balance between configurability and structure, but we aimed to maintain good UX — which AWS is notoriously lacking. This led us to develop a custom solution that leverages a combination of role-based access control (RBAC) and attribute-based access control (ABAC) models to offer dynamic permissions management.
On the configuration side, the core abstraction in the new system is a policy, which is nothing more than a set of rules. Each policy could be associated with a role, which would then be associated with users.
Each rule has an effect, actions, resources, and some conditions. The effect is applied if all the other fields match a given authorization request. In order to maintain strong type safety and limit the complexity of any given rule, we defined three different kinds of conditions we would support:
Reference conditions: both operands reference attributes, which are dynamic values provided when an authorization request is made. Put another way: [attribute] [operator] [attribute], for example user_idequalsresource.created_by.
Right value conditions: the left operand references an attribute, and the right operand is a constant. Put another way: [attribute] [operator] [constant], for example user_teamsinclude“fc-barcelona”.
Left value conditions: the right operand references an attribute, and the left operand is a constant. Put another way: [constant] [operator] [attribute], for example 2024-01-01is_beforeresource.created_at.
Validation
To validate that this solution was expressive enough to meet our needs, we gathered a list of all existing permissions, as well as the requests that we had received from customers and prospects, and manually checked that the configuration could be modeled by a policy.
In one instance, “agents can only view their own agent scorecard” would be defined as:
Enforcing permissions
On the enforcement side, we wanted engineers to only need to interact with a handful of functions and basic interfaces: enforcers, resources, and actions. Our enforcer exposes two functions:
IsAuthorized: this is a hard authorization check. Given an action and a resource, this function returns an unauthorized error if the current user isn’t allowed to perform that action on that resource based on their associated policies. If no rules match the request, the action is unauthorized, and deny effects take precedence over allow effects if multiple rules match. Any uncertainty, such as attributes referenced in a rule but not provided in the request, result in that rule not being matched.
AuthorizationConditions: this function is used to find resources that a user can access or determine if they might have access to a subset of some resources. It’s also used to infer usable filters to show in the UI. If an action and resource pair is partially provided, and there are conditions that could potentially allow the action if met, but there's not enough information to fully evaluate them, those conditions are returned.
Using this interface also allowed us to use dependency injection for unit testing and to ignore authorization checks in internal tools, thus fulfilling both the code reusability and testing requirements.
In order to make it easy to use existing data structures for authorization, the enforcer consumes resource and action interfaces, defined below. This allows engineers to simply add the required functions to an existing data model and make a call to check authorization.
Client-side authorization
As much as being able to enforce permissions is crucial, we also needed to provide a good experience to our users. Simply returning a 403 error when someone attempts an unauthorized action isn't sufficient. Ideally, users never even attempt that action in the first place. To that end, we introduced two custom react hooks to check permissions and disable frontend elements. Under the hood, these hooks simply make a request to the backend, which uses the enforcer.
This isn’t as simple as just a wrapper around the IsAuthorized function though. We observed three key differences between checking authorization on the client and server side:
The client occasionally doesn’t know all of the information (attributes) that it needs to provide in order to definitively determine if a request is authorized.
A common client-side use case is: "If the user can perform this action on any resources with these characteristics, display a navigation item or table component; otherwise, show an error message.”
A false positive (something that’s allowed that shouldn’t be) on the client side isn’t dangerous, it’s just bad UX since the server will ultimately reject the request later. In fact, we would rather have a false positive than a false negative (which definitely isn’t the case on the server side), since a false negative would prematurely block an action that should be allowed.
Based on this understanding, we designed the handler for these hooks to work differently from simply calling IsAuthorized. Instead, we call AuthorizationConditions, and if the action could potentially be authorized (any conditions are returned) based on the partial information provided, we inform the client that the request is authorized. This approach allows us to hide and disable components that the user clearly doesn’t have access to, while keeping those they might be able to use accessible.
Configuring permissions
When we initially designed our solution, we planned to store all policies and conditions in the database, and somehow manage all of that configuration via a UI. We called these fully-configurable policies “custom policies.” However, we quickly realized that most permissions don’t need to be that complicated, and that it should be as simple as changing which pre-defined (”managed”) policies are added to a role. This approach had several advantages:
Managed policies can be defined and maintained in code, which guarantees that they are consistent across accounts and makes them easy to update and debug.
Getting managed policies for a given user is cheaper because only the associations between users and roles, and roles and managed policies, are stored in the database. This approach avoids the need to store entire policies in the database, as would be required with custom policies.
Based on this approach, we designed our React.js UI to consume a configuration that contains “sections” and “permissions” to generate a role editor with a series of radio groups and checkboxes. The configuration for a single permission might look something like this:
Designing our UI this way made it very easy for engineers to expose simple permission configurations to users without worrying about the logic that happens behind the scenes to update a role’s policies or render React components. Most of the time they just had to add an enum and a couple of lines of configuration.
Note: This explanation doesn’t do the work our designer put into the UI justice, but since this is an engineering blog, we won’t get into that here.
More configurability without breaking the UI
Managed policies couldn’t do everything we wanted though, which is why we came up with the concept of “template conditions.” Template conditions are a middle ground between the rigidity of managed policies and the complexity of custom policies. A template condition is essentially a right or left value condition (reminder: condition where one of the operands is a constant), where instead of the constant being defined by an engineer in code, it is configurable by users and stored in the database.
This approach was particularly useful for permissions like "users with this role can assign the following roles," as it allowed most of the policy to remain in the code while enabling configuration specifically for the condition.
We then exposed reusable components for displaying these template values in the UI so engineers could easily add more.
Optimizations
As we implemented the new system, we made a couple of important optimizations to ensure that authorization remained fast and easy to use:
Since rules with deny effects take precedence over allow effects, we can evaluate them first and return as soon as we find a matching deny rule, thus avoiding having to check every rule for each request.
We encourage compound authorization (a single user action requiring multiple individual authorization checks) because it produces a more simple action/resource model, which means that there are often multiple authorization requests per HTTP request. Of all of the steps for authorizing requests, getting the policy information that’s stored in the database is always going to be the most expensive: we’ve measured that fetching the stored data takes, on average, 150 milliseconds, while actually checking authorization takes only 7 microseconds. Therefore, we front-load this work by loading policies into memory once per HTTP request, and then using the pre-fetched data for all subsequent authorization requests. This ensures that the cost of our system scales with the number of requests (a proxy for users), and not the number of authorization checks, which is higher. This also allowed us to validate that we could support the scale of a full rollout before investing in migrating every permission.
Because the enforcer is going to be used pretty much everywhere in the codebase, we added it to the go context. This has the advantage of being easy to do: simply add middleware that updates the request context in the web server, and for other commands, update the context when it’s initialized. It also means that engineers don’t have to worry about passing it around everywhere, which is particularly annoying when adding a new dependency that pretty much everything requires. There’s a tradeoff in type safety here, as there's no guarantee that the enforcer will be present in every context. However, because this is such a critical dependency, we weren’t overly concerned about these issues going unnoticed during testing. To be extra cautious, we added monitoring and alerting.
Rolling out the new solution
Finding a solution that met our requirements was only part of the challenge. We also needed to migrate hundreds of existing permissions to the new system, which couldn’t be done all at once. This meant we had to support both the old and new systems simultaneously. To manage this, we began using the term “v1 role” for the five existing hardcoded roles and “v2 role” for our new custom roles.
We added a “legacy_role” field to our v2 roles, mapping them to the corresponding v1 roles. Every account was populated with five v2 roles that mapped 1:1 to the old roles, and we rewrote all role assignment and display logic to use v2 roles while preserving the old behavior through the legacy role field. To existing users, it appeared as though nothing had changed, but behind the scenes, we were prepared to start enforcing permissions based on v2 roles. Additionally, if necessary, we could manually create additional roles in customer accounts.
From here we observed that the fastest, safest way to roll out new permissions was the following simple procedure:
Find a permission that’s being gated by v1 roles and write a policy to represent the same permission.
Write a migration to backfill that permission for existing roles. In some cases this was easy: if the permission was granted to all v1 roles above standard, you could assign the policy to all v2 roles with a legacy role higher than standard. In other cases there were feature flags or other custom code involved, and engineers had to write more complicated logic to assign the policy to the correct roles.
Now that you’ve checked that all the right roles have the new policy, you can add code to enforce the permission using the new system on the server and the client.
Once the new and old systems have identical behavior, you can remove the old code and make the permission available for configuration in the UI.
The plan was simple: using this setup, incrementally migrate the existing permissions to the new framework, and once nothing is using v1 roles anymore, drop the legacy_role column.
What we learned along the way
We’ve come a long way from the five out-of-the-box roles we had six months ago: users are creating custom roles and editing permissions themselves, and engineers are adding permissions that will scale with the product. Here are a few things that we learned along the way:
We could have avoided work by not overbuilding at the beginning. Policies as a concept were definitely the right choice for us, but we never actually ended up using custom policies for anything. Had we started migrating permissions earlier, and built the simplest systems first, we might not have wasted time on tools that we never needed. There might be a use case for even more flexibility in the future, but only time will tell.
Migrating legacy permissions takes a lot of time, and was actually harder than building the policy engine itself in many ways. It was worth doing more building upfront to make life easier for the people implementing permissions to use the framework. This should be accounted for in build/buy decisions, as well as any technical design.
It was essential for us to find a way to roll out incrementally, and eventually launch the feature to customers without migrating every single permission. Had we blocked on fully removing v1 roles, we could easily have spent countless more months doing grunt work without demonstrating very much value. Instead, we found a clever product solution to reconcile the concepts of v1 and v2 roles for users. The incremental rollout also allowed us to stop the legacy permissions bleeding earlier, and satisfy several customer commitments out of the box before reaching general availability.
We’re hiring!
If you’re excited about these kinds of problems and want to apply cutting-edge techniques to solving customer support challenges, check out our open roles — we’re hiring!