AWS Multi-Account setup: It's complicated

I think it's about 5 years since I started using AWS "for realz". We were gearing up to migrate a bunch of stuff from on-prem to cloud and we planned to use Azure - we were a Windows/.net shop after all. But. There were some issues. So just for fun, we took an app that we were having serious difficulties running in Azure, and refactored it to run on AWS Lambda using Kinesis Firehose and S3 (it was a kind of event ingestion app), and it was rock solid from day one (as far as I know, it still hasn't missed a beat). As we learnt about the mature client sdks in all languages we neeeded, complete documentation and robust (and actually understandable) vpc networking, it just clicked. We did a 180, scrapped Azure and went full steam ahead on AWS. And we mostly never looked back.

There was one thing we sorely missed tho: Resource Groups. Both Azure and GCP have the ability to separate an account (or subscription if you will) into "buckets" or "folders" called resource groups, and to use these "containers" as permission boundaries. Bob can do anything he wants, as long as he does it in the `Backend-team-prod` resource group. And so on.

AWS doesn't have any of this, and the recommendation is to simply separate everything into multiple AWS accounts. One account for `Backend-team-prod`, and one account for `Machine-learning-geeking-out` and so on. There are, however, some fairly serious pain-points with this multi-account approach - and I would argue that it especially hurts small-to-medium organizations:

Problem 1: Stuff generally can't be moved between accounts: Let's say your building an app, but since your org is super-duper-agile, it's decided that the dev teams are reshuffled, meaning that "app ownership" will change. Now you're stuck with pondering how to move data from S3 buckets and DynamoDB tables from the old "owning team"'s account to the new one. It's a yuge pain, and I'm pretty sure many org's AWS account layout represent a previous version of the team structure instead of the current. Moving stuff around is simply too painful.

Problem 2: Where do we put stuff? Your org probably only has one Datadog account, and one Github "tenant". Where to you put your Container images and your code artifacts? I have experienced this reluctance to using "developer-facing" AWS services because honestly we struggle to determine where to place stuff. For container images and code artifact, the org I'm currently working with is using a "shared" account in an attempt to centralize built artifacts. However, it means that we have to maintain custom lambda functions and lots of other pieces of logic to tie together the various accounts so that build/publish/pull permissions line up. It's a lot of work to set up, and difficult to get right.

Problem 3: AWS SSO and roles. Developers access AWS accounts not through group memberships, but through "roles". AWS recently released "AWS SSO", a service that (among others) simplifies multi-account logins especially for orgs using external identity services such as Azure AD, G-suite or similar. AWS SSO lets you establish user groups, and each group can be tied to a certain IAM role in a certain AWS Account. Problem is, AWS SSO does _not_ let you stay logged-in to multiple accounts at the same time in the same browser, and the login process itself is very slow. If a user happens to have access to multiple roles, he or she needs to choose which one to use before accessing the AWS account. It's almost unbearably complicated when comparing it to the other two major cloud providers.

In my previous job, we found the whole "role system" so unfriendly to devs that we actually built our own: We wrote a "sync" engine to provision user accounts for developers into our various AWS Accounts, and a simple frontend where users could reset their AWS credentials for all AWS accounts in one fell swoop. The same solution would invalidate access keys (used for accessing AWS stuff on the command-line or thru SDKs) every night, so that we never had long-living access keys used by humans. This actually worked quite well and had the added benefit that you could use multiple AWS accounts in a single browser. The fact that AWS' own service is so user-unfriendly that we actually made the decision to build this ourselves is, well, unfortunate.

It's funny. I still remember when I started digging into the world of AWS IAM and was so impressed by how detailed it is possible to be when specifying permissions - and the fact that it was all documented so well felt like heaven for an Azure person like me - used to having to intercept http calls between my browser and Azure in order to figure out how things worked. But then it dawned on me - it is next to impossible to build "wide" permissions in AWS, along the lines of "do whatever you want, as long as the thing your creating has these tags". IAM statements are very much tuned towards big organizations that have dedicated expertise that manage these policies - possible in separate cloud security teams.

My wish-list? glad you asked:

  • Implement tag-on-create support as well as "resource tag condition" support on ALL aws services (this will allow enforcing that resources are created with a certain seg of tags). Once this is done, it is relatively straight-forward to build "resource-group-like" functionality on AWS using known tag keys and values - and that's something I hope the community will do so that individual organizations can benefit from each other.

  • Treat single-account AWS setups as completely valid option, and provide guidance on how small and medium-sized orgs can build AWS setups that are secure enough but flexible enough to accomodate smaller organizations

  • Stop the "choose-role-at-login" madness in AWS SSO. Roles work for system accounts, since those generally do a very limited set of things. Humans are complex. They do many things. Forcing me to choose between managing `backend-team-a` or `backend-team-b`'s ec2 instances at login is just… wrong. Give me good old additive group membership any day instead.

In about two weeks I'm switching to an org that's running on GCP. I fully expect to miss the breadth of AWS services and the rich documentation.
I aslo expect that I won't be missing multi-account headaches, waiting for endless AWS SSO login redirects and worrying about whether we got it all right, since those S3 buckets will millions of objects in them will be practically impossible to move later.