This is one example of the CEO making something happen that essentially birthed ...

potamic · on Nov 17, 2022

I wonder if we are far too quick to bestow CEOs with credit for something other people effected. Sure the CEO is the one to sign off on everything, but the question to ask is could any other person in the CEO role at that time, have done any different? I don't know the details in this particular case, but I'll go out on a limb and say the CEO quite likely did not proclaim "make this happen". Business was growing at an unbelievable pace, their systems were probably stressed to the max, their development was likely choked and the technical team comes up and says this is what we need to do otherwise we can't handle more than this much traffic. What choice does the CEO have? He says, send me your proposal and broadcasts it.

As for AWS, as far as I remember, Bezos was initially against the idea. The idea was the brainchild of one Andy Jassy who along with Rick Salzell convinced a reluctant board into trying this out. They realized that they had been unintentionally building this cloud platform for some years now in order to provide sellers with computing resources. Opening up to public users was just a small sales move. Whether they do it or not, they were going to continue to invest in their cloud platform and nothing would change as far as their technical direction was concerned, so the board finally relented.

jzelinskie · on Nov 17, 2022

Distributed authorization is indeed hard! IAM is one of the few (maybe the only) AWS service that isn't regional and it's because permissions must propagate globally for correctness' sake. As a distributed systems junkie, I'm shocked that other folks aren't as interested in authorization systems because they really push the boundaries of what we can do with data consistency at scale.

It's unfortunate that only Amazon themselves can add new permissions to IAM to secure their services. Why can't our applications add new permissions to IAM and query those? This is going to be a shameless plug, but it was this very problem that caused my cofounders and I to quit our jobs and start a company. Together (and now with a community of hundreds of users and contributions from a few well-known companies) we built SpiceDB[0], which is the culmination of state of the art distributed systems and authorization technology developed open source instead of behind closed doors at a hyper-scaler. We were mostly inspired by the internal system at Google, which is actually more powerful than AWS or Google Cloud's IAM services, despite a fork of it actually powering GCP's IAM.

[0]: https://github.com/authzed/spicedb

p_l · on Nov 16, 2022

It's easy till you break IAM itself with your policy complexity and random services start dying because other AWS components few layers deep cnst get IAM to parse

another_devy · on Nov 16, 2022

Intriguing, can you share details or overview why it failed for you. Will be kind of gotchas for me

p_l · on Nov 16, 2022

Essentially there's a maximum size of IAM policy, which AFAIK is not documented properly anywhere - get close to it or exceed it and you start getting random failures everywhere.

donavanm · on Nov 17, 2022

Character limits & the number of applied policies are all publicly documented https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_i.... Im not aware of any evaluation complexity limits and have never run in to that sort of problem in my ~10 years of dealing with IAM.

I expect you ran in to this sharp bit "You can add as many inline policies as you want to an IAM user, role, or group. But the total aggregate policy size (the sum size of all inline policies) per entity cannot exceed the following limits." Calculating the sum would be a pain as a user.

p_l · on Nov 17, 2022

We didn't use inline policies much, but we had many policies linked across different objects, and the error message never pointed properly and we somehow didn't stumble upon the docs you mention (that's going into my notes :D).

I no longer work on that project, but it was considerable blocker when I was leaving as Sagemaker notebooks started randomly failing to start depending on role they were launched with.

manv1 · on Nov 17, 2022

Yeah, I can see that happening. There are combinations of roles etc that might hit the limit.

Do you remember what was failing? That would give some insight into how these get evaluated.

I know that S3 does evaluation differently than the other services, which gave me some insight into the process. Unfortunately I forgot what the insight was (doh).

p_l · on Nov 17, 2022

The service that hit it was Sagemaker Notebooks, or specifically underlying EC2 instance (which you normally don't see as customer, afaik) - it failed trying to attach a network interface to the instance, because of IAM failure mentioning something rhyming with blown stack (been over a year since, so I don't recall details)

anonymousDan · on Nov 16, 2022

Can you (or anyone) say a bit about how the auth service is implemented from a distributed systems perspective? For example is it some kind of custom KV store?

dastbe · on Nov 17, 2022

in AWS, authentication and authorization happens within the application.

For the purposes of authorization, services integrate with a library that handles retrieving and caching policies based on caller identity. services create a context that includes all of the relevant metadata (service, operation, resources, etc.) and the library evaluates the policy and says allow or deny.

Doing it all in application means that if the control/distribution systems for auth go down most things that are in motion will remain in motion, and that deployments of the authentication/authorization code deploy out at a per-service granularity which also scopes blast radius.

There's some pretty obvious pain points (doing anything as a library means update the world for new features) but it has nice degradation properties and is relatively straightforward to grok as a service owner.

mannyv · on Nov 16, 2022

Well, it's really tough, because (1) every operation has to check if the calling entity is authorized, (2) changes need to propagate super quickly, and (3) performance needs to be pretty much realtime.

At some level every API call is authorized (and tracked).

To be honest, this is one of the secret sauces that makes AWS go. Someone once told me that they're not doing anything exciting, just caching, but I'm pretty sure they didn't really know what was going on.