This is one example of the CEO making something happen that essentially birthed AWS.
Bezos, of all people, was like "make it happen." And it did. It was basically work for no reason except future proofing. Having someone up the food chain OK this much work for the future (and no hard dollar benefit) is highly unusual.
And besides that they've done some incredible things with their infrastructure, like authorization. Distributed authorization is really hard, but at AWS it's completely invisible. Remove a permission from an IAM role and it moves through AWS really, really fast. It's totally magic. Anyone who was abused by CORBA knows how hard that is to do well.
Their newer stuff (like Cognito) is sort of weird, but other things are surprisingly solid given how big AWS is. Small shops have trouble shipping feature complete software, and BigCorps can be even worse. AWS has gotten really good at it.
I wonder if we are far too quick to bestow CEOs with credit for something other people effected. Sure the CEO is the one to sign off on everything, but the question to ask is could any other person in the CEO role at that time, have done any different? I don't know the details in this particular case, but I'll go out on a limb and say the CEO quite likely did not proclaim "make this happen". Business was growing at an unbelievable pace, their systems were probably stressed to the max, their development was likely choked and the technical team comes up and says this is what we need to do otherwise we can't handle more than this much traffic. What choice does the CEO have? He says, send me your proposal and broadcasts it.
As for AWS, as far as I remember, Bezos was initially against the idea. The idea was the brainchild of one Andy Jassy who along with Rick Salzell convinced a reluctant board into trying this out. They realized that they had been unintentionally building this cloud platform for some years now in order to provide sellers with computing resources. Opening up to public users was just a small sales move. Whether they do it or not, they were going to continue to invest in their cloud platform and nothing would change as far as their technical direction was concerned, so the board finally relented.
Distributed authorization is indeed hard! IAM is one of the few (maybe the only) AWS service that isn't regional and it's because permissions must propagate globally for correctness' sake. As a distributed systems junkie, I'm shocked that other folks aren't as interested in authorization systems because they really push the boundaries of what we can do with data consistency at scale.
It's unfortunate that only Amazon themselves can add new permissions to IAM to secure their services. Why can't our applications add new permissions to IAM and query those? This is going to be a shameless plug, but it was this very problem that caused my cofounders and I to quit our jobs and start a company. Together (and now with a community of hundreds of users and contributions from a few well-known companies) we built SpiceDB[0], which is the culmination of state of the art distributed systems and authorization technology developed open source instead of behind closed doors at a hyper-scaler. We were mostly inspired by the internal system at Google, which is actually more powerful than AWS or Google Cloud's IAM services, despite a fork of it actually powering GCP's IAM.
It's easy till you break IAM itself with your policy complexity and random services start dying because other AWS components few layers deep cnst get IAM to parse
Essentially there's a maximum size of IAM policy, which AFAIK is not documented properly anywhere - get close to it or exceed it and you start getting random failures everywhere.
Character limits & the number of applied policies are all publicly documented https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_i.... Im not aware of any evaluation complexity limits and have never run in to that sort of problem in my ~10 years of dealing with IAM.
I expect you ran in to this sharp bit "You can add as many inline policies as you want to an IAM user, role, or group. But the total aggregate policy size (the sum size of all inline policies) per entity cannot exceed the following limits." Calculating the sum would be a pain as a user.
We didn't use inline policies much, but we had many policies linked across different objects, and the error message never pointed properly and we somehow didn't stumble upon the docs you mention (that's going into my notes :D).
I no longer work on that project, but it was considerable blocker when I was leaving as Sagemaker notebooks started randomly failing to start depending on role they were launched with.
Yeah, I can see that happening. There are combinations of roles etc that might hit the limit.
Do you remember what was failing? That would give some insight into how these get evaluated.
I know that S3 does evaluation differently than the other services, which gave me some insight into the process. Unfortunately I forgot what the insight was (doh).
The service that hit it was Sagemaker Notebooks, or specifically underlying EC2 instance (which you normally don't see as customer, afaik) - it failed trying to attach a network interface to the instance, because of IAM failure mentioning something rhyming with blown stack (been over a year since, so I don't recall details)
Can you (or anyone) say a bit about how the auth service is implemented from a distributed systems perspective? For example is it some kind of custom KV store?
in AWS, authentication and authorization happens within the application.
For the purposes of authorization, services integrate with a library that handles retrieving and caching policies based on caller identity. services create a context that includes all of the relevant metadata (service, operation, resources, etc.) and the library evaluates the policy and says allow or deny.
Doing it all in application means that if the control/distribution systems for auth go down most things that are in motion will remain in motion, and that deployments of the authentication/authorization code deploy out at a per-service granularity which also scopes blast radius.
There's some pretty obvious pain points (doing anything as a library means update the world for new features) but it has nice degradation properties and is relatively straightforward to grok as a service owner.
Well, it's really tough, because (1) every operation has to check if the calling entity is authorized, (2) changes need to propagate super quickly, and (3) performance needs to be pretty much realtime.
At some level every API call is authorized (and tracked).
To be honest, this is one of the secret sauces that makes AWS go. Someone once told me that they're not doing anything exciting, just caching, but I'm pretty sure they didn't really know what was going on.
Bezos, of all people, was like "make it happen." And it did. It was basically work for no reason except future proofing. Having someone up the food chain OK this much work for the future (and no hard dollar benefit) is highly unusual.
And besides that they've done some incredible things with their infrastructure, like authorization. Distributed authorization is really hard, but at AWS it's completely invisible. Remove a permission from an IAM role and it moves through AWS really, really fast. It's totally magic. Anyone who was abused by CORBA knows how hard that is to do well.
Their newer stuff (like Cognito) is sort of weird, but other things are surprisingly solid given how big AWS is. Small shops have trouble shipping feature complete software, and BigCorps can be even worse. AWS has gotten really good at it.