But here is a problem I am facing personally (numbers are hypothetical).
I get a review request 10-15/day by 4 teammates, who are generating code by prompting, and I am doing same, so you can guess we might have ~20 PRs/day to review. now each PR is roughly updating 5-6 files and 10-15 lines in each.
So you can estimate that, I am looking at around 50-60 files, but I can't keep the context of the whole file because change I am looking is somewhere in the middle, 3 lines here, 5 lines there and another 4 lines at the end.
Ideally, you’re working with teammates you trust. The best teams I’ve worked on reviews were a formality. Most of the time a quick scan and a LGTM. We worked together prior to the review as needed on areas we knew would need input from others.
AI changes none of this. If you’re putting up PRs and getting comments, you need to slow down. Slow is smooth, and smooth is fast.
I’ll caveat this with that’s only if your employer cares about quality. If you’re fine passing that on to your users, might as well just stop reviewing all together.
> Ideally, you’re working with teammates you trust.
I do trust them, but code is not theirs, prompt is. What if I trust them, but because how much they use LLMs their brain started becoming lazy and they started missing edge cases, who should review the code? me or them?
At the beginning, I relied on my trust and did quick scans, but eventually noticed they became un-interested in the craft and started submitting LLM output as it is, I still trust them as good faith actors, but not their brain anymore (and my own as well).
Also, assumption is based on ideal team: where everyone behaves in good faith. But this is not the case in corporations and big tech, especially when incentives are aligned with the "output/impact" you are making. A lot of times, promoted people won't see the impact of their past bad judgements, so why craft perfect code
Yeah, I agree with you. I’d say they’re not high performers anymore. Best answer I’ve got is find a place where quality matters. If you’re at a body shop it’s not gonna be fun.
I do think some of this is just a hype wave and businesses will learn quality and trust matter. But maybe not - if wealth keeps becoming more concentrated at the top, it’s slop for the plebs.
My work has turned into churning out a PR, marking it as a draft so no one reviews it, and walking away. I come back after thinking about what it produced and usually realize it missed something or that the implications of some minor change are more far-reaching than the LLM understood. I take another pass. Then, I walk away again. Repeat.
Honestly I'm not sure much has changed with my output, because I don't submit PRs which aren't thoughtful. That is what the most annoying people in my organization do. They submit something that compiles, and then I spend a couple hours of my day demonstrating how incorrect it is.
For small fixes where I can recognize there is a clear, small fix which is easily testable I no longer add them to a TODO list, I simply set an agent off on the task and take it all the way to PR. It has been nice to be able to autopilot mindless changesets.
If reviewing has become the bottleneck, the obvious - albeit slightly boring - solution is to slow down spitting out new code, and spend relatively more time reviewing.
Just going ahead and piling up PRs or skipping the review process is of course not recommended.
you are not wrong, but solution you are proposing is just throttling the system because of the bottleneck, and it doesn't solve the bottleneck problem.
Correct, but that has and probably always will be the case.
You spend the time on what is needed for you to move ahead - if code review is now the most time consuming part, that is where you will spend your time. If ever that is no longer a problem, defining requirements will maybe be the next bottleneck and where you spend your time, and so forth.
Of course it would be great to get rid of the review bottleneck as well, but I at least don't have an answer to that - I don't think the current generation of LLMs are good enough to allow us bypassing that step.
You know we’ve had the ability to generate large amounts of code for a long time, right? You could have been drowning in reviews in 2018. Cheap devs are not new. There’s a reason this trend never caught on for any decent company.
I hope you are not bot, because your account was created just 8 minutes ago.
> You know we’ve had the ability to generate large amounts of code for a long time, right?
No, I was not aware. Nothing comes close to the scale of 'coherent looking' code generation of today's tech.
Even if you employ 100K people and ask them to write proper if/else code non-stop, LLM can still outcompete them by a huge margin with much better looking code.
(don't compare it LLM output to codegen of the past, because codegen was carefully crafted and a lot of times were deterministic, I am only talking about people writing code vs LLMs writing code)
> No, I was not aware. Nothing comes close to the scale of 'coherent looking' code generation of today's tech.
Are you talking about “I’m overwhelmed by code review” or “we can now produce code at a scale no amount of humans can ever review”. Those are 2 very different things.
You review code because you’re responsible for it. This problem existed pre AI and nothing had changed wrt to being overwhelmed. The solution is still the same. To the latter, I think that’s more the software dark factory kind of thinking?
I find that interesting and maybe we’ll get there. But again, the code it takes to verify a system is drastically more complex than the system itself. I don’t know how you could build such a thing except in narrow use cases. Which I do think well see one day, though how narrow they are is the key part.
I don't quite follow - are you describing an issue with the way your team has structured PRs? IMO, a PR should contain just enough code to clearly and completely solve "a thing" without solving too much at once. But what this means in practice depends on the team, product, velocity, etc. It sounds like your PRs might be broken up into too small of chunks if you can't understand why the code is being added.
I am saying PRs I get are around 60-70 lines of change, which is small enough to be considered as single unit (add to this unit tests which must pass with new change, so we are talking about 30 line change + 30 line unit test)
But when looking at the PR changes, you don't always see whole picture because review subjects (code lines) are scattered across files and methods, and GitHub also shows methods and files partially making it even more difficult to quickly spot the context around those updated lines.
Its difficult problem, because even if GitHub shows whole body of the updated method or a file, you still don't see grand picture.
For example: A (calls) -> B -> C -> D
And you made changes in D, how do you know the side effect on B, what if it broke A?
If the code is well architected, the contract between C and D should make it clear whether changes in D affect C or not. And if C is not affected, then B and A won't be either.
> Its difficult problem, because even if GitHub shows whole body of the updated method or a file, you still don't see grand picture.
> For example: A (calls) -> B -> C -> D
> And you made changes in D, how do you know the side effect on B, what if it broke A?
That's poor encapsulation. If the changes in D respect its contract, and C respects D's contract, your changes in D shouldn't affect C, much less B or A.
That's the reality of most software built in last 20 years.
> If the changes in D respect its contract, and C respects D's contract, your changes in D shouldn't affect C, much less B or A.
Any changes in D, eventually must affect B or A, it's inevitable, otherwise D shouldn't exist in call stack.
How the case I mentioned can happen, imagine in each layer you have 3 variations: 1 happy path 2 edge case handling, lets start from lowest:
D: 3, C: 3D=9, B: 3C=27, A: 3*B=81
Obviously, you won't be writing 81 unit tests for A, 27 for B, you will mock implementations and write enough unit tests to make the coverage good. Because of that mocking, when you update D and add a new case, but do not surface relevant mocking to upper layers, you will end up in a situation where D impacts A, but its not visible in unit tests.
While reading the changes in D, I can't reconstruct all possible parent caller chain in my brain, to ask engineer to write relevant unit tests.
So, case I mentioned happens, otherwise in real world there would be no bugs
check out the branch. if the changes are that risky, the web ui for your repository host is not suitable for reviewing them.
the rest of your issues sound architectural.
if changes are breaking contracts in calling code, that heavily implies that type declarations are not in use, or enumerable values which drive conditional behavior are mistyped as a primitive supertype.
if unit tests are not catching things, that implies the unit tests are asserting trivial things, being written after the implementation to just make cases that pass based on it, or are mocking modules they don't need to.
outside of pathological cases the only thing you should be mocking is i/o, and even then that is the textbook use for dependency injection.
who reviews the tests? again me? -> that's exactly why I am saying review is a bottleneck, especially with current tooling, which doesn't show second order impacts of the changes and its not easy to reason about when method gets called by 10 other methods with 4 level of parent hierarchy
I really want to say: "You are absolutely right"
But here is a problem I am facing personally (numbers are hypothetical).
I get a review request 10-15/day by 4 teammates, who are generating code by prompting, and I am doing same, so you can guess we might have ~20 PRs/day to review. now each PR is roughly updating 5-6 files and 10-15 lines in each.
So you can estimate that, I am looking at around 50-60 files, but I can't keep the context of the whole file because change I am looking is somewhere in the middle, 3 lines here, 5 lines there and another 4 lines at the end.
How am I supposed to review all these?