> Correctness will go from a binary pass/fail to a probability
Excellent point. The pervasiveness of neural nets will require engineers (and really, everyone) to start thinking more probabilistically and establish acceptability thresholds instead of certainty. It's the way of the future.
That seems like a regression in many regards, especially given how much of this seems like a solution looking for a problem it can solve rather than the other way around.
Agreed. We may imagine ourselves very smart but in my experience most people do not understand probability and statistics well.
Is 90% accuracy good enough? Is 95%? 99%? 99.9%? No matter the answer, you have to tolerate errors. Now your stakeholders have to tolerate errors. Are they going to accept errors just because "Software 2.0" is here and that's what we all have to live with? Nope.
All software is riddled with errors and for most purposes that's fine. Any developer, stakeholder, whatever, who thinks otherwise is living in a parallel universe.
It may be possible in the future to have "for all practical purposes flawless" software, which might make sense for select special applications. That would be a new thing though, rather than something we have and could lose due to adopting AI development.
> All software is riddled with errors and for most purposes that's fine.
This is typically not that true when it comes to correctness. Most software does the correct thing in the eyes of the user, nearly 100% of the time. And when it doesn't, the bug gets fixed and that edge case is corrected for every other user going forward.
AI generated software, from what I've seen, has a wide range of errors in correctness (along with all the other errors that you mentioned all software having..which is true). Like it literally just does the wrong thing given what the user is expecting it to do. The path toward iteratively improving and getting it to an acceptable level of correctness for any given application might be there, but so far I have not seen it.
That's an interesting difference but I'm not convinced it's valid. Perhaps you can explain what you mean in more detail?
Let's say the software is good enough if it does the right thing 99.9% of the time it is used. I take it, you're saying that if an AI starts modifying it and only writes correct code in 99.9% of cases (yes, current AI is not even close, but it will improve), that makes it worse because the software might start failing completely. However, if you have proper tests and release management, such obvious flaws will quickly be detected and fixed or rolled back. For most applications that seems pretty much equivalent to what we have now.
The other case is software that is completely AI generated and cannot reasonably be modified by humans anymore. In that case, again, you have tests and a sane deployment strategy that mitigates failures to a sufficient degree depending on application.
So the only issue is when you start making a completely AI generated software and fail to ever meet requirements, or pass the human written test cases? Users at least will never be impacted by that. Even now, many human-written software projects never get to the stage where they can be used. Is this really a problem, especially if the attempt at AI generation of software is cheap?
With this being the "if" question. Improve, but improve to a point where we can trust it to do the things with the level of correctness actually required? Unclear so far. GPT3, the state of the art, can't be trusted to answer basic questions correctly (yet).
I also don't personally buy into the other notions in these comments that "the future is probabilistic software". I think that's wishful thinking outside of some specific domains and an attempt to bend our actual requirements to meet the capabilities of AI software, rather than the opposite.
> pass the human written test cases
I'm not super sold on this idea either. It seems reasonably possible that writing the test cases to a level of specification necessary to ensure that correctness we're after is just as much effort as just writing the code.
The way I see it is as expanding the high-level-low-level spectrum of programing tools towards the high end. In very large-scale projects that have many levels of complexity, I would argue "probabilistic software" is already a reality. You don't even try to fix all the bugs or edge cases. You try to minimize the impact of failures while accepting that there will always be failures. Building in logic to fail more gracefully is a big part of that and cost/benefit of such efforts is a probabilistic question, even if it might rarely be framed as such.
It's usually much easier (and never harder) to specify what needs to be done than how it should be done. Whether you can trust the result (enough) depends on the application. AI driven development will be applied first wheree errors and failures are least harmful and advance from there. It might take a long time until the degree of correctness improves enough and trust is build around it. After all, some countries' railroad networks still don't use computers but instead have a human map out new routes and schedules on paper to make sure trains don't crash. Nevertheless, the speed of AI development has exceeded (at least my) expectations time and time again in recent years.
What might happen is that we eventually get a lot of software that fails more often than now but is a lot cheaper. That would still mean that the new methods are widely adopted. People accept software errors as a part of life already, so even if they get more frequent in less critical applications, we will adapt.
What is really dangerous is when the various software components get too fast and complex to understand or control and develop pathological feedback loops in situations that cause real trouble. This kind of thing is a continuum of badness which tops out at "AI taking over and wiping out humanity". Given market incentives driving adoption, which I anticipate to be strong, it's hard to imagine how such risks might be mitigated.
> You don't even try to fix all the bugs or edge cases. You try to minimize the impact of failures while accepting that there will always be failures
Yes but I'll point you back at my original comment about correctness. I've never been on a team that shipped code we knew would do the wrong thing. I ship code with known failure points all the time. But when the code runs to completion, I'm pretty darn sure it's doing the correct thing and we try very, very hard to make sure of that. With AI I am seeing that they can't discern between issues of correctness and issues of failure or availability. It's just like 95% success across all of those spectrums.
> What is really dangerous is when the various software components get too fast and complex to understand or control and develop pathological feedback loops in situations that cause real trouble.
Yea I agree, that's somewhat scary to think about.
> This is typically not that true when it comes to correctness. Most software does the correct thing in the eyes of the user, nearly 100% of the time. And when it doesn't, the bug gets fixed and that edge case is corrected for every other user going forward.
You're confusing pleasantness with correctness. Pleasantness is the property of pleasing the user, or more often the software's owner. Correctness is the property of conforming to a specification. Since most software written has no specification, correctness is undefined. Evidently this works adequately well in the marketplace.
There's a difference between errors in UI/reliability/performance/etc and errors in business logic. When there's an error in the business logic, then heads roll.
The entire field of Service Reliability is dedicated to finding the exact boundaries of acceptable errors, and defining the response function when those boundaries are crossed.
I would assume this actually gels very nicely with neural nets since its constantly optimizing for fitness. Hell in theory you could bake in your SLA/SLIs into the models to self correct? Give the model direct feedback that its unfit?
I'd argue that it's a solution to the problem that programmers are expensive and only a fraction can be counted on to produce 100% reliable code anyway.
While it's possible for a highly-skilled, highly-professional developer can both write code that will solve a given problem 100% correctly and write tests that will prove that it solves them for the entire domain, in practice most developers fall short on both counts. Every time you interact with a date or phone number field that chastises you for your use or non-use of punctuation, you know this.
So, for many use cases, it's possible imperfect programmers will be replaced with neural networks that are 95% accurate, perhaps with a differently-trained one checking the work of the first one.
I've been looking for an opportunity to try out a pattern where the two approaches improve each other:
1. Generate a 95% accurate model
2. Use it to generate test cases
3. Code the thing, with the help of the cases
4. Manually remove cases in the 5%
I image steps 1 and 2 being completed by a product owner and 3 and 4 being completed by a software engineer.
We're so horrifically bad at communicating a requirement's intent, I wonder what would happen if we tried to use AI to communicate them via their extent instead.
I'm not familiar enough with AI workflows to weigh in on how easy or hard it would be. I've just noticed that it's much easier to describe why something is not what you want than describing what you want.
So if AI can give us a mockup that's workable enough to skip the first few iterations of "no that's not what I want" and get right to the part where the engineer is asking questions about the edge cases that weren't explicit in the requirements... That's a win.
I imagine you'd still want to have it in a box of some sort re: creating infra. Like you give it a very small cluster and probably make the stack decisions "write me a postgres schema for... write a fastapi API for the schema... write a react UI for the API... write me a k8s operator that up/down's the above components... Workshop the idea with other product people...
...and only then involve the engineer like: "make this AI-generated house of cards into a fortress".
Less of a regression than you might think. How many software systems (outside of very well unit tested components) really have strong correctness proofs? It’s an almost futile task in modern software engineering with all those distributed systems everywhere
It seems to me that current AI techniques are more about reducing programmer workload than actually doing something that isn't possible with traditional methods.
> Our software has a 99% chance to calculate your taxes correctly! And only a 1% chance of failure in which it's your fault and it's you that's committing tax fraud
Considering ChatGPT spits out innacurate information all the time, I think "our software has a 99% chance of you going to jail for tax fraud" is more accurate.
the problem really is that ChatGPT writes things based on what is most likely given what it wrote before and what the input is, so what are the possible error scenarios:
1. making a mistake in this part is really common, ChatGPT makes common mistake.
2. you have uncommon situation affecting here, ChatGPT ignores and writes things that cause you to get in trouble, or it writes things that cause you to pay more than you should.
Also the longer is goes on writing things the more likely that things it writes does not hang together with the past things it wrote, when a human lies they try to make their lies at least follow a sensible pattern. ChatGPT would be likely to get you flagged for audits because you can't be sure that what it wrote on page 1 jibes with what it writes on page 3.
If they were broken, they are deterministically broken, unless they are really poorly written the calculator won't randomize the answers to the same inputs.
For some use cases, sure. We go through painstaking efforts to ensure things like correctness, consistency, and idempotency for a reason though. Most things we want to be deterministic, and when something's not deterministic we freak out and fix it ASAP (including waking people up in the middle of the night to do so)
Yea, aka how engineers working on safety critical or high reliability stuff already have to think.
Assume your software has a probability to fail or have bugs or gets hit by bit flips or unreliable hardware. There’s a whole field for dealing with those kind of things that typical web devs haven’t had to worry about as much.
These are not those type of Engineers. Not the ones doing triple computing using three different processors with a summation and voting process, inclusive using different programming languages and compilers, because the compiler can also have bugs. These are the engineers running self driving beta neural nets on public roads...
I use a process called sample-and-vote when having LLMs return executable solutions. I imagine we will see this kind of design a lot with probabilistic computing.
More like we're 98.7 % certain that the screen layout looks ok for all device types and languages. The question is how hard is it to fix the problematic cases without turning the whole thing upside down when you find out the original solution is not enough.
Engineer brains are just a type of neural network too, which also have a probability of putting out bad code. Space exploration budgets run into the hundreds of millions of dollars each and recruit the brightest people on the planet to design and program software, yet there is a long list of projects that failed due to code errors. Mars probes have failed to reach the planet, orbited too low and burned up, crashed into the surface, or landed successfully but later overwrote critical memory due to a flawed software update.
The AI doesn't have to be perfect, but only offer a lower error rate than humans.
Engineer brains are not a neural net. It's only our brain lack of knowledge of how the brain works that led us to anthropomorphize Matrices and Weighted Graphs :-)
Reading a bit on the Brain will quickly help dismiss those analogies. I suggest these two as good starting points:
- Lange Clinical Neurology - 11th Edition
- Bradley's Neurology in Clinical Practice, 8th Edition
That doesn't sound like progress. Maybe for CRUD apps it will be okay. I have a hard time imagining it will be acceptable for financial or critical systems.
You’ll rent a self-driving car from a ride sharing app, they’ll make sure that the expected cost of fines and lawsuits will be significantly less than the expected revenue from providing rides. Although they can help push down the former value by making sure they operate from a favorable jurisdiction.
That assumes people even care what the statistics are. Many people just go with their emotional response based on a political outlook when deciding what course of action to take.
Excellent point. The pervasiveness of neural nets will require engineers (and really, everyone) to start thinking more probabilistically and establish acceptability thresholds instead of certainty. It's the way of the future.