There's a position in between "exit cleanly" and "general protection fault, core dumped" where the process essentially does the internal equivalent of SIGKILLing itself.
I.e. either intentionally (e.g. tripping an assertion failure), or accidentally due to some logic-failure in exception/error-handling, the process ends up calling the exit(3) syscall without first having run its libc at_exit finalizers that a clean exit(2) would run; or, at a slightly higher runtime abstraction level, the process calls exit(2) or returns from main(), without having run through the appropriate RAII destructors (in C++/Rust), or gracefully signalled will-shutdown to managed threads to allow them to run terminating-state code (in Java/Go/Erlang/Win32/etc), or etc.
This kind of "hard abort" often truncates logging output at the point of abort; leaves TCP connections hanging open; leaves lockfiles around on disk; and has the potential to corrupt any data files that were being written to. Basically, it results in the process not executing "should always execute" code to clean up after itself.
So, although the OS kernel/scheduler thinks everything went fine, and that it didn't have to step in to forcibly terminate the process's lifecycle (though it did very likely observe a nonzero process exit code), I think most people would still generally call this type of abort a "crash." The process's runtime got into an invalid/broken state and stopped cleaning up, even if the process itself didn't violate any protection rules / resource limits / etc.
> It is not OK to stop at the first sign of trouble, and return whatever maybe is right. “123timmy” is not a number, nor is the empty string.
None of the C functions referenced (atol, strtol, sscanf) are number-parsing functions per se. Rather, they're numeric-lexeme scanning+extraction functions.
These functions are all designed to avoid making any assumptions about the syntax of the larger document the numeric lexeme might be embedded in. You might, after all, be using a syntax where numbers can come with units on the end. Or you might be reading numbers as comma-separated values.
And, as a key point the author might be missing: C, in being co-designed with UNIX, offers primitives tuned for the context of:
- writing UNIX CLI tools that work with unbounded streams of input (i.e. piped output from other UNIX CLI tools),
- where, crucially, the stream is just text, and so carries no TLV-esque framing protocol to tell you the definitive length of a thing;
- and nor (especially in early memory-constrained systems) are you able to perform allocations of heap memory in order to employ an unbounded growable buffer for retaining the current lexeme until you do reach the end of it (which, if you could, would let you use a scanner state-machine that doubles as a parser/validator, returning either a parsed value or an error)
- but instead, to deal with the 1. unbounded input, 2. of textual encoding, 3. in constant memory, you must eagerly scan the input stream (i.e. synchronously reduce over each received byte, or at most each fixed-length N-byte chunk using a static or stack-allocated fixed-length buffer, discarding the original string bytes once reduced-over) to produce lexically-decoded (but not parsed/validated) lexemes; and then do this again, on a higher level, feeding your stream of lexemes into a fixed-sized sum-typed ring-buffer (i.e. an array-of-union-typed-lexeme-struct-type-entries), where you can then invoke a function that attempts to scan over + consume them (but unlike the original stream-parsing function, doesn't consume the buffer unless successful, and so isn't functioning as a scanner per se, but rather as an LR parser.)
If you're not writing UNIX CLI tools, direct use of the C-stdlib numeric-lexeme scan functions is operating on the wrong abstraction layer. What you want, if you have pre-framed strings that are "either valid numbers or parse errors", is to implement an actual parsing function... that can then invoke these numeric-lexer functions to do the majority of its work.
And if you're writing C, and yet you're not in UNIX-pipeline unbounded-text-stream land, but rather are parsing well-defined bounded-length "documents" (like, say, C source files)... then you probably want to use a real lexer-generator (like flex) to feed a parser-generator (like yacc/bison). Where:
- you'd validate the token in context, in the parsing phaase;
- and your lexing rules would make certain classes of input invalid at lexing time. (E.g. you can write your lexeme matching rules such that multi-digit numbers with leading zeroes, or floating-point values with no digits before/after the decimal place, simply aren't "numbers" from your lexer's perspective.)
...which means that, once again, you can "get away with" invokeing the regular C numeric-lexeme scanner functions; i.e. `yylval = atoi(yytext);` in bison terms. (And you'd want to, since doing so saves memory vs. keeping the numbers around as strings.)
When you say "improve an svg like this", how are you imagining setting that workflow up? Are you just feeding them the SVG to iterate on; or are you giving them access to a browser to look at the rendering of the SVG?
I ask because:
Insofar as the original pelican test is zero-shot, it effectively serves as a way to test for the presence of a kind of "visual imagination" component within the layers of the model, that the model would internally "paint" an SVG [or PostScript, etc] encoding of an image onto, to then extract effective features from, analyze for fitness as a solution to a stated request, etc.
But if you're trying to do a multi-shot pelican, then just feeding back in the SVG produced in the previous attempt, really doesn't correspond to any interesting human capability. Humans can't take an SVG of a pelican and iteratively improve upon it just based on our imagined version of how that SVG renders, either! Rather, a human, given the pelican, would simply load the pelican SVG in a browser; look at the browser's rendering of the pelican; note the things wrong with that rendering; and then edit the SVG to hopefully fix those flaws (and repeat.)
I imagine current (mult-modal and/or computer-use) LLMs would actually be very good at such an "iterative rendered pelican" test.
I'm talking about two type of improvement, model improving, and prompt based improving. I am noticing that the baseline output has a lot more going on, the model has improved, yet it still makes those obvious looking mistakes with the shape of the frame or disconnected limbs etc.
And I am saying that if you take one of these SVGs and ask an LLM to look for flaws, it rarely spots those obvious flaws and instead suggests adding a sunset and fish in the birds mouth.
This is also my gripe with a lot of this stuff, always evaluating models on what they can literally oneshot is completely pointless; it's not how anything works, neither for humans nor for scaffolded AIs. I guess it's neat if you want to argue that a certain level of intelligence can "never be achieved" in a single forward pass, but like, so what. No one cares about that, except people who have already decided to be anti AI.
(not that I am in any sense pro AI, but it's just a weird lack of intellectual rigor)
Asking a model to improve its output is not one-shotting tho? My observation was that asking an llm to iterate and improve a response causes it to add more stuff, rather tha repair the broken stuff. And that model progress in general has the same pattern. This new model adds more details to its responses but continues to make mistakes at about the same rate.
The question was whether you were giving it the rendered image and using the model's visual modal capability, or feeding back in the textual SVG.
It's hard to "imagine" what the rendered SVG looks like, for both humans and LLMs, so just iterating on text won't really be as useful of a test. But if you show it what it rendered, it might observe the bad-looking bicycle and be able to fix the text that way.
> Honestly, with the first step, it seems the PMs are already halfway there to implementation of the feature so I wonder if in the future they'll just do everything themselves
I'm guessing they've tried (or been induced to try by upper management), but given up because they don't know how to debug any problems that arise due to the LLM working itself into a corner.
Coding-agent LLMs act a lot like junior devs. And junior devs are: eager to write code before gathering requirements; often reaching for dumb brute-force solutions that require more work from them and are more error-prone, rather than embracing laziness/automation; getting confused and then "spinning their wheels" trying things that clearly won't work instead of asking for help; not recognizing when they've created an X-Y problem, and have then solved for their Y but not actually solved for the original problem X; etc.
The way you compensate for those inexperience-driven flaws in junior devs' approach, is to have them paired with, or fast-iteration-code-reviewed by, senior devs.
Insofar as a PM has development experience, it's usually only to the level of being a "junior dev" themselves. But to compensate for LLMs-as-junior-devs, they really need senior-dev levels of experience.
The good PMs know all of this, and so they're generally wary to take responsibility for driving the actual coding-agent development process on all but the most trivial change requests. A large part of a PM's job is understanding task assignment / delegation based on comparative advantage; and from their perspective, it's obvious that wielding LLMs in solution-space (as opposed to problem-space, as they do) is something still best left to the engineers trained to navigate solution-space.
> I think when LLMs first came out people thought they could just say something like, "Make a Facebook clone". But now we're realizing we need to be more exact with our requirements and define things better.
The annoying thing is that giving an LLM vague instructions like "make a Facebook clone" does work... in certain limited cases. Those being mostly the exact things a not-very-creative "ideas person" would think to try first. Which gave the "ideas people" totally the wrong idea about what these things can do.
These same "ideas people" have been contracting human software developers to "make them a Facebook clone" (and other requests of similar quality) for decades now.
And every so often, the result of one of those requests would end up out there on the internet; most recently on Github. (Which is, once there's enough of them laying about, already enough to allow a coding-agent LLM trained on Github sources to spew out a gestalt reconstruction of these attempts. For better or worse.)
But for the most common of these harebrained ideas (both social-media-feed websites and e-commerce marketplace websites fit here), entire frameworks or "engines" have also been developed to make shipping one of these derivative projects as easy as shipping a Wordpress.org site. You don't rewrite the code; you just use the engine.
And so, if you ask an LLM to build you Facebook, it won't build you Facebook from scratch. It'll just pull in one of those frameworks.
And if you're an "ideas person", you'll think the LLM just did something magical. You won't necessarily understand what a library ecosystem even is; you won't realize the LLM didn't just generate all the code that powers the site itself, spitting out something perfectly functional after just a minute.
> These same "ideas people" have been contracting human software developers to "make them a Facebook clone" (and other requests of similar quality) for decades now.
I was doing an exercise boot camp when mobile apps started to hit big. The instructor came to me saying he needed an app. "Okay, what should the app do?" He had no idea, he just knew he needed an app.
Maybe not viruses much any more, but definitely worms. (And also some automated malicious servers scattered about the Internet that pull lists of devices with certain ports open from Shodan et al, and then repeatedly attempt to attack/penetrate whatever's on those lists.)
There are several videos available on YouTube, of someone connecting a Win9x/2K/XP machine to the modern Internet, waiting just a few minutes, and then observing (through Process Explorer) the silent introduction of various payloads onto the system.
You didn't have a router with dialup, or early DSL, where the modem was a separate device. You'd often get publicly routable IPv4s in your university dorm, too. See also napster. :-)
IMHO, if you're working on large feature changes, before nudging the agent to write any code, it's best to:
1. establish consensus, just in the chat, on the problem domain — i.e. the business-domain problem you're solving (as if the agent is your contact at a software-development contractor, and you're sitting down to pin down what you want from them)
2. co-write with the agent a hierarchical-bullet-pointed design document (this should be an actual .md file, not just in the chat) — letting the agent generate + edit most of this, but nitpicking it thoroughly for problems and decision-vagueness, forcing all design-level decisions to be made up-front here
3. tell it to translate the design spec into a skeleton for a BDD spec test suite, to be populated as it implements
4. let the agent free to actually do the impl — where the agent is free to add/modify/delete unit tests and integration tests and so on, but where it must keep the design-spec file and the structure of the derived BDD spec tests fixed (and, before considering itself done, ensure that A. the BDD spec tests are all fleshed out with proper logic reflecting their labels, and 2. they all pass.)
5. At this point you might be done. But if your project is absolutely huge, you might do another "sprint" at this point, starting again from the top by defining new business requirements, amending the design, getting the agent to add to the BDD suite, etc. (Or, if you want to talk everything out up front, you'd insert a step between 2 and 3 of "breaking the design down into milestones" — where the agent will only create BDD spec items for the current milestone, solve for them, gets approval, and then move onto the next milestone.)
Yes, I'm basically saying you should do waterfall with LLMs. Waterfall can actually be rather pleasant, when the whole process happens over the course of an hour.
And the key point here, for understanding: after the project (or after each milestone for a large project), you can have the agent walk you through the code it wrote, explaining it to you in the chat — with the constraint that it shouldn't bother to explain anything already "implied" by the design.
You can then have it turn this explanation of "the surprising parts" into code comments — and the resulting comments would actually be of the kind humans would write, rather than being pro-forma garbage!
Or https://www.arduboy.com/. (Though both of these are "more capable" in the sense of kids being able to make their own games for them; not necessarily in the sense of the games having even the fidelity of a Gameboy Color.)
A Tag-Connect footprint has a BOM cost of zero. (Though not a TCO of zero in a closed enterprise ecosystem, as everyone needing to debug the devices needs to buy a stupid $60 pogo-pin cable.)
> Ultimately, the point of hardware attestation isn't to ensure that your device is trusted, but that the action you're trying to perform was done by a human, not a bot doing millions of them per second. It's just another CAPTCHA mechanism in disguise, required because bots have gotten so good at solving the existing ones.
...no? Maybe this is true of end-user device attestation. But there are other use-cases for attestation.
Server device attestation is an entirely different thing. It's used in e.g. IaaS "Confidential VM" offerings, where the audience for the attestation information is the customer, rather than the server host. It's a very pro-privacy / pro-data-sovereignty feature.
And while embedded device attestation is sometimes about preventing customers from tampering with IoT stuff you "sold" them, more often it's about being able to trust and confidently assert that e.g. the climate sensors you've deployed all over a forest as part of a research project haven't been fucked with to report false data by someone with an agenda. (Or to "apply denial" to your unmanned military satellite downlink station the moment you detect that there's some unknown person out there futzing with it.)
I.e. either intentionally (e.g. tripping an assertion failure), or accidentally due to some logic-failure in exception/error-handling, the process ends up calling the exit(3) syscall without first having run its libc at_exit finalizers that a clean exit(2) would run; or, at a slightly higher runtime abstraction level, the process calls exit(2) or returns from main(), without having run through the appropriate RAII destructors (in C++/Rust), or gracefully signalled will-shutdown to managed threads to allow them to run terminating-state code (in Java/Go/Erlang/Win32/etc), or etc.
This kind of "hard abort" often truncates logging output at the point of abort; leaves TCP connections hanging open; leaves lockfiles around on disk; and has the potential to corrupt any data files that were being written to. Basically, it results in the process not executing "should always execute" code to clean up after itself.
So, although the OS kernel/scheduler thinks everything went fine, and that it didn't have to step in to forcibly terminate the process's lifecycle (though it did very likely observe a nonzero process exit code), I think most people would still generally call this type of abort a "crash." The process's runtime got into an invalid/broken state and stopped cleaning up, even if the process itself didn't violate any protection rules / resource limits / etc.
reply