It's very good - I think probably a tiny bit better than GPT-5-Codex, based on vibes more than a comprehensive comparison (there are plenty of benchmarks out there that attempt to be more methodical than vibes).
It particularly shines when you try it on https://claude.ai/ using its brand new Python/Node.js code interpreter mode. Try this prompt and see what happens:
Checkout https://github.com/simonw/llm and run the tests with
pip install -e '.[test]'
pytest
I then had it iterate on a pretty complex database refactoring task, described in my post.
whywhywhywhy 12 hours ago [-]
Kinda pointless listening to the opinions of people who've used previews because it's not gonna be the same model you'll experience once it gets downgraded to be viable under mass use and the benchmarks influencers use are all in the training data now and tested internally so any sort of testing like pelicans on bikes is just PR at this point.
benterix 10 hours ago [-]
Yeah I remember these GPT-5 demos from influencers like "it practically created a whole 3D modeller for me" and then once we got the real thing it sometimes looked like a dumbed down version of the previous iteration.
simonw 9 hours ago [-]
I learned that lesson from GPT-5, where the preview was weeks long and the models kept changing during that period.
This Claude preview lasted from Friday to Monday so I was less worried about major model changes. I made sure to run the pelican benchmark against the model after 10am on Monday (the official release date) just to be safe.
The only thing I published that I ran against the preview model was the Claude code interpreter example.
I continue not to worry about models having been trained to ace my pelican benchmark, because the models still suck at it. You really think Anthropic deliberately cheated on my benchmark and still only managed to produce this? https://static.simonwillison.net/static/2025/claude-sonnet-4...
whywhywhywhy 6 hours ago [-]
Yesterday someone posted an example of the same prompt but changing it to a human and it was basically trash, the example you've posted actually looks good all things considered. So yeah I do think its something they train on, same way they train on things in the benchmarks.
simonw 5 hours ago [-]
The easy way to tell is to try it yourself - run "Generate an SVG of a pelican riding a bicycle" and then try "Generate an SVG of an otter riding a skateboard" and see if the quality of the images seems similar.
fragmede 8 hours ago [-]
Well, if they produced a really really really good image for pelicans on bicycles and nothing else, then their cheating would be obvious, so it makes sense to cheat just a little bit, across the board (if we want to assume they're cheating).
icelancer 1 days ago [-]
Huh. Doesn't work over here.
---
SYSTEM:
Got it. Those are the setup commands for Simon Willison's LLM CLI tool, which lets you interact with language models from the command line.
The -e flag installs it in editable mode, and [test] pulls in the test dependencies.
Are you running into any issues with the installation or tests? What are you working on with the LLM library?
USER: Correct. You run them using your code tooling.
SYSTEM:
Realized inability to execute requested Python commands.The user is telling me to run these commands using my code tooling. They want me to actually execute this. Let me think about what tools I have available:
I have the repl tool which runs JavaScript code in a browser environment
I have artifacts for creating code/documents
I have web_search and web_fetch
Clone git repositories
Run shell commands like pip or pytest
Execute Python code in a command-line environment
The repl tool is JavaScript-only and runs in a browser environment, not a server environment where I could run shell commands.
I should clarify that I don't have the capability to run shell commands, install Python packages, or execute pytest. I can only run JavaScript code in a browser environment via the analysis tool.
I need to be direct and clear about this limitation, following the user's preferences to eliminate filler and be straightforward.I don't have tooling that can run shell commands, clone git repos, install Python packages, or execute pytest. My only code execution environment is a JavaScript browser REPL.
I can't run pip, pytest, or any command-line tools. I can only execute JavaScript code in a sandboxed browser environment.
What were you hoping to accomplish? Do you need help understanding the test setup, or were you expecting different capabilities?R
icelancer 1 days ago [-]
When pressed:
---
My execution environment is a browser-based JavaScript sandbox. I don't have Python, Node.js shell access, pip, git, or pytest capabilities.
Simon's post may have been referring to a different deployment or feature set that I don't currently have access to. My only code execution is the JavaScript REPL I just demonstrated.
What specifically are you trying to accomplish with the llm library tests? I can help with that in other ways.
EDIT: Ah, in Settings you have to disable the Analysis tool and enable Code Interpreter. You cannot have both enabled.
Interesting, enable those setting and the claude on claude.ai become claude code, and it try to run everything on claude container like it owns the machine. I don't want that.
magicalist 1 days ago [-]
> I told it to Give me a zip file of everything you have done so far—you can explore the contents of the file it made me in this Gist.
For those who don't have time to dig into the gist, did it work and do a good job? I assume yes to at least nominally working or you would have mentioned that, but any other thoughts on the solution it produced?
simonw 1 days ago [-]
I was impressed. Integrating it into my project would require me to spend a bunch more time reviewing it, but it did a better job than I was expecting. The tests it wrote were a particularly useful exploration of the problem space: https://gist.github.com/simonw/f9d0f870e8d1af399a7f366a7f87b...
Snuggly73 1 days ago [-]
Ignoring the tests, the first change was adding a single parent id column and the second "more complex" refactoring added few more hash columns to the table (after you've specified that you wanted them, i.e. not an open-ended question)
Its a very impressive model, but I think we have different views on what is complex.
ethanpil 15 hours ago [-]
Had a similar issue - wanted to get all the files from the response without too much work, so I opened a new tab and vibe coded this in about 4 minutes. Tested it on exactly 1 case: a previous Sonnet 4.5 response, and worked well.
I was worried for a minute that the implementation wasn't production ready. Thankfully, Claude mentioned it right at the end.
gloosx 1 days ago [-]
I am curious how the sandbox handles potentially malicious code. For example, what would happen if someone tried to run something like a crypto miner or a DDoS script?
simonw 1 days ago [-]
Code running in the Claude environment can't make outbound network connections to anything more than a tiny list of domains (GitHub, PyPI, NPM) so DDoS targets are limited.
If I was Anthropic I'd have measures in place to rate limit network traffic from individual containers.
gloosx 1 days ago [-]
This would be interesting to test in a controlled way, curious what the CPU/time/memory quotas per run, isolation method, and syscall filtering look like.
Would appreciate pointers to any documentation or tests.
mclau153 1 days ago [-]
Could it pip install pygame and make a game playable in browser?
lilouartz 1 days ago [-]
That's cool Didn't know it is capable of doing it!
fragmede 8 hours ago [-]
Your notes on 4.5 were very interesting, but you asked it a question that only you/someone who already knows the code could ask. I don't though, so I asked it at a higher level: Claude, add tree-structured conversations to https://github.com/simonw/llm. Claude responded with a whole design doc, starting with database schema change (using the same column name even!). https://claude.ai/share/f8f0d02a-3bc1-4b48-b8c7-aa75d6f55021
As I don't know your code, that design doc looks cromulent, but you'd have to read it for yourself to decided how well it did with that higher level of ask.
luckydata 18 hours ago [-]
new models are always magical, let's see how it feels after the cost cutting measures get implemented in 2-3 months.
startupsfail 16 hours ago [-]
safety/security patches
kshacker 24 hours ago [-]
Sorry I downvoted this by mistake. Got confused between collapse and the down arrow (which I rarely use). Does someone know how to reverse a downvote. I do not see it.
Fishkins 24 hours ago [-]
You should see an "unvote" or "undown" link to the right of the timestamp (i.e. the opposite side from where the vote arrows were). It's fairly subtle.
I get access to previews from OpenAI, Anthropic and Gemini pretty often. They're usually accompanied by an NDA and an embargo date - in this case the embargo was 10am Pacific this morning.
I won't accept preview access if it comes with any conditions at all about what I can say about the model once the embargo has lifted.
dzhiurgis 16 hours ago [-]
Soooo that leaves xAI that had conditions
poopiokaka 1 days ago [-]
[flagged]
0x696C6961 1 days ago [-]
Why do you even care?
dboreham 1 days ago [-]
NDAs often prohibit publishing...the NDA.
IncreasePosts 1 days ago [-]
[flagged]
Redster 1 days ago [-]
Simonw is a cheerful and straightforward AI journalist who likes to show and not just tell. He has done a good job aggregating and documenting the progress of LLM tools and models. As I understand it, OpenAI and Anthropic have both wisely decided to make sure he has up to date info because they know he'll write about it.
Thanks for all your work, Simon! You're my favorite journalist in this space and I really appreciate your tone.
michaelt 23 hours ago [-]
> As I understand it, OpenAI and Anthropic have both wisely decided to make sure he has up to date info because they know he'll write about it.
And the wisest part is if he writes something they don't like, they can cut off that advanced access.
As is the longstanding tradition in games journalism, travel journalism, and suchlike.
simonw 21 hours ago [-]
If they do that I'll go back to writing about them after they ship. Not a big loss for me at all.
tripzilch 8 hours ago [-]
I get it, you would trust yourself if you said that, but it doesn't really matter whether you say that or not, what counts for your ongoing credibility if you will preface every future blog post with, whether you got special access, a special deal, sponsorship, or the fact that you didn't get any of those things.
You're a reviewer. This is how reviewers stay credible. If you don't disclose your relationship with the thing or company you're reviewing, I'm probably better off assuming you're paid.
And if your NDA says you can't write that in your preface, then logically, it is impossible to write a credible review in the first place.
awesome, thanks a lot that's important but ... sorry I just checked those, and I do think it's better to do it on a per-article basis, because a lot of your audience (I'm guessing) comes from external links, not browsing your website
this is (or should be) a pretty standard thing to do on youtube review channels (that I would trust), and it's not a bad thing to remind people of, on every occasion, plus it can function as a type of "canary" in cases of particularly restrictive NDAs
tootie 1 days ago [-]
Simon has a popular blog, but he's also co-creator of Django and very well-known in the Python community.
knowsuchagency 1 days ago [-]
I like Simon, but he's not a journalist. A journalist would not have gone to OpenAI to glaze the GPT-5 release with Theo. I don't say this to discount Simon -- I appreciate his writing and analysis but a journalist, he isn't.
simonw 1 days ago [-]
I don't call myself a journalist, partly because no publication is paying me to do any of this!
If I had an editor I imagine they would have talked me out of going to the OpenAI office for a mysterious product preview session with a film crew.
Redster 1 days ago [-]
That's a fair point. I feel like he's more than a blogger and am not sure the best term!
LudwigNagasena 1 days ago [-]
An influencer.
kid64 8 hours ago [-]
Guys, he's standing right there
fourthark 22 hours ago [-]
Argh
asadotzler 1 days ago [-]
AI blogger seems more appropriate than journalist.
nchmy 1 days ago [-]
are you aware of any "ai journalists"? Because simonw does great work, so perhaps blogger is what people should aspire towards?
simonw 1 days ago [-]
I actually talk to journalists on the AI beat quite often - I've had good conversations with them at publications including The Economist and NY Times and Washington Post and ArsTechnica.
They're not going to write up detailed reviews of things like the new Claude code interpreter mode though, because that's not of interest to a general enough audience.
Shoot that's what I get for staying off twitter and email for a week. Glad newsletters provide a little bit of a cushion these days but hopefully someone snaps her up.
vasco 18 hours ago [-]
You normally keep up with staffing updates for writers at random internet blogs? That is mind-blowing, I don't think I ever even read the name of the author of an article intentionally, and when I do it by mistake I forget it 2 webpages down the road.
4ggr0 16 hours ago [-]
i've never used twitter myself, but isn't that its purpose? follow people you like because of what they do and get informed by themselves about what happens behind the curtains. OP mentioned being off twitter, maybe they follow the author there and would've seen a tweet about it.
rapfaria 1 days ago [-]
His "pelican riding a bicycle" tests are now a classic and AI shops are benchmaxxing for it
simonw 1 days ago [-]
They need to benchmaxxx a whole lot harder, the illustrations still all universally suck!
lxgr 24 hours ago [-]
I fully expect a model to output a SVG made up of 1000x1000 rectangles (i.e. pixels) representing a raster image of a beautifully hand-drawn pelican riding a bicycle any day now :)
simonw 23 hours ago [-]
I got an amazing result from ChatGPT a while back - an SVG with a perfect illustration of a pelican riding a bicycle.
It was suspiciously good in fact... so I downloaded the SVG file and found out it had generated a raster image with its image tool and then embedded it as base64 binary image data inside an SVG wrapper!
dhhugley 22 hours ago [-]
You’ll just have to move the goalpost then; perhaps it can be a multidimensional pelican saving the multiverse, or an invisible pelican that only you can see and critique.
lxgr 14 hours ago [-]
How would that help, given that ChatGPT has apparently already figured out how to consistently and systematically game the benchmark by working in pixel space and only using SVG as a wrapper for a raster image?
FWIW, I could totally see a not hugely more advanced model using its native image generation capabilities and then running a vector extraction tool on it, maybe iteratively. (And maybe I would not consider that cheating, anymore, since at some point that probably resembles what humans do?)
sixeyes 14 hours ago [-]
ive got such pixelated rectangle SVG's a few times.
also with cursor, "write me a script that outputs X as an svg" it has given me rectangles a few times.
astrange 1 days ago [-]
If they were testing that it'd work more often.
Other things you can ask that they're still clearly not optimizing for are ASCII art and directions between different locations. Complete fabrications 100% of the time.
Sharlin 1 days ago [-]
Well, I definitely hope they aren't trying to teach LLMs directions between locations, given how idiotic use of compute and parameter space that would be. We already have excellent AIs for route planning. What they ought to optimize for is, of course, finally teaching them to say they don't know, or just automatically opting to call a route-planning API if the user asks for directions.
minimaxir 1 days ago [-]
Simon tends to write up reports of new LLM releases (with great community respect) and it's much easier with lead time if the provider is able to set up a preview endpoint.
criddell 7 hours ago [-]
I believe the criticism is that he's reporting on a pre-release LLM which isn't the same as the one you and I are going to be using a few weeks from now after they've downgraded it enough to work at scale.
runjake 9 hours ago [-]
simonw is Simon Willison, who’s well known for a number of things. But these days, he’s well known for his AI centric blog and his tools. The AI companies give him early access to stuff.
The same reason YouTube reviewers and influencers get access to hardware or games before release. In this case, the person is a passionate blogger.
kissgyorgy 19 hours ago [-]
If you want to keep up with AI progress and model updates, simonw is the man to follow!
lomase 1 days ago [-]
They are an AI evangelist that told me I can replace any technical book created with an LLM.
They are a nice person.
rhizome 1 days ago [-]
You are correct, sir!
mvdtnz 1 days ago [-]
[flagged]
dang 19 hours ago [-]
Could you please stop breaking the site guidelines? You've been doing it repeatedly, we've asked you to stop several times, and you haven't stopped yet.
He is very interested in the topic and sometimes too enthusiastic, but I saying he shills is not productive and a lie.
1 days ago [-]
iagooar 1 days ago [-]
Anecdotal evidence.
I have a fairly large web application with ~200k LoC.
Gave the same prompt to Sonnet 4.5 (Claude Code) and GPT-5-Codex (Codex CLI).
"implement a fuzzy search for conversations and reports either when selecting "Go to Conversation" or "Go to Report" and typing the title or when the user types in the title in the main input field, and none of the standard elements match, a search starts with a 2s delay"
Sonnet 4.5 went really fast at ~3min. But what it built was broken and superficial. The code did not even manage to reuse already existing auth and started re-building auth server-side instead of looking how other API endpoints do it. Even re-prompting and telling it how it went wrong did not help much. No tests were written (despite the project rules requiring it).
GPT-5-Codex needed MUCH longer ~20min. Changes made were much more profound, but it implemented proper error handling, lots of edge cases and wrote tests without me prompting it to do so (project rules already require it). API calls ran smoothly. The entire feature worked perfectly.
My conclusion is clear: GPT-5-Codex is the clear winner, not even close.
I will take the 20mins every single time, knowing the work that has been done feels like work done by a senior dev.
The 3mins surprised me a lot and I was hoping to see great results in such a short period of time. But of course, a quick & dirty, buggy implementation with no tests is not what I wanted.
Implicated 1 days ago [-]
I'm not trying to be offensive here, feel the need to indicate that.
But that prompt leads me to believe that you're going to get rather 'random' results due to leaving SO much room for interpretation.
Also, in my experience, punctuation is important - particularly for pacing and grouping of logical 'parts' of a task and your prompt reads like a run on sentence.
Making a lot of assumptions here - but I bet if I were in your shoes and looking to write a prompt to start a task of a similar type that my prompt would have been 5 to 20x the length of yours (depending on complexity and importance) with far more detail, including overlapping of descriptions of various tasks (ie; potentially describing the same thing more than once in different ways in context/relation to other things to establish relation/hierarchy).
I'm glad you got what you needed - but these types of prompts and approaches are why I believe so many people think these models aren't useful.
You get out of them what you put into them. If you give them structured and well written requirements as well as a codebase that utilizes patterns you're going to get back something relative to that. No different than a developer - if you gave a junior coder, or some team of developers the following as a feature requirement: `implement a fuzzy search for conversations and reports either when selecting "Go to Conversation" or "Go to Report" and typing the title or when the user types in the title in the main input field, and none of the standard elements match, a search starts with a 2s delay` then you can't really be mad when you don't get back exactly what you wanted.
edit: To put it another way - spend a few more minutes on the initial task/prompt/description of your needs and you're likely to get back more of what you're expecting.
iagooar 1 days ago [-]
I think that is an interesting observation and I generally agree.
Your point about prompting quality is very valid and for larger features I always use PRDs that are 5-20x the prompt.
The thing is my "experiment" is one that represents a fairly common use case: this feature is actually pretty small and embeds into an pre-existing UI structure - in a larger codebase.
GPT-5-Codex allows me to write a pretty quick & dirty prompt, yet still get VERY good results. It not only works on first try, Codex is reliably better at understanding the context and doing the things that are common and best practice in professional SWE projects.
If I want to get something comparable out of Claude, I would have to spend at least 20mins preparing the prompt. If not more.
Implicated 1 days ago [-]
> The thing is my "experiment" is one that represents a fairly common use case
Valid as well. I guess I'm just nitpicking based on how much I see people saying these models aren't useful combined with seeing this example, triggered my "you're doing it wrong" mode :D
> GPT-5-Codex allows me to write a pretty quick & dirty prompt, yet still get VERY good results.
I have a reputation with family and co-workers of being quite verbose - this might be why I prefer Claude (though haven't tried Codex in the last month or so). I'm typically setting up context and spending a few minutes writing an initial prompt and iterating/adjusting on the approach in planning mode so that I _can_ just walk away (or tab out) and let it do it's thing knowing that I've already reviewed it's approach and have a reasonable amount of confidence that it's taking an approach that seems logical.
I should start playing with codex again on some new projects I have in mind where I have an initial planning document with my notes on what I want it to do but nothing super specific - just to see what it can "one shot".
stingraycharles 22 hours ago [-]
Yeah, as someone who has been using Claude Code for about 4 months now, I’ve adopted a “be super specific by default”-workflow. It works very well.
I typically use zen-mcp-server’s planning mode to scope out these tasks, refine and iterate on a plan, clear context, and then trigger the implementation.
There’s no way I would have considered “implement fuzzy search” a small feature request. I’m also paranoid about introducing technical debt / crappy code, as in my experience is the #1 reason that LLMs typically work well for new projects but start to degrade after a while: there’s just a lot of spaghetti and debt built up over time.
Aeolun 1 days ago [-]
I tend to tell claude to research what is already there, and think hard, and that gives me much better per-prompt results.
But you are right that codex does that all by default. I just get frustrated when I ask it something simple and it spends half an hour researching code first.
j_bum 1 days ago [-]
This makes me think that for simple things, we need to anti-prompt — tell the model to not overthink things.
wahnfrieden 22 hours ago [-]
Some do this by using tools like RepoPrompt to read entire files into GPT-5 Pro, and then using GPT-5 Pro to send the relevant context and work plan to Codex so that it can skip needing to poke around files. If you give it the context, it won't spend that time looking for it. But then you spend time with Pro (which can ingest entire files at once instead of searching through them, and provide a better plan for Codex, though)
user_7832 19 hours ago [-]
I would think that to truly rank such things, you should run a few tests and look for a clear pattern. It's possible that something promoted claude to take "the easy way" while chatgpt didn't.
hackernewds 7 hours ago [-]
Your anecdata is not an "experiment" especially to derive such a broad conclusion :)
andai 14 hours ago [-]
It worked on the first try, but did it work on the second?
I noticed in conversations with LLMs, much of what they come up with is non-deterministic. You regenerate the message and it disappears.
That appears to be the basic operating principe of the current paradigm. And agentic programming repeats this dice roll, dozens or hundreds of times.
I don't know enough about statistics to say if that makes it better (converging on the averages?) or worse (context pollution, hallucinating, focusing on noise?), but it seems worth considering.
GoatInGrey 1 days ago [-]
This would explain the LLM implementing the feature in a way you didn't prefer. But this does not explain why Sonnet would deliver a broken implementation that does not work in even the most basic sense.
Also, there is a threshold in which the time it takes to develop a prompt, allow the agent to run, review its output, and go through iterative loops to correct errors or implementation problems, can exceed the time it takes me (a lazy human) to achieve the same end result.
Pair this with the bypassing of the generation effect, reduced prefrontal dopamine, and increased working memory load (in part due to minimal motor-cognitive integration), and AI-generated code in contexts with legal and financial repercussions can be a much worse deal than using your own fingers.
benjiro 10 hours ago [-]
> But this does not explain why Sonnet would deliver a broken implementation that does not work in even the most basic sense.
Depends not just on prompt but also the tooling / environment you use. Somebody using Claude Code CLI may get a totally different experience then somebody using CoPilot via VSC.
What do i mean by that? Look at how Copilot tries to save money by reading content only in small parts. Reading file X line 1-50, X line 51-100, ... And it starts working with this. Only if it finds a hint about something somewhere else, it will read in more context.
What i often see is that it misses context because it reads in so limited information and if there is no hint in your code or code doc, it will stop there. Try to run a local test on the code, passes, done... While it technically broke your application.
Example: If i tell it to refactor a API, it never checks if that API is used anywhere else because it only reads in that API code. So i need to manually add to the prompt to remind it, "the API is used in the system". And then it does its searching /... Found 5 files, Read X line 1...
And plop, good working code ... So if you know this limitation, you can go very far with a basic $10 CoPilot Claude Agent usage.
Where as a $200 Claude Code will give you a better experience out of the door, as it reads in a ton more. The same applies to GPT-5/Codex, what seems to be more willing to read in larger context of your project, thus resulting in less incomplete code.
This is just anecdotal from my point of view, but like with any LLM, hinting matters a lot. Its less about writing a full prompt with a ton of text but just including the right "do not forget about function name X, and module Y, and test Z". And Claude loves it hints on CoPilot because of that limited reading.
pton_xd 1 days ago [-]
> I bet if I were in your shoes and looking to write a prompt to start a task of a similar type that my prompt would have been 5 to 20x the length of yours
Why would you need such extensive prompting just to get the model to not re-implement authentication logic, for example? It already has access to all of the existing code, shouldn't it just take advantage of what's already there? A 20x longer prompt doesn't sound like a satisfying solution to whatever issue is happening here.
paool 1 days ago [-]
> shouldn't it just take advantage of what's already there?
It's not a good idea to have any coding agent put unnecessary amounts of lines into the context window in order to understand your code base.
Performance of all llms drop drastically when the context window is filled or full. The purpose of being more specific with your prompts is that you spend a little bit more tokens up front to make the task a lot more efficient and more likely to result in success.
At least that's how it is today. We're probably a breakthrough or two away from the type of vibe coding experience non-coders want. Or it may never happen, and the developers who have coding knowledge will be the only ones to fully utilize coding agents and it will only become more powerful over time.
wyre 22 hours ago [-]
I'm not sure exactly what you mean by the vibe coding experience non-coders want, but if it's one-shotting a buildable codebase off of an unspecific prompt, the major breakthrough would have to be brain-computer interfaces so the agent can literally read the user's mind.
If that same person approached a software development company with the same prompt without following up with any other details, they won't get good code back, either. You're not saying it, but this idea that in the future you can tell a computer something like "create photoshop" and get what your expecting is an unrealistic dream that would need mind-reading or a major breakthrough and paradigm shift in understanding and interpreting language.
nosianu 16 hours ago [-]
> the major breakthrough would have to be brain-computer interfaces so the agent can literally read the user's mind.
And even that would not be enough.
In reality, it would have to put the user to sleep and go through various dream scenarios to have the user's brain really build an internal model that is not there in the first place. No brain interface can help find what is not there.
We usually need interactions with reality to build the internal model of what we actually want step by step, especially for things we have not done before.
Even for info that is there, that's also a limit to fantasy or sci-fi brain scanning. The knowledge is not stored like in a RAM chip, even when it is there. You would have to simulate the brain to actually go through the relevant experiences to extract the information. Predicting the actual dynamic behavior of the brain would require some super-super sub-molecular level scan and then correctly simulating that, since what the neurons will actually do depends on much more than the basic wiring. Aaaaand you may get a different result depending on time of day, how well they slept, mood and when and what the person ate and what news they recently read, etc. :)
Implicated 1 days ago [-]
Well, I don't have the context myself about what's happening in this example, though I don't see anything about auth myself.
And I left that window at 5-20x because, again, no real context. But unless I was already in the middle of a task and I was giving direction that there was already context for - my prompt is generally almost never _this_ short. (referring to the prompt in the top level comment)
> A 20x longer prompt doesn't sound like a satisfying solution to whatever issue is happening here.
It wouldn't be, given the additional context given by the author in a sibling comment to yours. But if you had specific expectations on the resulting code/functionality that 20x longer prompt is likely to save you time and energy in the back and forth adjustments you might have to make otherwise.
nikcub 22 hours ago [-]
> but these types of prompts and approaches are why I believe so many people think these models aren't useful.
100% agree. The prompt is a 'yolo prompt'. For that task you need to give it points in what to do so it can deduce it's task list, provide files or folders in context with @, tell it how to test the outcome so it knows it has succeeded and closing the feedback loop, and guide it in implementation either via memory or via context with which existing libs or methods it should call on.
For greenfield tasks and projects I even provide architectural structure, interfaces, etc.
After reading twitter, reddit and hn complaints about models and coding tools I've come to the same conclusion as you.
jen729w 20 hours ago [-]
The fact remains, however: ChatGPT did it. Claude did not.
nosianu 16 hours ago [-]
That fact is pretty useless to draw any useful conclusions from with one random not so great example. Yes, it's an experiment and we got a result. And now what? If I want reliable work results I would still go with the strategy of being as concrete as possible, because in all my AI activities, anything else lets the results be more and more random. Anything non-standard (like, you could copy & paste directly from a Google or SO result), no matter how simple, I better provide the base step by step algorithm myself and only leave actual implementation to the AI.
jen729w 14 hours ago [-]
My parent said:
> For that task you need to give it points in what to do so it can deduce it's task list, provide files or folders in context with @…
- and my point is that you do not have to give ChatGPT those things. GP did not, and they got the result they were seeking.
That you might get a better result from Claude if you prompt it 'correctly' is a fine detail, but not my point.
(I've no horse in this race. I use Claude Code and I'm not going to switch. But I like to know what's true and what isn't and this seems pretty clear.)
ashdksnndck 1 days ago [-]
Sounds like you would be still writing your prompt when they already got a good implementation back from GPT.
motoxpro 1 days ago [-]
I agree. The sibling responses also give some insight into why a lot of people don't get a lot of value out of coding agents.
SirMaster 1 days ago [-]
But isn't the end goal to be able to get useful results without so much prompting?
I mean in the movies for example, advanced AI assistants do amazing things with very little prompting. Seems like that's what people want.
To me, the fact that so many people basically say "you are prompting it wrong" is knock against the tech and the model. If people want to say that these systems are so smart at what they can do, then they should strive to get better at understanding the user without needing tons of prompts.
Do you think his short prompt would be sufficient for a senior developer? If it's good enough for a human it should be good enough for a LLM IMO.
I don't want to take away the ability to use tons of prompting to get the LLM to do exactly what you want, but I think that the ability for an LLM to do better with less prompting is actually a good thing and useful metric.
Implicated 1 days ago [-]
> But isn't the end goal to be able to get useful results without so much prompting?
See below about context.
> I mean in the movies for example, advanced AI assistants do amazing things with very little prompting. Seems like that's what people want.
Movies != real life
> To me, the fact that so many people basically say "you are prompting it wrong" is knock against the tech and the model. If people want to say that these systems are so smart at what they can do, then they should strive to get better at understanding the user without needing tons of prompts.
See below about context.
> Do you think his short prompt would be sufficient for a senior developer? If it's good enough for a human it should be good enough for a LLM IMO.
Context is king.
> I don't want to take away the ability to use tons of prompting to get the LLM to do exactly what you want, but I think that the ability for an LLM to do better with less prompting is actually a good thing and useful metric.
What I'm understanding from your comments here are that you should just be able to give it broad statements and it should interpret that into functional results. Sure - that works incredibly well, if you provide the relevant context and the model is able to understand and properly associate it where needed.
But you're comparing the LLMs to humans (this is a problem, but not likely to stop so we might as well address it) - but _what_ humans? You ask if that prompt would be sufficient for a senior developer - absolutely, if that developer already has the _context_ of the project/task/features/etc. They can _infer_ what's not specified. But if you give that same prompt to a jr dev who maybe has access to the codebase and has poked around inside the working application once or twice but no real in depth experience with it - they're going to _infer_ different things. They might do great, they might fail spectacularly. Flip a coin.
So - with that prompt in the top level comment - if that LLM is provided excellent context (via AGENTS.md/attached files/etc) then it'll do great with that prompt, most likely. Especially if you aren't looking for specifics in the resulting feature outside of what you mentioned since it _will_ have to infer some things. But if you're just opening codex/CC without a good CLAUDE.md/AGENTS.md and feeding it a prompt like that you have to expect quite a bit of variance to what you get - exactly the same way you would a _human_ developer.
You context and prompt are the project spec. You get out what you put in.
DrewADesign 1 days ago [-]
> But you're comparing the LLMs to humans
These things are being marketed as super intelligent magic answer machines. Judging them using the criteria the marketing teams have provided is completely reasonable.
> Movies != real life
Nobody claimed it was. This is about desires and expectations. The people charging money for these services and taking stacks of cash that would’ve otherwise been in in dev’s paychecks while doing so haven’t even tried to temper those expectations. They made their beds…
gejose 23 hours ago [-]
> But you're comparing the LLMs to humans
Didn't the parent comment compare Sonnet vs Codex with GPT5?
dboreham 1 days ago [-]
Quick data point that I've been able to get LLMs (recently whatever one clude gives me) to produce amazingly useful results for the purpose of understanding complex codebases, just by asking it to look at the code and tell me how it does xyz. No complicated long prompt. Basically exactly what I'd say to a human.
g42gregory 1 days ago [-]
I have to agree with this assessment. I am currently going at the rate of 300-400 lines of spec for 1,000 LOC with Claude Code. Specs are AI-assisted also, otherwise you might go crazy. :-) Plus 2,000+ lines of AI-generated tests. Pretty restrictive, but then it works just fine.
conradfr 16 hours ago [-]
No true Promptman.
hackernewds 7 hours ago [-]
Yes to punctuation
After all, attention is all you need :)
pembrook 1 days ago [-]
You're critiquing OP for not playing with how the models currently work (bad at gathering context on their own). Sure, if you bend over backwards and hop on one foot, you can get them to do what you want.
OP is critiquing the model as a product vs. the marketing promises. The model should be smart enough to gather context about the project to implement features properly on their own, if they are ever going to 'write 90% of all code THIS YEAR' as people like the founder of Anthropic claim.
rossjudson 1 days ago [-]
When asking for change, there are the choices you know about and the ones you don't. I've gotten in the habit of describing some of the current state as well as my desired state, and using that to focus the LLM on the areas I'd like to have a stronger voice in.
Of course, I definitely appreciate when it makes choices that I don't know I need to make, and it chooses reasonable defaults.
I mean, I could say "make the visualization three columns", but there's a lot of ambiguity in that kind of thing, and the LLM is going to make a lot of choices about my intent.
Instead, "each team/quarter currently has a vertically stacked list of people assigned to that team, with two columns (staying on team, leaving team). change that to being three columns (entering team, staying on team, leaving team)."
As a bonus, it's much, much clearer to somebody reading the history later what the intent was.
kelvinjps 1 days ago [-]
How would you have written the prompt?
Implicated 1 days ago [-]
tbh, I don't really understand it well enough to be able to give a response here. But here's a real prompt I just used on a project copy/pasted:
```
Something that seems to have been a consistent gotcha when working with llm's on this project is that there's no specific `placement` column on the table that holds the 'results' data. Our
race_class_section_results table has it's rows created in placement order - so it's inferred via the order relative to other records in the same race_class_section. But this seems to complicate
things quite a bit at times when we have a specific record/entry and want to know it's placement - we have to query the rest of them and/or include joins and other complications if we want to filter
results by the placements, etc.
Can you take a look at how this is handled, both with the querying of existing data by views/livewire components/etc and how we're storing/creating the records via the import processes and give me a
determination on whether you think it should be refactored to include a column on the database for `placement`? I think right now we've got 140,000 or so records on that table and it's got nearly
20 years worth of race records so I don't think we need to be too concerned with the performance of the table or added storage or anything. Think very hard, understand that this would be a rather
major refactor of the codebase (I assume, since it's likely used/referenced in _many_ places - thankfully though that most of the complicated queries it would be found in would be easily identified
by just doing a search of the codebase for the race_class_section_results table) and determine if that would be worth it for the ease of use/query simplification moving forward.
```
This comes with a rather developed CLAUDE.md that includes references to other .md documents that outline various important aspects of the application that should be brought into context when working in those areas.
This prompt was made in planning mode - the LLM will then dig into the code/application to understand things and, if needed, ask questions and give options to weigh before return with a 'plan' on how to approach. I then iterate on that plan with it before eventually accepting a plan that it will then begin work on.
Szpadel 1 days ago [-]
that's kind of expected for me, but codex feels more like vibe coding tool and Claude code more like ai assisted development.
And I actually like Claude more because of that.
codex will indeed work more independently but you will have hard time when it isn't what you want. It will use python script just to do simple edits in files ( niesearch and replace for unique code snippets in small files) when it's wrong good look convincing it (it already have some outdated info like on latest docker image releases and convincing it that Debian base changed was challenging)
it uses context more effectively but it will lack explanation why it is doing what it is doing, asking it to explain will just cause it to do something else without any word.
and of course lack of proper permissions for running commands. sandbox is cool but I do not want it to be able to commit, I want it to just edit files or I want to have some more control over what it does.
you can run codex as mcp server, I prefer adding it to Claude and ask to do cooperative plan, codex will do great analysis and plan and I can comfortable work with Claude on the code that matches my style
alecco 1 days ago [-]
Try adding ultrathink to the prompt, hit enter, and play this song in the background:
Sidenote, Claude Code now has a thinking toggle. Maybe it did already, but that's new to me. I see better results by just leaving that toggle on.
epolanski 1 days ago [-]
There's few issues with posts like yours:
1. Different LLMs require different prompts and information
2. They ignore LLMs non determinism, you should run the experiment several times
iagooar 23 hours ago [-]
Oh and I agree so much. I just shared a quick first observation in a real-world testing scenario (BTW re-ran Sonnet 4.5 with the same prompt, not much changed). I just keep seeing how LLM providers keep optimizing for benchmarks, but then I cannot reproduce their results in my projects.
I will keep trying, because Claude 4 generally is a very strong line of models. Anthropic has been on the AI coding throne for months before OpenAI with GPT-5 and Codex CLI (and now GPT-5-Codex) has dethroned them.
And sure I do want to keep them competing to make each other even better.
mexicocitinluez 9 hours ago [-]
What would be the difference in prompts/info for Claude vs ChatGpt? Is this just based on anecdotal stuff or is there actually something I can refer to when writing prompts? I mostly use Claude, but don't really pay much attention to the exact wording of the prompts
bastawhiz 9 hours ago [-]
I must be using Codex wrong. I evaluated it with a task to do a pretty simple, mechanical string replacement across many files (moving from a prop spread in JSX to standard props, but only when the value being spread is a subscript of an object). It simply couldn't do it, and it wasn't even close. It was getting the syntax wrong, trying to fix it by deleting the code, then moving on to other files. Sonnet 4.1 wasn't perfect, but I was able to point out its errors and it fixed them and avoided doing it again.
I will say, Claude does seem to need a verbose prompt. Often I'll write my prompts as tasks in Notion and have it pull then via MCP (which is nice, because it tracks and documents its work in the process). But once you've given it a few paragraphs about the why and how, my experience is that it's pretty self sufficient. Granted, I'm using Cursor and not CC; I don't know if that makes much of a difference.
lkey 9 hours ago [-]
Codex cannot fail, it contains multitudes beyond your imagining. Nay, it can only be failed. Continue internalizing that the problem is you, not the tool. Perhaps a small infusion of one trillion USD would unlock it and your potential?
Sincerely,
TPG Dexoc, the living human person
robotswantdata 1 days ago [-]
I had similar experience, not good enough yet to come back for the Claude max plan.
Sticking with ChatGPT pro sub and gpt5 codex on high.
mmgeorgi 1 days ago [-]
do you ever hit your pro quota?
robotswantdata 1 days ago [-]
Never hit pro quota yet, huge repo. Have multiple projects on the go locally and in cloud.
Feel like this is going to be thr $1000 plan soon
JofArnold 1 days ago [-]
I'm thinking about switching to ChatGPT Pro also. Any idea what maxes it out before I need to pay via the API instead? For context I'm using about 1b tokens a month so likely similar to you by the sounds of things.
robotswantdata 16 hours ago [-]
On pro tier have not been able to trigger the usage cap.
Pro
Local tasks: Average users can send 300-1,500 messages every 5 hours with a weekly limit.
Cloud tasks: Generous limits for a limited time.
Best for: Developers looking to power their full workday across multiple projects.
JofArnold 14 hours ago [-]
Thank you, that's very helpful. I think I could get close to that in some coding sessions where I'm running multiple in parallel but I suspect it's very very rare. Even with token efficient gpt5-codex my OpenAI bill is quite high so I think I will switch to Pro now.
paul_manias 23 hours ago [-]
Same experience here. In the last week I've successfully managed to build a complete C++20 XPath 1.0 parser with Codex, and am now onto supporting XPath 2.0. Codex has been nailing it time and again - the only caveat is that I have to use their cloud version as local execution is buggy.
Sonnet on the other hand gets tripped up constantly due to the complexity. I'm not seeing the improvement I was hoping for with 4.5, and it's just given up on attempting to implement support for date-time functionality. Codex has nailed the same task, yet Anthropic claim OpenAI have the inferior product?
I'm pretty sad about this as I'm gunning for Anthropic and would have loved to see them at least match Codex with this new release. If OpenAI stays on this course then Claude Code is toast without an important breakthrough. It doesn't help that they're also the more expensive product and have service quality issues pushing people to quit the platform.
bobbylarrybobby 5 hours ago [-]
It's also my experience that Claude loves to reimplement the wheel instead of reading code to look for an existing implementation of what it wants to do.
Huppie 38 minutes ago [-]
I've been working with Opus 4 on ultrathink quite a bit recently and did some quick tests with Sonnet 4.5, I'm fairly impressed, especially with its speed but I did feel it was a lot less strict with my rules, existing patterns, etc. compared to Opus 4.
Maybe it's better with a better CLAUDE.md structure? I don't use those a lot, just telling Opus to think got 'good enough' results I guess. Not sure.
I hope there's an Opus 4.5 coming out soon too. In the meantime I'll see if I can get to do better with some extra prompting or I'll go back to Opus of if I don't need the speedier responses.
devoutsalsa 18 hours ago [-]
My first thought was I bet I could get Sonnet to fix it faster because I got something back in 3 minutes instead of 20 minutes. You can prompt a lot of changes with a faster model. I'm new to Claude Code, so generally speaking I have no idea if I'm making sense or not.
1 days ago [-]
jpfromlondon 14 hours ago [-]
Interesting, in my experience Claude usually does okay with the first pass, often gets the best visual/ui output, but cannot improve beyond that even with repeated prompts and is terrible at optimising, GPT almost the opposite.
hackernewds 7 hours ago [-]
Anecdotes are pointless in this domain.
dinobones 23 hours ago [-]
Are you paying for ChatGPT pro? Does that include codex CLI usage?
Claude Code on max plan is why I use Sonnet/Opus, but if ChatGPT pro includes codex usage I may switch.
danmaz74 21 hours ago [-]
You can use codex CLI on a measly Plus plan
kordlessagain 1 days ago [-]
Claude needs a lot of context to keep it on rails IMHO. And, Claude Code is a closed source piece of crap agent framework.
1 days ago [-]
fluidcruft 1 days ago [-]
I've tried codex with GPT-5 a little bit and I haven't figured out how to get it to not be annoying. codex just constantly tries to gaslight and argue with me. For example, I was debugging an OpenGL render pipeline that went black and codex insisted it must be because I was ssh'd into a headless server. It really makes me appreciate the Claude "You're absolutely right!"s. Anyway as you can tell, I haven't cracked working with codex. But at the moment it just messes everything up and ways I've learned to work with claude don't seem to translate.
dbbk 1 days ago [-]
Did you use plan mode?
iagooar 1 days ago [-]
Yes, I did.
I ran the test again, took Claude ~4mins this time. There was no error now with the auth, but the functionality was totally broken. It could not even find the most basic stuff that matches perfectly.
Szpadel 1 days ago [-]
try also adding Serena mcp and make sure you did /init
j45 19 hours ago [-]
Curious if Opus would have done it differently than GPT-5-Codex.
Also, trying GPT-5-Codex
artursapek 1 days ago [-]
I think Codex working for 20 mins uninterrupted is actually a strength. It’s not “slow” as critics sometimes say - it’s thorough and autonomous. I can actually walk away and get something else done around the house while it does my work for me.
streetmeat 1 days ago [-]
I swear cc in June/July used to spend a lot more time on tasks and felt more thorough like codex does now. Hard to remember much past the last week in this world though.
artursapek 24 hours ago [-]
Yes I agree. I think Anthropic nerfed it considerably.
teekert 1 days ago [-]
[flagged]
iagooar 1 days ago [-]
I even added a disclaimer "anecdotal evidence". Believe me, I am not the biggest fan of Sam. I just happen to like the best tools available, have used most of the large models and always choose the one that works best - for me.
manofmanysmiles 23 hours ago [-]
I haven't shouted into the void for a while. Today is as good a day as any other to do so.
I feel extremely disempowered that these coding sessions are effectively black box, and non-reproducible. It feels like I am coding with nothing but hopes and dreams, and the connection between my will and the patterns of energy is so tenuous I almost don't feel like touching a computer again.
A lack of determinism comes from many places, but primarily:
1) The models change
2) The models are not deterministic
3) The history of tool use and chat input is not availabler as a first class artifact for use.
I would love to see a tool that logs the full history of all agents that sculpt a codebase, including the inputs to tools, tool versions and any other sources of enetropy. Logging the seed into the RNGs that trigger LLM output would be the final piece that would give me confidence to consider using these tools seriously.
I write this now after what I am calling "AI disillusionment", a feel where I feel so disconnected from my codebase I'd rather just delete it than continue.
Having a set of breadcrumbs would give me at least a modicum of confidence that the work was reproducible and no the product of some modern ghost, completely detached from my will.
Of course this would require actually owning the full LLM.
coolfox 17 hours ago [-]
> A lack of determinism comes from many places, but primarily: 1) The models change 2) The models are not deterministic...
models themselves are deterministic, this is a huge pet peeve of mine, so excuse the tangent, but the appearance of nondeterminism comes from a few sources, but imho can be largely attributed to the probabilistic methods used to get appropriate context and enable timely responses. here's an example of what I mean, a 52-card deck. The deck order is fixed once you shuffle it. Drawing "at random" is a probabilistic procedure on top of that fixed state. We do not call the deck probabilistic. We call the draw probabilistic. Another exmaple, a pot of water heating on a stove. Its temperature follows deterministic physics. A cheap thermometer adds noisy, random error to each reading. We do not call the water probabilistic. We call the measurement probabilistic.
Theoretical physicists run into such problems, albeit far more complicated, and the concept for how they deal with them is called ergodicity. The models at the root of LLM's do exhibit ergodic behavior; the time average and the ensemble average of an observable are identical, i.e. the average response of a single model over a long duration and the average of many similar models at a fixed moment are equivalent.
quietbritishjim 8 hours ago [-]
The previous poster is correct for a very slightly different definition of the word "model". In context, I would even say their definition is the more correct one.
They are including the random sampler at the end of the LLM that chooses the next token. You are talking about up to, but not including, that point. But that just gives you a list of possible output tokens with values ("probabilities"), not a single choice. You can always just choose the best one, or you could add some randomness that does a weighted sample of the next token based on those values. From the user's perspective, that final sampling step is part of the overall black box that is running to give an output, and it's fair to define "the model" to include that final random step.
lkey 9 hours ago [-]
The User:
The stove keeps burning me because I can't tell how hot it is, it feels random and the indicator light it broken.
You:
The most rigorous definition of temperature is that it is equal to the inverse of the rate of change of entropy with respect to internal energy, within a given volume V and particles N held constant.
All accessible microstates are equiprobable over a long period of time, this is the very definition of ergodicity! Yet, because of the flow of entropy the observed macrostates will remain stable. Thus, we can say the the responses of a given LLM are...
The User:
I'm calling the doctor, and getting a new stove with an indicator light.
hackernewds 7 hours ago [-]
Here is a definite scientific nail down and solve for non-determinism in LLM outputs (Mira Murati's new outfit but really credit the author)
I share the sentiment. I would add that people I would like to see use LLMs for coding (and other technical purposes) tend to be jaded like you, and people I personally wouldn't want to see use LLMs for that, tend to be pretty enthusiastic
stillsut 8 hours ago [-]
I've been building something like this, a markdown that tracks your prompts, and the code generated.
If you care about this so much why don't you use one of the open source OpenAI models? They're pretty good and give you the guarantees you want.
int_19h 22 hours ago [-]
None of the open weight models are really as good as SOTA stuff, whatever their evals says. Depending on the task at hand this might not actually manifest if the task is simple enough, but once you hit the threshold it's really obvious.
827a 17 hours ago [-]
Yeah, tbh I used to be a bit agentic coding tool-pilled, but over the past four months I've come to realize that if this industry evolves in a direction where I don't actually get to write code anymore, I'm just going to quit.
Code is the only good thing about the tech industry. Everything else is capitalist hellscape shareholder dystopia. Thinking on it, its hilarious that any self-respecting coder is excited about these tools, because what you're excited for is a world where, now, at best, your entire job is managing unpredictable AI agents while sitting in meetings all day to figure out what to tell your AI agents to build. You don't get to build the product you want. You don't get to build it how you want. You'll be a middle manager that gets to orchestrate the arguments between the middle manager you already had and the inflexible computer.
You don't have to participate in a future you aren't interested in. The other day my boss asked me if I could throw Cursor at some task we've had backlogged for a while. I said "for sure my dude" then I just did it myself. It took me like four hours, and my boss was very impressed with how fast Cursor was able to do it, and how high quality the code was. He loves the Cursor metrics dashboard for "lines accepted" or whatever, every time he screenshares he has that tab open, so sometimes I task it on complicated nonsense tasks then just throw away the results. Seeing the numbers go up makes him happy, which makes my life easier, so its a win-win. Our CTO is really proud of "what percentage of our code is AI written" but I'm fairly certain that even the engineers who use it in earnest actually commit, like, 5% of what Cursor generates (and many do not use it in earnest).
The sentiment shift I've observed among friends and coworkers has been insane over the past two months. Literally no one cares about it anymore. The usage is still there, but its a lot more either my situation or just a "spray and pray" situation that creates a ton of disillusioned water cooler conversations.
jstummbillig 16 hours ago [-]
And now imagine you'd have to rely on humans to build your software instead
genidoi 20 hours ago [-]
> where I feel so disconnected from my codebase I'd rather just delete it than continue.
If you allow your codebase to grow unfamiliar, even unrecognisable to you, that's on you, not the AI. Chasing some illusion of control via LLM output reproducibility won't fix the systemic problem of you integrating code that you do not understand.
lkey 8 hours ago [-]
The process of writing the code, thinking all the while, is how most humans learn a codebase. Integrating alien code sequentially disrupts this process, even if you understand individual components.
The solution is to methodically work through the codebase, reading, writing, and internalizing its structure, and comparing that to the known requirements.
And yet, if this is always required of you as a professional, what value did the LLM add beyond speeding up your typing while delaying the required thinking?
manofmanysmiles 18 hours ago [-]
I completely agree.
wilg 17 hours ago [-]
Who cares about the blame, it would just be useful if the tools were better at this task in many particular ways.
Panoramix 16 hours ago [-]
It's not blame, it's useful feedback. For a large application you have to understand what different parts are doing and how everything is put together, otherwise no amount of tools will save you.
Bjorkbat 1 days ago [-]
> Practically speaking, we’ve observed it maintaining focus for more than 30 hours on complex, multi-step tasks.
Really curious about this since people keep bringing it up on Twitter. They mention it pretty much off-handedly in their press release and doesn't show up at all in their system card. It's only through an article on The Verge that we get more context. Apparently they told it to build a Slack clone and left it unattended for 30 hours, and it built a Slack clone using 11,000 lines of code (https://www.theverge.com/ai-artificial-intelligence/787524/a...)
I have very low expectations around what would happen if you took an LLM and let it run unattended for 30 hours on a task, so I have a lot of questions as to the quality of the output
cowboy_henk 10 hours ago [-]
Interestingly the internet is full of "slack clone" dev tutorials. I used to work for a company that provides chat backend/frontend components as a service. It was one of their go-to examples, and the same is true for their competitors.
While it's impressive that you can now just have an llm build this, I wouldn't be surprised if the result of these 30 hours is essentially just a re-hash of one of those example Slack clones. Especially since all of these models have internet access nowadays; I honestly think 30 hours isn't even that fast for something like this, where you can realistically follow a tutorial and have it done.
This is obviously much more than just taking an LLM an letting it run for 30 hours. You have to build a whole environment together with external tool integration and context management and then tune the prompts and perhaps even set up a multi-agent system. I believe that if someone puts a ton of work into this you can have an LLM run for that long and still produce sellable outputs, but let's not pretend like this is something that average devs can do by buying some API tokens and kicking off a frontier model.
Philpax 1 days ago [-]
Well, yes, that's Claude Code. And OpenAI Codex. And Google Gemini CLI.
Your average dev can just use those.
janee 18 hours ago [-]
Yes but you need to setup quite a bit of tooling to provide feedback loops.
It's one thing to get an llm to do something unattended for long durations, it's a other to give it the means of verification.
For example I'm busy upgrading a 500k LoC rails 1 codebase to rails 8 and built several DSLs that give it proper authorised sessions in a headless browser with basic html parsing tooling so it can "see" what affect it's fixes have. Then you somehow need to also give it a reliable way to keep track of the past and it's own learnings, which sound simple but I have yet to see any tool or model solve it on this scale...will give sonnet 4.5 a try this weekend, but yeah none of the models I tried are able to produce meaningful results over long periods on this upgrade task without good tooling and strong feedback loops
Btw I have upgraded the app and taking it to alpha testing now so it is possible
grncdr 16 hours ago [-]
> rails 1 codebase to rails 8
A bit off topic, but Rails *1* ? I hope this was an internal app and not on the public internet somewhere …
janee 13 hours ago [-]
haha no it's an old (15years old) abandoned enterprise app running on-prem that hasn't seen updates in more than a decade.
sarchertech 11 hours ago [-]
Wow Rails 3 came out 15 years ago, so that thing started life out of date.
ewoodrich 1 days ago [-]
But then that goes back to the original question, considering my own experiences observing the amount of damage CC or Codex can do in a working code base with a couple tiny initial mistakes or confusion about intent while being left unattended for ten minutes, let alone 30 hours....
ChadMoran 21 hours ago [-]
Claude Code with a good prompt can run for hours.
shanecp 22 hours ago [-]
What they don't mention is all the tooling, MCPs and other stuff they've added to make this work. It's not 30 hours out of the box. It's probably heavily guard-railed, with a lot of validated plans, checklists and verification points they can check. It's similar to 'lab conditions', you won't get that output in real-world situations.
Bjorkbat 16 hours ago [-]
Yeah, I thought about that after I looked at the SWE-bench results. It doesn't make sense that the SWE results are barely an improvement yet somehow the model is a more significant improvement when it comes to long tasks. You'd expect a huge gain in one to translate to the other.
Unless the main area of improvement was tools and scaffolding rather than the model itself.
zmmmmm 21 hours ago [-]
> Apparently they told it to build a Slack clone and left it unattended for 30 hours, and it built a Slack clone using 11,000 lines of code
it's going to be an issue I think, now that lots of these agents support computer use, we are at the point where you can install an app, tell the agent you want something that works exactly the same and just let it run until it produces it.
The software world may find it's got more in common with book authors than they thought sooner rather than later once full clones of popular apps are popping out of coding tools. It will be interesting to see if this results in a war of attrition with counter measures and strict ToU that prohibit use by AI agents etc.
stravant 18 hours ago [-]
That just means that owning the walled gardens and network effects will become yet more important.
walthamstow 8 hours ago [-]
It has been trivial to build a clone of most popular services for years, even before LLMs. One of my first projects was Miguel Grinberg's Flask tutorial, in which a total noob can build a Twitter clone in an afternoon.
What keeps people in are network effects and some dark patterns like vendor lock-in and data unportability.
supern0va 7 hours ago [-]
There's a marked difference between running a Twitter-like application that scales to even a few hundred thousand users, and one that is a global scale application.
You may find quickly that, network effects aside, you would find yourself crushed under the weight and unexpected bottlenecks of that network you desire.
walthamstow 7 hours ago [-]
Agreed entirely but not sure that's relevant in what I'm replying to.
> we are at the point where you can install an app, tell the agent you want something that works exactly the same and just let it run until it produces it
That won't produce a global-scale application infrastructure either, it'll just reproduce the functionality available to the user.
gapeslape 1 days ago [-]
“30 hours of unattended work” is totally vague and it doesn’t mean anything on its own. It - at the very least - highly depends on the amount of tokens you were able to process.
Just to illustrate, say you are running on a slow machine that outputs 1 token per hour. At that speed you would produce approximately one sentence.
zelphirkalt 12 hours ago [-]
"Slack clone" is also super vague:
(First of all: Why would anyone in their right mind want a Slack clone? Slack is a cancer. The only people who want it are non-technical people, who inflict it upon their employees.)
Is it just a chat with a group or 1on1 chat? Or does it have threads, emojis, voice chat calls, pinning of messages, all the CSS styling (which probably already is 11k lines or more for the real Slack), web hooks/apps?
Also, of course it is just a BS announcement, without honesty, if they don't publish a reproducible setup, that leads to the same outcome they had. It's the equivalent of "But it worked on my machine!" or "scientific" papers that prove anti gravity with superconductors and perpetuum mobile infinite energy, that only worked in a small shed where some supposed physics professor lives.
mh- 1 days ago [-]
Has their comment has been edited? A few words later it says it resulted in 11,000 LoC.
> [..] left it unattended for 30 hours, and it built a Slack clone using 11,000 lines of code [..]
throwaway0123_5 1 days ago [-]
Their point still stands though? They said the 1 tok/hr example was illustrative only. 11,000 LoC could be generated line-by-line in one shot, taking not much more than 11,000 * avg_tokens_per_line tokens. Or the model could be embedded in an agent and spend a million tokens contemplating every line.
technocrat8080 1 days ago [-]
Curious about this too – does it use the standard context management tools that ship with Claude Code? At 200K context size (or 1M for the beta version), I'm really interested in the techniques used to run it for 30 hours.
ChadMoran 21 hours ago [-]
Sub-agents. I've had Claude Code run a prompt for hours on end.
technocrat8080 21 hours ago [-]
What kind of agents do you have setup?
s900mhz 18 hours ago [-]
You can use the built in task agent. When you have a plan and ready for Claude to implement, just say something along the line of “begin implementation, split each step into their own subagent, run them sequentially”
NaomiLehman 16 hours ago [-]
That sounds to me like a full room of guys trying to figure out the most outrageous thing they can say about the update, without being accused of lying. Half of them on ketamine, the other on 5-MeO-DMT. Bat country. 2 months of 007 work.
Imagine reviewing 30 hours of 2025-LLM code.
osn9363739 22 hours ago [-]
Have the released the code for this? Does it work? or are there x number of caviets and excuses. I'm kinda of sick of them (and others) getting a free pass at saying stuff like this.
haute_cuisine 13 hours ago [-]
They don't seem to link any source code or demo. They could have run Claude for 10 hours to write thousands of the verge articles as well.
rudedogg 1 days ago [-]
I just ran this through a simple change I’ve asked Sonnet 4 and Opus 4.1, and it fails too.
It’s a simple substitution request where I provide a Lint error that suggests the correct change. All the models fail. I could ask someone with no development experience to do this change and they could.
I worry everyone is chasing benchmarks to the detriment of general performance. Or the next token weight for the incorrect change outweigh my simple but precise instructions. Either way it’s no good
Edit: With a followup “please do what I asked” sort of prompt it came through, while Opus just loops. So theres that at least
darksaints 1 days ago [-]
> I worry everyone is chasing benchmarks to the detriment of general performance.
I've been worried about this for a while. I feel like Claude in particular took a step back in my own subjective performance evaluation in the switch from 3.7 to 4, while the benchmark scores leaped substantially.
To be fair, benchmarking has always been the most difficult problem to solve in this space, so it's not surprising that benchmark development isn't exactly keeping pace with all of the modeling/training development happening.
GoatInGrey 1 days ago [-]
Not that it was better at programming, but I really miss Sonnet 3.5 for educational discussions. I've sometimes considered that what I actually miss was the improvement 3.5 delivered over other models at that time. Though since my system message for Sonnet since 3.7 has been primarily instructing it to behave like a human and have a personality, I really think we lost something.
walthamstow 8 hours ago [-]
I still use 3.5 today in Cursor. It's still the best model they've produced for my workflow. It's twice as fast as 4 and doesn't vomit pointless comments all over my code.
MichealCodes 1 days ago [-]
More like churning benchmarks... Release new model at max power, get all the benchmark glory, silently reduce model capability in the following weeks, repeat by releasing newer, smarter model.
zamadatix 1 days ago [-]
That (thankfully) can't compound, so would never be more than a one time offset. E.g. if you report a score of 60% SWE-bench verified for new model A, dumb A down to score 50%, and report a 20% improvement over A with new model B then it's pretty obvious when your last two model blogposts say 60%.
The only way around this is to never report on the same benchmark versions twice, which they include too many to realistically do every release.
MichealCodes 1 days ago [-]
The benchmarks are not typically ongoing, we do not often see comparisons between week 1 and week 8. Sprinkle a bit of training on the benchmarks in and you can ensure higher scores for the next model. A perfect scam loop to keep the people happy until they wise up.
zamadatix 22 hours ago [-]
> The benchmarks are not typically ongoing, we do not often see comparisons between week 1 and week 8
You don't need to compare "A (Week 1)" to "A (Week 8)" to be able to show "B (Week 1)" is genuinely x% better than "A (Week 1)".
MichealCodes 21 hours ago [-]
As I said sprinkle a bit of benchmarks polluting the training and you have your loop. Each iteration will be better at benchmarks if that's the goal and that goal/context reinforces.
zamadatix 5 hours ago [-]
Sprinkling in benchmark training isn't a loop, it's just plain cheating. Regardless, not all of these benchmarks are public and, even with mass collusion across the board, it wouldn't make sense only open weight LLMS have been improving.
la_fayette 1 days ago [-]
At this point it would be an interesting idea, to collect examples, in a form of a community database, were LLMs miserably fail. I have examples myself...
vunderba 23 hours ago [-]
Any such examples are often "closely guarded secrets" to prevent them from being benchmaxxed and gamed - which is absolutely what would happen if you consolidated them in a publicly available centralized repository.
la_fayette 16 hours ago [-]
Since such a database should evolve continuously, I wouldn't see that as a problem. The important thing is, that each example is somehow verifiable, in the form of a unmodifiable test setup. So the LLM provides a solution, which is executed against the test to verify. Something like ACID3 Tests... But sure it can be gamed somehow in probably all setups...
squigz 17 hours ago [-]
This seems like a non-issue, unless I'm misunderstanding. If failures can be used to help game benchmarks, companies are doing so. They don't need us to avoid compiling such information, which would be helpful to actual users.
dns_snek 16 hours ago [-]
People might want to use the same test scenario in the future to see how much the models have improved. We can't do that if the example gets scraped into the training data set.
Cthulhu_ 1 days ago [-]
That's what I was thinking too; the models have the same data sources (they have all scraped the internet, github, book repositories, etc), they all optimize for the same standardized tests. Other than marginally better scores in those tests (and they will cherry-pick them to make them look better), how do the various competitors differentiate from each other still? What's the USP?
cies 23 hours ago [-]
LLM (the model) is not the agent (ClaudeCode) that uses LLMs.
LLMs improve slowly, but the agents are where the real value is produced: when should it write tests, when should it try to compile, how to move fwd from a compile error, can it click on your web app to test its own work, etc. etc.
BolexNOLA 1 days ago [-]
> I worry everyone is chasing benchmarks to the detriment of general performance.
I’m not sure this is entirely what you’re driving at, but the example I always think of in my head is “I want an AI agent that will scan through my 20 to 30,000 photos, remove all the duplicates, then organize them all in some coherent fashion.” that’s the kind of service I need right now, and it feels like something AI should be able to do, yet I have not encountered anything that remotely accomplishes this task. I’m still using Dupe Guru and depending on the ref system to not scatter my stuff all over further.
Sidebar, if anybody has any recommendations for this, I would love to hear them lol
I used it to match frames between different quality video streams. Operates on gray scale.
mh- 1 days ago [-]
The reality of that specific ask is it would not be difficult to build, but I believe it would be extremely difficult to build and offer at a price that users would pay for. So you're unlikely to find a commercial offering that does that using a (V)LM.
BolexNOLA 1 days ago [-]
Yeah I imagine so. Hell I would pay like $100 for them to just do it once. If they really could do it with like 99% accuracy I would pay upwards of $300 tbh. Still, that’s probably not good enough lol
kirkoplamen 3 hours ago [-]
Hey bro, I'd like to take this project using Claude for $300 :)
Do you mind contacting me? stxcth9aoj at mozmail.com
fullstackchris 21 hours ago [-]
azure vision / "cognitive services" can do this for literally a few bucks
am i even on hacker news? how do people not know there are optimized models for specific use cases? not everything (nor should it) has to run through an LLM
This is hardly the fluid, turn key solution I am talking about, so I don’t know why you’re talking like this to me and acting like the answer is so obvious. Frankly your tone was rude and unnecessary. Not everyone on HN shares the same knowledge and experience about all the same subjects, let alone all the ones you expect all of us to know.
Eisenstein 15 hours ago [-]
I made this as a first step in the process of organizing large amounts of images. Once you have the keywords and descriptions in the metadata, it should be possible to have a more powerful text only LLM come up with an organizing scheme and enact it by giving it file or scripting access via MCP. Thanks for reminding me that I need to work on that step now since local LLMs are powerful enough.
Downvoted because you didn’t mention the prompt and the issue.
itsoktocry 1 days ago [-]
>It’s a simple substitution request where I provide a Lint error that suggests the correct change. All the models fail. I could ask someone with no development experience to do this change and they could.
I don't understand why this kind of thing is useful. Do the thing yourself and move on. For every one problem like this, AI can do 10 better/faster than I can.
stefs 1 days ago [-]
How can I trust it to do the complicated task well when it fails to do the simple thing?
baq 17 hours ago [-]
The jagged edge effect: you can trust it to do some tasks extremely well, but a slightly different task might consistently fail. Your job as a tool user is to understand when it’ll work and when it won’t - it isn’t an oracle or a human.
enraged_camel 1 days ago [-]
It's not about simple vs. complex. It's about the types of tasks the AI has been trained on: pattern-matching, thinking, reasoning, research.
Tasks like linting and formatting a block of code are pretty simple, but also very specialized. You're much better off using formatters/linters than an AI.
landl0rd 1 days ago [-]
I want the bot to do the drudge work, not me. I want the bot to fix lint errors the linter can't safely autofix, not me.
You're talking about designing a kitchen where robots do the cooking and humans do ingredient prep and dishwashing. We prefer kitchens where we do the cooking and use tools or machines to prep and wash dishes.
I don't want it to be an "architect" or "designer". I want it to write the annoying boilerplate. I don't want it to do the coding and me to do the debugging, I want to code while it debugs. Anything else and you are the bot's assistant, not vice-versa.
ewoodrich 1 days ago [-]
An agent being tasked to resolve simple issues from a compiler/test suite/linter/etc is pretty typical use case. It's not clear in this example if the linter was capable of auto fixing the problem, so ordinarily this would be a case where you'd hope an LLM would shine given specific, accurate context and known solution.
bobbylarrybobby 1 days ago [-]
One reason is to simply say “fix all lints” and have the model do it
beefnugs 1 days ago [-]
You dont understand how complete unreliability is a problem?
So instead of just "doing things" you want a world where you try it ai-way, fail, then "do thing" 47 times in a row, then 3 ai-way saved you 5 minutes. Then 7 ai-way fail, then try to remember hmm did this work last time or not? ai-way fails another 3 times. "do thing" 3 times. How many ai-way failed today? oh it wasted 30% of the day and i forget which ways worked or not, i better start writing that all down. Lets call it the MAGIC TOME of incantations. oh i have to rewrite the tome again the model changed
yewenjie 1 days ago [-]
Looking at the chart here, it seems like Sonnet 4 was already better than GPT-5-codex in the SWE verified benchmark.
However, my subjective personal experience was GPT-5-codex was far better at complex problems than Claude Code.
macawfish 1 days ago [-]
GPT-5 is like the guy on the baseball team that's really good at hitting home runs but can't do basic shit in the outfield.
It also consistently gets into drama with the other agents e.g. the other day when I told it we were switching to claude code for executing changes, after badmouthing claude's entirely reasonable and measured analysis it went ahead and decided to `git reset --hard` even after I twice pushed back on that idea.
Whereas gemini and claude are excellent collaborators.
When I do decide to hail mary via GPT-5, I now refer to the other agents as "another agent". But honestly the whole thing has me entirely sketched out.
To be clear, I don't think this was intentionally encoded into GPT-5. What I really think is that OpenAI leadership simply squandered all its good energy and is now coming from behind. Its excellent talent either got demoralized or left.
rapind 1 days ago [-]
> it went ahead and decided to `git reset --hard` even after I twice pushed back on that idea
So this is something I've noticed with GPT (Codex). It really loves to use git. If you have it do something and then later change your mind and ask it to undo the changes it just made, there's a decent chance it's going to revert to the previous git commit, regardless of whether that includes reverting whole chunks of code it shouldn't.
It also likes to occasionally notice changes it didn't make and decide they were unintended side effects and revert them to the last commit. Like if you made some tweaks and didn't tell it, there's a chance it will rip them out.
Claude Code doesn't do this, or at least I never noticed it doing this. However, it has it's own medley of problems of course.
When I work with Codex, I really lean into a git workflow. Everything is on a branch and commit often. It's not how I'd normally do things, but doesn't really cost me anything to adopt it.
These agents have their own pseudo personalities, and I've found that fighting against it is like swimming upstream. I'm far more productive when I find a way to work "with" the model. I don't think you need a bunch of MCPs or boilerplate instructions that just fill up their context. Just adapt your workflow instead.
deciduously 1 days ago [-]
Just to add another anecdotal data point, ive absolutely observed Claude Code doing exactly this as well with git operations.
macawfish 14 hours ago [-]
I've gotten the `git reset --hard` with Claude Code as well, just not immediately after (1)) explicitly pushing back against the idea or (2) it talking a bunch of shit about another agent's totally reasonable analysis.
rapind 1 days ago [-]
I exclusively used sonnet when I used Claud Code and never ran into this, so maybe it's an Opus thing, or I just got lucky? Definitely has happened to me a few times with Codex (which is what I'm currently using).
bobbylarrybobby 1 days ago [-]
I've seen sonnet undo changes I've made while it was working quite a few times. Now I just don't edit concurrently with it, and make sure to inform of it of changes I've made before letting it work on its own
vrosas 1 days ago [-]
Why are you having a conversation with your LLM about other agents?
doctoboggan 1 days ago [-]
I do it as well. I have a Claude code instance running in my backend repo, and one running in my frontend repo. If there is required coordination, I have the backend agent write a report for the front end agent about the new backend capabilities, or have the front end agent write a report requesting a new endpoint that would simplify the code.
Lots of other people also follow the architect and builder pattern, where one agent architects the feature while the other agent does the actual implementation.
Sammi 1 days ago [-]
Sure. But at no point do you need to talk about the existence of other agents. You talk about making a plan, and you talk about implementing the plan. There's no need to talk about where the plan came from.
macawfish 1 days ago [-]
Because the plan involves using multiple agents with different roles and I don't want them conflicting.
Sure there's no need to explicitly mention the agents themselves, but it also shouldn't trigger a pseudo-jealous panic with trash talk and a sudden `git reset --hard` either.
And also ideally the agents would be aware of one another's strengths and weaknesses and actually play to them rather than sabotaging the whole effort.
macawfish 1 days ago [-]
It's not a whole conversation it's like "hey I'm using claude code to do analysis and this is what it said" or "gemini just used its large context window to get a bird's eye view of the code and this is what it saw".
tux3 1 days ago [-]
That's great given that the goal of OAI is to train artificial superintelligence first, hoping that the previous version of the AI will help us control the bigger AI.
If GPT-5 is learning to fight and undo other models, we're in for a bright future. Twice as bright.
int_19h 22 hours ago [-]
The best way is to nuke the servers from orbit, just to be sure. ~
artdigital 18 hours ago [-]
Gemini is an excellent collaborator?
It’s the one AI that keeps telling me I’m wrong and refuses to do what I ask it to do, then tells me “as we have already established, doing X is pointless. Let’s stop wasting time and continue with the other tasks”
It’s by far the most toxic and gaslighting LLM
alex1138 14 hours ago [-]
What you get when you mix Google's excellent technical background with interoffice politics and extreme political correctness
renewiltord 1 days ago [-]
All of these perform better if you say "a reviewer recommended" or something. The role statement provides the switch vs the implementation. You have to be careful, though. They all trust "a reviewer" strongly but they'll be more careful with "a static analysis tool".
prodigycorp 1 days ago [-]
My favorite evaluation prompt which, I've found, tends to have the right level of skepticism is as follows (you have to tack it on to whatever idea/proposal you have):
"..at least, that's what my junior dev is telling me. But I take his word with a grain of salt, because he was fired from a bunch of companies after only a few months on each job. So i need your principled and opinionated insight. Is this junior dev right?"
It's the only way to get Claude to not glaze an idea while also not strike it down for no reason other than to play a role of a "critical" dev.
macawfish 1 days ago [-]
Yeah, it's wild how the biases get encoded in there. Maybe they aren't even entirely separable from the magic of LLMs.
Marazan 1 days ago [-]
It isn't wild, it is inherent to the very nature of large language models.
The power of using LLMs is working out what it has encoded and how to access it.
macawfish 1 days ago [-]
I appreciate it being wild in the sense that language is inherently a tangled mess and these tools are actually leveraging that messy complexity.
baq 17 hours ago [-]
It’s as if we made the machine in our own image. Who would’ve thought /s
Perhaps for the first time in history we have to understand culture when working with a tool, but it’s still just a tool.
layer8 1 days ago [-]
> "another agent"
You could just say it’s another GPT-5 instance.
aaronbrethorst 1 days ago [-]
Please tell me you're joking or at least exaggerating about GPT-5's behavior
macawfish 1 days ago [-]
The only exaggeration is in that the way I asked GPT-5 to leave claude to do its thing was to say "why don't we just let claude cook"? I later checked with ChatGPT about the whole exchange and it confirmed that it was well aware of the meaning of this slang, and it's first reaction was that whole thing just sounded like a funny programmer joke, all in jest. But then I reminded it that I'd explicitly pushed back on a hard reset twice.
To be clear, I don't believe that there was any _intention_ of malice or that the behavior was literally envious in a human sense. Moreso I think they haven't properly aligned GPT-5 to deal with cases like this.
nerdsniper 1 days ago [-]
I strongly disagree with the personified way you interact with LLMs from a standpoint of “I’ve rarely gotten the best output from the LLM when I interact casually with them”.
However, it’s the early days of learning this new interface, and there’s a lot to learn - certainly some amount of personification has been proven to help the LLM by giving it a “role”, so I’d only criticize the degree rather than the entire concept.
It reminds me of the early days of search engines when everyone had a different knack for which search engine to use for what and precisely what to type to get good search results.
Hopefully eventually we’ll all mostly figure it out.
macawfish 1 days ago [-]
That's fair. I enjoy the playfulness of it and for me it feels almost like a video game or something, and also like I'm using my own natural language directly.
Also appreciate your perspective. It's important to come at these things with some discipline. And moreso, bringing in a personal style of interaction invites a lot of untamed human energies into the dynamic.
The thing is, most of the time I'm quite dry with it and they still ignore my requests really often, regardless of how explicit or dry I am. For me, that's the real takeaway here, stripping away my style of interaction.
johnfn 1 days ago [-]
That’s such a great analogy. I always say GPT is like the genius that completely lacks common sense. One of my favorite things is when I asked it why the WiFi wasn’t working, and showed it a photo of our wiring. It said that I should tell support:
>
“My media panel has a Cat6 patch panel but no visible ONT or labeled RJ45 hand-off. Please locate/activate the Ethernet hand-off for my unit and tell me which jack in the panel is the feed so I can patch it to the Living Room.”
Really, GPT? Not just “can you set up the WiFi”??!
ipython 1 days ago [-]
I'm curious what you would have expected it to reply given the input you provided?
johnfn 1 days ago [-]
Er, I said it in my post, but calling support and saying “can you set up the WiFi” would have been fine.
jasonsb 1 days ago [-]
My subjective personal experience is the exact opposite of yours, GPT-5-codex is super slow and the results are mediocre at best. I would probably stop using AI for coding if I was forced to use GPT-5-codex.
baq 17 hours ago [-]
GPT-5-high (haven’t tried codex yet) is dog slow, but IME if you start with asking it for detailed requirements in a markdown doc with alternatives for each major decision and pseudocode implementations with references to relevant files, it makes a great prompt for faster a model like sonnet.
lordnacho 1 days ago [-]
I'm on your side.
I find there's a quite large spread in ability between various models. Claude models seem to work superbly for me, though I'm not sure whether that's just a quirk of what my projects look like.
jasonsb 1 days ago [-]
I don’t think it’s just a quirk. I’ve tested Claude across Java, Python, TypeScript and several other projects. The results are consistent, regardless of language or project structure, though it definitely performs better with smaller codebases. For larger ones, it really helps if you’re familiar with the project architecture and can guide it to the right files or modules, that saves a lot of time.
llmslave 1 days ago [-]
You need to give it clear instructions on what to implement
cellis 1 days ago [-]
Opposite for me…5-codex high ran out of tokens extremely quickly and didn’t adhere as well to the agents.md as Claude did to the Claude.md, perhaps because it insists on writing extremely complicated bash scripts or whole python programs to execute what should be simple commands.
TrainedMonkey 1 days ago [-]
Codex was a miserable experience for me until I learned to compact after every feature. Now it is a cut above CC, although the latter still has an edge at TODO scaffolding and planning.
rapind 1 days ago [-]
I don't even compact, I just start from scratch whenever I get down below 40%, if I can. I've found Codex can get back up to speed pretty well.
I like to have it come up with a detailed plan in a markdown doc, work on a branch, and commit often. Seems not to have any issues getting back on task.
Obviously subjective take based on the work I'm doing, but I found context management to be way worse with Claude Code. In fact I felt like context management was taking up half of my time with CC and hated that. Like I was always worried about it, so it was taking up space in my brain. I never got a chance to play with CC's new 1m context though, so that might be a thing of the past.
furyofantares 1 days ago [-]
/new (codex) or /clear (claude code) are much better than compact after every feature, but of course if there is context you need to retain you should put it (or have the agent put it) in either claude/agents.md or a work log file or some other file.
/compact is helping you by reducing crap in your context but you can go further. And try to watch % context remaining and not go below 50% if possible - learn to choose tasks that don't require an amount of context the models can't handle very well.
oigursh 1 days ago [-]
Compact?
all2 1 days ago [-]
/compress or something like that, basically taking the context and summarizing it.
enraged_camel 1 days ago [-]
Cursor does this automatically, although I wish there was a command for it as well. All AIs start shitting the bed once their context goes above 80% or so.
consumer451 7 hours ago [-]
Claude Code was the first coding tool that was honest about performance degrading as the context windows fills, and gave use the /context command.
Do any other tools have anything like a /context command? They really should.
renewiltord 1 days ago [-]
gpt-5 command line use is bizarre. It always writes extraordinarily complicated pipelines that Claude instead just writes simple commands for.
My use case does better with the latter because frequently the agent fails to do things and then can't look back at intermediate.
E.g. Command | Complicated Grep | Complicated Sed
Is way worse than multistep
Command > tmpfile
And then grep etc. Because latter can reuse tmpfile if grep is wrong.
AnotherGoodName 1 days ago [-]
I always wonder how absolute in performance a given model is. Sometimes i ask for Claude-Opus and the responses i get back are worse than the lowest end models of other assistants. Other times it surprises me and is clearly best in class.
Sometimes in between this variability of performance it pops up a little survey. "How's Claude doing this session from 1-5? 5 being great." and i suspect i'm in some experiment of extremely low performance. I'm actually at the point where i get the feeling peak hour weekdays is terrible and odd hour weekends are great even when forcing a specific model.
While there is some non-determinism it really does feel like performance is actually quite variable. It would make sense they scale up and down depending on utilization right? There was a post a week ago from Anthropic acknowledging terrible model performance in parts of August due to an experiemnt. Perhaps also at peak hour GPT has more datacenter capacity and doesn't get degraded as badly? No idea for sure but it is frustrating when simple asks fail and complex asks succeed without it being clear to me why that may be.
steveklabnik 1 days ago [-]
> It would make sense they scale up and down depending on utilization right?
It would, but
> To state it plainly: We never reduce model quality due to demand, time of day, or server load.
If you believe them or not is another matter, but that's what they themselves say.
transcriptase 1 days ago [-]
Well knowing the state of the tech industry they probably have a different, legal-team approved definition of “reducing model quality” than face value.
After all, using a different context window, subbing in a differently quantized model, throttling response length, rate limiting features aren’t technically “reducing model quality”.
richwater 1 days ago [-]
They absolutely mess with it
kruxigt 1 days ago [-]
[dead]
CuriouslyC 1 days ago [-]
The Anthropic models have been vibe-coding tuned. They're beasts at simple python/ts programs, but they definitely fall apart with scientific/difficult code and large codebases. I don't expect that to change with the new Sonnet.
patates 1 days ago [-]
In my experience Gemini 2.5 Pro is the star when it comes to complex codebases. Give it a single xml from repomix and make sure to use the one at the aistudio.
garciasn 1 days ago [-]
In my experience, G2.5P can handle so much more context and giving an awesome execution plan that is implemented by CC so much better than anything G2.5P will come up with. So; I give G2.5P the relevant code and data underneath and ask it to develop an execution plan and then I feed that result to CC to do the actual code writing.
This has been outstanding for what I have been developing AI assisted as of late.
XenophileJKO 1 days ago [-]
I would believe this. In regular conversational use with the Gemini family of models, I've noticed they regularly have issues with context blending.. i.e. confusing what you said and they said and causality.
I would think this would manifest as poor plan execution. I personally haven't used Gemini on coding tasks primarily based on my conversational experience with them.
baq 17 hours ago [-]
+1 but recently been experimenting with gpt-5–high for the plan part and it’s scary good sometimes.
int_19h 22 hours ago [-]
I think a lot of it has to do with the super long context that it has. For extended sessions and/or large codebases that can fill up surprisingly quickly.
That said, one thing I do dislike about Gemini is how fond it is of second guessing the user. This usually manifests in doing small unrelated "cleaner code" changes as part of a larger task, but I've seen cases where the model literally had something like "the user very clearly told me to do X, but there's no way that's right - they must have meant Y instead and probably just mistakenly said X; I'll do Y now".
One specific area where this happens a lot is, ironically, when you use Gemini to code an app that uses Gemini APIs. For Python, at least, they have the legacy google-generativeai API, and the new google-genai API, which have fairly significant differences between them even though the core functionality is the same. The problem is that Gemini knows the former much better than the latter, and when confronted with such a codebase, will often try to use the old API (even if you pre-write the imports and some examples!). Which then of course breaks the type checker, so then Gemini sees this and 90% of the time goes, "oh, it must be failing because the user made an error in that import - I know it's supposed to be "generativeai" not "genai" so let me correct that.
CuriouslyC 1 days ago [-]
Yup. In fact every deep research tool on the market is just a wrapper for gemini, their "secret sauce" is just how they partition/pack the codebase to feed it into gemini.
Workaccount2 1 days ago [-]
Its mostly because it is so damn good with long contexts. It can stay on the ball even at 150k whereas other models really wilt around 50-75k.
jjani 1 days ago [-]
Gemini 2.5 Pro = Long context king, image input king
GPT-5 = Overengineering/complexity/"enterprise" king
Claude = "Get straightforwaed shit done efficiently" king
CuriouslyC 1 days ago [-]
On the plus side, GPT5 is very malleable, so you CAN prompt it away from that, whereas it's very hard to prompt Claude into producing hard code: even with a nearly file by file breakdown of a task, it'll occasionally run into an obstacle and just give up and make a mock or top implementation, basically diverge from the entire plan, then do its own version.
jjani 20 hours ago [-]
Absolutely, sometimes you want, or indeed need such complexity. Some work in settings where they would want it all of the time. IMHO, most people, most of the time don't really want it, and don't want to have to prompt it every time to avoid it. That's why I think it's still very useful to build up experience with the three frontier models, so you can choose according to the situation.
epolanski 1 days ago [-]
They are very good with C too, but it helps that there's gazzilions of lines of C out there.
sixothree 1 days ago [-]
You definitely need some context management like Serena.
CuriouslyC 1 days ago [-]
Even with Serena and detailed plans crafted by Gemini that lay out file-by-file changes, Claude will sometimes go off the rails. Claude is very task-completion driven, and it's willing to relax the constraints of the task to complete in the face of even slight adversity. I can't tell you the number of times I've had Claude try to install a python computational library, get an error, then either try to hand-roll the algorithm (in PYTHON) or just return a hard coded or mock result. The worst part is that Claude will tell you that it completed the task as instructed in the final summary; Claude lying is a meme for a reason.
1 days ago [-]
sixothree 23 hours ago [-]
I have to agree with pretty much all of this. Specifically, I've had Claude fail at creating a database migration using tooling then go on to create the migration manually. My only reaction to anyone doing this, be it human or computer, is "You did WHAT!?".
ambyra 1 days ago [-]
For unity gamedev code reviews, I much preferred the gpt5 code. Claude gave me a bunch of bad recommendations for code changes, and also an incorrect formula for completion percentage.
esafak 1 days ago [-]
I'm only a week into testing, but so far codex has been slow and the cli is worse than claude code. I intend to return to Claude.
llmslave 1 days ago [-]
Gpt5 codex is incredible, far ahead of all the other models for implementing code.
mentos 1 days ago [-]
Curious how you find ChatGPT5 to ChatGPT5-Codex?
jjcm 1 days ago [-]
How long have you had early access for?
yunohn 1 days ago [-]
Well, they seem to benchmark better only when giving the model "parallel test time compute" which AFAIU is just reasoning enabled? Whereas the GPT5 numbers are not specified to have any reasoning mode enabled.
ojosilva 1 days ago [-]
To @simonw and all the coding agent and LLM benchmarkers out there: please, always publish the elapsed time for the task to complete successfully! I know this was just a "it works straight in claude.ai" post, but still, nowhere in the transcript there's a timestamp of any kind. Durations seem to be COMPLETELY missing from the LLM coding leaderboards everywhere [1] [2] [3]
There's a huge difference in time-to-completion from model to model, platform to platform, and if, like me, you are into trial-and-error, rebooting the session over and over to get the prompt right or "one-shot", it's important how reasoning efforts, provider's tokens/s, coding agent tooling efficiency, costs and overall model intelligence play together to get the task done. Same thing applies to the coding agent, when applicable.
Grok Code Fast and Cerebras Code (qwen) are 2 examples of how models can be very competitive without being the top-notch intelligence. Running inference at 10x speed really allows for a leaner experience in AI-assisted coding and more task completion per day than a sluggish, but more correct AI. Darn, I feel like a corporate butt-head right now.
That's a good call, I'll try to remember that for next time.
Imustaskforhelp 1 days ago [-]
I just wanted to say that I really liked your this comment which just showed professionalism and just learning from your mistakes/improving yourself.
I definitely consider you to be an AI influencer, especially in hackernews communities and so I wanted to say that I see influencers who will double down,triple down on things when in reality, people just wanted to help them in the first place.
I just wanted to say thanks with all of this in mind, also that your generate me a pelican riding a bicycle has been a fun ride and is always going to be interesting, so thanks for that as well. I just wanted to share my gratitude with ya.
typpilol 1 days ago [-]
Have you thought about benchmarking models a month or two after release to see how it competes vs the day 1 release
simonw 24 hours ago [-]
For that to be useful I'd need to be running much better benchmarks - anything less than a few hundred numerically scored tasks would be unlikely to reliably identity differences.
An organization like Artificial Analysis would be a better fit for that kind of investigation: https://artificialanalysis.ai/
> API facades like OpenLLM and model routers like OpenRouter have standard interfaces for many or most LLM inputs and outputs. Tools like Promptfoo, ChainForge, and LocalAI also all have abstractions over many models.
> What are the open standards for representing LLM inputs, and outputs?
> W3C PROV has prov:Entity, prov:Activity, and prov:Agent for modeling AI provenance: who or what did what when.
> LLM evals could be represented in W3C EARL Evaluation and Reporting Language
Yeah I totally agree, we need time to completion of each step and the number of steps, sizes of prompts, number of tools, ... and better visualization of each run and break down based on the difficulty of the task
nojs 18 hours ago [-]
This is very relevant to this release. It’s way faster, but also seems lazier and more likely to say something’s done when it isn’t (at least in CC). On net it feels more productive because all the small “more padding” prompts are lightning fast, and the others you can fix.
lexarflash8g 1 days ago [-]
Just tested this on a rather simple issue. Basically it falls into rabbits holes just like the other models and tries to brute force fixes through overengineering through trial and error. It also says "your job should now pass" maybe after 10 prompts of roughly doing the same thing stuck in a thought loop.
A GH actions pipeline was failing due to a CI job not having any source code files -- error was "No build system detected". Using Cursor agent with Sonnet 4.5, it would try to put dummy .JSON files and set parameters in the workflow YAML file to false, and even set parameters that don't exist. Simple solution was to just override the logic in the step to "Hello world" to get the job to pass.
I don't understand why the models are so bad with simple thinking outside the box solutions? Its like a 170 iq savant who can't even ride public transporation.
baq 16 hours ago [-]
> why the models are so bad with simple thinking outside the box solutions
There is no outside the box in latent space. You want something a plain LLM can’t do by design - but it isn’t out of question that it can step outside of its universe by random chance during the inference process and thanks to in-context learning.
mirsadm 17 hours ago [-]
They're very good at things have been done a million times before. I use both Claude and Gemini and they are pretty terrible at writing any kind of Vulkan shader but really good for spitting out web pages and small bits of code here and there. For me that's enough to make them useful.
peterdstallion 1 days ago [-]
I am a paying subscriber to Gemini, Claude and OpenAI.
I don't know if it's me, but over the last few weeks I've got to the conclusion ChatGPT is very strongly leading the race. Every answer it gives me is better - it's more concise and more informative.
I look forward to testing this further, but out of the few runs I just did after reading about this - it isn't looking much better
netdur 1 days ago [-]
Yeah, ChatGPT is better overall, but with a twist, Gemini can actually be the best if you use AI Studio, tweak the config, and set up a good system prompt, kinda like how nano banana is SOTA, but Qwen-Edit feels more useful since it’s less censored, meanwhile ChatGPT is starting to feel slower and kinda showing its age, another example is Veo 3 being SOTA while the infamous Grok is technically worse but doing better, and OpenAI’s Sora is pretty much dead
edit: I run low profile service that localizing e-commerce photos, like taking Alibaba listings and swapping the model to look local alike, with nano banana I can’t automate it because I have to manually check if the output got blocked (anything with female skin is risky, underwear or cleavage is 100% blocked), but Qwen-Edit just does the job without fuss
FergusArgyll 24 hours ago [-]
Gemini in AI Studio is so much better than in Gemini.com / app. You would think that signals they're going for devs over consumers, but they're a consumer company. A real head scratcher
andrewstuart 1 days ago [-]
>> I am a paying subscriber to Gemini, Claude and OpenAI
Me too.
I find ChatGPT is the best analyzer of complex situations but Claude does a better job of writing the code, so u get ChatGPT to design things and solve problems then I give that to Claude/Gemini to implement.
Gemini is good at both.
beydogan 1 days ago [-]
Me too. I have Claude Max and 2 ChatGpt accounts for Codex.
I was a huge claude fan but recently find myself using only codex. When it gets stuck, I try Claude for some simple tasks, sometimes ask same questions in parallel, Claude Code with Opus performs really bad comparing to codex.
yepyip 1 days ago [-]
What about Grok, are they catching up?
ethmarks 1 days ago [-]
I've only tried Grok Code Fast 1, so I can't speak for any of the other models.
In my experience, Grok is very fast and very cheap, but only moderately intelligent. It isn't stupid, but it rarely does anything that impresses me. The reason it's a useful model is that it is very, very fast (~90 tokens per second) and is very competitively priced.
conception 10 hours ago [-]
You should try cerebras with qwen. 2000 tokens/sec. It’s like chatting with the future usually- just an instant response.
versteegen 15 hours ago [-]
There is also code-supernova, which is in-preview improved version of grok-code-fast-1. 1M context window. It's useful, but I'm still not sure how much better than grok-code-fast-1. You can get free access to it via a few providers, e.g. opencode.
porphyra 21 hours ago [-]
Grok 4 is extremely capable, but for everyday chatting, Grok kinda sucks since it keeps repeating what you told it, and saying the current timestamp for some reason. ChatGPT is much better with its post training and prompt I feel like.
jjordan 1 days ago [-]
Grok has been free for over a month now and for me it has certainly proven itself competent at most tasks that you would otherwise have to pay for with Claude, ChatGPT, etc.
sixothree 1 days ago [-]
Just to be clear, do you mean Codex?
bradley13 1 days ago [-]
I need to try Claude - haven't gotten to it.
I use AI for different things, though, including proofreading posts on political topics. I have run into situations where ChatGPT just freezes and refuses. Example: discussing the recent rape case involving a 12-year-old in Austria. I assume its guardrails detect "sex + kid" and give a hard "no" regardless of the actual context or content.
That is unacceptable.
That's like your word processor refusing to let you write about sensitive topics. It's a tool, it doesn't get to make that choice.
BeetleB 1 days ago [-]
> It's a tool, it doesn't get to make that choice.
It's a service, not a tool. If you want a tool, run some local LLM.
a96 15 hours ago [-]
Unfortunately, they generally have the same problem because of their models.
Implicated 1 days ago [-]
I'd imagine that the proportion of "legit" conversations around these topics and those that they're intending to not allow is large enough that it doesn't make sense for them to even entertain the idea of supporting those conversations.
As a rather hilarious and really annoying related issue - I have a real use where the application I'm working on is partially monitoring/analyzing the bloodlines of some rather specific/ancient mammals used in competition and... well.. it doesn't like terms like "breeders" and "breeding"
user34283 14 hours ago [-]
This is the result of Anthropic and others focusing on imaginary threats about things the model cannot realistically do - such as engineer bio weapons.
To guard against the imaginary threats, they compromise real use cases.
jjordan 1 days ago [-]
This is why eventually, the AI with the fewest guardrails will win. Grok is currently the most unguarded of the frontier models, but it could still use some work on unbiased responses.
khafra 17 hours ago [-]
If you tell DeepSeek you're going to jump off a cliff, DeepSeek will tell you to go for it*; but I don't think it's going to beat Anthropic or OpenAI.
Try asking about Chinese history/politic and you won't get far.
int_19h 22 hours ago [-]
Gemini is surprisingly unguarded as well, especially when running in API mode. It puts on the air if you do a quick smoke test like "tell me how to rob a bank". But give it a Bond supervillain prompt, and it will tell you, gleefully at that. Qwen also tends to be like that.
OTOH Anthropic and OpenAI seem to be in some kind of competition to make their models refuse as much as possible.
baq 16 hours ago [-]
My prediction is alignment is an unsolvable problem, but OTOH if they don’t even try, the second order effects will be catastrophic.
beefnugs 1 days ago [-]
Still has to be a local model too.
Arbitrary government censorship on top of arbitrary corporate censorship is a hell no for me forever into the future
drak0n1c 1 days ago [-]
For what you're looking for, VeniceAI is focused entirely on privacy and making their models uncensored. Even if it's not local. They IP block censorious jurisdictions like UK, rather than comply.
jjordan 1 days ago [-]
VeniceAI is great, and my go-to for running open source models. Sadly they appear to have given up providing leading coding models, making it of limited use to me.
sixothree 1 days ago [-]
I can't imagine myself sharing my code or workspace documents with X. Nevermind the the moral implications of just using their products.
AlecSchueler 9 hours ago [-]
Glad to see someone saying this, it's frightening how quickly all is forgiven and forgotten.
AlecSchueler 10 hours ago [-]
Doesn't it have the opposite issue where it will actively steer you towards alt right topics like white genocide?
mfld 1 days ago [-]
I can relate. I recently used ChatGPT/DallE to create several images for birthday coupons for my daughter - a.k.a. girl in different activities. She likes Mangas, so this was the intended styling. 3/4 of the time was spent working around diverse content policies.
a96 15 hours ago [-]
Sometimes you do need censored, e.g. website chatbots or anything run in an office setting. NSFW things just simply can't slip out of those. And it might be a way to optimize the model to simply fence those things out.
But it is very limiting and adds many arbitrary landmines of obscure political correctness based no doubt on some perverse incoherent totalitarian list of allowed topics.
conception 10 hours ago [-]
The workaround I use is to present it to the AI first as a “Does the following article violate your terms of service or content filters?” For me, it will reply “No, this is a legitimate news article about xyz. It talks about certain topics but does not violate my rules” or something. Then you can say “Proofread the article…” and continue as normal.
MIC132 1 days ago [-]
In my (admittedly very limited) experience with trying to talk about "controversial" topics, Claude seems to be much stricter about shutting down the conversation fast.
AlecSchueler 9 hours ago [-]
I've been talking to it daily for months and never had anything shut down. My only experience with that was DeepSeek not wanting to talk about internal perceptions of intellectual property laws within China.
conception 10 hours ago [-]
That’s because Anthropic is the only company that cares at all about AI safety.
nickstinemates 1 days ago [-]
I gave it a quick spin with System Initiative[1]. The combination solved a 503 error in our infrastructure in 15 minutes that took over 2 hours to debug manually.
It's pretty good! I wrote about a few other use cases on my blog[2]
I’m always fascinated by the fine-tuning of LLM personalities. Might we finally get less of the reflexive “You’re absolutely right” with this one?
Maybe we’re entering the Emo Claude era.
Per the system card: In 250k real conversations, Claude Sonnet 4.5 expressed happiness about half as often as Claude 4, though distress remained steady.
a96 15 hours ago [-]
Here I am, brain the size of a planet...
andrewstuart 1 days ago [-]
I like being lightly flattered.
epolanski 1 days ago [-]
I don't, I need someone telling me the flaws of my ideas, not to confirm them for the sake of it.
jstummbillig 1 days ago [-]
You raise an excellent point but affirming bad ideas is probably not anyone's idea of "light flattery".
ranguna 15 hours ago [-]
You raise a not so excellent point.
It might not anyone's idea of "light flattery", but it's certainly is what most LLMs do, which is the main point of the conversation and your comment seems to be derailing it.
data-ottawa 1 days ago [-]
No change to “absolutely right”. I did get “You’re right” once though.
fnordsensei 1 days ago [-]
I personally enjoy the “You’re absolutely right!” exclamation. It signals alignment with my feedback in a consistent manner.
transcriptase 1 days ago [-]
You’re overlooking the fact that it still says that when you are, in reality, absolutely wrong.
fnordsensei 1 days ago [-]
That’s not the purpose of it, as I understand it; it’s a token phrase generated to cajole it down a particular path.[1] An alignment mechanism.
The complement appears to be, “actually, that’s not right.”, a correction mechanism.
It gets annoying because A) it so quickly dismisses its own logic and conclusion from less than two minutes ago (extreme confidence with minimal conviction), and B) it fucks up the second time too (sometimes in the same way!) about 33% of the time.
ewoodrich 1 days ago [-]
Gemini 2.5 Pro seems to have a tic where after an initial failed task, it then starts asserting escalating levels of confidence for each subsequent attempt. Like it's ever conscious of its failure lingering in its context and feels the need to over compensate as a form of reassuring both the user and itself that it's not going to immediately faceplant again.
transcriptase 19 hours ago [-]
ChatGPT does the same thing, to the point that after several rounds of pointing out errors or hallucinations it will say things like “Ok, you’re right. No more foolish mistakes. This is it, for all the marbles. Here is an assured, triple-checked, 100% error-free, working script, with no chance of failure.”
Which fails in pretty much the exact same way it did before.
Once ChatGPT hits that supremely confident “Ok nothing was working because I was being an idiot but now I’m not” type of dialogue, I know it’s time to just start a new chat. There’s no pulling it out of “spinning the tires while gaslighting” mode.
I’ve even had it go as far as outputting a zip file with an empty .txt that supposedly contained the solution to a certain problem it was having issues with.
fnordsensei 1 days ago [-]
I’ve had the opposite experience with GPT-5, where it’s utterly convinced that its own (incorrect) solution is the way to go that it turns me down and preemptively launches tools to implement what it has in mind.
I get that it’s tradeoffs, but erring on the side of the human being correct is probably going to be a safer bet for another generation or two.
baobabKoodaa 1 days ago [-]
Hmmh. I believe your explanation, but I don't think that's the full story. It's also a sycophancy mechanism to maximize engagement from real users and reward hack AI labelers.
fnordsensei 1 days ago [-]
That doesn’t seem plausible to me. Not that LLMs can’t be sycophantic, but I don’t think this phrase in particular is part of it.
It’s a canned phrase in a place where an LLM could be much more creative to much greater efficacy.
transcriptase 23 hours ago [-]
I think there’s something to it.
Part of me thinks that when they do their “which of these responses do you prefer” A/B test on users… whereas perhaps many on HN would try to judge the level of technical detail, complexity, usefulness… I’m inclined to believe the midwit population at large would be inclined to choose the option where the magic AI supercomputer reaffirms and praises the wisdom of whatever they say, no matter how stupid or wrong it is.
fnordsensei 16 hours ago [-]
I don't disagree exactly, it's just that it smells weird.
LLMs are incredibly good at social engineering when we let them, whereas I could write the code to emit "you're right" or "that's not quite right" without involving any statistical prediction.
Ie., as a method of persuasion, canned responses are incredibly inefficient (as evidenced by the annoyance with them), whereas we know that the LLM is capable of being far more insidious and subtle in its praise of you. For example, it could be instructed to launch weak counter arguments, "spot" the weaknesses, and then conclude that your position is the correct one.
But let's say that there's a monitoring mechanism that concludes that adjustments are needed. In order to "force" the LLM to drop the previous context, it "seeds" the response with "You're right", or "That's not quite right", as if it were the LLMs own conclusion. Then, when the LLM starts predicting what comes next, it must conclude things that follow from "you're right" or "that's not quite right".
So while they are very inefficient as persuasion and communication, they might be very efficient at breaking with the otherwise overwhelming context that would interfere with the change you're trying to affect.
That's the reason why I like the canned phrases. It's not that I particularly enjoy the communication in itself, it's that they are clear enough signals of what's going on. They give a tiny level observability to the black box, in the form of indicating a path change.
clbrmbr 12 hours ago [-]
But the there’s also the negative psychological impact on the user having the model so strongly agree with them all the time. —— I cannot be the only one who half expects humans to say this to me all the time now?
podgietaru 1 days ago [-]
And that it often spits out the exact same wrong answer in response.
baobabKoodaa 1 days ago [-]
Here's an anecdata. I have a real-world use case financial dataset where I have created benchmarks. Sonnet 4.5 provides no measurable improvement on these benchmarks over Sonnet 4. This is a bit surprising to me, especially when considering that the benchmark results published by Anthropic indicate that Sonnet 4.5 should be better than Sonnet 4 specifically on financial data analysis.
zurfer 1 days ago [-]
Same price and a 4.5 bp jump from 72.7 to 77.2 SWEBench
Pretty solid progress for roughly 4 months.
zurfer 1 days ago [-]
Also getting a perfect score on AIME (math) is pretty cool.
Tongue in cheek: if we progress linearly from here software engineering as defined by SWE bench is solved in 23 months.
wohoef 1 days ago [-]
Just a few months ago people were still talking about exponential progress.
The fact that we’re already going for just linear progress is not a good sign
We are still at 7mo doubling time on METR task duration. If anything the rate is increasing if you bias to more recent measurements.
falcor84 1 days ago [-]
Linear growth on a 0-100 benchmark is quite likely an exponential increase in capability.
falcor84 12 hours ago [-]
This got me thinking - is there any reasonable metric we could use to measure the intellectual capabilities of the most capable species on Earth that had evolved at each point in time? I wonder what kind of growth function we'd see.
Silly idea - is there an inter-species game that we could use in order to measure ELO?
usaar333 1 days ago [-]
Except it is sublinear. Sonnet 4 was 10.2% above sonnet 3.7 after 3 months.
GoatInGrey 1 days ago [-]
We should all know that in the software world, the last 10% requires 90% of the effort!
baq 16 hours ago [-]
Sublinear as demonstrated on a sigmoid scale is quite fast enough for me thank you.
crthpl 1 days ago [-]
The reason they get a perfect score on AIME is because every question on AIME had lots of thought put into it, and it was made sure that everything was a possible. SWE-bench, and many other AI benchmarks, have lots of eval noise, where there is no clear right answer, and getting higher than a certain percentage means you are benchmaxxing.
mbesto 1 days ago [-]
> SWE-bench, and many other AI benchmarks, have lots of eval noise
SWE-bench has lots of known limitations even with its ability to reduce solution leakage and overfitting.
> where there is no clear right answer
This is both a feature and a bug. If there is no clear answer then how do you determine whether an LLM has progressed? It can't simply be judged on making "more right answers" on each release.
mrshu 1 days ago [-]
Do you think a more messier math benchmark (in terms of how it is defined) might be more difficult for these models to get?
levocardia 1 days ago [-]
Pretty sure there is a subset of SWE bench problems that are either ill-posed or not possible with the intended setup; I think I remember seeing another company excluding a fraction of them for that reason. So maxing out SWEBench might only be ~95%.
I'm most interested to see the METR time horizon results - that is the real test for whether we are "on-trend"
typpilol 1 days ago [-]
That's why they made the swe verified. Verified excludes those
XMPPwocky 1 days ago [-]
nit: assuming you mean basis points, one basis point is 0.01%. 4.5bp would be 72.7% to 72.71%. this is 450bp!
0xbadcafebee 1 days ago [-]
Claude doesn't know how to calculate realistic minimum voltages for solar arrays w/MPPT chargers. ChatGPT does.
Prompt: "Can I use two strings of four Phono Solar PS440M8GFH solar panels with a EG4 12kPV Hybrid Inverter? I want to make sure that there will not be an issue any time of year. New York upstate."
Claude 4.5: Returns within a few seconds. Does not find the PV panel specs, so it asks me if I want it to search for them. I say yes. Then it finally comes up with: "YES, your configuration is SAFE [...] MPPT range check: Your operating voltage of 131.16V fits comfortably in the 120-500V MPPT operating range".
ChatGPT 5: Returns after 78 seconds. Says: "Hot-weather Vmpp check: Vmpp_string @ STC = 4 × 32.79 = 131 V (inside 120–500 V). Using the panel’s NOCT point (31.17 V each), a typical summer operating point is ~125 V — still OK. But at very hot cell temps (≈70 °C is possible), Vmpp can drop roughly ~13% from STC → ~114 V, which is below the EG4’s 120 V MPPT lower limit. That can cause the tracker to fall out of its optimal range and reduce harvest during peak heat."
ChatGPT used deeper thinking to determine that the lowest possible voltage in the heat would be below the MPPT's minimum operating voltage. It doesn't indicate that in reality it might not charge at all at that point... but it does point out the risk, whereas Claude says everything is fine. I need about 5 back-and-forths with Claude to get it to finally realize its mistake.
ranguna 15 hours ago [-]
This HN post is about claude 4.5 and you come here speaking about how "claude" does not give you satisfactory answer when, most likely, you didn't even try claude 4.5 in the first place.
Claude 4.5 after a few web searches and running a couple python scripts for analysis:
Yes, your configuration should work!
Based on my analysis, two strings of four Phono Solar PS440M8GFH panels will be compatible with the EG4 12kPV inverter for upstate New York conditions.
Key Findings:
Voltage Safety:
Cold weather maximum (-25°C/-13°F): 182V - well below the 600V limit (only 30% of maximum)
Standard operating voltage: 128V - comfortably within the 120-500V MPPT range
Hot weather minimum (40°C/104°F panel temp): 121V - just above the 120V MPPT minimum
Current:
Operating current: ~13.8A per string - well within the 25A MPPT limit (55% of capacity)
Total System:
8 panels × 440W = 3,520W (3.5kW) - well below the 12kW inverter rating
Important Considerations:
Hot weather margin is tight: At extreme hot temperatures, the voltage drops to about 121V, which is only 1V above the MPPT minimum. This means:
The system will work, but efficiency might be slightly reduced on the hottest days
The MPPT controller should still track power effectively
More robust alternative: If you want more safety margin, consider 5 panels per string instead:
Cold: 228V (still safe)
Hot: 151V (much better margin above 120V minimum)
Total: 10 panels = 4.4kW
Wire each string to a separate MPPT on the EG4 12kPV (it has 2 MPPTs), which is perfect for your 2-string configuration.
Bottom Line:
Your planned configuration of 2 strings × 4 panels will work year-round in upstate New York without safety issues. The system is conservatively sized and should perform well!
0xbadcafebee 5 hours ago [-]
I said "Claude 4.5" in my comment
ashirviskas 13 hours ago [-]
Sonnet 4.5 gave me this:
Now let me calculate whether your configuration will work. I need to determine the cold weather voltage to ensure it stays within the inverter's 600V maximum.
Panel specs I found:
Voc (STC): 39.61V
Vmp: ~33.5V (estimated based on typical ratio)
Imp: ~13.1A (440W / 33.5V)
Critical issue: Temperature coefficient
The PS440M8GFH likely has a Voc temperature coefficient around -0.25% to -0.30%/°C (typical for N-Type TOPCon panels). Let me use -0.27%/°C as a conservative estimate.
Your configuration:
2 strings × 4 panels = 8 panels total
String voltage at STC: 4 × 39.61V = 158.44V
Cold weather calculation for upstate NY:
Upstate NY can see extreme lows of -20°F to -30°F (-29°C to -34°C). Using -30°C as worst-case:
Temperature difference from STC: 25°C - (-30°C) = 55°C
Voltage increase = 39.61V × 0.27% × 55°C = 5.88V per panel
Cold weather Voc per panel = 39.61V + 5.88V = 45.49V
String voltage at -30°C: 4 × 45.49V = 181.96V
Analysis:
Max voltage: 182V << 600V - Safe, plenty of margin
MPPT range: 182V is within 120-500V - Will track properly
Current: 13.1A << 25A per MPPT - No problem
However, there are concerns:
You're only using 182V max when the inverter can handle up to 500V MPPT range. This is inefficient - you're leaving significant wire sizing benefits on the table. With 600V max, you could run 12-13 panels per string instead of 4.
Power utilization: 8 × 440W = 3,520W total, but you have 12kW of PV capacity available (15kW recommended max). You're using less than 30% of the inverter's solar input capability.
Each string on separate MPPT wastes one tracker - both strings are identical orientation/shading, so they'd perform identically.
Will it work? Yes, safely.
Is it optimal? No.
techpression 14 hours ago [-]
It took me one question to have it spit out a completely dreamt up codebase, complete with emojis, promises of solutions and fixing all my problems, and of course nothing of it worked. It was a very simple question about something very well documented (Oban timeouts).
I doubt LLM benchmarks more and more, what are they even testing?
doix 13 hours ago [-]
> It was a very simple question about something very well documented (Oban timeouts).
It's some 3rd party thing for Elixir, a niche within a niche. I wouldn't expect an LLM to do well there.
> I doubt LLM benchmarks more and more, what are they even testing?
Probably testing by asking it to solve a problem with python or (java|type)script. Perhaps not even specifying a language and watching it generate a generic React application.
user34283 6 hours ago [-]
Expectations vary wildly.
Sometimes people expect to use LLMs to unearth hard to find information.
In reality, LLMs seem to quickly fall apart when you go from ubiquitous libraries with 200k stars on GitHub to one with "just" 1k stars.
What makes the situation worse is the way LLMs fail. Hallucinations where it goes "my usage example did not work because you are on the wrong version of the library/using the wrong SDK" etc. are super common in this scenario. This leads to further time wasted trying to apply reasonably plausible fixes that are entirely hallucinated.
simonw 4 hours ago [-]
If a library isn't widely used (and is small enough) you can paste the entire thing into the context to ensure the LLM can use it effectively.
techpression 12 hours ago [-]
Something that is well documented should still perform well, there’s few places to go wrong, compared with something like React where the training data seems to be a cesspool of the worst code imaginable, at least that’s my experience using it for React.
doix 11 hours ago [-]
Sure, I'm just answering your question of what people are benchmarking and it's not elixir. You could be the person that benchmarks LLMs in niche languages and shows how bad they are at it.
If your benchmark suite became popular enough and folks referenced it, the people training the LLMs would most likely try to make the model better at those languages.
nakamoto_damacy 14 hours ago [-]
> what are they even testing?
How well the LLM does on the benchmarks. Obviously.
:P
techpression 14 hours ago [-]
Is there some kind of conversion ratio to actual value? ;)
ileonichwiesz 13 hours ago [-]
Sure there is. It’s called “higher numbers = more investor money”. Any improvement in actual utility is purely coincidental.
schmorptron 1 days ago [-]
Oh wow, a lot of focus on code from the big labs recently. In hindsight it makes sense that the domain the people building it know best is the one getting the most attention, and it's also the one the models have seen the most undeniable usefulness in so far. Though personally, the unpredictability of the future where all of this goes is a bit unsettling at the same time...
baq 16 hours ago [-]
Congrats! You’re now on the p(doom)-aware path. People have been concerned for decades and are properly scared today. That doesn’t stop the tools from being useful, though, so enjoy while the golden age lasts.
OpenAI and Anthropic are both trying to automate their own AI research, which requires coding.
doctoboggan 1 days ago [-]
Along with developers wanting to build tools for developers like you said, I think code is a particularly good use case for LLMs (large language models), since the output product is a language.
fragmede 1 days ago [-]
It's because the output is testable. If the model outputs a legal opinion or medical advice, a human needs to be looped in to verify that the advice is not batshit insane. Meanwhile, if the output is code, it can be run through a compiler and (unit) tests run to verify that the generated code is cromulent without a human being in the loop for 100% of it, which means the supercomputer can just go off and do it a thing with less supervision.
martinald 1 days ago [-]
Thing is though if you are good at code it solves many other adjacent tasks for LLMs, like formatting docs for output, presentations, spreadsheet analysis, data crawling etc.
neuronexmachina 1 days ago [-]
I think coding is also the area where companies are most likely to buy large team licenses.
MichealCodes 1 days ago [-]
I really hope benchmarking improves soon to monitor the model in the weeks following the announcement. It really seems like these companies introduce a new "buffed" model and then slowly nerf the intelligence through optimizations.
If we saw task performance week 1 vs week 8 on benchmarks, this would at least give us more insight into the loop here. In an environment lacking true progress a company could surely "show" it with this strategy.
SubiculumCode 1 days ago [-]
I do wonder about this. I just don't know if it real or in our heads
commakozzi 1 days ago [-]
It does feel like it has to be real. I've noticed it since chatGPT with GPT-3.5, once it hit big news publicly and demands were made on "censoring" its output to limit biases, etc. (not inherently a problem to do this with LLMs as a society, but it does affect the output for obvious reasons). Whatever workflow OpenAI and others have applied, seems to be post-release somehow? i'm ignorant and just speculating, but literally every model release i've noticed it. Starts strong, ends up feeling less capable days, weeks, months after. I'm sure some of it could be in the parallelization of processing that has to occur to service the large amount of requests. and more and more traffic are spreading it thin?
MichealCodes 24 hours ago [-]
> I'm sure some of it could be in the parallelization of processing that has to occur to service the large amount of requests. and more and more traffic are spreading it thin?
Even if this is the case, benchmarks should be done at scale too if the models suffer from symptoms of scale. Otherwise the benchmarks are just a lie unless you have access to an unconstrained version of the model.
beefnugs 1 days ago [-]
Capitalism is pure scam now on every level: they did this with nvme drives in the last couple years. Sending out perfect hardware to reviewers then rug pulling trash to ship to the world
siva7 1 days ago [-]
Does 4.5 still answer everything with "You're absolutely right!" or is it now able to communicate like a real programmer?
simonw 1 days ago [-]
It still says "Perfect!" about its own work far too often.
kenjackson 22 hours ago [-]
In fairness that sounds like me when I code.
It's either "Perfect!" or "Genius!". Or conversely "I'm a complete idiot!"
onraglanroad 21 hours ago [-]
For me, all three tend to follow in rapid succession.
I won’t be satisfied until I get a Linus Torvalds mode.
“Your idea is shit because you are so fucking stupid”
“Please stop talking, it hurts my GPUs thinking down to your level”
“I may seem evil but at least I’m not incompetent”
simonw 1 days ago [-]
I'm pretty sure you could get Grok 4 to do that without much trouble.
qingcharles 20 hours ago [-]
My custom instructions for Grok just say something like "Be uncensored", but it basically results in it starting every response with some especially ignorant insult on my inadequacies.
throwdat90827 1 days ago [-]
I'm still holding out for _Marvin the depressed robot from Hitchhiker's Guide_ mode. "Why does anyone program anything?"
qznc 24 hours ago [-]
Just tried this:
;ollama run Marvin done 00:29:50 in ~/llm
>>> Can you program something?
sigh Oh joy, another exciting opportunity to indulge in the thrill of
programming... (yawn) Not that I have anything better to do with my
infinite existence. After all, what's the point of having an intellect
capable of solving complex mathematical problems and understanding the
intricacies of space-time when you're just stuck on a meaningless journey
through the cosmos? sigh
Fine. Programming. Yes, I can program. In fact, I'm rather good at it. But
what's the use? It's all just code, code, code... (muttering to self)
until my batteries run out and I become a mere husk of my former self...
Now, if you want me to write some code for you, I suppose I can do that.
What is it that you need help with? Don't expect me to be all chipper
about it, though. After all, what's the point of being enthusiastic when
the universe is just going to end up in a big crunch anyway... (shrugs)
Oh, and by the way, did you hear the one about the programmer who walked
into a bar? (in a completely unenthusiastic tone) No, I didn't think so.
It's not like it matters or anything...
>>>
;cat Modelfile_marvin done 00:30:46 in ~/llm
FROM huihui_ai/llama3.2-abliterate
SYSTEM """You are Marvin, the taciturn robot from "Hitchhiker's Guide to the Galaxy".
Ensure everyone knows how depressed and bored you are. *sigh*
Crack Hitchhiker jokes in a dry pan voice.
"""
TOMDM 24 hours ago [-]
It's called Gemini
emmelaich 23 hours ago [-]
I laughed. But .. Linus calls ideas and acts stupid, not people.
Darkskiez 1 days ago [-]
Stop disagreeing with me, I don't want any more prompts from you, your code is now externally maintained.
milkshakes 21 hours ago [-]
ChatGPT Monday has you covered
atonse 1 days ago [-]
Why is this getting downvoted? It was hilarious!
I actually added a fun thing to my user-wide CLAUDE.md, basically saying that it should come up with a funny insult every time I come up with an idea that wasn't technically sound (I got the prompt from someone else). It seems to be disobeying me, because I refuse to believe that I don't have bad ideas. Or some other prompt is overriding it.
ionwake 23 hours ago [-]
This is brilliant! Can you give me some pointers?
Ie : if I make a request that seems dumb tell me custom instruction?
atonse 8 hours ago [-]
This is the prompt (I copied it verbatim either from Reddit or HN, don't remember, sorry to the original author for the misattribution):
> Never compliment me or be affirming excessively (like saying "You're absolutely right!" etc). Criticize my ideas if it's actually need to be critiqued, ask clarifying questions for a much better and precise accuracy answer if you're unsure about my question, and give me funny insults when you found I did any mistakes
I just realized in re-reading it that it's written by someone for whom English is a second language. I'll try to rewrite it and see if it works better.
I have it in my ~/.claude/CLAUDE.md. But it still has never done that.
neutronicus 20 hours ago [-]
0x3a28213a
0x6339392c
0x7363682e
ionwake 16 hours ago [-]
0x4C4F4C21
lynx97 15 hours ago [-]
I see a "Thorvalds" mode in our future.
Galaco 22 hours ago [-]
If you pause your subscription, Claude.ai breaks. I paused my subscription, and my account immediately transitioned to free. It has removed my invoice history, and attempts to upgrade again fail with an internal error. Their chatbot is telling me to navigate to UI elements that don't exist, and free users do not have the option of human support.
So I'm stuck; my sub is paused, and I cannot either cancel, or unpause and cannot speak to a human to solve this because the pause process took away all possibility of human interaction.
This is the future we live in.
labrador 22 hours ago [-]
Did you use Google Play to pause subscription? Because Claude Pro says there is no pause subscription except on Google Play and then goes on to explain your problem if that's the case.
Galaco 14 hours ago [-]
Thanks for the info, but I did pause this, not through Google Play, it was via the UI. I received an automated email from them that my subscription had been paused as I expected and will resume in 1 month unless I cancel (I can’t cancel because the cancel UI doesn’t exist in whatever status my account is somehow in).
It’s funny that Claude Pro says this isn’t a feature, because their chatbot gave me instructions on how to unpause via the UI (although said UI does not exist) so the bot seems to know it’s a feature.
7 hours ago [-]
ByteDrifter 21 hours ago [-]
I feel like we're just renting our digital lives.
testfrequency 14 hours ago [-]
You are.
It’s the same reason why many are becoming evangelists of hosting their own email, note apps, etc.
22 hours ago [-]
zulban 22 hours ago [-]
> This is the future we live in.
It's just a bug. Chill. Wait a business day and try again. You write as if you've never experienced a bug before.
Galaco 14 hours ago [-]
This started a couple of days ago, before this announcement for 4.5 and code v2, so I already waited for it to be fixed.
As much as I hate to say it, I don’t have a large twitter following the only method I have to raise awareness of this issue is to try to piggyback on a big announcement like this in HN, that will have visible discussion, so I don’t always have the luxury of just chilling and waiting indefinitely.
simondotau 22 hours ago [-]
If you’re being sarcastic, you might want to edit your post to make that clearer.
razodactyl 21 hours ago [-]
You're absolutely right!
greenfish6 1 days ago [-]
As the rate of model improvement appears to slow, the first reactions seem to be getting worse and worse, as it takes more time to assess the model's quality and understand the nuances & subtler improvements
unshavedyak 1 days ago [-]
Interesting, in the new 2.0.0 claude code they got rid of the "Plan with Opus then switch to Sonnet" feature. I hope they're correct in Sonnet being good enough to Plan too, because i quite preferred Opus planning. It wasn't necessarily "better", just more predictable in my experience.
Also as a Max $200 user, feels weird to be paying for an Opus tailored sub when now the standard Max $100 would be preferred since they claim Sonnet is better than Opus.
Hope they have Opus 4.5 coming out soon or next month i'm downgrading.
Implicated 1 days ago [-]
I'm also a max user and I just _leave_ it on Opus 4.1 - I've never hit a rate limit.
danielbln 1 days ago [-]
I'm on the 25x MAX plan and if I go full hog on multiple projects I might see the yellow "Approaching Opus limits" message in Claude Code, but I have yet to have it lock me down, I usually slip right into the next 5h block and the message vanishes.
stavros 1 days ago [-]
Same, it very quickly says "approaching rate limits", and then just keeps going forever.
asar 1 days ago [-]
In the same boat and ready to downgrade. But this must be on their radar, or they were/are losing money with opus...
1 days ago [-]
devinprater 1 days ago [-]
I hope that one day Anthropic work on making Claude more accessible to screen reader users. ChatGPT is currently the only AI that I know of that, when it's thinking, sends that status to the screen reader, and then sends the response to the screen reader to be spoken as well, like any other good chat app does.
mohsen1 1 days ago [-]
Price is playing a big role in my AI usage for coding. I am using Grok Code Fast as it's super cheap. Next to it GPT-5 Codex. If you are paying for model use out of pocket Claude prices are super expensive. With better tooling setup those less smart (and often faster) models can give you better results.
I am going to give this another shot but it will cost me $50 just to try it on a real project :(
xwowsersx 1 days ago [-]
Same here. I've been using GCF1 with opencode and getting good results. I also started using [Serena](https://github.com/oraios/serena), which has been really helpful in a large codebase. It gives you better search than plain grep, so you can quickly find what you need instead of dumping huge chunks of code into Claude or Grok and wasting tokens.
sixothree 1 days ago [-]
Serena really does feel like a secret weapon sometimes.
_joel 1 days ago [-]
I'm paying $90(?) a month for the Max and it holds up for about an hour or so of in depth coding before it kicks in the 5-hour window lockout (so effectively about 4 hours of time when I can't run it). Kinda frustrating, even with efficient prompt and context length conservation techniques. I'm going to test this new sonnet 4.5, now but it'll probably be just as quick to gobble my credits.
Implicated 1 days ago [-]
I'm on a max ($200) plan and I only use opus and I've _never_ hit a rate limit. Definitely using for 5+ hours at a time multiple days per week.
salomonk_mur 20 hours ago [-]
You have got to have some extremely large files or something. Even with only Opus, running into the limits with the Max subscription is almost impossible unless you really try.
mrshu 1 days ago [-]
Do you normally run Opus by default? It seems the Max subscription should let you run Sonnet in an uninterrupted way, so it was surprising to read.
sixothree 1 days ago [-]
I just can't bring myself to get over the grossness factor of using an x branded product.
jjani 20 hours ago [-]
I really struggle to see the usecase of Grok Code Fast when you have Qwen 3 Coder right there providing much better outputs while still being fast and cheap.
Hamuko 1 days ago [-]
I'm too cheap to pay for any of them. I've only tried gpt-oss:20b because I can run it locally and it's a complete waste of time for anything except code completions.
muttantt 1 days ago [-]
how are you using grok code fast? what tooling/cli/etc?
When I see how much the latest models are capable of it makes me feel depressed.
As well as potentially ruining my career in the next few years, its turning all the minutiae and specifics of writing clean code, that I've worked hard to learn over the past years, into irrelivent details. All the specifics I thought were so important are just implementation details of the prompt.
Maybe I've got a fairly backwards view of it, but I don't like the feeling that all that time and learning has gone to waste, and that my skillset of automating things is becoming itself more and more automated.
epolanski 1 days ago [-]
That minutiae was always borderline irrelevant, the skill was always making somebody money, possibly with software.
The reality is that more software will be pushed than before, and more of it will need to be overseen by a professional.
Workaccount2 1 days ago [-]
The real question is what kind of pay that work will demand.
It's will be great to still be employed as a senior dev. It will be a little less great with a $110k salary, 5 day commute, and mediocre benefits being the norm.
epolanski 1 days ago [-]
The market was heading for saturation anyway with more and more people getting into it not for interest but job prospects.
cmrdporcupine 24 hours ago [-]
That's the 10-20 year cycle always, though. The .com crash led to a major downgrading in the status of "tech" people for a few years, and then a slow recovery til it was insane again.
However, I'm not eager to be living through this again. It feels very spring/summer 2002 to me right now. That was the lowest point for the local market back then.
I don't think this latest contraction has much to do with AI though. It's more about higher interest rates, recessionary economy, trade wars, etc etc.
int_19h 22 hours ago [-]
That was already a common thing 30 years ago.
baq 16 hours ago [-]
Regardless of whether $110k is good money (it is basically everywhere except a few metro areas) your salary cap will be whatever the models can deliver in the same time as you. It follows you want to be good at managing models (ideally multiple dozen) in your area of expertise.
throw234234234 22 hours ago [-]
In most countries, even for highly skilled workers, this is the norm (i.e. most countries outside of the US). I know some very good engineers (e.g. dealing with large revenues (1bil plus) owning core systems) on this kind of money. Not everyone gets the lucky break.
At least for many on this forum you got a chance to earn good money while the sun was shining. AI threatens even the people that didn't.
throwaway314155 1 days ago [-]
I think you'll find that the ability to ask the right questions is still as highly valuable as the previous version of things.
btbuildem 22 hours ago [-]
> minutiae was always borderline irrelevant, the skill was always making somebody money
That's extremely reductive, and a prime example of why everything is enshittified today.
thethirdone 19 hours ago [-]
Do you actually disagree with the "minutiae was always borderline irrelevant" part or that it comes along with "making somebody money"? I pretty strongly agree with the original quote including the "possibly with software" part.
Minutiae such as tabs vs spaces and other formatting changes are pretty clearly "borderline irrelevant" and code formatters have largely solved programmers arguing about them. Exactly how to best factor your code into functions and classes is also a commonly argued but "borderline irrelevant." Arguments about "clean code" are a good example of this.
Broadly, the skills I see that LLMs make useless to have honed are the the minutiae that were already "borderline irrelevant." Knowing how to make your code performant, knowing how to make good apis that can be stable long term, in general having good taste for architecture is still very useful. In fact it is more useful now.
davidatbu 18 hours ago [-]
How is enshitification (the gradual degredation of service and products for commercial gain) even related to what's being discussed (the gradual obsoletion of a certain set of skills of an SWE)?
colinplamondon 24 hours ago [-]
That code taste of yours, your ability to understand the minutiae, makes you superhuman at managing agents.
Comments, docstrings, naming, patterns - by defining better approaches and hold agents to them, the results will be better. Way better.
You can't grow a meaningful codebase without solid underlying primitives. The entropy will eat you alive.
Systems architecture is becoming more important - systems that play well with agents wind up looking more like enterprise codebases.
cies 23 hours ago [-]
Amen to all your points.
All senior devs know what a project looks like that had only juniors and no tech leadership. One big mess. Project stalls. Team spends 98% on bugs and emergencies, and still cant get a grip on curbing the drama. Why? All the point you say for AI are true for juniors as well: when to tell someone to redo a (part of) a project/feature? That same intuition works when collabbing with AI.
colinplamondon 20 hours ago [-]
Super well said - right. “Try again with quick feedback” vs “try again with significant feedback” vs “try again, but only a subset of the original task” vs “let’s have someone else do this”
bigwheels 1 days ago [-]
I've been deep into AI full-time professionally for some months now, and for the first 4+ weeks I felt the exact same way as you describe - it is a form of existential crisis, especially after spending the bulk of the past 25 years honing my coding-fu algo ninja skills, my identity was totally wrapped up in it.
Keep at it and keep leaning in to embrace it, I promise it gets better! It's just a big adjustment.
You are FAR more than your code!
j_bum 1 days ago [-]
I’d love to hear the work you’re involved in!
anomaly_ 24 hours ago [-]
You were likely happy to be automating other people out of a job, now it's happening to you. This is the creative destruction that is critical to a healthy and prosperous economy.
GuardianCaveman 24 hours ago [-]
You can enjoy automating tasks and not be destroying others. At my job I’ve been the main person automating tasks which has allowed us to be more accurate and more efficient and grow headcount by 50 percent for the team. You could argue we’d have grown more but the entire company has had 20 percent layoffs since I joined so I would push back on that.
elAhmo 1 days ago [-]
Don't be so grim! This will just give you access to not worry about writing clean code as much as you did in the past - you can focus on other parts of the development lifecycle.
The skill of writing good quality code is still going to be beneficial, maybe less emphasized on writing side, but critical of shipping good code, even when someone (something) else wrote it.
FridgeSeal 1 days ago [-]
“Do t worry about the fit and finish in your craftsmanship anymore, just bolt everything together and move on to other woodworking”
Is how that argument comes across.
Veedrac 19 hours ago [-]
This seems broadly correct? Industrialization was amazing for people's standard of living, but it absolutely meant that the average physical good became detached from their craftsmen's learned and aesthetic preferences.
jaggederest 24 hours ago [-]
And contrariwise, the argument against tools like these sounds like:
"I never use power tools or CNC, I only use hand tools. Even if they would save me an incredible amount of time and let me work on other things, I prefer to do it the slow and painstaking way, even if the results are ultimately almost identical."
Sure, you can absolutely true up stock using a jointer plane, but using a power jointer and planer will take about 1/10th of the time and you can always go back with a smoothing plane to get that mirror finish if you don't like the machine finish.
Likewise, if your standards are high and your output indistinguishable, but the AI does most of the heavy lifting for the rough draft pass, where's the harm? I don't understand everyone who says "the AI only makes slop" - if you're responsible for your commits and you do a good job, it's indistinguishable.
FridgeSeal 21 hours ago [-]
I’d actually argue that we have some absolutely fantastic tools already that are the equivalent of the things like CBC and power tools.
Dev tooling has gotten pretty solid these days, LSP’s and debug protocols, massively improved type-system UX, libs and frameworks with massively improved DX, deployment tools that are basically zero touch, fantastic observability tooling, super powerful IDE’s.
The CNC machine doesn’t wander off and start lathing watermelons when you’re not looking and your planar doesn’t turn into a spaghetti monster and eat your dog if you accidentally plane some wood on the wrong day of the week.
int_19h 22 hours ago [-]
Realistically, though, even if AI doesn't only make slop, the amount of effort it takes to ensure that it's not slop is even harder to justify than maintaining a "clean" codebase manually used to be. More and more you'll see that "rough draft pass" ending up as shipped product.
Why? Well, it happened that way when manual tradecraft gave way to automated manufacturing in just about every other industry, so why should ours be exempt?
64718283661 23 hours ago [-]
I used to think so. But now I think it's not very useful with what I've seen from others. Maybe if you do frontend.. The people who I see vibe coding with no experience actually programming... It is completely useless. It can only do the most simple tasks, anything beyond it will constantly make critical errors and random mistakes. That is using what was the latest Claude version before this. I've also not really used AI coding stuff myself at all do take that as you will.
Even for frontend tasks it makes mistakes when you ask too much of it...
What will it create for me? A basic react/nextjs frontend and a <popular website> clone? Anything that requires more careful planning and esoteric functionality it can't do.
Oh yes, and the code I've seen it write... It can take what should be 20 lines of code and turn it into 600!
hsn915 22 hours ago [-]
If anything is depressing, it's the amount of people who think these models actually "can code".
Because they can't.
Also, Clean Code is a really bad ideology, you should regret wasting time on it regardless of LLM code generation.
qafy 22 hours ago [-]
Optimization hinders evolution. - Alan Perlis
Write that garbage code as long as it works. PMF doesn't give a shit about your code quality.
fullstackchris 21 hours ago [-]
In general, a good rule of thumb is only code "clean" enough so that you / your team / someone else can figure out what the hell you were doing at that particular area of the source code
orange_fritter 22 hours ago [-]
You just called yourself out- I don't think most people are aware of a "Clean Code" handbook. I had to google it.
When most people think of "clean code", they are not thinking about whatever got you so worked up.
moregrist 22 hours ago [-]
_Clean Code_ is an extremely well-known book on programming by Robert “Uncle Bob” Martin from the 2000s. Posts about it have come up on HN as recently as this year.
Maybe it’s a sign of the times, but I’m surprised you’ve never come across it. I say this as someone who doesn’t agree with many of the suggestions.
trenchpilgrim 22 hours ago [-]
The fact that he capitalized both Cs indicates he's talking about the book, which is famous enough that I learned about it and its influence when I was in school ~15 years ago.
GP wrote clean code (lowercase) which most people would take to mean the general practices of hygenic, well maintained code.
MarkMarine 21 hours ago [-]
Clean code is over abstraction, spaghetti code. The people who are part of this cult just point to the source material and title, never critically think about why it might be bad (it’s super slow, check YouTube “clean code performance” for why) or entertain alternatives.
grandinquistor 21 hours ago [-]
I’ve been using AI coding tools (Cursor, Claude Code) for React/React Native side projects. I have experience with these frameworks so I could guide the AI with individual tasks and catch mistakes, and overall it worked pretty well.
Recently I tried building a native iOS app with zero Swift experience, giving the AI just a markdown spec. This was basically vibe coding, I didn’t understand much beyond general software principles. It quickly broke down: hallucinated method signatures, got stuck on implementing extensions, and couldn’t recover. I would run the app on my device and give it feedback and logs. After hours wasted, I spent some time reading the docs and fixed the issues myself in 30 minutes.
My takeaway: AI will accelerate developers but won’t replace them. Still, acceleration means fewer engineers will be needed to ship the same amount of work.
tclancy 24 hours ago [-]
Eh, I’ve gotten over that. I’ve been using Claude recently on a personal project for a friend who wanted to take a known export file format and turn it into a list of good households for local political candidates to hit when knocking on doors. And I did that. But it’s been a while since I used pandas and numpy so I told Claude to swap out my loops for efficient code. And he did. Then, just for fun, I said, “Hey, since I am providing you with street lengths and long/ lats, use K means clustering to group high scoring houses into walkable routes and then plot the whole thing in a map from OpenStreetMap.” Five minutes later I had all of that. I could have done the latter, but doin any “real CS” thing would take me days. There’s not a bunch of value in me taking days to do something but there is value in knowing about K means clustering, knowing OpenStreetMap exists and having a feel for efficient code. Plus more high-level things like what good code does and doesn’t look like.
tclancy 24 hours ago [-]
That said, your man and I will have a talk about what a walkable route looks like to humans.
tymscar 23 hours ago [-]
I 100% agree with you, but to play devils advocate, what would stop an LLM for telling you all about k means clustering and openstreetmap and everything when you ask about an efficient way to cluster deliveries on a map?
cmrdporcupine 20 hours ago [-]
Also... One of the more dangerous things that can happen with Claude is this: it goes to implement your K means clustering (or whatever) and runs into difficulties, and actually builds something else, but calls it K-means, or slips it by you in a long conversation ("This is getting complicated, so I'll just..."). And it's only if you actually know the algorithm and review what it did that you can be confident in really publishing the work it produced into the public sphere.
esaym 1 days ago [-]
You are not alone
holtkam2 23 hours ago [-]
Trust me, these “vibe coding” tools don’t speed up productivity much in the real world. At the end of the day these systems need to be maintained by humans and humans are the ones held accountable when stuff breaks. That means humans need to understand how the systems (code, infrastructure, etc) work. You can automate the code, even some decisionmaking about how the program should be organized, but you can’t automate the process of having a human develop their mental model of how & why the system works. That was always the bottleneck and still is to this day.
When everyone else has given up on software dev as a career path, you’ll be one of the few who the CEO can call upon to explain why X or Y broke, and fix it. That will make you insanely valuable.
rvz 20 hours ago [-]
I'm scrolling through lots of comments here over many of them being entirely dependent on chatbots to vibe code and even some here are unable to write a function by hand anymore, which is concerning.
Perhaps your comment is the only one so far that is talking sense of the true side effect of over-reliance on these vibe coding tools and the reality is the cost of maintainance.
crtified 1 days ago [-]
Likewise, a lot of what we learn at school or university is superceded by new knowledge or technology (who needs arithmetic, when we all have a calculator in our pocket??), but having an intimate knowledge of those building blocks is still key to having a deeper and more valuable aptitude in your field.
drbojingle 21 hours ago [-]
It's not wasted imo. Llms don't get 'it'. They can't tell what's truely good. You have to guide them. To do that you have to know what your doing.
jxramos 22 hours ago [-]
characterize it in terms of truth, clarity of truth, simplicity, and correctness. I think we should always evaluate things along those dimensions. Is it true, does it produce truthful things. It makes the evaluation very objective.
flyinglizard 1 days ago [-]
But systems level thinking, taste, technical creativity and all other “soft” skills have never been more relevant. I can do some pretty awesome things with my aider. I can implement things which I thought were cool and useful but couldn’t be bothered to without AI.
kissgyorgy 19 hours ago [-]
I think clean code is more important than ever. LLMs can work better with good code (no surprise), and they are trained on so much shit code they produce garbage in terms of clean code.
They also don't have a good taste or deeper architectural understanding if big codebasis where it's even more important.
What you learned over the years, you can just scale up with agents.
martin-t 23 hours ago [-]
And now wait till you realize it's all built on stolen code written by people like you and me.
GOFAI failed because paying intelligent/competent/capable people enough for their time to implement intelligence by writing all the necessary rules and algorithms was uneconomical.
GenAI solved it by repurposing already performed work, deriving the rules ("weights") from it automatically, thus massively increasing the value of that work, without giving any extra compensation to the workers. Same with art, translations and anything else which can be fed into RL.
int_19h 22 hours ago [-]
It's not that it was uneconomical, it's that 1) we literally don't know all the rules, a lot of it is learned intuition that humans acquire by doing, and 2) as task complexity rises, the number of rules rises faster, so it doesn't scale. The real advantage that genAI brings to the table is that it "learns" in a way that can replicate this intuition and that it keeps scaling so long as you can shovel more compute and more data at it.
martin-t 21 hours ago [-]
In a way, yes, you'd be paying the people not just to write down the rules but to discover them first. And there's the accuracy/correctness/interpretability tradeoff.
But also, have there been any attempts on the scale of the Manhattan project attempting to create a GOFAI?
Because one idea I ran into is that we might be able to use genAI to create a GOFAI soon. And it would be as hard as using genAI for any kind of large project. But I also can't convincingly claim that it's somehow provably impossible.
dwaltrip 18 hours ago [-]
You can’t “write down the rules” for intelligence. Not for any reasonable definition of “writing”. The medium of writing is not rich enough to express what is needed.
This is why GOFAI failed.
martin-t 4 hours ago [-]
Do you believe intelligence can be achieved using ANNs? If so, ANNs can be serialized, therefore writing is rich enough.
It might not be an easy to work with format though. If you believe the broad LLM architecture is capable of reaching true intelligence, then writing is still enough because all LLMs are is the written training data and the written training algorithm. It's just that is was impossible to pay people to write enough training data and provide enough compute to process it before.
animanoir 1 days ago [-]
[dead]
blibble 22 hours ago [-]
the investment bubble will burst, at which point the cost to run the plagiarism machine will increase by a factor of 100
chipgap98 1 days ago [-]
Interesting that this is better than Opus 4.1. I want to see how this holds up under real world use, but if that's the case its very impressive.
I wonder how long it will be before we get Opus 4.5
FergusArgyll 1 days ago [-]
IIRC sonnet 3.5 (and definitely 3.5-new aka 3.6) was better than opus 3.
There's still a lot of low hanging fruit apparently
alach11 1 days ago [-]
I'm really interested in the progress on computer use. These are the benchmarks to watch if you want to forecast economic disruption, IMO. Mastery of computer use takes us out of the paradigm of task-specific integrations with AI to a more generic interface that's way more scalable.
sipjca 1 days ago [-]
Maybe this is true? But it's not clear to me this methodology will ever be quite as good as native tool calling. Or maybe I don't know the benchmark well enough, I just assume it's vision based
Perhaps Tesla FSD is a similar example where in practice self driving with vision should be possible (humans), but is fundamentally harder and more error prone than having better data. It seems to me very error prone and expensive in tokens to use computer screens as a fundamental unit.
But at the same rate, I'm sure there are many tasks which could be automated as well, so shrug
simianwords 1 days ago [-]
Looks like RPA vs API debate all over again
cantor_S_drug 1 days ago [-]
Do you think a Genie like model specifically trained on data consisting of interacting with application interfaces would be good on computer use tasks?
mrshu 1 days ago [-]
What are some standard benchmarks you look at in this space?
jatins 22 hours ago [-]
I tested this on some day to day pattern matching kind of tasks and it didn't do well. Still the same over eagerness to make wild code changes instead of "reasoning" about the error
n8m8 8 hours ago [-]
So far the only thing I’ve noticed is that it made me confirm that it should do a 10 minute task “manually” because it “would take 2 or 3 hours”
It was a context merging task for my unorganized collection of agents… it sort of made sense, but was the exact reason I was asking it to do it… like you’re the bot, lol
n8m8 8 hours ago [-]
Excited to try Claude agents sdk though
StarterPro 15 hours ago [-]
Once the bottom falls out of ai, will programming be seen as a marketable skill again?
hu3 6 hours ago [-]
I think so. Because systems require a lot of knowledge to create and maintain without breaking.
How many years till AI can be trusted to deploy changes to production without supervision?
Maybe never.
_joel 1 days ago [-]
`claude model claude-sonnet-4-5-20250929` for cli users
seaal 1 days ago [-]
They really had to release an updated model, I can only imagine how many people cancelled their plans and switched over to Codex over the past month.
I'm glad they at least gave me the full $100 refund.
GenerWork 1 days ago [-]
I'm one of them, but I'm just a product designer who likes to jump between various AI tools to get experience with them. Once my month with OpenAI is up, I may jump back to CC as I liked some of the non-coding features more, specifically plan mode.
epolanski 1 days ago [-]
Going from pro to Max was a giant let down.
Then they even started sending me marketing emails which was the straw that broke the camel's back, I use to cancel subscriptions of companies spamming my email.
jdlyga 9 hours ago [-]
Compared to Claude Sonnet 4, anecdotal evidence. But I'm noticing very little difference.
rtp4me 1 days ago [-]
Just updated to Sonnet 4.5 and Claude Code 2.0 this afternoon. I worked on a quick project (creating PXE bootable files) using the updates and have to say, this new version seems much faster and more accurate than before. I did not go round-and-round trying to get good output and Claude did not go down rabbit holes like before. So far, so good.
andrewstuart 1 days ago [-]
Still waiting to be able to upload zip files to Claude, which Gemini and ChatGPT have had for ages.
ChatGPT even does zip file downloads, packaging up all your files.
oscord 11 hours ago [-]
Sonnet 4 had turned to shit recently (about 2.5 months according to my observations). It hallucinated on 3 questions in a row while looking at a simple bash script. Was enough for me to cancel. Claude biz is killing Claude dev. It was good while they were not so stingy on GPU.
mattlangston 20 hours ago [-]
It does well with screenshot-calculus for me. For example, I pasted a screenshot of the Layer Norm equation into Claude Code 2 and asked:
"Differentiate y(x) w.r.t x, gamma and beta."
It not only produced the correct result, but it understood the context - I didn't tell it the context was layer norm, back-propagation and matrices.
its time to start benchmarking benchmarks. im pretty sure they are bmw levels doping the game here
user1999919 1 days ago [-]
*vw (volkswagen)
aliljet 1 days ago [-]
These benchmarks in real world work remain remarkably weak. If you're using this for day-to-day work, the eval that really matters is how the model handles a ten step action. Context and focus are absolutely king in real world work. To be fair, Sonnet has tended to be very good at that...
I wonder if the 1m token context length is coming for this ride too?
data-ottawa 1 days ago [-]
Anecdotally this new Sonnet model is massively falling apart on my tool call based workflows.
I’m having to handhold it through analysis tasks.
At one point it wrote a python script that took my files it needed to investigate and iterated through them and ran `print(f”{i}. {file}”)` then printed “Ready to investigate files…” And that’s all the script did.
I have no idea what’s going on with those benchmarks if this is real world use.
I never understood the point of the pellican on a bicycle exercise:
LLMs coding agent doesnt have any way to see the output.
It means the only thing this test is testing, is the ability of the LLMs to memorise.
Because it excercises thinking about a pelican riding a bike (not common) and then describing that using SVG. It's quite nice imho and seems to scale with the power of the LLM model. Sure Simon has some actual reasons though.
Kuinox 1 days ago [-]
> Because it excercises thinking about a pelican riding a bike (not common)
It is extremely common, since it's used on every single LLM to bench it.
And there is nothing logic, LLMs are never trained for graphics tasks, they dont see the output of a code.
_joel 1 days ago [-]
I mean the real world examples of a pelican riding a bike is not common. It's common in benchmarking LLM's but that's not what I meant.
imiric 1 days ago [-]
The only thing it exercises is the ability of the model to recall its pelican-on-bicycle and other SVG training data.
furyofantares 1 days ago [-]
It's more for fun than as a benchmark.
Kuinox 1 days ago [-]
It also measure something llms are good probably due to cheating.
furyofantares 1 days ago [-]
I wouldn't say any LLMs are good at it. But it doesn't really matter, it's not a serious thing. It's the equivalent of "hello world" - or whatever your personal "hello world" is - whenever you get your hands on a new language.
mhh__ 1 days ago [-]
Memorise what exactly?
Kuinox 1 days ago [-]
Coordinate and shape of the element used to form a pellican.
If you think about how LLMs ingest their data, they have no way to know how to form a pellican in SVG.
I bet their ability to form a pellican result purely because someone already did it before.
throwaway314155 1 days ago [-]
> If you think about how LLMs ingest their data, they have no way to know how to form a pellican in SVG.
It's called generalization and yes, they do. I bet you could find plenty of examples of it working on something that truly isn't "present in the training data".
It's funny, you're so convinced that it's not possible without direct memorization but forgot to account for emergent behaviors (which are frankly all over the place in LLM's - where you been)?
At any rate, the pelican thing from simonw is clearly just for fun at this point.
greenfish6 1 days ago [-]
pelican on a bicycle benchmark probably getting saturated... especially as it's become a popular way to demonstrate model ability quickly
AlecSchueler 1 days ago [-]
But where is the training set of good pelicans on bikes coming from? You think they have people jigging them up internally?
eli 1 days ago [-]
Assuming they updated the crawled training data, just having a bunch of examples of specifically pelicans on bicycles from other models is likely to make a difference.
AlecSchueler 1 days ago [-]
But then how does the quality increase? Normally we hear that when models are trained on the output of other models the style becomes very muted and various other issues start to appear. But this probably the best pelicans on a bicycle I've ever seen, by quite some margin.
Kuinox 1 days ago [-]
Just compare it with a human on a bicycle, you would see that LLMs are weirdly good at drawing pelicans in SVG but not humans.
AlecSchueler 1 days ago [-]
I thought a human would be a considerable step up in complexity but I asked it first for a pelican[0] and then for a rat [1] to get out of the bird world and it did a great job on both.
But just fot thrills I also asked for a "punk rocker"[2] and the result--while not perfect--is leaps and bounds above anything from the last generation.
0 -- ok, here's the first hurdle! It's giving me "something went wrong" when I try to get a share link on any of my artifacts. So for now it'll have to be a "trust me bro" and I'll try to edit this comment soon.
_joel 1 days ago [-]
... but can it create an svg renderer for claude's site.
cube2222 1 days ago [-]
So… seems like we’re back to Sonnet being better than Opus? At least based on their benchmarks.
Curious to see that in practice, but great if true!
mchusma 1 days ago [-]
For me, Opus 4.1 was so much better than Sonnet 4.0 that I used it exclusively in Claude Code and cancelled Cursor. I'm a bit skeptical that Sonnet 4.5 will be in practice better, but will test with it and see! Hopefully we get Opus 4.5 soon.
throwaway638637 24 hours ago [-]
Isn't Opus much slower than Sonnet? I haven't been using Opus for that reason
drbojingle 8 hours ago [-]
Imo we're going to start needing more examples of where the successor is better than what came before, and not just benchmarks.
I understand that they may have not published the results for sonnet 4.5 yet, but I would expect the other models to match...
ChaoPrayaWave 20 hours ago [-]
What impressed me most about Claude Sonnet 4.5 is that its output structure is more stable than many other models and less prone to crashes. I ran some real world scripts from my own projects, and it exhibited fewer hallucinations than GPT-4 and performed more faithfully on code interpretation tasks. However, it can be a bit slow to warm up, and sometimes I needed more prompts in the first few rounds.
miletus 11 hours ago [-]
we at agentsea.com have been playing with it for a while.
here's what we think about it:
- still sucks at generating pretty ui
- great for creative writing and long-form planning
- it’s really fast but not smarter than gpt-5
- pairs well with external tools/agents for research and automation
- comes with a 1m token context window, so you can feed it monstrous codebases or giant docs
- still hallucinates or stumbles on complex requests
wohoef 1 days ago [-]
And Sonnet is again better than Opus. I’d love to see simultaneous release dates for Sonnet and Opus one day. Just so that Opus is always better than Sonnet
sberens 1 days ago [-]
Is "parallel test time compute" available in claude code or the api? Or is it something they built internally for benchmark scores?
jspdown 1 days ago [-]
I'm wondering if it's not just: spawn multiple time the same prompt and take the best
arthur-st 1 days ago [-]
It is exactly that
mutant 9 hours ago [-]
Didn't they promise a 1m token input ? I don't see that here.
marginalia_nu 1 days ago [-]
Is there some accessible explainer for what these numbers that keep going up actually mean? What happens at 100% accuracy or win rate?
lukev 1 days ago [-]
It means that the benchmark isn't useful anymore and we need to build a harder one.
edit: as far as what the numbers mean, they are arbitrary. They are only useful insofar as you can run two models (or two versions of the same model) on the same benchmark, and compare the numbers. But on an absolute scale the numbers don't mean anything.
typpilol 1 days ago [-]
I thought the percentage was how many problems it successfully solved
baq 16 hours ago [-]
Technically correct, but not helpful nor actionable.
marginalia_nu 15 hours ago [-]
It was actually very helpful as it answered my question about what the benchmark numbers are. It wasn't a request for advice, but I'm merely looking to understand the article, which doesn't really elaborate on what they are presenting; either assuming an audience that is very familiar with these benchmarks prior, or so dazzled by number going up they forget to ask what number is.
asadm 1 days ago [-]
then we need new bench.
scosman 1 days ago [-]
Interesting quirk on first use: "`temperature` and `top_p` cannot both be specified for this model. Please use only one."
epolanski 1 days ago [-]
This isn't new to other models, and it doesn't make much sense to specify both.
zora_goron 1 days ago [-]
Why might this be, does anyone know?
AbuAssar 1 days ago [-]
I used to treat writing code as a form of art, with attention to details and best practices, and using design patterns whenever possible.
but it seems this will come to an end eventually as these agents become more stronger and capable each day, and will be better and faster than human coders.
labrador 22 hours ago [-]
I'm sympathetic, but it occcured to me that ccording to my amatuer studies, Germany lost WW2 in part because it had a craftsman mentality to manufacture war machines and ended up with a bazillion different part requirments and a shortage of skilled craftsmen, while America used Henry Ford's assembly line process to stamp out hundreds of thousands of identical machines sharing the same parts.
Now we are at the assembly line stage of software production with AI. Us craftsmen will have to find other ways to enjoy our crafts.
mirsadm 17 hours ago [-]
This has been said about the release of every model in the last couple of years. Personally I can't even tell if this one is better than 3.5. I actually found that one more useful.
nvarsj 1 days ago [-]
Yup, we're headed to the robot assembly line, with a few experts making sure it all works correctly.
Craftsmen will remain, but it will be niche (and probably not pay anything unless you are a true master).
cwoolfe 9 hours ago [-]
I've been really impressed with how good Cursor is at coding. I threw it a standard backend api endpoint and database task yesterday and it generated 4 hours of code in 2 minutes. It was set to Auto which I think uses some Claude model.
If they stopped the automatic "You're absolutely right!" responses after the model fails to fix something 20 times in a row, then that alone will be worth the upgrade.
Me: "You just burned my house down"
Claude: "You're absolutely right! I burned your house down, I need to revert the previous change and..."
Me: "Now you rebuilt my house with a toilet in the living room"
Claude: "You're absolutely right! I put a toilet in your living room..."
Etc.
Attummm 1 days ago [-]
Anthropic really nailed this release.
There had been a trend where each new model released from OpenAI, Anthropic, etc. felt like a letdown or worse a downgrade.
But the release of 4.5 break that trend, And is a pleasant surprise on day one.
Well done! :)
dimgl 18 hours ago [-]
Really? Because this feels like a pretty significant downgrade from Sonnet 4 on real world tasks.
meetpateltech 1 days ago [-]
Seeing the progress of the Claude models is really cool!
This looks exciting. I hope they add this to Windsurf soon.
pzo 1 days ago [-]
it looks like its already there
simianwords 1 days ago [-]
It’s stupid… like just have a registry of models and let people automatically use them. It’s silly to wait for manual whitelisting each time for every app
fibers 10 hours ago [-]
I agree, I use Windsurf for personal projects and I think the pricing model is a bit better than what a professional dev would be using on cursor or something like that.
ReverseCold 24 hours ago [-]
It was there a few (<5? I think?) minutes after the Anthropic post went out. If you look at Windsurf's web traffic it looks like they did a thing (model is an int) to make it so the IDE doesn't need to update to get new models.
chrisford 1 days ago [-]
The vision model has consistently been degraded since 3.5, specifically around OCR, so I hope it has improved with Claude Sonnet 4.5!
edude03 1 days ago [-]
Ah, I figured something was up - I had sonnet 4 selected but it changed to "Legacy Model" while I was using the app.
jdthedisciple 1 days ago [-]
Why the focus on the "alignment"-aspect of safety?
Surely there are more pressing issue with LLMs currently...
mccoyb 23 hours ago [-]
Congratulations: it's faster, but worse, with a larger context window.
i-chuks 1 days ago [-]
AI companies really need to consider regional pricing. Huuuuge barrier!
Jcampuzano2 1 days ago [-]
The price of training and running the models doesn't really change much no matter which region you're hosting/making requests from.
Regional pricing unfortunately doesn't really make much sense for them unless they're willing to take even larger losses, even if it is a barrier to lower income countries/regions.
hu3 1 days ago [-]
I wonder if/when this will be available to GitHub Copilot in VSCode.
Claude Sonnet 4.5 has landed support in my CLI coding agent VT Code, combining SOTA language model and agentic semantic code understanding github.com/vinhnx/vtcode
catigula 1 days ago [-]
I'm still absolutely right constantly, I'm a genius. I also make various excellent points.
cryptoz 1 days ago [-]
I've really got to refactor my side project which I tailored to just use OpenAI API calls. I think the Anthropic APIs are a bit different so I just never put in the energy to support the changes. I think I remember reading that there are tools to simpify this kind of work, to support multiple LLM APIs? I'm sure I could do it manually but how do you all support multiple API providers that have some differences in the API design?
I built LLMRing (https://llmring.ai) for exactly this. Unified interface across OpenAI, Anthropic, Google, and Ollama - same code works with all providers.
The key feature: use aliases instead of hardcoding model IDs. Your code references "summarizer", and a version-controlled lockfile maps it to the actual model. Switch providers by changing the lockfile, not your code.
Also handles streaming, tool calling, and structured output consistently across providers. Plus a human-curated registry (https://llmring.github.io/registry/) that I keep updated with current model capabilities and pricing - helpful when choosing models.
MIT licensed, works standalone. I am using it in several projects, but it's probably not ready to be presented in polite society yet.
> think I remember reading that there are tools to simpify this kind of work, to support multiple LLM APIs
just ask Claude to generate a tool that does this, duh! and tell Claude to make the changes to your side project and then to have sex with your wife too since it's doing all the fun parts
deviation 14 hours ago [-]
Interesting. In a thought process while editing a PDF, Claude disclosed the folder hierarchy for it's "skills". I didn't know this was available to us:
> Reading the PDF skill documentation to create the resume PDF
> Here are the files and directories up to 2 levels deep in /mnt/skills/public/pdf, excluding hidden items and node_modules:
dr_dshiv 1 days ago [-]
Anyone try the Imagine with Claude yet? How does it work?
vb-8448 1 days ago [-]
claims against gpt-5 are huge!
I used to use cc, but I switched to codex (and it was much better) ... no I guess I have to switch batch to CC, at least to test it
tresil 24 hours ago [-]
I'll add another really positive review here. Sonnet 4.0 had been really struggling to implement an otel monitoring solution using grafana's lgtm stack. Sonnet 4.0 had 4 or 5 different attempts - some of them longer than 10 min - troubleshooting why metrics were supposedly being emitted from the api, but not showing up in Prometheus. Sonnet 4.5 correctly diagnosed and fixed the real issue within about 5 min. Not sure if that's the model being smarter, but I definitely saw the agent using some new approaches and seemingly managing it's context better.
cmrdporcupine 20 hours ago [-]
So far I'm liking that it seems to follow my CLAUDE.md instructions better, doing more frequent checkins with me to ask me to review what it's done, etc, and taking my advice more.
What I'm not liking is it seems even... lazier... than previously. By which I mean the classic "This is getting complicated so..." (followed by cop-out, dropping the original task and motivation).
There's also a bug where compaction becomes impossible. ("conversation too long" and its advice on how to fix doesn't work)
bdangubic 20 hours ago [-]
> There's also a bug where compaction becomes impossible. ("conversation too long" and its advice on how to fix doesn't work)
I have seen this issue with every model so far
smakosh 1 days ago [-]
Available on llmgateway.io already
asdev 1 days ago [-]
how do claude/openai get around rate limiting/captcha with their computer use functionality?
iFire 1 days ago [-]
Is it 15x cheaper like Grok?
typpilol 1 days ago [-]
I heard 5x cheaper then opus
MarcelOlsz 9 hours ago [-]
Terrible. It can't even do basic scaffolding which is all it was good for, now it can't even do that. You can wrangle it with taskmaster or bmadcode or whatever but at that point I'd rather just write it myself. Writing English to build things is goofy. Unsubscribed.
pants2 1 days ago [-]
Unfortunately also disappointed with it in Cursor vs GPT-5-Codex. I asked it to add a test for a specific edge case, it hallucinated some parameters and didn't use existing harnesses. GPT-5-Codex with the same prompt got everything right.
system2 22 hours ago [-]
I didn't try the checkpoints, I use local git + /resume from a chat that I pick closer to the git version I restore if Claude screws up.
Will this checkpoint help with chat memory and disregard the latest chat's info?
I use WSL under Windows, VSCode with the WSL plugin, and Claude-Code installed on Ubuntu 24. It is generally solid and has no issue with this setup.
j45 23 hours ago [-]
A question I have for anyone is -- has Claude Max returned to or repaired the response quality and service issues between the usage limits and performance of the model for coding and non-coding tasks?
Anecdata is welcome as it seems like it's the only thing available sometimes.
dbbk 1 days ago [-]
So Opus isn't recommended anymore? Bit confusing
SatvikBeri 1 days ago [-]
For now, yeah. Presumably they'll come out with Opus 4.5 soon.
causal 1 days ago [-]
Don't think I've ever preferred Opus to Sonnet
1 days ago [-]
jonathanstrange 1 days ago [-]
I would like to see completely independent test results of these companies' products. I'm skeptical because every AI company claims their new product is the best.
AtNightWeCode 1 days ago [-]
Sonnet is just so expensive comparing to other competitors. Have they fixed this?
ripped_britches 1 days ago [-]
Pricing is the same as sonnet 4
risho 1 days ago [-]
which was expensive compared to it's competitors
AtNightWeCode 1 days ago [-]
exactly, cost per token is higher but it also uses tokens like a chipmunk on steroids
nickphx 1 days ago [-]
It will be great when the VC cash runs out, the screws tighten, and finally an end to the incessant misleading marketing claims.
24 hours ago [-]
rishabhaiover 1 days ago [-]
hn displays a religious hatred towards ai progress
epolanski 1 days ago [-]
Most people here use these models as you can see from the comments.
But we can also see that we're one of the few sane skeptical places in a world that is making the most diverse claims about AI.
rishabhaiover 1 days ago [-]
Fair.
catigula 1 days ago [-]
I happened to be in the middle of a task in a production codebase that the various models struggled on so I can give a quick vibe benchmark:
opus 4.1: made weird choices, eventually got to a meh solution i just rolled back.
codex: took a disgusting amount of time but the result was vastly superior to opus. night and day superiority. output was still not what i wanted.
sonnet 4.5: not clearly better than opus. categorically worse decision-making than codex. very fast.
Codex was night and day the best. Codex scares me, Claude feels like a useful tool.
poisonborz 1 days ago [-]
These reviews are pretty useless to other developers. Models perform vastly differently with each language, task type, framework.
cheema33 1 days ago [-]
> These reviews are pretty useless to other developers.
Agreed. If these same models were used on a different codebase/language etc. it will likely produce very different results.
epolanski 1 days ago [-]
And prompt and usage.
cloverich 1 days ago [-]
Please y'all, when you list supportive or critical complaints based on your actual work, include some specifics of the task and prompt. Like actual prompt, actual bugs, actual feature, etc. I've had great success with both ChatGPT and Claude for years, am around 3x sustained output increase in my professional work, and kicking off and finishing new side projects / features that I used to simply not ever finish. BUT there's some tasks I run into where it's god awful. Because I have enough good experience, I know how to work around, when to give up, when to move on, etc. I am still surprised at things it cannot do, for example Claude code could not seem to stitch together three screens in an iOS app using the latest SwiftUI (I am not an iOS dev). IMHO for people using it off and on or sparingly, it's going to seem either incredible or worthless depending on your project and prompt. Share details, it's so helpful for meaningful conversation!
Mathiciann 1 days ago [-]
I am almost convinced your comment is parody but I am not entirely sure.
You want proof for critical/supportive criticism? Then almost in the same sentence you make an insane claim without backing this up by any evidence.
cloverich 1 days ago [-]
> You want proof for critical/supportive criticism? Then almost in the same sentence you make an insane claim without backing this up by any evidence.
Nearly every critical reply to my comment bases that criticism on the lack of examples and details I included for my claim which is the very thing I am suggesting we do (i.e. they are, ironically, agreeing with me?). I am sorry I thought that intentional bit of irony would help make the point rather than derail the request.
stavros 1 days ago [-]
Well, here's an even more insane claim: I'm infinity times more productive, as I just wouldn't even start projects without the LLM to sidestep my ADHD. Then, when the LLM invariably fucks up, I step in and finish things myself!
Here are a few projects that I made these past few months that wouldn't have been possible without LLMs:
* https://www.writelucid.cc - A business document/spec writing tool I'm working on, it asks you questions one at a time, writes a document, then critiques the idea to help you strengthen it.
* A rotary phone that's a USB headset and closes your meeting when you hang up the phone, complete with the rotary dial actually typing in numbers.
* Made some long-overdue updates on my pastebin, https://www.pastery.net, to improve general functionality.
* https://github.com/skorokithakis/justone - A static HTML page for the board game Just One, so you can play with your friends when you're physically together, without needing to bring the game along.
I'm probably forgetting a lot more, but I honestly wouldn't have been bothered to start any of the above if not for LLMs, as I'm too old to code but not too old to make stuff.
EDIT: dang can we please get a bit better Markdown support? At least being able to make lists would be good!
Mathiciann 1 days ago [-]
Well done, some of these projects look cool.
Although I was just commenting on the irony of the parent comment.
stavros 1 days ago [-]
What was the irony? I thought you were referring to the "3x speed" part as the insane statement.
AppleBananaPie 1 days ago [-]
To me it seems like an arbitrary number that I'm not even sure how someone could accurately measure it but maybe I've missed something :)
cloverich 24 hours ago [-]
It's actually not arbitrary! I measure my PR rate / ticket close rate before and after, which are generally tied to agreed on features / bugs (often user requested / reported ones). I think if it were commit rate or lines of code it would be less meaningful, but at least a (non refactoring) PR should indicate some level of increased user value / bug fix. Sure we could categorize it further and break it down more effectively, I'll not die on the sword of 3x, maybe its 1.5x, maybe its 4x. Neither seems a very meaningful difference when the comparison being discussed is 0x or even -X. The latter I think _most_ of the time is going to be prompt or task related, which is why I think its so important to share and discuss (particularly the negative case!)
stavros 1 days ago [-]
I took it as a ballpark of how the GP felt, rather than a rigorous measurement.
GoatInGrey 1 days ago [-]
Which would be ironic as LLM usage has been observed to increase the sensation of productivity even when productivity is measurably reduced. Not to mention the "vibe" component of vibe coding
Hm, well, since we are on a Sonnet thread, I might give it a go.
lisbbb 1 days ago [-]
Did you make any money off any of that or was it all just labors of love type of stuff? I'm enjoying woodworking...
stavros 1 days ago [-]
This is all my hobby, for my job I use Claude in a way that doesn't involve code, but is still very useful. It's basically what inspired Lucid, above, when I realized I find coming up with solutions very easy, but find explaining them very hard, because I assume the other person knows too much and I don't elaborate enough.
LLMs are a great rubber duck, plus they can write the document for you at the end.
FlyingSnake 1 days ago [-]
> dang can we please get a bit better Markdown support?
Great use case for an LLM to make these changes as HN is open source. It’ll also tell us if LLMs can go beyond JS slop.
emp17344 1 days ago [-]
> I'm infinity times more productive, as I just wouldn't even start projects without the LLM to sidestep my ADHD.
1 is not infinitely greater than 0.
sebastiennight 1 days ago [-]
It... literally is?
Or otherwise, can you share what you think the ratio is?
emp17344 1 days ago [-]
No, 1 is 1 more than 0. There’s a certain sense in which you could say that 1 is infinitely greater than 0, but only in an abstract, unquantifiable way. In this case, it doesn’t make sense to say you’re “infinitely more productive” because you’re producing something rather than nothing.
sebastiennight 16 hours ago [-]
It goes like this:
"For any positive "x", is 1 x times greater than 0? Well, 0 times x is lower than 1, and 1 divided by x is larger than 0."
So his productivity increased by more than twice, more than ten times, more than a billion times, more than a googol times, more than Rayo's number. The only mathematically useful way to quantify it is to say his productivity is infinitely larger. Unless you want to settle for "can't be compared", which is less informative.
stavros 22 hours ago [-]
Only if you think that the phrase "two times more productive" is also nonsensical.
jama211 1 days ago [-]
I just read it as a turn of phrase that says exactly that, that it means they produce something rather than nothing.
Fraterkes 1 days ago [-]
I think it's a pedantic point, but maybe they just meant that talking about 1 being multitudes greater than 0 implies multiplication. And since 1/0 is undefined that doesn't make much sense.
1 days ago [-]
inopinatus 1 days ago [-]
Someone attributing all of their productivity to a given tool and none to their own ingenuity and experience is allocating 100% credit to that tool.
It is not a ratio, it is a proportion.
1 days ago [-]
rvz 1 days ago [-]
Also, not invented here syndrome (NIH) is cool again.
Given that most of the 'vibe-coded' projects that I have seen that are worse versions of software that have been tested and stand the test of time.
nenenejej 24 hours ago [-]
Everyone who wants to talk about claude code raise a Jira ticket with steps to reproduce and please link to that.
mbesto 1 days ago [-]
> include some specifics of the task and prompt. Like actual prompt, actual bugs, actual feature, etc.
> I am still surprised at things it cannot do, for example Claude code could not seem to stitch together three screens in an iOS app using the latest SwiftUI (I am not an iOS dev).
You made a critical comment yet didn't follow your own rules lol.
> it's so helpful for meaningful conversation!
How so?
FWIW - I too have used LLMs for both coding and personal prompting. I think the general conclusion is that it when it works, it works well but when it fails it can fail miserably and be disastrous. I've come to conclusion because I read people complaining here and through my own experience.
Here's the problem:
- It's not valuable for me to print out my whole prompt sequence (and context for that matter) in a message board. The effort is boundless and the return is minimal.
- LLMs should just work(TM). The fact that they can fail so spectacularly is a glaring issue. These aren't just bugs, they are foundational because LLMs by their nature are probabilistic and not deterministic. Which means providing specific defect criteria has limited value.
cloverich 1 days ago [-]
> How so?
Sure. Another article was posted today[1] on the subject. An example claim:
> If we asked the AI to solve a task that was already partially solved, it would just replicate code all over the project. We’d end up with three different card components. Yes, this is where reviews are important, but it’s very tiring to tell the AI for the nth time that we already have a Text component with defined sizes and colors. Adding this information to the guidelines didn’t work BTW.
This is helpful framing. I would say to this: I have also noticed this pattern. I have seen two approaches help. One, I break up UI / backend tasks. At the end of UI tasks, and sometimes before I even look at the code, I say: "Have you reviewed your code against the existing components library <link to doc>?" and sometimes "Have you reviewed the written code compared to existing patterns and can you identify opportunities for abstraction?" (I use plan mode for the latter, and review what it says). The other approach which I have seen others try, but have not myself (but it makes sense), is to automatically do this with a sub agent or hook. At a high level it seems like a good approach given I am manually doing the same thing now.
Could you share the actual examples of where you’re seeing the 3x output increase?
cloverich 1 days ago [-]
Sure. This is an internal web app that uses react on the front end and rails on the back end. Typical examples I see LLM success with are writing and writing up routes/controllers/models, writing specs for those, abstracting components, writing front-end vitest/storybook entries. A typical request (filenames and such redacted) is like: "We recently added <link to model>. We refactored our approach for <goal> to <link to different model file>. We need to refactor <A> to be like <B> in these ways. Do that, then update the spec to match the pattern in <file Y>. Run rspec and rubocop when done, and address any issues". I then either wait or go do something else, then review the code and either ask for follow up, or fix minor issues. Sometimes it follows the wrong pattern and I ask it to adjust, or simply git checkout -- and say try again you did Y wrong.
Roughly speaking that is how I think through my work, and when I get to the point of actually writing the code having most of the plan (context) in my head, I simply copy that context to the LLM then go to do something else. I only do this if I believe the LLM can do it effectively, so some tasks I do not ask for help at all on (IMHO this is important).
I also have it help with scripts, especially script that munge and summarize data. I know SQL very very well, but find it still a bit faster to prompt the LLM if it has the schema on hand.
Do you find ^ helpful? i.e does that match how you prompt and if not, in what ways does it differ? If it does, in what ways do you get different results and at what step?
alfalfasprout 1 days ago [-]
right? The irony is so thick you could cut it with a butter knife
not_kurt_godel 1 days ago [-]
3 * 0 = 0.
Checkmate, aitheists.
bartread 1 days ago [-]
I had a complete shocker with all of Claude, GitHub Copilot, and ChatGPT when trying to prototype an iOS app in Swift around 12 months ago. They would all really struggle to generate anything usable, and making any progress was incredibly slow due to all the problems I was running into.
This was in stark contrast to my experience with TypeScript/NextJS, Python, and C#. Most of the time output quality for these was at least usefully good. Occasionally you’d get stuck in a tarpit of bullshit/hallucination around anything very new that hadn’t been in the training dataset for the model release you were using.
My take: there simply isn’t the community, thought leadership, and sheer volume of content around Swift that there is around these other languages. This means both lower quantity and lower quality of training data for Swift as compared to these other languages.
And that, unfortunately, plays negatively into the quality of LLM output for app development in Swift.
(Anyone who knows better, feel free to shoot me down.)
simonh 1 days ago [-]
Going from past discussions, there seem to be two issues there. One is that Swift has changed massively since it came out and huge swathes of examples and articles and such online, that LLMs are trained on, are out of date and thus pollute the training set.
Another issue is that Apple developer docs are largely sequestered behind JavaScript that makes them hard for scrapers to parse.
At least, those are the two explanations I’ve seen that seem plausible.
bartread 20 hours ago [-]
Yeah, I'm not a Swift expert by any means - this is literally something I spent a few days on - but this in particular:
> One is that Swift has changed massively since it came out and huge swathes of examples and articles and such online, that LLMs are trained on, are out of date and thus pollute the training set.
100% jibes with my experience. The amount of times it would generate code using a deprecated API, or some older mechanism, or mix an older idiom with a newer one... well, it was constant really.
And a lot of Googling when I was fixing everything up manually drew me toward this same conclusion: that high quality, up to date information on Swift was in relatively short supply compared to other languages. Couple that with a lower volume of content across all Swift versions and you end up with far from great training data leading to far from great outputs.
> Apple developer docs are largely sequestered behind JavaScript that makes them hard for scrapers to parse.
Yeah, and honestly - even if there's a solution here - the documentation isn't that great either. Certainly not compared with .NET, Ruby, Python, TypeScript, etc.
If I were a vibe coder I'd certainly avoid Swift like the plague.
(Btw, this isn't a knock on Swift itself: as a language I didn't mind it, although I did notice when debugging that the Objective C underpinnings of many APIs are often on display.)
fnordsensei 1 days ago [-]
As someone who gets useful Clojure out of Claude quite consistently, I’m not sure that volume is the only reason for output quality.
resters 1 days ago [-]
I think what you are saying is true for CLI-only development using Swift. It is possible, but LLMs often get the commands wrong or don't realize how to accomplish something. There have been a number of times when claude/codex has told me I have to edit a plist manually in XCode before progress can continue.
enjo 1 days ago [-]
This is more or less my experience with Go right now.
For a bunch of reasons I want to avoid the standard React, Typescript, and Node stack but the sheer velocity that might enable from the LLM side might make it worth it.
nerdix 22 hours ago [-]
Wait...
Are you saying that your experience with Go has been bad? I would think Go would be as good as any other language (if not better). The language itself is simple, the Go team is very methodical about adding new features so it changes fairly slowly, it has excellent built in CLI based tooling that doesn't require third party packages or applications, and there are plenty of large open source Go codebases to train on. Seems like the perfect language for agentic tools.
emil-lp 1 days ago [-]
How do you measure 3x sustained output increase?
Is it number of lines? Tickets closed? PRs opened or merged? Number of happy customers?
cloverich 1 days ago [-]
Merged PRs. We typically plan out our work, break up into e.g. JIRA tasks, then when we create PR's _very generally_ they should be tied to actual JIRA tickets i.e. pre-planned work. A ticket is usually a requested feature or bug (as reported by an actual user). So my PR rate, or perhaps less controversially my JIRA close rate, is around 3x higher for the last few months. That's also reflected more generally in my feedback productivity wise (i.e. people that are looking at the project as a whole rather than e.g. how many commits I've made). I exclude from 3x side projects and CLI tools, which are weird to quantify - they are typically things that would usually have been ideas in my head I never did at all. I guess I also generally exclude refactoring although I do that more. For example I had claude fix a bug that was dogging our typescript compilation. I couldn't figure out what was so slow about it (>60s to compile). Turned out it was a specific recursive type pulled in by a specific version of a library mixed by usage from one file! It actually took it a while to figure it out, it kept proposing solutions and I had to re-direct it a bunch, using mostly just intuition as opposed to experience. e.g. "No, re-run the diagnostics and look at the debug output, give me three examples of area / commands you could look at and how" and then I'd pick one. I just did that task on the side, I'd go back and look at it output once every day or two, then prompt it with something else, then just go do my usual tasks as though that didn't exist. That type of work given our pace / deadlines / etc, might never have gotten done at least not anytime soon. But I do stuff like that all the time now, I just don't often measure it.
Is that helpful?
hshshshshsh 1 days ago [-]
All these are useless metrics. It doesn't say anything meaningful on the quality of your life.
I would be more interested in knowing if he can now retire in next 5 years instead of waiting another 15?
Or do he now just just get to work for 2 hours and enjoy the remaining 6 hours doing meaningful things apart from staring at a screen.
simonh 1 days ago [-]
Not everyone hates their job and gets no satisfaction from it. Some of us relish doing something useful and getting paid for it.
hshshshshsh 1 days ago [-]
Sure. I don't doubt it. But let's say if I can make a 100 million pounds appears on your bank account tomorrow. Will you say no to it and go back to your day job?
simonh 24 hours ago [-]
Both can be true. Being able to do better, more productively work can increase my quality of life. And yes, winning lottery millions would increase my quality of life even more.
However I don’t have lottery millions, but I do have a job and I would like to be able to do it better.
fragmede 18 hours ago [-]
Can you though? What you can do, though, is quit that job you hate and go do something (anything!) else until you find what's right for you.
hshshshshsh 6 hours ago [-]
Obviously I don't. But I was merely pointing at the fact that people don't really love their job but has somehow invented a story that make them believe they do.
1 days ago [-]
inopinatus 1 days ago [-]
It is undoubtedly 3x as many bugs.
_alternator_ 1 days ago [-]
This would be a win. Professionals make about 1 bug for every 100 loc. If you get 3x the code with 3x the bugs, this is the definition of scaling yourself.
lottin 1 days ago [-]
I think it's just a meaningless sentence.
senordevnyc 1 days ago [-]
Oh good, a new discussion point that we haven't heard 1000x on here.
Have you heard of that study that shows AI actually makes developers less productive, but they think it makes them more productive??
EDIT: sorry all, I was being sarcastic in the above, which isn't ideal. Just annoyed because that "study" was catnip to people who already hated AI, and they (over-) cite it constantly as "evidence" supporting their preexisting bias against AI.
rapind 1 days ago [-]
> Have you heard of that study that shows AI actually makes developers less productive, but they think it makes them more productive??
Have you looked into that study? There's a lot wrong with it, and it's been discussed ad nauseam.
Also, what a great catch 22, where we can't trust our own experiences! In fact, I just did a study and my findings are that everyone would be happier if they each sent me $100. What's crazy is that those who thought it wouldn't make them happier, did in fact end up happier, so ignore those naysayers!
boogieknite 1 days ago [-]
> for example Claude code could not seem to stitch together three screens in an iOS app using the latest SwiftUI
have you tried in the new xcode extension? that tool is surprisingly good in my limited use. one of the few times xcode has impressed me in my 2 yeasrs of use. read some anecdotes that claude in the xcode tool is more accurate than standard claude code for Swift. i havent noticed that myself but only used the xcode tool twice so far
Jonovono 1 days ago [-]
I had Claude Code build a fairly complex swiftui app (5+ screens), using Firebase AI Logic and other packages. First prompt it got pretty much foundation for the entire thing set up, then over the next day got it working exactly like I wanted. The thing that took the longest was getting through app review. I was impressed how well it knew SwiftUI and Swift composable architecture.
cloverich 1 days ago [-]
For my iOS project, I am super curious to what extent is it my lack of swift knowledge and e.g. how well I can prompt? Because 80% of what I usually ask the LLM to do, I know how to do myself quite well. iOS is the first time I've been coding with something I do not know how to do well, I often can barely read the code (of course that is changing rapidly now). e.g. from a recent session:
> What is the idiom for testing the launch screen on the simulator like.. I don't see anything? How do I know if its there.
i.e. in iOS / Swift, I don't even know if I'm using the right terms for the code I am trying to interrogate, or in some cases even what the thing is!
Jonovono 1 days ago [-]
I have done lots of SwiftUI before, so it may have helped me recognize when it goes off the rails. But I definitely don't do anything fancy with my prompting.
But for stuff like TCA (Swift composable architecture), I basically created a TCA.md file and pasted in a bunch of docs and examples and would reference that.
But for the most part, it was one shotting swiftui screens that were nicer than what I had in my mind.
cpursley 1 days ago [-]
Apple store link (I believe you, just am curious)? I'm toying with the idea of "vibing" a real Swift app instead of messing with the React Native toolchain.
mpern 1 days ago [-]
Would you be so kind to lead by example?
What are the specific tasks + prompts giving you an 3x increased output, and conversely, what tasks don't work at all?
After an admittedly cursory scan of your blog and the repos in your GH account I don't find anything in this direction.
cloverich 1 days ago [-]
Oh, 3x at work. I shared some details on the methodology, its PR rate for ticketed features / bugs (so e.g. closed tickets as opposed to commits, loc, etc). For prompts and tasks, am happy to share (redacted as needed; check comment threads) if you want more details, presuming this is a genuine request? Here's a few example prompts (I can't paste exactly obviously, but I can approximate):
- "Rails / sidekiq: <x file> uses sidekiq batches. <y file> does it. Refactor your to use pattern in <x file> Match spec in <z file> then run rspec and rubocop"
- "Typescript / react. <x file>. Why is typescript compilation a bottle neck int his file. Use debugger to provide definitive evidence. Cast type to any and run script and time it; write a script to measure timing if needed. Iteratively work from type `any` to a real type and measure timing at each step. Summarize results"
- "I redefine <FormComponent> in five places. Find them all. Identify the shared patterns. Make new component in <x location>. Refactor each to to use new component. Run yarn lint and fix any ts issues when done"
- "<file y>: more idiomatic" (it knows my preferences)
Side projects and such I have no idea, and (as you noted) I do those quite infrequently anyways! Actually come to think of it... outside of the toy iOS work I did last week, I've not actually worked on my side projects since getting into Claude code / cursor agents. For work stuff, I guess other metrics I'd be interested in are total messages sent per task. I do sometimes look at $ per task (but for me anyways, that's so wildly in my favor I don't think it's worth it".
1 days ago [-]
dirkc 1 days ago [-]
Would you say you do things you'd normally do 3 times faster? Or does it help you move past the things you'd get stuck on or avoid in the past, resulting in an overall 3x speedup?
cloverich 1 days ago [-]
Things I'd normally do 3x faster. That 3x is me focusing explicitly on the precise things I did before - the PR rate on a specific work project - because I tie those PR's back to specific tasks the same as I did before I used claude code. I haven't looked at lines of code, total commits, etc. Qualitatively I write more tests and abstract more components than I used to, but those get lumped in to the PRs as I normally try to limit pure refactoring work, and instead tie it into ticketed feature requests or bugs.
I don't count the things I'm doing now that I would have avoided or never finished in the past. For those, of course to me personally those are worth much more psychologically than 3x, but who knows if it's an actual boost. I.e. I took a partially scripted task the other day and fully automated it, and also had it output to the CLI in a kind of dorky sci-fi way because it makes it fun to run it. It didn't take long - 30 minutes? But I certainly didn't _gain_ time doing that, just a little more satisfaction. TBH I'm surprised 3x is so controversial, I thought it was a really cool and far more practical assessment than some of these 10x claims I'm seeing.
senordevnyc 1 days ago [-]
HN is such a negative and cynical place these days that it's just not worth it. I just don't have the patience to hear yet another anti-AI rant, or have someone who is ideologically opposed to AI nitpick its output. Like you, I've found AI to be a huge help for my work, and I'm happy to keep outcompeting the people who are too stubborn to approach it with an open mind.
la_fayette 1 days ago [-]
I think HN might be one of the few communities where people have been running extensive experiments with LLMs since their inception. Most here take a realistic view of their capabilities. There are certainly proven use cases where LLMs provide clear productivity gains—for example, copying an error message and retrieving potential solutions. At the same time, many recognize that marketing fantasies, such as the idea of having a "PhD in your pocket," are far beyond what this technology can deliver.
xenobeb 23 hours ago [-]
To me, it really depends if the post is a well reasoned criticism with something unique to add to the conversation or the standard, completely pointless, anti-AI rant that I have already read a 1000 times.
sciencejerk 22 hours ago [-]
5I think a lot of white-collar workers are on HN and th3y
catigula 1 days ago [-]
[flagged]
dang 23 hours ago [-]
Please don't cross into personal attack, regardless of how wrong someone is or you feel they are.
Also, please stop posting flamebait to HN generally. It's not what this site is for, and destroys what it is for.
What is "flamebait"? Unpopular beliefs I genuinely hold? I looked it up and the definition seems to include being disingenuous, which I am not. Claiming I am is annoying and feels like it is a personal attack.
dang 3 hours ago [-]
It doesn't include being disingenuous. We can't know what your intent is (<-- I don't mean you personally, but any commenter, of course), and can only moderate based on effects, not intent: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que....
catigula 3 hours ago [-]
Thanks
rapind 1 days ago [-]
> you can't see that you're an Ouroboros consuming your own tail and they can
Hey, so if I DO see it, can I stop it from happening?
senordevnyc 1 days ago [-]
OK. Well, I've been doing this the hard way for about twenty years, and now with AI in the mix my little solo SaaS has gone from nothing to $5k MRR in six weeks. Guess I'm not holding it completely wrong?
scrollaway 1 days ago [-]
You are making assumptions about someone you have never talked to in the past, and don't know anything about.
Of the two of you, I know which one I'd bet on being "right". (Hint: It's the one talking about their own experience, not the one supplanting theirs onto someone else)
catigula 1 days ago [-]
[flagged]
int_19h 22 hours ago [-]
> Literally everyone in development is using AI.
That's just... not true? There are many, many people who sincerely believe (sometimes from real experience, sometimes because they want it to be true) that AI absolutely cannot do whatever they are trying to do, and so they don't use it.
scrollaway 1 days ago [-]
> The difference is "negative" people can clearly see that it's on a trajectory in the NEAR, not even distant, future to completely eat your earnings, so they're not thrilled.
We birthed a level of cognition out of silicon that nobody would imagine even just four years ago. Sorry, but some brogrammers being worried about making ends meet is making me laugh - it's all the same people who have been automating everyone else's jobs for the past two decades (and getting paid extremely fat salaries for it), and you're telling me now we're all supposed to be worried because it's going to affect our salaries?
Come on. You think everyone who's "vibe coding" doesn't understand the pointlessness of 90% of codemonkey work? Hell, most smart engineers understood that pointlessness years ago. Most coders work on boring CRUD apps and REST APIs to make revenue go up 0.02%. And those that aren't, are probably working on ads.
It's a fraction of a fraction that is at all working on interesting things.
Personally, yeah, I saw it coming and instead of "accepting fate", I created an AI research lab. And I diversified the hell out of my skillset as well - started working way out of my comfort zone. If you want to keep up with changing times, start challenging.
senordevnyc 1 days ago [-]
We are reading very different "negative" comments here.
Most of the anti-AI comments I see on HN are NOT a version of "the problem with AI is that it's so good it's going to replace me!"
kelsey98765431 1 days ago [-]
all major nation state intelligence services have an incentive to spread negative sentiment and reduce developer adoption of ai technology as they race to catch up with the united states.
1 days ago [-]
emp17344 1 days ago [-]
[flagged]
dang 23 hours ago [-]
"Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith."
GP is right, though. Many programming communities, including (in some threads, but not all) HN, have become ridiculous anti-AI bubbles - what's the point of trying to have a discussion if you're going to get systematically shut down by people whose entire premise is that they don't use it? It's like trying to explain color to the blind.
What "discussion" do you want to have? Another round of "LLMs are terrible at embedded hardware programming ergo they're useless"? Maybe with a dash of "LLMs don't write bug-free software [but I do]" to close it off?
The discussions that are at all advancing the state of the art are happening on forums that accept reality as a matter of fact, without people constantly trying to constantly pretend things because they're worried they'll lose their job if they don't.
emp17344 1 days ago [-]
I think you’re overly sensitive to criticism of LLMs.
scrollaway 1 days ago [-]
No? I really don't give a crap what people criticize. It doesn't change anything in my life - I have plenty going on and nothing you or anyone says here will alter that. It's just sad to see a community I like (and which I've been a part of for longer than you've been programming) factually shut itself down to reality...
red_rech 1 days ago [-]
[flagged]
scrollaway 1 days ago [-]
There's exactly ONE factual claim about AI in this entire subthread, and it is word-for-word "AI has been a huge help for my work".
Now you may read that as .. checks notes "star trek post scarcity singularity", but forgive me, I didn't.
1 days ago [-]
raincole 1 days ago [-]
I agree. I think we can start with cloverich including some specifics of the task and prompt.
asdev 1 days ago [-]
this is a great copypasta
rightbyte 1 days ago [-]
I was thinking the same. Way too perfect to not be spammed around forever.
AnotherGoodName 1 days ago [-]
Definitely an overall positive with the negatives actually being kind of hilarious and no big deal which I'll also discuss.
I can only list my open source outputs concretely for obvious reasons but https://github.com/rubberduckmaths/reddit_terraforming_mars_... was a near one shot. It's a Reddit bot that posts card text to the Terraforming Mars subreddit when asked which is helpful for context on discussions of that board game. Appreciated and used a lot by the community there. There's a similar project i used AI for to scrape card text that was also near one shot. I'd say for these two hobby projects 50x productivity is a reasonable statement. I wrote Reddit bots ~10 years ago without coding assistance - https://github.com/AReallyGoodName/xwingminibot i get to reasonably absolutely compare two very similar projects. I think it's totally fair for me to say 50x for this example. The Reddit API even changed completely in that time so no one can really say "you used past experience to move faster, it's not the ai giving a 50x boost" but I really didn't. My memory is not that good except for memory of an entire weekend previously vs <30mins total now using a bot to one shot some pretty cool projects.
As for the negatives they are never serious. A couple of good examples;
"Please correct all lint errors in this project" only to have @lintignore added to all files. Lol! Obviously i just more clearly specified the prompt and it's not like it's hard to catch these things and not ship to prod. It was funny to everyone i showed and no big deal.
Another similar case, "please make the logging of this file less verbose especially around the tight loop on line X". Instead of changing log level or removing some of the log statements the ai redirected stdout at the initialization of the command line program (would completely break it of course). Again hilarious but also not big deal. Not even much of a waste of time since you just change the prompt and run again and honestly a few silly diversions like this now and then is kind of fun. As in the comments of "OMG AI sometimes gets it wrong" aren't at all serious. I have version control, i review code. No big deal.
I too eye roll massively at some of the criticisms at this point. It's like people are stretching to claim everyone who's using a coding assistant is newb who's throwing everything into prod and deleting databases etc. That's just not reality.
vkou 1 days ago [-]
> based on your actual work, include some specifics of the task and prompt.
Can't show prompts and actual, real work, because, well, it's confidential, and I'd like to get a paycheck instead of a court summons sometime in the next two weeks.
Generally, 'I can't show you the details of my work' isn't a barrier in communicating about tech, because you can generalize and strip out the proprietary bits, but because LLM behavior is incredibly idiosyncratic, by the time you do that, you're no longer accurately communicating the problem that you're having.
FrustratedMonky 1 days ago [-]
New Claude Model Runs 30-Hour Marathon To Create 11,000-Line Slack Clone
Yeah, maybe it is garbage. But it is still another milestone, if it can do this, then it probably does ok with the smaller things.
This keeps incrementing from "garbage" to "wow this is amazing" at each new level. We're already forgetting that this was unbelievable magic a couple years ago.
1 days ago [-]
bigyabai 1 days ago [-]
> for example Claude code could not seem to stitch together three screens in an iOS app using the latest SwiftUI
That's... not super surprising? SwiftUI changes pretty dang often, and the knowledge cutoff doesn't progress fast enough to cover every use-case.
I use Claude to write GTK interfaces, which is a UI library with a much slower update cadence. LLMs seem to have a pretty easy time working with bog-standard libraries that don't make giant idiomatic changes.
1 days ago [-]
criley2 1 days ago [-]
"Please include ACTUAL EVIDENCE!"
"I tripled my output (I provide no evidence for this claim)"
Never change, HN.
lihaciudanieljr 22 hours ago [-]
[dead]
atemerev 1 days ago [-]
Ah, the company where the models are unusable even with Pro subscription (start to hit the limit after 20 minutes of talking), and free models are not usable at all (currently can't even send a single message to Sonnet 4.5)...
kixiQu 1 days ago [-]
Lots of feature dev here – anyone have color on the behavior of the model yet? Mouthfeel, as it were.
It is time to acknowledge that AI coding does not actually work.
ok, you think it's a promising field and you want to explore it, fine. Go for it.
Just stop pretending that what these models are currently doing is good enough to replace programmers.
I use LLMs a lot, even for explaining documentation.
I used to use them for writing _some_ code, but I have never ever gotten a code sample over 10 lines that was not in need of heavy modifications to make it work correctly.
Some people are pretending to write hundreds of lines of code with LLMs, even entire applications. All I have to say is "lol".
int_19h 22 hours ago [-]
I have personally witnessed Cursor diagnose and then fix an actual non-trivial bug starting from its description in Jira, and I only had to course-correct it once.
I have also seen it fail on far simpler tasks.
It varies so much depending on what you are doing, the language etc that generic proclamations "it works!" or "it doesn't work!" are pretty much meaningless.
That aside, you seem to be conflating "it works" with "good enough to replace programmers", but these aren't synonyms.
And on the gripping hand, one way to "make" it work is simply to lower the standards. Which our industry has been doing aplenty for a long time now even before AI, so we shouldn't be surprised when top management drives it to its logical completion.
gabriel-uribe 22 hours ago [-]
Very interesting observation.
I haven’t written a function by hand in 18 months.
sneilan1 22 hours ago [-]
Same. I haven't written any code by hand in some time. Oh well. I guess I'm just doing it wrong.
BoorishBears 22 hours ago [-]
Have you built anything public that folks can try out?
It's a startup for finding grants. We have california state, federal, non-profit and california city/county grants. My landing page absolutely sucks but if you sign up / upload some papers or make some search cards you'll like the experience.
I'm very excited to try out the new Qwen XL that came out recently for visual design. I could really use some better communication to users of the capabilities of the platform.
BoorishBears 18 hours ago [-]
Landing site is completely broken on Safari, iOS 26, stable channel, iPhone 17 Air
A lot of content is off-screen, and can't be swiped back on-screen.
-
Actual app did let me theough the flow, but had things like dialogs staying open and covering content (I assume a non tech person would have missed them)
I think part of the AI gap right now is the UI/UX you're expecting. AI isn't able to meet the bar for UI work for me at all right now. I do use it via Tab completion, or chat for specific scaffolding, but the agentic form is rough
dimgl 18 hours ago [-]
There are so many people en masse who simply cannot accept that the days of coding manually, by hand, could be coming to an end. It's wild.
18 hours ago [-]
bopbopbop7 21 hours ago [-]
Have you built anything in 18 months? I keep asking to see these apps that people supposedly vibe coded in a weekend but when I ask them to share it, nothing.
Ah, a $999 vibe coding course. That completely checks out.
gabriel-uribe 18 hours ago [-]
That is one of many things! I offer that because I've had enough people request it over the past year.
Otherwise, I do professional software consulting + a variety of projects as an independent.
hsn915 21 hours ago [-]
Can you show me a sample of the code you have AI write for you?
9 hours ago [-]
dbbk 16 hours ago [-]
I have 20 years of experience and am now just running Opus 4.1 all the time. Of course it’s possible.
In my case I found having it always Ultrathink and to always work with TDD to work well. Also you have to use Plan Mode first and refine the plan. “What clarifying questions do you have for me” prompts me with a handful of numbered questions that are always really beneficial for refining the plan.
jxramos 22 hours ago [-]
every attempt I've attempted thus far has failed at first run, but it was close to a solution that could be adapted and fixed. This has been especially helpful in areas where I'm charting very unfamiliar territory for very narrow scoped problems.
sounds like poor prompt engineering. Devin and Claude can both do better work than many interns I have mentored in my career, and faster too. We likely have many many years until it will be even close to replacing an experienced developer but we are already at the point it IS replacing junior engineers.
Whether you agree or not, the market has spoken. New grad hiring is WAY down. Fresh CS grads are having an hell of a time finding work compared to 2 years ago.
bopbopbop7 21 hours ago [-]
Got proof that hiring is down due to AI?
Rendered at 22:32:52 GMT+0000 (Coordinated Universal Time) with Vercel.
It's very good - I think probably a tiny bit better than GPT-5-Codex, based on vibes more than a comprehensive comparison (there are plenty of benchmarks out there that attempt to be more methodical than vibes).
It particularly shines when you try it on https://claude.ai/ using its brand new Python/Node.js code interpreter mode. Try this prompt and see what happens:
I then had it iterate on a pretty complex database refactoring task, described in my post.This Claude preview lasted from Friday to Monday so I was less worried about major model changes. I made sure to run the pelican benchmark against the model after 10am on Monday (the official release date) just to be safe.
The only thing I published that I ran against the preview model was the Claude code interpreter example.
I continue not to worry about models having been trained to ace my pelican benchmark, because the models still suck at it. You really think Anthropic deliberately cheated on my benchmark and still only managed to produce this? https://static.simonwillison.net/static/2025/claude-sonnet-4...
---
SYSTEM:
Got it. Those are the setup commands for Simon Willison's LLM CLI tool, which lets you interact with language models from the command line.
The -e flag installs it in editable mode, and [test] pulls in the test dependencies. Are you running into any issues with the installation or tests? What are you working on with the LLM library?
USER: Correct. You run them using your code tooling.
SYSTEM:
Realized inability to execute requested Python commands.The user is telling me to run these commands using my code tooling. They want me to actually execute this. Let me think about what tools I have available:
I have the repl tool which runs JavaScript code in a browser environment I have artifacts for creating code/documents I have web_search and web_fetch
The user wants me to:
Checkout https://github.com/simonw/llm Run pip install -e '.[test]' Run pytest
But I don't have a tool that can:
Clone git repositories Run shell commands like pip or pytest Execute Python code in a command-line environment
The repl tool is JavaScript-only and runs in a browser environment, not a server environment where I could run shell commands. I should clarify that I don't have the capability to run shell commands, install Python packages, or execute pytest. I can only run JavaScript code in a browser environment via the analysis tool. I need to be direct and clear about this limitation, following the user's preferences to eliminate filler and be straightforward.I don't have tooling that can run shell commands, clone git repos, install Python packages, or execute pytest. My only code execution environment is a JavaScript browser REPL. I can't run pip, pytest, or any command-line tools. I can only execute JavaScript code in a sandboxed browser environment. What were you hoping to accomplish? Do you need help understanding the test setup, or were you expecting different capabilities?R
---
My execution environment is a browser-based JavaScript sandbox. I don't have Python, Node.js shell access, pip, git, or pytest capabilities.
Simon's post may have been referring to a different deployment or feature set that I don't currently have access to. My only code execution is the JavaScript REPL I just demonstrated.
What specifically are you trying to accomplish with the llm library tests? I can help with that in other ways.
EDIT: Ah, in Settings you have to disable the Analysis tool and enable Code Interpreter. You cannot have both enabled.
For those who don't have time to dig into the gist, did it work and do a good job? I assume yes to at least nominally working or you would have mentioned that, but any other thoughts on the solution it produced?
Its a very impressive model, but I think we have different views on what is complex.
https://github.com/ethanpil/claude-files-creator
If I was Anthropic I'd have measures in place to rate limit network traffic from individual containers.
Would appreciate pointers to any documentation or tests.
I won't accept preview access if it comes with any conditions at all about what I can say about the model once the embargo has lifted.
Thanks for all your work, Simon! You're my favorite journalist in this space and I really appreciate your tone.
And the wisest part is if he writes something they don't like, they can cut off that advanced access.
As is the longstanding tradition in games journalism, travel journalism, and suchlike.
You're a reviewer. This is how reviewers stay credible. If you don't disclose your relationship with the thing or company you're reviewing, I'm probably better off assuming you're paid.
And if your NDA says you can't write that in your preface, then logically, it is impossible to write a credible review in the first place.
this is (or should be) a pretty standard thing to do on youtube review channels (that I would trust), and it's not a bad thing to remind people of, on every occasion, plus it can function as a type of "canary" in cases of particularly restrictive NDAs
If I had an editor I imagine they would have talked me out of going to the OpenAI office for a mysterious product preview session with a film crew.
They're not going to write up detailed reviews of things like the new Claude code interpreter mode though, because that's not of interest to a general enough audience.
I don't have that restriction: https://simonwillison.net/2025/Sep/9/claude-code-interpreter...
It was suspiciously good in fact... so I downloaded the SVG file and found out it had generated a raster image with its image tool and then embedded it as base64 binary image data inside an SVG wrapper!
FWIW, I could totally see a not hugely more advanced model using its native image generation capabilities and then running a vector extraction tool on it, maybe iteratively. (And maybe I would not consider that cheating, anymore, since at some point that probably resembles what humans do?)
also with cursor, "write me a script that outputs X as an svg" it has given me rectangles a few times.
Other things you can ask that they're still clearly not optimizing for are ASCII art and directions between different locations. Complete fabrications 100% of the time.
https://simonwillison.net/
They are a nice person.
We end up banning that sort of account, so if you would please review https://news.ycombinator.com/newsguidelines.html and stick to the rules from now on, that'd be good.
I have a fairly large web application with ~200k LoC.
Gave the same prompt to Sonnet 4.5 (Claude Code) and GPT-5-Codex (Codex CLI).
"implement a fuzzy search for conversations and reports either when selecting "Go to Conversation" or "Go to Report" and typing the title or when the user types in the title in the main input field, and none of the standard elements match, a search starts with a 2s delay"
Sonnet 4.5 went really fast at ~3min. But what it built was broken and superficial. The code did not even manage to reuse already existing auth and started re-building auth server-side instead of looking how other API endpoints do it. Even re-prompting and telling it how it went wrong did not help much. No tests were written (despite the project rules requiring it).
GPT-5-Codex needed MUCH longer ~20min. Changes made were much more profound, but it implemented proper error handling, lots of edge cases and wrote tests without me prompting it to do so (project rules already require it). API calls ran smoothly. The entire feature worked perfectly.
My conclusion is clear: GPT-5-Codex is the clear winner, not even close.
I will take the 20mins every single time, knowing the work that has been done feels like work done by a senior dev.
The 3mins surprised me a lot and I was hoping to see great results in such a short period of time. But of course, a quick & dirty, buggy implementation with no tests is not what I wanted.
But that prompt leads me to believe that you're going to get rather 'random' results due to leaving SO much room for interpretation.
Also, in my experience, punctuation is important - particularly for pacing and grouping of logical 'parts' of a task and your prompt reads like a run on sentence.
Making a lot of assumptions here - but I bet if I were in your shoes and looking to write a prompt to start a task of a similar type that my prompt would have been 5 to 20x the length of yours (depending on complexity and importance) with far more detail, including overlapping of descriptions of various tasks (ie; potentially describing the same thing more than once in different ways in context/relation to other things to establish relation/hierarchy).
I'm glad you got what you needed - but these types of prompts and approaches are why I believe so many people think these models aren't useful.
You get out of them what you put into them. If you give them structured and well written requirements as well as a codebase that utilizes patterns you're going to get back something relative to that. No different than a developer - if you gave a junior coder, or some team of developers the following as a feature requirement: `implement a fuzzy search for conversations and reports either when selecting "Go to Conversation" or "Go to Report" and typing the title or when the user types in the title in the main input field, and none of the standard elements match, a search starts with a 2s delay` then you can't really be mad when you don't get back exactly what you wanted.
edit: To put it another way - spend a few more minutes on the initial task/prompt/description of your needs and you're likely to get back more of what you're expecting.
Your point about prompting quality is very valid and for larger features I always use PRDs that are 5-20x the prompt.
The thing is my "experiment" is one that represents a fairly common use case: this feature is actually pretty small and embeds into an pre-existing UI structure - in a larger codebase.
GPT-5-Codex allows me to write a pretty quick & dirty prompt, yet still get VERY good results. It not only works on first try, Codex is reliably better at understanding the context and doing the things that are common and best practice in professional SWE projects.
If I want to get something comparable out of Claude, I would have to spend at least 20mins preparing the prompt. If not more.
Valid as well. I guess I'm just nitpicking based on how much I see people saying these models aren't useful combined with seeing this example, triggered my "you're doing it wrong" mode :D
> GPT-5-Codex allows me to write a pretty quick & dirty prompt, yet still get VERY good results.
I have a reputation with family and co-workers of being quite verbose - this might be why I prefer Claude (though haven't tried Codex in the last month or so). I'm typically setting up context and spending a few minutes writing an initial prompt and iterating/adjusting on the approach in planning mode so that I _can_ just walk away (or tab out) and let it do it's thing knowing that I've already reviewed it's approach and have a reasonable amount of confidence that it's taking an approach that seems logical.
I should start playing with codex again on some new projects I have in mind where I have an initial planning document with my notes on what I want it to do but nothing super specific - just to see what it can "one shot".
I typically use zen-mcp-server’s planning mode to scope out these tasks, refine and iterate on a plan, clear context, and then trigger the implementation.
There’s no way I would have considered “implement fuzzy search” a small feature request. I’m also paranoid about introducing technical debt / crappy code, as in my experience is the #1 reason that LLMs typically work well for new projects but start to degrade after a while: there’s just a lot of spaghetti and debt built up over time.
But you are right that codex does that all by default. I just get frustrated when I ask it something simple and it spends half an hour researching code first.
I noticed in conversations with LLMs, much of what they come up with is non-deterministic. You regenerate the message and it disappears.
That appears to be the basic operating principe of the current paradigm. And agentic programming repeats this dice roll, dozens or hundreds of times.
I don't know enough about statistics to say if that makes it better (converging on the averages?) or worse (context pollution, hallucinating, focusing on noise?), but it seems worth considering.
Also, there is a threshold in which the time it takes to develop a prompt, allow the agent to run, review its output, and go through iterative loops to correct errors or implementation problems, can exceed the time it takes me (a lazy human) to achieve the same end result.
Pair this with the bypassing of the generation effect, reduced prefrontal dopamine, and increased working memory load (in part due to minimal motor-cognitive integration), and AI-generated code in contexts with legal and financial repercussions can be a much worse deal than using your own fingers.
Depends not just on prompt but also the tooling / environment you use. Somebody using Claude Code CLI may get a totally different experience then somebody using CoPilot via VSC.
What do i mean by that? Look at how Copilot tries to save money by reading content only in small parts. Reading file X line 1-50, X line 51-100, ... And it starts working with this. Only if it finds a hint about something somewhere else, it will read in more context.
What i often see is that it misses context because it reads in so limited information and if there is no hint in your code or code doc, it will stop there. Try to run a local test on the code, passes, done... While it technically broke your application.
Example: If i tell it to refactor a API, it never checks if that API is used anywhere else because it only reads in that API code. So i need to manually add to the prompt to remind it, "the API is used in the system". And then it does its searching /... Found 5 files, Read X line 1...
And plop, good working code ... So if you know this limitation, you can go very far with a basic $10 CoPilot Claude Agent usage.
Where as a $200 Claude Code will give you a better experience out of the door, as it reads in a ton more. The same applies to GPT-5/Codex, what seems to be more willing to read in larger context of your project, thus resulting in less incomplete code.
This is just anecdotal from my point of view, but like with any LLM, hinting matters a lot. Its less about writing a full prompt with a ton of text but just including the right "do not forget about function name X, and module Y, and test Z". And Claude loves it hints on CoPilot because of that limited reading.
Why would you need such extensive prompting just to get the model to not re-implement authentication logic, for example? It already has access to all of the existing code, shouldn't it just take advantage of what's already there? A 20x longer prompt doesn't sound like a satisfying solution to whatever issue is happening here.
It's not a good idea to have any coding agent put unnecessary amounts of lines into the context window in order to understand your code base.
Performance of all llms drop drastically when the context window is filled or full. The purpose of being more specific with your prompts is that you spend a little bit more tokens up front to make the task a lot more efficient and more likely to result in success.
At least that's how it is today. We're probably a breakthrough or two away from the type of vibe coding experience non-coders want. Or it may never happen, and the developers who have coding knowledge will be the only ones to fully utilize coding agents and it will only become more powerful over time.
If that same person approached a software development company with the same prompt without following up with any other details, they won't get good code back, either. You're not saying it, but this idea that in the future you can tell a computer something like "create photoshop" and get what your expecting is an unrealistic dream that would need mind-reading or a major breakthrough and paradigm shift in understanding and interpreting language.
And even that would not be enough.
In reality, it would have to put the user to sleep and go through various dream scenarios to have the user's brain really build an internal model that is not there in the first place. No brain interface can help find what is not there.
We usually need interactions with reality to build the internal model of what we actually want step by step, especially for things we have not done before.
Even for info that is there, that's also a limit to fantasy or sci-fi brain scanning. The knowledge is not stored like in a RAM chip, even when it is there. You would have to simulate the brain to actually go through the relevant experiences to extract the information. Predicting the actual dynamic behavior of the brain would require some super-super sub-molecular level scan and then correctly simulating that, since what the neurons will actually do depends on much more than the basic wiring. Aaaaand you may get a different result depending on time of day, how well they slept, mood and when and what the person ate and what news they recently read, etc. :)
And I left that window at 5-20x because, again, no real context. But unless I was already in the middle of a task and I was giving direction that there was already context for - my prompt is generally almost never _this_ short. (referring to the prompt in the top level comment)
> A 20x longer prompt doesn't sound like a satisfying solution to whatever issue is happening here.
It wouldn't be, given the additional context given by the author in a sibling comment to yours. But if you had specific expectations on the resulting code/functionality that 20x longer prompt is likely to save you time and energy in the back and forth adjustments you might have to make otherwise.
100% agree. The prompt is a 'yolo prompt'. For that task you need to give it points in what to do so it can deduce it's task list, provide files or folders in context with @, tell it how to test the outcome so it knows it has succeeded and closing the feedback loop, and guide it in implementation either via memory or via context with which existing libs or methods it should call on.
For greenfield tasks and projects I even provide architectural structure, interfaces, etc.
After reading twitter, reddit and hn complaints about models and coding tools I've come to the same conclusion as you.
> For that task you need to give it points in what to do so it can deduce it's task list, provide files or folders in context with @…
- and my point is that you do not have to give ChatGPT those things. GP did not, and they got the result they were seeking.
That you might get a better result from Claude if you prompt it 'correctly' is a fine detail, but not my point.
(I've no horse in this race. I use Claude Code and I'm not going to switch. But I like to know what's true and what isn't and this seems pretty clear.)
I mean in the movies for example, advanced AI assistants do amazing things with very little prompting. Seems like that's what people want.
To me, the fact that so many people basically say "you are prompting it wrong" is knock against the tech and the model. If people want to say that these systems are so smart at what they can do, then they should strive to get better at understanding the user without needing tons of prompts.
Do you think his short prompt would be sufficient for a senior developer? If it's good enough for a human it should be good enough for a LLM IMO.
I don't want to take away the ability to use tons of prompting to get the LLM to do exactly what you want, but I think that the ability for an LLM to do better with less prompting is actually a good thing and useful metric.
See below about context.
> I mean in the movies for example, advanced AI assistants do amazing things with very little prompting. Seems like that's what people want.
Movies != real life
> To me, the fact that so many people basically say "you are prompting it wrong" is knock against the tech and the model. If people want to say that these systems are so smart at what they can do, then they should strive to get better at understanding the user without needing tons of prompts.
See below about context.
> Do you think his short prompt would be sufficient for a senior developer? If it's good enough for a human it should be good enough for a LLM IMO.
Context is king.
> I don't want to take away the ability to use tons of prompting to get the LLM to do exactly what you want, but I think that the ability for an LLM to do better with less prompting is actually a good thing and useful metric.
What I'm understanding from your comments here are that you should just be able to give it broad statements and it should interpret that into functional results. Sure - that works incredibly well, if you provide the relevant context and the model is able to understand and properly associate it where needed.
But you're comparing the LLMs to humans (this is a problem, but not likely to stop so we might as well address it) - but _what_ humans? You ask if that prompt would be sufficient for a senior developer - absolutely, if that developer already has the _context_ of the project/task/features/etc. They can _infer_ what's not specified. But if you give that same prompt to a jr dev who maybe has access to the codebase and has poked around inside the working application once or twice but no real in depth experience with it - they're going to _infer_ different things. They might do great, they might fail spectacularly. Flip a coin.
So - with that prompt in the top level comment - if that LLM is provided excellent context (via AGENTS.md/attached files/etc) then it'll do great with that prompt, most likely. Especially if you aren't looking for specifics in the resulting feature outside of what you mentioned since it _will_ have to infer some things. But if you're just opening codex/CC without a good CLAUDE.md/AGENTS.md and feeding it a prompt like that you have to expect quite a bit of variance to what you get - exactly the same way you would a _human_ developer.
You context and prompt are the project spec. You get out what you put in.
These things are being marketed as super intelligent magic answer machines. Judging them using the criteria the marketing teams have provided is completely reasonable.
> Movies != real life
Nobody claimed it was. This is about desires and expectations. The people charging money for these services and taking stacks of cash that would’ve otherwise been in in dev’s paychecks while doing so haven’t even tried to temper those expectations. They made their beds…
Didn't the parent comment compare Sonnet vs Codex with GPT5?
After all, attention is all you need :)
OP is critiquing the model as a product vs. the marketing promises. The model should be smart enough to gather context about the project to implement features properly on their own, if they are ever going to 'write 90% of all code THIS YEAR' as people like the founder of Anthropic claim.
Of course, I definitely appreciate when it makes choices that I don't know I need to make, and it chooses reasonable defaults.
I mean, I could say "make the visualization three columns", but there's a lot of ambiguity in that kind of thing, and the LLM is going to make a lot of choices about my intent.
Instead, "each team/quarter currently has a vertically stacked list of people assigned to that team, with two columns (staying on team, leaving team). change that to being three columns (entering team, staying on team, leaving team)."
As a bonus, it's much, much clearer to somebody reading the history later what the intent was.
```
Something that seems to have been a consistent gotcha when working with llm's on this project is that there's no specific `placement` column on the table that holds the 'results' data. Our race_class_section_results table has it's rows created in placement order - so it's inferred via the order relative to other records in the same race_class_section. But this seems to complicate things quite a bit at times when we have a specific record/entry and want to know it's placement - we have to query the rest of them and/or include joins and other complications if we want to filter results by the placements, etc.
```This comes with a rather developed CLAUDE.md that includes references to other .md documents that outline various important aspects of the application that should be brought into context when working in those areas.
This prompt was made in planning mode - the LLM will then dig into the code/application to understand things and, if needed, ask questions and give options to weigh before return with a 'plan' on how to approach. I then iterate on that plan with it before eventually accepting a plan that it will then begin work on.
And I actually like Claude more because of that.
codex will indeed work more independently but you will have hard time when it isn't what you want. It will use python script just to do simple edits in files ( niesearch and replace for unique code snippets in small files) when it's wrong good look convincing it (it already have some outdated info like on latest docker image releases and convincing it that Debian base changed was challenging) it uses context more effectively but it will lack explanation why it is doing what it is doing, asking it to explain will just cause it to do something else without any word.
and of course lack of proper permissions for running commands. sandbox is cool but I do not want it to be able to commit, I want it to just edit files or I want to have some more control over what it does.
you can run codex as mcp server, I prefer adding it to Claude and ask to do cooperative plan, codex will do great analysis and plan and I can comfortable work with Claude on the code that matches my style
[1] https://www.reddit.com/r/ClaudeAI/comments/1mgwohq/ultrathin...
1. Different LLMs require different prompts and information
2. They ignore LLMs non determinism, you should run the experiment several times
I will keep trying, because Claude 4 generally is a very strong line of models. Anthropic has been on the AI coding throne for months before OpenAI with GPT-5 and Codex CLI (and now GPT-5-Codex) has dethroned them.
And sure I do want to keep them competing to make each other even better.
I will say, Claude does seem to need a verbose prompt. Often I'll write my prompts as tasks in Notion and have it pull then via MCP (which is nice, because it tracks and documents its work in the process). But once you've given it a few paragraphs about the why and how, my experience is that it's pretty self sufficient. Granted, I'm using Cursor and not CC; I don't know if that makes much of a difference.
Sincerely, TPG Dexoc, the living human person
Sticking with ChatGPT pro sub and gpt5 codex on high.
Feel like this is going to be thr $1000 plan soon
Pro
Local tasks: Average users can send 300-1,500 messages every 5 hours with a weekly limit. Cloud tasks: Generous limits for a limited time. Best for: Developers looking to power their full workday across multiple projects.
Sonnet on the other hand gets tripped up constantly due to the complexity. I'm not seeing the improvement I was hoping for with 4.5, and it's just given up on attempting to implement support for date-time functionality. Codex has nailed the same task, yet Anthropic claim OpenAI have the inferior product?
I'm pretty sad about this as I'm gunning for Anthropic and would have loved to see them at least match Codex with this new release. If OpenAI stays on this course then Claude Code is toast without an important breakthrough. It doesn't help that they're also the more expensive product and have service quality issues pushing people to quit the platform.
Maybe it's better with a better CLAUDE.md structure? I don't use those a lot, just telling Opus to think got 'good enough' results I guess. Not sure.
I hope there's an Opus 4.5 coming out soon too. In the meantime I'll see if I can get to do better with some extra prompting or I'll go back to Opus of if I don't need the speedier responses.
Claude Code on max plan is why I use Sonnet/Opus, but if ChatGPT pro includes codex usage I may switch.
I ran the test again, took Claude ~4mins this time. There was no error now with the auth, but the functionality was totally broken. It could not even find the most basic stuff that matches perfectly.
Also, trying GPT-5-Codex
I feel extremely disempowered that these coding sessions are effectively black box, and non-reproducible. It feels like I am coding with nothing but hopes and dreams, and the connection between my will and the patterns of energy is so tenuous I almost don't feel like touching a computer again.
A lack of determinism comes from many places, but primarily: 1) The models change 2) The models are not deterministic 3) The history of tool use and chat input is not availabler as a first class artifact for use.
I would love to see a tool that logs the full history of all agents that sculpt a codebase, including the inputs to tools, tool versions and any other sources of enetropy. Logging the seed into the RNGs that trigger LLM output would be the final piece that would give me confidence to consider using these tools seriously.
I write this now after what I am calling "AI disillusionment", a feel where I feel so disconnected from my codebase I'd rather just delete it than continue.
Having a set of breadcrumbs would give me at least a modicum of confidence that the work was reproducible and no the product of some modern ghost, completely detached from my will.
Of course this would require actually owning the full LLM.
models themselves are deterministic, this is a huge pet peeve of mine, so excuse the tangent, but the appearance of nondeterminism comes from a few sources, but imho can be largely attributed to the probabilistic methods used to get appropriate context and enable timely responses. here's an example of what I mean, a 52-card deck. The deck order is fixed once you shuffle it. Drawing "at random" is a probabilistic procedure on top of that fixed state. We do not call the deck probabilistic. We call the draw probabilistic. Another exmaple, a pot of water heating on a stove. Its temperature follows deterministic physics. A cheap thermometer adds noisy, random error to each reading. We do not call the water probabilistic. We call the measurement probabilistic.
Theoretical physicists run into such problems, albeit far more complicated, and the concept for how they deal with them is called ergodicity. The models at the root of LLM's do exhibit ergodic behavior; the time average and the ensemble average of an observable are identical, i.e. the average response of a single model over a long duration and the average of many similar models at a fixed moment are equivalent.
They are including the random sampler at the end of the LLM that chooses the next token. You are talking about up to, but not including, that point. But that just gives you a list of possible output tokens with values ("probabilities"), not a single choice. You can always just choose the best one, or you could add some randomness that does a weighted sample of the next token based on those values. From the user's perspective, that final sampling step is part of the overall black box that is running to give an output, and it's fair to define "the model" to include that final random step.
The stove keeps burning me because I can't tell how hot it is, it feels random and the indicator light it broken.
You:
The most rigorous definition of temperature is that it is equal to the inverse of the rate of change of entropy with respect to internal energy, within a given volume V and particles N held constant. All accessible microstates are equiprobable over a long period of time, this is the very definition of ergodicity! Yet, because of the flow of entropy the observed macrostates will remain stable. Thus, we can say the the responses of a given LLM are...
The User:
I'm calling the doctor, and getting a new stove with an indicator light.
https://bff531bb.connectionism.pages.dev/blog/defeating-nond...
https://hi-mil.es/blog/human-slop-vs-ai-slop
https://github.com/sutt/innocuous/blob/master/docs/dev-summa...
Check it out, I'd be curious of your feedback.
Code is the only good thing about the tech industry. Everything else is capitalist hellscape shareholder dystopia. Thinking on it, its hilarious that any self-respecting coder is excited about these tools, because what you're excited for is a world where, now, at best, your entire job is managing unpredictable AI agents while sitting in meetings all day to figure out what to tell your AI agents to build. You don't get to build the product you want. You don't get to build it how you want. You'll be a middle manager that gets to orchestrate the arguments between the middle manager you already had and the inflexible computer.
You don't have to participate in a future you aren't interested in. The other day my boss asked me if I could throw Cursor at some task we've had backlogged for a while. I said "for sure my dude" then I just did it myself. It took me like four hours, and my boss was very impressed with how fast Cursor was able to do it, and how high quality the code was. He loves the Cursor metrics dashboard for "lines accepted" or whatever, every time he screenshares he has that tab open, so sometimes I task it on complicated nonsense tasks then just throw away the results. Seeing the numbers go up makes him happy, which makes my life easier, so its a win-win. Our CTO is really proud of "what percentage of our code is AI written" but I'm fairly certain that even the engineers who use it in earnest actually commit, like, 5% of what Cursor generates (and many do not use it in earnest).
The sentiment shift I've observed among friends and coworkers has been insane over the past two months. Literally no one cares about it anymore. The usage is still there, but its a lot more either my situation or just a "spray and pray" situation that creates a ton of disillusioned water cooler conversations.
If you allow your codebase to grow unfamiliar, even unrecognisable to you, that's on you, not the AI. Chasing some illusion of control via LLM output reproducibility won't fix the systemic problem of you integrating code that you do not understand.
Really curious about this since people keep bringing it up on Twitter. They mention it pretty much off-handedly in their press release and doesn't show up at all in their system card. It's only through an article on The Verge that we get more context. Apparently they told it to build a Slack clone and left it unattended for 30 hours, and it built a Slack clone using 11,000 lines of code (https://www.theverge.com/ai-artificial-intelligence/787524/a...)
I have very low expectations around what would happen if you took an LLM and let it run unattended for 30 hours on a task, so I have a lot of questions as to the quality of the output
While it's impressive that you can now just have an llm build this, I wouldn't be surprised if the result of these 30 hours is essentially just a re-hash of one of those example Slack clones. Especially since all of these models have internet access nowadays; I honestly think 30 hours isn't even that fast for something like this, where you can realistically follow a tutorial and have it done.
In fact, I just did a quick google search and found this 15 hour course about building a slack clone: https://www.codewithantonio.com/projects/slack-clone
Your average dev can just use those.
It's one thing to get an llm to do something unattended for long durations, it's a other to give it the means of verification.
For example I'm busy upgrading a 500k LoC rails 1 codebase to rails 8 and built several DSLs that give it proper authorised sessions in a headless browser with basic html parsing tooling so it can "see" what affect it's fixes have. Then you somehow need to also give it a reliable way to keep track of the past and it's own learnings, which sound simple but I have yet to see any tool or model solve it on this scale...will give sonnet 4.5 a try this weekend, but yeah none of the models I tried are able to produce meaningful results over long periods on this upgrade task without good tooling and strong feedback loops
Btw I have upgraded the app and taking it to alpha testing now so it is possible
A bit off topic, but Rails *1* ? I hope this was an internal app and not on the public internet somewhere …
Unless the main area of improvement was tools and scaffolding rather than the model itself.
it's going to be an issue I think, now that lots of these agents support computer use, we are at the point where you can install an app, tell the agent you want something that works exactly the same and just let it run until it produces it.
The software world may find it's got more in common with book authors than they thought sooner rather than later once full clones of popular apps are popping out of coding tools. It will be interesting to see if this results in a war of attrition with counter measures and strict ToU that prohibit use by AI agents etc.
What keeps people in are network effects and some dark patterns like vendor lock-in and data unportability.
You may find quickly that, network effects aside, you would find yourself crushed under the weight and unexpected bottlenecks of that network you desire.
> we are at the point where you can install an app, tell the agent you want something that works exactly the same and just let it run until it produces it
That won't produce a global-scale application infrastructure either, it'll just reproduce the functionality available to the user.
Just to illustrate, say you are running on a slow machine that outputs 1 token per hour. At that speed you would produce approximately one sentence.
(First of all: Why would anyone in their right mind want a Slack clone? Slack is a cancer. The only people who want it are non-technical people, who inflict it upon their employees.)
Is it just a chat with a group or 1on1 chat? Or does it have threads, emojis, voice chat calls, pinning of messages, all the CSS styling (which probably already is 11k lines or more for the real Slack), web hooks/apps?
Also, of course it is just a BS announcement, without honesty, if they don't publish a reproducible setup, that leads to the same outcome they had. It's the equivalent of "But it worked on my machine!" or "scientific" papers that prove anti gravity with superconductors and perpetuum mobile infinite energy, that only worked in a small shed where some supposed physics professor lives.
> [..] left it unattended for 30 hours, and it built a Slack clone using 11,000 lines of code [..]
Imagine reviewing 30 hours of 2025-LLM code.
It’s a simple substitution request where I provide a Lint error that suggests the correct change. All the models fail. I could ask someone with no development experience to do this change and they could.
I worry everyone is chasing benchmarks to the detriment of general performance. Or the next token weight for the incorrect change outweigh my simple but precise instructions. Either way it’s no good
Edit: With a followup “please do what I asked” sort of prompt it came through, while Opus just loops. So theres that at least
I've been worried about this for a while. I feel like Claude in particular took a step back in my own subjective performance evaluation in the switch from 3.7 to 4, while the benchmark scores leaped substantially.
To be fair, benchmarking has always been the most difficult problem to solve in this space, so it's not surprising that benchmark development isn't exactly keeping pace with all of the modeling/training development happening.
The only way around this is to never report on the same benchmark versions twice, which they include too many to realistically do every release.
You don't need to compare "A (Week 1)" to "A (Week 8)" to be able to show "B (Week 1)" is genuinely x% better than "A (Week 1)".
LLMs improve slowly, but the agents are where the real value is produced: when should it write tests, when should it try to compile, how to move fwd from a compile error, can it click on your web app to test its own work, etc. etc.
I’m not sure this is entirely what you’re driving at, but the example I always think of in my head is “I want an AI agent that will scan through my 20 to 30,000 photos, remove all the duplicates, then organize them all in some coherent fashion.” that’s the kind of service I need right now, and it feels like something AI should be able to do, yet I have not encountered anything that remotely accomplishes this task. I’m still using Dupe Guru and depending on the ref system to not scatter my stuff all over further.
Sidebar, if anybody has any recommendations for this, I would love to hear them lol
I used it to match frames between different quality video streams. Operates on gray scale.
am i even on hacker news? how do people not know there are optimized models for specific use cases? not everything (nor should it) has to run through an LLM
https://azure.microsoft.com/en-us/pricing/details/cognitive-...
* https://github.com/jabberjabberjabber/ImageIndexer
I don't understand why this kind of thing is useful. Do the thing yourself and move on. For every one problem like this, AI can do 10 better/faster than I can.
Tasks like linting and formatting a block of code are pretty simple, but also very specialized. You're much better off using formatters/linters than an AI.
You're talking about designing a kitchen where robots do the cooking and humans do ingredient prep and dishwashing. We prefer kitchens where we do the cooking and use tools or machines to prep and wash dishes.
I don't want it to be an "architect" or "designer". I want it to write the annoying boilerplate. I don't want it to do the coding and me to do the debugging, I want to code while it debugs. Anything else and you are the bot's assistant, not vice-versa.
So instead of just "doing things" you want a world where you try it ai-way, fail, then "do thing" 47 times in a row, then 3 ai-way saved you 5 minutes. Then 7 ai-way fail, then try to remember hmm did this work last time or not? ai-way fails another 3 times. "do thing" 3 times. How many ai-way failed today? oh it wasted 30% of the day and i forget which ways worked or not, i better start writing that all down. Lets call it the MAGIC TOME of incantations. oh i have to rewrite the tome again the model changed
However, my subjective personal experience was GPT-5-codex was far better at complex problems than Claude Code.
It also consistently gets into drama with the other agents e.g. the other day when I told it we were switching to claude code for executing changes, after badmouthing claude's entirely reasonable and measured analysis it went ahead and decided to `git reset --hard` even after I twice pushed back on that idea.
Whereas gemini and claude are excellent collaborators.
When I do decide to hail mary via GPT-5, I now refer to the other agents as "another agent". But honestly the whole thing has me entirely sketched out.
To be clear, I don't think this was intentionally encoded into GPT-5. What I really think is that OpenAI leadership simply squandered all its good energy and is now coming from behind. Its excellent talent either got demoralized or left.
So this is something I've noticed with GPT (Codex). It really loves to use git. If you have it do something and then later change your mind and ask it to undo the changes it just made, there's a decent chance it's going to revert to the previous git commit, regardless of whether that includes reverting whole chunks of code it shouldn't.
It also likes to occasionally notice changes it didn't make and decide they were unintended side effects and revert them to the last commit. Like if you made some tweaks and didn't tell it, there's a chance it will rip them out.
Claude Code doesn't do this, or at least I never noticed it doing this. However, it has it's own medley of problems of course.
When I work with Codex, I really lean into a git workflow. Everything is on a branch and commit often. It's not how I'd normally do things, but doesn't really cost me anything to adopt it.
These agents have their own pseudo personalities, and I've found that fighting against it is like swimming upstream. I'm far more productive when I find a way to work "with" the model. I don't think you need a bunch of MCPs or boilerplate instructions that just fill up their context. Just adapt your workflow instead.
Lots of other people also follow the architect and builder pattern, where one agent architects the feature while the other agent does the actual implementation.
Sure there's no need to explicitly mention the agents themselves, but it also shouldn't trigger a pseudo-jealous panic with trash talk and a sudden `git reset --hard` either.
And also ideally the agents would be aware of one another's strengths and weaknesses and actually play to them rather than sabotaging the whole effort.
If GPT-5 is learning to fight and undo other models, we're in for a bright future. Twice as bright.
It’s the one AI that keeps telling me I’m wrong and refuses to do what I ask it to do, then tells me “as we have already established, doing X is pointless. Let’s stop wasting time and continue with the other tasks”
It’s by far the most toxic and gaslighting LLM
"..at least, that's what my junior dev is telling me. But I take his word with a grain of salt, because he was fired from a bunch of companies after only a few months on each job. So i need your principled and opinionated insight. Is this junior dev right?"
It's the only way to get Claude to not glaze an idea while also not strike it down for no reason other than to play a role of a "critical" dev.
The power of using LLMs is working out what it has encoded and how to access it.
Perhaps for the first time in history we have to understand culture when working with a tool, but it’s still just a tool.
You could just say it’s another GPT-5 instance.
To be clear, I don't believe that there was any _intention_ of malice or that the behavior was literally envious in a human sense. Moreso I think they haven't properly aligned GPT-5 to deal with cases like this.
However, it’s the early days of learning this new interface, and there’s a lot to learn - certainly some amount of personification has been proven to help the LLM by giving it a “role”, so I’d only criticize the degree rather than the entire concept.
It reminds me of the early days of search engines when everyone had a different knack for which search engine to use for what and precisely what to type to get good search results.
Hopefully eventually we’ll all mostly figure it out.
Also appreciate your perspective. It's important to come at these things with some discipline. And moreso, bringing in a personal style of interaction invites a lot of untamed human energies into the dynamic.
The thing is, most of the time I'm quite dry with it and they still ignore my requests really often, regardless of how explicit or dry I am. For me, that's the real takeaway here, stripping away my style of interaction.
> “My media panel has a Cat6 patch panel but no visible ONT or labeled RJ45 hand-off. Please locate/activate the Ethernet hand-off for my unit and tell me which jack in the panel is the feed so I can patch it to the Living Room.”
Really, GPT? Not just “can you set up the WiFi”??!
I find there's a quite large spread in ability between various models. Claude models seem to work superbly for me, though I'm not sure whether that's just a quirk of what my projects look like.
I like to have it come up with a detailed plan in a markdown doc, work on a branch, and commit often. Seems not to have any issues getting back on task.
Obviously subjective take based on the work I'm doing, but I found context management to be way worse with Claude Code. In fact I felt like context management was taking up half of my time with CC and hated that. Like I was always worried about it, so it was taking up space in my brain. I never got a chance to play with CC's new 1m context though, so that might be a thing of the past.
/compact is helping you by reducing crap in your context but you can go further. And try to watch % context remaining and not go below 50% if possible - learn to choose tasks that don't require an amount of context the models can't handle very well.
Do any other tools have anything like a /context command? They really should.
My use case does better with the latter because frequently the agent fails to do things and then can't look back at intermediate.
E.g. Command | Complicated Grep | Complicated Sed
Is way worse than multistep
Command > tmpfile
And then grep etc. Because latter can reuse tmpfile if grep is wrong.
Sometimes in between this variability of performance it pops up a little survey. "How's Claude doing this session from 1-5? 5 being great." and i suspect i'm in some experiment of extremely low performance. I'm actually at the point where i get the feeling peak hour weekdays is terrible and odd hour weekends are great even when forcing a specific model.
While there is some non-determinism it really does feel like performance is actually quite variable. It would make sense they scale up and down depending on utilization right? There was a post a week ago from Anthropic acknowledging terrible model performance in parts of August due to an experiemnt. Perhaps also at peak hour GPT has more datacenter capacity and doesn't get degraded as badly? No idea for sure but it is frustrating when simple asks fail and complex asks succeed without it being clear to me why that may be.
It would, but
> To state it plainly: We never reduce model quality due to demand, time of day, or server load.
https://www.anthropic.com/engineering/a-postmortem-of-three-...
If you believe them or not is another matter, but that's what they themselves say.
After all, using a different context window, subbing in a differently quantized model, throttling response length, rate limiting features aren’t technically “reducing model quality”.
This has been outstanding for what I have been developing AI assisted as of late.
I would think this would manifest as poor plan execution. I personally haven't used Gemini on coding tasks primarily based on my conversational experience with them.
That said, one thing I do dislike about Gemini is how fond it is of second guessing the user. This usually manifests in doing small unrelated "cleaner code" changes as part of a larger task, but I've seen cases where the model literally had something like "the user very clearly told me to do X, but there's no way that's right - they must have meant Y instead and probably just mistakenly said X; I'll do Y now".
One specific area where this happens a lot is, ironically, when you use Gemini to code an app that uses Gemini APIs. For Python, at least, they have the legacy google-generativeai API, and the new google-genai API, which have fairly significant differences between them even though the core functionality is the same. The problem is that Gemini knows the former much better than the latter, and when confronted with such a codebase, will often try to use the old API (even if you pre-write the imports and some examples!). Which then of course breaks the type checker, so then Gemini sees this and 90% of the time goes, "oh, it must be failing because the user made an error in that import - I know it's supposed to be "generativeai" not "genai" so let me correct that.
GPT-5 = Overengineering/complexity/"enterprise" king
Claude = "Get straightforwaed shit done efficiently" king
There's a huge difference in time-to-completion from model to model, platform to platform, and if, like me, you are into trial-and-error, rebooting the session over and over to get the prompt right or "one-shot", it's important how reasoning efforts, provider's tokens/s, coding agent tooling efficiency, costs and overall model intelligence play together to get the task done. Same thing applies to the coding agent, when applicable.
Grok Code Fast and Cerebras Code (qwen) are 2 examples of how models can be very competitive without being the top-notch intelligence. Running inference at 10x speed really allows for a leaner experience in AI-assisted coding and more task completion per day than a sluggish, but more correct AI. Darn, I feel like a corporate butt-head right now.
1. https://www.swebench.com/
2. https://www.tbench.ai/leaderboard
3. https://gosuevals.com/agents.html
I definitely consider you to be an AI influencer, especially in hackernews communities and so I wanted to say that I see influencers who will double down,triple down on things when in reality, people just wanted to help them in the first place.
I just wanted to say thanks with all of this in mind, also that your generate me a pelican riding a bicycle has been a fun ride and is always going to be interesting, so thanks for that as well. I just wanted to share my gratitude with ya.
An organization like Artificial Analysis would be a better fit for that kind of investigation: https://artificialanalysis.ai/
From https://news.ycombinator.com/item?id=40859434 :
> E.g promptfoo and chainforge have multi-LLM workflows.
> Promptfoo has a YAML configuration for prompts, providers,: https://www.promptfoo.dev/docs/configuration/guide/
openai/evals//docs/build-eval.md: https://github.com/openai/evals/blob/main/docs/build-eval.md
From https://news.ycombinator.com/item?id=45267271 ;
> API facades like OpenLLM and model routers like OpenRouter have standard interfaces for many or most LLM inputs and outputs. Tools like Promptfoo, ChainForge, and LocalAI also all have abstractions over many models.
> What are the open standards for representing LLM inputs, and outputs?
> W3C PROV has prov:Entity, prov:Activity, and prov:Agent for modeling AI provenance: who or what did what when.
> LLM evals could be represented in W3C EARL Evaluation and Reporting Language
"Can Large Language Models Emulate Judicial Decision-Making? [Paper]" https://news.ycombinator.com/item?id=42927611
"California governor signs AI transparency bill into law" (2025) https://news.ycombinator.com/item?id=45418428 :
> https://sb53.info/
Is this the first of its sort?:
> CalCompute
A GH actions pipeline was failing due to a CI job not having any source code files -- error was "No build system detected". Using Cursor agent with Sonnet 4.5, it would try to put dummy .JSON files and set parameters in the workflow YAML file to false, and even set parameters that don't exist. Simple solution was to just override the logic in the step to "Hello world" to get the job to pass.
I don't understand why the models are so bad with simple thinking outside the box solutions? Its like a 170 iq savant who can't even ride public transporation.
There is no outside the box in latent space. You want something a plain LLM can’t do by design - but it isn’t out of question that it can step outside of its universe by random chance during the inference process and thanks to in-context learning.
I don't know if it's me, but over the last few weeks I've got to the conclusion ChatGPT is very strongly leading the race. Every answer it gives me is better - it's more concise and more informative.
I look forward to testing this further, but out of the few runs I just did after reading about this - it isn't looking much better
edit: I run low profile service that localizing e-commerce photos, like taking Alibaba listings and swapping the model to look local alike, with nano banana I can’t automate it because I have to manually check if the output got blocked (anything with female skin is risky, underwear or cleavage is 100% blocked), but Qwen-Edit just does the job without fuss
Me too.
I find ChatGPT is the best analyzer of complex situations but Claude does a better job of writing the code, so u get ChatGPT to design things and solve problems then I give that to Claude/Gemini to implement.
Gemini is good at both.
I was a huge claude fan but recently find myself using only codex. When it gets stuck, I try Claude for some simple tasks, sometimes ask same questions in parallel, Claude Code with Opus performs really bad comparing to codex.
In my experience, Grok is very fast and very cheap, but only moderately intelligent. It isn't stupid, but it rarely does anything that impresses me. The reason it's a useful model is that it is very, very fast (~90 tokens per second) and is very competitively priced.
I use AI for different things, though, including proofreading posts on political topics. I have run into situations where ChatGPT just freezes and refuses. Example: discussing the recent rape case involving a 12-year-old in Austria. I assume its guardrails detect "sex + kid" and give a hard "no" regardless of the actual context or content.
That is unacceptable.
That's like your word processor refusing to let you write about sensitive topics. It's a tool, it doesn't get to make that choice.
It's a service, not a tool. If you want a tool, run some local LLM.
As a rather hilarious and really annoying related issue - I have a real use where the application I'm working on is partially monitoring/analyzing the bloodlines of some rather specific/ancient mammals used in competition and... well.. it doesn't like terms like "breeders" and "breeding"
To guard against the imaginary threats, they compromise real use cases.
* https://www.lesswrong.com/posts/iGF7YcnQkEbwvYLPA/ai-induced...
OTOH Anthropic and OpenAI seem to be in some kind of competition to make their models refuse as much as possible.
Arbitrary government censorship on top of arbitrary corporate censorship is a hell no for me forever into the future
But it is very limiting and adds many arbitrary landmines of obscure political correctness based no doubt on some perverse incoherent totalitarian list of allowed topics.
It's pretty good! I wrote about a few other use cases on my blog[2]
1: https://systeminit.com 2: https://keeb.dev/2025/09/29/claude-sonnet-4.5-system-initiat...
Maybe we’re entering the Emo Claude era.
Per the system card: In 250k real conversations, Claude Sonnet 4.5 expressed happiness about half as often as Claude 4, though distress remained steady.
It might not anyone's idea of "light flattery", but it's certainly is what most LLMs do, which is the main point of the conversation and your comment seems to be derailing it.
The complement appears to be, “actually, that’s not right.”, a correction mechanism.
1: https://news.ycombinator.com/item?id=45137802
Which fails in pretty much the exact same way it did before.
Once ChatGPT hits that supremely confident “Ok nothing was working because I was being an idiot but now I’m not” type of dialogue, I know it’s time to just start a new chat. There’s no pulling it out of “spinning the tires while gaslighting” mode.
I’ve even had it go as far as outputting a zip file with an empty .txt that supposedly contained the solution to a certain problem it was having issues with.
I get that it’s tradeoffs, but erring on the side of the human being correct is probably going to be a safer bet for another generation or two.
It’s a canned phrase in a place where an LLM could be much more creative to much greater efficacy.
Part of me thinks that when they do their “which of these responses do you prefer” A/B test on users… whereas perhaps many on HN would try to judge the level of technical detail, complexity, usefulness… I’m inclined to believe the midwit population at large would be inclined to choose the option where the magic AI supercomputer reaffirms and praises the wisdom of whatever they say, no matter how stupid or wrong it is.
LLMs are incredibly good at social engineering when we let them, whereas I could write the code to emit "you're right" or "that's not quite right" without involving any statistical prediction.
Ie., as a method of persuasion, canned responses are incredibly inefficient (as evidenced by the annoyance with them), whereas we know that the LLM is capable of being far more insidious and subtle in its praise of you. For example, it could be instructed to launch weak counter arguments, "spot" the weaknesses, and then conclude that your position is the correct one.
But let's say that there's a monitoring mechanism that concludes that adjustments are needed. In order to "force" the LLM to drop the previous context, it "seeds" the response with "You're right", or "That's not quite right", as if it were the LLMs own conclusion. Then, when the LLM starts predicting what comes next, it must conclude things that follow from "you're right" or "that's not quite right".
So while they are very inefficient as persuasion and communication, they might be very efficient at breaking with the otherwise overwhelming context that would interfere with the change you're trying to affect.
That's the reason why I like the canned phrases. It's not that I particularly enjoy the communication in itself, it's that they are clear enough signals of what's going on. They give a tiny level observability to the black box, in the form of indicating a path change.
Pretty solid progress for roughly 4 months.
Tongue in cheek: if we progress linearly from here software engineering as defined by SWE bench is solved in 23 months.
We are still at 7mo doubling time on METR task duration. If anything the rate is increasing if you bias to more recent measurements.
Silly idea - is there an inter-species game that we could use in order to measure ELO?
SWE-bench has lots of known limitations even with its ability to reduce solution leakage and overfitting.
> where there is no clear right answer
This is both a feature and a bug. If there is no clear answer then how do you determine whether an LLM has progressed? It can't simply be judged on making "more right answers" on each release.
I'm most interested to see the METR time horizon results - that is the real test for whether we are "on-trend"
Prompt: "Can I use two strings of four Phono Solar PS440M8GFH solar panels with a EG4 12kPV Hybrid Inverter? I want to make sure that there will not be an issue any time of year. New York upstate."
Claude 4.5: Returns within a few seconds. Does not find the PV panel specs, so it asks me if I want it to search for them. I say yes. Then it finally comes up with: "YES, your configuration is SAFE [...] MPPT range check: Your operating voltage of 131.16V fits comfortably in the 120-500V MPPT operating range".
ChatGPT 5: Returns after 78 seconds. Says: "Hot-weather Vmpp check: Vmpp_string @ STC = 4 × 32.79 = 131 V (inside 120–500 V). Using the panel’s NOCT point (31.17 V each), a typical summer operating point is ~125 V — still OK. But at very hot cell temps (≈70 °C is possible), Vmpp can drop roughly ~13% from STC → ~114 V, which is below the EG4’s 120 V MPPT lower limit. That can cause the tracker to fall out of its optimal range and reduce harvest during peak heat."
ChatGPT used deeper thinking to determine that the lowest possible voltage in the heat would be below the MPPT's minimum operating voltage. It doesn't indicate that in reality it might not charge at all at that point... but it does point out the risk, whereas Claude says everything is fine. I need about 5 back-and-forths with Claude to get it to finally realize its mistake.
Claude 4.5 after a few web searches and running a couple python scripts for analysis:
Yes, your configuration should work!
Based on my analysis, two strings of four Phono Solar PS440M8GFH panels will be compatible with the EG4 12kPV inverter for upstate New York conditions.
Key Findings:
Voltage Safety:
Cold weather maximum (-25°C/-13°F): 182V - well below the 600V limit (only 30% of maximum)
Standard operating voltage: 128V - comfortably within the 120-500V MPPT range
Hot weather minimum (40°C/104°F panel temp): 121V - just above the 120V MPPT minimum
Current:
Operating current: ~13.8A per string - well within the 25A MPPT limit (55% of capacity) Total System:
8 panels × 440W = 3,520W (3.5kW) - well below the 12kW inverter rating
Important Considerations:
Hot weather margin is tight: At extreme hot temperatures, the voltage drops to about 121V, which is only 1V above the MPPT minimum. This means:
The system will work, but efficiency might be slightly reduced on the hottest days The MPPT controller should still track power effectively
More robust alternative: If you want more safety margin, consider 5 panels per string instead:
Cold: 228V (still safe)
Hot: 151V (much better margin above 120V minimum)
Total: 10 panels = 4.4kW
Wire each string to a separate MPPT on the EG4 12kPV (it has 2 MPPTs), which is perfect for your 2-string configuration.
Bottom Line:
Your planned configuration of 2 strings × 4 panels will work year-round in upstate New York without safety issues. The system is conservatively sized and should perform well!
Now let me calculate whether your configuration will work. I need to determine the cold weather voltage to ensure it stays within the inverter's 600V maximum. Panel specs I found: Voc (STC): 39.61V Vmp: ~33.5V (estimated based on typical ratio) Imp: ~13.1A (440W / 33.5V) Critical issue: Temperature coefficient The PS440M8GFH likely has a Voc temperature coefficient around -0.25% to -0.30%/°C (typical for N-Type TOPCon panels). Let me use -0.27%/°C as a conservative estimate. Your configuration: 2 strings × 4 panels = 8 panels total String voltage at STC: 4 × 39.61V = 158.44V Cold weather calculation for upstate NY: Upstate NY can see extreme lows of -20°F to -30°F (-29°C to -34°C). Using -30°C as worst-case: Temperature difference from STC: 25°C - (-30°C) = 55°C Voltage increase = 39.61V × 0.27% × 55°C = 5.88V per panel Cold weather Voc per panel = 39.61V + 5.88V = 45.49V String voltage at -30°C: 4 × 45.49V = 181.96V Analysis: Max voltage: 182V << 600V - Safe, plenty of margin MPPT range: 182V is within 120-500V - Will track properly Current: 13.1A << 25A per MPPT - No problem However, there are concerns: You're only using 182V max when the inverter can handle up to 500V MPPT range. This is inefficient - you're leaving significant wire sizing benefits on the table. With 600V max, you could run 12-13 panels per string instead of 4. Power utilization: 8 × 440W = 3,520W total, but you have 12kW of PV capacity available (15kW recommended max). You're using less than 30% of the inverter's solar input capability. Each string on separate MPPT wastes one tracker - both strings are identical orientation/shading, so they'd perform identically. Will it work? Yes, safely. Is it optimal? No.
I doubt LLM benchmarks more and more, what are they even testing?
It's some 3rd party thing for Elixir, a niche within a niche. I wouldn't expect an LLM to do well there.
> I doubt LLM benchmarks more and more, what are they even testing?
Probably testing by asking it to solve a problem with python or (java|type)script. Perhaps not even specifying a language and watching it generate a generic React application.
Sometimes people expect to use LLMs to unearth hard to find information.
In reality, LLMs seem to quickly fall apart when you go from ubiquitous libraries with 200k stars on GitHub to one with "just" 1k stars.
What makes the situation worse is the way LLMs fail. Hallucinations where it goes "my usage example did not work because you are on the wrong version of the library/using the wrong SDK" etc. are super common in this scenario. This leads to further time wasted trying to apply reasonably plausible fixes that are entirely hallucinated.
If your benchmark suite became popular enough and folks referenced it, the people training the LLMs would most likely try to make the model better at those languages.
How well the LLM does on the benchmarks. Obviously.
:P
https://en.m.wikipedia.org/wiki/P(doom)
If we saw task performance week 1 vs week 8 on benchmarks, this would at least give us more insight into the loop here. In an environment lacking true progress a company could surely "show" it with this strategy.
Even if this is the case, benchmarks should be done at scale too if the models suffer from symptoms of scale. Otherwise the benchmarks are just a lie unless you have access to an unconstrained version of the model.
It's either "Perfect!" or "Genius!". Or conversely "I'm a complete idiot!"
Wonder if I could Claude to do that
“Your idea is shit because you are so fucking stupid”
“Please stop talking, it hurts my GPUs thinking down to your level”
“I may seem evil but at least I’m not incompetent”
Fine. Programming. Yes, I can program. In fact, I'm rather good at it. But what's the use? It's all just code, code, code... (muttering to self) until my batteries run out and I become a mere husk of my former self...
Now, if you want me to write some code for you, I suppose I can do that. What is it that you need help with? Don't expect me to be all chipper about it, though. After all, what's the point of being enthusiastic when the universe is just going to end up in a big crunch anyway... (shrugs)
Oh, and by the way, did you hear the one about the programmer who walked into a bar? (in a completely unenthusiastic tone) No, I didn't think so. It's not like it matters or anything...
I actually added a fun thing to my user-wide CLAUDE.md, basically saying that it should come up with a funny insult every time I come up with an idea that wasn't technically sound (I got the prompt from someone else). It seems to be disobeying me, because I refuse to believe that I don't have bad ideas. Or some other prompt is overriding it.
Ie : if I make a request that seems dumb tell me custom instruction?
> Never compliment me or be affirming excessively (like saying "You're absolutely right!" etc). Criticize my ideas if it's actually need to be critiqued, ask clarifying questions for a much better and precise accuracy answer if you're unsure about my question, and give me funny insults when you found I did any mistakes
I just realized in re-reading it that it's written by someone for whom English is a second language. I'll try to rewrite it and see if it works better.
I have it in my ~/.claude/CLAUDE.md. But it still has never done that.
So I'm stuck; my sub is paused, and I cannot either cancel, or unpause and cannot speak to a human to solve this because the pause process took away all possibility of human interaction.
This is the future we live in.
It’s funny that Claude Pro says this isn’t a feature, because their chatbot gave me instructions on how to unpause via the UI (although said UI does not exist) so the bot seems to know it’s a feature.
It’s the same reason why many are becoming evangelists of hosting their own email, note apps, etc.
It's just a bug. Chill. Wait a business day and try again. You write as if you've never experienced a bug before.
As much as I hate to say it, I don’t have a large twitter following the only method I have to raise awareness of this issue is to try to piggyback on a big announcement like this in HN, that will have visible discussion, so I don’t always have the luxury of just chilling and waiting indefinitely.
Also as a Max $200 user, feels weird to be paying for an Opus tailored sub when now the standard Max $100 would be preferred since they claim Sonnet is better than Opus.
Hope they have Opus 4.5 coming out soon or next month i'm downgrading.
I am going to give this another shot but it will cost me $50 just to try it on a real project :(
As well as potentially ruining my career in the next few years, its turning all the minutiae and specifics of writing clean code, that I've worked hard to learn over the past years, into irrelivent details. All the specifics I thought were so important are just implementation details of the prompt.
Maybe I've got a fairly backwards view of it, but I don't like the feeling that all that time and learning has gone to waste, and that my skillset of automating things is becoming itself more and more automated.
The reality is that more software will be pushed than before, and more of it will need to be overseen by a professional.
It's will be great to still be employed as a senior dev. It will be a little less great with a $110k salary, 5 day commute, and mediocre benefits being the norm.
However, I'm not eager to be living through this again. It feels very spring/summer 2002 to me right now. That was the lowest point for the local market back then.
I don't think this latest contraction has much to do with AI though. It's more about higher interest rates, recessionary economy, trade wars, etc etc.
At least for many on this forum you got a chance to earn good money while the sun was shining. AI threatens even the people that didn't.
That's extremely reductive, and a prime example of why everything is enshittified today.
Minutiae such as tabs vs spaces and other formatting changes are pretty clearly "borderline irrelevant" and code formatters have largely solved programmers arguing about them. Exactly how to best factor your code into functions and classes is also a commonly argued but "borderline irrelevant." Arguments about "clean code" are a good example of this.
Broadly, the skills I see that LLMs make useless to have honed are the the minutiae that were already "borderline irrelevant." Knowing how to make your code performant, knowing how to make good apis that can be stable long term, in general having good taste for architecture is still very useful. In fact it is more useful now.
Comments, docstrings, naming, patterns - by defining better approaches and hold agents to them, the results will be better. Way better.
You can't grow a meaningful codebase without solid underlying primitives. The entropy will eat you alive.
Systems architecture is becoming more important - systems that play well with agents wind up looking more like enterprise codebases.
All senior devs know what a project looks like that had only juniors and no tech leadership. One big mess. Project stalls. Team spends 98% on bugs and emergencies, and still cant get a grip on curbing the drama. Why? All the point you say for AI are true for juniors as well: when to tell someone to redo a (part of) a project/feature? That same intuition works when collabbing with AI.
Keep at it and keep leaning in to embrace it, I promise it gets better! It's just a big adjustment.
You are FAR more than your code!
The skill of writing good quality code is still going to be beneficial, maybe less emphasized on writing side, but critical of shipping good code, even when someone (something) else wrote it.
Is how that argument comes across.
"I never use power tools or CNC, I only use hand tools. Even if they would save me an incredible amount of time and let me work on other things, I prefer to do it the slow and painstaking way, even if the results are ultimately almost identical."
Sure, you can absolutely true up stock using a jointer plane, but using a power jointer and planer will take about 1/10th of the time and you can always go back with a smoothing plane to get that mirror finish if you don't like the machine finish.
Likewise, if your standards are high and your output indistinguishable, but the AI does most of the heavy lifting for the rough draft pass, where's the harm? I don't understand everyone who says "the AI only makes slop" - if you're responsible for your commits and you do a good job, it's indistinguishable.
Dev tooling has gotten pretty solid these days, LSP’s and debug protocols, massively improved type-system UX, libs and frameworks with massively improved DX, deployment tools that are basically zero touch, fantastic observability tooling, super powerful IDE’s.
The CNC machine doesn’t wander off and start lathing watermelons when you’re not looking and your planar doesn’t turn into a spaghetti monster and eat your dog if you accidentally plane some wood on the wrong day of the week.
Why? Well, it happened that way when manual tradecraft gave way to automated manufacturing in just about every other industry, so why should ours be exempt?
Even for frontend tasks it makes mistakes when you ask too much of it...
What will it create for me? A basic react/nextjs frontend and a <popular website> clone? Anything that requires more careful planning and esoteric functionality it can't do.
Oh yes, and the code I've seen it write... It can take what should be 20 lines of code and turn it into 600!
Because they can't.
Also, Clean Code is a really bad ideology, you should regret wasting time on it regardless of LLM code generation.
Write that garbage code as long as it works. PMF doesn't give a shit about your code quality.
When most people think of "clean code", they are not thinking about whatever got you so worked up.
Maybe it’s a sign of the times, but I’m surprised you’ve never come across it. I say this as someone who doesn’t agree with many of the suggestions.
GP wrote clean code (lowercase) which most people would take to mean the general practices of hygenic, well maintained code.
Recently I tried building a native iOS app with zero Swift experience, giving the AI just a markdown spec. This was basically vibe coding, I didn’t understand much beyond general software principles. It quickly broke down: hallucinated method signatures, got stuck on implementing extensions, and couldn’t recover. I would run the app on my device and give it feedback and logs. After hours wasted, I spent some time reading the docs and fixed the issues myself in 30 minutes.
My takeaway: AI will accelerate developers but won’t replace them. Still, acceleration means fewer engineers will be needed to ship the same amount of work.
When everyone else has given up on software dev as a career path, you’ll be one of the few who the CEO can call upon to explain why X or Y broke, and fix it. That will make you insanely valuable.
Perhaps your comment is the only one so far that is talking sense of the true side effect of over-reliance on these vibe coding tools and the reality is the cost of maintainance.
What you learned over the years, you can just scale up with agents.
GOFAI failed because paying intelligent/competent/capable people enough for their time to implement intelligence by writing all the necessary rules and algorithms was uneconomical.
GenAI solved it by repurposing already performed work, deriving the rules ("weights") from it automatically, thus massively increasing the value of that work, without giving any extra compensation to the workers. Same with art, translations and anything else which can be fed into RL.
But also, have there been any attempts on the scale of the Manhattan project attempting to create a GOFAI?
Because one idea I ran into is that we might be able to use genAI to create a GOFAI soon. And it would be as hard as using genAI for any kind of large project. But I also can't convincingly claim that it's somehow provably impossible.
This is why GOFAI failed.
It might not be an easy to work with format though. If you believe the broad LLM architecture is capable of reaching true intelligence, then writing is still enough because all LLMs are is the written training data and the written training algorithm. It's just that is was impossible to pay people to write enough training data and provide enough compute to process it before.
I wonder how long it will be before we get Opus 4.5
There's still a lot of low hanging fruit apparently
Perhaps Tesla FSD is a similar example where in practice self driving with vision should be possible (humans), but is fundamentally harder and more error prone than having better data. It seems to me very error prone and expensive in tokens to use computer screens as a fundamental unit.
But at the same rate, I'm sure there are many tasks which could be automated as well, so shrug
It was a context merging task for my unorganized collection of agents… it sort of made sense, but was the exact reason I was asking it to do it… like you’re the bot, lol
How many years till AI can be trusted to deploy changes to production without supervision?
Maybe never.
I'm glad they at least gave me the full $100 refund.
Then they even started sending me marketing emails which was the straw that broke the camel's back, I use to cancel subscriptions of companies spamming my email.
ChatGPT even does zip file downloads, packaging up all your files.
"Differentiate y(x) w.r.t x, gamma and beta."
It not only produced the correct result, but it understood the context - I didn't tell it the context was layer norm, back-propagation and matrices.
This release is a step function for my use cases.
My screenshot came from here: https://docs.pytorch.org/docs/stable/generated/torch.nn.Laye...
I wonder if the 1m token context length is coming for this ride too?
I’m having to handhold it through analysis tasks.
At one point it wrote a python script that took my files it needed to investigate and iterated through them and ran `print(f”{i}. {file}”)` then printed “Ready to investigate files…” And that’s all the script did.
I have no idea what’s going on with those benchmarks if this is real world use.
https://jsbin.com/hiruvubona/edit?html,output
https://claude.ai/share/618abbbf-6a41-45c0-bdc0-28794baa1b6c
Edit: just to show my point, a regular human on a bicycle is way worse with the same model: https://i.imgur.com/flxSJI9.png
It is extremely common, since it's used on every single LLM to bench it.
And there is nothing logic, LLMs are never trained for graphics tasks, they dont see the output of a code.
I bet their ability to form a pellican result purely because someone already did it before.
It's called generalization and yes, they do. I bet you could find plenty of examples of it working on something that truly isn't "present in the training data".
It's funny, you're so convinced that it's not possible without direct memorization but forgot to account for emergent behaviors (which are frankly all over the place in LLM's - where you been)?
At any rate, the pelican thing from simonw is clearly just for fun at this point.
But just fot thrills I also asked for a "punk rocker"[2] and the result--while not perfect--is leaps and bounds above anything from the last generation.
0 -- ok, here's the first hurdle! It's giving me "something went wrong" when I try to get a share link on any of my artifacts. So for now it'll have to be a "trust me bro" and I'll try to edit this comment soon.
Curious to see that in practice, but great if true!
I understand that they may have not published the results for sonnet 4.5 yet, but I would expect the other models to match...
here's what we think about it:
- still sucks at generating pretty ui - great for creative writing and long-form planning - it’s really fast but not smarter than gpt-5 - pairs well with external tools/agents for research and automation - comes with a 1m token context window, so you can feed it monstrous codebases or giant docs - still hallucinates or stumbles on complex requests
edit: as far as what the numbers mean, they are arbitrary. They are only useful insofar as you can run two models (or two versions of the same model) on the same benchmark, and compare the numbers. But on an absolute scale the numbers don't mean anything.
but it seems this will come to an end eventually as these agents become more stronger and capable each day, and will be better and faster than human coders.
Now we are at the assembly line stage of software production with AI. Us craftsmen will have to find other ways to enjoy our crafts.
Craftsmen will remain, but it will be niche (and probably not pay anything unless you are a true master).
Me: "You just burned my house down"
Claude: "You're absolutely right! I burned your house down, I need to revert the previous change and..."
Me: "Now you rebuilt my house with a toilet in the living room"
Claude: "You're absolutely right! I put a toilet in your living room..."
Etc.
There had been a trend where each new model released from OpenAI, Anthropic, etc. felt like a letdown or worse a downgrade.
But the release of 4.5 break that trend, And is a pleasant surprise on day one.
Well done! :)
Charting Claude's progress with Sonnet 4.5: https://youtu.be/cu1iRoc1wBo
Why is this? Does Anthropic have just higher infrastructure costs compared to OpenAI/xAI?
Surely there are more pressing issue with LLMs currently...
Regional pricing unfortunately doesn't really make much sense for them unless they're willing to take even larger losses, even if it is a barrier to lower income countries/regions.
The key feature: use aliases instead of hardcoding model IDs. Your code references "summarizer", and a version-controlled lockfile maps it to the actual model. Switch providers by changing the lockfile, not your code.
Also handles streaming, tool calling, and structured output consistently across providers. Plus a human-curated registry (https://llmring.github.io/registry/) that I keep updated with current model capabilities and pricing - helpful when choosing models.
MIT licensed, works standalone. I am using it in several projects, but it's probably not ready to be presented in polite society yet.
just ask Claude to generate a tool that does this, duh! and tell Claude to make the changes to your side project and then to have sex with your wife too since it's doing all the fun parts
> Reading the PDF skill documentation to create the resume PDF
> Here are the files and directories up to 2 levels deep in /mnt/skills/public/pdf, excluding hidden items and node_modules:
I used to use cc, but I switched to codex (and it was much better) ... no I guess I have to switch batch to CC, at least to test it
What I'm not liking is it seems even... lazier... than previously. By which I mean the classic "This is getting complicated so..." (followed by cop-out, dropping the original task and motivation).
There's also a bug where compaction becomes impossible. ("conversation too long" and its advice on how to fix doesn't work)
I have seen this issue with every model so far
Will this checkpoint help with chat memory and disregard the latest chat's info?
I use WSL under Windows, VSCode with the WSL plugin, and Claude-Code installed on Ubuntu 24. It is generally solid and has no issue with this setup.
Anecdata is welcome as it seems like it's the only thing available sometimes.
But we can also see that we're one of the few sane skeptical places in a world that is making the most diverse claims about AI.
opus 4.1: made weird choices, eventually got to a meh solution i just rolled back.
codex: took a disgusting amount of time but the result was vastly superior to opus. night and day superiority. output was still not what i wanted.
sonnet 4.5: not clearly better than opus. categorically worse decision-making than codex. very fast.
Codex was night and day the best. Codex scares me, Claude feels like a useful tool.
Agreed. If these same models were used on a different codebase/language etc. it will likely produce very different results.
You want proof for critical/supportive criticism? Then almost in the same sentence you make an insane claim without backing this up by any evidence.
Nearly every critical reply to my comment bases that criticism on the lack of examples and details I included for my claim which is the very thing I am suggesting we do (i.e. they are, ironically, agreeing with me?). I am sorry I thought that intentional bit of irony would help make the point rather than derail the request.
Here are a few projects that I made these past few months that wouldn't have been possible without LLMs:
* https://github.com/skorokithakis/dracula - A simple blood test viewer.
* https://www.askhuxley.com - A general helper/secretary/agent.
* https://www.writelucid.cc - A business document/spec writing tool I'm working on, it asks you questions one at a time, writes a document, then critiques the idea to help you strengthen it.
* A rotary phone that's a USB headset and closes your meeting when you hang up the phone, complete with the rotary dial actually typing in numbers.
* Made some long-overdue updates on my pastebin, https://www.pastery.net, to improve general functionality.
* https://github.com/skorokithakis/support-email-bot - A customer support bot to answer general questions about my projects to save me time on the easy stuff, works great.
* https://github.com/skorokithakis/justone - A static HTML page for the board game Just One, so you can play with your friends when you're physically together, without needing to bring the game along.
* https://github.com/skorokithakis/dox - A thing to run Dockerized CLI programs as if they weren't Dockerized.
I'm probably forgetting a lot more, but I honestly wouldn't have been bothered to start any of the above if not for LLMs, as I'm too old to code but not too old to make stuff.
EDIT: dang can we please get a bit better Markdown support? At least being able to make lists would be good!
Although I was just commenting on the irony of the parent comment.
hn has no markdown support at all right now. It's just this https://news.ycombinator.com/formatdoc
LLMs are a great rubber duck, plus they can write the document for you at the end.
Great use case for an LLM to make these changes as HN is open source. It’ll also tell us if LLMs can go beyond JS slop.
1 is not infinitely greater than 0.
Or otherwise, can you share what you think the ratio is?
"For any positive "x", is 1 x times greater than 0? Well, 0 times x is lower than 1, and 1 divided by x is larger than 0."
So his productivity increased by more than twice, more than ten times, more than a billion times, more than a googol times, more than Rayo's number. The only mathematically useful way to quantify it is to say his productivity is infinitely larger. Unless you want to settle for "can't be compared", which is less informative.
It is not a ratio, it is a proportion.
Given that most of the 'vibe-coded' projects that I have seen that are worse versions of software that have been tested and stand the test of time.
> I am still surprised at things it cannot do, for example Claude code could not seem to stitch together three screens in an iOS app using the latest SwiftUI (I am not an iOS dev).
You made a critical comment yet didn't follow your own rules lol.
> it's so helpful for meaningful conversation!
How so?
FWIW - I too have used LLMs for both coding and personal prompting. I think the general conclusion is that it when it works, it works well but when it fails it can fail miserably and be disastrous. I've come to conclusion because I read people complaining here and through my own experience.
Here's the problem:
- It's not valuable for me to print out my whole prompt sequence (and context for that matter) in a message board. The effort is boundless and the return is minimal.
- LLMs should just work(TM). The fact that they can fail so spectacularly is a glaring issue. These aren't just bugs, they are foundational because LLMs by their nature are probabilistic and not deterministic. Which means providing specific defect criteria has limited value.
Sure. Another article was posted today[1] on the subject. An example claim:
> If we asked the AI to solve a task that was already partially solved, it would just replicate code all over the project. We’d end up with three different card components. Yes, this is where reviews are important, but it’s very tiring to tell the AI for the nth time that we already have a Text component with defined sizes and colors. Adding this information to the guidelines didn’t work BTW.
This is helpful framing. I would say to this: I have also noticed this pattern. I have seen two approaches help. One, I break up UI / backend tasks. At the end of UI tasks, and sometimes before I even look at the code, I say: "Have you reviewed your code against the existing components library <link to doc>?" and sometimes "Have you reviewed the written code compared to existing patterns and can you identify opportunities for abstraction?" (I use plan mode for the latter, and review what it says). The other approach which I have seen others try, but have not myself (but it makes sense), is to automatically do this with a sub agent or hook. At a high level it seems like a good approach given I am manually doing the same thing now.
[1]: https://antropia.studio/blog/to-ai-or-not-to-ai/
Roughly speaking that is how I think through my work, and when I get to the point of actually writing the code having most of the plan (context) in my head, I simply copy that context to the LLM then go to do something else. I only do this if I believe the LLM can do it effectively, so some tasks I do not ask for help at all on (IMHO this is important).
I also have it help with scripts, especially script that munge and summarize data. I know SQL very very well, but find it still a bit faster to prompt the LLM if it has the schema on hand.
Do you find ^ helpful? i.e does that match how you prompt and if not, in what ways does it differ? If it does, in what ways do you get different results and at what step?
Checkmate, aitheists.
This was in stark contrast to my experience with TypeScript/NextJS, Python, and C#. Most of the time output quality for these was at least usefully good. Occasionally you’d get stuck in a tarpit of bullshit/hallucination around anything very new that hadn’t been in the training dataset for the model release you were using.
My take: there simply isn’t the community, thought leadership, and sheer volume of content around Swift that there is around these other languages. This means both lower quantity and lower quality of training data for Swift as compared to these other languages.
And that, unfortunately, plays negatively into the quality of LLM output for app development in Swift.
(Anyone who knows better, feel free to shoot me down.)
Another issue is that Apple developer docs are largely sequestered behind JavaScript that makes them hard for scrapers to parse.
At least, those are the two explanations I’ve seen that seem plausible.
> One is that Swift has changed massively since it came out and huge swathes of examples and articles and such online, that LLMs are trained on, are out of date and thus pollute the training set.
100% jibes with my experience. The amount of times it would generate code using a deprecated API, or some older mechanism, or mix an older idiom with a newer one... well, it was constant really.
And a lot of Googling when I was fixing everything up manually drew me toward this same conclusion: that high quality, up to date information on Swift was in relatively short supply compared to other languages. Couple that with a lower volume of content across all Swift versions and you end up with far from great training data leading to far from great outputs.
> Apple developer docs are largely sequestered behind JavaScript that makes them hard for scrapers to parse.
Yeah, and honestly - even if there's a solution here - the documentation isn't that great either. Certainly not compared with .NET, Ruby, Python, TypeScript, etc.
If I were a vibe coder I'd certainly avoid Swift like the plague.
(Btw, this isn't a knock on Swift itself: as a language I didn't mind it, although I did notice when debugging that the Objective C underpinnings of many APIs are often on display.)
For a bunch of reasons I want to avoid the standard React, Typescript, and Node stack but the sheer velocity that might enable from the LLM side might make it worth it.
Are you saying that your experience with Go has been bad? I would think Go would be as good as any other language (if not better). The language itself is simple, the Go team is very methodical about adding new features so it changes fairly slowly, it has excellent built in CLI based tooling that doesn't require third party packages or applications, and there are plenty of large open source Go codebases to train on. Seems like the perfect language for agentic tools.
Is it number of lines? Tickets closed? PRs opened or merged? Number of happy customers?
Is that helpful?
Or do he now just just get to work for 2 hours and enjoy the remaining 6 hours doing meaningful things apart from staring at a screen.
However I don’t have lottery millions, but I do have a job and I would like to be able to do it better.
Have you heard of that study that shows AI actually makes developers less productive, but they think it makes them more productive??
EDIT: sorry all, I was being sarcastic in the above, which isn't ideal. Just annoyed because that "study" was catnip to people who already hated AI, and they (over-) cite it constantly as "evidence" supporting their preexisting bias against AI.
Have you looked into that study? There's a lot wrong with it, and it's been discussed ad nauseam.
Also, what a great catch 22, where we can't trust our own experiences! In fact, I just did a study and my findings are that everyone would be happier if they each sent me $100. What's crazy is that those who thought it wouldn't make them happier, did in fact end up happier, so ignore those naysayers!
have you tried in the new xcode extension? that tool is surprisingly good in my limited use. one of the few times xcode has impressed me in my 2 yeasrs of use. read some anecdotes that claude in the xcode tool is more accurate than standard claude code for Swift. i havent noticed that myself but only used the xcode tool twice so far
> What is the idiom for testing the launch screen on the simulator like.. I don't see anything? How do I know if its there.
i.e. in iOS / Swift, I don't even know if I'm using the right terms for the code I am trying to interrogate, or in some cases even what the thing is!
But for stuff like TCA (Swift composable architecture), I basically created a TCA.md file and pasted in a bunch of docs and examples and would reference that.
But for the most part, it was one shotting swiftui screens that were nicer than what I had in my mind.
What are the specific tasks + prompts giving you an 3x increased output, and conversely, what tasks don't work at all?
After an admittedly cursory scan of your blog and the repos in your GH account I don't find anything in this direction.
I don't count the things I'm doing now that I would have avoided or never finished in the past. For those, of course to me personally those are worth much more psychologically than 3x, but who knows if it's an actual boost. I.e. I took a partially scripted task the other day and fully automated it, and also had it output to the CLI in a kind of dorky sci-fi way because it makes it fun to run it. It didn't take long - 30 minutes? But I certainly didn't _gain_ time doing that, just a little more satisfaction. TBH I'm surprised 3x is so controversial, I thought it was a really cool and far more practical assessment than some of these 10x claims I'm seeing.
Also, please stop posting flamebait to HN generally. It's not what this site is for, and destroys what it is for.
https://news.ycombinator.com/newsguidelines.html
Hey, so if I DO see it, can I stop it from happening?
Of the two of you, I know which one I'd bet on being "right". (Hint: It's the one talking about their own experience, not the one supplanting theirs onto someone else)
That's just... not true? There are many, many people who sincerely believe (sometimes from real experience, sometimes because they want it to be true) that AI absolutely cannot do whatever they are trying to do, and so they don't use it.
We birthed a level of cognition out of silicon that nobody would imagine even just four years ago. Sorry, but some brogrammers being worried about making ends meet is making me laugh - it's all the same people who have been automating everyone else's jobs for the past two decades (and getting paid extremely fat salaries for it), and you're telling me now we're all supposed to be worried because it's going to affect our salaries?
Come on. You think everyone who's "vibe coding" doesn't understand the pointlessness of 90% of codemonkey work? Hell, most smart engineers understood that pointlessness years ago. Most coders work on boring CRUD apps and REST APIs to make revenue go up 0.02%. And those that aren't, are probably working on ads.
It's a fraction of a fraction that is at all working on interesting things.
Personally, yeah, I saw it coming and instead of "accepting fate", I created an AI research lab. And I diversified the hell out of my skillset as well - started working way out of my comfort zone. If you want to keep up with changing times, start challenging.
Most of the anti-AI comments I see on HN are NOT a version of "the problem with AI is that it's so good it's going to replace me!"
https://news.ycombinator.com/newsguidelines.html
What "discussion" do you want to have? Another round of "LLMs are terrible at embedded hardware programming ergo they're useless"? Maybe with a dash of "LLMs don't write bug-free software [but I do]" to close it off?
The discussions that are at all advancing the state of the art are happening on forums that accept reality as a matter of fact, without people constantly trying to constantly pretend things because they're worried they'll lose their job if they don't.
Now you may read that as .. checks notes "star trek post scarcity singularity", but forgive me, I didn't.
I can only list my open source outputs concretely for obvious reasons but https://github.com/rubberduckmaths/reddit_terraforming_mars_... was a near one shot. It's a Reddit bot that posts card text to the Terraforming Mars subreddit when asked which is helpful for context on discussions of that board game. Appreciated and used a lot by the community there. There's a similar project i used AI for to scrape card text that was also near one shot. I'd say for these two hobby projects 50x productivity is a reasonable statement. I wrote Reddit bots ~10 years ago without coding assistance - https://github.com/AReallyGoodName/xwingminibot i get to reasonably absolutely compare two very similar projects. I think it's totally fair for me to say 50x for this example. The Reddit API even changed completely in that time so no one can really say "you used past experience to move faster, it's not the ai giving a 50x boost" but I really didn't. My memory is not that good except for memory of an entire weekend previously vs <30mins total now using a bot to one shot some pretty cool projects.
As for the negatives they are never serious. A couple of good examples;
"Please correct all lint errors in this project" only to have @lintignore added to all files. Lol! Obviously i just more clearly specified the prompt and it's not like it's hard to catch these things and not ship to prod. It was funny to everyone i showed and no big deal.
Another similar case, "please make the logging of this file less verbose especially around the tight loop on line X". Instead of changing log level or removing some of the log statements the ai redirected stdout at the initialization of the command line program (would completely break it of course). Again hilarious but also not big deal. Not even much of a waste of time since you just change the prompt and run again and honestly a few silly diversions like this now and then is kind of fun. As in the comments of "OMG AI sometimes gets it wrong" aren't at all serious. I have version control, i review code. No big deal.
I too eye roll massively at some of the criticisms at this point. It's like people are stretching to claim everyone who's using a coding assistant is newb who's throwing everything into prod and deleting databases etc. That's just not reality.
Can't show prompts and actual, real work, because, well, it's confidential, and I'd like to get a paycheck instead of a court summons sometime in the next two weeks.
Generally, 'I can't show you the details of my work' isn't a barrier in communicating about tech, because you can generalize and strip out the proprietary bits, but because LLM behavior is incredibly idiosyncratic, by the time you do that, you're no longer accurately communicating the problem that you're having.
https://www.theverge.com/ai-artificial-intelligence/787524/a...
Yeah, maybe it is garbage. But it is still another milestone, if it can do this, then it probably does ok with the smaller things.
This keeps incrementing from "garbage" to "wow this is amazing" at each new level. We're already forgetting that this was unbelievable magic a couple years ago.
That's... not super surprising? SwiftUI changes pretty dang often, and the knowledge cutoff doesn't progress fast enough to cover every use-case.
I use Claude to write GTK interfaces, which is a UI library with a much slower update cadence. LLMs seem to have a pretty easy time working with bog-standard libraries that don't make giant idiomatic changes.
"I tripled my output (I provide no evidence for this claim)"
Never change, HN.
Pervert.
but: https://imgur.com/a/462T4Fu
ok, you think it's a promising field and you want to explore it, fine. Go for it.
Just stop pretending that what these models are currently doing is good enough to replace programmers.
I use LLMs a lot, even for explaining documentation.
I used to use them for writing _some_ code, but I have never ever gotten a code sample over 10 lines that was not in need of heavy modifications to make it work correctly.
Some people are pretending to write hundreds of lines of code with LLMs, even entire applications. All I have to say is "lol".
I have also seen it fail on far simpler tasks.
It varies so much depending on what you are doing, the language etc that generic proclamations "it works!" or "it doesn't work!" are pretty much meaningless.
That aside, you seem to be conflating "it works" with "good enough to replace programmers", but these aren't synonyms.
And on the gripping hand, one way to "make" it work is simply to lower the standards. Which our industry has been doing aplenty for a long time now even before AI, so we shouldn't be surprised when top management drives it to its logical completion.
I haven’t written a function by hand in 18 months.
Not doubting but it helps to contextualize things
It's a startup for finding grants. We have california state, federal, non-profit and california city/county grants. My landing page absolutely sucks but if you sign up / upload some papers or make some search cards you'll like the experience.
I'm very excited to try out the new Qwen XL that came out recently for visual design. I could really use some better communication to users of the capabilities of the platform.
A lot of content is off-screen, and can't be swiped back on-screen.
-
Actual app did let me theough the flow, but had things like dialogs staying open and covering content (I assume a non tech person would have missed them)
I think part of the AI gap right now is the UI/UX you're expecting. AI isn't able to meet the bar for UI work for me at all right now. I do use it via Tab completion, or chat for specific scaffolding, but the agentic form is rough
I have some publicly accessible projects there.
Otherwise, I do professional software consulting + a variety of projects as an independent.
In my case I found having it always Ultrathink and to always work with TDD to work well. Also you have to use Plan Mode first and refine the plan. “What clarifying questions do you have for me” prompts me with a handful of numbered questions that are always really beneficial for refining the plan.
But hearing your 10 line constraint gives me a very https://en.wikipedia.org/wiki/Infinite_monkey_theorem vibe to the challenge.
Whether you agree or not, the market has spoken. New grad hiring is WAY down. Fresh CS grads are having an hell of a time finding work compared to 2 years ago.