Surprisingly, I found out that layout was the trickiest thing, as newspaper articles often had multiple layers of headers, spanned multiple columns, etc.
Yes I agree the layouts are the trickiest part. I tried a few and ended up using some of the Paddle Paddle models for document layout analysis and orientation and such, which give bounding boxes and predicted reading order, but the reading orders aren't great even with SOTA most recent models on complex layouts, or even simple layouts when you have mastheads or images or other artifacts to work around. It's still valuable information that can be combined with heuristics though to stitch together a more accurate reading order, as the starting point of a pipeline
Great! Was thinking about PP but because I only ran an order of magnitude fewer articles (under 1mm pages; by piggybacking on Dell's OCR) I relied on Arcanum ( https://www.arcanum.com/en/newspaper-segmentation/about/ ) which was cheap enough (but I think not cheap enough at your scale).
Hmm, I just tried to upload the jpgs of some of todays samples to Arcanum via https://www.arcanum.com/en/newspaper-segmentation/try-it/ and it didn't work. I'll try it again later, but it seems based on a cursory look that it wouldn't return info that I would need to correct it if I didn't like the output, and that I'd still have to stitch the individual pages back together myself?
Hard to have any idea of what happened due to the FOW. There's some grainy footage (posted by Trump) of someone running past the SS checkpoint; where you can see some shots fired by the SS. Then Blitzer who says this person might have had multiple large guns, etc.
I'm sworn off from Musk-related products, and this will prob make cursor worse (switch to X's LLM for instance). So, any suggestions for switching? Codex; Claude Code? (I like my IDE and I like the freedom to choose a model, which is why I stuck with Cursor even when it felt more expensive)
Everyone talks badly about Cursor and it is kinda a piece of junk, but no, there's nothing that has the features of: being able to see agent diffs in an editor, seeing diffs inline in chat, be able to click them to jump to the code, and being able to click old chat messages to edit/fork them.
Those are basically my only requirements, and it feels like I've tried everything and they're all everything only has 1 of those features. Zed is the closest, it technically has those features, they're just buggy and have provider specific quirks.
So I'm stuck on Cursor until Anthropic invents IDE technology, or at least VS Code wrapper technology.
Jetbrains IDEs have AI support with all the things you've described, and in a more polished experience that requires significantly less maintenance and tuning. It does that while affording an actual IDE experience that works well for supported languages/projects out of the box, without the need to constantly tune plugins and experience jank misaligned UX that seems to be the norm for VSCode and derivatives.
No association with Jetbrains, and despite having a license, don't even use their AI support much myself (mostly using CC, with IDE integration for diff viewing). But if you haven't tried it recently, probably worth a revisit if you're open to Jetbrains products.
I hope their models improve. I used Junie when it first came out and it was okay but unreliable. I use Cursor with composer right now and I never have any issues. I sure do miss using PyCharm though.
I really doubt they'll swap in Grok. Grok seems pretty dead. Probably more likely they'll reuse the hardware for composer.
If value is a concern, Codex. It's pretty hard to beat those subsidies. If you really want model freedom, Copilot is surprisingly decent value and as of right now let's you use your sub in other harnesses like OpenCode.
Codex is not a replacement for an IDE. Yes I still need an IDE.
When coding agents work they're great. When they don't I still need the IDE. They usually don't work that great when I'm working on something novel or brownfield. Which happens quite regularly.
But I definitely still want ai autocomplete. I'm not a Vim user. Coding isn't about typing for me, it's about solving problems. So a tool that does lots of the typing for me is a godsend.
So do I go for VS Code + Copilot? Because it was bad when I tried it again for a few days in November. Slow to respond and gave poor results. Cursor is snappy and gives useful results most of the time.
Same I'll be switching off Cursor today to either claude code or kiro. Luckily my company lets us choose which agentic software we want to use. I won't touch anything Musk is related to, he is toxic and anything he touches turns toxic and supports him.
Claude is much faster and better at reading papers than Codex (some of this is nested skill dispatch) but they both work quite incredibly for this. Compile your set of papers, queue it up and hit /ingest-collection and go sleep, and come back to a remarkable knowledge base :)
I've just added a simple example on the github page of how the resulting difference-file looks in an html-browser, I hope this helps to get a better idea.
It's noblesse oblige, or rather an example of the end of noblesse oblige, that the super rich don't even have to pretend to do things for others any more. Which, I would suggest, is a short-sighted and ultimately hubristicaly stupid change...
I've also worked with this data, but only for research purposes:
https://www.finhist.com/bank-runs/episodes/13895.html https://www.finhist.com/bank-runs/index.html
Surprisingly, I found out that layout was the trickiest thing, as newspaper articles often had multiple layers of headers, spanned multiple columns, etc.
Do you have a preferred solution on that?
reply