LLM scraping

Using a LLM for web scraping, where the content of the page is given to the model and prompted to return the data isn’t reliable or affordable. For one off scraping it can work fine of course, soon as your scraping hundreds or thousands of pages you’re far better off writing a puppeteer or bs4 script, selectors don’t change that often and it’s just infinitely more scalable and cheaper than running thousands of tokens through a LLM for a single page. Takes me < 1 hour to write a script for nearly any source

— https://x.com/abacaj/status/1822641876685459913

There’s only one company I know of that has discovered the real alpha is in using an LLM to guess the correct selectors and then plugging them into cookie cutter scripts. Like “what’s the XPATH of the rental listings table on this page” and then pulling that into a bs4 script

— https://x.com/Nexuist/status/1822643497305874518

Yea and unironically the new Gemini model can generate reliable scripts even on very long html pages (1million+ tokens, get rid of script tags and some other useless tags). Then just use the generated script not the model

— https://x.com/abacaj/status/1822645124200829389