Basic Tools
These tools do not use a real browser, so they’re perfect for scrapping standards-based websites, but you may hit a wall trying to scrap sites that abuse scripting (angular or react client only sites, other single page applications …).
Less Basic Tools
These tools actually run in your own browser.
- WebScraper and its chrome extension
Ninja Tools
These tools will allow to run a scripted browser in the background, or even on a remote server or server farm. If you’re doing serious web-stealing/fuck-your-content work, then you should go here.
- Jupyter
- Selenium (Python)
- CouchDB
Recipes
Using Chrome Developper tools to autoload an infinite scrollable list
function scrollDown() {
window.scrollTo(0,document.body.scrollHeight)
};
$(document).ajaxSuccess(function() {
scrollDown();
});
Misc
TODO:
- explore ZeroMQ to distribute work? Dask?