Not a lot of talks on this subject, let's try to make a small summary of the state-of-the-art. Lots of opinions here, feel free to disagree.



The running joke of the conference has been «How many of you have heard/used Jupyter Notebooks?». One speaker even adding, seeing the audience smiling «How many of you have been asked this multiple times already in this conference?».

Obviously, Jupyter is having a lot of momentum, for good reasons. And I have to include it here, because it's very convenient to scratch some scraping tasks.


Selenium is a great tool to script browsing sessions, and with the increasing numbers of JavaScript-heavy websites, it becomes more or less the de-facto tool for python scraping. One could also use things like PhantomJS/CasperJS but then you won't be able to control it from python. Which may be perfectly fine of course.


It was argued that when you get become serious in the crawlers/scrapers development, you want to have a client-server architecture, the server being in charge of running selenium and handling one-off tasks like authentication, while the client(s) just send commands to it, probably communicating using some simple networking libraries like ZeroMQ.


Scrappy has been around for a long time, and I would not use it for any new scraping project. It's browser-less, probably not ready to transition to asynchronous patterns and needs a lot of specific learning.

