Web scraping

Not a lot of talks on this subject, let's try to make a small summary of the state-of-the-art. Lots of opinions here, feel free to disagree.

Jupyter

The running joke of the conference has been «How many of you have heard/used Jupyter Notebooks?». One speaker even adding, seeing the audience smiling «How many of you have been asked this multiple times already in this conference?».

Obviously, Jupyter is having a lot of momentum, for good reasons. And I have to include it here, because it's very convenient to scratch some scraping tasks.

Selenium

Selenium is a great tool to script browsing sessions, and with the increasing numbers of JavaScript-heavy websites, it becomes more or less the de-facto tool for python scraping. One could also use things like PhantomJS/CasperJS but then you won't be able to control it from python. Which may be perfectly fine of course.

Client-server

It was argued that when you get become serious in the crawlers/scrapers development, you want to have a client-server architecture, the server being in charge of running selenium and handling one-off tasks like authentication, while the client(s) just send commands to it, probably communicating using some simple networking libraries like ZeroMQ.

Scrappy

Scrappy has been around for a long time, and I would not use it for any new scraping project. It's browser-less, probably not ready to transition to asynchronous patterns and needs a lot of specific learning.

Misc

PyVirtualDisplay

Liked this article? Please consider sharing it on your favorite network, it really helps me a lot!

europython Europython Web scraping

Jupyter

Selenium

Client-server

Scrappy

Misc

Share the love!

Jupyter

Selenium

Client-server

Scrappy

Misc

Share the love!

Related Posts