europython Europython Data processing

Tadaaa-Tadaaa (Dataaa Dataaa?)

One of the keynotes, by Gaël Varoquaux, was called «How python became the language of data?». And while it's probably not the only one language good with data, the fact that python is widely used for this purpose by people with different educational/professional backgrounds, like engineers and (hard-core) scientists, makes the libraries available in this domain of very good quality.

Most of the greatest tools available are summarized on PyData.

Managing data processing tasks at scale

AirFlow (from Airbnb, now an Apache incubator project)

Luigi (from Spotify, requires HDFS)

Natural Language Processing

An amazing talk by Katharine Jarmul was about natural language processing state-of-the-art. You can find her slides on her website: I hate you NLP.

Machine Learning for Dummies

What I know about machine learning is between nothing and not much:

  • Start-ups love to include it in their pitches even if what they're actually doing is a few regexp matches.
  • Neural networks helped AlphaGo learn how to beat Lee Sedol, one of the (if not the one) best go players in the world.
  • ML is mostly transforming some vague human-friendly thing into a n-dimmensional vector in a given n-dimmensional space, so it becomes easy to compute similarity as geometrical distances between two vectors.

I added a few bits to my understanding of the concept during the conference talks:

  • Tensors are n-dimmensional generalization of matrices. But now that I'm reading the wikipedia page, I'm lost again.
  • Neural networks are networks of "neurons", that transforms the input by applying a y = softmax(weight * x + biasis) tranformation.
  • Machine learning is an iterative process adjusting weight and biasis of each neuron on each iteration, trying to get a network actually able to transform some abstract matrix (like an image, a document, a word, a go board ...) into an computer intelligible concept (like "dog", "happy", "sad", "good move", "bad move").

Sorry if I'm taking shortcuts or showing severe lacks here, please feel free to comment to say how incorrect I am.

TensorFlow is a python library that makes it "easy" to create neural network repreentations in python, and then to execute them either locally or on a distributed cluster of machines. Ian Lewis, from the Google Cloud Platform team, made a great talk about it.

Data visualization

Although I didn't find the time to play with the tool yet, we had a great presentation of Bokeh, an amazing data visualization library with a lot of really usefull features. It allows to describe graphs in pure python, either in a notebook or in a regular script, and can display it either in a client-server way or even write down plain static HTML files (think reporting).

Definitely worth checking out.

Misc

As always, jupyter notebooks use is highly recommended to create your models.

Also, I found the cookiecutter data science template to be a great directory structure starting point for data processing projects.

And, how many of you have heard/used Jupyter Notebooks?


Photo credits: jannekestaaks cc-by-nc

Share the love!

Liked this article? Please consider sharing it on your favorite network, it really helps me a lot!

You can also add your valuable insights by commenting below.