Data science road blocks

Abhijeet Kamble
7 min readJun 4, 2019

Hello Internet! This is my third blog! First good one, sadly, because like many data scientists, I’m much better with numbers than words.

I am a student at the Flatiron School and I’ve been tinkering around with Big Data for the last couple of months. The results of it have been fascinating but I can’t help but feel numbingly dull with my analysis because of certain aspects in the field. I found a few ways around this and thought that I would start a blog to help fellow data scientists experiencing similar mundane issues. So, in the hopes to make all of our lives as a data scientist a little less tedious, take what you will from this blog.

I did a Data-science project last weekend where I attempted to solve a classic classification problem. However, like many data scientist’s experiences, some of the largest roadblocks I faced were Exploratory Data Analysis (EDA), Feature Engineering, and the time it took to see results for the computational complexity of the advanced machine learning models (more on that later).

I feel a lot of budding data scientists face the same issues while dealing with their projects, so, I want to address these particular issues here.

EDA (Exploratory Data Analysis)

A lot of the EDA is done manually — bleh!- such as looking for missing values, checking for correlation between two columns, whether the data is normally distributed or not, unique value counts, the averages trio mean, median, mode, the maxima-minima, kurtosis, etc. (oh, my eyes!)

Comparing all these values is time expensive and cumbersome! Bleh!

I found a library in python called pandas_profiling, here is an example of how this works with one of the most common datasets on kaggle, the titanic data-set. Just a one-liner “ pandas_profiling.ProfileReport(df) ”

image pandas

This is what the pandas_profile report does to your initial EDA, you can, of course, dive deeper with this.

In my opinion, it is better than the inbuilt .describe() function of pandas.

FEATURE ENGINEERING

In the context of data-science, a feature can be described as a characteristic or a set of characteristics, that explains the occurrence of a phenomenon. When these characteristics are converted into some numerical measurable form, they are called features.

For example, the different columns, in the picture down below, can be different features in reference to the titanic data. Eg AGE.

Feature Engineering can be simply defined as the process of creating new features from the existing features in a dataset.

Example: we could combine Fare with Class and make it into a new feature. This process of creating new relevant features for a calculated estimation is called Feature Engineering.

The performance of a predictive model is heavily dependent on the quality of the features in the model; if it can convey more information about the target variable in the most efficient way, its performance will definitely rise.

This is a juggernaut of a task for a data-scientist (again bleh!), as it involves a lot of manual labour and brainstorming in the form of feature tuning, creating new features, examining their relevance and coming to a collective decision. Trust me, it’s easier said than done! And I cringed through it all.

I spent the vast majority of my time struggling to determine which features to use and which to eliminate. The rest of my peers in the program are in the same struggle boat. One stated that “Feature Engineering is truly an art!” and I cannot help but agree with them.

In the quest to surpass these tedious tasks, I’ve come across this dope python library called featuretools!

python library

Featuretools could dramatically automate this arduous task of FE. But here’s the kicker…

Despite being a valuable time saver, the tool will only get you through half of the process; i.e. it can only automate the feature engineering process to a certain degree, freeing up time so that we can focus on the important parts for model building. This freed up time is great, but not enough.

Before taking Featuretools for a spin, there are three major components of the package that we should be aware of:

  • Entities
  • Deep Feature Synthesis (DFS)
  • Feature primitives

a) An Entity can be considered as a representative of a Pandas DataFrame. A collection of multiple entities is called an Entityset.

b) Deep Feature Synthesis (DFS) has got nothing to do with deep learning. Don’t worry. DFS is actually a Feature Engineering method and is the backbone of Featuretools. It enables the creation of new features from single, as well as multiple data frames.

c) DFS create features by applying Feature primitives to the Entity-relationships in an EntitySet. These primitives are the often-used methods to generate features manually. For example, the primitive “mean” would find the mean of a variable at an aggregated level.

Installation “!pip install featuretools” in jupyter notebook

And import featuretools as ft

Step I Create an Entity set

Entity code snippet
output for Entity

Step II Create a DFS

code & output for DFS

Step III Calling the DFS Matrice

Various Features

The featuretools package is truly a game-changer in data-science. While its applications are understandably still limited in cases of industry use, the amount of time it saves, and the usefulness of feature it generates, has won me over.

Computational Complexity

Before entering this data-driven Bootcamp, the staff suggested that I use a MacBook. At first, I was like “why do I need one? I have a top of the line windows device with killer specs.”

The first week into the boot camp and I realized how wrong I was. Not only is a MacBook so much more reliable (Rest in Power Steve Jobs), but it was computationally faster than my killer state of the art windows device as well! (Points to 🍎 once again!)

So, I considered buying one, and I was really happy with the new machine- which had 4 cores of a processor and 16 gigs of ram and what not! They are not groundbreaking specs, of course, but they are still better than most laptops out there.

However, as soon as I encountered complex machine learning algorithms like SVM, Random Forests, etc., my laptop took 2–3 hours to run a grid search with these models for parameter tuning. During this time, as my model run its course, I could do nothing but sit and stare at it.

I had little to no clue that an extensive grid search heavily relies on a GPU, something my Mac barely has.

Google solved this issue for all of us by providing Google Colab. Thank you, Google! That real VIP!

Google Colab is a free cloud service and now it supports free GPU! You can:

  • improve your Python programming language coding skills.
  • develop deep learning applications using popular libraries such as Keras,
    TensorFlow, PyTorch, and OpenCV.
  • You can create notebooks in Colab, upload notebooks, store notebooks, share notebooks.
  • Mount your Google Drive and use whatever you’ve got stored in there, import most of your favourite directories, upload your personal Jupyter Notebooks.
  • Upload notebooks directly from GitHub, upload Kaggle files, download your notebooks.
  • You can use it to tweak/run your code in python. It is quite user-friendly with jupyter notebooks- if you are familiar with jupyter you will adapt easily. But it’s not that bad for anyone who doesn’t know jupyter to begin with google colab, etc.

The most important feature that distinguishes Colab from other free cloud services is: Colab provides GPU and is totally free.

Again. It is free, free, free, free, free, free, free, free, and even freer.

It has its limits like it is not supported for Scala and R yet, it does limit your session size, but there are legit loopholes to get around that, eg re-upload your files.

Here’s how you get started with it.

create a folder on google drive

Next, create a google colab notebook with google drive

create a new Colaboratory notebook

Finally, set the GPU

GPU setup

Just kidding, that more power is just an April fool’s prank by Google.

To set up the GPU which is not on by default go to Runtime >> change Runtime type and then select GPU from there.

And you are all set, use it just like your jupyter notebook and your complex ml models run in a matter of seconds, worse minutes.

With great power comes great responsibility!!

So, that’s it for now.

--

--