Natural language processing

8 min readOct 24, 2019

Being familiarized with the basics of what NLP and NLTK, we’re going to continue the process of assembling some tools and knowledge that we’ll put to use as we move forward in this blog. In this blog, we’re going to learn how to read in semi-structured data. This probably isn’t something you really think about if you’re used to reading in well-formatted CSV or PRK files in the past, which would be represented by this nice column or row-based depiction on the left, but reading in text files can be a little bit of a battle sometimes, represented by this figure on the right. Text data will typically be in a semi-structured or unstructured format. So, I’ve mentioned unstructured a couple times. What does that really mean? Well, it could mean that it’s binary data, it could mean no delimiters, or it could mean no indications of any rows. A few examples might be an email, PDF file, social media post, these may just get dumped into a file with no indication of where, maybe, a subject of an email ends and the body of the email begins, or even where one email ends and the next begins. It could also get cluttered by things like HTML tags, and we won’t really get into it in this blog, but the idea is just that it can get really messy. It’s important to note that Python is pretty smart, but ultimately, unless it’s told otherwise, it basically sees everything as a string of characters. It needs to be told what those characters mean. So, to illustrate how to read in semi-structured data, we’re going to be using a dataset from the UCI Machine Learning Repository. This dataset is a collection of text messages, each with a label of either spam or ham. We’ll be using the same dataset for the duration of the blog, and it’s all contained in your exercise files, so there’s no need to download it. Now, before we dive into this, I want to show you what the raw text file looks like, so we know what we’re working with. It’s not a clean CSV file, but it’s not terribly unstructured, either. Each row has a distinct text message and a distinct label as either spam or ham. So, in the context of text datasets, this is actually pretty well structured, so this shouldn’t be too difficult. Now, let’s jump back to our Notebook. So, we’re going to start by opening this file and just reading it, which basically just means that you’re pulling in the text without any semblance of structure at all. You may do this if you don’t know what your data looks like, and you’re just reading it in, this gives you a way to get a first look at the data, and then you can determine your next step. So, we’re opening it, and then we’re reading it. I should note that this lesson is as much about text manipulation as it is reading in a file. The way we’ll read this in certainly isn’t the easiest, but many times, you don’t know what the easiest way is up front, and you have to explore the data first. Additionally, I don’t want to limit you from taking on more complicated datasets moving forward. So, we’re going to take a roundabout, inefficient way to reading this, with the purpose of arming you with some tools that you’ll need to take more difficult datasets later on. So, we’ve read in our dataset using this open .read. Now, let’s print it out and take a look at what’s contained in here. So, because we’ve read it in without any indication of format, it doesn’t know if there are rows, columns, words or anything. All Python sees is a very long string. So, to tell it how much to print out, we basically just tell it how many characters. So, we’re telling it, “Print out the first 500 characters.” So now, you could see that it’s just basically a block of text, and you’ll see that you have these \t and these \n separators. The \t’s are between the labels and the text message bodies, and the \n’s are typically at the end of those lines. There are many ways to tackle this. The way that I’m going to do it is replace the \n’s with \t’s, and that’ll allow us to split this into a list, and then we can split it up from there. So, what we’re going to do is, we’re going to assign this to a list called parsedData, and we’re going to pass in, say, we want to use this rawData string, and what we’re going to do is we’re going to use this replace function. What that’s expecting to see, as the first parameter, is the thing that it’s looking for. In this case, that’s \t. And then, the second parameter is what it’s going to replace that with. In our case, we’re going to do \n. So, every time it sees \t, it’ll replace it with \n, and what this allows us to do is then do .split, and what this will do is it’ll take our string and split it on a certain character and return a list. So, we’ll tell it, split on \n, so now every time it sees an N, it’ll chop it off and add that component to a list. So, if we run this, and then we’ll print out parsedData. Instead of printing out the first 500, now it’s an actual list, so we’ll just print out the first five. Now, you can see that this is basically alternating where your labels exist and your zeroth position, the second position, the fourth position, so it’s every other, and the same thing with the actual labels. So, this gives us some sort of structure. So, what we’re going to do is we’re going to create a new list called labelList, and what we’re going to do is we’re going to pull from that parsedData list, every other item. So, what we tell it is, parsedData, and we want to start in position zero, so that’ll be the very first one, and then we tell it, we want to go to the end, so we don’t enter anything in the second place. And then, we tell it, two, so what that’s telling it is, start in the zeroth position, go to the very end and take every other, so it’ll grab ham, spam, ham and so on. Now, we want to do the same thing for text lists, so that we can get the text bodies in a different list. So, again, we’re taking parsedData, but this time, it’s going to be starting in the position one, which is the second element in the list. So, we’ll start with this one, and it’ll grab every other one. So, we’ll run that, and then, let’s take a look at the first five entries of labelList, and then we’ll take a look at the first five entries of textList, just to make sure that it did what we expected it to do. And we’ll go ahead and print both of these, because if you don’t include the print statement, Jupyter Notebooks will only print the most recent statement, so it would only print textList, and it wouldn’t print labelList, so we’ll add the print statement on both, print those out, and we can see, the lists are separated in the way that we expected them to be. So, now we have the information that we need separated into two lists, so now we can start thinking about combining these in a way that we can actually use for analysis. The first thing we have to do is import Pandas, so just do import pandas, and we’re going to save that as PD, so we don’t have to call pandas every time that we need to use one of its functions. And we’re going to create a data frame that we’ll call fullCorpus, and the way that we create this data frame is pd., call the data frame, method, and within this data frame, parenthesis, we’re going to need to pass it a dictionary, where, in the dictionary, the keys are going to be the name of the columns, and then the values will be the lists that we’ve stored the actual values in. For the first column, we’ll call it Label, and we’ll pass in labelList, because that contains the actual values, and then, in the second column, we’ll call bodyList, and we’ll pass in textList. So, what this should create is, it’ll take labelList, textList, create a data frame out of them, and then label the column names label and bodyList. So, let’s go ahead and print this out. fullCorpus.head, that’ll print out the first five rows. And, we have an error. So, it throws an error, saying the arrays must have the same length. So, let’s dig into that. So, we’re going to check the length of each of these lists to see where the issue lies. So, we’ll do print, LEN, labelList, to print the length of that list, and then we’ll do the same thing for the textList. Print those, and you see that labelList has one extra entry that textList does not have. My guess is, it picked up on something at the very end that is creating the mismatched length. So, let’s print out the last five items of labelList. So, the way we print out the last five items is we do negative five til the end, so what that’ll say is, go to the very end, count five backwards, and print out those last five. And you’ll see that the very last entry here is empty. So, it is picking up on one extra entry that we don’t need it to. So, if we just drop that, then it’ll have the same length as textList, and they’ll match up. We can just copy this code from here, because we’re going to be using the same exact thing to create our dataframe. The only change we’re going to make is we’re going to tell labelList, instead of grabbing all of them, just don’t grab the last one, so this says, start at the beginning, and capture all of them except for the very last one. So, now when we do fullCorpus and print out the first five, it should print out the way that we expect. And so, now, you have a nice, clean version, compared to what we had over here, where it was rather messy, and you could see some structure, but now we have it all cleaned up, where it’s all in the data frame and you can start to see how you could actually do some analysis with this. Now, I’d like to revisit the fact that I mentioned we would take a bit of a roundabout way, for the purpose of arming you with the tools to tackle more difficult datasets. To give you an idea of where we could have taken a shortcut to reading this in, as soon as we noticed the \t up here, that tells us immediately that this is tab delimited. Pandas allows you to read in tab-separated files very easily. You’ll learn that different details like this \t are giveaways to allow you to take shortcuts, essentially, to reading in your data. The easiest way to actually read this in will be to use the Read CSV function from Pandas. So, we’ll assign this to dataset, and we’ll call pandas.read_csv, and we just pass in our spam collection dataset, and the thing that tells it that it’s spam separated is there’s a separator parameter here, and we tell it \t, so that tells it that it is tab separated. And the only other thing that we need to add in here is, header = None. The reason we need to do that is because the raw dataset does not have column names, and if we don’t say header = None, it’ll take that first column and assume those are column names. So, let’s go ahead and print this out. And, as you can see, the column order is switched, but it’s the same content, and there’s just no column labels here. So, in this lesson, we took a dataset where we didn’t have any idea what the structure was, and we explored it to uncover the structure of the data set and organized it in a clean way, so now it’s prepared for the next steps in our NLP pipeline.

Written by Abhijeet Kamble