Data Cleaning Challenges!


data_cleansing.jpg

Data Cleansing Definition (1)

Dat Cleaning Challange:


data_quality_problems.jpg

Data Cleansing Definition (2)

This notebook deals with the data cleaning, and walks through the steps of data cleaning in a good quality practice.

Handling Missing Values


Part (1) of the (5) in Data Cleaning process! We are going to be looking at how to deal with missing values.

*In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data. Missing data can occur because of nonresponse: no information is provided for one or more items or a whole unit ("subject"). Some items are more likely to generate a nonresponse than others: items about private subjects such as income. Attrition is a type of missingness that can occur in longitudinal studies—for instance, studying development where a measurement is repeated after a certain period of time. Missingness occurs when participants drop out before the test ends, and one or more measurements are missing.

In part (1) we will:


missing_data.jpg

Missing Data (3)

First look at the data


The first thing we will need to do is load in the libraries and datasets we will be using. We will be using a dataset of events that occured in American Football games for demonstration.

Important! Make sure we run this cell ourself or the rest of our code won't work!

The first thing we do when we get a new dataset is take a look at some of it. This lets us see that it all read incorrectly and get an idea of what is going on with the data. In this case, we are looking to see if we see any missing values, which will be reprsented with NaN or None.

How many missing data points we have


We do have some missing values. We will see how many we have in each column.

That seems like a lot! It might be helpful to see what percentage of the values in our dataset were missing to give us a better sense of the scale of this problem:

Quarter of the cells in this dataset are empty! In the next step, we are going to take a closer look at some of the columns with missing values and try to figure out what might be going on with them.

Why the data is missing


This is the point at which we get into the part of data science that we like to call "data intution", by which we mean "really looking at our data and trying to figure out why it is the way it is and how that will affect our analysis". It can be a frustrating part of data science, especially if we are newer to the field and do not have a lot of experience. For dealing with missing values, we will need to use our intution to figure out why the value is missing. One of the most important question we can ask ourself to help figure this out is this:

Is this value missing becuase it was not recorded or becuase it dose not exist?

If a value is missing becuase it does not exist (like the height of the oldest child of someone who does not have any children) then it does not make sense to try and guess what it might be. These values we probalby do want to keep as NaN. On the other hand, if a value is missing becuase it was not recorded, then we can try to guess what it might have been based on the other values in that column and row. (This is called "imputation" and we will learn how to do it!

We will work through an example. Looking at the number of missing values in the nfl_data dataframe, we notice that the column TimesSec has a lot of missing values in it:

By looking at the business documentation, we will assume this column has information on the number of seconds left in the game when the play was made. This means that these values are probably missing because they were not recorded, rather than because they do not exist. So, it would make sense for us to try and guess what they should be rather than just leaving them as NA's.

On the other hand, there are other fields, like PenalizedTeam that also have lot of missing fields. In this case, though, the field is missing because if there was no penalty then it does not make sense to say which team was penalized. For this column, it would make more sense to either leave it empty or to add a third value like "neither" and use that to replace the NA's.

Always: Read over the dataset documentation if we have not already! If we are working with a dataset that we have gotten from another department, we can also try reaching out to them to get more information.

If we are doing very careful data analysis, this is the point at which we would look at each column individually to figure out the best strategy for filling those missing values. For the rest of this notebook, we will cover some techniques that can help us with missing values but will probably also end up removing some useful information or adding some noise to our data.

Drop missing values


If we are in a hurry don not have a reason to figure out why our values are missing, one option we have is to just remove any rows or columns that contain missing values. (Note: We do not generally recommend this approch for important projects! It is usually worth it to take the time to go through our data and look at all the columns with missing values one-by-one to really get to know our dataset.)

If we are sure we want to drop rows with missing values, pandas does have a handy function, dropna() to help us do this. We will try it out on our NFL dataset!

Looks like that is removed all our data! This is because every row in our dataset had at least one missing value. We might have better output removing all the columns that have at least one missing value instead.

We have lost quite a bit of data, but at this point we have successfully removed all the NaN's from our data.

Filling in missing values


Another option is to try and fill in the missing values. For this next bit, We getting a small sub-section of the NFL data so that it will print well.

We can use the Panda's fillna() function to fill in missing values in a dataframe for us. One option we have is to specify what we want the NaN values to be replaced with. We would like to replace all the NaN values with 0.

We could replace missing values with whatever value comes directly after it in the same column. (This makes a lot of sense for datasets where the observations have some sort of logical order to them.)

Filling in missing values is also known as "imputation"



Scale And Normalize Data

Part (2) of the (5) in Data Cleaning process! We are going to be looking at how to deal withscale and normalize data (and what the difference is between the two!).

Feature scaling is a method used to normalize the range of independent variables or features of data. Data processing is also known as data normalization and is generally performed during the data preprocessing step.

In part (2) we will:

Get our environment set up for scale and normalize data


The first thing we will need to do is load in the libraries and datasets we will be using.

Important! Make sure we run this cell ourself or the rest of our code won't work!

Scaling vs. Normalization: What is the difference?


One of the reasons that it is easy to get confused between scaling and normalization is because the terms are sometimes used interchangeably and, to make it even more confusing, they are very similar! In both cases, we are transforming the values of numeric variables so that the transformed data points have specific helpful properties. The difference is that, in scaling, we are changing the range of our data while in normalization we are changing the shape of the distribution of our data.


Scaling

This means that we are transforming our data so that it fits within a specific scale, like 0-100 or 0-1. We want to scale data when we are using methods based on measures of how far apart data points, like support vector machines, or SVM or k-nearest neighbors, or KNN. With these algorithms, a change of "1" in any numeric feature is given the same importance.

For example, we might be looking at the prices of some products in both Yen and US Dollars. One US Dollar is worth about 100 Yen, but if we do not scale our prices methods like SVM or KNN will consider a difference in price of 1 Yen as important as a difference of 1 US Dollar! This clearly does not fit with our intuitions of the world. With currency, we can convert between currencies. But what about if we are looking at something like height and weight? It is not entirely clear how many pounds should equal one inch (or how many kilograms should equal one meter).

By scaling our variables, we can help compare different variables on equal footing. To help solidify what scaling looks like, we will look at a made-up example.

We notice that the shape of the data does not change, but that instead of ranging from 0 to 8, it now ranges from 0 to 1.


Normalization

Scaling just changes the range of our data. Normalization is a more radical transformation. The point of normalization is to change our observations so that they can be described as a normal distribution.

Normal distribution: Also known as the "bell curve", this is a specific statistical distribution where a roughly equal observations fall above and below the mean, the mean and the median are the same, and there are more observations closer to the mean. The normal distribution is also known as the Gaussian distribution.

In general, we will only want to normalize our data if we are going to be using a machine learning or statistics technique that assumes our data is normally distributed. Some examples of these include t-tests, ANOVAs, linear regression, linear discriminant analysis (LDA) and Gaussian naive Bayes. (Pro tip: any method with "Gaussian" in the name probably assumes normality.)

The method were using to normalize here is called the Box-Cox Transformation. We will take a quick peek at what normalizing some data looks like:

Notice that the shape of our data has changed. Before normalizing it was almost L-shaped. But after normalizing it looks more like the outline of a bell (hence "bell curve").


Practice scaling


To practice scaling and normalization, we are going to be using a dataset of Kickstarter campaigns. (Kickstarter is a website where people can ask people to invest in various projects and concept products.)

We will start by scaling the goals of each campaign, which is how much money they were asking for.

We can see that scaling changed the scales of the plots dramatically (but not the shape of the data: it looks like most campaigns have small goals but a few have very large ones)

Practice normalization


We will try practicing normalization. We are going to normalize the amount of money pledged to each campaign.

It is not perfect (it looks like a lot pledges got very few pledges) but it is much closer to normal!



Parsing Dates

Part (3) of the (5) in Data Cleaning process! We are going to be looking at how to deal with dates.

The Date.parse() method parses a string representation of a date, and returns the number of milliseconds since January 1, 1970, 00:00:00 UTC or NaN if the string is unrecognized or, in some cases, contains illegal date values (e.g. 2015-02-31).

In part (3) we will:

Get our environment set up for parsing dates


The first thing we will need to do is load in the libraries and datasets we will be using. We will be working with two datasets: one containing information on earthquakes that occured between 1965 and 2016, and another that contains information on landslides that occured between 2007 and 2016.

Important! Make sure we run this cell ourself or the rest of our code won't work!

Check the data type of our date column


For this part of the challenge, we will be working with the date column from the landslides dataframe. The very first thing we are going to do is take a peek at the first few rows to make sure it actually looks like it contains dates.

Dates are showing! But just because we as a human can tell that these are dates does not mean that programing language Python knows that they are dates. Notice that the at the bottom of the output of head(), we can see that it says that the data type of this column is "object".

Pandas uses the "object" dtype for storing various types of data types, but most often when we see a column with the dtype "object" it will have strings in it.

If we check the pandas dtype documentation here, we will notice that there is also a specific datetime64 dtypes. Because the dtype of our column is object rather than datetime64, we can tell that Python does not know that this column contains dates.

We can also look at just the dtype of our column without printing the first few rows if we like:

We may have to check the numpy documentation to match the letter code to the dtype of the object. "O" is the code for "object", so we can see that these two methods give us the same information.

Convert our date columns to datetime


Now that we know that our date column is not being recognized as a date, it is time to convert it so that it is recognized as a date. This is called "parsing dates" because we are taking in a string and identifying its component parts.

We can pandas what the format of our dates are with a guide called as "strftime directive", which we can find more information on at this link. The basic idea is that we need to point out which parts of the date are where and what punctuation is between them. There are lots of possible parts of a date, but the most common are %d for day, %m for month, %y for a two-digit year and %Y for a four digit year.

Some examples:

Looking back up at the head of the date column in the landslides dataset, we can see that it's in the format "month/day/two-digit year", so we can use the same syntax as the first example to parse in our dates:

Now when we check the first few rows of the new column, we can see that the dtype is datetime64. We can also see that my dates have been slightly rearranged so that they fit the default order datetime objects (year-month-day).

Now that our dates are parsed correctly, we can interact with them in useful ways.


landslides['date_parsed'] = pd.to_datetime(landslides['Date'], infer_datetime_format=True)


Select just the day of the month from our column


We will try to get information on the day of the month that a landslide occured on from the original "date" column, which has an "object" dtype:

We got an error! The important part to look at here is the part at the very end that says AttributeError: Can only use .dt accessor with datetimelike values. We are getting this error because the dt.day() function does not know how to deal with a column with the dtype "object". Even though our dataframe has dates in it, because they have not been parsed we can not interact with them in a useful way.

We have a column that we parsed earlier , and that lets us get the day of the month out no problem:

Plot the day of the month to check the date parsing


One of the biggest dangers in parsing dates is mixing up the months and days. The to_datetime() function does have very helpful error messages, but it does not hurt to double-check that the days of the month we've extracted make sense.

To do this, we will plot a histogram of the days of the month. We expect it to have values between 1 and 31 and, since there is no reason to suppose the landslides are more common on some days of the month than others, a relatively even distribution. (With a dip on 31 because not all months have 31 days.) We will see if that is the case:

We did parse our dates correctly and this graph makes good sense to us.




ascii_conversion_chart.jpg

ASCII Conversion Chart (4)

Character Encodings

Part (4) of the (5) in Data Cleaning process! We are going to be working with different character encodings.


In computing, data storage, and data transmission, character encoding is used to represent a repertoire of characters by some kind of encoding system that assigns a number to each character for digital representation. Depending on the abstraction level and context, corresponding code points and the resulting code space may be regarded as bit patterns, octets, natural numbers, electrical pulses, etc. A character encoding is used in computation, data storage, and transmission of textual data. "Character set", "character map", "codeset" and "code page" are related, but not identical, terms.

In part (4) we will:

Get our environment set up for character encodings


The first thing we will need to do is load in the libraries we will be using. Not our datasets, though: we will get to those later!

Important! Make sure we run this cell ourself or the rest of our code won't work!

What are encodings?


Character encodings are specific sets of rules for mapping from raw binary byte strings (that look like this: 0110100001101001) to characters that make up human-readable text (like "hello world"). There are many different encodings, and if we tried to read in text with a different encoding that the one it was originally written in, we ended up with scrambled text called "mojibake" (said like mo-gee-bah-kay). Here is an example of mojibake:

æ–‡å—化ã??

we might also end up with a "unknown" characters. There are what gets printed when there is no mapping between a particular byte and a character in the encoding we are using to read our byte string in and they look like this:

����������

Character encoding mismatches are less common today than they used to be, but it is definitely still a problem. There are lots of different character encodings, but the main one we need to know is UTF-8.

UTF-8 is the standard text encoding. All Python code is in UTF-8 and, ideally, all our data should be as well. It is when things are not in UTF-8 that we run into trouble.

It was pretty hard to deal with encodings in Python 2, but thankfully in Python 3 it is a lot simpler. There are two main data types we will encounter when working with text in Python 3. One is is the string, which is what text is by default.

The other data is the bytes data type, which is a sequence of integers. we can convert a string into bytes by specifying which encoding it is in:

If we look at a bytes object, we will see that it has a b in front of it, and then maybe some text after. That is because bytes are printed out as if they were characters encoded in ASCII. (ASCII is an older character encoding that does not really work for writing any language other than English.) Here we can see that our euro symbol has been replaced with some mojibake that looks like "\xe2\x82\xac" when it is printed as if it were an ASCII string.

When we convert our bytes back to a string with the correct encoding, we can see that our text is all there correctly, which is great! :)

However, when we try to use a different encoding to map our bytes into a string,, we get an error. This is because the encoding weare trying to use does not know what to do with the bytes we are trying to pass it. We need to tell Python the encoding that the byte string is actually supposed to be in.

We can think of different encodings as different ways of recording music. We can record the same music on a CD, cassette tape or 8-track. While the music may sound more-or-less the same, we need to use the right equipment to play the music from each recording format. The correct decoder is like a cassette player or a cd player. If we try to play a cassette in a CD player, it just won't work.

We can also run into trouble if we try to use the wrong encoding to map from a string to bytes. As we mentioned, strings are UTF-8 by default in Python 3, so if we try to treat them like they were in another encoding we will create problems.

For example, if we try to convert a string to bytes for ascii using encode(), we can ask for the bytes to be what they would be if the text was in ASCII. Since our text is not in ASCII, though, there will be some characters it can not handle. We can automatically replace the characters that ASCII can not handle. If we do that, however, any characters not in ASCII will just be replaced with the unknown character. Then, when we convert the bytes back to a string, the character will be replaced with the unknown character. The dangerous part about this is that there is not way to tell which character it should have been. That means we may have just made our data unusable!

This is not right and we want to avoid doing it! It is far better to convert all our text to UTF-8 as soon as we can and keep it in that encoding. The best time to convert non UTF-8 input into UTF-8 is when we read in files, which we will talk about next.

First, however, try converting between bytes and strings with different encodings and see what happens. Notice what this does to our text. Would we want this to happen to data we were trying to analyze?

Reading in files with encoding problems


Most files we will encounter will probably be encoded with UTF-8. This is what Python expects by default, so most of the time we won't run into problems. However, sometimes we will get an error like this:

Notice that we get the same UnicodeDecodeError we got when we tried to decode UTF-8 bytes as if they were ASCII! This tells us that this file is not actually UTF-8. We do not know what encoding it actually is though. One way to figure it out is to try and test a bunch of different character encodings and see if any of them work. A better way, though, is to use the chardet module to try and automatically guess what the right encoding is. It is not 100% guaranteed to be right, but it is usually faster than just trying to guess.

We are going to just look at the first ten thousand bytes of this file. This is usually enough for a good guess about what the encoding is and is much faster than trying to look at the whole file. (Especially with a large file this can be very slow.) Another reason to just look at the first part of the file is that we can see by looking at the error message that the first problem is the 11th character. So we probably only need to look at the first little bit of the file to figure out what is going on.

So chardet is 73% confidence that the right encoding is "Windows-1252". We will see if that is correct:

Looks like chardet was right! The file reads in with no problem (although we do get a warning about datatypes) and when we look at the first few rows it seems to be be fine.

What if the encoding chardet guesses is not right? Since chardet is basically just a fancy guesser, sometimes it will guess the wrong encoding. One thing we can try is looking at more or less of the file and seeing if we get a different result and then try that.

Saving our files with UTF-8 encoding


Finally, once we have gone through all the trouble of getting our file into UTF-8, we will probably want to keep it that way. The easiest way to do that is to save our files with UTF-8 encoding. The good news is, since UTF-8 is the standard encoding in Python, when we save a file it will be saved as UTF-8 by default:

And we are completed!

If we have not saved a file in a kernel before, we need to hit the commit & run button and wait for our notebook to finish running first before we can see or access the file we have saved out. If we do not see it at first, wait a couple minutes and it should show up. The files we save will be in our work directory "../output/", and we can download them from our notebook.




inconsistent.jpg

Inconsistent Data Entry (4)

Inconsistent Data Entry

Part (5) of the (5) in Data Cleaning process! We are going to be working with different character encodings.

Handling Missing Values


Part (1) of the (5) in Data Cleaning process! we are going to learn how to clean up inconsistent text entries.

Typographic errors and spelling mistakes, can completely wreck even the most well-planned database. These types of errors are difficult to predict and often occur when data entry tasks are being performed by human beings. We simply get in a hurry and can get a little sloppy with our typing skills. Related to this is the common problem of having different entry styles among a variety of individuals. Different people might be in charge of putting data into the database. Some people like one abbreviation over another, and others will choose to spell out information completely. we will take a look at an example using the products table that we created in the last movie.

In part (5) we will:

Get our environment set up for inconsistent data entry


The first thing we will need to do is load in the libraries we will be using.

Important! Make sure we run this cell ourself or the rest of our code won't work!

When we tried to read in the PakistanSuicideAttacks Ver 11 (30-November-2017).csvfile the first time, we got a character encoding error, so we are going to quickly check out what the encoding should be...

We will read it in with the correct encoding.

Now we are ready to get started! We can, as always, take a moment here to look at the data and get familiar with it. :)

Do some preliminary text pre-processing


We will clean up the "City" column to make sure there is no data entry inconsistencies in it. We could go through and check each row by hand, of course, and hand-correct inconsistencies when we find them. There is a more efficient way to do this though!

We can see some problems due to inconsistent data entry: 'Lahore' and 'Lahore ', for example, or 'Lakki Marwat' and 'Lakki marwat'.

The first thing we are going to do is make everything lower case (we can change it back at the end if we like) and remove any white spaces at the beginning and end of cells. Inconsistencies in capitalizations and trailing white spaces are very common in text data and we can fix a good 80% of our text data entry inconsistencies by doing this.

Next we are going to tackle more difficult inconsistencies.

Use fuzzy matching to correct inconsistent data entry


We will look at the city column and see if there is any more data cleaning we need to do.

It does look like there are some remaining inconsistencies: 'd. i khan' and 'd.i khan' should probably be the same. (We looked it up and 'd.g khan' is a seperate city, so we should not combine those.)

We are going to use the fuzzywuzzy package to help identify which string are closest to each other. This dataset is small enough that we could probably could correct errors by hand, but that approach does not scale well. It is challenge to correct thousand of data! Automating things as early as possible is generally a good idea.

Fuzzy matching: The process of automatically finding text strings that are very similar to the target string. In general, a string is considered "closer" to another one the fewer characters we would need to change if we were transforming one string into another. So "apple" and "snapple" are two changes away from each other (add "s" and "n") while "in" and "on" and one change away (rplace "i" with "o"). We won't always be able to rely on fuzzy matching 100%, but it will usually end up saving us at least a little time.

Fuzzywuzzy returns a ratio given two strings. The closer the ratio is to 100, the smaller the edit distance between the two strings. We are going to get the ten strings from our list of cities that have the closest distance to "d.i khan".

We can see that two of the items in the cities are very close to "d.i khan": "d. i khan" and "d.i khan". We can also see the "d.g khan", which is a seperate city, has a ratio of 88. Since we do not want to replace "d.g khan" with "d.i khan", we will replace all rows in our City column that have a ratio of > 90 with "d. i khan".

We are going to write a function. It is a good idea to write a general purpose function we can reuse if we think we might have to do a specific task more than once or twice. This keeps us from having to copy and paste code too often, which saves time and can help prevent mistakes.

Now that we have a function, we can put it to the test!

We will check the unique values in our City column again and make sure we have tidied up d.i khan correctly.

We only have "d.i khan" in our dataframe and we did not have to change anything by hand.