GithubHelp home page GithubHelp logo

dsc-enterprise-deloitte-dl-pandas-series-and-dataframes's Introduction

Understanding Pandas Series and DataFrames

Introduction

In this lesson, we're digging into Pandas Series and DataFrames - the two main data types you'll work with.

Objectives

You will be able to:

  • Understand and explain what Pandas Series and DataFrames are and how they differ from dictionaries and lists
  • Create Series & DataFrames from dictionaries and lists
  • Manipulate columns in DataFrames (df.rename(), df.drop())
  • Manipulate the index in DataFrames (df.reindex(), df.drop(), df.rename())
  • Manipulate column datatypes

Pandas Data Types vs. Native Python Data Types

As we talk more about Object-Oriented Programming (OOP), using Pandas Series and DataFrames instead of built-in Python datatypes can have a range of benefits. One of the most important benefit is that Series and DataFrames have a range of built-in methods which make standard practices and procedures streamlined. Some of these methods can result in dramatic performance gains. To read more about these methods, make sure to continuously reference the Pandas documentation. It is impossible to know every method of pandas at any given time, nor should you devote much time to memorization. We will not deeply explain every Pandas method in these upcoming lessons and labs, but a critical part of every Data Scientist's job is to investigate documentation to learn about components of these tools on your own.

Setup

Let's take a little time to import the packages we need and to import and preview a dataset.

Import pandas and other packages to be used

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Import and preview a dataset

df = pd.read_csv('turnstile_180901.txt')
print(df.shape)
df.head()
(197625, 11)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
C/A UNIT SCP STATION LINENAME DIVISION DATE TIME DESC ENTRIES EXITS
0 A002 R051 02-00-00 59 ST NQR456W BMT 08/25/2018 00:00:00 REGULAR 6736067 2283184
1 A002 R051 02-00-00 59 ST NQR456W BMT 08/25/2018 04:00:00 REGULAR 6736087 2283188
2 A002 R051 02-00-00 59 ST NQR456W BMT 08/25/2018 08:00:00 REGULAR 6736105 2283229
3 A002 R051 02-00-00 59 ST NQR456W BMT 08/25/2018 12:00:00 REGULAR 6736180 2283314
4 A002 R051 02-00-00 59 ST NQR456W BMT 08/25/2018 16:00:00 REGULAR 6736349 2283384
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 197625 entries, 0 to 197624
Data columns (total 11 columns):
C/A                                                                     197625 non-null object
UNIT                                                                    197625 non-null object
SCP                                                                     197625 non-null object
STATION                                                                 197625 non-null object
LINENAME                                                                197625 non-null object
DIVISION                                                                197625 non-null object
DATE                                                                    197625 non-null object
TIME                                                                    197625 non-null object
DESC                                                                    197625 non-null object
ENTRIES                                                                 197625 non-null int64
EXITS                                                                   197625 non-null int64
dtypes: int64(2), object(9)
memory usage: 16.6+ MB

Data Munging/ Manipulation

This MTA turnstile dataset is a great place for us to get our hands dirty wrangling and cleaning some data! Here's the data dictionary if you want to know more about the dataset http://web.mta.info/developers/resources/nyct/turnstile/ts_Field_Description.txt

Let's start by filtering the data down to all stations for the N line. To do this, we'll need to extract all "N"s from the LINENAME column, or create a column indicating whether or not the stop is an N line stop.

Defining Functions

At this point, we will need to define some functions to perform data manipulation so that we can reuse them easily. Let's review how to do this: In Python, we define a function using the def keyword. Afterwards, we give the function a name, followed by parentheses. Any required (or optional parameters) are specified within the parentheses, just as you would normally call a function. You then specify the function's behavior using a colon and an indentation, much the same way you would a for loop or conditional block. Finally, if you want your function to return something (as with the str.pop() method) as opposed to a function that simply does something in the background but returns nothing (such as list.append()), you must use the return keyword. Note that as soon as a function hits a point in execution where something is returned, the function would terminate and no further commands would be executed. In other words the return command both returns a value and forces termination of the function.

def contains_n(text):
    if 'N' in text:
        return True
    else:
        return False

#or the shorter, more pythonic:
def contains_n(text):
    bool_val = 'N' in text
    return bool_val
df['On_N_Line'] = df.LINENAME.map(contains_n)
df.head(2)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
C/A UNIT SCP STATION LINENAME DIVISION DATE TIME DESC ENTRIES EXITS On_N_Line
0 A002 R051 02-00-00 59 ST NQR456W BMT 08/25/2018 00:00:00 REGULAR 6736067 2283184 True
1 A002 R051 02-00-00 59 ST NQR456W BMT 08/25/2018 04:00:00 REGULAR 6736087 2283188 True
df.tail(2)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
C/A UNIT SCP STATION LINENAME DIVISION DATE TIME DESC ENTRIES EXITS On_N_Line
197623 TRAM2 R469 00-05-01 RIT-ROOSEVELT R RIT 08/31/2018 17:00:00 REGULAR 5554 348 False
197624 TRAM2 R469 00-05-01 RIT-ROOSEVELT R RIT 08/31/2018 21:00:00 REGULAR 5554 348 False
df.On_N_Line.value_counts(normalize=True)
False    0.870441
True     0.129559
Name: On_N_Line, dtype: float64

If you have not seen value_counts() before, this would be a good time to check out the documentation for it !

Explanation

Above we used the .map() method for Pandas series. This allows us to pass a function that will be applied to each and every data entry within the series. As shorthand, we could also pass a lambda function to determine whether or not each row was on the N line or not:

df['On_N_Line'] = df.LINENAME.map(lambda x: 'N' in x)

This is shorter and equivalent to the functions defined above. Lambda functions are often more convenient, but have less functionality than defining functions explicitly.

Cleaning Column Names

Sometimes, you have messy column names.

df.columns
Index(['C/A', 'UNIT', 'SCP', 'STATION', 'LINENAME', 'DIVISION', 'DATE', 'TIME',
       'DESC', 'ENTRIES',
       'EXITS                                                               ',
       'On_N_Line'],
      dtype='object')

You might notice that, foolishly, the EXITS column has a lot of annoying whitespace following it. We can quickly use a list comprehension to clean up all of the column names.

Reformatting Column Types

Another common data munging technique can be reformatting column types. We first previewed column types above using the df.info() method, which we'll repeat here.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 197625 entries, 0 to 197624
Data columns (total 12 columns):
C/A                                                                     197625 non-null object
UNIT                                                                    197625 non-null object
SCP                                                                     197625 non-null object
STATION                                                                 197625 non-null object
LINENAME                                                                197625 non-null object
DIVISION                                                                197625 non-null object
DATE                                                                    197625 non-null object
TIME                                                                    197625 non-null object
DESC                                                                    197625 non-null object
ENTRIES                                                                 197625 non-null int64
EXITS                                                                   197625 non-null int64
On_N_Line                                                               197625 non-null bool
dtypes: bool(1), int64(2), object(9)
memory usage: 16.8+ MB

A common transformation needed is converting numbers stored as text to float or integer representations. In this case ENTRIES and EXITS are appropriately int64, but to practice, we'll demonstrate changing that to a float and then back to an int.

print(df.ENTRIES.dtype) #We can also check an individual column type rather then all 
df.ENTRIES = df.ENTRIES.astype(float) #Changing the column to float
print(df.ENTRIES.dtype) #Checking our changes
int64
float64
#Converting Back
print(df.ENTRIES.dtype) 
df.ENTRIES = df.ENTRIES.astype(int)
print(df.ENTRIES.dtype)
float64
int64

Attempting to convert a string column to int or float will produce errors if there are actually non-numeric characters

df.LINENAME = df.LINENAME.astype(int)
---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-12-9635123507d4> in <module>
----> 1 df.LINENAME = df.LINENAME.astype(int)


//anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors, **kwargs)
   5689             # else, only a single dtype is given
   5690             new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors,
-> 5691                                          **kwargs)
   5692             return self._constructor(new_data).__finalize__(self)
   5693 


//anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in astype(self, dtype, **kwargs)
    529 
    530     def astype(self, dtype, **kwargs):
--> 531         return self.apply('astype', dtype=dtype, **kwargs)
    532 
    533     def convert(self, **kwargs):


//anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
    393                                             copy=align_copy)
    394 
--> 395             applied = getattr(b, f)(**kwargs)
    396             result_blocks = _extend_blocks(applied, result_blocks)
    397 


//anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors, values, **kwargs)
    532     def astype(self, dtype, copy=False, errors='raise', values=None, **kwargs):
    533         return self._astype(dtype, copy=copy, errors=errors, values=values,
--> 534                             **kwargs)
    535 
    536     def _astype(self, dtype, copy=False, errors='raise', values=None,


//anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py in _astype(self, dtype, copy, errors, values, **kwargs)
    631 
    632                     # _astype_nansafe works fine with 1-d only
--> 633                     values = astype_nansafe(values.ravel(), dtype, copy=True)
    634 
    635                 # TODO(extension)


//anaconda3/lib/python3.7/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
    681         # work around NumPy brokenness, #1987
    682         if np.issubdtype(dtype.type, np.integer):
--> 683             return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
    684 
    685         # if we have a datetime/timedelta array of objects


pandas/_libs/lib.pyx in pandas._libs.lib.astype_intsafe()


ValueError: invalid literal for int() with base 10: 'NQR456W'

Converting Dates

A slightly more complicated data type transformation is creating date or datetime objects. These are built-in datatypes that have useful information such as being able to quickly calculate the time between two days, or extracting the day of the week from a given date. However, if we look at our current date column, we will notice it is simply a non-null object (probably simply text).

df.DATE.dtype
dtype('O')

pd.to_datetime()

This is the handiest of methods when converting strings to datetime objects.

#Often you can simply pass the series into this method.
pd.to_datetime(df.DATE).head() #It is good practice to preview the results first
#This prevents overwriting data if some error was produced. However everything looks good!
0   2018-08-25
1   2018-08-25
2   2018-08-25
3   2018-08-25
4   2018-08-25
Name: DATE, dtype: datetime64[ns]

Sometimes the above won't work and you'll have to explicitly pass how the date is formatted.
To do that, you have to use some datetime codes. Here's a preview of some of the most common ones:

To explicitly pass formatting parameters, preview your dates and write the appropriate codes.

df.DATE.iloc[0] #Another method for slicing series/dataframes
'08/25/2018'
#Notice we include delimiters (in this case /) between the codes.
pd.to_datetime(df.DATE, format='%m/%d/%Y').head()
0   2018-08-25
1   2018-08-25
2   2018-08-25
3   2018-08-25
4   2018-08-25
Name: DATE, dtype: datetime64[ns]
#Actually apply and save our changes
df.DATE = pd.to_datetime(df.DATE)
print(df.DATE.dtype)
#Preview updated dataframe
df.head(2)
datetime64[ns]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
C/A UNIT SCP STATION LINENAME DIVISION DATE TIME DESC ENTRIES EXITS On_N_Line
0 A002 R051 02-00-00 59 ST NQR456W BMT 2018-08-25 00:00:00 REGULAR 6736067 2283184 True
1 A002 R051 02-00-00 59 ST NQR456W BMT 2018-08-25 04:00:00 REGULAR 6736087 2283188 True

Datetime Methods

Now that we have converted the DATE field to a datetime object we can use some handy built-in methods.

#dt stores all the built in datetime methods (only works for datetime columns)
df.DATE.dt.day_name().head()
0    Saturday
1    Saturday
2    Saturday
3    Saturday
4    Saturday
Name: DATE, dtype: object

Renaming Columns

You can rename columns using dictionaries as follows:

df = df.rename(columns={'DATE' : 'date'})
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
C/A UNIT SCP STATION LINENAME DIVISION date TIME DESC ENTRIES EXITS On_N_Line
0 A002 R051 02-00-00 59 ST NQR456W BMT 2018-08-25 00:00:00 REGULAR 6736067 2283184 True
1 A002 R051 02-00-00 59 ST NQR456W BMT 2018-08-25 04:00:00 REGULAR 6736087 2283188 True
2 A002 R051 02-00-00 59 ST NQR456W BMT 2018-08-25 08:00:00 REGULAR 6736105 2283229 True
3 A002 R051 02-00-00 59 ST NQR456W BMT 2018-08-25 12:00:00 REGULAR 6736180 2283314 True
4 A002 R051 02-00-00 59 ST NQR456W BMT 2018-08-25 16:00:00 REGULAR 6736349 2283384 True

Dropping Columns

You can also drop columns.

df = df.drop('C/A', axis=1) #If you don't pass the axis=1 parameter, pandas will try and drop a row with the specified index
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
UNIT SCP STATION LINENAME DIVISION date TIME DESC ENTRIES EXITS On_N_Line
0 R051 02-00-00 59 ST NQR456W BMT 2018-08-25 00:00:00 REGULAR 6736067 2283184 True
1 R051 02-00-00 59 ST NQR456W BMT 2018-08-25 04:00:00 REGULAR 6736087 2283188 True
2 R051 02-00-00 59 ST NQR456W BMT 2018-08-25 08:00:00 REGULAR 6736105 2283229 True
3 R051 02-00-00 59 ST NQR456W BMT 2018-08-25 12:00:00 REGULAR 6736180 2283314 True
4 R051 02-00-00 59 ST NQR456W BMT 2018-08-25 16:00:00 REGULAR 6736349 2283384 True

Setting a New Index

It can also be helpful to set an index such as when graphing.

df = df.set_index('date')
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
UNIT SCP STATION LINENAME DIVISION TIME DESC ENTRIES EXITS On_N_Line
date
2018-08-25 R051 02-00-00 59 ST NQR456W BMT 00:00:00 REGULAR 6736067 2283184 True
2018-08-25 R051 02-00-00 59 ST NQR456W BMT 04:00:00 REGULAR 6736087 2283188 True
2018-08-25 R051 02-00-00 59 ST NQR456W BMT 08:00:00 REGULAR 6736105 2283229 True
2018-08-25 R051 02-00-00 59 ST NQR456W BMT 12:00:00 REGULAR 6736180 2283314 True
2018-08-25 R051 02-00-00 59 ST NQR456W BMT 16:00:00 REGULAR 6736349 2283384 True

Summary

We've seen in this lesson the differences between Pandas (Series and DataFrames) and Python native (Dictionaries and Lists) data types. We've also looked at how to create the Series and DataFrames from dictionaries and lists, and how to manipulate both columns and the index in DataFrame.

dsc-enterprise-deloitte-dl-pandas-series-and-dataframes's People

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.