GithubHelp home page GithubHelp logo

spikeyball / dsc-1-04-06-combining-dataframes-with-pandas-online-ds-sp-000 Goto Github PK

View Code? Open in Web Editor NEW

This project forked from learn-co-students/dsc-1-04-06-combining-dataframes-with-pandas-online-ds-sp-000

0.0 1.0 0.0 57 KB

License: Other

Jupyter Notebook 100.00%

dsc-1-04-06-combining-dataframes-with-pandas-online-ds-sp-000's Introduction

Combining DataFrames With Pandas

Introduction

In this section, we'll learn about how to combine DataFrames with concatenation. We'll also learn how to read in tables from SQL databases and store them in DataFrames, as well as the various types of joins that exist and how we can perform them in pandas.

Objectives:

You will be able to:

  • Understand and explain when to use DataFrame joins and merges
  • Be able to use pd.merge when combining DataFrames based on column values
  • Understand, explain and use a range of DataFrame merge types: outer, inner, left and right
  • Use pd.concat() to stack DataFrames

Concatenating DataFrames

Recall that "concatenation" means adding the contents of a second collection on to the end of the a first collection. We learned how to do this when working with strings. For instance:

print('Data ' + 'Science!')
# Output: "Data Science!"

Since strings are a form of collections in python, we can concatenate them as above.

DataFrames are also collections, so it stands to reason that pandas provides an easy way to concatenate them. Examine the following diagram from the pandas documentation on concatenation:

In this example, 3 DataFrames have been concatenated, resulting in one larger dataframe containing the contents in the order they were concatenated.

To perform a concatenation between 2 or more DataFrames, we pass in an array of the objects to concatenate to the pd.concat() function, as demonstrated below:

to_concat = [df1, df2, df3]
big_df = pd.concat(to_concat)

Note that there are many different optional keyword arguments we can set with pd.concat()--for a full breakdown of all the ways we can use this method, take a look at the pandas documentation.

Working With SQL Tables

Often, we'll want to load SQL tables directly into pandas to take advantage of all the functionality that DataFrames provide. This is easy to do, since SQL tables and pandas DataFrames have the same innate structure.

The easiest way to load in a table from a SQL database is to use the pd.load_sql_table() method. However, in order to use this, we first have to connect to the database in question using the sqlalchemy library. Don't worry too much about the details of SQL or sqlalchemy - we'll dig into both of them later in the course.

The following code demonstrates how to create an engine object using sqlalchemy that will connect to our database for us and allow us to use the pd.read_sql_table() method in pandas:

from sqlalchemy import create_engine

database_path = "some_database"
table_name = "some_Table

engine = create_engine('sqlite:///' + database_path, echo=False)

df_from_table = pd.read_sql_table(table_name, engine)

Note that in this case, we need to provide both the path to the database during the creation of the engine object, as well as the table name to read in and store in a DataFrame when using the read_sql_table() method (as well as the engine object we created).

Once we have read in a table using this method, you'll have a regular pandas DataFrame to work with!

Keys and Indexes

Every table in a Database has a column that serves as the Primary Key. This key acts as the index for that table. We'll use these keys, along with the Foreign Key, which points to a primary key value in another table, to execute Joins. This allows us to "line up" information from multiple tables and combine them into one table. We'll learn more about Primary Keys and Foreign Keys in the next section when we dive into SQL and relational databases, so don't worry too much about these concepts now.

Often, it is useful for us to set a column to act as the index for a DataFrame. To do this, we would type:

some_dataframe.set_index("name_of_index_column", inplace=True)

Note that this will mutate the dataset in place and set the column with the specified name as the index column of the DataFrame. If inplace is not specified it will default to False, meaning that a copy of the DataFrame with the requested changes will be returned, but the original object will remain unchanged.

NOTE: Running cells that make an inplace change more than once will often cause pandas to throw an error. If this happens, just restart the kernel.

By setting the index columns on DataFrames, we make it easy for us to join DataFrames later on. Note that this is not always feasible, but it's a useful step when possible.

Types of Joins

Joins are always executed between a Left Table and a Right Table. There are four different types of Joins we can execute. Consider the following Venn Diagrams:

When thinking about Joins, it is easy to conceptualize them as Venn Diagrams.

An Outer Join returns all records from both tables.

An Inner Join returns only the records with matching keys in both tables.

A Left Join returns all the records from the left table, as well as any records from the right table that have a matching key with a record from the left table.

A Right Join returns all the records from the right table, as well as any records from the left table that have a matching key with a record from the right table.

DataFrames contain a built-in .join() method. By default, the table calling the .join() method is always the left table. The following code snippet demonstrates how to execute a join in pandas:

joined_df = df1.join(df2, how='inner')

Note that to call .join(), we must pass in the right table. We can also set the type of join to perform with the how parameter. The options are 'left', 'right', 'inner', and 'outer'.

If how= is not specified, it defaults to 'left'.

NOTE: If both tables contain columns with the same name, the join will throw an error due to a naming collision, since the resulting table would have multiple columns with the same name. To solve this, pass in a value to lsuffix= or rsuffix=, which will append this suffix to the offending columns to resolve the naming collisions.

Summary

In this section we learned how to use concatenation to join together multiple DataFrames in Pandas.

dsc-1-04-06-combining-dataframes-with-pandas-online-ds-sp-000's People

Contributors

mike-kane avatar peterbell avatar sik-flow avatar spikeyball avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.