GithubHelp home page GithubHelp logo

dsc-enterprise-bain-combining-dataframes's Introduction

Combining DataFrames With Pandas

Introduction

In this section, you'll learn about how to combine DataFrames with concatenation. You'll also learn how to read in tables from SQL databases and store them in DataFrames, as well as the various types of joins that exist and how you can perform them in pandas.

Objectives

You will be able to:

  • Understand and explain when to use DataFrame joins and merges
  • Be able to use pd.merge when combining DataFrames based on column values
  • Understand, explain and use a range of DataFrame merge types: outer, inner, left and right
  • Use pd.concat() to stack DataFrames

Concatenating DataFrames

Recall that "concatenation" means adding the contents of a second collection on to the end of the a first collection. You learned how to do this when working with strings. For instance:

print('Data ' + 'Science!')
# Output: "Data Science!"

Since strings are a form of collections in python, you can concatenate them as above.

DataFrames are also collections, so it stands to reason that pandas provides an easy way to concatenate them. Examine the following diagram from the pandas documentation on concatenation:

In this example, 3 DataFrames have been concatenated, resulting in one larger dataframe containing the contents in the order they were concatenated.

To perform a concatenation between 2 or more DataFrames, you pass in an array of the objects to concatenate to the pd.concat() function, as demonstrated below:

to_concat = [df1, df2, df3]
big_df = pd.concat(to_concat)

Note that there are many different optional keyword arguments you can set with pd.concat()--for a full breakdown of all the ways you can use this method, take a look at the pandas documentation.

Keys and Indexes

Every table in a Database has a column that serves as the Primary Key. In pandas, the index is the primary key for that table. You'll use these keys, along with the Foreign Key, which points to a primary key value in another table, to execute Joins. This allows us to "line up" information from multiple tables and combine them into one table. You'll learn more about Primary Keys and Foreign Keys in the next future when you'll dive into SQL and relational databases, so don't worry too much about these concepts now. That said, you can use similar functionality in Pandas.

Often, it is useful for us to set a column to act as the index for a DataFrame. To do this, you would type:

some_dataframe.set_index("name_of_index_column", inplace=True)

Note that this will mutate the dataset in place and set the column with the specified name as the index column of the DataFrame. If inplace is not specified it will default to False, meaning that a copy of the DataFrame with the requested changes will be returned, but the original object will remain unchanged.

NOTE: Running cells that make an inplace change more than once will often cause pandas to throw an error. If this happens, just restart the kernel.

By setting the index columns on DataFrames, you make it easy to join DataFrames later on. Note that this is not always feasible, but it's a useful step when possible.

Types of Joins

Joins are always executed between a Left Table and a Right Table. There are four different types of Joins you can execute. Consider the following Venn Diagrams:

When thinking about Joins, it is easy to conceptualize them as Venn Diagrams.

An Outer Join returns all records from both tables.

An Inner Join returns only the records with matching keys in both tables.

A Left Join returns all the records from the left table, as well as any records from the right table that have a matching key with a record from the left table.

A Right Join returns all the records from the right table, as well as any records from the left table that have a matching key with a record from the right table.

DataFrames contain a built-in .join() method. By default, the table calling the .join() method is always the left table. The following code snippet demonstrates how to execute a join in pandas:

joined_df = df1.join(df2, how='inner')

Note that to call .join(), you must pass in the right table. You can also set the type of join to perform with the how parameter. The options are 'left', 'right', 'inner', and 'outer'.

If how= is not specified, it defaults to 'left'.

NOTE: If both tables contain columns with the same name, the join will throw an error due to a naming collision, since the resulting table would have multiple columns with the same name. To solve this, pass in a value to lsuffix= or rsuffix=, which will append this suffix to the offending columns to resolve the naming collisions.

Summary

In this section you learned how to use concatenation to join together multiple DataFrames in Pandas.

dsc-enterprise-bain-combining-dataframes's People

Contributors

loredirick avatar mathymitchell avatar mike-kane avatar peterbell avatar sik-flow avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.