GithubHelp home page GithubHelp logo

princess-1 / dsc-working-with-known-json-schemas-online-ds-sp-000 Goto Github PK

View Code? Open in Web Editor NEW

This project forked from learn-co-students/dsc-working-with-known-json-schemas-online-ds-sp-000

0.0 1.0 0.0 145 KB

License: Other

Jupyter Notebook 100.00%

dsc-working-with-known-json-schemas-online-ds-sp-000's Introduction

Working with Known JSON Schemas

Introduction

You've started taking a look at JSON files and you'll continue to explore how to navigate and traverse these files. One common use case of json files will be when you are connecting to various websites through their established APIs to retrieve data from them. With these, you are typically given a schema for how the data is structured and then will use this knowledge to retrieve pertinent information. In this lecture, you'll take a look at the response from the NY Times API.

Objectives

You will be able to:

  • Use the JSON module to load and parse JSON documents
  • Write data to predefined JSON schemas
  • Convert JSON to a pandas dataframe

Reading a JSON Schema

Here's the JSON schema provided for a section of the NY Times API:

or a more detailed view (truncated):

You can see that the master structure is a dictionary and has a key named 'response'. This is also a dictionary and has two keys: 'data' and 'meta'. As you continue to examine the schema hierarchy, you'll notice the vast majority, in this case, are dictionaries.

Loading the Data File

As you saw before, let's start by importing this data from the file. Here's how to open the file and load its contents.

import json
f = open('ny_times_response.json', 'r')
data = json.load(f)
print(type(data))
print(data.keys())
<class 'dict'>
dict_keys(['status', 'copyright', 'response'])

You should see that there are two additional keys 'status' and 'copyright' which were not shown in the schema documentation.

Loading Specific Data

Looking at the schema, you might be interested in retrieving a specific piece of data, such as the articles' headlines. Notice that this is a key under 'docs', which is under 'response'. So the schema is roughly: data['response']['docs']['headline']. While this is close to the code you'll use to extract headlines, something is a bit off. Notice that if you look closely at the schema outline, that the 'docs' subheading is actually a list. Each item within this list should be a dictionary with the keys shown above, but that is an important distinction. Breaking it into two steps you have:

docs = data['response']['docs']
print(type(docs), len(docs))
<class 'list'> 9
for doc in docs:
    print(doc['headline'])
{'main': "HIGGINS, SPENT $22,189.53.; Governor-Elect's Election Expenses -- Harrison $9,220.28.", 'kicker': None, 'content_kicker': None, 'print_headline': None, 'name': None, 'seo': None, 'sub': None}
{'main': 'GARDEN BOUTS CANCELED; Mauriello Says He Could Not Be Ready on Nov. 3', 'kicker': '1', 'content_kicker': None, 'print_headline': None, 'name': None, 'seo': None, 'sub': None}
{'main': 'Stock Drop Is Biggest in 2 Months--Margin Rise Held Factor in Lightest Trading of 1955', 'kicker': '1', 'content_kicker': None, 'print_headline': None, 'name': None, 'seo': None, 'sub': None}
{'main': 'MUSIC OF THE WEEK', 'kicker': None, 'content_kicker': None, 'print_headline': None, 'name': None, 'seo': None, 'sub': None}
{'main': 'Anacomp Inc. reports earnings for Qtr to March 31', 'kicker': None, 'content_kicker': None, 'print_headline': None, 'name': None, 'seo': None, 'sub': None}
{'main': 'Brooklyn Routs Yeshiva', 'kicker': '1', 'content_kicker': None, 'print_headline': None, 'name': None, 'seo': None, 'sub': None}
{'main': 'Albuquerque Program Gives Drinkers a Lift', 'kicker': '1', 'content_kicker': None, 'print_headline': None, 'name': None, 'seo': None, 'sub': None}
{'main': 'Front Page 7 -- No Title', 'kicker': '1', 'content_kicker': None, 'print_headline': None, 'name': None, 'seo': None, 'sub': None}
{'main': 'UNIONS AND BUILDERS READY FOR LONG FIGHT; None of the Strikers Back - Lock-Out Soon in Effect. 23,000 ALREADY INVOLVED Orders Sent to Every Building Employer Within Twenty-five Miles -- House-smiths Vote Not to Strike.', 'kicker': None, 'content_kicker': None, 'print_headline': None, 'name': None, 'seo': None, 'sub': None}

Or if you want to just print the main headlines themselves:

for doc in docs:
    print(doc['headline']['main'])
    print('\n')
HIGGINS, SPENT $22,189.53.; Governor-Elect's Election Expenses -- Harrison $9,220.28.


GARDEN BOUTS CANCELED; Mauriello Says He Could Not Be Ready on Nov. 3


Stock Drop Is Biggest in 2 Months--Margin Rise Held Factor in Lightest Trading of 1955


MUSIC OF THE WEEK


Anacomp Inc. reports earnings for Qtr to March 31


Brooklyn Routs Yeshiva


Albuquerque Program Gives Drinkers a Lift


Front Page 7 -- No Title


UNIONS AND BUILDERS READY FOR LONG FIGHT; None of the Strikers Back - Lock-Out Soon in Effect. 23,000 ALREADY INVOLVED Orders Sent to Every Building Employer Within Twenty-five Miles -- House-smiths Vote Not to Strike.

Transforming JSON to Alternative Formats

You've also previously started to take a look at how to transform JSON to DataFrames. Investigating the schema, a good option for this could again be the 'docs' subheading. While this still has nested data itself, it's often easier to load the entire section as a dataframe and then use additional functions to break apart the internally nested data from there.

import pandas as pd
df = pd.DataFrame(data['response']['docs'])
df.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
_id abstract blog byline document_type headline keywords multimedia news_desk print_page pub_date score snippet source type_of_material web_url word_count
0 4fc04eb745c1498b0d23da00 Spent $22,200 {} NaN article {'main': 'HIGGINS, SPENT $22,189.53.; Governor... [{'name': 'persons', 'value': 'HIGGINS, LT. GO... [] NaN 2 1904-11-17T00:00:00Z 1 Spent $22,200 The New York Times Article https://query.nytimes.com/gst/abstract.html?re... 213
1 4fc21ebf45c1498b0d612b22 NaN {} NaN article {'main': 'GARDEN BOUTS CANCELED; Mauriello Say... [] [] NaN 15 1944-10-23T00:00:00Z 1 The New York Times Article https://query.nytimes.com/gst/abstract.html?re... 149
2 4fc3b41d45c1498b0d7fd41e NaN {} {'original': 'By JOHN G. FORREST', 'person': [... article {'main': 'Stock Drop Is Biggest in 2 Months--M... [] [] NaN F1 1955-05-15T00:00:00Z 1 Stock prices last week, on the lightest volume... The New York Times Article https://query.nytimes.com/gst/abstract.html?re... 823

Breaking out nested data

Now that you have the data loaded, it's time to clean it up by breaking out some of the nested data. For example, you should notice that the headline entries are actualy dictionaries. You could transform these into singular data columns with something like this:

keys = df.headline.iloc[0].keys() #Get dictionary keys
#Keep track of columns we make for subsequent preview
new_cols = []
#Create a new feature for each of these keys
for key in keys:
    new_col = 'headline_{}'.format(key) #Create new column name
    df[new_col] = df.headline.map(lambda x: x[key]) #Create a new column
    new_cols.append(new_col)
df[new_cols].head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
headline_main headline_kicker headline_content_kicker headline_print_headline headline_name headline_seo headline_sub
0 HIGGINS, SPENT $22,189.53.; Governor-Elect's E... None None None None None None
1 GARDEN BOUTS CANCELED; Mauriello Says He Could... 1 None None None None None
2 Stock Drop Is Biggest in 2 Months--Margin Rise... 1 None None None None None
3 MUSIC OF THE WEEK None None None None None None
4 Anacomp Inc. reports earnings for Qtr to March 31 None None None None None None

Wahoo! This is a good general strategy for transforming nested JSON: create a DataFrame and then break out nested features into their own column features.

Outputing to JSON

Finally, take a look at how you can write data back to JSON. Like loading, you first open a file (this time with write permission) and use the json package to transfer data to that file container.

with open('output.json', 'w') as f:
    json.dump(data, f)

Summary

There you have it! In this, you took another look at JSON, taking a look at an example schema diagram and retrieving information. You also looked at a general procedure for transforming nested data to Pandas DataFrames (create a DataFrame, and then break apart nested data using lambda functions to create additional columns). Finally, you also took a brief look at saving data to json files.

dsc-working-with-known-json-schemas-online-ds-sp-000's People

Contributors

fpolchow avatar lmcm18 avatar loredirick avatar mas16 avatar mathymitchell avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.