Working with Known JSON Schemas

Introduction

You've started taking a look at JSON files and you'll continue to explore how to navigate and traverse these files. One common use case of json files will be when you are connecting to various websites through their established APIs to retrieve data from them. With these, you are typically given a schema for how the data is structured and then will use this knowledge to retrieve pertinent information. In this lecture, you'll take a look at the response from the NY Times API.

Objectives

You will be able to:

Use the JSON module to load and parse JSON documents
Write data to predefined JSON schemas
Convert JSON to a pandas dataframe

Reading a JSON Schema

Here's the JSON schema provided for a section of the NY Times API:

or a more detailed view (truncated):

You can see that the master structure is a dictionary and has a key named 'response'. This is also a dictionary and has two keys: 'data' and 'meta'. As you continue to examine the schema hierarchy, you'll notice the vast majority, in this case, are dictionaries.

Loading the Data File

As you saw before, let's start by importing this data from the file. Here's how to open the file and load its contents.

import json

f = open('ny_times_response.json', 'r')
data = json.load(f)

print(type(data))
print(data.keys())

<class 'dict'>
dict_keys(['status', 'copyright', 'response'])

You should see that there are two additional keys 'status' and 'copyright' which were not shown in the schema documentation.

Loading Specific Data

Looking at the schema, you might be interested in retrieving a specific piece of data, such as the articles' headlines. Notice that this is a key under 'docs', which is under 'response'. So the schema is roughly: data['response']['docs']['headline']. While this is close to the code you'll use to extract headlines, something is a bit off. Notice that if you look closely at the schema outline, that the 'docs' subheading is actually a list. Each item within this list should be a dictionary with the keys shown above, but that is an important distinction. Breaking it into two steps you have:

docs = data['response']['docs']
print(type(docs), len(docs))

<class 'list'> 9

for doc in docs:
    print(doc['headline'])

{'main': "HIGGINS, SPENT $22,189.53.; Governor-Elect's Election Expenses -- Harrison $9,220.28.", 'kicker': None, 'content_kicker': None, 'print_headline': None, 'name': None, 'seo': None, 'sub': None}
{'main': 'GARDEN BOUTS CANCELED; Mauriello Says He Could Not Be Ready on Nov. 3', 'kicker': '1', 'content_kicker': None, 'print_headline': None, 'name': None, 'seo': None, 'sub': None}
{'main': 'Stock Drop Is Biggest in 2 Months--Margin Rise Held Factor in Lightest Trading of 1955', 'kicker': '1', 'content_kicker': None, 'print_headline': None, 'name': None, 'seo': None, 'sub': None}
{'main': 'MUSIC OF THE WEEK', 'kicker': None, 'content_kicker': None, 'print_headline': None, 'name': None, 'seo': None, 'sub': None}
{'main': 'Anacomp Inc. reports earnings for Qtr to March 31', 'kicker': None, 'content_kicker': None, 'print_headline': None, 'name': None, 'seo': None, 'sub': None}
{'main': 'Brooklyn Routs Yeshiva', 'kicker': '1', 'content_kicker': None, 'print_headline': None, 'name': None, 'seo': None, 'sub': None}
{'main': 'Albuquerque Program Gives Drinkers a Lift', 'kicker': '1', 'content_kicker': None, 'print_headline': None, 'name': None, 'seo': None, 'sub': None}
{'main': 'Front Page 7 -- No Title', 'kicker': '1', 'content_kicker': None, 'print_headline': None, 'name': None, 'seo': None, 'sub': None}
{'main': 'UNIONS AND BUILDERS READY FOR LONG FIGHT; None of the Strikers Back - Lock-Out Soon in Effect. 23,000 ALREADY INVOLVED Orders Sent to Every Building Employer Within Twenty-five Miles -- House-smiths Vote Not to Strike.', 'kicker': None, 'content_kicker': None, 'print_headline': None, 'name': None, 'seo': None, 'sub': None}

Or if you want to just print the main headlines themselves:

for doc in docs:
    print(doc['headline']['main'])
    print('\n')

HIGGINS, SPENT $22,189.53.; Governor-Elect's Election Expenses -- Harrison $9,220.28.


GARDEN BOUTS CANCELED; Mauriello Says He Could Not Be Ready on Nov. 3


Stock Drop Is Biggest in 2 Months--Margin Rise Held Factor in Lightest Trading of 1955


MUSIC OF THE WEEK


Anacomp Inc. reports earnings for Qtr to March 31


Brooklyn Routs Yeshiva


Albuquerque Program Gives Drinkers a Lift


Front Page 7 -- No Title


UNIONS AND BUILDERS READY FOR LONG FIGHT; None of the Strikers Back - Lock-Out Soon in Effect. 23,000 ALREADY INVOLVED Orders Sent to Every Building Employer Within Twenty-five Miles -- House-smiths Vote Not to Strike.

Transforming JSON to Alternative Formats

You've also previously started to take a look at how to transform JSON to DataFrames. Investigating the schema, a good option for this could again be the 'docs' subheading. While this still has nested data itself, it's often easier to load the entire section as a dataframe and then use additional functions to break apart the internally nested data from there.

import pandas as pd

df = pd.DataFrame(data['response']['docs'])
df.head(3)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	_id	abstract	blog	byline	document_type	headline	keywords	multimedia	news_desk	print_page	pub_date	score	snippet	source	type_of_material	web_url	word_count
0	4fc04eb745c1498b0d23da00	Spent $22,200	{}	NaN	article	{'main': 'HIGGINS, SPENT $22,189.53.; Governor...	[{'name': 'persons', 'value': 'HIGGINS, LT. GO...	[]	NaN	2	1904-11-17T00:00:00Z	1	Spent $22,200	The New York Times	Article	https://query.nytimes.com/gst/abstract.html?re...	213
1	4fc21ebf45c1498b0d612b22	NaN	{}	NaN	article	{'main': 'GARDEN BOUTS CANCELED; Mauriello Say...	[]	[]	NaN	15	1944-10-23T00:00:00Z	1		The New York Times	Article	https://query.nytimes.com/gst/abstract.html?re...	149
2	4fc3b41d45c1498b0d7fd41e	NaN	{}	{'original': 'By JOHN G. FORREST', 'person': [...	article	{'main': 'Stock Drop Is Biggest in 2 Months--M...	[]	[]	NaN	F1	1955-05-15T00:00:00Z	1	Stock prices last week, on the lightest volume...	The New York Times	Article	https://query.nytimes.com/gst/abstract.html?re...	823

Breaking out nested data

Now that you have the data loaded, it's time to clean it up by breaking out some of the nested data. For example, you should notice that the headline entries are actualy dictionaries. You could transform these into singular data columns with something like this:

keys = df.headline.iloc[0].keys() #Get dictionary keys
#Keep track of columns we make for subsequent preview
new_cols = []
#Create a new feature for each of these keys
for key in keys:
    new_col = 'headline_{}'.format(key) #Create new column name
    df[new_col] = df.headline.map(lambda x: x[key]) #Create a new column
    new_cols.append(new_col)
df[new_cols].head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	headline_main	headline_kicker	headline_content_kicker	headline_print_headline	headline_name	headline_seo	headline_sub
0	HIGGINS, SPENT $22,189.53.; Governor-Elect's E...	None	None	None	None	None	None
1	GARDEN BOUTS CANCELED; Mauriello Says He Could...	1	None	None	None	None	None
2	Stock Drop Is Biggest in 2 Months--Margin Rise...	1	None	None	None	None	None
3	MUSIC OF THE WEEK	None	None	None	None	None	None
4	Anacomp Inc. reports earnings for Qtr to March 31	None	None	None	None	None	None

Wahoo! This is a good general strategy for transforming nested JSON: create a DataFrame and then break out nested features into their own column features.

Outputing to JSON

Finally, take a look at how you can write data back to JSON. Like loading, you first open a file (this time with write permission) and use the json package to transfer data to that file container.

with open('output.json', 'w') as f:
    json.dump(data, f)

Summary

There you have it! In this, you took another look at JSON, taking a look at an example schema diagram and retrieving information. You also looked at a general procedure for transforming nested data to Pandas DataFrames (create a DataFrame, and then break apart nested data using lambda functions to create additional columns). Finally, you also took a brief look at saving data to json files.

princess-1 / dsc-working-with-known-json-schemas-online-ds-sp-000 Goto Github PK

dsc-working-with-known-json-schemas-online-ds-sp-000's Introduction

Working with Known JSON Schemas

Introduction

Objectives

Reading a JSON Schema

Loading the Data File

Loading Specific Data

Transforming JSON to Alternative Formats

Breaking out nested data

Outputing to JSON

Summary

dsc-working-with-known-json-schemas-online-ds-sp-000's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs