GithubHelp home page GithubHelp logo

dsc-xml-v2-1-2020-former-employee_hrbenefit's Introduction

XML

Introduction

In this lecture, you'll continue investigating new formats for datasets. Specifically, you'll investigate another of the most popular data formats for the web: XML.

Objectives

You will be able to:

  • Use the XML module to load and parse XML data
  • Compare and contrast JSON and XML as data interchange types

XML

XML stands for 'Extensible Markup Language'. You may note the acronym's similarity to HTML; HyperText Markup Language. While HTML contains information for how to display a page, XML is used to store the data and content of the page itself. Like HTML, XML uses tags to separate and organize data in a hierarchical manner. Here's a brief preview of an XML file:

Loading XML Data

Prebuilt Python modules exist that will give you a powerful starting point for accessing and manipulating the underlying data within XML files.

The XML Module

You can check out the full details of the XML package here:
https://docs.python.org/3.6/library/xml.html#
but for now, you'll simply be using a submodule, ElementTree:
https://docs.python.org/3.6/library/xml.etree.elementtree.html#module-xml.etree.ElementTree

Notice the nested structure of the XML file:

Compare and contrast this nested data structure with the brief preview of the same file above, now in JSON:

JSON files are much simpler to read than XML files! Nonetheless, learning how to work with XML files will come in handy when learning to parse HTML, which you'll encounter soon in the section about web scraping.

Parsing XML files

When parsing the data, you'll have to navigate through this hierarchical structure. This is the idea behind the ElementTree submodule. You'll start with a root note and then iterate over its children, each of which should have a tag (the name in <angle_brackets>) and an associated attribute (the data between the two angle brackets <start> data <stop>).

import xml.etree.ElementTree as ET

First you create the tree and retrieve the root tag.

tree = ET.parse('nyc_2001_campaign_finance.xml')
root = tree.getroot()

Afterwards, you can iterate through the root node's children:

for child in root:
    print(child.tag, child.attrib)
row {}

Due to the nested structure, you often have to dig further down the tree:

#Count is added here to limit the number of results
count = 0
for child in root:
    print('Child:\n')
    print(child.tag, child.attrib)
    print('Grandchildren:')
    for grandchild in child:
        count += 1
        if count < 10:
            print(grandchild.tag, grandchild.attrib)
    print('\n\n')
Child:

row {}
Grandchildren:
row {'_id': '1', '_uuid': 'E3E9CC9F-7443-43F6-94AF-B5A0F802DBA1', '_position': '1', '_address': 'https://data.cityofnewyork.us/resource/_8dhd-zvi6/1'}
row {'_id': '2', '_uuid': '9D257416-581A-4C42-85CC-B6EAD9DED97F', '_position': '2', '_address': 'https://data.cityofnewyork.us/resource/_8dhd-zvi6/2'}
row {'_id': '3', '_uuid': 'B80D7891-93CF-49E8-86E8-182B618E68F2', '_position': '3', '_address': 'https://data.cityofnewyork.us/resource/_8dhd-zvi6/3'}
row {'_id': '4', '_uuid': 'BB012003-78F5-406D-8A87-7FF8A425EE3F', '_position': '4', '_address': 'https://data.cityofnewyork.us/resource/_8dhd-zvi6/4'}
row {'_id': '5', '_uuid': '945825F9-2F5D-47C2-A16B-75B93E61E1AD', '_position': '5', '_address': 'https://data.cityofnewyork.us/resource/_8dhd-zvi6/5'}
row {'_id': '6', '_uuid': '9546F502-39D6-4340-B37E-60682EB22274', '_position': '6', '_address': 'https://data.cityofnewyork.us/resource/_8dhd-zvi6/6'}
row {'_id': '7', '_uuid': '4B6C74AD-17A0-4B7E-973A-2592D68A687D', '_position': '7', '_address': 'https://data.cityofnewyork.us/resource/_8dhd-zvi6/7'}
row {'_id': '8', '_uuid': 'ABD22A5E-B8DA-446F-82BC-93AA11AF99DF', '_position': '8', '_address': 'https://data.cityofnewyork.us/resource/_8dhd-zvi6/8'}
row {'_id': '9', '_uuid': '7CD36FB5-600F-44F5-A10C-CB3434B6805F', '_position': '9', '_address': 'https://data.cityofnewyork.us/resource/_8dhd-zvi6/9'}

Due to the nested structure, there is also a convenience method .iter() that allows you to iterate through all sub generations, regardless of depth.

count = 0
for element in root.iter():
    count += 1
    if count < 10:
        print(element.tag, element.attrib)
response {}
row {}
row {'_id': '1', '_uuid': 'E3E9CC9F-7443-43F6-94AF-B5A0F802DBA1', '_position': '1', '_address': 'https://data.cityofnewyork.us/resource/_8dhd-zvi6/1'}
candid {}
candname {}
officeboro {}
canclass {}
row {'_id': '2', '_uuid': '9D257416-581A-4C42-85CC-B6EAD9DED97F', '_position': '2', '_address': 'https://data.cityofnewyork.us/resource/_8dhd-zvi6/2'}
election {}

With some finesse, you could also extract all of these row tags into a dataframe....

import pandas as pd
dfs = []
for n, element in enumerate(root.iter('row')):
    if n > 0:
        dfs.append(pd.DataFrame.from_dict(element.attrib, orient='index').transpose())
df = pd.concat(dfs)
print(len(df))
df.head()
285
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
_id _uuid _position _address
0 1 E3E9CC9F-7443-43F6-94AF-B5A0F802DBA1 1 https://data.cityofnewyork.us/resource/_8dhd-z...
0 2 9D257416-581A-4C42-85CC-B6EAD9DED97F 2 https://data.cityofnewyork.us/resource/_8dhd-z...
0 3 B80D7891-93CF-49E8-86E8-182B618E68F2 3 https://data.cityofnewyork.us/resource/_8dhd-z...
0 4 BB012003-78F5-406D-8A87-7FF8A425EE3F 4 https://data.cityofnewyork.us/resource/_8dhd-z...
0 5 945825F9-2F5D-47C2-A16B-75B93E61E1AD 5 https://data.cityofnewyork.us/resource/_8dhd-z...

Shew!

As you can see, parsing XML can get a bit complicated. It's a useful example for web scraping as HTML will have a similar structure that you'll need to exploit. That said, XML is an outdated format, and JSON is the new standard.

Files using the JSON format are simpler and more flexible than files in XML format. The JSON format was introduced after XML, and was meant to streamline many data transportation issues existing at the time.

Summary

As you can see, there's still a lot going on here with the deeply nested structure of some of these data files. In the upcoming lab, you'll get a chance to practice loading files and conducting some initial preview of the data as you did here.

dsc-xml-v2-1-2020-former-employee_hrbenefit's People

Contributors

lmcm18 avatar loredirick avatar mas16 avatar mathymitchell avatar peterbell avatar sik-flow avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.