GithubHelp home page GithubHelp logo

table-scraper's Introduction

table-scraper

Do you need a data set? Are you lazy (I sure am!)? This table scraper can make it easier for you. By just answering a few prompt questions from the command line, you can easily scrape HTML tables from a website.

##Getting Started

So first, let's look at what the files do:

TableScraperScript.py: This is a scraper that I made to get some data off the Grammy's website. If you're working on a similar project and want to fork and edit it to suit your needs, awesome! Go right ahead. Otherwise, it may not be totally useful to you other than to serve as an example of how to build your own scraper. But I did use this template to build...

TableScraperEdit.py: This is sort of a global version of this scraper. If you run it in the command line and answer the questions it prompts you with, you can scrape HTML tables on the web. If this is what you're interested in using, you don't actually need the other files.

output.xml: The edited data set that TableScraperScript.py gave me.

Alright, so let's get started! First, download any files you want, just make sure you get TableScraperEdit.py.

Next, make sure you've got BeautifulSoup downloaded (if not, do apt-get python-bs4 or pip install beautifulsoup4).

Then, open up a window with the site you want to scrape and go ahead and inspect all up in the table you want to scrape (right click "Inspect Element").

Open a terminal window and enter in python TableScraperEdit.py (if you've downloaded the whole folder, make sure you enter cd table-scraper first). Now it's going to ask you a bunch of questions -- they're not terribly difficult but it's important to remember to PUT ALL OF YOUR ANSWERS IN QUOTATION MARKS. Otherwise, the program will get confused and cry little error tears all over you and you'll have to run the script again. You should also know that the script can currently only scrape a table that's up to six columns -- it's a limitation that I'm working on. Now, let's go through the questions:

Paste the url you want to use: Pretty easy. Just copy the url from the website you're scraping from, paste it in the terminal, put quotes around it and press enter. For example, I used search results from the Grammys winners website for my project, so I did "http://www.grammy.com/nominees/search?artist=&field_nominee_work_value=&year=All&genre=31".

What kind of tag are you looking for (class or id)?: Now we're starting to identify the table that the script should be looking through. This is where you should enter the table's css selector, usually either a class or an id. Here's what my table looks like in the element inspector: What my table looks like in the element inspector

So I'll enter "class".

What are the cells tagged as?: This means entering the specific class or id of your table. For example, you'll see that if you look up at the picture of my table's HTML, the class is "views-table cols-4", so that's what I'll enter.

What would you like to name the data file?: This not only determines the file's name but also its format. Right now, it's just a choice between xml and csv, so choose one of those formats and enter a name. For example, I entered "output.xml".

The terminal should print out that the data is writing for a little bit, and then that'll tell you that it's done. Go look for your file, and it should be there!

##Contributor Guidelines

This little project is just getting its legs, so please, feel free to contribute regardless of your level as a programmer! I only ask that you:

  • Be respectful of others. Use comments for constructive criticsm or positive remarks.
  • Do commits in your own fork and send me a pull request to merge any changes.
  • If you find an issue, please be descriptive - make sure to note what file the issue is in and what lines of code need to be fixed.
  • This project is primarily written in Python, so please indent with four spaces. Also, be descriptve by naming variables by what they do and by frequently using comments.

table-scraper's People

Watchers

Caroline Pate avatar

table-scraper's Issues

Create a visualization with this thing

That's why I created this project! It's not exactly integral to using the program, and the data I pulled was strings so it will take a bit more work to visualize...but it would be nice to be able to show people what they can do with scraped data.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.