GithubHelp home page GithubHelp logo

br-scraper's Introduction

br-scraper

This is a simple Python module designed to scrape tables from the great baseball website baseball-reference.com. Instantiate the BRScraper class with an optional server argument that defaults to http://www.baseball-reference.com/. Call the method parse_tables() with a web page (everything in the URL after the server and slash) to scrape the data from it. So for example:

import brscraper
scraper = brscraper.BRScraper()
data = scraper.parse_tables("teams/BOS/2011.shtml")

data will contain a dictionary of parseable tables by HTML id scraped from the indicated page, the 2011 Boston Red Sox season page. In this example (at the time of this writing), there were five such tables on the page in the example. To access the team pitching data, which is given the team_pitching id, just call data["team_pitching"].

Each such table is a list of dictionaries, where each item in the list represents a row of the table and the keys are the headers of the table. So in the above example, to get the ERA of the first pitcher listed in the table, you would write

>>> float(data["team_pitching"][0]["ERA"])
2.89

If a header does not have a name, it is assigned an empty string as its key. Continuing the same example, baseball-reference at current writing does not give a name to the column containing player names, and so to get that pitcher's name you would do the following:

>>> data["team_pitching"][0][""]
u'Josh Beckett'

Note that you need to do your own conversions; everything is stored as a string; there is no attempt to guess at whether a data type should be a number or a string. In several cases numbers, strings, and/or blanks are all found in the same columns, so you'll probably need exception handling on the conversions as well.

Some more examples can be seen in the unit test file brscraper_unittest.py, which is by no means comprehensive but shows a few of the different baseball-reference pages you can scrape data from.

This was an ad-hoc script that I figured might as well go on Github; I don't plan to update it regularly. Thanks as well to baseball-reference.com for creating a wonderful resource for baseball statheads.

br-scraper's People

Contributors

andrewblim avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.