GithubHelp home page GithubHelp logo

dpapathanasiou / recipebook Goto Github PK

View Code? Open in Web Editor NEW
97.0 6.0 24.0 51 KB

This is a simple application for scraping and parsing food recipe data found on the web in hRecipe format, producing results in json

License: MIT License

Python 100.00%

recipebook's Introduction

About

This is a simple application for scraping and parsing food recipe data found on the web in hRecipe format, producing results in json.

This project was inspired by this answer to a query for an open database of recipes.

Contribute your favorite site by implementing a RecipeParser class for it, and make a pull request.

Data

Recipes collected using the crawler are now available in a parallel repository.

Usage

The current version uses Python 3, which is easily installed with Anaconda: use the Python 3.x command line installer, and optionally, the PyCharm IDE (the community edition is free).

Individual recipes

Import the class corresponding to the site you want, and use the recipe URL in its constructor.

Here's an example to fetch and parse the Chocolate, Almond, and Banana Parfaits recipe from Epicurious:

>>> import sys; sys.path.append('sites')
>>> from epicurious import Epicurious
>>> recipe = Epicurious("http://www.epicurious.com/recipes/food/views/Chocolate-Almond-and-Banana-Parfaits-357369")

Use the save() method to create a file of the recipe in json object.

The file name is determined from the URL, and the output folder is defined in the settings.py file as OUTPUT_FOLDER, and can be overridden by creating a local_settings.py file:

>>> recipe.save()

Results in the creation of /tmp/chocolate-almond-and-banana-parfaits-357369.json with these contents:

{
    "directions": [
        "Heat chocolate chips and 4 tablespoons cream in microwave in 1-cup glass measuring cup at 50 percent power just until chocolate is melted, about 30 to 35 seconds. Stir to blend; cool chocolate sauce to lukewarm. Whisk mascarpone, amaretto, sugar, and remaining 2 tablespoons cream in medium bowl until blended and mixture just starts to thicken.",
        "Using 2 1/2-inch-diameter cookie cutter, cut out round from each angel food cake slice. Place 1 cake round in each of 4 wine goblets or old-fashioned glasses. Top each cake round with 3 banana slices, 1 heaping tablespoon mascarpone mixture, bittersweet chocolate sauce, and sprinkling of almonds. Repeat parfait layering 1 more time and serve."
    ],
    "ingredients": [
        "1/2 cup bittersweet chocolate chips",
        "6 tablespoons heavy whipping cream, divided",
        "3/4 cup mascarpone cheese",
        "3 tablespoons amaretto",
        "2 tablespoons sugar",
        "8 1/2-inch-thick angel food cake slices",
        "24 1/3-inch-thick diagonal banana slices (from about 3 bananas)",
        "1/3 cup (about) sliced almonds, toasted"
    ],
    "language": "en-US",
    "source": "www.epicurious.com",
    "tags": [
        "Chocolate",
        "Dessert",
        "Quick & Easy",
        "High Fiber",
        "Banana",
        "Almond",
        "Amaretto",
        "Shower",
        "Party",
        "Vegetarian",
        "Pescatarian",
        "Peanut Free",
        "Soy Free",
        "Kosher"
    ],
    "title": "Chocolate, Almond, and Banana Parfaits",
    "url": "http://www.epicurious.com/recipes/food/views/Chocolate-Almond-and-Banana-Parfaits-357369"
}

Crawling

Most sites offer related links within each recipe.

From the example above, the getOtherRecipeLinks() method produces more URLs to fetch:

>>> recipe.getOtherRecipeLinks()
['http://www.epicurious.com/recipes/food/views/chocolate-amaretto-souffles-104730', 'http://www.epicurious.com/recipes/food/views/coffee-almond-ice-cream-cake-with-dark-chocolate-sauce-11036', 'http://www.epicurious.com/recipes/food/views/toasted-almond-mocha-ice-cream-tart-12550', 'http://www.epicurious.com/recipes/food/views/chocolate-marble-cheesecake-241488', 'http://www.epicurious.com/recipes/food/views/hazelnut-dome-cake-4246']

The crawler.py application takes advantage of this by visiting each related recipe link in parallel, getting even more recipe links, fetching each of those, and so on.

Kick it off with a specific site and a file of initial seed links, and it will automatically fetch and parse all the related links it finds, without repeating the same link twice.

From the example above, here is how to start the crawler with four parallel worker threads.

The file /tmp/epi.link passed in the second argument contains the seed URL http://www.epicurious.com/recipes/food/views/Chocolate-Almond-and-Banana-Parfaits-357369 for this example, though it could contain more links, too.

It is also a good idea to capture the output into a log file, as shown here, in order to see the full list of parsed recipes, along with any error messages.

python crawler.py Epicurious /tmp/epi.link 4 > epicurious.log 2>&1

By default, all the json files are written to the OUTPUT_FOLDER folder specified in settings.py local_settings.py, but this can be changed by passing a fourth argument: "False" or "F" (in either upper or lower case) will prevent the individual recipes from being written to to the OUTPUT_FOLDER folder at all.

Similarly, storing the results to a ARMS mongo service is off by default, but if the fifth and sixth arguments specify a database and collection, respectively, the crawler will attempt to store them, using the ARMS server, api key and seed definitions in settings.py or local_settings.py.

Avoiding server blocks

The crawler can also be configured to pause a random number of seconds in between fetches, to prevent recipe hosts from blocking it for too many requests.

The pause default configuration is defined in lines 12 and 13 of the settings.py file, which can be overridden in a local_settings.py definition.

Another strategy, which can done in conjunction with pausing, is to change the user agent from the default defined in line 11 of the settings.py file to something resembling a human user.

MDN maintains a list of current common browser agent strings, which can be used in a local_settings.py definition of the UA variable.

Usage

Here is the crawler usage in full:

python crawler.py [site: (AllRecipes|Epicurious|FoodNetwork|Saveur|SiroGohan|WilliamsSonoma)] [file of seed urls] [threads] [save() (defaults to True)] [store() database (defaults to None)] [store() collection (defaults to None)]

recipebook's People

Contributors

dpapathanasiou avatar obai-d avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

recipebook's Issues

time delay between fetches...

I'm adding one myself right now but you might consider adding a time delay between fetches for crawler.py. AllRecipes gets really upset if you crawl their site too fast and will at first just give you a "Too Many Requests" response which you might want to start looking for, and then if you continue they'll block your IP for 48 hours. I'm betting many of the other sites will as well.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.