GithubHelp home page GithubHelp logo

anvaari / patent-crawler Goto Github PK

View Code? Open in Web Editor NEW
6.0 1.0 5.0 19 KB

Patent Crawler is a python program to crawl patent information from Google Patent with given keywords.

License: Apache License 2.0

Python 100.00%
python crawler data data-mining patent-data patents

patent-crawler's Introduction

Patent Crawler

Patent Crawler is a python program to crawl patent information from Google Patent with given keywords.

How It Works?

Google set very low rate-limit on search pages and block any activity wich detect them as scraping. But don't have such policy on each patent page. So at first I download list of patent which include few information include URL, then go to URLs and scrap them. I tried to wrote these programs in user friendly way. So running program will guide you to scrap what ever you want.

Usage

  • Clone the repo
  • Create a virtual environment and activate it. How
    • pip install -r requirements.txt
  • Download gecko driver for firefox from here and place it into code path.
  • Now it's time to download gp-search.csv, csv which contain all search result for your keyword. Search_Url_Finder.py guide you step by step to download this csv Or you can do it manualy by go to Google Patent.
    • python Search_Url_Finder.py
  • Rename downloaded csv file to gp-search.csv and place it into code path.
  • Now run Patent_Crawler.py. It will scrap information of all patents in gp-search.csv and save them to patents_data.csv.
    • python Patent_Crawler.py

Notes

  • Patent_Crawler extract this information from patents page (Google Patents) and store them into datafram:

    • ID
    • Title
    • Abstract
    • Description
    • Claims
    • Inventors
    • Patent Office
    • Publication Date
    • URL
  • Patent_Crawler have capability to resume from last run. So don't worry if something unwanted happend (i.e Power outage!)

    • Patent_Crawler save data on hard drive after scrap every 5 patents. This can slow down proccess when data became very larg (when we have larg number of patents), So it's better to set this 15 or 30 for better speed.
  • Google will block IP if number of requests exceed specific number in each hour (or overal, I don't know it). So I set some sleep in code. You can reduce time of sleep but it increase probability of getting banned!

  • Two files will create in the code directory :

    • patents_data.csv --> Contain all information scraped from patents pages
    • not_scrap_pickle --> Contain all pantents from gp-search.csv which haven't be scrapped

Contribution

I really love open source community. It makes me proud to be a part of this community. So feel free to send any pull request or question in issues.

Hope this Pantent_Crawler can help you :)

Donation

Donation make developer of this project so happy and greatful :) So if patent crawler help you and want donate, here is my address on lightning network. You can donate bitcoin with less amount of fee :)

lightning : [email protected]

patent-crawler's People

Contributors

anvaari avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.