GithubHelp home page GithubHelp logo

praw-ner's Introduction

PRAW-NER

PRAW-NER is a toolkit built over the Python Reddit API Wrapper (PRAW) which provides easy access to all the elements of Reddit. It uses the Python API's exposed by Reddit to extract posts, comments and saved content and can be extrapolated to everything that is available on Reddit. NER stands for Named Entity Recognition - this is used to pick out entities from given raw text, in this case, a post or a comment.

The requirement stemmed from the hundreds of saved posts and comments I had gathered over the last 4-5 years and a need to really pick out the most important things only. Scraping posts or top-level comments and then picking out only the relevant entities in them helped me sort out my Reddit profile.


PRAW Information

Here is all the information you will need to learn about PRAW.


Configuration

Configuring an instance of PRAW can be found in PRAW's configuration page. PRAW provides multiple ways of configuring an instance of the API, but if the plan is to open-source your work on PRAW, then it is recommended to use a .ini file and add it to .gitigore list so that your personal information stays with you. If you are looking to access a certain user (or your own) profile via the API, then you will need to provide your Reddit username and password on the .ini file along with the user agent name. Along with this, you will need a client secret and client ID which will be used as OAuth by Reddit. The steps to obtain those can be found here


Code Walkthrough

Once you are set with your configuration, we can look at what this toolkit provides and how you can leverage it for your own purpose.

  1. Initializing Reddit instance

I have created 2 distinct configurations or "sites", based on the need of PRAW. One is a site to access my own profile, other is a generic site to access public content on Reddit. Note that one .ini file can hold multiple sites. These can be found in /base/basePRAW.py and /scraper/commentScraper.py. 2 sites, savedPostsParser and commentScraper are the 2 sites that need to be passed to the Reddit object to instantiate the Reddit API. If one is looking only to scrape content and store it locally, they can use the commentScraper.py config as you won't need to pass in your Reddit Credentials for it.

  1. Parsing saved content from user profiles

scraper/savedContentScraper.py provides different methods using which one can extract saved data from Reddit. Depending on what the user needs, they can get the saved subreddit, saved content specific to a subreddit and more generally, all saved content in a user's profile.

  1. Unsave Content

Once saved content has been pulled in, the user can easily clear that space by using /unsave/unsaveContent.py.

  1. Parsing comments on any post on Reddit

scraper/commentScraper.py provides methods with which users can obtain all top-level comments on a post either using the post URL or using the post ID.

  1. File Writer

The file writer takes as input all the scraped content and writes it to a local directory provided by the user.

  1. Named Entity Recognition (NER)

NER/namedEntityRecognizer.py uses spaCy's English Language model. Named Entity Recognition is one of spaCy's linguistic features where they provide a set list of entities that their model has been trained on. For my use case, I wanted to extract the book titles in all the posts and comments I had saved. So, I used the 'WORK OF ART' entity of the spaCy model to get the book titles I wanted. Please Note : Sometimes, spaCy may not be able to recognize the entities that you need recognized. For this, spaCy also provides a mechanism to retrain the original model or create new models from scratch Finally, these entities are all consolidated in a list which is passed to the File Writer.

praw-ner's People

Contributors

viraj27 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.