GithubHelp home page GithubHelp logo

textgen's Introduction

textgen

Objective of Project:

Use the opinions, statements, and questions of 330 million Reddit users as training data to generate personalized replies to a given comment on a post. The scope of subreddits used in this project are those that may contain content related to the United States highly debated topics such as healthcare, immigration, and gun control. This project is not intended to troll, but produce coherent, personalized responses as to maximize the probability our model's reply will not be detected as being generated by a machine and will subsequently be replied to.

Overview of Project Architecture:

The image below represents a two topic architecture. The architecture is setup to be distributed so any number of topics are theoretically possible. For instance, topic1 may be healthcare and topic2 may be immigration. Each of these topics have a predefined list of subreddits to stream from using the Reddit API. Each topic also has a predefined set of target keywords to search for within the title of each streamed post. For example, 'healthcare' is a keyword for the healthcare topic. If a keyword is contained in the title, a new document is created within a MongoDB collection that stores that posts information.

Post Information Stored: Postid, Title, Author, DateTime Posted, Number of Current Upvotes, and Number of Current Comments

The issue with streaming data from Reddit is all posts are brand new, meaning most posts will have no comments when the posts information is stored in MongoDB. Thus, every hour, the collection is polled and another call is made via the Reddit API to update the current number of comments for each post. If the second difference of the number of comments in a given post turns negative the post is archived and the postid is sent via Kafka producer to the scrape program. The Kafka consumer receives this postid and proceeds to scrape all comments under the post in linear order i.e. top to bottom. Each comment/record of the post is serialized and sent via another Kafka Producer to a program that transforms each record. Each record is inserted into a Postgres database.

Comment Information Sent/Stored: Postid, Comment Parent ID, Comment ID, Comment Author, Comment, Level, Thread, and Upvotes

The 3 major operations performed on each post are determining the post's keywords, all interactions between users on the post, and extracting any URLs mentioned. After the operations are performed on a record the Neo4j graph database is updated to reflect this new information.

System Design

Network of Posts and Keywords:

Suppose there are two posts 'A' and 'B'. To quickly determine if 'A' and 'B' are related in some way, we can query the graph and see if 'A' and 'B' both contain a given keyword.

Keywords

Network of Interactions between Comment Authors and Posts:

Suppose there are two posts 'A' and 'B'. User1 has a previous interaction with User2 on post 'A'. Today, post 'B' was published and User1 interacted with User2 again. The weight on the edge of the graph between User1 and User2 would be updated from 1 to 2. Additionally, both User1 and User2 would have an edge (relationship) with the post 'A' and post 'B'.

Interactions

API Used: Reddit API (PRAW)

Tools Used: Python, Scala, SQL, Cypher, Apache Spark, Apache Kafka

Databases Used: MongoDB, Postgres, Neo4j

textgen's People

Contributors

ttheisen avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.