GithubHelp home page GithubHelp logo

okbot's Introduction

MarginalBear

MarginalBear is a chit-chatbot with a conversation retrieval engine based on PTT corpus. The core modules in this repo are: crawl_app, ingest_app and chat_app, and we use Django to manage these apps.

Chat Demos

Setting Demos in Django Admin

Crawler setting:

Blacklist setting:

Vocabulary:

PTT-Crawler

Crawlers are implemented with scrapy framework, the logic is defined under crawl_app/spider/ directory, each article in crawled data is collected in jsonline files and formatted as follows:

"url": <url>,
"data": <article-publish-date>,
"title": <title>,
"author": <author>,
"content": <article-body>,
"push": <list of comment-string>,

To build conversation corpus, we paired the title and push fields to mimic the Q&A behavior, here are some examples:

<title> as Q              <push> as A
綜藝玩很大是不是走下坡了      很久沒看了  都是老梗
該怎麼挽回好友?             就算挽回 以後也會因為別的事離開你
妹妹想去補習,該怎麼辦        其實你沒有妹妹

Further data cleaning process is handled by ingest_app.

Each crawler only handles articles from one PTT forum, since the user habits in different forums(ex: gossiping, sex, mantalk, ... etc.) are usually quit different, we may apply specific rules on each crawler. In order to manage these crawlers easily, the crawl engine are integrated with Django. In Django admin interface, we can easily create different rules to filter out the noisy articles. A rule is actually a blacklist set with phrases should be filtered and a type related to the field of crawled items, these types are:

  • title: related to title field of crawled items.
  • push: related to push field of crawled items.
  • author: related to author field of crawled items.
  • audience: related to commenter of push field.

A blacklist can be defined in admin as:

"type": title,
"phrase": 公告, Re:, Fw:, 投稿, 水桶,

Which means crawler should drop the item as the article's title contains one of these phrases. With this configuration, each crawler can equip multiple rules to aim different kind of censored contents.

A spider can be defined in admin as:

"tag": Gossiping,  # forum name
"entry": https://www.ptt.cc/bbs/Gossiping/index{index}.html,
"page": 250,   # pages to crawl in a crawl task
"offset": 50,  # the distance from the newest page
"freq": 1,     # crawl frequencey, used with crontab, ex: daily
"blacklist": [<rule1>, <rule2>, ...],
"start": -1,   # start page index
"end": -1,     # end page index
"status": debug, # pass or debug

When a spider is created, run this command to check whether the config is valid:

./manage.py okbot_update_spider <tag>

The start and end index will be updated according to page and offset settings, if everything goes fine, the status will change to pass, meaning the spider is ready to fire:

./manage.py okbot_crawl <tag>

After issuing a crawl task, a job log is generated; when the task is finished, a statistic summary is recorded and can be viewed in admin, ex:

"name": "Gossiping",
"item_num": "3227",
"drop_num": "10",
"title": "mean: 19.2, std: 4.3",
"url": "mean: 56.0, std: 0.0",
"author": "mean: 16.7, std: 4.2",
"date": "mean: 24.0, std: 0.0",
"push": "mean: 17.4, std: 9.4",
"content": "mean: 269.3, std: 350.1"

Finally, we use crontab to manage daily crawl jobs, you can find the handler script in crawl_ingest.py.

Ingester

This module "ingest" crawled data into database, and does three things:

  1. Build vocabularies by tokenizing(with jieba) articles' titles.
  2. Index every articles.
  3. Build the ManyToMany relation(inverted indexing) between vocaluaries and articles.

The taskes are wrapped into a command:

./manage.py okbot_ingest --jlpath <jsonline-file> --tokenizer <tokenizer>

Since the script only support postgresql, if you use postgresql backend with Django, provide these environment variables, then the command should work:

  • OKBOT_DB_USER
  • OKBOT_DB_NAME
  • OKBOT_DB_PASSWORD

The vocabulary will be listed in Django admin. Since retrieval mechanism works with inverted index, you should label the words with high document-frequecy as stopword or the retrieval process will be very slow.

Chatbot

The bot is deployed on both messenger and line platforms, you can find the api implementation in chat_app/views.py. Basically, when the bot recieves a query, the engine find the related articles by inverted index, then calculates the jaccard or bm25 similarity with some other features between query and articles' titles, after ranking the articles, the bot finally picks an "comment" in the top ranking articles as an reponse. You can find the ranking algorithm and implementations in chat_app/bots.py. A word2vec(with gensim package) model is also applied on queries to generate similar phrases, in order to rich the search informations.

Other features:

  • Chat rules table
  • Chat tree/caching
  • Jieba tag weighting table

Evaluation

okbot's People

Contributors

ryanchao2012 avatar ifengc avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.