GithubHelp home page GithubHelp logo

dechrissen / hstk Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 0.0 301 KB

Toolkit for creating and interfacing with a database of news headlines

Python 100.00%
database language-model ocr snapchat toolkit data-analysis nlp news-headlines

hstk's Introduction

hstk โ€“ Headline Snap Toolkit

hstk provides a collection of tools for creating a database of fabricated news headlines (Headline Snaps) and interfacing with the database. Some of its built-in tools include language modeling, data analysis, data visualization tools. It can also synthesize new language data (news headlines) based on a trigram model.

Table of Contents

Installation

First, clone this repository (or fork your own copy, then clone that).

git clone [email protected]:Dechrissen/hstk.git

Alternatively, you can download the latest release's source code from the releases section of this repository. In that case, you'd need to extract it.

Setup

Ensure you have Python 3 installed.

cd to the project directory.

To install dependencies, run:

pip install -r requirements.txt

(For this to work on Windows, you might need to prefix the command with python -m, i.e. python -m pip install -r requirements.txt).

Installing the Tesseract executable

The Tesseract executable (engine) is required for the OCR backend in this toolkit. It is available on both Windows and Linux.

  • Windows installer
    • make sure you install the executable in this location exactly: C:\Program Files\Tesseract-OCR\tesseract.exe
  • Linux instructions
    • no special location instructions; just install via your distro's package manager

To perform the first-time setup, run:

python hstk.py

without any arguments. This will initialize the data directories for use.

Usage

cd to the project directory.

To get a detailed help message, run:

python hstk.py -h

On data persistence

When you run this toolkit for the first time, a directory called /data will be created locally within the project directory. This is where you will place your own source image data (in /data/src/raw), and where various generated data will be output by the tools.

The idea is that the data directory (or at least, the database files that get created in /data/db) should persist regardless of whatever subcommands you run with hstk. The source data you provide can theoretically stay in the data directory forever, and new data can be added at any time. Updates to this toolkit, for example, shouldn't affect existing data in your local directory.

The source data is what's used to generate the entries in the database files (e.g. /data/db/hs.db). Whenever new source data is added to /data/src/raw, running the conversion command (python hstk.py -c) will update the database.

Similarly, newline-separated files containing Headline Snaps as text can be added to /data/src/text at any time, with a .txt extension. Then python hstk.py -a will add those to the database.

Features

There are several commands at your disposal in this toolkit.

command flag description
add -a adds the current contents of the files in /data/src/text (newline-separated) to the database
total -t output the total number of Headline Snaps in the database
random -r print a random Headline Snap from the database
convert -c convert the Headline Snap image files in /data/src/raw to text via OCR and output them to /data/text/ocr_output.txt, then adds all contents of the /data/text directory to the database
export -x dump all Headline Snaps from the database to a text file at /data/dump.txt
search -s query the database for Headline Snaps containing a provided search phrase
delete -d delete all data from the Headline Snap and token databases

Note: run the toolkit with the -h flag to see all commands.

Additional functionality exists via subcommands, outlined below.

Via the tokenizer subcommand ...

command flag description
update_tokens -u iterate through all Headline Snaps in the database and add all unique tokens to a separate token database, keeping track of counts
query_tokens -q print the number of times some individual token appears in the database, according to the token database

Via the trigrams subcommand ...

command flag description
generate -g train a trigram language model on the database and synthesize a new Headline Snap based on it

Via the visualizer subcommand ...

command flag description
word_cloud -w generate a word cloud representing the most commonly occurring terms in the database

For detailed help with each subcommand, run:

python htsk.py <SUBCOMMAND> -h

Converting non-compliant Headline Snaps

In some cases, you might need to use a9t9 to convert some legacy Headline Snap image files, i.e., those which do not follow the guidelines (background not black, text too high in the frame, etc.). For detailed instructions, see this guide.

hstk's People

Contributors

dechrissen avatar quadrixis avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.