GithubHelp home page GithubHelp logo

tlack / hairytext Goto Github PK

View Code? Open in Web Editor NEW
26.0 3.0 4.0 758 KB

A data labeling and NLP tool for Elixir (uses Spacy)

Elixir 82.40% CSS 4.27% JavaScript 2.94% HTML 0.21% Python 10.18%
elixir spacy nlp-machine-learning nlp text-classification entity-recognition phoenix-live-view

hairytext's Introduction

Hairy Text

HairyText diagram

Hairy Text is a tool for natural language processing.

With Hairy Text, you can perform named entity recognition (NER) tasks using the world-class Spacy library, and label data for training to improve your model. All this from a normal looking and mobile-friendly web application.

It is written in Elixir + Phoenix LiveView and Python, doesn't require a database (totally self contained), and runs fine on a $5/mo server without a GPU.

Screenshot

List of examples for labeling

HairyText Examples screenshot

Labeling interface in modal window

HairyText Labeler screenshot

Testing unseen input (and adding to labeling queue)

HairyText TEST screenshot

Features

  • Built with the awesome Spacy NLP framework (so I probably didn't mess it up!)
  • Easily label text fragments for machine learning / NLP experiments
  • Interactive "test" console lets you quickly debug your model
  • Refreshless but highly dynamic Phoenix Liveview web-based user interface (like React, without it)
  • User logins with HTTP AUTH password check
  • Export a .ZIP of your labeled examples and prediction history (both convenient .JSON files)
  • REST JSON API for making predictions (to tie it into the rest of your project)
  • API predictions stored in log and reviewable in app
  • Label examples or images with a category/class
  • Label text inside examples as entities - for instance "time reference" or "place name"
  • Filter by entity tags or labels
  • Train online via web interface and report live training progress (rough..)
  • Support for multiple projects (some bugs)

Future

  • Make into embeddable component like LiveDashboard
  • Support for "one at a time" editing that's more about a workflow of doing one labeling task after another
  • Object detection (aka classify regions inside images)
  • Assist in generating low-confidence predictions to more quickly improve model
  • Each project should have its own DETS files
  • APIs to: label examples new and old, bulk predict, view predictions log
  • Learn from embeddings (BERT, I'm looking at you)
  • uPlot training graphs

Bugs

  • Many Elixir warnings
  • When creating a new example from the Predictions or Test screen, clicking on the example text to label it will cause it to reset. This is really annoying. Use a two-step editing process for now.
  • Projects support broken in some ways (training and testing)

Notes

Hairy Text uses a custom DETS-based storage system shim for training examples, logs, etc. that integrates with Elixir's built in Ecto database framework (but only for trivial parts of its functionality).

The connection to the Python-based NLP backend uses ErlPort

Motivation

I wanted Prodigy but can't afford such bourgeoisie things.

Requirements

  • Elixir 1.10+
  • Phoenix 1.5 + LiveView
  • Python 3.6+
  • Spacy NLP toolkit for Python

Installation

First, grab the code:

$ git clone https://github.com/tlack/hairytext/

Then, install Spacy for Python. You'll need a recent version of Python. Consider using virtualenv with Hairytext. FYI, The Python code is in priv/python

$ pip install spacy

Next, configure the default username/password:

$ vim config/config.exs

Finally, install Elixir dependencies and start server:

$ npm install --prefix assets
$ mix deps.get
$ iex -S mix phx.server

Then open http://localhost:4141 to start playing. The default username and password is admin:sohairy

API

$ curl 'http://localhost:4141/api/predict/9d00fa70-df5c-4a3a-9f0d-8c53f3345417?text=i+am+live+on+twitch' | json_pp
{
	"text" : "i am live on twitch",
		"label" : "good",
		"label_confidence" : 0.999650120735168,
		"entities" : {
			"service" : "twitch"
		}
}

Add a new example to an image classification project:

$ curl https://example.com/test/someimage.jpg -o test.jpg
$ curl -X POST -F "[email protected]" "http://localhost:4141/api/example/16700ec8-dab3-4d53-bcee-9b5e2ea52d3d"

Add a new example image to a project using its URL:

$ curl -X POST -F "image=http://example.com/images/1.jpg" \
		"http://localhost:4141/api/example/16700ec8-dab3-4d53-bcee-9b5e2ea52d3d" 

Add a new example, with a known label, to a project:

$ curl -X POST -F "image=http://example.com/images/1.jpg" \
		-F "label=yellow" \
		"http://localhost:4141/api/example/16700ec8-dab3-4d53-bcee-9b5e2ea52d3d" 

Use from iex shell

Make a prediction for some new text. This returns the raw Spacy result format.

iex(48)> Spacy.predict("i want to go live on whuwhuwhaaaaat at 7am")
	%{
		'classification' => %{
			[] => 0.9619296193122864,
			'bad' => 0.03623370826244354,
			'good' => 0.9784072041511536
		},
		'entities' => [["service", "whuwhuwhaaaaat"], ["when", "7am"]],
		'text' => "i want to go live on whuwhuwhaaaaat at 7am"
	}

See the raw data about an example in the system:

iex(418)> HT.Data.list_examples |> List.last |> Map.get(:id) |> HT.Data.get_example!
%HT.Data.Example{
__meta__: #Ecto.Schema.Metadata<:built, "examples">,
	entities: %{},
	id: "e94e1954-0548-4f51-9570-e63cd298d2d7",
	image: nil,
	inserted_at: ~U[2020-04-30 04:05:34.965387Z],
	label: "bad",
	project: "9d00fa70-df5c-4a3a-9f0d-8c53f3345417",
	source: nil,
	status: nil,
	text: "i hate when people start live streaming. twitch sucks.",
	updated_at: ~U[2020-05-02 06:42:11.799197Z]
}

There is a handy utility feature to use when you have a bulk of images to label.

First, copy images from your examples directory into the HairyText image_examples/ subdirectory for your project. Have your HairyText project ID at hand for this process (you can find it editing project settings).

$ find /tmp/my-new-examples/ -type f -name \*png | shuf | head -250 > example-list.txt
$ cp `cat example-list.txt` ~/hairy-text-path/image_examples/16700ec8-dab3-4d53-bcee-9b5e2ea52d3d

Now we have them in the right path for HairyText to manipulate, but we need to get them into the database. Luckily HairyText provides a convenience function to do this.

iex> Util.upsert_examples_from_image_folder()

hairytext's People

Contributors

tlack avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.