snorkel-team / snorkel Goto Github PK

View Code? Open in Web Editor NEW

5.8K 5.8K 859.0 285.63 MB

A system for quickly generating training data with weak supervision

Home Page: https://snorkel.org

License: Apache License 2.0

Python 100.00%

ai data-augmentation data-science data-slicing labeling machine-learning python snorkel training-data weak-supervision

snorkel's People

Contributors

Stargazers

Watchers

Forkers

xiao-cheng edoughty thomaspalomares jerrycjw nwilson0 joelgrus sunqf okmt haanns mooz kuleshov manasrk gte620v rand sandy4321 olveirap payalbajaj benjamesbabala denmoroz biligee bruso poonamrbhide pookie9 aliskin codeaudit thomasklau vyraun mouatez tryerrorman xflee boosterduan lixuejian williamleif windj007 bharathy89 ajkl agrawalarpan allensmile zelladoor shahsaurabh0605 jeffwdoak tsaxena rychev odewahn matic0209 luomeng511 80kr ruankranz bryanhe liushifeng meacial nadileaf fsonntag tongmu bygreencn neelima08 anillingutla sofianhamiti vishnur sanjeeku hlchen123 dqj nomad-ca-us yidanzhou ludditelabs nrghosh gdtm86 ee08b397 readme-guru emaxerrno paidi ryan2x cleemesser yashtawade rap9430 mattmorgis olivierh59500 yayitswei donaldscrump alexgarciac theolivenbaum mindis gengkunling chrismattmann jasontlam drpinkacn kluikens semanticbeeng styanddty ahmadpgh danich1 chenhl123 chrisg1622 vochicong lyiuy ichiranakita andrehuang fred12 hieuqtran bityangke

snorkel's Issues

Certain docs taking vast majority of nltk parse time...

Modify classification accuracy functions to work over subset of dataset

SentenceParser fails for blank documents

Fix getting pushed with new example notebook

Incorrect behavior for matching two of the same dictionary in Relations

See example in ddlite.py

Also, don't pair the same token to itself for a relation

Add "smart" Viewer sampling

Rather than a fully random sample, do we want

Some all-abstained candidates?
Some high conflict candidates?
Some low conflict candidates?
Some candidates with probability close to 0.5?

Add cross-validation for regularization

Debug setup in virtualenv

Apparently there are some problems here with the parser interface (@chrismre)

Lasso logistic regression

DeepDive can do l1 logistic regression, so ddlite should too

Resolve sentence list of list issue for Extractions

Either Extractions should be able to handle a list of list of sentences, or the parsers should output a flat list of sentences

Output simple rule statistics (coverage, overlap, conflict, empirical rule accuracy?)

Put in logistic regression

Need to be able to set initialization params and "train unsupervised" so scikit-learn version would need modifications... could also use scipy.opt

Create a RegexMatch entity mention operator

This will be almost identical to the existing DictionaryMatch operator.

This RegexMatch operator, if essentially copied from DictionaryMatch, will be able to trivially do e.g. POS tag sequence matches (using match_attrib=poses); however this could be wrapped / presented as a separate operator...?

Refactor Extractions so it isn't a big state machine?

Would be good to separate the concepts of Extractions as a data container/operator and the learning algorithms it implements? Should also probably implement Relation and Entity using a proxy pattern.

@ajratner let's talk before deciding either way on this?

edit: Adding question marks so it sounds like a question?

'gene_tag_example/gene_tag_saved_sents_v2.pkl' missing

This breaks one cell in the extraction example

Split demo into multiple notebooks?

E.g. have one for getting to an Extractions object, then dump this, and have a separate notebook for writing rules & testing model?

CoreNLP server has slight lag in getting started up...

Just need to wait one sec...

Add raw (untokenized) text as attribute to Sentence object

For regex matching, it would be very helpful to have access to the text of a single sentence without any tokenization. When tagging chemical names, for example, we frequently get these tokenization type artifacts:
Li ( 3 ) PS ( 4 ) vs. Li(3)PS(4)
Some of these can be fixed with modified regexes, but it would be nice to operate on the original text itself. As far as mapping back to tokens for entity tags, we could just consider a match as anything that overlaps with the original span.

treedlib is private

Could always run the good ole essential-feature-is-behind-a-paywall con

Restore extraction phase of genotype-phenotype example

Helpful for people interested in Relations-style applications

MindTagger runs after Python is killed

Fix getting pushed with #44

Reload tags in MindTagger

When we reopen MindTagger, we can keep the same sample as before, but not reload the tags. @netj is there a nice way to do this with the API like how we retrieve the tags, or should we form and dump tags.json to the instance directory?

Alex make the chart

Best practices for LF iteration

How do we illustrate best practices for debugging vs. enhancement vs. revision to new users?

Add dependency tree helper functionality

E.g. user should be able to access a path_between attribute of a Relation object, etc. This can / is currently being done with treedlib, however I am trying to decouple these two repos... however can bring this back in under the covers in a more limited form (e.g. Relation objects initialize with several dependency path attributes like path_between, but don't expose the direct XPath mechanisms to the user)

Parallelize nltk CoreNLP parser in simple way

Emphasis on simple- this is not going to be an optimal preprocessing setup either way, we just want to make it a bit better through simple means that don't require any additional installs, configs, etc.

Support other feature extraction libraries / modules besides treedlib

Error analysis workflow 1

User gets random subsample of candidates in Mindtagger, and labels them
User gets statistics over the labeling functions, some w.r.t. to this label set (e.g. empirical acc., etc.)
Learn model
Get precision stats
Log (?) and repeat

Stats to show for label function development:

Coverage
Overlap
Conflict
Empirical accuracy
Show labeling functions and/or candidates that are conflict heavy (+ low emp. accuracy lfs)

Questions:

Should we be proscriptive, and automatically (opaquely) split their label set into a "label fn. validation set" and a test set (as default option which can be turned off)?
How to integrate ground truth that they bring in externally?

Can we get a MindTagger into DDL / Jupyter notebook?

@netj Chris mentioned doing this (I think you were there?)

New NER example notebook

Add document offsets for tokens in `Sentence`

Facilitates applying gold standard labels (among other things)
@jason-fries

statspadding

Add hold out sets

DB / DeepDive connectivity

One simple way to have db connectivity in the notebook is our favorite extension, ipython-sql. We could initially just build some helper functions around this (or any other psql connector).

However, in DDL we pass around an object containing the entire dataset (Relations)- this would allow us to connect to the database in a way that is opaque to the user, turning this Relations object into essentially a cache for the DeepDive db...

What else?

Serialize docs so that nltk / CoreNLP server is only started once?

Each parse_sents call requires a new Java process and a 10sec+ delay to load the models.

Rename Extractions to Candidates

Create a simple DocParser class

Desired initial functionalities:

Take as input a directory filepath, a single filename, a filename pattern
Strip XML, HTML (i.e. strip tags without corrupting basic sentence structures)

Ideally there would be some simple way to extend so that users could write basic XML/HTML parser modules (e.g. to grab metadata, preserve section structure, etc.) via some python library (e.g. lxml, beautifulsoup). This kind of solution would not be performant, but could potentially be very simple...

Make setup script...

Fix parentheses encoding

Not a big deal, but shows up as -LRB- and -RRB- in MindTagger which looks like a lot like a gene to the underinformed

treedlib should be submodule

Better organization of python dependencies

Some users' envs may not allow sudo. Even if they do, we'd still want to use virtualenv.
We could use a requirements.txt to list all py dependencies.

Write documentation

Example notebooks are the de facto documentation. Example notebooks are not actually documentation.

Output marginals + calibration plots + histograms

We need to output the basic deepdive calibration plots (notebooks are perfect for this!), as well as potentially some other histograms which guide users towards correct error analysis / debugging procedures.

We also need to output the marginals, which is a minor sub-function to add in.

Add MindTagger evaluation utilities

Should we focus on the hold out set? What are the important metrics to users wanting to refine their DSRs?

Issue with old pickle files loading with new Sentence format

Extend logistic regression functions to handle scipy.csr_matrix

Support sparse matrices as optional input

MindTagger groups tags for all extractions with the same sentence id number

Should give each extraction a unique id

Entity class and feature extractors

Useful for NER systems, and unary Relations aren't friendly

Save MindTagger Output

Refine saving and loading annotation dumps from MindTagger during DSR refinement. I realize "items.csv" is dumped in the MindTagger directory under some unique folder id and tags can be fetched using get_mindtagger_tags() on the MindTagger instance, but the metrics associated with these values should be wrapped up in some sort of "classification_report" type function.

Add functions for non-learning/inference prediction accuracy as well

Let the user evaluate how well do their rules work if:

We pick the first rule that triggers as the labeler, based on the ordering of rules passed in
Majority vote
Logistic regression model...

E.g. for an even quicker-and-dirtier option / baselines!

snorkel-team / snorkel Goto Github PK

snorkel's People

Contributors

Stargazers

Watchers

Forkers

snorkel's Issues

statspadding

Recommend Projects

Recommend Topics

Recommend Org

Jobs