snorkel-team / snorkel Goto Github PK
View Code? Open in Web Editor NEWA system for quickly generating training data with weak supervision
Home Page: https://snorkel.org
License: Apache License 2.0
A system for quickly generating training data with weak supervision
Home Page: https://snorkel.org
License: Apache License 2.0
Fix getting pushed with new example notebook
See example in ddlite.py
Also, don't pair the same token to itself for a relation
Rather than a fully random sample, do we want
Apparently there are some problems here with the parser interface (@chrismre)
DeepDive can do l1 logistic regression, so ddlite should too
Either Extractions should be able to handle a list of list of sentences, or the parsers should output a flat list of sentences
Need to be able to set initialization params and "train unsupervised" so scikit-learn version would need modifications... could also use scipy.opt
This will be almost identical to the existing DictionaryMatch
operator.
This RegexMatch
operator, if essentially copied from DictionaryMatch
, will be able to trivially do e.g. POS tag sequence matches (using match_attrib=poses
); however this could be wrapped / presented as a separate operator...?
Would be good to separate the concepts of Extractions
as a data container/operator and the learning algorithms it implements? Should also probably implement Relation
and Entity
using a proxy pattern.
@ajratner let's talk before deciding either way on this?
edit: Adding question marks so it sounds like a question?
This breaks one cell in the extraction example
E.g. have one for getting to an Extractions object, then dump this, and have a separate notebook for writing rules & testing model?
Just need to wait one sec...
For regex matching, it would be very helpful to have access to the text of a single sentence without any tokenization. When tagging chemical names, for example, we frequently get these tokenization type artifacts:
Li ( 3 ) PS ( 4 ) vs. Li(3)PS(4)
Some of these can be fixed with modified regexes, but it would be nice to operate on the original text itself. As far as mapping back to tokens for entity tags, we could just consider a match as anything that overlaps with the original span.
Could always run the good ole essential-feature-is-behind-a-paywall con
Helpful for people interested in Relations
-style applications
Fix getting pushed with #44
When we reopen MindTagger, we can keep the same sample as before, but not reload the tags. @netj is there a nice way to do this with the API like how we retrieve the tags, or should we form and dump tags.json
to the instance directory?
How do we illustrate best practices for debugging vs. enhancement vs. revision to new users?
E.g. user should be able to access a path_between
attribute of a Relation
object, etc. This can / is currently being done with treedlib
, however I am trying to decouple these two repos... however can bring this back in under the covers in a more limited form (e.g. Relation
objects initialize with several dependency path attributes like path_between
, but don't expose the direct XPath mechanisms to the user)
Emphasis on simple- this is not going to be an optimal preprocessing setup either way, we just want to make it a bit better through simple means that don't require any additional installs, configs, etc.
Stats to show for label function development:
Questions:
@netj Chris mentioned doing this (I think you were there?)
Facilitates applying gold standard labels (among other things)
@jason-fries
One simple way to have db connectivity in the notebook is our favorite extension, ipython-sql. We could initially just build some helper functions around this (or any other psql connector).
However, in DDL we pass around an object containing the entire dataset (Relations
)- this would allow us to connect to the database in a way that is opaque to the user, turning this Relations
object into essentially a cache for the DeepDive db...
What else?
Each parse_sents
call requires a new Java process and a 10sec+ delay to load the models.
Desired initial functionalities:
Ideally there would be some simple way to extend so that users could write basic XML/HTML parser modules (e.g. to grab metadata, preserve section structure, etc.) via some python library (e.g. lxml
, beautifulsoup
). This kind of solution would not be performant, but could potentially be very simple...
Not a big deal, but shows up as -LRB- and -RRB- in MindTagger which looks like a lot like a gene to the underinformed
sudo
. Even if they do, we'd still want to use virtualenv
.requirements.txt
to list all py dependencies.Example notebooks are the de facto documentation. Example notebooks are not actually documentation.
We need to output the basic deepdive calibration plots (notebooks are perfect for this!), as well as potentially some other histograms which guide users towards correct error analysis / debugging procedures.
We also need to output the marginals, which is a minor sub-function to add in.
Should we focus on the hold out set? What are the important metrics to users wanting to refine their DSRs?
Support sparse matrices as optional input
Should give each extraction a unique id
Useful for NER systems, and unary Relations aren't friendly
Refine saving and loading annotation dumps from MindTagger during DSR refinement. I realize "items.csv" is dumped in the MindTagger directory under some unique folder id and tags can be fetched using get_mindtagger_tags() on the MindTagger instance, but the metrics associated with these values should be wrapped up in some sort of "classification_report" type function.
Let the user evaluate how well do their rules work if:
E.g. for an even quicker-and-dirtier option / baselines!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.