Comments (6)
I very much like the idea of this, but I'm not convinced the code/data for so many domains should be rolled up into one super-pip install. Probably we want additional projects (like adapt-data-music), and possibly language specific versions of each. These data sets may be very large, and we want to be respectful of resources on dev boxes as well as end-user devices. I'd be happy to create an adapt-data-music-en repo for you to start playing in, and I'll see if I can find some time to make an adapt-data-weather-en repo to act as an example.
from adapt.
True. But when fetching the entities from Wikidata, there could just be scripts that operate on the SPARQL endpoint and generate the dictionaries.
Plus there could be a script (or even function) in adapt, that downloads pre-built dictionaries (maybe domain specific if they become too large. That's the way the NLTK does it (they have some nltk.download() ) function.
from adapt.
You can check out a working prototype here: https://github.com/wolfv/adapt/tree/feature-numbers-dates/adapt/tools
I've added the entity_fetcher script and a trie of almost all musicians and bands in wikidata.
The trie is built using marisa trie which I think is really good + fast.
The entire trie is only 1.6 MB :)
from adapt.
I had completely forgotten about NLTK's data management model! I definitely like that; We'd want to come up with a standardized way/location of storing the data so that it can be cached locally (as opposed to re-running queries unnecessarily).
As for marisa trie; that looks like a pretty rockin' trie implementation, but it's missing one major feature from the adapt trie; gather. At least, that appears to be the case from my cursory reading of the marisa-trie python wrapper. I'm not gonna lie, that is some brutally dense code, and having been out of C++ for 5 years (and never writing cython bindings), I can't make any true claim of understanding the code.
I can however explain my code! The purpose of gather is to allow us to make N passes on an utterance for entity tagging (one pass per token), as opposed to doing an N-Gram expansion on the utterance (which would be N! complexity). Maybe there's a clever way to reimplement (or reverse) that logic so we can use a standard trie implementation, but maintain the performance characteristics? I'm open to suggestions.
from adapt.
Good to hear! Yes, definitly my idea would be to have a download option thing downloading the data from some other place than wikidata because hitting their server with these queries all the time will be quite expensive.
Hmm, if I understand the gather
functionality correctly than my idea would be the following:
Split all names into tokens (e.g. "Blues Brothers" -> "Blues", "Brothers")
Append the ID to each token ("Blues" -> 123, "Brothers" -> 123)
and afterwards one can find the intersection of all entity IDs in Blues and Brothers to find out that Blues Brothers belong together.
But on a related note, I think that 'in' queries, even with n-gram expansion, are so cheap with the Marisa Trie that it doesn't really matter.
Another option might be to use the following function: trie.has_keys_with_prefix(u'fo') to iteratively build up the n-gram expansion.
Let me know if this stuff made sense :) however, it will probably be a bit harder to implement the matching with edit distance I guess...
from adapt.
FYI: I still think this is a really interesting idea! I don't believe there's been a ton of progress, but I may revive it in a post-1.0 world. thanks!
from adapt.
Related Issues (20)
- Using keyword "my" results in regex problems HOT 9
- Possible Regression using two or more regex HOT 3
- How do I use audio for intent classification? Any code? HOT 2
- upgrade to latest pyee HOT 1
- Catching first number with regex fails HOT 6
- Confusing examples: MultiIntent* examples define unused Parser and EntityTagger HOT 2
- Trie's `max_threshold` is documented as int, seems to be float
- `ZeroDivisionError` in determine_intent when tags are empty HOT 1
- Bug causing .optionally regex to not execute, but it works with .required. IntentBuilder HOT 13
- Entity matching more than it should HOT 3
- An issue with adapt-parser. Adding new intents is breaking old behavior. HOT 3
- Adapt react-native HOT 2
- Add license and test files to PyPI packages HOT 2
- Consolidate package requirements HOT 4
- Tooling for debugging Adapt
- Improve the readability of Adapt
- AttributeError when re-registering regex
- IntentDeterminationEngine.determine_intent does not return sorted results
- Github Action: Fix exit status
- Regex entities with optional words HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from adapt.