GithubHelp home page GithubHelp logo

erre-quadro / spikex Goto Github PK

View Code? Open in Web Editor NEW
398.0 9.0 28.0 3.51 MB

SpikeX - SpaCy Pipes for Knowledge Extraction

License: Apache License 2.0

Python 99.79% Shell 0.21%
spacy spacy-pipes nlp named-entity-recognition entity-linking wikipedia sentence-splitting abbreviations-detection acronym-recognition noun-phrase-extract verb-phrase-extract clustering wikigraph wikipedia-graph

spikex's Introduction

SpikeX - SpaCy Pipes for Knowledge Extraction

SpikeX is a collection of pipes ready to be plugged in a spaCy pipeline. It aims to help in building knowledge extraction tools with almost-zero effort.

Build Status pypi Version Code style: black

What's new in SpikeX 0.5.0

WikiGraph has never been so lightning fast:

  • 🌕 Performance mooning, thanks to the adoption of a sparse adjacency matrix to handle pages graph, instead of using igraph
  • 🚀 Memory optimization, with a consumption cut by ~40% and a compressed size cut by ~20%, introducing new bidirectional dictionaries to manage data
  • 📖 New APIs for a faster and easier usage and interaction
  • 🛠 Overall fixes, for a better graph and a better pages matching

Pipes

  • WikiPageX links Wikipedia pages to chunks in text
  • ClusterX picks noun chunks in a text and clusters them based on a revisiting of the Ball Mapper algorithm, Radial Ball Mapper
  • AbbrX detects abbreviations and acronyms, linking them to their long form. It is based on scispacy's one with improvements
  • LabelX takes labelings of pattern matching expressions and catches them in a text, solving overlappings, abbreviations and acronyms
  • PhraseX creates a Doc's underscore extension based on a custom attribute name and phrase patterns. Examples are NounPhraseX and VerbPhraseX, which extract noun phrases and verb phrases, respectively
  • SentX detects sentences in a text, based on Splitta with refinements

Tools

  • WikiGraph with pages as leaves linked to categories as nodes
  • Matcher that inherits its interface from the spaCy's one, but built using an engine made of RegEx which boosts its performance

Install SpikeX

Some requirements are inherited from spaCy:

  • spaCy version: 2.3+
  • Operating system: macOS / OS X · Linux · Windows (Cygwin, MinGW, Visual Studio)
  • Python version: Python 3.6+ (only 64 bit)
  • Package managers: pip

Some dependencies use Cython and it needs to be installed before SpikeX:

pip install cython

Remember that a virtual environment is always recommended, in order to avoid modifying system state.

pip

At this point, installing SpikeX via pip is a one line command:

pip install spikex

Usage

Prerequirements

SpikeX pipes work with spaCy, hence a model its needed to be installed. Follow official instructions here. The brand new spaCy 3.0 is supported!

WikiGraph

A WikiGraph is built starting from some key components of Wikipedia: pages, categories and relations between them.

Auto

Creating a WikiGraph can take time, depending on how large is its Wikipedia dump. For this reason, we provide wikigraphs ready to be used:

Date WikiGraph Lang Size (compressed) Size (memory)
2021-05-20 enwiki_core EN 1.3GB 8GB
2021-05-20 simplewiki_core EN 20MB 130MB
2021-05-20 itwiki_core IT 208MB 1.2GB
More coming...

SpikeX provides a command to shortcut downloading and installing a WikiGraph (Linux or macOS, Windows not supported yet):

spikex download-wikigraph simplewiki_core

Manual

A WikiGraph can be created from command line, specifying which Wikipedia dump to take and where to save it:

spikex create-wikigraph \
  <YOUR-OUTPUT-PATH> \
  --wiki <WIKI-NAME, default: en> \
  --version <DUMP-VERSION, default: latest> \
  --dumps-path <DUMPS-BACKUP-PATH> \

Then it needs to be packed and installed:

spikex package-wikigraph \
  <WIKIGRAPH-RAW-PATH> \
  <YOUR-OUTPUT-PATH>

Follow the instructions at the end of the packing process and install the distribution package in your virtual environment. Now your are ready to use your WikiGraph as you wish:

from spikex.wikigraph import load as wg_load

wg = wg_load("enwiki_core")
page = "Natural_language_processing"
categories = wg.get_categories(page, distance=1)
for category in categories:
    print(category)

>>> Category:Speech_recognition
>>> Category:Artificial_intelligence
>>> Category:Natural_language_processing
>>> Category:Computational_linguistics

Matcher

The Matcher is identical to the spaCy's one, but faster when it comes to handle many patterns at once (order of thousands), so follow official usage instructions here.

A trivial example:

from spikex.matcher import Matcher
from spacy import load as spacy_load

nlp = spacy_load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matcher.add("TEST", [[{"LOWER": "nlp"}]])
doc = nlp("I love NLP")
for _, s, e in matcher(doc):
  print(doc[s: e])

>>> NLP

WikiPageX

The WikiPageX pipe uses a WikiGraph in order to find chunks in a text that match Wikipedia page titles.

from spacy import load as spacy_load
from spikex.wikigraph import load as wg_load
from spikex.pipes import WikiPageX

nlp = spacy_load("en_core_web_sm")
doc = nlp("An apple a day keeps the doctor away")
wg = wg_load("simplewiki_core")
wpx = WikiPageX(wg)
doc = wpx(doc)
for span in doc._.wiki_spans:
  print(span._.wiki_pages)

>>> ['An']
>>> ['Apple', 'Apple_(disambiguation)', 'Apple_(company)', 'Apple_(tree)']
>>> ['A', 'A_(musical_note)', 'A_(New_York_City_Subway_service)', 'A_(disambiguation)', 'A_(Cyrillic)')]
>>> ['Day']
>>> ['The_Doctor', 'The_Doctor_(Doctor_Who)', 'The_Doctor_(Star_Trek)', 'The_Doctor_(disambiguation)']
>>> ['The']
>>> ['Doctor_(Doctor_Who)', 'Doctor_(Star_Trek)', 'Doctor', 'Doctor_(title)', 'Doctor_(disambiguation)']

ClusterX

The ClusterX pipe takes noun chunks in a text and clusters them using a Radial Ball Mapper algorithm.

from spacy import load as spacy_load
from spikex.pipes import ClusterX

nlp = spacy_load("en_core_web_sm")
doc = nlp("Grab this juicy orange and watch a dog chasing a cat.")
clusterx = ClusterX(min_score=0.65)
doc = clusterx(doc)
for cluster in doc._.cluster_chunks:
  print(cluster)

>>> [this juicy orange]
>>> [a cat, a dog]

AbbrX

The AbbrX pipe finds abbreviations and acronyms in the text, linking short and long forms together:

from spacy import load as spacy_load
from spikex.pipes import AbbrX

nlp = spacy_load("en_core_web_sm")
doc = nlp("a little snippet with an abbreviation (abbr)")
abbrx = AbbrX(nlp.vocab)
doc = abbrx(doc)
for abbr in doc._.abbrs:
  print(abbr, "->", abbr._.long_form)

>>> abbr -> abbreviation

LabelX

The LabelX pipe matches and labels patterns in text, solving overlappings, abbreviations and acronyms.

from spacy import load as spacy_load
from spikex.pipes import LabelX

nlp = spacy_load("en_core_web_sm")
doc = nlp("looking for a computer system engineer")
patterns = [
  [{"LOWER": "computer"}, {"LOWER": "system"}],
  [{"LOWER": "system"}, {"LOWER": "engineer"}],
]
labelx = LabelX(nlp.vocab, [("TEST", patterns)], validate=True, only_longest=True)
doc = labelx(doc)
for labeling in doc._.labelings:
  print(labeling, f"[{labeling.label_}]")

>>> computer system engineer [TEST]

PhraseX

The PhraseX pipe creates a custom Doc's underscore extension which fulfills with matches from phrase patterns.

from spacy import load as spacy_load
from spikex.pipes import PhraseX

nlp = spacy_load("en_core_web_sm")
doc = nlp("I have Melrose and McIntosh apples, or Williams pears")
patterns = [
  [{"LOWER": "mcintosh"}],
  [{"LOWER": "melrose"}],
]
phrasex = PhraseX(nlp.vocab, "apples", patterns)
doc = phrasex(doc)
for apple in doc._.apples:
  print(apple)

>>> Melrose
>>> McIntosh

SentX

The SentX pipe splits sentences in a text. It modifies tokens' is_sent_start attribute, so it's mandatory to add it before parser pipe in the spaCy pipeline:

from spacy import load as spacy_load
from spikex.pipes import SentX
from spikex.defaults import spacy_version

if spacy_version >= 3:
  from spacy.language import Language

  @Language.factory("sentx")
  def create_sentx(nlp, name):
      return SentX()

nlp = spacy_load("en_core_web_sm")
sentx_pipe = SentX() if spacy_version < 3 else "sentx"
nlp.add_pipe(sentx_pipe, before="parser")
doc = nlp("A little sentence. Followed by another one.")
for sent in doc.sents:
  print(sent)

>>> A little sentence.
>>> Followed by another one.

That's all folks

Feel free to contribute and have fun!

spikex's People

Contributors

chetan8000 avatar hp0404 avatar paoloq avatar tomerre2 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spikex's Issues

Cannot download "enwiki_core"

Very insightful packages and explanation, thanks a lot!

Now I encountered a problem. The spikex provides wikigraphs ready to be used, including enwiki_core, simplewiki_core, and itwiki_core.
However, it seems that the website that stores these packages has gone (404 not found) so that we can not download "enwiki_core". How can I access to the wikigraph package (i.e., enwiki_core) at this moment?
I would be very grateful if someone could help me download these packages.

Creating dewiki_core

  • spikex version: 0.5.2
  • Python version: 3.9.7
  • Operating System: Windows 10

Description

I tryed to create a german wikigraph and got a type error for the compression_wrapper()

What I Did

spikex create-wikigraph de_wiki_graph --wiki de --dumps-path de_wiki_dumps

Traceback (most recent call last):
  File "C:\Users\Friedrich.Schmidt\Anaconda3\envs\ml\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\Friedrich.Schmidt\Anaconda3\envs\ml\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\Friedrich.Schmidt\Anaconda3\envs\ml\Scripts\spikex.exe\__main__.py", line 7, in <module>
  File "C:\Users\Friedrich.Schmidt\Anaconda3\envs\ml\lib\site-packages\spikex\__main__.py", line 23, in main
    typer.run(commands[command])
  File "C:\Users\Friedrich.Schmidt\Anaconda3\envs\ml\lib\site-packages\typer\main.py", line 859, in run
    app()
  File "C:\Users\Friedrich.Schmidt\Anaconda3\envs\ml\lib\site-packages\typer\main.py", line 214, in __call__
    return get_command(self)(*args, **kwargs)
  File "C:\Users\Friedrich.Schmidt\Anaconda3\envs\ml\lib\site-packages\click\core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "C:\Users\Friedrich.Schmidt\Anaconda3\envs\ml\lib\site-packages\click\core.py", line 782, in main
    rv = self.invoke(ctx)
  File "C:\Users\Friedrich.Schmidt\Anaconda3\envs\ml\lib\site-packages\click\core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Users\Friedrich.Schmidt\Anaconda3\envs\ml\lib\site-packages\click\core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "C:\Users\Friedrich.Schmidt\Anaconda3\envs\ml\lib\site-packages\typer\main.py", line 497, in wrapper
    return callback(**use_params)  # type: ignore
  File "C:\Users\Friedrich.Schmidt\Anaconda3\envs\ml\lib\site-packages\spikex\cli\create.py", line 62, in create_wikigraph
    wg = WikiGraph.build(**kwargs)
  File "C:\Users\Friedrich.Schmidt\Anaconda3\envs\ml\lib\site-packages\spikex\wikigraph\wikigraph.py", line 61, in build
    p, r, d, c, cl = _make_graph_components(**kwargs)
  File "C:\Users\Friedrich.Schmidt\Anaconda3\envs\ml\lib\site-packages\spikex\wikigraph\wikigraph.py", line 278, in _make_graph_components
    pprops = _get_pprops(**kwargs)
  File "C:\Users\Friedrich.Schmidt\Anaconda3\envs\ml\lib\site-packages\spikex\wikigraph\wikigraph.py", line 316, in _get_pprops
    for pageid, prop, value in iter_pprops_data:
  File "C:\Users\Friedrich.Schmidt\Anaconda3\envs\ml\lib\site-packages\spikex\wikigraph\dumptools.py", line 211, in _parse_wiki_sql_dump
    ) as pbar, compression_wrapper(compress_obj, "rb") as decompress_obj:
TypeError: compression_wrapper() missing 1 required positional argument: 'compression'

Additonal Information

https://github.com/RaRe-Technologies/smart_open/blob/develop/smart_open/compression.py -> Line 106:
def compression_wrapper(file_obj, mode, compression):

The current "compression_wrapper" function actually expects another argument called "compression", which is not passed at the moment: compression_wrapper(compress_obj, "rb")

I fixed the error locally by adding the missing argument: compression_wrapper(compress_obj, "rb", 'infer_from_extension')

Incomplete list of categories

  • spikex version: 0.5.2
  • Python version: 3.9.7
  • Operating System: Windows 10

Description

I want to get all categories of a page, but most categories are missing

What I Did

from spikex.wikigraph import load as wg_load
page = "Peking_2022"
categories = wg.get_categories(page, distance=1)

What I get: ['Category:Olympische_Winterspiele_2022']
The output I expect: ['Austragung der Olympischen Winterspiele', 'Olympische Winterspiele 2022', 'Sport (Hebei)', 'Sportveranstaltung 2022', 'Sportveranstaltung in Peking', 'Wikipedia:Veraltet nach Jahr 2022', 'Zukünftige Sportveranstaltung']
Prove: https://de.wikipedia.org/wiki/Olympische_Winterspiele_2022

I created a categorylinks dictionary from the categorylinks.sql.gz, so that the keys are the page_ids and under each key is the list of categories. I used your functions to get the page_id: page_id = self.get_pageid(self.redirect(page)) and my categorylinks dictionary . With this method I get the expected output. If this behaviour is not desired, I would like to think that there is a problem with the processing of categorylinks.sql.gz on your side.

Umlauts

  • spikex version: 0.5.2
  • Python version: 3.9.7
  • Operating System: Windows 10

Description

Getting categories for a page with umlauts from my dewiki_core (Cem Özdemir: https://de.wikipedia.org/wiki/Cem_%C3%96zdemir)
It crashes, what shouldn't happen. There is also an english wiki page for him (https://en.wikipedia.org/wiki/Cem_%C3%96zdemir)

What I Did

from spikex.wikigraph import load as wg_load
wg = wg_load("dewiki_core")
page = "Cem_Özdemir"
categories = wg.get_categories(page, distance=1)
TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'

abbreviation difference from scispacy

Hi! scispacy developer here. Could you share what changes you made to our abbreviation detector? I am curious what issues you encountered/fixed (obviously not bothered at all that you based yours off of ours).

Exception: invalid data, magic number is not correct

  • spikex 0.5.0:
  • Python 3.6:
  • Windows 10:

Description

Hi i installed spikex and downloaded enwiki_core. However when i try to load enwiki_core :

from spikex.wikigraph import load as wg_load
from spikex.pipes import WikiPageX

# load a WikiGraph
wg = wg_load('enwiki_core')

I am getting the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\local\Pathways\nicepaths\lib\site-packages\spikex\wikigraph\wikigraph.py", line 41, in load
    return WikiGraph.load(data_path, meta)
  File "C:\local\Pathways\nicepaths\lib\site-packages\spikex\wikigraph\wikigraph.py", line 81, in load
    wg._wpd = WikiPageDetector.load(data_path)
  File "C:\local\Pathways\nicepaths\lib\site-packages\spikex\wikigraph\wikigraph.py", line 180, in load
    wpd._trie = Trie.from_buff(mmap(bf.fileno(), 0), copy=False)
  File "lib\cyac\trie.pyx", line 1086, in cyac.trie.Trie.from_buff
  File "lib\cyac\trie.pyx", line 1103, in cyac.trie.trie_from_buff
Exception: invalid data, magic number is not correct

cyac version is 1.3 (last one)
Any ideas please?

Abbrv pipeline errors out

  • spikex version: spikex-0.4.0.dev2 from source / spacy 2.3.5
  • Python version: 3.6
  • Operating System: OSX

Description

Describe what you were trying to get done.

  • I was trying to test the abbrv pipeline

Tell us what happened, what went wrong, and what you expected to happen.

  • Copied the example from README

What I Did

import spacy
from spikex.pipes import AbbrX

nlp = spacy.load("en_core_web_sm")

abbrx = AbbrX(nlp)
nlp.add_pipe(abbrx)
doc = abbrx(nlp("a little snippet with abbreviations (abbrs)"))
doc._.abbrs
205         return (
    206             self.vocab.strings.add(key)
--> 207             if key not in self.vocab.strings
    208             else self.vocab.strings[key]
    209         )

AttributeError: 'English' object has no attribute 'strings'

How to speed up the progress of adding patterns

  • spikex version: 0.5.0
  • Python version:
  • Operating System: linux

Description

Hey, guys. I found your tool is very powerful, thx for sharing.
I met a problem that the time cost is huge, when I was trying to add 30 thousands patterns to initialize LabelX.
And this progress is much slower than the spacy, so that I wonder if any solution you guys can propose?

spikex download-wikigraph simplewiki_core

Hello,

I tested this command :

spikex download-wikigraph simplewiki_core

In a Jupyter Notebook, it returns :

File "<ipython-input-7-d71a5d9ca149>", line 1 spikex download-wikigraph simplewiki_core ^ SyntaxError: invalid syntax

In Anaconda prompt, it returns:

`
(base) C:\WINDOWS\system32>spikex download-wikigraph simplewiki_core
Traceback (most recent call last):
File "c:\users\ludovic\anaconda3\lib\runpy.py", line 194, in _run_module_as_main
return run_code(code, main_globals, None,
File "c:\users\ludovic\anaconda3\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\Users\Ludovic\anaconda3\Scripts\spikex.exe_main
.py", line 7, in
File "c:\users\ludovic\anaconda3\lib\site-packages\spikex_main
.py", line 23, in main
typer.run(commands[command])
File "c:\users\ludovic\anaconda3\lib\site-packages\typer\main.py", line 859, in run
app()
File "c:\users\ludovic\anaconda3\lib\site-packages\typer\main.py", line 214, in call
return get_command(self)(*args, **kwargs)
File "c:\users\ludovic\anaconda3\lib\site-packages\click\core.py", line 829, in call
return self.main(*args, **kwargs)
File "c:\users\ludovic\anaconda3\lib\site-packages\click\core.py", line 782, in main
rv = self.invoke(ctx)
File "c:\users\ludovic\anaconda3\lib\site-packages\click\core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "c:\users\ludovic\anaconda3\lib\site-packages\click\core.py", line 610, in invoke
return callback(*args, **kwargs)
File "c:\users\ludovic\anaconda3\lib\site-packages\typer\main.py", line 497, in wrapper
return callback(**use_params) # type: ignore
File "c:\users\ludovic\anaconda3\lib\site-packages\spikex\cli\download.py", line 46, in download_wikigraph
_run_command(f"wget --quiet --show-progress -O {wg_tar} {wg_url}")
File "c:\users\ludovic\anaconda3\lib\site-packages\spikex\cli\download.py", line 54, in _run_command
return run(
File "c:\users\ludovic\anaconda3\lib\subprocess.py", line 493, in run
with Popen(*popenargs, **kwargs) as process:
File "c:\users\ludovic\anaconda3\lib\subprocess.py", line 858, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "c:\users\ludovic\anaconda3\lib\subprocess.py", line 1311, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] Le fichier spécifié est introuvable

(base) C:\WINDOWS\system32>`

What can I do ?

Thank you very much for your help !

labelX pipeline errors out

  • spikex version: 0.5.2
  • Python version: 3.7.11
  • Operating System: google.cloud

Description

Describe what you were trying to get done.

  • I was trying to test the labelX pipeline

Tell us what happened, what went wrong, and what you expected to happen.

  • Copied the example from README

What I Did

from spacy import load as spacy_load
from spikex.pipes import LabelX

nlp = spacy_load("en_core_web_sm")
doc = nlp("looking for a computer system engineer")
patterns = [
  [{"LOWER": "computer"}, {"LOWER": "system"}],
  [{"LOWER": "system"}, {"LOWER": "engineer"}],
]
labelx = LabelX(vocab=nlp.vocab, labelings=("TEST", patterns), validate=True, only_longest=True)
doc = labelx(doc)
for labeling in doc._.labelings:
  print(labeling, f"[{labeling.label_}]")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-da0684755206> in <module>()
      9   [{"LOWER": "system"}, {"LOWER": "engineer"}],
     10 ]
---> 11 labelx = LabelX(vocab=nlp.vocab, labelings=("TEST", patterns), validate=True, only_longest=True)
     12 doc = labelx(doc)
     13 for labeling in doc._.labelings:

/usr/local/lib/python3.7/dist-packages/spikex/pipes/labels.py in __init__(self, vocab, labelings, validate, only_longest)
     32         if not labelings or labelings is None:
     33             return
---> 34         for label, patterns in labelings:
     35             self.add(label, patterns)
     36 

ValueError: too many values to unpack (expected 2)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.