GithubHelp home page GithubHelp logo

wojciechmula / pyahocorasick Goto Github PK

View Code? Open in Web Editor NEW
917.0 22.0 122.0 745 KB

Python module (C extension and plain python) implementing Aho-Corasick algorithm

License: BSD 3-Clause "New" or "Revised" License

C 52.59% Python 37.93% Makefile 0.35% Shell 2.58% HTML 3.27% CSS 3.29%
string-manipulation automaton aho-corasick trie

pyahocorasick's Introduction

pyahocorasick

GitHub Action build on test -  Master branch status Documentation Status

pyahocorasick is a fast and memory efficient library for exact or approximate multi-pattern string search meaning that you can find multiple key strings occurrences at once in some input text. The strings "index" can be built ahead of time and saved (as a pickle) to disk to reload and reuse later. The library provides an ahocorasick Python module that you can use as a plain dict-like Trie or convert a Trie to an automaton for efficient Aho-Corasick search.

pyahocorasick is implemented in C and tested on Python 3.8 and up. It works on 64 bits Linux, macOS and Windows.

The license is BSD-3-Clause. Some utilities, such as tests and the pure Python automaton are dedicated to the Public Domain.

Testimonials

Many thanks for this package. Wasn't sure where to leave a thank you note but this package is absolutely fantastic in our application where we have a library of 100k+ CRISPR guides that we have to count in a stream of millions of DNA sequencing reads. This package does it faster than the previous C program we used for the purpose and helps us stick to just Python code in our pipeline.

Miika (AstraZeneca Functional Genomics Centre) #145

Download and source code

You can fetch pyahocorasick from:

The documentation is published at https://pyahocorasick.readthedocs.io/

Quick start

This module is written in C. You need a C compiler installed to compile native CPython extensions. To install:

pip install pyahocorasick

Then create an Automaton:

>>> import ahocorasick
>>> automaton = ahocorasick.Automaton()

You can use the Automaton class as a trie. Add some string keys and their associated value to this trie. Here we associate a tuple of (insertion index, original string) as a value to each key string we add to the trie:

>>> for idx, key in enumerate('he her hers she'.split()):
...   automaton.add_word(key, (idx, key))

Then check if some string exists in the trie:

>>> 'he' in automaton
True
>>> 'HER' in automaton
False

And play with the get() dict-like method:

>>> automaton.get('he')
(0, 'he')
>>> automaton.get('she')
(3, 'she')
>>> automaton.get('cat', 'not exists')
'not exists'
>>> automaton.get('dog')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError

Now convert the trie to an Aho-Corasick automaton to enable Aho-Corasick search:

>>> automaton.make_automaton()

Then search all occurrences of the keys (the needles) in an input string (our haystack).

Here we print the results and just check that they are correct. The Automaton.iter() method return the results as two-tuples of the end index where a trie key was found in the input string and the associated value for this key. Here we had stored as values a tuple with the original string and its trie insertion order:

>>> for end_index, (insert_order, original_value) in automaton.iter(haystack):
...     start_index = end_index - len(original_value) + 1
...     print((start_index, end_index, (insert_order, original_value)))
...     assert haystack[start_index:start_index + len(original_value)] == original_value
...
(1, 2, (0, 'he'))
(1, 3, (1, 'her'))
(1, 4, (2, 'hers'))
(4, 6, (3, 'she'))
(5, 6, (0, 'he'))

You can also create an eventually large automaton ahead of time and pickle it to re-load later. Here we just pickle to a string. You would typically pickle to a file instead:

>>> import pickle
>>> pickled = pickle.dumps(automaton)
>>> B = pickle.loads(pickled)
>>> B.get('he')
(0, 'he')

See also:

Documentation

The full documentation including the API overview and reference is published on readthedocs.

Overview

With an Aho-Corasick automaton you can efficiently search all occurrences of multiple strings (the needles) in an input string (the haystack) making a single pass over the input string. With pyahocorasick you can eventually build large automatons and pickle them to reuse them over and over as an indexed structure for fast multi pattern string matching.

One of the advantages of an Aho-Corasick automaton is that the typical worst-case and best-case runtimes are about the same and depends primarily on the size of the input string and secondarily on the number of matches returned. While this may not be the fastest string search algorithm in all cases, it can search for multiple strings at once and its runtime guarantees make it rather unique. Because pyahocorasick is based on a Trie, it stores redundant keys prefixes only once using memory efficiently.

A drawback is that it needs to be constructed and "finalized" ahead of time before you can search strings. In several applications where you search for several pre-defined "needles" in a variable "haystacks" this is actually an advantage.

Aho-Corasick automatons are commonly used for fast multi-pattern matching in intrusion detection systems (such as snort), anti-viruses and many other applications that need fast matching against a pre-defined set of string keys.

Internally an Aho-Corasick automaton is typically based on a Trie with extra data for failure links and an implementation of the Aho-Corasick search procedure.

Behind the scenes the pyahocorasick Python library implements these two data structures: a Trie and an Aho-Corasick string matching automaton. Both are exposed through the Automaton class.

In addition to Trie-like and Aho-Corasick methods and data structures, pyahocorasick also implements dict-like methods: The pyahocorasick Automaton is a Trie a dict-like structure indexed by string keys each associated with a value object. You can use this to retrieve an associated value in a time proportional to a string key length.

pyahocorasick is available in two flavors:

  • a CPython C-based extension, compatible with Python 3 only. Use older version 1.4.x for Python 2.7.x and 32 bits support.
  • a simpler pure Python module, compatible with Python 2 and 3. This is only available in the source repository (not on Pypi) under the etc/py/ directory and has a slightly different API.

Unicode and bytes

The type of strings accepted and returned by Automaton methods are either unicode or bytes, depending on a compile time settings (preprocessor definition of AHOCORASICK_UNICODE as set in setup.py).

The Automaton.unicode attributes can tell you how the library was built. On Python 3, unicode is the default.

Warning

When the library is built with unicode support, an Automaton will store 2 or 4 bytes per letter, depending on your Python installation. When built for bytes, only one byte per letter is needed.

Build and install from PyPi

To install for common operating systems, use pip. Pre-built wheels should be available on Pypi at some point in the future:

pip install pyahocorasick

To build from sources you need to have a C compiler installed and configured which should be standard on Linux and easy to get on MacOSX.

To build from sources, clone the git repository or download and extract the source archive.

Install pip (and its setuptools companion) and then run (in a virtualenv of course!):

pip install .

If compilation succeeds, the module is ready to use.

Support

Support is available through the GitHub issue tracker to report bugs or ask questions.

Contributing

You can submit contributions through GitHub pull requests.

  • There is a Makefile with a default target that builds and runs tests.
  • The tests can run with a pip installe -e .[testing] && pytest -vvs
  • See also the .github directory for CI tests and workflow

Authors

The initial author and maintainer is Wojciech Muła. Philippe Ombredanne is Wojciech's sidekick and helps maintaining, and rewrote documentation, setup CI servers and did a some work to make this module more accessible to end users.

Alphabetic list of authors and contributors:

  • Andrew Grigorev
  • Ayan Mahapatra
  • Bogdan
  • David Woakes
  • Edward Betts
  • Frankie Robertson
  • Frederik Petersen
  • gladtosee
  • INADA Naoki
  • Jan Fan
  • Pastafarianist
  • Philippe Ombredanne
  • Renat Nasyrov
  • Sylvain Zimmer
  • Xiaopeng Xu

and many others!

This library would not be possible without help of many people, who contributed in various ways. They created pull requests, reported bugs as GitHub issues or via direct messages, proposed fixes, or spent their valuable time on testing.

Thank you.

License

This library is licensed under very liberal BSD-3-Clause license. Some portions of the code are dedicated to the public domain such as the pure Python automaton and test code.

Full text of license is available in LICENSE file.

Other Aho-Corasick implementations for Python you can consider

While pyahocorasick tries to be the finest and fastest Aho Corasick library for Python you may consider these other libraries:

  • Written in pure Python.
  • Poor performance.
  • Written in pure Python.
  • Better performance than py-aho-corasick.
  • Using pypy, ahocorapy's search performance is only slightly worse than pyahocorasick's.
  • Performs additional suffix shortcutting (more setup overhead, less search overhead for suffix lookups).
  • Includes visualization tool for resulting automaton (using pygraphviz).
  • MIT-licensed, 100% test coverage, tested on all major python versions (+ pypy)
  • Written in C. Does not return overlapping matches.
  • Does not compile on Windows (July 2016).
  • No support for the pickle protocol.
  • Written in Cython.
  • Large automaton may take a long time to build (July 2016)
  • No support for a dict-like protocol to associate a value to a string key.
  • Written in C.
  • seems unmaintained (last update in 2005).
  • GPL-licensed.

pyahocorasick's People

Contributors

axvin avatar ayansinhamahapatra avatar charlesxu90 avatar dgrunwald avatar edwardbetts avatar ei-grad avatar frankier avatar frederikp avatar grrrrrrrrr avatar guangyi-z avatar koichiyasuoka avatar littlebear0729 avatar melsabagh avatar methane avatar nathaniel-daniel avatar pehat avatar pombredanne avatar robinchm avatar smancill avatar spock avatar sylvinus avatar timgates42 avatar tirkarthi avatar woakesd avatar wojciechmula avatar zhu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pyahocorasick's Issues

Segfault in python3

Simon Rosenthal has reported that following code causes segfault:

import sys
import ahocorasick
import pdb

#pdb.set_trace()
A = ahocorasick.Automaton()

# add some words to trie
for index, word in enumerate("he her hers she".split()):
    A.add_word(word, (index, word))
A.make_automaton()

# then find all occurrences in string
for item in A.iter("_hershe_"):
    print(item)
#
A = None #### segfault here
sys.exit(0)

Releasing 1.1.5 ?

I have an encoding issue during installation (python3.6/ubuntu 16.04) while installing in 1.1.4 which is fixed by #48 but not in the latest release. Any chance to have soon a release including that fix ? Thx !

Doc on wildcards.... wildcards preprocessor has no support for escape sequence, so wildcards can't match characters ? nor *

In the example 1, https://github.com/WojciechMula/pyahocorasick/blob/7fc453f58e7187fbc5622695b466d0337c4b21f1/README.rst#example
we have for item in A.iter("_hershe_"): to find all occurrences...
I could not find much doc on the significance of this magic underscore... How does this relate to wildcards? Can the underscore or a wildcard also be used as a characters in indexed "words"? where is the code is this underscore processed?

Failed to build from sdist with 1.1.15

https://ci.appveyor.com/project/pombreda/thirdparty/build/job/yp3w6bahsokx2swn

 creating build\temp.win32-3.6
  creating build\temp.win32-3.6\Release
  C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -DAHOCORASICK_UNICODE= -IC:\Python36\include -IC:\Python36\include "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.14393.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.6.1\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.14393.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.14393.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.14393.0\winrt" /Tcpyahocorasick.c /Fobuild\temp.win32-3.6\Release\pyahocorasick.obj
  pyahocorasick.c
  c:\users\appveyor\appdata\local\temp\1\pip-build-mbzdwcr8\pyahocorasick\windows.h(14): fatal error C1083: Cannot open include file: 'msinttypes/stdint.h': No such file or directory
  error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\BIN\\cl.exe' failed with exit status 2
  
  ----------------------------------------
  Failed building wheel for pyahocorasick
  Running setup.py clean for pyahocorasick
Failed to build pyahocorasick
ERROR: Failed to build one or more wheels

This is missing in the MANIFEST.in

Windows build in MSVC 2015

Nishant Sharma has reported that following change is required in windows.h

typedef __int32 int32_t
typedef unsinged __int32 uint32_t

in order to build with the latest MSVC. I'll include it in the next release.

Maybe a bug?

Build an AC automaton on top of tens of thousands of keywords, and then I did the following search:

In [28]: m.get('poke')
Out[28]: (4, 'poke')

In [29]: m.get('go')
Out[29]: (2, 'go')

In [30]: m.find_all('pokego', print)
0 (1, 'p')
1 (2, 'po')
1 (1, 'o')
2 (3, 'pok')
2 (2, 'ok')
2 (1, 'k')
3 (4, 'poke')
3 (3, 'oke')
3 (2, 'ke')
3 (1, 'e')

I expect find_all to return match for 'go', but it didn't.

installation under python 2: C compilation is attempted and fails

Hello!

I've tried to install pyahocorasick both using pip and by cloning current master and doing a python setup.py install.
Using python 2.7.9 I see setup trying to conpile the C module and failing with

pyahocorasick.c:42:1: error: unknown type name ‘PyModuleDef’
 PyModuleDef ahocorasick_module = {
 ^
pyahocorasick.c:43:2: error: ‘PyModuleDef_HEAD_INIT’ undeclared here (not in a function)

Isn't the module supposed to be compiled only for Python 3?

Cannot build using MinGW

I configured (this is rather painful) a Windows 7 VM to compile Python extensions with MinGW.
Using these https://rnsharp.wordpress.com/2011/12/02/how-to-create-a-python-extension-with-mingw/

Here is what I get

C:\dev\pyahocorasick\pyahocorasick-0a4a98>python setup.py build -c mingw32
running build
running build_ext
building 'ahocorasick' extension
C:\MinGW\bin\gcc.exe -mdll -O -Wall -IC:\Python27\include -IC:\Python27\PC -c pyahocorasick.c -o build\temp.win32-2.7\Release\pyahocorasick.o
In file included from pyahocorasick.c:21:0:
utils.c: In function 'pymod_get_string':
utils.c:41:15: warning: pointer targets in assignment differ in signedness [-Wpointer-sign]
         *word = PyString_AS_STRING(obj);
               ^
utils.c: In function 'pymod_parse_start_end':
utils.c:100:3: warning: unknown conversion type character 'z' in format [-Wformat=]
   PyErr_Format(PyExc_IndexError, "start index not in range %zd..%zd", min, max);
   ^
utils.c:100:3: warning: unknown conversion type character 'z' in format [-Wformat=]
utils.c:100:3: warning: too many arguments for format [-Wformat-extra-args]
utils.c:124:3: warning: unknown conversion type character 'z' in format [-Wformat=]
   PyErr_Format(PyExc_IndexError, "end index not in range %zd..%zd", min, max);
   ^
utils.c:124:3: warning: unknown conversion type character 'z' in format [-Wformat=]
utils.c:124:3: warning: too many arguments for format [-Wformat-extra-args]
In file included from pyahocorasick.c:25:0:
Automaton.c: At top level:
Automaton.c:1010:2: error: initializer element is not constant
  PyVarObject_HEAD_INIT(&PyType_Type, 0)
  ^
Automaton.c:1010:2: error: (near initialization for 'automaton_type.ob_type')
In file included from pyahocorasick.c:26:0:
AutomatonItemsIter.c:259:2: error: initializer element is not constant
  PyVarObject_HEAD_INIT(&PyType_Type, 0)
  ^
AutomatonItemsIter.c:259:2: error: (near initialization for 'automaton_items_iter_type.ob_type')
In file included from pyahocorasick.c:27:0:
AutomatonSearchIter.c:243:2: error: initializer element is not constant
  PyVarObject_HEAD_INIT(&PyType_Type, 0)
  ^
AutomatonSearchIter.c:243:2: error: (near initialization for 'automaton_search_iter_type.ob_type')
error: command 'gcc' failed with exit status 1

End position out after supplementary plane character in Windows

This is an issue with Python under windows only. The attached sample illustrates the problem.

I presume this is because Windows python builds are narrow builds (16 bits per character and supplementary plane characters are stored as a supplementary pair).

I can add the 🙈 as a word and I get a match for it, though end is 9!

issue.txt

Memory leaks when unpickling Automaton

How to reproduce:

import pickle
import time

import ahocorasick
import psutil
import requests

automaton = ahocorasick.Automaton()

r = requests.get('https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm')
assert r.ok

for word in r.text.split():
    if word.isalpha():
        automaton.add_word(word.encode('utf8').lower(), word)

automaton.make_automaton()
pickled = pickle.dumps(automaton)

print('Cycles    Free MiB')
for i in range(10000):
    if i % 1000 == 0:
        free = psutil.virtual_memory().free
        print('{:05d}     {}'.format(i, free/1000000))
    unpickled = pickle.loads(pickled)
    time.sleep(0.001)
# Tested on environment:
#
# pyahocorasick==1.1.4
# Python 2.7.9-1
# Linux 3.16.0-4-amd64 Debian 3.16.39-1+deb8u2 (2017-03-07) x86_64 GNU/Linux

# Results:
#
# Cycles    Free MiB
# 00000     2068
# 01000     2026
# 02000     1984
# 03000     1942
# 04000     1900
# 05000     1858
# 06000     1816
# 07000     1774
# 08000     1732
# 09000     1690

Same on python3 - remove .encode('utf8')

Pickling fails with a pure Python Trie that reach some depths and hits the recursion limit

This is a well known issue for pickle: recursive data structure (such as nested dict as used in py/pyahocorasick.py) do not pickle when you reach a certain depth.

First I comment out the slots line in pyahocorasick.py (such as I do not have to implement the __*state__ methods):
#__slots__ = ['char', 'output', 'fail', 'children']

Then I run this:

$ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyahocorasick as pa
>>> for x in range(10):
...  key = str(range(x, x+100))
... 
>>> t=pa.Trie()
>>> for x in range(10):
...  key = str(range(x, x+100))
...  t.add_word(key, x)
... 
>>> import pickle
>>> p=pickle.dumps(t)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/pickle.py", line 1374, in dumps
    Pickler(file, protocol).dump(obj)
[...]
  File "/usr/lib/python2.7/pickle.py", line 271, in save
    pid = self.persistent_id(obj)
RuntimeError: maximum recursion depth exceeded

The obvious solution is to increase the recursion limit, but this fails quickly and can exhaust system resources:

>>> import sys
>>> sys.setrecursionlimit(10000)
>>> p=pickle.dumps(t)

Add a few more and it dies too:

>>> for x in range(100):
...  key = str(range(x, x+1000))
...  t.add_word(key, x)
... 
>>> p=pickle.dumps(t)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/pickle.py", line 1374, in dumps
    Pickler(file, protocol).dump(obj)
[...]
  File "/usr/lib/python2.7/pickle.py", line 271, in save
    pid = self.persistent_id(obj)
RuntimeError: maximum recursion depth exceeded

The culprit is that the Trie uses nested dictionaries (as opposed to the C Automaton that uses an array-based Trie structure. One way out would be to have a similar data structure in Python such that pickling works.

Pickling fails with an Automaton(STORE_INTS)

Using the latest code with Python 2.7 on 64 bits Linux and an Automaton(STORE_INTS) pickling fails.

>>> import ahocorasick as aho
>>> a=aho.Automaton(aho.STORE_INTS)
>>> a.add_word('abc', 12)
True
>>> a.make_automaton()
>>> import cPickle
>>> p=cPickle.dumps(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
SystemError: NULL object passed to Py_BuildValue

Bug in iter method

Reported by Jonathan Grs:

ac = ahocorasick.Automaton()
ac.add_word(b'S', 1)
ac.make_automaton()
buffer = b'SSS'
ac.iter(buffer, 0, 3) # this causes an error
ac.iter(buffer, 0, 2) # no error, but it misses the last 'S' in the buffer

I think the solution here is to use '>' instead of '>=' in the limits
checking in pymod_parse_start_end but I am not completely sure.

Jonathan kindly supplied the patch, which I apply in near future.

Need an iter_long() method [was: Question about functionality. ]

Hello, and thanks for your work.

  1. It's possible to save a trie. Something like pickle format?
  2. Module ahocorasick by Danny Yoo has this method: search_long(query, [startpos]). Same as search(), except that this searches for the longest leftmost keyword that matches. pyahocorasick has smth like this method?

python setup.py install error

running install
running build
running build_ext
building 'ahocorasick' extension
clang -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -Qunused-arguments -Qunused-arguments -DAHOCORASICK_UNICODE= -I/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c pyahocorasick.c -o build/temp.macosx-10.10-x86_64-2.7/pyahocorasick.o
In file included from pyahocorasick.c:25:
./utils.c:90:69: warning: format specifies type 'int' but the argument has type 'ssize_t' (aka 'long')
[-Wformat]
PyErr_Format(PyExc_IndexError, "start index not in range %d..%d", min, max);
~~ ^~~
%zd
./utils.c:90:74: warning: format specifies type 'int' but the argument has type 'ssize_t' (aka 'long')
[-Wformat]
PyErr_Format(PyExc_IndexError, "start index not in range %d..%d", min, max);
~~ ^~~
%zd
./utils.c:113:67: warning: format specifies type 'int' but the argument has type 'ssize_t' (aka 'long')
[-Wformat]
PyErr_Format(PyExc_IndexError, "end index not in range %d..%d", min, max);
~~ ^~~
%zd
./utils.c:113:72: warning: format specifies type 'int' but the argument has type 'ssize_t' (aka 'long')
[-Wformat]
PyErr_Format(PyExc_IndexError, "end index not in range %d..%d", min, max);
~~ ^~~
%zd
In file included from pyahocorasick.c:29:
In file included from ./Automaton.c:948:
./Automaton_pickle.c:145:16: warning: cast to 'TrieNode ' (aka 'struct TrieNode *') from smaller integer
type 'int' [-Wint-to-pointer-cast]
dump->fail = (TrieNode
)(NODEID(tmp)->id);
^
./Automaton_pickle.c:154:12: warning: cast to 'TrieNode ' (aka 'struct TrieNode *') from smaller integer
type 'int' [-Wint-to-pointer-cast]
arr[i] = (TrieNode
)(NODEID(child)->id); // save id of child node
^
pyahocorasick.c:42:1: error: unknown type name 'PyModuleDef'
PyModuleDef ahocorasick_module = {
^
pyahocorasick.c:43:2: error: use of undeclared identifier 'PyModuleDef_HEAD_INIT'
PyModuleDef_HEAD_INIT,
^
pyahocorasick.c:59:11: warning: implicit declaration of function 'PyModule_Create' is invalid in C99
[-Wimplicit-function-declaration]
module = PyModule_Create(&ahocorasick_module);
^
pyahocorasick.c:61:3: error: void function 'PyInit_ahocorasick' should not return a value [-Wreturn-type]
return NULL;
^ ~~~~
pyahocorasick.c:65:3: error: void function 'PyInit_ahocorasick' should not return a value [-Wreturn-type]
return NULL;
^ ~~~~
pyahocorasick.c:90:2: error: void function 'PyInit_ahocorasick' should not return a value [-Wreturn-type]
return module;
^ ~~~~~~
7 warnings and 5 errors generated.
error: command 'clang' failed with exit status 1

Pickle rewrite - help needed

Bugs related to pickling are recurrent and annoys everybody; sometimes a bug causes crash of the interpreter which is completely unacceptable. I tried my best to track the problem(s) down, but I failed. Moreover, last year was tough for me (I was ill, then I bought and was renovating a flat, finally recent changes in ex-company had forced me to seek for a new job) and as a result I couldn't spend much time on side projects.

This project is pretty popular, and it would be great if somebody helped with a pickling algorithm. IMO the best option is to trash the current one and start over.

Store sequences of integers

It would be great if we could store sequences of integers over a well defined range in an automaton instead of just plain bytes or unicode.
Here is a simple python proof of concept (derived from the pure Python trie)

# -*- coding: utf-8 -*-

# Copyright (c) 2011-2014 Wojciech Mula
# All rights reserved.
#
# Redistribution and use in source and binary forms, with or
# without modification, are permitted provided that the following
# conditions are met:
#
# * Redistributions of source code must retain the above copyright
#   notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above
#   copyright notice, this list of conditions and the following
#   disclaimer in the documentation and/or other materials
#   provided with the distribution.
# * Neither the name of the Wojciech Mula nor the names of its
#   contributors may be used to endorse or promote products
#   derived from this software without specific prior written
#   permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
# CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES,
# INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
# MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
# AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
# LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
# IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF
# THE POSSIBILITY OF SUCH DAMAGE.

"""
Aho-Corasick string search algorithm.

Author    : Wojciech Mula, [email protected]
WWW       : http://0x80.pl
License   : public domain
"""

from __future__ import print_function, absolute_import

from collections import deque


# used to distinguish from None
NIL = -1

# attributes index in the node
_key = 0
_val = 1
_fail = 2
_kids = 3

def TrieNode(key):
    """
    Build a new node as a simple list: [key, value, fail, kids]
    - key is an integer
    - value is an arbitrary object associated with this node
    - failure link used by Aho-Corasick automaton
    - kids is a mapping of children
    """
    # [keyitem:0, value:1, fail:2, _kids:3]
    return [key, -1, -1, {}]


class Trie(object):
    """
    Trie/Aho-Corasick automaton specialized to store integer sequences.
    """

    def __init__(self, items_range):
        """
        Initialize a Trie for a storing integers in the range(`items_range`)
        contiguous integer items.
        """
        self.root = TrieNode([])
        self.items_range = items_range

    def __get_node(self, seq):
        """
        Return a final node or None if the trie does not contain the sequence of
        integers.
        """
        node = self.root
        for key in seq:
            try:
                # note: kids may be None
                node = node[_kids][key]
            except KeyError:
                return None
        return node

    def get(self, seq, default=-1):
        """
        Return the value associated with the sequence of integers. If the
        sequence of integers is not present in trie return the `default` if
        provided or raise a KeyError if not provided.
        """
        node = self.__get_node(seq)
        value = -1
        if node:
            value = node[_val]

        if value == -1:
            if default == -1:
                raise KeyError()
            else:
                return default
        else:
            return value

    def iterkeys(self):
        return (k for k, _v in self.iteritems())

    def itervalues(self):
        return (v for _k, v in self.iteritems())

    def iteritems(self):
        L = []

        def walk(node, s):
            s = s + [node[_key]]
            if node[_val] != -1:
                L.append((s, node[_val]))

            # FIXME: this is using recursion rather than a stack
            for child in node[_kids].values():
                if child is not node:
                    walk(child, s)

        walk(self.root, [])
        return iter(L)

    def __len__(self):
        stack = deque()
        stack.append(self.root)
        n = 0
        while stack:
            node = stack.pop()
            if node[_val] != -1:
                n += 1
            for child in node[_kids].itervalues():
                stack.append(child)
        return n

    def add(self, seq, value):
        """
        Add a sequence of integers and its associated value to the trie. If `seq`
        already exists in the trie, its value is replaced by `value`.
        """
        if not seq:
            return

        node = self.root
        for key in seq:
            try:
                # note: kids may be None
                node = node[_kids][key]
            except KeyError:
                n = TrieNode(key)
                node[_kids][key] = n
                node = n

        # only assign the value to the last item of the sequence
        node[_val] = value

    def clear(self):
        """
        Clears trie.
        """
        self.root = TrieNode([])


    def exists(self, seq):
        """
        Return True if the sequence of integers is present in the trie.
        """
        node = self.__get_node(seq)
        if node:
            return bool(node[_val] != -1)
        else:
            return False

    def match(self, seq):
        """
        Return True if the sequence of items is a prefix of any existing
        sequence of items in the trie.
        """
        return self.__get_node(seq) is not None

    def make_automaton(self):
        """
        Convert the trie to an Aho-Corasick automaton adding the failure links.
        """
        queue = deque()

        #1. create top root kids over the items range, failing to root
        for item in range(self.items_range):
            # self.content is either int or chr
            # item = self.content(i)
            if item in self.root[_kids]:
                node = self.root[_kids][item]
                # f(s) = 0
                node[_fail] = self.root
                queue.append(node)
            else:
                self.root[_kids][item] = self.root

        #2. using the queue of all possible items, walk the trie and add failure links
        while queue:
            current = queue.popleft()
            for node in current[_kids].values():
                queue.append(node)
                state = current[_fail]
                while node[_key] not in state[_kids]:
                    state = state[_fail]
                node[_fail] = state[_kids].get(node[_key], self.root)

    def search(self, seq):
        """
        Yield all matches of `seq` sequence of integers in the automaton
        performing an Aho-Corasick search. This includes overlapping matches.

        The returned tuples are: (matched end index in seq, [associated values, ...])
        such that the actual matched sub-sequence is: seq[end_index - n + 1:end_index + 1]
        """
        state = self.root
        for index, key in enumerate(seq):
            # find the first failure link and next state
            while key not in state[_kids]:
                state = state[_fail]

            # follow kids or get back to root
            state = state[_kids].get(key, self.root)
            tmp = state
            value = []
            while tmp != -1:
                if tmp == -1:
                    break
                if tmp[_val] != -1:
                    value.append(tmp[_val])
                tmp = tmp[_fail]
            if value:
                yield index, value

    def search_long(self, seq):
        """
        Yield all loguest non-overlapping matches of the `seq` sequence of
        integers in the automaton performing an Aho-Corasick search such that
        when matches overlap, only the longuest is returned.

        Note that because of the original index construction, two matches cannot
        be the same as not two rules are identical.

        The returned tuples are: (matched end index in seq, [associated values, ...])
        such that the actual matched sub-sequence is: seq[end_index - n + 1:end_index + 1]
        """
        state = self.root
        last = None

        index = 0
        while index < len(seq):
            item = seq[index]

            if item in state[_kids]:
                state = state[_kids][item]
                if state[_val] != -1:
                    # save the last node on the path
                    last = index, [state[_val]]
                index += 1
            else:
                if last:
                    # return the saved match
                    yield last
                    # and start over since we do not want overlapping results
                    # Note: this leads to quadratic complexity in the worst case
                    index = last[1] + 1
                    state = self.root
                    last = None
                else:
                    # if no output, perform classic Aho-Corasick algorithm
                    while item not in state[_kids]:
                        state = state[_fail]
        # last match if any
        if last:
            yield last

Links to projects using pyahocorasick

This could be either in the README or best in a wiki page. This could make sense as a new release is coming up to provide some pointers as example of real world usage.

I collected these few links to projects that use pyahocorasick:

_pickle.load fails for more than 64 words

Hello I don't know is this connected to #50, so created a new issue:

This works ok:

import ahocorasick
import _pickle

A = ahocorasick.Automaton()
for i in range(0, 64):
    A.add_word(str(i), (i, i))
_pickle.dump(A, open('aho', 'wb'))
_pickle.load(open('aho', 'rb'))
#<ahocorasick.Automaton object at 0x7ff51acc4a58>

And this fails constantly:

import ahocorasick
import _pickle

A = ahocorasick.Automaton()
for i in range(0, 65):
    A.add_word(str(i), (i, i))
_pickle.dump(A, open('aho', 'wb'))
_pickle.load(open('aho', 'rb'))
#---------------------------------------------------------------------------
#ValueError                                Traceback (most recent call last)
#<ipython-input-129-f886db783629> in <module>()
#      3     A.add_word(str(i), (i, i))
#      4 _pickle.dump(A, open('aho', 'wb'))
#----> 5 _pickle.load(open('aho', 'rb'))

#ValueError: binary data truncated (2)
python --version
# Python 3.6.2

pip list | grep aho
# pyahocorasick (1.1.4)

Python 3.4.3 segmentation fault on ahocorasick.clear()

The following code is causing a segmentation fault with Python 3.4.3. The segmentation fault manifests after the call to clear().

import ahocorasick
A = ahocorasick.Automaton()
for index, word in enumerate("he her hers she".split()):
   A.add_word(word, (index, word))
A.clear()

UnicodeDecodeError install with python3.5 on ubuntu docker

Python 3.5.2

Running

uname -a

I get

Linux 0dd086988daa 4.9.49-moby #1 SMP Wed Sep 27 23:17:17 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Running

pip3 install pyahocorasick

I get

Collecting pyahocorasick==1.0
  Downloading pyahocorasick-1.0.0.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-11rm5xis/pyahocorasick/setup.py", line 44, in <module>
        long_description = get_readme(),
      File "/tmp/pip-build-11rm5xis/pyahocorasick/setup.py", line 6, in get_readme
        return f.read()
      File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
        return codecs.ascii_decode(input, self.errors)[0]
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 1268: ordinal not in range(128)

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-11rm5xis/pyahocorasick/

Update:

Got it to work by installing from source.

RUN cd /tmp; git clone https://github.com/WojciechMula/pyahocorasick/; cd pyahocorasick; pip3 install .

But I guess thats 1.1.5.dev1? Can this be pushed up to pypi?

btw awsome lib - https://stackoverflow.com/questions/192957/efficiently-querying-one-string-against-multiple-regexes/47319512#47319512 worked awsome, gonna use it a lot more.

Killed while creating huge automation

I'm trying to create a huge automation based on a data dump from wikidata. I have a dictionary with 14234049 keys, which corresponds to all English and Swedish labels of all entities there.

Here's my code:

dictionary = {"Belgium": [{"en": "Q31"}], ... }  # HUGE dictionary with over 14 million items
automation = ahocorasick.Automaton()
for word, matches in dictionary.items():
    automation.add_word(word, (word, matches))
automation.make_automaton()

The output I get is simply "Killed: 9".

I'm running Python 3.5 on a macOS 10.12.1, on a Macbook Pro with 16 Gb RAM.

calling get before first add_word crashes

There is no root in the trie until add_word is called therefore trienode_get_next fails its assert.

import ahocorasick
a = ahocorasick.Automaton()
a.get('foo', None)

results in

trienode.c:trienode_get_next:33 - node failed!

Failure links error

This bug has been reported by Spiros Antonatos.

I have the following piece of code but seems that pyahocorasick's failure node implementation is buggy :

import pyahocorasick

patterns = ['GT-C3303','SAMSUNG-GT-C3303K/']
text = 'SAMSUNG-GT-C3303i/1.0 NetFront/3.5 Profile/MIDP-2.0 Configuration/CLDC-1.1'

pmatch_tree = pyahocorasick.Trie()
for pattern in patterns:
  ret = pmatch_tree.add_word(pattern, (0, pattern))

pmatch_tree.make_automaton()

for res in pmatch_tree.iter(text):
  print res

GCC warnings with latest 1.1.5

A build gives this:

creating build/temp.linux-x86_64-3.6
gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DAHOCORASICK_UNICODE= -I/home/pombreda/.pyenv/versions/3.6.1/include/python3.6m -c pyahocorasick.c -o build/temp.linux-x86_64-3.6/pyahocorasick.o
In file included from pyahocorasick.c:22:0:
utils.c: In function ‘pymod_get_string’:
utils.c:49:19: warning: pointer targets in assignment differ in signedness [-Wpointer-sign]
             *word = PyUnicode_AS_UNICODE(obj);
                   ^
utils.c:44:8: warning: unused variable ‘bytes’ [-Wunused-variable]
  char* bytes;
        ^
utils.c:43:10: warning: unused variable ‘i’ [-Wunused-variable]
  ssize_t i;
          ^
In file included from pyahocorasick.c:23:0:
trienode.c: In function ‘trienode_get_next’:
trienode.c:37:14: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
  for (i=0; i < node->n; i++)
              ^
In file included from pyahocorasick.c:24:0:
trie.c: In function ‘trie_add_word’:
trie.c:30:14: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
  for (i=0; i < wordlen; i++) {
              ^
trie.c: In function ‘trie_traverse_aux’:
trie.c:128:14: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
  for (i=0; i < node->n; i++) {
              ^
In file included from pyahocorasick.c:26:0:
Automaton.c: In function ‘automaton_add_word’:
Automaton.c:250:18: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
      if (integer == -1 and PyErr_Occurred())
                  ^
Automaton.c: In function ‘clear_aux’:
Automaton.c:337:15: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   for (i=0; i < node->n; i++) {
               ^
Automaton.c: In function ‘automaton_make_automaton’:
Automaton.c:547:14: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
  for (i=0; i < automaton->root->n; i++) {
              ^
Automaton.c:573:15: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   for (i=0; i < node->n; i++) {
               ^
Automaton.c: In function ‘get_stats_aux’:
Automaton.c:961:14: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
  for (i=0; i < node->n; i++)
              ^
Automaton.c: In function ‘dump_aux’:
Automaton.c:1047:14: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
  for (i=0; i < node->n; i++) {
              ^
In file included from Automaton.c:1133:0,
                 from pyahocorasick.c:26:
Automaton_pickle.c: In function ‘pickle_dump_save’:
Automaton_pickle.c:155:14: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
  for (i=0; i < node->n; i++) {
              ^
Automaton_pickle.c: In function ‘automaton_unpickle’:
Automaton_pickle.c:364:14: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
  for (i=1; i < id; i++) {
              ^
Automaton_pickle.c:397:15: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   for (i=1; i < id; i++) {
               ^
In file included from pyahocorasick.c:27:0:
AutomatonItemsIter.c: In function ‘automaton_items_iter_next’:
AutomatonItemsIter.c:207:6: warning: pointer targets in passing argument 1 of ‘PyUnicode_FromUnicode’ differ in signedness [-Wpointer-sign]
      return PyUnicode_FromUnicode(iter->buffer + 1, item->depth);
      ^
In file included from /home/pombreda/.pyenv/versions/3.6.1/include/python3.6m/Python.h:77:0,
                 from common.h:14,
                 from pyahocorasick.c:13:
/home/pombreda/.pyenv/versions/3.6.1/include/python3.6m/unicodeobject.h:688:23: note: expected ‘const Py_UNICODE *’ but argument is of type ‘uint32_t *’
 PyAPI_FUNC(PyObject*) PyUnicode_FromUnicode(
                       ^
In file included from pyahocorasick.c:22:0:
pyahocorasick.c: At top level:
utils.c:111:1: warning: ‘pymod_get_string_from_tuple’ defined but not used [-Wunused-function]
 pymod_get_string_from_tuple(PyObject* tuple, int index, TRIE_LETTER_TYPE** word, ssize_t* wordlen) {
 ^
utils.c:169:1: warning: ‘pymod_get_sequence_from_tuple’ defined but not used [-Wunused-function]
 pymod_get_sequence_from_tuple(PyObject* tuple, int index, TRIE_LETTER_TYPE** word, ssize_t* wordlen) {
 ^
creating build/lib.linux-x86_64-3.6

This likely innocuous though

Failed install on Windows 10 with Python 3.5.2

I'm not able to install using the command line command

py -m pip install ahocorasick

The compile is failing, I've attached the error messages I'm getting.

I have Visual Studio Professional 2015 Professional installed and Cython 0.25.2. Any pointers on what I may be doing wrong gratefully received!

ahocorasick-compile-error.txt

Invalid pickle file generated: "ValueError: binary data truncated (1)"

I've managed to create an automation, and then pickle that automation to a 286 Mb pickle file. Problem is, when I try to unpickle, I get this error:

$ python -m pickle wikidata-automation.pickle 
Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/pickle.py", line 1605, in <module>
    obj = load(f)
ValueError: binary data truncated (1)

The source of that error is here: https://github.com/WojciechMula/pyahocorasick/blob/master/Automaton_pickle.c#L309

Would you mind helping me troubleshoot this? Any ideas? I don't think I can send files this big to you?

Update: This is how I build the pickle file:

automaton = ahocorasick.Automaton()
for i, (label, id_) in enumerate(generator):
    automaton.add_word(label, id_)

automaton.make_automaton()

with open(filename_out, "wb") as f:
    pickle.dump(automaton, f)

Where generator just runs yield ("Belgium", "Q31").

Search through an mmap()-ed file?

Hi,

I'm trying to use this module to search through a file that I've mmap()-ed, following one of the examples. So roughly something like this:

f = open(filename, "rb")
m = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
A = ahocorasick.Automaton()
A.add_word("foo")
A.add_word("bar\x00")
A.add_word("baz")
A.make_automaton()
for item in A.iter(m):
    print(item)

I get this error:

Traceback (most recent call last):
  File "./ac.py", line 15, in func
    for item in A.iter(m):
TypeError: string required

Is this possible? I was hoping to search through large amounts of "unclean" input, including searching for binary values like "bar\x00".

Support for duplicate keys

It seems that when a key is added a second time, a duplicate is not added and rather the value associated with that key is replaced. It would be very useful for my use case and I'm sure others to store both, and I've seen this in other implementations.

I could handle this up front but it means spending quite a bit of time in Python and so would be slow. I'd be willing to help with a change, if people generally agreed it would be positive, but might need some pointers.

Thanks for an awesome, stable and blazingly fast library!

Word segmentation on strings without spaces

Could aho corasick (this module in particular) be used for such task ? I am asking as I have not previously used said algorithm and I didn't find an example showcasing this use.

For Example:

"hellotherehowareyou" --> "hello there how are you" Now with dynamic programing you could do this in O(N^2) . . . can we do better ? aho corasick seems like the way to go?

Method to manually destroy automaton?

Hi,

Is there a method I can call or a way to manually ensure the Automaton is destroyed? In my case I am repeatedly creating an Automaton assigned to the same variable and searching based on it, but memory seem to accumulate, which should not be the case if the previous Automaton was automatically garbage collected. Is there a way I could perhaps trigger the garbage collection?

Thanks,
Ayaan

Possible memory leak that can be

Problem reported by Jonathan Grs:

In addition, I have encountered a possible memory leak that can be
reproduced very easily using the following code (at least on my Debian
machine with wither python 3.2 or 3.4):

import ahocorasick
import sys

ac = ahocorasick.Automaton()
ac.add_word(b'SSSSS', 1)
ac.make_automaton()

with open('somefile', 'rb') as f:
data = f.read()

for loop in range(1000):
for start in range(0, len(data) - 20):
ac.iter(data, start)
print('.', end='')
sys.stdout.flush()

After some debugging I singled out pymod_parse_start_end in utils.c as the
cause.
It seems PyNumber_Index adds a reference to obj, so to resolve the leak I
added 'Py_DECREF(obj);' after obj is not needed any more (once for start
and once for end).

Can't not run Example in Win7 64bit system

Hi, i am a new pythoner. I install the pyahocorasick in Win7 64bit system, and try to run the example in the readme. But it can not run and the error message is "AttributeError: 'module' object has no attribute 'Automaton' ". I want to know, how to install pyahocorasick correctly.

Unicode support

Excerpt from another e-mail from Simon Rosenthal:

However, our requirement is to be able to locate substrings in Japanese text rather than English. When I
loaded a set of Japanese strings, I  got this error message after the following statemnent:
A.make_automaton()
... Automaton.c:automaton_make_automaton:508 - state failed!
I'm wondering if this is any way related to UTF-8 vs Ascii ? I know Python 3 is fully UTF-8 compliant, but the
C code may not be handling Japanese multibyte characters properly.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.