GithubHelp home page GithubHelp logo

biterm's Introduction

Biterm Topic Model

This is a simple Python implementation of the awesome Biterm Topic Model. This model is accurate in short text classification. It explicitly models the word co-occurrence patterns in the whole corpus to solve the problem of sparse word co-occurrence at document-level.

Simply install by:

pip install biterm

Load some short texts and vectorize them via sklearn.

    from sklearn.feature_extraction.text import CountVectorizer

    texts = open('./data/reuters.titles').read().splitlines()[:50]
    vec = CountVectorizer(stop_words='english')
    X = vec.fit_transform(texts).toarray()

Get the vocabulary and the biterms from the texts.

    from biterm.utility import vec_to_biterms

    vocab = np.array(vec.get_feature_names())
    biterms = vec_to_biterms(X)

Create a BTM and pass the biterms to train it.

    from biterm.btm import oBTM

    btm = oBTM(num_topics=20, V=vocab)
    topics = btm.fit_transform(biterms, iterations=100)

Save a topic plot using pyLDAvis and explore the results! (also see simple_btml.py)

    from biterm.btm import oBTM

    btm = oBTM(num_topics=20, V=vocab)
    topics = btm.fit_transform(biterms, iterations=100)

pyLDAvis Visualization

Inference is done with Gibbs Sampling and it's not really fast. The implementation is not meant for production. But if you have to classify a lot of texts you can try using online learning.

import numpy as np
import pyLDAvis
from biterm.btm import oBTM 
from sklearn.feature_extraction.text import CountVectorizer
from biterm.utility import vec_to_biterms, topic_summuary # helper functions

if __name__ == "__main__":

    texts = open('./data/reuters.titles').read().splitlines() # path of data file

    # vectorize texts
    vec = CountVectorizer(stop_words='english')
    X = vec.fit_transform(texts).toarray()

    # get vocabulary
    vocab = np.array(vec.get_feature_names())

    # get biterms
    biterms = vec_to_biterms(X)

    # create btm
    btm = oBTM(num_topics=20, V=vocab)

    print("\n\n Train Online BTM ..")
    for i in range(0, len(biterms), 100): # prozess chunk of 200 texts
        biterms_chunk = biterms[i:i + 100]
        btm.fit(biterms_chunk, iterations=50)
    topics = btm.transform(biterms)

    print("\n\n Visualize Topics ..")
    vis = pyLDAvis.prepare(btm.phi_wz.T, topics, np.count_nonzero(X, axis=1), vocab, np.sum(X, axis=0))
    pyLDAvis.save_html(vis, './vis/online_btm.html')  # path to output

    print("\n\n Topic coherence ..")
    topic_summuary(btm.phi_wz.T, X, vocab, 10)

    print("\n\n Texts & Topics ..")
    for i in range(len(texts)):
        print("{} (topic: {})".format(texts[i], topics[i].argmax()))

Use the Cython version to speed up performance. Therefore, you can download the repo and build the cbtm.pyx for the operating system of your choice. Afterwards use from biterm.cbtm import oBTM to use the cythonic version.

biterm's People

Contributors

markoarnauto avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

biterm's Issues

"no module named utility"

Hi

I have installed biterm on pycharm, and have the following imports in my code:

import numpy as np
import pyLDAvis
from biterm.cbtm import oBTM
from sklearn.feature_extraction.text import CountVectorizer
from biterm.utility import vec_to_biterms, topic_summuary

however i get the error:
/homes/ahr18/PycharmProjects/biterm/Source/transformation.py/bin/python /homes/ahr18/PycharmProjects/biterm/Source/biterm.py
Traceback (most recent call last):
File "/homes/ahr18/PycharmProjects/biterm/Source/biterm.py", line 3, in
from biterm.cbtm import oBTM
File "/homes/ahr18/PycharmProjects/biterm/Source/biterm.py", line 3, in
from biterm.cbtm import oBTM
ImportError: No module named cbtm

Process finished with exit code 1

any idea how i can fix this?

Thanks !

Multithreading

How can I use multithreading instead of singlethreading when running biterm?

python sucks

PLEASE USE ANOTHER LANGUAGE BECAUSE PYTHON SUCKS DICKS!

python setup.py egg_info failed with error code 1

I can't install the package.

Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\BM-008\AppData\Local\Temp\pip-install-2l82s4ev\biterm\setup.py", line 39, in
ext_modules=cythonize(extensions)
File "c:\users\bm-008\anaconda3\lib\site-packages\Cython\Build\Dependencies.py", line 796, in cythonize
aliases=aliases)
File "c:\users\bm-008\anaconda3\lib\site-packages\Cython\Build\Dependencies.py", line 688, in create_extension_list
for file in nonempty(sorted(extended_iglob(filepattern)), "'%s' doesn't match any files" % filepattern):
File "c:\users\bm-008\anaconda3\lib\site-packages\Cython\Build\Dependencies.py", line 107, in nonempty
raise ValueError(error_msg)
ValueError: 'biterm/cbtm.pyx' doesn't match any files

----------------------------------------

Command "python setup.py egg_info" failed with error code 1 in C:\Users\BM-008\AppData\Local\Temp\pip-install-2l82s4ev\biterm\

Installation issue

I am unable to install biterm in python3 and getting the following error. Please resolve it.

biterm/cbtm.c(2139): warning C4244: '=': conversion from 'npy_intp' to 'int', possible loss of data
biterm/cbtm.c(2237): warning C4013: 'drand48' undefined; assuming extern returning int

cbtm.obj : error LNK2001: unresolved external symbol drand48
build\lib.win-amd64-3.7\biterm\cbtm.cp37-win_amd64.pyd : fatal error LNK1120: 1 unresolved externals
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.21.27702\bin\HostX86\x64\link.exe' failed with exit status 1120

Microsoft Visual Studio 14.0\\VC\\BIN\\x86_amd64\\link.exe' failed with exit status 1120

Hi markoarnauto ( great name btw),

i failed to install biterm on my system ( win10, python 3.6 anaconda, installing with pip install biterm)
Microsoft visual c++ 14 is installed ( and re-installed), but it still fails. Can't figure out why, the closed issue looks kind of similar, but i cant use it. Sorry if I'm getting something wrong and this is my fault, not an issue, I'm not a pro with the python dependencies.

EDIT : Installing with pip install git+git://github.com/markoarnauto/biterm.git works fine.

Here's the error message I'm gettin:

(classification_omt) C:\Users\fmeyer>pip install biterm
Collecting biterm
Using cached https://files.pythonhosted.org/packages/36/ca/5a43511e6ea8ca02cc9e8be1b8898ad79b140c055d4400342dc210ba23bb/biterm-0.1.5.tar.gz
Requirement already satisfied: numpy in c:\users\fmeyer\anaconda3\envs\classification_omt\lib\site-packages (from biterm) (1.16.4)
Requirement already satisfied: tqdm in c:\users\fmeyer\anaconda3\envs\classification_omt\lib\site-packages (from biterm) (4.31.1)
Requirement already satisfied: cython in c:\users\fmeyer\anaconda3\envs\classification_omt\lib\site-packages (from biterm) (0.29.6)
Requirement already satisfied: nltk in c:\users\fmeyer\anaconda3\envs\classification_omt\lib\site-packages (from biterm) (3.3)
Requirement already satisfied: six in c:\users\fmeyer\anaconda3\envs\classification_omt\lib\site-packages (from nltk->biterm) (1.11.0)
Building wheels for collected packages: biterm
Running setup.py bdist_wheel for biterm ... error
Complete output from command C:\Users\fmeyer\Anaconda3\envs\classification_omt\python.exe -u -c "import setuptools, tokenize;file='C:\Users\fmeyer\AppData\Local\Temp\pip-install-n_9xm6j5\biterm\setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" bdist_wheel -d C:\Users\fmeyer\AppData\Local\Temp\pip-wheel-7z338gxy --python-tag cp36:
running bdist_wheel
running build
running build_py
creating build
creating build\lib.win-amd64-3.6
creating build\lib.win-amd64-3.6\biterm
copying biterm\btm.py -> build\lib.win-amd64-3.6\biterm
copying biterm\utility.py -> build\lib.win-amd64-3.6\biterm
copying biterm_init_.py -> build\lib.win-amd64-3.6\biterm
copying biterm_main_.py -> build\lib.win-amd64-3.6\biterm
running build_ext
building 'biterm.cbtm' extension
creating build\temp.win-amd64-3.6
creating build\temp.win-amd64-3.6\Release
creating build\temp.win-amd64-3.6\Release\biterm
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Users\fmeyer\Anaconda3\envs\classification_omt\lib\site-packages\numpy\core\include -IC:\Users\fmeyer\Anaconda3\envs\classification_omt\include -IC:\Users\fmeyer\Anaconda3\envs\classification_omt\include "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\winrt" /Tcbiterm/cbtm.c /Fobuild\temp.win-amd64-3.6\Release\biterm/cbtm.obj
cbtm.c
c:\users\fmeyer\anaconda3\envs\classification_omt\lib\site-packages\numpy\core\include\numpy\npy_1_7_deprecated_api.h(14) : Warning Msg: Using deprecated NumPy API, disable it with #define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
biterm/cbtm.c(2139): warning C4244: '=': conversion from 'npy_intp' to 'int', possible loss of data
biterm/cbtm.c(2237): warning C4013: 'drand48' undefined; assuming extern returning int
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\link.exe /nologo /INCREMENTAL:NO /LTCG /DLL /MANIFEST:EMBED,ID=2 /MANIFESTUAC:NO /LIBPATH:C:\Users\fmeyer\Anaconda3\envs\classification_omt\libs /LIBPATH:C:\Users\fmeyer\Anaconda3\envs\classification_omt\PCbuild\amd64 "/LIBPATH:C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\LIB\amd64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.10240.0\ucrt\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.10240.0\um\x64" /EXPORT:PyInit_cbtm build\temp.win-amd64-3.6\Release\biterm/cbtm.obj /OUT:build\lib.win-amd64-3.6\biterm\cbtm.cp36-win_amd64.pyd /IMPLIB:build\temp.win-amd64-3.6\Release\biterm\cbtm.cp36-win_amd64.lib
cbtm.obj : warning LNK4197: export 'PyInit_cbtm' specified multiple times; using first specification
Creating library build\temp.win-amd64-3.6\Release\biterm\cbtm.cp36-win_amd64.lib and object build\temp.win-amd64-3.6\Release\biterm\cbtm.cp36-win_amd64.exp
cbtm.obj : error LNK2001: unresolved external symbol drand48
build\lib.win-amd64-3.6\biterm\cbtm.cp36-win_amd64.pyd : fatal error LNK1120: 1 unresolved externals
error: command 'C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\link.exe' failed with exit status 1120


Failed building wheel for biterm
Running setup.py clean for biterm
Failed to build biterm
Installing collected packages: biterm
Running setup.py install for biterm ... error
Complete output from command C:\Users\fmeyer\Anaconda3\envs\classification_omt\python.exe -u -c "import setuptools, tokenize;file='C:\Users\fmeyer\AppData\Local\Temp\pip-install-n_9xm6j5\biterm\setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record C:\Users\fmeyer\AppData\Local\Temp\pip-record-erqtkw2c\install-record.txt --single-version-externally-managed --compile:
running install
running build
running build_py
creating build
creating build\lib.win-amd64-3.6
creating build\lib.win-amd64-3.6\biterm
copying biterm\btm.py -> build\lib.win-amd64-3.6\biterm
copying biterm\utility.py -> build\lib.win-amd64-3.6\biterm
copying biterm_init_.py -> build\lib.win-amd64-3.6\biterm
copying biterm_main_.py -> build\lib.win-amd64-3.6\biterm
running build_ext
building 'biterm.cbtm' extension
creating build\temp.win-amd64-3.6
creating build\temp.win-amd64-3.6\Release
creating build\temp.win-amd64-3.6\Release\biterm
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Users\fmeyer\Anaconda3\envs\classification_omt\lib\site-packages\numpy\core\include -IC:\Users\fmeyer\Anaconda3\envs\classification_omt\include -IC:\Users\fmeyer\Anaconda3\envs\classification_omt\include "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\winrt" /Tcbiterm/cbtm.c /Fobuild\temp.win-amd64-3.6\Release\biterm/cbtm.obj
cbtm.c
c:\users\fmeyer\anaconda3\envs\classification_omt\lib\site-packages\numpy\core\include\numpy\npy_1_7_deprecated_api.h(14) : Warning Msg: Using deprecated NumPy API, disable it with #define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
biterm/cbtm.c(2139): warning C4244: '=': conversion from 'npy_intp' to 'int', possible loss of data
biterm/cbtm.c(2237): warning C4013: 'drand48' undefined; assuming extern returning int
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\link.exe /nologo /INCREMENTAL:NO /LTCG /DLL /MANIFEST:EMBED,ID=2 /MANIFESTUAC:NO /LIBPATH:C:\Users\fmeyer\Anaconda3\envs\classification_omt\libs /LIBPATH:C:\Users\fmeyer\Anaconda3\envs\classification_omt\PCbuild\amd64 "/LIBPATH:C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\LIB\amd64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.10240.0\ucrt\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.10240.0\um\x64" /EXPORT:PyInit_cbtm build\temp.win-amd64-3.6\Release\biterm/cbtm.obj /OUT:build\lib.win-amd64-3.6\biterm\cbtm.cp36-win_amd64.pyd /IMPLIB:build\temp.win-amd64-3.6\Release\biterm\cbtm.cp36-win_amd64.lib
cbtm.obj : warning LNK4197: export 'PyInit_cbtm' specified multiple times; using first specification
Creating library build\temp.win-amd64-3.6\Release\biterm\cbtm.cp36-win_amd64.lib and object build\temp.win-amd64-3.6\Release\biterm\cbtm.cp36-win_amd64.exp
cbtm.obj : error LNK2001: unresolved external symbol drand48
build\lib.win-amd64-3.6\biterm\cbtm.cp36-win_amd64.pyd : fatal error LNK1120: 1 unresolved externals
error: command 'C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\link.exe' failed with exit status 1120

----------------------------------------

Command "C:\Users\fmeyer\Anaconda3\envs\classification_omt\python.exe -u -c "import setuptools, tokenize;file='C:\Users\fmeyer\AppData\Local\Temp\pip-install-n_9xm6j5\biterm\setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record C:\Users\fmeyer\AppData\Local\Temp\pip-record-erqtkw2c\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in C:\Users\fmeyer\AppData\Local\Temp\pip-install-n_9xm6j5\biterm\

Regards,
Fabian

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.