samhames / hyperreal Goto Github PK
View Code? Open in Web Editor NEWA Python package for interpretive topic modelling
License: Apache License 2.0
A Python package for interpretive topic modelling
License: Apache License 2.0
Choose a cluster and or feature, and pivot the display so that everything is sorted by similarity to that query. This is effectively a 1D projection of the dataset by similarity to a chosen probe query.
Getting the below error when trying to create an index from a corpus hyperreal plaintext-corpus index corpus.db corpus_index.db
:
Indexing corpus.db into corpus_index.db.
Traceback (most recent call last):
File "C:\Users\<...>\AppData\Local\Programs\Python\Python39\lib\shutil.py", line 625, in _
rmtree_unsafe
os.unlink(fullname)
PermissionError: [WinError 32] The process cannot access the file because it is being used by an
other process: 'C:\\Users\\<...>\\AppData\\Local\\Temp\\tmpp5wq8h2m\\0'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\<...>\AppData\Local\Programs\Python\Python39\lib\tempfile.py", line 805, in
onerror
_os.unlink(path)
PermissionError: [WinError 32] The process cannot access the file because it is being used by an
other process: 'C:\\Users\\<...>\\AppData\\Local\\Temp\\tmpp5wq8h2m\\0'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\<...>\PycharmProjects\hyperreal\venv\Scripts\hyperreal-script.py", line 33,
in <module>
sys.exit(load_entry_point('hyperreal', 'console_scripts', 'hyperreal')())
File "C:\Users\<...>\PycharmProjects\hyperreal\venv\lib\site-packages\click\core.py", line
1130, in __call__
return self.main(*args, **kwargs)
File "C:\Users\<...>\PycharmProjects\hyperreal\venv\lib\site-packages\click\core.py", line
1055, in main
rv = self.invoke(ctx)
File "C:\Users\<...>\PycharmProjects\hyperreal\venv\lib\site-packages\click\core.py", line
1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "C:\Users\<...>\PycharmProjects\hyperreal\venv\lib\site-packages\click\core.py", line
1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "C:\Users\N10980695\PycharmProjects\hyperreal\venv\lib\site-packages\click\core.py", line
1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "C:\Users\<...>\PycharmProjects\hyperreal\venv\lib\site-packages\click\core.py", line
760, in invoke
return __callback(*args, **kwargs)
File "c:\users\<...>\pycharmprojects\hyperreal\hyperreal\cli.py", line 69, in plaintext_co
rpus_index
doc_index.index()
File "c:\users\<...>\pycharmprojects\hyperreal\hyperreal\index.py", line 364, in index
self.save_corpus()
File "C:\Users\<...>\AppData\Local\Programs\Python\Python39\lib\tempfile.py", line 830, in
__exit__
self.cleanup()
File "C:\Users\<...>\AppData\Local\Programs\Python\Python39\lib\tempfile.py", line 834, in
cleanup
self._rmtree([self.name](http://self.name/))
File "C:\Users\<...>\AppData\Local\Programs\Python\Python39\lib\tempfile.py", line 816, in
_rmtree
_shutil.rmtree(name, onerror=onerror)
File "C:\Users\<...>\AppData\Local\Programs\Python\Python39\lib\shutil.py", line 749, in r
mtree
return _rmtree_unsafe(path, onerror)
File "C:\Users\<...>\AppData\Local\Programs\Python\Python39\lib\shutil.py", line 627, in _
rmtree_unsafe
onerror(os.unlink, fullname, sys.exc_info())
File "C:\Users\<...>\AppData\Local\Programs\Python\Python39\lib\tempfile.py", line 808, in
onerror
cls._rmtree(path)
File "C:\Users\<...>\AppData\Local\Programs\Python\Python39\lib\tempfile.py", line 816, in
_rmtree
_shutil.rmtree(name, onerror=onerror)
File "C:\Users\<...>\AppData\Local\Programs\Python\Python39\lib\shutil.py", line 749, in r
mtree
return _rmtree_unsafe(path, onerror)
File "C:\Users\<...>\AppData\Local\Programs\Python\Python39\lib\shutil.py", line 608, in _
rmtree_unsafe
onerror(os.scandir, path, sys.exc_info())
File "C:\Users\<...>\AppData\Local\Programs\Python\Python39\lib\shutil.py", line 605, in _
rmtree_unsafe
with os.scandir(path) as scandir_it:
NotADirectoryError: [WinError 267] The directory name is invalid: 'C:\\Users\\<...>\\AppData\
\Local\\Temp\\tmpp5wq8h2m\\0'
The issue appears to be similar to Belval/pdf2image#151.
Environment : Windows 10, winver 21H2
Make it easier to actually keep adding documentation, instead of keeping on putting it in the too hard basket.
When trying to run the tool locally (command python3 hyperreal/server.py serve
, I am encountering a ValueError ('f"{args.index_path} is not a valid index file."
) It would be helpful to know the format required for the index file and other related steps.
Currently status is arbitrarily reported, and mostly in a not useful way. We need to integrate a logging framework and actually instrument the various functions for it to be useful.
To make it easy to choose what to include or exclude, as well as providing additional context about the indexing process.
Common operations like time series trends are going to require an operator that takes a query and intersects it against an entire field, with varying operations and measurements. This is particularly going to be important for time where simple counts and similarity will go a long way.
The best long term API/functionality is very uncertain, but right now something simple like an iterator over the values in a field, combined with indexing-time binning might be enough to do some useful things with prechosen granularity for aggregations. This would allow something like one of the following to construct an hourly count of something:
from hyperreal import index
i = index.Index('test_index.db')
# reuse __getitem__ as an iterator when provided with only a string as the field name
[(value, query.intersection_cardinality(docs)) for value, docs in i['text']]
# Add an additional method --> this is probably clearer that this is an iterator of outputs
[(value, query.intersection_cardinality(docs)) for value, docs in i.iterate_field_values('text')]
# Use the slice notation, with the use of the None value as a sentinel for "from the beginning/to the end"
[(field, value, query.intersection_cardinality(docs)) for (field, value), docs in i[(field, None):(field, None)]
This will probably also feed into future work on fancier drilldowns, and approaches to how we integrate/display non-model features within the context of the model clusters.
I stumbled upon a very minor scenario where not specifying a number of clusters when editing them throws a 500 error, so documenting it here.
Traceback (most recent call last):
File "C:\Users\<user>\hyperreal\venv\lib\site-packages\cherrypy\_cprequest.py", line 638, in respond
self._do_respond(path_info)
File "C:\Users\<user>l\venv\lib\site-packages\cherrypy\_cprequest.py", line 697, in _do_respond
response.body = self.handler()
File "C:\Users\<user>\hyperreal\venv\lib\site-packages\cherrypy\lib\encoding.py", line 223, in __call__
self.body = self.oldhandler(*args, **kwargs)
File "C:\Users\<user>\hyperreal\venv\lib\site-packages\cherrypy\_cpdispatch.py", line 54, in __call__
return self.callable(*self.args, **self.kwargs)
File "c:\users\<user>hyperreal\hyperreal\server.py", line 198, in refine
cherrypy.request.index.refine_clusters(
File "c:\users\<user>\hyperreal\hyperreal\index.py", line 101, in wrapper
results = func(*args, **kwargs)
File "c:\users\<user>\hyperreal\hyperreal\index.py", line 1793, in refine_clusters
cluster_feature, new_cluster_ids = self._refine_feature_groups(
File "c:\users\<user>\hyperreal\hyperreal\index.py", line 1692, in _refine_feature_groups
comparison_delta, comparison_cluster = best_feature_clusters[
KeyError: 8194
After running hyperreal stackexchange-corpus index travel_sx.db travel_sx_index.db
I am seeing several lines of output like defaultdict(<class 'list'>, {0: ['C:\\Users\\N10980~1\\AppData\\Local\\Temp\\tmpe1_cih5z\\0']})
. I am not sure if this output is needed / helpful.
Currently the web layer is a big ol mess of a prototype - at some point this will need to be reviewed from an architecture perspective to see what this actually needs to be in the future.
Getting the below error when running hyperreal stackexchange-corpus index travel_sx.db travel_sx_index.db
:
(venv) PS C:\Users\NXXXXX\qutscripts\PycharmProjects\hyperreal> hyperreal stackexchange-corpus index travel_sx.db travel_sx_index.db
Indexing travel_sx.db into travel_sx_index.db.
defaultdict(<class 'list'>, {0: ['C:\\Users\\NXXXXX~1\\AppData\\Local\\Temp\\tmpvgyzcycl\\0']})
defaultdict(<class 'list'>, {0: ['C:\\Users\\NXXXXX~1\\AppData\\Local\\Temp\\tmpvgyzcycl\\0', 'C:\\Users\\NXXXXX~1\\AppData\\Local\\Temp\\tmpvgyzcycl\\
1']})
defaultdict(<class 'list'>, {0: ['C:\\Users\\NXXXXX~1\\AppData\\Local\\Temp\\tmpvgyzcycl\\0', 'C:\\Users\\NXXXXX~1\\AppData\\Local\\Temp\\tmpvgyzcycl\\
1', 'C:\\Users\\NXXXXX~1\\AppData\\Local\\Temp\\tmpvgyzcycl\\7']})
defaultdict(<class 'list'>, {0: ['C:\\Users\\NXXXXX~1\\AppData\\Local\\Temp\\tmpvgyzcycl\\0', 'C:\\Users\\NXXXXX~1\\AppData\\Local\\Temp\\tmpvgyzcycl\\
1', 'C:\\Users\\NXXXXX~1\\AppData\\Local\\Temp\\tmpvgyzcycl\\7', 'C:\\Users\\NXXXXX~1\\AppData\\Local\\Temp\\tmpvgyzcycl\\2']})
Traceback (most recent call last):
File "c:\users\nXXXXXX\qutscripts\pycharmprojects\hyperreal\hyperreal\index.py", line 537, in index
os.remove(next_temp_file)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\NXXXXX~1\\AppData\\Local\\Te
mp\\tmpvgyzcycl\\13'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\NXXXXXX\AppData\Local\Programs\Python\Python310\lib\shutil.py", line 617, in _rmtree_unsafe
os.unlink(fullname)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\NXXXXX~1\\AppData\\Local\\Te
mp\\tmpvgyzcycl\\13'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\NXXXXXX\AppData\Local\Programs\Python\Python310\lib\tempfile.py", line 820, in onerror
_os.unlink(path)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\NXXXXX~1\\AppData\\Local\\Te
mp\\tmpvgyzcycl\\13'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\NXXXXXX\qutscripts\PycharmProjects\hyperreal\venv\Scripts\hyperreal-script.py", line 33, in <module>
sys.exit(load_entry_point('hyperreal', 'console_scripts', 'hyperreal')())
File "C:\Users\NXXXXXX\qutscripts\PycharmProjects\hyperreal\venv\lib\site-packages\click\core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "C:\Users\NXXXXXX\qutscripts\PycharmProjects\hyperreal\venv\lib\site-packages\click\core.py", line 1055, in main
rv = self.invoke(ctx)
File "C:\Users\NXXXXXX\qutscripts\PycharmProjects\hyperreal\venv\lib\site-packages\click\core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "C:\Users\NXXXXXX\qutscripts\PycharmProjects\hyperreal\venv\lib\site-packages\click\core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "C:\Users\NXXXXXX\qutscripts\PycharmProjects\hyperreal\venv\lib\site-packages\click\core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "C:\Users\NXXXXXX\qutscripts\PycharmProjects\hyperreal\venv\lib\site-packages\click\core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "c:\users\nXXXXXX\qutscripts\pycharmprojects\hyperreal\hyperreal\cli.py", line 159, in stackexchange_corpus_index
doc_index.index(doc_batch_size=doc_batch_size)
File "c:\users\nXXXXXX\qutscripts\pycharmprojects\hyperreal\hyperreal\index.py", line 145, in wrapper_func
return func(self, *args, **kwargs)
File "c:\users\nXXXXXX\qutscripts\pycharmprojects\hyperreal\hyperreal\index.py", line 380, in index
with tempfile.TemporaryDirectory() as tempdir:
File "C:\Users\NXXXXXX\AppData\Local\Programs\Python\Python310\lib\tempfile.py", line 846, in __exit__
self.cleanup()
File "C:\Users\NXXXXXX\AppData\Local\Programs\Python\Python310\lib\tempfile.py", line 850, in cleanup
self._rmtree(self.name, ignore_errors=self._ignore_cleanup_errors)
File "C:\Users\NXXXXXX\AppData\Local\Programs\Python\Python310\lib\tempfile.py", line 832, in _rmtree
_shutil.rmtree(name, onerror=onerror)
File "C:\Users\NXXXXXX\AppData\Local\Programs\Python\Python310\lib\shutil.py", line 749, in rmtree
return _rmtree_unsafe(path, onerror)
File "C:\Users\NXXXXXX\AppData\Local\Programs\Python\Python310\lib\shutil.py", line 619, in _rmtree_unsafe
onerror(os.unlink, fullname, sys.exc_info())
File "C:\Users\NXXXXXX\AppData\Local\Programs\Python\Python310\lib\tempfile.py", line 823, in onerror
cls._rmtree(path, ignore_errors=ignore_errors)
File "C:\Users\NXXXXXX\AppData\Local\Programs\Python\Python310\lib\tempfile.py", line 832, in _rmtree
_shutil.rmtree(name, onerror=onerror)
File "C:\Users\NXXXXXX\AppData\Local\Programs\Python\Python310\lib\shutil.py", line 749, in rmtree
return _rmtree_unsafe(path, onerror)
File "C:\Users\NXXXXXX\AppData\Local\Programs\Python\Python310\lib\shutil.py", line 600, in _rmtree_unsafe
onerror(os.scandir, path, sys.exc_info())
File "C:\Users\NXXXXXX\AppData\Local\Programs\Python\Python310\lib\shutil.py", line 597, in _rmtree_unsafe
with os.scandir(path) as scandir_it:
NotADirectoryError: [WinError 267] The directory name is invalid: 'C:\\Users\\NXXXXX~1\\AppData\\Local\\Temp\\tmpvgyzcycl\\13'
(venv) PS C:\Users\NXXXXXX\qutscripts\PycharmProjects\hyperreal>
It probably makes more sense to store the path to the corpus as a path relative to the index file, not the working directory. Currently the following won't work because of this:
# Create an index in the current directory
hyperreal plaintext-corpus index example.db example_index.db
cd ..
# The index will successfully open, but trying to access the corpus will look for example.db in the current directory!
hyperreal serve example_directory/example_index.db
I think this will just be a temporary waypoint while we think about a more solid direction for a future work, but there's no reason not to start with this low hanging fruit.
As a starting point I'd suggest we want to do the following:
Work remaining:
Now that I've played around with things a bit I have a better idea of what things need to look like, time to start writing things down to capture the thoughts.
Either ATAP, or my own account.
We want to see the documents associated with a particular cluster or selected features - this supports close reading of the documents in conjunction with the surface forms of the features grouped by the model.
This requires:
After running hyperreal plaintext-corpus index corpus.db corpus_index.db
, getting a NameError: name '_corpus' is not defined
error.
Getting this error after pulling new updates from the main branch:
Traceback (most recent call last):
File "C:\Users\NXXXXX\qutscripts\PycharmProjects\hyperreal\venv\lib\site-packages\cherrypy\_cprequest.py", line 638, in respond
self._do_respond(path_info)
File "C:\Users\NXXXXX\qutscripts\PycharmProjects\hyperreal\venv\lib\site-packages\cherrypy\_cprequest.py", line 697, in _do_respond
response.body = self.handler()
File "C:\Users\NXXXXX\qutscripts\PycharmProjects\hyperreal\venv\lib\site-packages\cherrypy\lib\encoding.py", line 223, in __call__
self.body = self.oldhandler(*args, **kwargs)
File "C:\Users\NXXXXX\qutscripts\PycharmProjects\hyperreal\venv\lib\site-packages\cherrypy\_cpdispatch.py", line 54, in __call__
return self.callable(*self.args, **self.kwargs)
File "c:\users\NXXXXX\qutscripts\pycharmprojects\hyperreal\hyperreal\server.py", line 167, in index
rendered_docs = cherrypy.request.index.render_docs(ranked)
File "c:\users\NXXXXX\qutscripts\pycharmprojects\hyperreal\hyperreal\index.py", line 56, in wrapper_func
return func(self, *args, **kwargs)
File "c:\users\NXXXXX\qutscripts\pycharmprojects\hyperreal\hyperreal\index.py", line 551, in render_docs
return self.corpus.render_docs_html(doc_keys)
File "c:\users\NXXXXX\qutscripts\pycharmprojects\hyperreal\hyperreal\corpus.py", line 619, in render_docs_html
docs = [(key, self._render_doc_key(key)) for key in doc_keys]
File "c:\users\NXXXXX\qutscripts\pycharmprojects\hyperreal\hyperreal\corpus.py", line 619, in <listcomp>
docs = [(key, self._render_doc_key(key)) for key in doc_keys]
File "c:\users\NXXXXX\qutscripts\pycharmprojects\hyperreal\hyperreal\corpus.py", line 545, in _render_doc_key
base_fields = list(
IndexError: list index out of range
The current implementation of the index is a boolean information retrieval system - documents are represented and indexed as a set of attributes, and can be recalled based on the attributes present in the document (the features). The base feature clustering algorithm can be interpreted within this framework as creating a series of disjunctive queries (feature1 OR feature2 OR feature2) for documents that share these features. This is conceptually and computationally simple to work with, but obviously falls short of the expressive power of even early relevance ranking search engines.
The question is, what can we do about this? There are a few different directions we can consider:
Provide a first pass for dealing with more of the complexities of Twitter data as an additional example corpus.
Currently the individual cluster view is very different from the global view - these need to be made consistent in advance of the release, including statistics, sorting, and showing query results where appropriate.
Want to avoid things like #3 ๐
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.