samhames / hyperreal Goto Github PK

View Code? Open in Web Editor NEW

14.0 14.0 0.0 10.63 MB

A Python package for interpretive topic modelling

License: Apache License 2.0

Python 91.25% HTML 8.75%

hyperreal's People

Contributors

Stargazers

Watchers

hyperreal's Issues

Pivot by selected query

Choose a cluster and or feature, and pivot the display so that everything is sorted by similarity to that query. This is effectively a 1D projection of the dataset by similarity to a chosen probe query.

Index functionality to pivot according to a query
Pivot clusters view by selected cluster or feature
Pivot cluster detail view by selected feature

Allow selecting specific fields to use when creating a model

Pathing / directory issue (may be Windows-specific) when attempting to write to a Temp directory

Getting the below error when trying to create an index from a corpus hyperreal plaintext-corpus index corpus.db corpus_index.db:

Indexing corpus.db into corpus_index.db.
Traceback (most recent call last):
  File "C:\Users\<...>\AppData\Local\Programs\Python\Python39\lib\shutil.py", line 625, in _
rmtree_unsafe
    os.unlink(fullname)
PermissionError: [WinError 32] The process cannot access the file because it is being used by an
other process: 'C:\\Users\\<...>\\AppData\\Local\\Temp\\tmpp5wq8h2m\\0'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\<...>\AppData\Local\Programs\Python\Python39\lib\tempfile.py", line 805, in
 onerror
    _os.unlink(path)
PermissionError: [WinError 32] The process cannot access the file because it is being used by an
other process: 'C:\\Users\\<...>\\AppData\\Local\\Temp\\tmpp5wq8h2m\\0'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\<...>\PycharmProjects\hyperreal\venv\Scripts\hyperreal-script.py", line 33,
 in <module>
    sys.exit(load_entry_point('hyperreal', 'console_scripts', 'hyperreal')())
  File "C:\Users\<...>\PycharmProjects\hyperreal\venv\lib\site-packages\click\core.py", line
 1130, in __call__
    return self.main(*args, **kwargs)
  File "C:\Users\<...>\PycharmProjects\hyperreal\venv\lib\site-packages\click\core.py", line
 1055, in main
    rv = self.invoke(ctx)
  File "C:\Users\<...>\PycharmProjects\hyperreal\venv\lib\site-packages\click\core.py", line
 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\Users\<...>\PycharmProjects\hyperreal\venv\lib\site-packages\click\core.py", line
 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\Users\N10980695\PycharmProjects\hyperreal\venv\lib\site-packages\click\core.py", line
 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Users\<...>\PycharmProjects\hyperreal\venv\lib\site-packages\click\core.py", line
 760, in invoke
    return __callback(*args, **kwargs)
  File "c:\users\<...>\pycharmprojects\hyperreal\hyperreal\cli.py", line 69, in plaintext_co
rpus_index
    doc_index.index()
  File "c:\users\<...>\pycharmprojects\hyperreal\hyperreal\index.py", line 364, in index    
    self.save_corpus()
  File "C:\Users\<...>\AppData\Local\Programs\Python\Python39\lib\tempfile.py", line 830, in
 __exit__
    self.cleanup()
  File "C:\Users\<...>\AppData\Local\Programs\Python\Python39\lib\tempfile.py", line 834, in
 cleanup
    self._rmtree([self.name](http://self.name/))
  File "C:\Users\<...>\AppData\Local\Programs\Python\Python39\lib\tempfile.py", line 816, in
 _rmtree
    _shutil.rmtree(name, onerror=onerror)
  File "C:\Users\<...>\AppData\Local\Programs\Python\Python39\lib\shutil.py", line 749, in r
mtree
    return _rmtree_unsafe(path, onerror)
  File "C:\Users\<...>\AppData\Local\Programs\Python\Python39\lib\shutil.py", line 627, in _
rmtree_unsafe
    onerror(os.unlink, fullname, sys.exc_info())
  File "C:\Users\<...>\AppData\Local\Programs\Python\Python39\lib\tempfile.py", line 808, in
 onerror
    cls._rmtree(path)
  File "C:\Users\<...>\AppData\Local\Programs\Python\Python39\lib\tempfile.py", line 816, in
 _rmtree
    _shutil.rmtree(name, onerror=onerror)
  File "C:\Users\<...>\AppData\Local\Programs\Python\Python39\lib\shutil.py", line 749, in r
mtree
    return _rmtree_unsafe(path, onerror)
  File "C:\Users\<...>\AppData\Local\Programs\Python\Python39\lib\shutil.py", line 608, in _
rmtree_unsafe
    onerror(os.scandir, path, sys.exc_info())
  File "C:\Users\<...>\AppData\Local\Programs\Python\Python39\lib\shutil.py", line 605, in _
rmtree_unsafe
    with os.scandir(path) as scandir_it:
NotADirectoryError: [WinError 267] The directory name is invalid: 'C:\\Users\\<...>\\AppData\
\Local\\Temp\\tmpp5wq8h2m\\0'

The issue appears to be similar to Belval/pdf2image#151.

Environment : Windows 10, winver 21H2

First public release

Out of Scope for first release

Hierarchical clustering of features - leave to future work/optimisation for large corpora
Changes to scoring or display of document features
Design work on corpus - index interface - wait for initial features and use to bed in before abstracting any further
Visual improvements outside of basic consistency work
Field related metadata to make suggestions for what to include in the model
Field related metadata to drive computational associations and drill down

Create a more complete documentation setup

Make it easier to actually keep adding documentation, instead of keeping on putting it in the too hard basket.

Could not find instructions for valid index file

When trying to run the tool locally (command python3 hyperreal/server.py serve , I am encountering a ValueError ('f"{args.index_path} is not a valid index file.") It would be helpful to know the format required for the index file and other related steps.

Setup github actions for CI

Logging framework

Currently status is arbitrarily reported, and mostly in a not useful way. We need to integrate a logging framework and actually instrument the various functions for it to be useful.

Read the docs configuration - or host the docs on github pages instead?

Choose a license

Allow customising the number of examples shown when showing results

Example notebook/s - working with speeches from Australian Hansard

Setup automated releases to PyPI

Decide on versioning scheme and whether to consider this a "stable" release or not
Depends on #43
Make repository public + move to somewhere else
PyPI credentials + release token
Final opportunity to rethink/review the name

Allow creating a new model and selecting fields through the web interface/CLI

To make it easy to choose what to include or exclude, as well as providing additional context about the indexing process.

Decide where in the URL hierarchy this fits?
Show the field_summary statistics table from the index
Form for creating a new model, with selectable fields
Option to run further iterations on the existing model

Framework for intersections/arrangements with a field

Common operations like time series trends are going to require an operator that takes a query and intersects it against an entire field, with varying operations and measurements. This is particularly going to be important for time where simple counts and similarity will go a long way.

The best long term API/functionality is very uncertain, but right now something simple like an iterator over the values in a field, combined with indexing-time binning might be enough to do some useful things with prechosen granularity for aggregations. This would allow something like one of the following to construct an hourly count of something:

from hyperreal import index
i = index.Index('test_index.db')

# reuse __getitem__ as an iterator when provided with only a string as the field name
[(value, query.intersection_cardinality(docs)) for value, docs in i['text']]
# Add an additional method --> this is probably clearer that this is an iterator of outputs
[(value, query.intersection_cardinality(docs)) for value, docs in i.iterate_field_values('text')]
# Use the slice notation, with the use of the None value as a sentinel for "from the beginning/to the end"
[(field, value, query.intersection_cardinality(docs)) for (field, value), docs in i[(field, None):(field, None)]

This will probably also feed into future work on fancier drilldowns, and approaches to how we integrate/display non-model features within the context of the model clusters.

Nice-to-have: Error handling of cluster selection when mistakenly not specifying clusters

I stumbled upon a very minor scenario where not specifying a number of clusters when editing them throws a 500 error, so documenting it here.

Steps to reproduce

Select a cluster for editing from the page with all clusters using a checkbox
Click "Edit selected clusters" option
Click "Refine selected clusters" without providing a number of clusters
See a 500 Error / KeyError appear:

Traceback (most recent call last):
  File "C:\Users\<user>\hyperreal\venv\lib\site-packages\cherrypy\_cprequest.py", line 638, in respond
    self._do_respond(path_info)
  File "C:\Users\<user>l\venv\lib\site-packages\cherrypy\_cprequest.py", line 697, in _do_respond
    response.body = self.handler()
  File "C:\Users\<user>\hyperreal\venv\lib\site-packages\cherrypy\lib\encoding.py", line 223, in __call__
    self.body = self.oldhandler(*args, **kwargs)
  File "C:\Users\<user>\hyperreal\venv\lib\site-packages\cherrypy\_cpdispatch.py", line 54, in __call__
    return self.callable(*self.args, **self.kwargs)
  File "c:\users\<user>hyperreal\hyperreal\server.py", line 198, in refine
    cherrypy.request.index.refine_clusters(
  File "c:\users\<user>\hyperreal\hyperreal\index.py", line 101, in wrapper
    results = func(*args, **kwargs)
  File "c:\users\<user>\hyperreal\hyperreal\index.py", line 1793, in refine_clusters
    cluster_feature, new_cluster_ids = self._refine_feature_groups(
  File "c:\users\<user>\hyperreal\hyperreal\index.py", line 1692, in _refine_feature_groups
    comparison_delta, comparison_cluster = best_feature_clusters[
KeyError: 8194

Unclear output in the console after running hyperreal stackexchange-corpus index travel_sx.db travel_sx_index.db

After running hyperreal stackexchange-corpus index travel_sx.db travel_sx_index.db I am seeing several lines of output like defaultdict(<class 'list'>, {0: ['C:\\Users\\N10980~1\\AppData\\Local\\Temp\\tmpe1_cih5z\\0']}). I am not sure if this output is needed / helpful.

Create a standalone executable for ease of use

Work out how to approach index migrations

Decide on a more sustainable web architecture

Currently the web layer is a big ol mess of a prototype - at some point this will need to be reviewed from an architecture perspective to see what this actually needs to be in the future.

Use subclustering to make large clusterings more legible

Directory / process related errors

Getting the below error when running hyperreal stackexchange-corpus index travel_sx.db travel_sx_index.db:

(venv) PS C:\Users\NXXXXX\qutscripts\PycharmProjects\hyperreal> hyperreal stackexchange-corpus index travel_sx.db travel_sx_index.db
Indexing travel_sx.db into travel_sx_index.db.
defaultdict(<class 'list'>, {0: ['C:\\Users\\NXXXXX~1\\AppData\\Local\\Temp\\tmpvgyzcycl\\0']})
defaultdict(<class 'list'>, {0: ['C:\\Users\\NXXXXX~1\\AppData\\Local\\Temp\\tmpvgyzcycl\\0', 'C:\\Users\\NXXXXX~1\\AppData\\Local\\Temp\\tmpvgyzcycl\\
1']})
defaultdict(<class 'list'>, {0: ['C:\\Users\\NXXXXX~1\\AppData\\Local\\Temp\\tmpvgyzcycl\\0', 'C:\\Users\\NXXXXX~1\\AppData\\Local\\Temp\\tmpvgyzcycl\\
1', 'C:\\Users\\NXXXXX~1\\AppData\\Local\\Temp\\tmpvgyzcycl\\7']})
defaultdict(<class 'list'>, {0: ['C:\\Users\\NXXXXX~1\\AppData\\Local\\Temp\\tmpvgyzcycl\\0', 'C:\\Users\\NXXXXX~1\\AppData\\Local\\Temp\\tmpvgyzcycl\\
1', 'C:\\Users\\NXXXXX~1\\AppData\\Local\\Temp\\tmpvgyzcycl\\7', 'C:\\Users\\NXXXXX~1\\AppData\\Local\\Temp\\tmpvgyzcycl\\2']})
Traceback (most recent call last):
  File "c:\users\nXXXXXX\qutscripts\pycharmprojects\hyperreal\hyperreal\index.py", line 537, in index
    os.remove(next_temp_file)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\NXXXXX~1\\AppData\\Local\\Te
mp\\tmpvgyzcycl\\13'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\NXXXXXX\AppData\Local\Programs\Python\Python310\lib\shutil.py", line 617, in _rmtree_unsafe
    os.unlink(fullname)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\NXXXXX~1\\AppData\\Local\\Te
mp\\tmpvgyzcycl\\13'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\NXXXXXX\AppData\Local\Programs\Python\Python310\lib\tempfile.py", line 820, in onerror
    _os.unlink(path)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\NXXXXX~1\\AppData\\Local\\Te
mp\\tmpvgyzcycl\\13'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\NXXXXXX\qutscripts\PycharmProjects\hyperreal\venv\Scripts\hyperreal-script.py", line 33, in <module>
    sys.exit(load_entry_point('hyperreal', 'console_scripts', 'hyperreal')())
  File "C:\Users\NXXXXXX\qutscripts\PycharmProjects\hyperreal\venv\lib\site-packages\click\core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "C:\Users\NXXXXXX\qutscripts\PycharmProjects\hyperreal\venv\lib\site-packages\click\core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "C:\Users\NXXXXXX\qutscripts\PycharmProjects\hyperreal\venv\lib\site-packages\click\core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\Users\NXXXXXX\qutscripts\PycharmProjects\hyperreal\venv\lib\site-packages\click\core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\Users\NXXXXXX\qutscripts\PycharmProjects\hyperreal\venv\lib\site-packages\click\core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Users\NXXXXXX\qutscripts\PycharmProjects\hyperreal\venv\lib\site-packages\click\core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "c:\users\nXXXXXX\qutscripts\pycharmprojects\hyperreal\hyperreal\cli.py", line 159, in stackexchange_corpus_index
    doc_index.index(doc_batch_size=doc_batch_size)
  File "c:\users\nXXXXXX\qutscripts\pycharmprojects\hyperreal\hyperreal\index.py", line 145, in wrapper_func
    return func(self, *args, **kwargs)
  File "c:\users\nXXXXXX\qutscripts\pycharmprojects\hyperreal\hyperreal\index.py", line 380, in index
    with tempfile.TemporaryDirectory() as tempdir:
  File "C:\Users\NXXXXXX\AppData\Local\Programs\Python\Python310\lib\tempfile.py", line 846, in __exit__
    self.cleanup()
  File "C:\Users\NXXXXXX\AppData\Local\Programs\Python\Python310\lib\tempfile.py", line 850, in cleanup
    self._rmtree(self.name, ignore_errors=self._ignore_cleanup_errors)
  File "C:\Users\NXXXXXX\AppData\Local\Programs\Python\Python310\lib\tempfile.py", line 832, in _rmtree
    _shutil.rmtree(name, onerror=onerror)
  File "C:\Users\NXXXXXX\AppData\Local\Programs\Python\Python310\lib\shutil.py", line 749, in rmtree
    return _rmtree_unsafe(path, onerror)
  File "C:\Users\NXXXXXX\AppData\Local\Programs\Python\Python310\lib\shutil.py", line 619, in _rmtree_unsafe
    onerror(os.unlink, fullname, sys.exc_info())
  File "C:\Users\NXXXXXX\AppData\Local\Programs\Python\Python310\lib\tempfile.py", line 823, in onerror
    cls._rmtree(path, ignore_errors=ignore_errors)
  File "C:\Users\NXXXXXX\AppData\Local\Programs\Python\Python310\lib\tempfile.py", line 832, in _rmtree
    _shutil.rmtree(name, onerror=onerror)
  File "C:\Users\NXXXXXX\AppData\Local\Programs\Python\Python310\lib\shutil.py", line 749, in rmtree
    return _rmtree_unsafe(path, onerror)
  File "C:\Users\NXXXXXX\AppData\Local\Programs\Python\Python310\lib\shutil.py", line 600, in _rmtree_unsafe
    onerror(os.scandir, path, sys.exc_info())
  File "C:\Users\NXXXXXX\AppData\Local\Programs\Python\Python310\lib\shutil.py", line 597, in _rmtree_unsafe
    with os.scandir(path) as scandir_it:
NotADirectoryError: [WinError 267] The directory name is invalid: 'C:\\Users\\NXXXXX~1\\AppData\\Local\\Temp\\tmpvgyzcycl\\13'
(venv) PS C:\Users\NXXXXXX\qutscripts\PycharmProjects\hyperreal>

Paths to corpus objects stored in the index are relative to the base directory, not the index directory.

It probably makes more sense to store the path to the corpus as a path relative to the index file, not the working directory. Currently the following won't work because of this:

# Create an index in the current directory
hyperreal plaintext-corpus index example.db example_index.db
cd ..
# The index will successfully open, but trying to access the corpus will look for example.db in the current directory!
hyperreal serve example_directory/example_index.db

Can Model Runs Be Deterministic?

Allow editing of the base model

I think this will just be a temporary waypoint while we think about a more solid direction for a future work, but there's no reason not to start with this low hanging fruit.

As a starting point I'd suggest we want to do the following:

Delete a cluster (and remove all contained features within it)
Delete a feature from a cluster
Merge two or more selected clusters into one
Select a group of features within a cluster and create a new distinct cluster from them

Create an example corpus type for stack exchange data

Work remaining:

Way to specify the source site of data
Allow ingesting multiple sites, with the source site as a feature to enable cross site analysis
Link to the site from the document view

Revisit the overall architecture as well

Now that I've played around with things a bit I have a better idea of what things need to look like, time to start writing things down to capture the thoughts.

Migrate repository to a public location

Either ATAP, or my own account.

Show matching documents for a query

We want to see the documents associated with a particular cluster or selected features - this supports close reading of the documents in conjunction with the surface forms of the features grouped by the model.

This requires:

The corpus object needs to have some standardised way of rendering documents for the web layer
The web layer needs a way to represent choice of objects to indicate what to search for. At a minimum we should be able to represent both a single specific features, and also a cluster of features. This will likely be significantly expanded in the future.
UI for representing the query results in the web layer - I'll put together some wireframes later with an emphasis on "easy to implement"
Pagination or random selection of a number of results matching the search query. I'd suggest that we go for random sampling as the simplest thing for the initial version and come back to ranking and/or pagination later.
Is this all supported by the index, or do we need some additional features?

Add document counts to the main cluster view

Add how to cite.

Running into NameError: name '_corpus' is not defined

After running hyperreal plaintext-corpus index corpus.db corpus_index.db, getting a NameError: name '_corpus' is not defined error.

500 error when pivoting by cluster

Getting this error after pulling new updates from the main branch:

Traceback (most recent call last):
  File "C:\Users\NXXXXX\qutscripts\PycharmProjects\hyperreal\venv\lib\site-packages\cherrypy\_cprequest.py", line 638, in respond
    self._do_respond(path_info)
  File "C:\Users\NXXXXX\qutscripts\PycharmProjects\hyperreal\venv\lib\site-packages\cherrypy\_cprequest.py", line 697, in _do_respond
    response.body = self.handler()
  File "C:\Users\NXXXXX\qutscripts\PycharmProjects\hyperreal\venv\lib\site-packages\cherrypy\lib\encoding.py", line 223, in __call__
    self.body = self.oldhandler(*args, **kwargs)
  File "C:\Users\NXXXXX\qutscripts\PycharmProjects\hyperreal\venv\lib\site-packages\cherrypy\_cpdispatch.py", line 54, in __call__
    return self.callable(*self.args, **self.kwargs)
  File "c:\users\NXXXXX\qutscripts\pycharmprojects\hyperreal\hyperreal\server.py", line 167, in index
    rendered_docs = cherrypy.request.index.render_docs(ranked)
  File "c:\users\NXXXXX\qutscripts\pycharmprojects\hyperreal\hyperreal\index.py", line 56, in wrapper_func
    return func(self, *args, **kwargs)
  File "c:\users\NXXXXX\qutscripts\pycharmprojects\hyperreal\hyperreal\index.py", line 551, in render_docs
    return self.corpus.render_docs_html(doc_keys)
  File "c:\users\NXXXXX\qutscripts\pycharmprojects\hyperreal\hyperreal\corpus.py", line 619, in render_docs_html
    docs = [(key, self._render_doc_key(key)) for key in doc_keys]
  File "c:\users\NXXXXX\qutscripts\pycharmprojects\hyperreal\hyperreal\corpus.py", line 619, in <listcomp>
    docs = [(key, self._render_doc_key(key)) for key in doc_keys]
  File "c:\users\NXXXXX\qutscripts\pycharmprojects\hyperreal\hyperreal\corpus.py", line 545, in _render_doc_key
    base_fields = list(
IndexError: list index out of range

Support more sophisticated querying

The current implementation of the index is a boolean information retrieval system - documents are represented and indexed as a set of attributes, and can be recalled based on the attributes present in the document (the features). The base feature clustering algorithm can be interpreted within this framework as creating a series of disjunctive queries (feature1 OR feature2 OR feature2) for documents that share these features. This is conceptually and computationally simple to work with, but obviously falls short of the expressive power of even early relevance ranking search engines.

The question is, what can we do about this? There are a few different directions we can consider:

Support more boolean search operations (AND, OR, NOT and grouping), for the following purposes:
1. Directly querying for specific documents (outside the context of the model)
2. Creating new features from combinations of existing features
3. Extend the cluster model, by allow modification of the set of terms and how they influence the document retrieved - instead of a topic being a cluster of features, a topic starts as a disjunction of features and can be incrementally modified to exclude or require certain features to be present
Relevance ranking: either by identifying existing approaches that work with the current binary model (preferred), or by incorporating new indexing functionality that better accounts for relevance ranking of features. This could be both in the form of general relevance ranking for arbitrary queries, or about relevance ranking specifically in the construction of the clusters/topics.
An interface for arbitrary querying/ranking? This might be useful if we want to explore dense representations of documents as well as, or instead of, the current sparse boolean model. Thinking out loud this could also work in conjunction with 1.iii

Pick an interesting openly licensed dataset
Write an end-to-end test case that exercises at least the model building pipeline and functionality

samhames / hyperreal Goto Github PK

hyperreal's People

Contributors

Stargazers

Watchers

hyperreal's Issues

Out of Scope for first release

Steps to reproduce

Recommend Projects

Recommend Topics

Recommend Org

Jobs