GithubHelp home page GithubHelp logo

hyperreal's People

Contributors

katkasian avatar martinschweinberger avatar samhames avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

hyperreal's Issues

Pivot by selected query

Choose a cluster and or feature, and pivot the display so that everything is sorted by similarity to that query. This is effectively a 1D projection of the dataset by similarity to a chosen probe query.

  • Index functionality to pivot according to a query
  • Pivot clusters view by selected cluster or feature
  • Pivot cluster detail view by selected feature

Pathing / directory issue (may be Windows-specific) when attempting to write to a Temp directory

Getting the below error when trying to create an index from a corpus hyperreal plaintext-corpus index corpus.db corpus_index.db:

Indexing corpus.db into corpus_index.db.
Traceback (most recent call last):
  File "C:\Users\<...>\AppData\Local\Programs\Python\Python39\lib\shutil.py", line 625, in _
rmtree_unsafe
    os.unlink(fullname)
PermissionError: [WinError 32] The process cannot access the file because it is being used by an
other process: 'C:\\Users\\<...>\\AppData\\Local\\Temp\\tmpp5wq8h2m\\0'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\<...>\AppData\Local\Programs\Python\Python39\lib\tempfile.py", line 805, in
 onerror
    _os.unlink(path)
PermissionError: [WinError 32] The process cannot access the file because it is being used by an
other process: 'C:\\Users\\<...>\\AppData\\Local\\Temp\\tmpp5wq8h2m\\0'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\<...>\PycharmProjects\hyperreal\venv\Scripts\hyperreal-script.py", line 33,
 in <module>
    sys.exit(load_entry_point('hyperreal', 'console_scripts', 'hyperreal')())
  File "C:\Users\<...>\PycharmProjects\hyperreal\venv\lib\site-packages\click\core.py", line
 1130, in __call__
    return self.main(*args, **kwargs)
  File "C:\Users\<...>\PycharmProjects\hyperreal\venv\lib\site-packages\click\core.py", line
 1055, in main
    rv = self.invoke(ctx)
  File "C:\Users\<...>\PycharmProjects\hyperreal\venv\lib\site-packages\click\core.py", line
 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\Users\<...>\PycharmProjects\hyperreal\venv\lib\site-packages\click\core.py", line
 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\Users\N10980695\PycharmProjects\hyperreal\venv\lib\site-packages\click\core.py", line
 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Users\<...>\PycharmProjects\hyperreal\venv\lib\site-packages\click\core.py", line
 760, in invoke
    return __callback(*args, **kwargs)
  File "c:\users\<...>\pycharmprojects\hyperreal\hyperreal\cli.py", line 69, in plaintext_co
rpus_index
    doc_index.index()
  File "c:\users\<...>\pycharmprojects\hyperreal\hyperreal\index.py", line 364, in index    
    self.save_corpus()
  File "C:\Users\<...>\AppData\Local\Programs\Python\Python39\lib\tempfile.py", line 830, in
 __exit__
    self.cleanup()
  File "C:\Users\<...>\AppData\Local\Programs\Python\Python39\lib\tempfile.py", line 834, in
 cleanup
    self._rmtree([self.name](http://self.name/))
  File "C:\Users\<...>\AppData\Local\Programs\Python\Python39\lib\tempfile.py", line 816, in
 _rmtree
    _shutil.rmtree(name, onerror=onerror)
  File "C:\Users\<...>\AppData\Local\Programs\Python\Python39\lib\shutil.py", line 749, in r
mtree
    return _rmtree_unsafe(path, onerror)
  File "C:\Users\<...>\AppData\Local\Programs\Python\Python39\lib\shutil.py", line 627, in _
rmtree_unsafe
    onerror(os.unlink, fullname, sys.exc_info())
  File "C:\Users\<...>\AppData\Local\Programs\Python\Python39\lib\tempfile.py", line 808, in
 onerror
    cls._rmtree(path)
  File "C:\Users\<...>\AppData\Local\Programs\Python\Python39\lib\tempfile.py", line 816, in
 _rmtree
    _shutil.rmtree(name, onerror=onerror)
  File "C:\Users\<...>\AppData\Local\Programs\Python\Python39\lib\shutil.py", line 749, in r
mtree
    return _rmtree_unsafe(path, onerror)
  File "C:\Users\<...>\AppData\Local\Programs\Python\Python39\lib\shutil.py", line 608, in _
rmtree_unsafe
    onerror(os.scandir, path, sys.exc_info())
  File "C:\Users\<...>\AppData\Local\Programs\Python\Python39\lib\shutil.py", line 605, in _
rmtree_unsafe
    with os.scandir(path) as scandir_it:
NotADirectoryError: [WinError 267] The directory name is invalid: 'C:\\Users\\<...>\\AppData\
\Local\\Temp\\tmpp5wq8h2m\\0'

The issue appears to be similar to Belval/pdf2image#151.

Environment : Windows 10, winver 21H2

First public release

  • #43
  • #48
  • #45
  • #38 Packaging and PyPI release
  • #18 Documentation:
  • #46
  • #47
  • Final Feature work:
  • #30 Logging framework
  • #33 as a first attempt to tackle some aspects of reproducibility

Out of Scope for first release

  • Hierarchical clustering of features - leave to future work/optimisation for large corpora
  • Changes to scoring or display of document features
  • Design work on corpus - index interface - wait for initial features and use to bed in before abstracting any further
  • Visual improvements outside of basic consistency work
  • Field related metadata to make suggestions for what to include in the model
  • Field related metadata to drive computational associations and drill down

Create a more complete documentation setup

Make it easier to actually keep adding documentation, instead of keeping on putting it in the too hard basket.

  • Tutorial
  • API Docs
  • Architecture overview
  • Developer documentation/contributing
  • High level overview in the readme
  • Future plans documentation

Could not find instructions for valid index file

When trying to run the tool locally (command python3 hyperreal/server.py serve , I am encountering a ValueError ('f"{args.index_path} is not a valid index file.") It would be helpful to know the format required for the index file and other related steps.

Logging framework

Currently status is arbitrarily reported, and mostly in a not useful way. We need to integrate a logging framework and actually instrument the various functions for it to be useful.

Setup automated releases to PyPI

  • Decide on versioning scheme and whether to consider this a "stable" release or not
  • Depends on #43
  • Make repository public + move to somewhere else
  • PyPI credentials + release token
  • Final opportunity to rethink/review the name

Allow creating a new model and selecting fields through the web interface/CLI

To make it easy to choose what to include or exclude, as well as providing additional context about the indexing process.

  • Decide where in the URL hierarchy this fits?
  • Show the field_summary statistics table from the index
  • Form for creating a new model, with selectable fields
  • Option to run further iterations on the existing model

Framework for intersections/arrangements with a field

Common operations like time series trends are going to require an operator that takes a query and intersects it against an entire field, with varying operations and measurements. This is particularly going to be important for time where simple counts and similarity will go a long way.

The best long term API/functionality is very uncertain, but right now something simple like an iterator over the values in a field, combined with indexing-time binning might be enough to do some useful things with prechosen granularity for aggregations. This would allow something like one of the following to construct an hourly count of something:

from hyperreal import index
i = index.Index('test_index.db')

# reuse __getitem__ as an iterator when provided with only a string as the field name
[(value, query.intersection_cardinality(docs)) for value, docs in i['text']]
# Add an additional method --> this is probably clearer that this is an iterator of outputs
[(value, query.intersection_cardinality(docs)) for value, docs in i.iterate_field_values('text')]
# Use the slice notation, with the use of the None value as a sentinel for "from the beginning/to the end"
[(field, value, query.intersection_cardinality(docs)) for (field, value), docs in i[(field, None):(field, None)]

This will probably also feed into future work on fancier drilldowns, and approaches to how we integrate/display non-model features within the context of the model clusters.

Nice-to-have: Error handling of cluster selection when mistakenly not specifying clusters

I stumbled upon a very minor scenario where not specifying a number of clusters when editing them throws a 500 error, so documenting it here.

Steps to reproduce

  1. Select a cluster for editing from the page with all clusters using a checkbox
  2. Click "Edit selected clusters" option
  3. Click "Refine selected clusters" without providing a number of clusters
  4. See a 500 Error / KeyError appear:
Traceback (most recent call last):
  File "C:\Users\<user>\hyperreal\venv\lib\site-packages\cherrypy\_cprequest.py", line 638, in respond
    self._do_respond(path_info)
  File "C:\Users\<user>l\venv\lib\site-packages\cherrypy\_cprequest.py", line 697, in _do_respond
    response.body = self.handler()
  File "C:\Users\<user>\hyperreal\venv\lib\site-packages\cherrypy\lib\encoding.py", line 223, in __call__
    self.body = self.oldhandler(*args, **kwargs)
  File "C:\Users\<user>\hyperreal\venv\lib\site-packages\cherrypy\_cpdispatch.py", line 54, in __call__
    return self.callable(*self.args, **self.kwargs)
  File "c:\users\<user>hyperreal\hyperreal\server.py", line 198, in refine
    cherrypy.request.index.refine_clusters(
  File "c:\users\<user>\hyperreal\hyperreal\index.py", line 101, in wrapper
    results = func(*args, **kwargs)
  File "c:\users\<user>\hyperreal\hyperreal\index.py", line 1793, in refine_clusters
    cluster_feature, new_cluster_ids = self._refine_feature_groups(
  File "c:\users\<user>\hyperreal\hyperreal\index.py", line 1692, in _refine_feature_groups
    comparison_delta, comparison_cluster = best_feature_clusters[
KeyError: 8194

Decide on a more sustainable web architecture

Currently the web layer is a big ol mess of a prototype - at some point this will need to be reviewed from an architecture perspective to see what this actually needs to be in the future.

Directory / process related errors

Getting the below error when running hyperreal stackexchange-corpus index travel_sx.db travel_sx_index.db:

(venv) PS C:\Users\NXXXXX\qutscripts\PycharmProjects\hyperreal> hyperreal stackexchange-corpus index travel_sx.db travel_sx_index.db
Indexing travel_sx.db into travel_sx_index.db.
defaultdict(<class 'list'>, {0: ['C:\\Users\\NXXXXX~1\\AppData\\Local\\Temp\\tmpvgyzcycl\\0']})
defaultdict(<class 'list'>, {0: ['C:\\Users\\NXXXXX~1\\AppData\\Local\\Temp\\tmpvgyzcycl\\0', 'C:\\Users\\NXXXXX~1\\AppData\\Local\\Temp\\tmpvgyzcycl\\
1']})
defaultdict(<class 'list'>, {0: ['C:\\Users\\NXXXXX~1\\AppData\\Local\\Temp\\tmpvgyzcycl\\0', 'C:\\Users\\NXXXXX~1\\AppData\\Local\\Temp\\tmpvgyzcycl\\
1', 'C:\\Users\\NXXXXX~1\\AppData\\Local\\Temp\\tmpvgyzcycl\\7']})
defaultdict(<class 'list'>, {0: ['C:\\Users\\NXXXXX~1\\AppData\\Local\\Temp\\tmpvgyzcycl\\0', 'C:\\Users\\NXXXXX~1\\AppData\\Local\\Temp\\tmpvgyzcycl\\
1', 'C:\\Users\\NXXXXX~1\\AppData\\Local\\Temp\\tmpvgyzcycl\\7', 'C:\\Users\\NXXXXX~1\\AppData\\Local\\Temp\\tmpvgyzcycl\\2']})
Traceback (most recent call last):
  File "c:\users\nXXXXXX\qutscripts\pycharmprojects\hyperreal\hyperreal\index.py", line 537, in index
    os.remove(next_temp_file)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\NXXXXX~1\\AppData\\Local\\Te
mp\\tmpvgyzcycl\\13'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\NXXXXXX\AppData\Local\Programs\Python\Python310\lib\shutil.py", line 617, in _rmtree_unsafe
    os.unlink(fullname)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\NXXXXX~1\\AppData\\Local\\Te
mp\\tmpvgyzcycl\\13'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\NXXXXXX\AppData\Local\Programs\Python\Python310\lib\tempfile.py", line 820, in onerror
    _os.unlink(path)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\NXXXXX~1\\AppData\\Local\\Te
mp\\tmpvgyzcycl\\13'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\NXXXXXX\qutscripts\PycharmProjects\hyperreal\venv\Scripts\hyperreal-script.py", line 33, in <module>
    sys.exit(load_entry_point('hyperreal', 'console_scripts', 'hyperreal')())
  File "C:\Users\NXXXXXX\qutscripts\PycharmProjects\hyperreal\venv\lib\site-packages\click\core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "C:\Users\NXXXXXX\qutscripts\PycharmProjects\hyperreal\venv\lib\site-packages\click\core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "C:\Users\NXXXXXX\qutscripts\PycharmProjects\hyperreal\venv\lib\site-packages\click\core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\Users\NXXXXXX\qutscripts\PycharmProjects\hyperreal\venv\lib\site-packages\click\core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\Users\NXXXXXX\qutscripts\PycharmProjects\hyperreal\venv\lib\site-packages\click\core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Users\NXXXXXX\qutscripts\PycharmProjects\hyperreal\venv\lib\site-packages\click\core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "c:\users\nXXXXXX\qutscripts\pycharmprojects\hyperreal\hyperreal\cli.py", line 159, in stackexchange_corpus_index
    doc_index.index(doc_batch_size=doc_batch_size)
  File "c:\users\nXXXXXX\qutscripts\pycharmprojects\hyperreal\hyperreal\index.py", line 145, in wrapper_func
    return func(self, *args, **kwargs)
  File "c:\users\nXXXXXX\qutscripts\pycharmprojects\hyperreal\hyperreal\index.py", line 380, in index
    with tempfile.TemporaryDirectory() as tempdir:
  File "C:\Users\NXXXXXX\AppData\Local\Programs\Python\Python310\lib\tempfile.py", line 846, in __exit__
    self.cleanup()
  File "C:\Users\NXXXXXX\AppData\Local\Programs\Python\Python310\lib\tempfile.py", line 850, in cleanup
    self._rmtree(self.name, ignore_errors=self._ignore_cleanup_errors)
  File "C:\Users\NXXXXXX\AppData\Local\Programs\Python\Python310\lib\tempfile.py", line 832, in _rmtree
    _shutil.rmtree(name, onerror=onerror)
  File "C:\Users\NXXXXXX\AppData\Local\Programs\Python\Python310\lib\shutil.py", line 749, in rmtree
    return _rmtree_unsafe(path, onerror)
  File "C:\Users\NXXXXXX\AppData\Local\Programs\Python\Python310\lib\shutil.py", line 619, in _rmtree_unsafe
    onerror(os.unlink, fullname, sys.exc_info())
  File "C:\Users\NXXXXXX\AppData\Local\Programs\Python\Python310\lib\tempfile.py", line 823, in onerror
    cls._rmtree(path, ignore_errors=ignore_errors)
  File "C:\Users\NXXXXXX\AppData\Local\Programs\Python\Python310\lib\tempfile.py", line 832, in _rmtree
    _shutil.rmtree(name, onerror=onerror)
  File "C:\Users\NXXXXXX\AppData\Local\Programs\Python\Python310\lib\shutil.py", line 749, in rmtree
    return _rmtree_unsafe(path, onerror)
  File "C:\Users\NXXXXXX\AppData\Local\Programs\Python\Python310\lib\shutil.py", line 600, in _rmtree_unsafe
    onerror(os.scandir, path, sys.exc_info())
  File "C:\Users\NXXXXXX\AppData\Local\Programs\Python\Python310\lib\shutil.py", line 597, in _rmtree_unsafe
    with os.scandir(path) as scandir_it:
NotADirectoryError: [WinError 267] The directory name is invalid: 'C:\\Users\\NXXXXX~1\\AppData\\Local\\Temp\\tmpvgyzcycl\\13'
(venv) PS C:\Users\NXXXXXX\qutscripts\PycharmProjects\hyperreal> 

Paths to corpus objects stored in the index are relative to the base directory, not the index directory.

It probably makes more sense to store the path to the corpus as a path relative to the index file, not the working directory. Currently the following won't work because of this:

# Create an index in the current directory
hyperreal plaintext-corpus index example.db example_index.db
cd ..
# The index will successfully open, but trying to access the corpus will look for example.db in the current directory!
hyperreal serve example_directory/example_index.db

Allow editing of the base model

I think this will just be a temporary waypoint while we think about a more solid direction for a future work, but there's no reason not to start with this low hanging fruit.

As a starting point I'd suggest we want to do the following:

  • Delete a cluster (and remove all contained features within it)
  • Delete a feature from a cluster
  • Merge two or more selected clusters into one
  • Select a group of features within a cluster and create a new distinct cluster from them

Revisit the overall architecture as well

Now that I've played around with things a bit I have a better idea of what things need to look like, time to start writing things down to capture the thoughts.

Show matching documents for a query

We want to see the documents associated with a particular cluster or selected features - this supports close reading of the documents in conjunction with the surface forms of the features grouped by the model.

This requires:

  • The corpus object needs to have some standardised way of rendering documents for the web layer
  • The web layer needs a way to represent choice of objects to indicate what to search for. At a minimum we should be able to represent both a single specific features, and also a cluster of features. This will likely be significantly expanded in the future.
  • UI for representing the query results in the web layer - I'll put together some wireframes later with an emphasis on "easy to implement"
  • Pagination or random selection of a number of results matching the search query. I'd suggest that we go for random sampling as the simplest thing for the initial version and come back to ranking and/or pagination later.
  • Is this all supported by the index, or do we need some additional features?

500 error when pivoting by cluster

Getting this error after pulling new updates from the main branch:

Traceback (most recent call last):
  File "C:\Users\NXXXXX\qutscripts\PycharmProjects\hyperreal\venv\lib\site-packages\cherrypy\_cprequest.py", line 638, in respond
    self._do_respond(path_info)
  File "C:\Users\NXXXXX\qutscripts\PycharmProjects\hyperreal\venv\lib\site-packages\cherrypy\_cprequest.py", line 697, in _do_respond
    response.body = self.handler()
  File "C:\Users\NXXXXX\qutscripts\PycharmProjects\hyperreal\venv\lib\site-packages\cherrypy\lib\encoding.py", line 223, in __call__
    self.body = self.oldhandler(*args, **kwargs)
  File "C:\Users\NXXXXX\qutscripts\PycharmProjects\hyperreal\venv\lib\site-packages\cherrypy\_cpdispatch.py", line 54, in __call__
    return self.callable(*self.args, **self.kwargs)
  File "c:\users\NXXXXX\qutscripts\pycharmprojects\hyperreal\hyperreal\server.py", line 167, in index
    rendered_docs = cherrypy.request.index.render_docs(ranked)
  File "c:\users\NXXXXX\qutscripts\pycharmprojects\hyperreal\hyperreal\index.py", line 56, in wrapper_func
    return func(self, *args, **kwargs)
  File "c:\users\NXXXXX\qutscripts\pycharmprojects\hyperreal\hyperreal\index.py", line 551, in render_docs
    return self.corpus.render_docs_html(doc_keys)
  File "c:\users\NXXXXX\qutscripts\pycharmprojects\hyperreal\hyperreal\corpus.py", line 619, in render_docs_html
    docs = [(key, self._render_doc_key(key)) for key in doc_keys]
  File "c:\users\NXXXXX\qutscripts\pycharmprojects\hyperreal\hyperreal\corpus.py", line 619, in <listcomp>
    docs = [(key, self._render_doc_key(key)) for key in doc_keys]
  File "c:\users\NXXXXX\qutscripts\pycharmprojects\hyperreal\hyperreal\corpus.py", line 545, in _render_doc_key
    base_fields = list(
IndexError: list index out of range

Support more sophisticated querying

The current implementation of the index is a boolean information retrieval system - documents are represented and indexed as a set of attributes, and can be recalled based on the attributes present in the document (the features). The base feature clustering algorithm can be interpreted within this framework as creating a series of disjunctive queries (feature1 OR feature2 OR feature2) for documents that share these features. This is conceptually and computationally simple to work with, but obviously falls short of the expressive power of even early relevance ranking search engines.

The question is, what can we do about this? There are a few different directions we can consider:

  1. Support more boolean search operations (AND, OR, NOT and grouping), for the following purposes:
    1. Directly querying for specific documents (outside the context of the model)
    2. Creating new features from combinations of existing features
    3. Extend the cluster model, by allow modification of the set of terms and how they influence the document retrieved - instead of a topic being a cluster of features, a topic starts as a disjunction of features and can be incrementally modified to exclude or require certain features to be present
  2. Relevance ranking: either by identifying existing approaches that work with the current binary model (preferred), or by incorporating new indexing functionality that better accounts for relevance ranking of features. This could be both in the form of general relevance ranking for arbitrary queries, or about relevance ranking specifically in the construction of the clusters/topics.
  3. An interface for arbitrary querying/ranking? This might be useful if we want to explore dense representations of documents as well as, or instead of, the current sparse boolean model. Thinking out loud this could also work in conjunction with 1.iii

Add basic automated tests

Want to avoid things like #3 ๐Ÿ™ƒ

  • Pick an interesting openly licensed dataset
  • Write an end-to-end test case that exercises at least the model building pipeline and functionality

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.