GithubHelp home page GithubHelp logo

tsdataclinic / smooshr Goto Github PK

View Code? Open in Web Editor NEW
14.0 14.0 4.0 10.55 MB

Tool to consolidate entries and columns from multiple datasets

Home Page: https://tsdataclinic.github.io/smooshr/

License: Apache License 2.0

JavaScript 46.90% HTML 0.91% CSS 0.22% Dockerfile 0.19% Python 2.42% Shell 0.03% Jupyter Notebook 40.59% SCSS 8.75%

smooshr's People

Contributors

brahmcapoor avatar jonathangrant avatar jps327 avatar kaushik12 avatar rachaelwr avatar stuartlynn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

smooshr's Issues

Tweak loading for URL and open data options to indicate loading without progress

Right now we cant give progress updates on URL or Open Data sources. This is because we can't do a range request against these resources which is what papa-parse uses to stream data. Either need to implement this streaming on the proxy server somehow or figure out some other solution but for now we can caution that a dataset might take a while to read and display a spinner rather than a progress bar

Define a schema for the different components of the smooshr data model

As we move to a different storage system and way of representing operations on a dataset, we will need a more robust schema. Currently, the very simple schema we have is

  • Project: Contains multiple datasets
  • Dataset: represents the full dataset as a set of summary data and multiple Columns and MetaColumns
  • Column: Represents a column in the original dataset, has a name and a list of unique entries
  • MetaColumn: A simple way of treating two columns as 1, this ultimetly gets merged in to a single column when we run the code output
  • Entry: A unique entry in a column which has a value and the number of times it occurs in that column
  • Mapping: A collections of entries for a specific column that will be mapped to another value,

We probably want to rethink this schema to make it a lot more rhobust to other tasks we want to run in smooshr.

Add ability to reject suggestions to better define future ones

Currently, the word embedding system only looks for things near the mean of the current entries in the category selection. It would be great to have another button that explicitly removed the suggestion from the list and incorporates that embedding in to the suggestions search in a negative way.

[New Feature]: Building crosswalks using smooshr

Given the existing mapping functionality, I imagine we could extend it to build crosswalks by mapping (one-to-one, one-to-many ?) entities from a column in one dataset to entities in a column from another dataset.

Move embeddings to a database

Currently, the word embedding data set we are using is over 3.2Gb of data. This gets loaded in to memory by gemsim when the server starts up. This takes a while and is not ideal if we want to move this process to a worker for example, as each worker would need to load the data in to memory.

Instead we should look to see if the embedding can be loaded in to a database and queried. Or even just a key val store. This will reduce the overall memory useage of the app and load times.

This wouldn't need to do much more than look up a given word and return it's vector, seeing as we are doing most of the similarity / clustering client side.

Create Project level abstraction

Create a catch all project abstraction that can be used to reference multiple datafiles, have the mappings scoped to that entity

Allow loading of projects

Currently we can save a project but not load it.

This should be pretty simple and will help with facilitating sharing on the community server when we have it.

Add ability to pull datasets from URLS

Currently, all datasets need to be loaded locally. It would be interesting to have datasets be reference-able by a URL instead. This would let us have projects that can pull from and tidy open data specifically, perhaps making those mappings public then.

Batch request embedings from the server for performance emprovement

Currently we send a request per unique word to the embedding server to get that words embedding vector.

The server supports sending multiple words at a time and getting back the results. We should chunk up the requests to make fewer API calls which should make the embedding fetching quicker.

const get_embedings_from_server = entries => {
let unique_words = new Set();
entries.forEach(entry => {
entry.name.split(' ').forEach(word => {
unique_words.add(word);
});
});
return Promise.all(
Array.from(unique_words).map(entry =>
fetch(
`${
process.env.REACT_APP_API_URL
}/embedding/${entry.toLowerCase().replace(/[\W_]+/g, '')}`,
)
.then(r => r.json())
.then(r => r[0]),
),
);
};

This is the function that will need to be modified to run the queries in batches and then correctly assign the result once the batch has been effected.

Things to consider :

  1. The server might fail if one or more of the words does not have a representation in the corpus. We would need to fix that here :

    smooshr/server/server.py

    Lines 66 to 80 in 8b11ccb

    @app.route('/embedding/<words>')
    def embeding(words):
    conn = get_db()
    try:
    words = words.split(',')
    sql = "select * from embeddings where key in ({seq})".format( seq=','.join(['?']*len(words)))
    result = conn.execute(sql, words)
    result = [ [r[0], r[1].tolist()] for r in result ]
    result = [ {"key": key, "embedding": embed} for key,embed in dict(result).items() ]
    return jsonify(result)
    except:
    return jsonify([])
    if __name__=='__main__':
    print('starting up server')
    app.run(host='0.0.0.0', port=5000, debug=True)

  2. It would be also good to give some feedback on this process that can show in the classification interface to let a user know how much of the embedding has been loaded.

Investigate different models for describing an analysis flow using a DAG or similar structure.

We currently only have 2 types of operation on smooshr

  1. Combine columns together
  2. Create a taxonomy for a given column

In the future we would like to have more steps for example

  • Extract part of a column as a new column. For example an address like "23 Some Street, Some City, US, 11221" -> "Some City" to
  • Standardize a time column
  • Merge the contents of two columns together to form a new column
  • Do entity matching on a given column
  • etc

Some of these steps will have dependencies on previous steps that are hard to predict at run time. It would be great to have each indiividual transform be defined as a node in a graph with dependecies linked by edges. Essentially a DAG.

This would inform the UI and the python code that is ultimetly spit out by the tool.

Some links to projects that might be worth looking at

Make drop zone bigger

Currently the drop zone for files is just the text, expand this to use the entire box

Set up an initial clustering guess

Attempt to use the word embeddings to generate a starting point for possible categories.

Open questions here:

  • Can we prompt users to suggest category starting points, basically seeds for the clustering algorithm

  • How do we define a catch all category that picks out everything that is nothing like anything else

  • How do we select the number of clusters? Do we try and do that automatically? If not how do we inform the user

-[ ] Can clustering be done interactively on the remaining categories?

Create concept of a meta-column

Currently we dont have a way to combine columns from multiple datasets. We need to create some kind of meta-column entry that can reference columns from multiple data sets as the same column in the rest of the app, combining entries from each for example.

Open questions here:

  • Do meta columns support concatenation of columns within a dataset? Or extraction of data from those columns?

  • Even if two columns from separate files are selected as the same, are there mappings that need to be file specific for each of them. I don't think so but need to check this.

Explore undo functionality

Explore how to do this using react context api. Time travel like this is doable with redux, not sure about with context API.

Change the server for embeddings to sqlite

Making the embeddings more portable will make deploy easier. Currently we are using postgresql which is perhaps more than we need. Moving sqlite might make more sense here as we can simply download the .sqlite file and run to get going.

Simplify docker setup for backend

Currently we have a flask app + redis + celery and workers. This was because we anticipated doing much more of this work on the backend but seeing as we aren't we should remove these dependencies

Investigate different ways of storing the data smooshr is using

Currently, smooshr uses in-memory storage to represent a dataset while users are working on it.

As we move to mode sophisticated analysis, we might need to rethink how we do this in a more efficent way.

Some options are

  • Using IndexedDB the browsers built-in database system.Probably through a library like dexie. Note we currently use indexdb as a dumb offline storage but this would move it to a more structured database

  • Using sqljs a compiled version of SQLite that runs in webassembely and provides basically native perfromance in browser. We would still need to figure out how to store the sqlite database offline but this could give us a really nice flexiable interface (SQL) for performing operations on the datasets

  • Something else? The local files api is worth keeping an eye on https://web.dev/file-system-access/ as it would let us save and read projects in a similar way to a native app.

Explore moving to typescript

As this becomes more complicated, it might be worth investing the time to move to a typescript for the project. Some basic type checking might help as we grow the project.

Set up CI

Once we know where the server is going to run, we will need to set up CI to be able to deploy it easily.

Github seems to have a new actions service. Check this out for the deploy!

Either disable multiple file uploads at the same time or fix multiple file uploads

Currently the file loader is a little stuck between working ok with 1 data set but allowing multiple to be uploaded or needing to work properly with multiple.

We should be able to support multiple uploads pretty well given that the csv parser generates multiple web workers to do the parsing so uploading 2 files shouldn't effect performance too much.

Cull unused code

There is a bunch of unused code that we should purge fro launch

Explore collaborative work using a P2P system.

Currently we have no way to collaborate on a Taxonomy between multiple machines. We dont want to have a centralized store of the data so if we are to implement this, we probably want to use a P2P system

Explore distributing state using Orbit DB or something similar.

Add progress bar to file loading

We can probably figure out how to add a progress bar to the file loading modal by monitoring how many bytes we have read in already. Would make for a nicer experience as you could see how long you have to go loading a file in

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.