tsdataclinic / smooshr Goto Github PK

View Code? Open in Web Editor NEW

14.0 14.0 4.0 10.55 MB

Tool to consolidate entries and columns from multiple datasets

Home Page: https://tsdataclinic.github.io/smooshr/

License: Apache License 2.0

JavaScript 46.90% HTML 0.91% CSS 0.22% Dockerfile 0.19% Python 2.42% Shell 0.03% Jupyter Notebook 40.59% SCSS 8.75%

smooshr's People

Contributors

Stargazers

Watchers

Forkers

jonathangrant digital-alpha souvikm2002

smooshr's Issues

Tweak loading for URL and open data options to indicate loading without progress

Right now we cant give progress updates on URL or Open Data sources. This is because we can't do a range request against these resources which is what papa-parse uses to stream data. Either need to implement this streaming on the proxy server somehow or figure out some other solution but for now we can caution that a dataset might take a while to read and display a spinner rather than a progress bar

Define a schema for the different components of the smooshr data model

As we move to a different storage system and way of representing operations on a dataset, we will need a more robust schema. Currently, the very simple schema we have is

Project: Contains multiple datasets
Dataset: represents the full dataset as a set of summary data and multiple Columns and MetaColumns
Column: Represents a column in the original dataset, has a name and a list of unique entries
MetaColumn: A simple way of treating two columns as 1, this ultimetly gets merged in to a single column when we run the code output
Entry: A unique entry in a column which has a value and the number of times it occurs in that column
Mapping: A collections of entries for a specific column that will be mapped to another value,

We probably want to rethink this schema to make it a lot more rhobust to other tasks we want to run in smooshr.

Add ability to reject suggestions to better define future ones

Currently, the word embedding system only looks for things near the mean of the current entries in the category selection. It would be great to have another button that explicitly removed the suggestion from the list and incorporates that embedding in to the suggestions search in a negative way.

Paginate the list of entries to improve performance

Currently the classification page creates a card for each entry in the datasets and displays it as part of a long list. This leads to performance hits when the number of entries is > a few thousand.

To fix this we should paginate the list, either through and infinite scroll or through explicit pages.

Code that would need to be changes for this lives here: https://github.com/tsdataclinic/smooshr/blob/master/src/pages/ColumnPage.js

[New Feature]: Building crosswalks using smooshr

Given the existing mapping functionality, I imagine we could extend it to build crosswalks by mapping (one-to-one, one-to-many ?) entities from a column in one dataset to entities in a column from another dataset.

Add ability to split entries in a column by some value

Some datasets have concatenated categories that you want to blow out in to multiple categories.

Need something that explodes the categories

"dog;cat;bird" -> "dog", "cat", "bird"

Move embeddings to a database

Currently, the word embedding data set we are using is over 3.2Gb of data. This gets loaded in to memory by gemsim when the server starts up. This takes a while and is not ideal if we want to move this process to a worker for example, as each worker would need to load the data in to memory.

Instead we should look to see if the embedding can be loaded in to a database and queried. Or even just a key val store. This will reduce the overall memory useage of the app and load times.

This wouldn't need to do much more than look up a given word and return it's vector, seeing as we are doing most of the similarity / clustering client side.

Align with data packages in terms of project structure and output format.

Look at using data packages for the final output of the project and for project description! https://frictionlessdata.io/data-packages/

Add prompt text and potential first category when no mappings defined

This will help with workflow and getting people to understand what to do at this point.

Create Project level abstraction

Create a catch all project abstraction that can be used to reference multiple datafiles, have the mappings scoped to that entity

Add prompts everywhere to help with more intuitive flow

Switch to using the app layout from @dataclinic/dataclinic

Currently, the project uses a custom layout. Change this to use the side bar, footer, main area components here:

https://github.com/tsdataclinic/DataClinicComponents/blob/master/packages/app-layout/src/AppLayout.tsx

Allow loading of projects

Currently we can save a project but not load it.

This should be pretty simple and will help with facilitating sharing on the community server when we have it.

Add ability to pull datasets from URLS

Currently, all datasets need to be loaded locally. It would be interesting to have datasets be reference-able by a URL instead. This would let us have projects that can pull from and tidy open data specifically, perhaps making those mappings public then.

Fix click targets for the project boxes

Project box click targets are a little fiddly just now. Clean these up

Add DC Favicon

Batch request embedings from the server for performance emprovement

Currently we send a request per unique word to the embedding server to get that words embedding vector.

The server supports sending multiple words at a time and getting back the results. We should chunk up the requests to make fewer API calls which should make the embedding fetching quicker.

smooshr/src/utils/calc_embedings.js

Lines 1 to 20 in 8b11ccb

 const get_embedings_from_server = entries => { 

 let unique_words = new Set(); 

 entries.forEach(entry => { 

 entry.name.split(' ').forEach(word => { 

 unique_words.add(word); 

 }); 

 }); 

 return Promise.all( 

 Array.from(unique_words).map(entry => 

 fetch( 

 `${ 

  process.env.REACT_APP_API_URL 

  }/embedding/${entry.toLowerCase().replace(/[\W_]+/g, '')}`, 

 ) 

 .then(r => r.json()) 

 .then(r => r[0]), 

 ), 

 ); 

 };

This is the function that will need to be modified to run the queries in batches and then correctly assign the result once the batch has been effected.

Things to consider :

The server might fail if one or more of the words does not have a representation in the corpus. We would need to fix that here :

smooshr/server/server.py

Lines 66 to 80 in 8b11ccb

 @app.route('/embedding/<words>') 

 def embeding(words): 

 conn = get_db() 

 try: 

 words = words.split(',') 

 sql = "select * from embeddings where key in ({seq})".format( seq=','.join(['?']*len(words))) 

 result = conn.execute(sql, words) 

 result = [ [r[0], r[1].tolist()] for r in result ] 

 result = [ {"key": key, "embedding": embed} for key,embed in dict(result).items() ] 

 return jsonify(result) 

 except: 

 return jsonify([]) 

 if __name__=='__main__': 

 print('starting up server') 

 app.run(host='0.0.0.0', port=5000, debug=True)

It would be also good to give some feedback on this process that can show in the classification interface to let a user know how much of the embedding has been loaded.

Make "Create mappings" and merge options more clear in Project UI

Investigate different models for describing an analysis flow using a DAG or similar structure.

We currently only have 2 types of operation on smooshr

Combine columns together
Create a taxonomy for a given column

In the future we would like to have more steps for example

Extract part of a column as a new column. For example an address like "23 Some Street, Some City, US, 11221" -> "Some City" to
Standardize a time column
Merge the contents of two columns together to form a new column
Do entity matching on a given column
etc

Some of these steps will have dependencies on previous steps that are hard to predict at run time. It would be great to have each indiividual transform be defined as a node in a graph with dependecies linked by edges. Essentially a DAG.

This would inform the UI and the python code that is ultimetly spit out by the tool.

Some links to projects that might be worth looking at

Importing data from NYC Open Data Portal appears broken

Every set I've tried results in '0 rows and 0 columns'

Tweak the design to bring in line with Newerhoods

Currently the design is using some data clinic colors but we should bring the general design in to line with Newerhoods

Make drop zone bigger

Currently the drop zone for files is just the text, expand this to use the entire box

Set up an initial clustering guess

Attempt to use the word embeddings to generate a starting point for possible categories.

Open questions here:

Can we prompt users to suggest category starting points, basically seeds for the clustering algorithm
How do we define a catch all category that picks out everything that is nothing like anything else
How do we select the number of clusters? Do we try and do that automatically? If not how do we inform the user

-[ ] Can clustering be done interactively on the remaining categories?

Resize modals to be a little more inline with their content

A bunch of the modals are too big. Should fix this

Change the includes and suggestions sections to make them more distinct

Implement new design

Implement the new design

Smooshr - V1 feedback.pdf

Fix the weird flexbox issues we have been having with scroll overflow

There are a few places where flexbox and scroll overflow are not playing particularly well together. Need to resolve this.

Create concept of a meta-column

Currently we dont have a way to combine columns from multiple datasets. We need to create some kind of meta-column entry that can reference columns from multiple data sets as the same column in the rest of the app, combining entries from each for example.

Open questions here:

Do meta columns support concatenation of columns within a dataset? Or extraction of data from those columns?
Even if two columns from separate files are selected as the same, are there mappings that need to be file specific for each of them. I don't think so but need to check this.

Explore storing entire file in indexDB with storage manager

Could be interesting to give this a go, this would allow for some more interesting free text entry cleaning.

Figure out what to persist in terms of public projects on the backend

It would be great to be able to share projects on the site, showcase how people are using it. Need to figure out how to persist enough of the project to make this happen.

Display selected categories somewhere in the interface

Can use suggestions here from the original design

Explore undo functionality

Explore how to do this using react context api. Time travel like this is doable with redux, not sure about with context API.

Change the server for embeddings to sqlite

Making the embeddings more portable will make deploy easier. Currently we are using postgresql which is perhaps more than we need. Moving sqlite might make more sense here as we can simply download the .sqlite file and run to get going.

Auto select new mapping when created

On the mappings page, auto select the a new mapping when it is created

Allow metadata to be added to new columns / mappings and bundle this with results

It would be a shame if the new mappings and columns generated by smoosher did not also come with meta-data. We should encourage people to add meta-data to the new mappings and export this automatically with the mappings and results

Simplify docker setup for backend

Currently we have a flask app + redis + celery and workers. This was because we anticipated doing much more of this work on the backend but seeing as we aren't we should remove these dependencies

Investigate negative mappings issue

Selecting to add to negative mappings is failing occasionally... not sure why need to check that out

Move to using indexdb instead of local storage

Currently we are using local storage to store the project definitions offline. Would be good to move this to IndexDB to allow more space (50mb vs 5mb). Dexie might be a good way to do this

https://dexie.org/

Might be interesting to move all state management there? Not sure how this interacts with react

Interesting example here : https://github.com/dfahlander/Dexie.js/tree/master/samples/react-redux

Investigate different ways of storing the data smooshr is using

Currently, smooshr uses in-memory storage to represent a dataset while users are working on it.

As we move to mode sophisticated analysis, we might need to rethink how we do this in a more efficent way.

Some options are

Using IndexedDB the browsers built-in database system.Probably through a library like dexie. Note we currently use indexdb as a dumb offline storage but this would move it to a more structured database
Using sqljs a compiled version of SQLite that runs in webassembely and provides basically native perfromance in browser. We would still need to figure out how to store the sqlite database offline but this could give us a really nice flexiable interface (SQL) for performing operations on the datasets
Something else? The local files api is worth keeping an eye on https://web.dev/file-system-access/ as it would let us save and read projects in a similar way to a native app.

	const get_embedings_from_server = entries => {
	let unique_words = new Set();
	entries.forEach(entry => {
	entry.name.split(' ').forEach(word => {
	unique_words.add(word);
	});
	});

	return Promise.all(
	Array.from(unique_words).map(entry =>
	fetch(
	`${
	process.env.REACT_APP_API_URL
	}/embedding/${entry.toLowerCase().replace(/[\W_]+/g, '')}`,
	)
	.then(r => r.json())
	.then(r => r[0]),
	),
	);
	};

	@app.route('/embedding/<words>')
	def embeding(words):
	conn = get_db()
	try:
	words = words.split(',')
	sql = "select * from embeddings where key in ({seq})".format( seq=','.join(['?']*len(words)))
	result = conn.execute(sql, words)
	result = [ [r[0], r[1].tolist()] for r in result ]
	result = [ {"key": key, "embedding": embed} for key,embed in dict(result).items() ]
	return jsonify(result)
	except:
	return jsonify([])
	if __name__=='__main__':
	print('starting up server')
	app.run(host='0.0.0.0', port=5000, debug=True)

tsdataclinic / smooshr Goto Github PK

smooshr's People

Contributors

Stargazers

Watchers

Forkers

smooshr's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs