thanatoz-1 / text-technology Goto Github PK

Project for Text technology

License: Apache License 2.0

Python 5.13% Shell 0.01% Jupyter Notebook 93.88% HTML 0.94% JavaScript 0.04%

text-technology's Issues

Keyword augmentation

Augment the keywords with all-cap words and cluster representatives. Other post-process like filter out too-shot and too-long keywords are needed.

todo: clean and update KeywordCleaner class

Doc: prepare example files

Prepare the final files

collect part outputs:
- papers.xml,
process part outputs: augmented_papers.xml, all_cap.dict, cluster.dict
- augmented_papers.xml and json files
- all_cap.dict
- cluster.dict
access
- sql data examples

Enhancement: Show both url and keywords of the paper

input validation

views.py, year range
fetch_display_given_keywords, alert, reminds users that some keywords don't exist

Keyword extraction: cluster representative words

Issues with the old methods

the most frequent word can't represent the cluster well
the cluster itself could be problematic
- e.g. Very large l2 distance stddev, up to 60.
- lacking domain knowledge: it categorize "student teacher methods" as "human"

todo: clean and upload the following code

weighted Levenshtein distance based kmeans jupyter notebook
code for building the index-term-to-cluster-representative-word mapper

Baseline api

A baseline API is required having the following necessary requirements:

flask rest API that simply emits data to the console for now
setting manageable from the config.ini file
Modularity of choice of database

refactor to_json.py

Autocomple features in existing bar

link

Enable users to query the change curves of their given keywords

New feature
loc: Keywords Page:
In the form "keywords", if the input is "LSTM;ASR", then the return only shows the change curve w.r.t to "LSTM" and "ASR"

A naive method, loop over each keyword at Django, query the database how many related papers are published during the given range

Document python files

Structure:


Inputs:
---
Argument name: a description of that argument

Outputs:
---
i.e. return

Example:
---
inputs:...
outputs:...```

collect: main.py, xml_loader.py, info_extractor.py, converter.py
process: augment and to_json.py