GithubHelp home page GithubHelp logo

thanatoz-1 / text-technology Goto Github PK

View Code? Open in Web Editor NEW
0.0 0.0 2.0 9.77 MB

Project for Text technology

License: Apache License 2.0

Python 5.13% Shell 0.01% Jupyter Notebook 93.88% HTML 0.94% JavaScript 0.04%

text-technology's People

Contributors

nwhal avatar thanatoz-1 avatar xiaoyixuan avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

xiaoyixuan nwhal

text-technology's Issues

Keyword augmentation

Augment the keywords with all-cap words and cluster representatives. Other post-process like filter out too-shot and too-long keywords are needed.

todo: clean and update KeywordCleaner class

Doc: prepare example files

Prepare the final files

  • collect part outputs:
    • papers.xml,
  • process part outputs: augmented_papers.xml, all_cap.dict, cluster.dict
    • augmented_papers.xml and json files
    • all_cap.dict
    • cluster.dict
  • access
    • sql data examples

input validation

  • views.py, year range
  • fetch_display_given_keywords, alert, reminds users that some keywords don't exist

Keyword extraction: cluster representative words

Issues with the old methods

  • the most frequent word can't represent the cluster well
  • the cluster itself could be problematic
    • e.g. Very large l2 distance stddev, up to 60.
    • lacking domain knowledge: it categorize "student teacher methods" as "human"

todo: clean and upload the following code

  • weighted Levenshtein distance based kmeans jupyter notebook
  • code for building the index-term-to-cluster-representative-word mapper

Baseline api

A baseline API is required having the following necessary requirements:

  • flask rest API that simply emits data to the console for now
  • setting manageable from the config.ini file
  • Modularity of choice of database

Enable users to query the change curves of their given keywords

New feature
loc: Keywords Page:
In the form "keywords", if the input is "LSTM;ASR", then the return only shows the change curve w.r.t to "LSTM" and "ASR"

  • A naive method, loop over each keyword at Django, query the database how many related papers are published during the given range

Document python files

Structure:


Inputs:
---
Argument name: a description of that argument

Outputs:
---
i.e. return

Example:
---
inputs:...
outputs:...```

collect: main.py, xml_loader.py, info_extractor.py, converter.py
process: augment and to_json.py

review comments

  • review the views.py code comments
  • briefly comment the other codes
  • bug fix: How PDF parser work.md incorrect pdf parser description

Doc: Update Readme

  • Keyword extraction: change the workflow figure
  • Access Readme: update the screen shots
  • add database schema to the main readme
  • directly show links to example outputs on the main readme
  • extension, update the table and the main readme

Other checks, tbc

a simple prototype

A prototype to test the basic features, e.g. list keywords over time, list the institute names and their papers

Refactor pdf parser

Issues with the old one

  • failed to load a lot of pdfs(7k4 out of 9989 were successfully processed)
  • incorrectly categorized some abstract content as authors or index terms

todo: clean and update code

Create a new README

Create a new README describing the project ideas. Please stick to the Markdown format.

clean all codes

  • remove unused functions
  • remove useless todos
  • remove debug code
  • better default values
  • format the code

converter

convert query results to xml files

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.