GithubHelp home page GithubHelp logo

anlp's People

Contributors

nithincshekar avatar samualkrish avatar summerlight avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

anlp's Issues

Collect dataset

We need a script to generate a dataset for experiment. Our current dataset is ALTA-2010 Shared Task. In the case for the need of more language, annotation, shorter text or whatever else, we need to be able to generate a similar dataset.

Step needed:

  1. Download Wikipedia dumps. Wikipedia texts are named in a format of xxwiki. All we need here are "current versions only" dumps.
  2. Extract only text using wikiextractor.
  3. Apply the methodology of the paper. You can easily get interlanguage links from the corresponding wiki page. (use a library BeautifulSoup4, find tags with a class "interlanguage-link")

Decide a project topic, details and roles

We'll decide(or at least narrow down) a project topic tomorrow. After the meeting, we should prepare to answer the questions at the corresponding wiki page.

Also, we need to decide each member's role for this project. Please choose the role from below you're hoping to take. (All members should be able to edit this issue; if you're not, please let me know)

Write a proposal.

Write a proposal before Tuesday 23:59.

Currently I am writing a proposal based on multi-lingual language identification.

Write a topic evaluation.

At the last meeting, several selected project topics are assigned to each member. The evaluation text is supposed to answer the below questions:

  1. What is the use of this application? Any research already done on this?
  2. Is there any dataset available?
  3. What are the brief steps and procedure that might be needed to achieve the application?

This set of questions is basically a gist of the most important part of the corresponding wiki page. So it is good to think about those detailed questions while writing a topic evaluation.

Partitioning similar language sets.

Our research need detect similar language and partition the whole language set into accordingly separated sets. At the first stage, a full-fledged LID is not needed; just make some fake detector which can "simulate" language detection results.

Implement basic LID schemes

We want to implement (very) basic LID schemes with CRF or structured SVM. Then we can see the result and find out whether it could be improved or not. We'll use PyStruct for this purpose. At the first stage, we don't need a full dataset. Just make some development set by hand (50~ would be suffice), and develop some identifier.

Before developing identifiers, please study the topic and how to use the library idiomatically. Fixing bugs in a legacy code is much harder than writing a new code from scratch, especially for those who are not code owners.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.