summerlight / anlp Goto Github PK

Applied Natural Language Processing project

License: Apache License 2.0

Python 93.52% Shell 0.52% Perl 5.95%

anlp's Issues

Choose the project topic

We need to choose the project topic. Please refer to the candidate topics page

We need a script to generate a dataset for experiment. Our current dataset is ALTA-2010 Shared Task. In the case for the need of more language, annotation, shorter text or whatever else, we need to be able to generate a similar dataset.

Step needed:

Download Wikipedia dumps. Wikipedia texts are named in a format of xxwiki. All we need here are "current versions only" dumps.
Extract only text using wikiextractor.
Apply the methodology of the paper. You can easily get interlanguage links from the corresponding wiki page. (use a library BeautifulSoup4, find tags with a class "interlanguage-link")

Decide a project topic, details and roles

We'll decide(or at least narrow down) a project topic tomorrow. After the meeting, we should prepare to answer the questions at the corresponding wiki page.

Also, we need to decide each member's role for this project. Please choose the role from below you're hoping to take. (All members should be able to edit this issue; if you're not, please let me know)

@summerlight
- Theory
- Coding
- Data
- Writing
@mytony
- Theory
- Coding
- Data
- Writing
@nithincshekar
- Theory
- Coding
- Data
- Writing
@Samualkrish
- Theory
- Coding
- Data
- Writing

Write a proposal.

Write a proposal before Tuesday 23:59.

Currently I am writing a proposal based on multi-lingual language identification.

Write a topic evaluation.

At the last meeting, several selected project topics are assigned to each member. The evaluation text is supposed to answer the below questions:

What is the use of this application? Any research already done on this?
Is there any dataset available?
What are the brief steps and procedure that might be needed to achieve the application?

This set of questions is basically a gist of the most important part of the corresponding wiki page. So it is good to think about those detailed questions while writing a topic evaluation.

Team name

Let's talk about our team name.

Partitioning similar language sets.

Our research need detect similar language and partition the whole language set into accordingly separated sets. At the first stage, a full-fledged LID is not needed; just make some fake detector which can "simulate" language detection results.

Implement basic LID schemes

We want to implement (very) basic LID schemes with CRF or structured SVM. Then we can see the result and find out whether it could be improved or not. We'll use PyStruct for this purpose. At the first stage, we don't need a full dataset. Just make some development set by hand (50~ would be suffice), and develop some identifier.

Before developing identifiers, please study the topic and how to use the library idiomatically. Fixing bugs in a legacy code is much harder than writing a new code from scratch, especially for those who are not code owners.

summerlight / anlp Goto Github PK

anlp's People

Contributors

Stargazers

Watchers

Forkers

anlp's Issues

Choose the project topic

Collect dataset

Decide a project topic, details and roles

Write a proposal.

Write a topic evaluation.

Team name

Partitioning similar language sets.

Implement basic LID schemes

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs