GithubHelp home page GithubHelp logo

2020-2-level-ctlr's Introduction

Dataset Collector Lab for 2nd course of Fundamental and Computational Linguistics (2020/2021)

Corpus Collection and Annotation

About the course

"Computer Tools for Linguistic Research" in Higher School of Economics (Nizhny Novgorod branch).

Lectors

Motivation

The idea is to automatically obtain a dataset that has a certain structure and appropriate content, perform morphological analysis using various NLP libraries. Dataset requirements.

Project Timeline

  1. Scrapper
    1. Short summary: Your code can automatically parse a media website you are going to choose , save texts and its metadata in a proper format
    2. Deadline: March 15th, 2021
    3. Format: each student works in their own PR
    4. Dataset volume: 5-7 articles
    5. Design document: ./docs/scrapper.md
    6. Additional resources:
      1. List of media websites to select from: link
  2. Pipeline
    1. Short summary: Your code can automatically process raw texts from previous step, make point-of-speech tagging and basic morphological analysis.
    2. Deadline: April 5th, 2021
    3. Format: each student works in their own PR
    4. Dataset volume: 5-7 articles
    5. Design document: ./docs/pipeline.md
  3. Own Research
    1. Short summary: Your code can create a bigger processed dataset of a requested volume and format that you use for your linguistic research.
    2. Deadline: TBD (approx. May 30th, 2021)
    3. Format: students work in groups - one PR per group
    4. Dataset volume: 100 articles

Technical solution

Module Description Component I need to know them, if I want to get at least
requests module for downloading web pages scrapper 4
BeautifulSoup module for finding information on web pages scrapper 4
lxml module for parsing HTML as a structure scrapper 6
pymystem3 module for morphological analysis pipeline 6
pymorphy2 module for morphological analysis pipeline 8
pandas module for table data analysis pipeline 10

Software solution is built on top of three components:

  1. scrapper.py - a module for finding articles from the given media, extracting text and dumping it to the filesystem. Students need to implement it.
  2. pipeline.py - a module for processing text: point-of-speech tagging and basic morphological analysis. Students need to implement it.
  3. article.py - a module for article abstraction to incapsulate low-level manipulations with the article

Handing over your work

Order of handing over:

  1. lab work is accepted for oral presentation.
  2. a student has explained the work of the program and showed it in action.
  3. a student has completed the min-task from a mentor that requires some slight code modifications.
  4. a student receives a mark:
    1. that corresponds to the expected one, if all the steps above are completed and mentor is satisified with the answer
    2. one point bigger than the expected one, if all the steps above are completed and mentor is very satisified with the answer
    3. one point smaller than the expected one, if a lab is handed over one week later than the deadline and criteria from 4.1 are satisfied
    4. two points smaller than the expected one, if a lab is handed over more than one week later than the deadline and criteria from 4.1 are satisfied

NOTE: a student might improve their mark for the lab, if they complete tasks of the next level after handing over the lab.

A lab work is accepted for oral presentation if all the critera below are satsified:

  1. there is a Pull Request (PR) with a correctly formatted name: Laboratory work #<NUMBER>, <SURNAME> <NAME> - <UNIVERSITY GROUP NAME>. Example: Laboratory work #1, Kuznetsova Valeriya - 19FPL1.
  2. has a filled file target_score.txt with an expected mark. Acceptable values: 4, 6, 8, 10.
  3. has green status.
  4. has a label done, set by mentor.

Resources

  1. Academic performance: link
  2. Media websites list: link
  3. Python programming course from previous semester: link
  4. Scrapping tutorials: YouTube series (russian)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.