Name: Taja Kuzman
Type: User
Company: Jožef Stefan Institute
Bio: PhD student in Computational Linguistics with a MA in Translation (FR, EN&SI).
Main interests: large language models, language technologies and resources
Twitter: TajaKuzman
Location: Ljubljana, Slovenia
Taja Kuzman's Projects
AI assistant, based on the GPT-3.5 model by OpenAI, designed to enhance your proficiency in writing research papers. Allows you to adapt your content to academic standards, transform bullet points into eloquent text, or enhance the quality of your writing through error detection.
A benchmark for evaluating robustness of automatic genre identification models to test their usability for the automatic enrichment of large text collections with genre information.
Genre Annotation Guidelines for GINCO corpora
Classification of hate speech and implicitness of hate speech, using Transformer language models (BERT). This repository can be used as an introduction to text classification with BERT-like models.
Open resources and community for machine translation
An evaluation of various encoder Transformer-based large language models on the named entity recognition task. The models are compared on 6 datasets, manually-annotated with named entitites.
A set of HTML widgets that could be embedded into Notion.so https://www.notion.so/ pages. For more see https://blog.shorouk.dev/notion-widgets-gallery/
A ML web app which detect objectivity of the text
A pipeline for machine translation (using OPUS-MT models) of parliamentary text collections in 30+ languages (ParlaMint corpora). The pipeline includes parsing TEI XLM and CONLL-u files, linguistic processing with the Stanza pipeline, machine translation and word alignment with the Eflomal tool.
Hands-on sessions for ESSLLI course "Computational approaches to semantic change detection"
Home page to Taja Kuzman's GitHub repository.
Variety identification
Example notebooks and tutorials from Constellate, the text analysis service from ITHAKA.
Analysing different text representations for genre identification. I parse CONLL-u files and extract various representations of a text (running text, lemmas, part-of-speech), then train a Fasttext model on each to see which representation is the most beneficial for the genre identification task.
Training and evaluating topic classification models (fastText and Transformer-based language models) for topic classification of Slovenian news texts. The repository can be used as a tutorial to learn topic classification.