GithubHelp home page GithubHelp logo

stylometrix's Introduction

StyloMetrix

StyloMetrixNASK

Zakład Inżynierii Lingwistycznej i Analizy Tekstu, NASK PIB

📌 Quick

💡 Stylometry tool in beta version for English, German, Polish, Russian and Ukrainian language, distributed as a Python package

💡 Tutorial notebook

💡 Notebook with example usage for classification and clustering with StyloMetrix

💡 List of built-in metrics for Polish, English, German, Ukrainian, Russian

🔖 Citation

Please cite this article when referring to StyloMetrix:

Okulska, I., Stetsenko, D., Kołos, A., Karlińska, A., Głąbińska, K., & Nowakowski, A. (2023). StyloMetrix: An Open-Source Multilingual Tool for Representing Stylometric Vectors. arXiv preprint arXiv:2309.12810.

🔔 About

StyloMetrix is a tool for creating text representations as StyloMetrix vectors. Each metric in vector quantifies a linguistic feature in text. Therefore a detailed information of the style of text can be translated to numeric values and used for - whatever you want!

The metrics are:

  • interpretable - each metric represents an aspect of linguistic knowledge
  • normalized - metrics express number of ocurrences of given feature per number of tokens in text, which lets us escape scaling effect in texts of different lengths
  • reproducible - values of metrics can be recalculated or even counted manually giving always the same output. The representation doesn't depend on any random factor or seeding
  • customizable - if your needs exceed the scope of built-in metrics, create your own! Don't forget to share your work and contribute to the community of StyloMetrix!

A StyloMetrix vector can be used as:

  • stylometric signature that encodes the writing style of the author and the genre
  • input for classifiers of supervised or unsupervised learning, for example Random Forest classifier or feature selection algorithms
  • values for statistical analyses in science
  • set of linguistic data for manual reference

The tool offers customization of vectors by selecting from built-in metrics or creating new metrics according to user's needs. We provide a user-friendly interface to support these tasks. See instructions below! ⬇

Currently StyloMetrix is available for English, German, Polish, Russian and Ukrainian language.

📢 Release

Our most recent release is:

v0.1.0

  • Changing the structure of StyloMetrix
  • Works much faster!
  • New metrics and categories in Polish and English language
  • German language in beta version
  • Russian language in beta version
  • Ukrainian language in beta version
  • Possibility to define metrics to use / categories of metrics as list of strings containing names.
  • Possibility to save intermediate steps so even if something crashes, you still have some of work saved.

Please notice that support for Russian and Ukrainian languages will no longer be available.

Previous releases

v0.0.6

  • Add categories Syntactic and Lexical for English

v0.0.4

  • Add English beta with built-in metrics in category Grammatical Forms

v0.0.3

  • Add StyloMetrix structure
  • Add tutorial
  • Add 6 built-in metrics categories for Polish beta: Grammatical Forms, Inflection, Lexical, Psycholinguistic, Syntactic, Word Formation
  • Specify license & citation

🔨 Installation

1. Install spaCy

Install spacy according to spaCy install instructions

2. Install model

For Polish:

Download and install model pl_nask v0.0.7

📍 pl_nask is the new HerBERT based model from IPI PAN, requires spacy==3.3

python -m pip install <PATH_TO_MODEL/pl_nask-0.0.7.tar.gz> 

For other languages:

3. Install StyloMetrix

pip install stylo_metrix

🪁 How to use

  1. Get your texts and import StyloMetrix:
import stylo_metrix as sm

texts = ['Panno święta, co Jasnej bronisz Częstochowy I w Ostrej świecisz Bramie!',
        'Ofiarowany, martwą podniosłem powiekę; I zaraz mogłem pieszo, do Twych świątyń progu...',
        'W ludziach straty nie było. Ale wszystkie ławy Miały zwichnione nogi;']
  1. Use StyloMetrix object for this texts:
stylo = sm.StyloMetrix('pl')
metrics = stylo.transform(texts)
print(metrics)
  1. Your results is now in metrics object.

That's it! Find out about more usages and customization options in notebook tutorial.

Find out about using StyloMetrix in classification or in clustering in example notebook

📈 Metrics

We have put care into creating a set of powerful built-in metrics. See the list below ⬇. However, since flexibility is strength, we provide an easy way to create new metrics.

Polish (see full list)

English (see full list)

German (see full list)

Russian (see full list)

Ukrainian (see full list)

📚 We use

📪 Contact

Zakład Inżynierii Lingwistycznej i Analizy Tekstu, Naukowa i Akademicka Sieć Komputerowa – Państwowy Instytut Badawczy

Inez Okulska [email protected] | [email protected]

Copyright (C) 2024 NASK PIB

stylometrix's People

Contributors

ziliat-nask avatar kingagla avatar annakolos123 avatar inezok avatar

Stargazers

 avatar Mateusz Czyżnikiewicz avatar R.Sowmiya avatar Adrian Nowicki avatar  avatar Albert Sawczyn avatar Olgierd avatar ja, bo ja avatar  avatar Ravel avatar Dima avatar Joachim Wałęga avatar Arkadiusz Modzelewski avatar Piotr Ocalewicz avatar Mateusz Szymik avatar Marek Brynda avatar Dzemsad Dugalic avatar Borys Jastrzębski avatar Kornel Romański avatar Christopher Tarry avatar Maciej Kaczkowski avatar Jan Święcki avatar Zuzanna Kwiatkowska avatar  avatar Arkadiusz Choruży avatar Alicja  avatar Wiktor Flis avatar michal.kajstura avatar Aleksandra avatar  avatar  avatar Karol Saputa avatar Daniel Wiczew avatar

Watchers

 avatar  avatar  avatar

stylometrix's Issues

Unable to import library

Hello,
I tried using your library for the first time (version 0.1.7), and I experienced issues with importing the module stylo_metrix. When trying to import your library, I got following error:
ImportError: cannot import name 'log_incidence' from 'stylo_metrix.utils' (/Users/rafalwojcik/miniconda3/lib/python3.9/site-packages/stylo_metrix/utils.py)
I checked utils.py file and it doesn't contain log_incidence function, which was added to lexical_en.py file imports in version 0.1.7

Convert dictionaries.py from Python to text files

Regarding this file:
https://github.com/ZILiAT-NASK/StyloMetrix/blob/v0.1.0/src/stylo_metrix/metrics/pl/data/dictionaries.py

It is compatible only with Python and require manual conversion to use these dictionaries from other languages.

I request converting these dictionaries from Python include file, to separate text files - just as it is done for english dictionaries:
https://github.com/ZILiAT-NASK/StyloMetrix/tree/v0.1.0/src/stylo_metrix/metrics/en/data/hurtlex

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.