GithubHelp home page GithubHelp logo

tecquel's Introduction

Simple example : simple_example.py

Basic usage with strings

from get_similarity import get_simil, sim_by_file

s1 = "Je mange le Lapin"

s2 = "Ie mange le lapin vert"

res = get_simil([s1, s2])

res is a Python dictionary

You can compare numerous strings

res = get_simil([s1, s2, s3, s4])

You can work with files paths

res = sim_by_file([path_ref, path_hyp])

compares path_ref with path_hyp

Example with a more complex directory structure

Useful for web scraping and OCR when one compares multiple systems

example : try test.py

from get_similarity import process_data

path_hyp = "dummy_data/reference/"# your path to the reference data path_ref = "dummy_data/cleaned/"#your path to the hypothesis (one directory for each different hypothesis)

NB: the filenames must be the same in teh "reference dir" and all the hypothesis dirs

res = process_data(path_hyp, path_ref)

Here: explain vocabulary

Expected directory structures

Option 1 : Directory structure Driven by tools all files of a given tool are in the same directory

Give a directory with the reference data and another directory with all the hypothesis in their own directory. Each filename in the reference data and in the reference corpus must have the name.

USAGE (see test.py): from get_similarity import process_data

path_hyp = "dummy_data/cleaned/" path_ref = "dummy_data/reference/"

print(f"Processing {path_ref} as reference path") print(f"Processing {path_hyp} as hypothesis path") res = process_data(path_hyp, path_ref)

Each filename of each hypothesis with be matched with the corresponding reference file

See "dummy_data" as an example of teh structure

dummy_data/ contains two subdirectories (reference and cleaned)

  • reference contains the reference files
  • cleaned contains different hypothesis obtained with different tools (BP3, GOO ...)

dummy_data/ ├── cleaned │   ├── BP3 │   ├── GOO │   ├── HTML2TEXT │   ├── INSCRIPTIS │   ├── JT │   ├── NEWSPAPER │   ├── READABILITY │   └── TRAF └── reference

Option 2 : Directory structure Driven by source (or books) The files are first sorted by source and then by tool. Then there is a directory with the reference (REF) and all the hypothesis (HYP)

You can find an example in the dummy_data_by_source directory:

dummy_data_by_source/ └── goodcontents.net ├── HYP │   └── NEWSPAPER │   └── TXT │   └── 20111121_goodcontents.net_6e8a193b0d5e43883d5bcacdf... └── REF └── TXT └── 20111121_goodcontents.net_6e8a193b0d5e43883d5bcacdf...

tecquel's People

Contributors

rundimeco avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.