GithubHelp home page GithubHelp logo

oggy22 / translator Goto Github PK

View Code? Open in Web Editor NEW
3.0 3.0 0.0 4.91 MB

This is my attempt to create a machine translator from and to multiple natural languages. The method is based on formal (Chomsky) grammars and equivalency rules. The input sentence is parsed into a tree, an equivalent tree is constructed in the target language and traversed to yield the translated text. Currently, I am working on my native Serbian, and English. Future support may include Russian, Spanish and German.

C# 7.73% C++ 55.89% C 36.38%

translator's People

Contributors

oggy22 avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

translator's Issues

Better reporting in ASSERT/Debug

ASSERT(x) in Debug throws exception, but it is invisible to the user.

Find a way to have a dialog pop-up when the ASSERT is breached in Debug.

More parsing tests

Add more parsing tests including those that would exercise more than one rule, for example:
"ja idem u skolu" would exercise:

  1. ->
  2. u -> PrilOdredba

Add parsing tests for English.

Make patterns work without jokers

Currently class pattern works only with jokers.

For pronouns in Serbian e.g:
{ L"ја", L"мој", зам, прид },
{ L"ти", L"твој", зам, прид },
There are no jokers.

Assert with throw

Implement Assert as:

  • in Debug, it will throw
  • in Release, it will __assume

Better reporting in tests

Currently, many tests only assert without a good failure message. This is due to the fact that many classes don't have string/wstring converters. Write these converters and use them in the tests.

Word rules used

In Debug, make sure that every single word rule was used. It can be done as:

  • as soon as the rule is used set a flag to true
  • check all the rules that they have the flag set (either in a test or in init)

Resolve Warning on shared binary directory

The following warning is issued sporadically. Try to resolve it.

Severity Code Description Project File Line Suppression State
Warning MSB8028 The intermediate directory (Debug) contains files shared from another project (TranslatorCppTest.vcxproj, UnitTest1.vcxproj). This can lead to incorrect clean and rebuild behavior. TestTranslatorCpp C:\Program Files (x86)\MSBuild\Microsoft.Cpp\v4.0\V140\Microsoft.CppBuild.targets 392

attribute_manage API and tests

The class attribute_manager should have the following methods:

// True if it has all the attributes that the argument has
bool has_all(const attribute_manager&);

// True if there is no conflict, i.e. no category has mismatching arguments
bool accepts(const attribute_manager&);

This should always hold:
am.has_all(am2) =>am.accepts(am2)
am.accepts(am2) <=> am2.accepts(am)

Jokers

Implement joker logic.
Currently only patterns with "%" are allowed.
There should be:
"?" any letter - for any language
"@" any vowel - language defined
"#" any consonant - language defined

Base class for languages

Introduce a base class for languages:

template <Lang>
class Language

so that

class Serbian : public Language<Serbian>
class English : public Language<English>

Command line App with arguments

Make command line application work with various arguments e.g:

  • translator.exe -random:SR
    This will generate a random proper sentence in Serbian language.
  • translator.exe -listwords:EN
    Lists all the words in English language
  • translator.exe -translate:EN-SR text.txt
    Translate the given file from English to Serbian.
  • translator.exe -help
    Prints all the commands

Serbian adjective declinations

Make Serbian adjectives have all 3x2x7x3=126 word forms

There are:

  • 3 genders
  • 2 plurals (singular and plural)
  • 7 cases
  • 3 comparatives (positive, comparative and superlative)

Test failing: each_adjective_has_3genders_2plurals_7cases

WPF C# communicating with Console App

Write a prototype of WPF C# which writes to std input and reads from std output of C++ Console App and displays it in a textbox.

The C++ Console App can simply echo the message or do some simple modifications to it.

The C++ Console App needs to be run only once and therefore there must be a message-end-character. E.g. Null-termination, once received on WPF, the text is displayed.

Might experience threading issues in WPF as only UI Thread can change the textbox.

Tests for some word forms

Write the following tests:

  • check_some_noun_forms:
    konj - > konja (akuz)
    covek -> covece
    orah -> orasi, orahe
    mis -> miseva
  • check_some_verb_forms:
    ici->idem, idu, isao

etc.

Basically, whenever there is a special rule or rule exception, write a test that checks that the rule is applied in an appropriate case.

TranslatorWPF should work

Make TranslatorWPF actually work with TranslatorCPP.
TranslatorCPP should have command line arguments e.g. "parse:Serbian" or "parse:English" or "translate:English-Serbian"

On parse only, the app should output something like "Parse ok"/"Parse failed"

Replace joker '%' with '*'

Currently used joker which matches everything is '%'. This was probably SQL influence, because when I started writing Translator in C#, I used SQL DB to store languages and thought using a similar syntax would be beneficial.
Actually, it's never been so, and now I think it should be '*' as in most scripts syntax (e.g. DOS command prompt)

Handling of attributes and categories

Currently attributes and categories are defined as enum classes each:
enum class attributes { singular, plural, masculine, feminine, ... }
enum class categories { number, gender, .... }
and there is mapping defined as:
std::map<attributes, categories> belongs_to = ...

I am thinking of:
enum class gender { masculine, feminine, neutrum }
enum class number { singular, plural }

This should be investigated and possibly the code be switched to such.

Unify set_s and map_s

Both unordered_set and set, and unordered_map and map are used.
Make them more consistent.

Consider having different flavours (i.e. non-unordered/unordered) for Debug/Release, as allegedly unordered versions are faster.

English words

Put some effort in generation of English words:

  • nouns: singular and plural
  • verbs: present, past participle, past

Generate random proper sentences

In addition to parsing and translation the system has potential of generating random proper sentences/texts by generating random abstract syntax trees as follows:

  • Start with a syntax node, probably sentence
  • Randomly choose rules and apply them where possible
  • Once no rule can be applied anymore, choose random words and place them on leafs in proper word forms
  • Walk the tree (DFS) to yield a sentence/text

Note: This is completely a separate task, independent of others, just for fun.

parsing_triangle visualization

There needs to be a nice visualization of parsing_triangle e.g:

Ja idem u skolu
++ - Zamenica
+----+ - glagol
+----+ - imenica
+------+ - recenica
+-----+ - priloska odredba
+----------------+ - recenica

Command line option to parse text

There should be a command line option with an input file. Input file will contain a text in Serbian (or potentially any other supported language). The system will try to parse the text and provide the following stats:

  • number of recognized/unrecognized words (potentally list them in output files)
  • the longest (successfully) parsed phrase counting characters
  • the longest (successfully) parsed phrase counting words

By adding new words these stats should gradually increase.

Extra: Create a service which would regularly download a text from http://www.politika.rs, parse it and store results.

Tests for command line

Sometimes it happens that all tests are passing but the program crashes from the command line.
Therefore, create tests which simulate common command line scenarios and which will fail iff command line fails. This may include:

  • Listing all the words and word forms for a given language
  • Parsing a sentence in the given language
  • Translating from/to languages

belongs_to_category into Language base class

I had problems moving belongs_to_category into the base Language class. Namely, attributes and attribute_categories are inaccessible from the base class.

Try solving this somehow, and having belongs_to_category in the base class.

Serbian: generation of words

Make system support these:
Човек - човече - људи (plural basis "ljud")
коњ - коњи
миш - мишеви (plural basis "mishev")
муж - мужеви (plural basis "mishev")
слон - слонови (plural basis "slonov"
сто - стола - столови (singular basis "sto")
мушкарац - мушкарца - мушкарци
nouns in akuzative animate vs inanimate

Words check

    • Traverse all the dictionary words for duplicates.
    • When creating each derived dictionary word check that it doesn't exist already.

This should be only in Debug.

x64 tests failing

All x64 both Release and Debug are failing with the following message:
A 64-bit test cannot run in a 32-bit process. Specify platform as x64 to force test run in x64 mode on x64 machine.

Break down language.h

The file language.h became too big containing many classes. Break it down into multiple files, possibly each containing one class/struct

Create natvis file

Create natvis file for classes and types in language.h
Natvis file instructs debugger how to present types in debugger.
For reference check cpprest.natvis

Tests

Create the following tests for Serbian language:

  • each noun has gender
  • each noun has 7x2 word forms 7 cases x 2 numbers
  • each verb has 3x2 forms in present

Visual Studio Git doesn't mark modified files

Sometimes, VS Git doesn't mark files which have been modified. Even when they are explicitly added as:
git add file.cpp
they seem to be unmarked eventually. Marked files are seen as with a "red check" in VS Solution Explorer.

Find ouy why and fix it.

Word Count tests

Write tests which will count the number of words and compare it against explicitly given numbers:

  • Dictionary words
  • Derived dictionary words
  • Word forms

Also, write a test to check that each dictionary word produces at least one word form

Smart pointers

Consider using smart pointers:

  • uniqure_ptr
  • shared_ptr
  • weak_ptr

One node rules don't actually work

For example, the rule in Serbian:
IS(broj, padez) -> imen(broj, padez)
doesn't seem to work.

The problem might be the way new nodes are added into the parsing matrix for rule of length 1. Left and right of the rule are hitting the same spot in the matrix.

Creating dictionary words out of other dictionary words

For example pronouns "ja", "ti", "on" need to have derived dictionary words as "moj", "tvoj", "njegov". Every derived dictionary word may/will have word forms e.g:
nom: moj/moja/moje
gen: mog(a)/moje/mog(a)
dat: mom(e)/mojoj/mom(e)
akuz: moj/moje/moje

Basic English-Serbian translator

Write some basics English-Serbian translator class:
template <class Lang1, class Lang2>
class word_translation
{
dictionary_word* p_word1;
dictionary_word* p_word2;
};

template <class Lang1, class Lang2>
class translator
{
static vector<pair<Lang1::wordtypes, Lang2::wordtypes>> wordtypes;
static vector<pair<Lang1::attributes, Lang2::attribtues>> attrs;
static vector<pair<Lang1::categories, Lang2::categories>> cats;
static vector<word_translation> words;
static vector<rule_translation> rules;
};

Remove C# projects

There are TestTranslator and Translator which are C# projects. Remove them.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.