This repository contains the training script of our solution in the Arabicthon2023 in KSA.
Arabicthon is a deep learning competition organized by The King Salman Global Academy for the Arabic Language on purpose to enrich Riyadh Dictionary.
It is a website that facilitates arabic understanding throughout semantic relations between words, such as:
- Synonyms.
- Antonyms.
- Lexical field.
- Related words - Isomorphism.
- Hypernym.
- Hyponym.
- Object to instance relationship, etc.
The main features are :
- Easy search for non-arabic speakers : We provide both arabic and english search with an automatic translation of english to arabic
- User-friendly vizualisation tools : Semantic relations are not just displayed as a boring list of words, however there are other appealing display modes like WordCloud and 3D Graph.
- WordCloud : Helps the user visualize the most related words to the input word.
- 3D Graph : Same advantage as a WordCloud, clickable nodes with another feature that enables the user to ...
- Assistance in learning semantic relations for beginners and advanced arabic learners, students and teachers and many more types of users! We provide an OCR tool that takes a text picture or PDF and detects all the relations in that text.
- Vocabulary quizzes : In order to enrich the database of Riyadh dictionary, we created smart quizzes for the users to check relations between given set of words, a double-edged sword Making the learner's experience more fun, improving the performance of our app!
React app that contains the frontend of the project. you can find it in here: https://github.com/mezdourcheima/arabicthon-front
Flask app that contains the backend of the project. you can find it in here: arabicthon_backend
This notebook covers machine translation backed by Hugging Face models txtai an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows. The quality of machine translation via cloud services has come a very long way and produces high quality results. You can find the training file in here: en-to-ar-translation.ipynb
We used N-Grams model ported from Aravec, a pre-trained distributed word representation (word embedding) on more than 1M vocabularies. You can find the training files in here: [lexical-field-and-vizualisation-twitter.ipynb] ()
For word-similarity we used Embedding layers using cosine similarity.
Extracting 4 relations from AraWordNet which are : hypernym, hyponym, has_instance, is_instance.
SQLite database and a CSV file with a comprehensive collection of Arabic synonyms, antonyms. You can find the database here
A set of 500 synsets (extracted from the Arabic Wordnet). Each synset is enriched with a list of candidate synonyms. The total number is 3K candidates. Each candidate synonym is annotated with a fuzzy value.
We used tesseract 4 which adds a new neural net (LSTM) based OCR engine which is focused on line recognition, but also still supports the legacy Tesseract OCR engine of Tesseract 3 which works by recognizing character patterns.