nkowaokwu / igbo_api Goto Github PK

View Code? Open in Web Editor NEW

312.0 14.0 138.0 17.53 MB

An API exposing Igbo words, definitions, and more

Home Page: https://igboapi.com

License: Apache License 2.0

JavaScript 7.30% Dockerfile 0.02% CSS 0.24% TypeScript 90.42% MDX 2.02%

igbo dictionary-api igbo-api dictionary igbo-language mongodb

igbo_api's People

Contributors

Stargazers

Watchers

Forkers

ezesundayeze yhoungdev mr-talukdar gabriellaamah greatgrant amannv2 c-zekeri acelewy ebugo ebubae oemekaogala perlbeforeswine slightlynerd bartoszlis87 ktdung aminhasibul justprateek xaerru japhmayor pappyj kenzdozz robertito1 timolinn aanaedu jovialcore tg-codes mancancode avishka-devinda doanhbui chriswenzy inezabonte styles-theandriod njokdan nguyenhuutinh dtaylor-us windward-hive chialuka web-consult-ng vicradon vigo92 karlgusta famzy tony-eneh miracleufo bkmgit codehunterdev sami64 proyce uwaks deedevs parthadhar lekejosh petroritse1 neurotech-hq cynthiapeter temtechie ijemmao kingwindie nenyeo davydocsurg aditya062003 yawmiej superfly101 bontusss emilychima ndohjapan chijiooke chimise samicey itsjayway alexxuss1996 runtee nanomsky mysticwillz mustafaalameen kcpele chinedu360 toyin5 iamyoungbossy vemerie funkycadet dhaxor niiwade emekasage ayo-ajayi shaikahmadnawaz dhruvdabhi101 nelwhix stancobridge dennardavid jcalmcrasher sir-radar stefanpgr mauricenyanja richiiestowie prince-konwea otagera alimurtuzacodes adex-codez harshal662

igbo_api's Issues

Create abbreviations map and replace abbreviations in word-classes, definitions, and examples

The Columbia Igbo Dictionary PDF has a list of abbreviations that are used within the dictionary.

To make them more readable, this project, while parsing, should be able to replace the abbreviations that are found in definitions and examples.

For example, n. as a word-class should be replaced as noun. If Y. is found within a word's definition or example should be replaced with Yoruba.

Abbreviations

Restructure ig Folder

Create a new folder inside /ig that will hold all the *.html files.

Make sure that all the routes pointing to these files are updated.

Create Separate Usage Docs

Currently, all the documentation on how to interact with the API is in the README.

That information should be moved out to its own doc. Maybe a USAGE.md file or create a new page in the Wiki.

Maybe we explore using automatically generating API docs with packages like apidoc

Normalize Text While Building Dictionaries

Currently, the raw text that's found in the the PDF dictionaries use accent marks that users of this API might not want to include when search for terms.

While building dictionaries, there should be a dictionary that's built with normalized text.

Add Build Status Badge to README

Add a small Build Status badge right below the title of the README

This repo is a good place to see how to add a badge - https://github.com/CultureHQ/github-actions-badge
This Medium guide is pretty helpful - https://medium.com/@AndreSand/adding-github-action-workflow-status-badge-to-your-repository-22ccea025af6

Scrape Terms from Igbo English Website

Scrape the Igbo terms along with their information from this site:

http://www.igboenglish.com/

The data that gets scrapped and parsed should be appended to the JSON dictionary objects.

Initial setup for the front facing site

Create a new branch dedicated to the source code for the front facing site discussed in #21. Setup a basic Gatsby project in this branch.

Return Search Phrases

If a user searches for a singular word that's in the phrases section of a term, that phrases should be returned with its information.

Refactor normalize for loop into reduce

Convert the for loop into .reduce() to clean up the code

Switch Express Routes

The main route api/v1/search/words should use MongoDB data while api/v1/test/words should use JSON data.

This is one step closer to consistently relying on the MongoDB data as a real site or service would.

Move Accidental Examples to Definitions

Currently, the parsing script thinks that a new line for a definition or example is considered to be an entirely new example.

The way that the script should determine if a new line item is either a continuation of a definition or example or a completely new example should be based on the top pixel differences.

If the script moves to a new row in the same column and the difference of tops is a clean 15px then it's a continuation of that column's cell. If the difference between tops isn't a flat integer, it's a new cell in that column.

Dictionary Site

Create a quick site that uses the API and displays the results.

Pagination doesn't go to the next 10 words

Instead of paginating through sets of ten words each time, the next page only introduces one new word while removing one term.
So instead of getting a set of ten new words, the frontend will only get one new word at a time.

Stop by normalizing definitions and examples

Currently, definitions and examples are getting normalized. The normalization script it optimized to remove non-Igbo words which isn't beneficial.

Deploy API

Now that the API has basic functionality to search for Igbo terms using either Igbo or English, it's time to release to the world.

The tools that I'm thinking about using include:

Heroku -hosting and managing the Igbo Dictionary Node API service
MongoDB Atlas - host the MongoDB data

What are people's thoughts about these platforms? Is there something else we should consider using that could help down the road with scaling?

Build out MongoDB Collections from JSON object with Mongoose

Now that this repo is able to parse the dictionary PDF and create a well-structured JSON object with terms and their information, it's time to start preparing this data to live in MongoDB.

This issue focuses on creating basic MongoDB documents and collections. Mongoose will be used to build out some basic schemas.

Schemas

Word

word - String
wordClass - String
definitions - Array[String]
phrases - Array[Phrase]
examples - Array[Example]

Phrase

phrase - String
parentWord - Word
definition - String
Examples - Array[Example]

Example

example - String
parentPhrase - Phrase
parentWord - Word

Initial Test + Continuous Integration

Write a few tests that check to see if the app is functioning properly.

After that, create a GitHub Actions workflow that creates a continuous integration pipeline that automatically runs the tests.

Paginate Large Response

If there are more than 20 words that come back from the database, then the client should be able to paginate through the responses.

This will help with network response times.

Use the query param page to allow the client to specify which page of responses they want to see.

Create a ReadMe for `gatsby-dev` branch

The current ReadMe file for this branch is the auto-generated file by the Gatsby starter. Create a custom ReadMe with project details to make the process easier for contributors.

Enforce react/forbid-prop-types

According to react/forbid-prop-types projects should stay away from PropTypes.object.

Remove all instances of PropTypes.object and provide more detailed object structures.

Also, remove the eslintrc.json react/forbid-prop-types rule from the 'rules' section so that it gets enforced.

Hash Function for Terms

There are different ways a user can type out the same word. One use might insist on using accent marks, while another doesn't know how to use accent marks. The API should still return the same information to the user.

For this to happen, a hash function that knows how to map normalized and non-normalized to text to the same word so the user gets the expected information.

Map words that use 'see <word>'

There are numerous cases in the original PDF where words don't have any word class, definition, phrases, or examples.
Instead, it just says to see a different word.

Words that are in this situation shouldn't be considered a standalone word in the JSON dictionary. Instead, it should be considered a variation in the word that it's telling the reader to see.

Scrape 1000 Most Common Igbo Words

Scrape the data that's found in this site and append it to the JSON dictionary objects.

https://1000mostcommonwords.com/1000-most-common-igbo-words/

If the word doesn't exist in the JSON dictionary, create a new object where that object has definitions, word, wordClass, phrases, examples, and variations.

If the word already exists in the JSON dictionary, append that information.

Include variations keys in JSON data

A lot of words can have different spellings or variations.

When the project is parsing the HTML and building the JSON objects, a variations key should be placed on each word object. This will allow words to keep track of their variations.

In the Columbia PDF, variations are denoted by the use of commas in the far left column in the dictionary table. Each comma-separated term should be separated into an array. The first word of the possible variations will be the key of the word object, and the subsequent words will be placed in the variations array.

The MongoDB Word schema should also be updated to capture this new variations key. It will be an array that contains strings.

The search functionality should also check each word's variations key to see if the searched keyword matches any of a given word's variations.

Normalization map

Build a normalization map where for each key in the map is the normalized term and the value is an array of all the non-normalized terms.

So whenever a user searches without tonal marks, the program can find the term as a key in the map and then grab all the term data for each of the words in the current array.

Change JSON Dictionary shape

Currently, the phrases key has an object for its value. On that object, each key is a phrase object that has a definitions and examples keys.

phrases should change from an object to an array of objects.

The updated shape is already represented in the MongoDB database, but it makes sense to keep the JSON up-to-date

Before

{
    "phrases": {
            "(agụū) -gụ": {
                "definitions": [
                    "be hungry"
                ],
                "examples": []
            }
        }
}

After

{
    "phrases": [
            {
                "phrase": "(agụū) -gụ",
                "definitions": [
                    "be hungry"
                ],
                "examples": []
            }
        ]
}

Add Heroku API link in README

Now that the API is deployed, we should mention it in the README.

Add a 'Try it Out' section near the top of the README that uses the following link:

https://igbo-api.herokuapp.com/api/v1/search/words

Error Handling for Axios Request

The API demo site (see #21 ) has been set up in this branch: https://github.com/ijemmao/igbo_api/tree/gatsby-dev.
Implement error handling for the axios request to respond when user search returns no words or an error is thrown,

Implement Database Migration

Follow this guide to enable the database to be automatically migrated

Database migrations will act as a form of version control so if the database becomes invalidated or malformed, it can be rolled back

Update gatsby-dev .gitignore

The generated directories db/ and build/ need to be added to the .gitignore to prevent those files from accidentally entering this branch's codebase.

Move A. B. Text into Definition Section

The Columbia paper has primary and secondary definitions for words using the prefixes A. and B.

parseAndBuild.js needs to move those phrases into a given word's definition property.

Delete normalized dictionaries

There are two normalized JSON dictionaries ig-en_normalized.json and ig-en_normalized_expanded.json that aren't used in the project and probably won't be used in the future.

Delete these two files along with the logic that's responsible for building them when the script yarn build:dictionaries is executed.

Organise output of search response in demo site

The demo site (see #21 ) set up here already has an axios request implemented to query the API and display results, but the results have still not been organised or styled.

Note: This project is built with Gatsby and uses Tailwind utility classes in order to give a consistent look to the site.

Use Airbnb ESLint Rules

The project is currently using eslint:recommended which I don't think is strong enough linting.
There are small formatting errors that fall through the cracks with this configuration.

So the .eslintrc.json file should "extends": "airbnb"

Double-check with your text editor or IDE that it's set up correctly (it's showing red error lines when a rule has been broken).

Regex Search for Terms

The following search features should be included to help the user search easier:

If kpo is provided, then the user should get kpọ
- Terms that have accents should be returned even if the user doesn't provide that information

Create Express Routes for MongoDB Data

Now that the project is able to move the JSON data into MongoDB, the API should start grabbing the data from MongoDB instead of the JSON files.

Create a new /GET endpoint similar to the one that exists. When query keyword is provided, the word along with it's resolved information should be included.

Add tests.

Allow searching with ids

When an id is provided in the API route for a word or phrase, then the API should return back the object with that id.

How should we implement a Suggestions Feature for adding/removing/changing words?

After the initial version of the front site is complete, it would be nice to add a suggestions feature where users can request to see a change within the API.

I don't have any concrete ideas for how this feature will work. So I wanted to ask for people's opinions on what they think is the best way to capture user requests.

Here are a couple of ideas I've had so far:

On the front site, there would be a 'Suggestions' button where a user could input key information about what changes they want to see. When they submit that form, a new GitHub issue would be created

I was thinking about tracking changes in GitHub so it's easier to track, but are there concerns with mixing user-requested changes with technical implementation efforts?
Again on the front site, instead of using GitHub issues, any requests will be sent to an email

This approach would be nice because it would only house the issues found via the front site, but it's less accessible for future contributors to address.
Again on the front site, instead of issues or emails, we could make a new document in the database under a Request model

This is my least favorite approach because it encourages human tampering with production-level data (once we've addressed the request we would have to go into the database to delete the document or update its status), but it's another thought I had

Also, a couple of more questions I had for people: how would we verify user-requested changes? What would the verification process look like? Would we want to

Terms that don't have direct definition don't capture expected phrases and examples

For examples, -zu-zò doesn't include it's phrase -zuzò èzuzò because it doesn't have a definition next to it

Search phrase definitions with English terms

Currently, the backend will search word definitions if they provide an English term. This search should be extended to phrase definitions that belong to a particular word.

This extension of the search functionality will help provide more search results to the client.

Remove Prefixed A., B., C., etc.

Words that have multiple definitions will have multiple strings inside of their definitions array.

What happens often is that those strings in the definitions array are prefixed with the letters A., B., C., or any other letter to denote that there are multiple definitions.

Those letters should be removed so that the cleaned text will serve as one of the definitions for a given term.

Add Slack Group Link to README

If people would want to join the Slack group, they should!

Add a section in the README that has the following link:

https://igboapi.slack.com

Regex Search - Ignore Apostrophes and Dashes

The following search features should be included to help the user search easier:

If n obe is provided, then the expected n’obe should be returned
- Apostrophe can be denoted with spaces
If bia is provided, then the expected -bia should be returned
- Dashes denote that the term is a verb, but users shouldn't be required to included dashes to find that term

_id and __v keys should be deleted from object when returned to the frontend

Currently, the Mongo objects that get returned from the backend over to the frontend include the key _id which is against the ESLint no-underscore-dangle rule.

So all the objects that will be returned to the frontend must be transformed to remove the _id and replace it with an id key.

The __v key should be removed completely so the frontend doesn't know the version of the word, phrase, or example, object.

Term
- Word Class
- Definitions
- List of Phrase documents
- List of Example documents
- Dialect/Region
Phrase
- Definition
- List of Example documents
Example
- Textual example

Note: Terms in bold are still a work in progress and might not be included in the document