nkowaokwu / igbo_api Goto Github PK
View Code? Open in Web Editor NEWAn API exposing Igbo words, definitions, and more
Home Page: https://igboapi.com
License: Apache License 2.0
An API exposing Igbo words, definitions, and more
Home Page: https://igboapi.com
License: Apache License 2.0
The Columbia Igbo Dictionary PDF has a list of abbreviations that are used within the dictionary.
To make them more readable, this project, while parsing, should be able to replace the abbreviations that are found in definitions and examples.
For example, n.
as a word-class should be replaced as noun
. If Y.
is found within a word's definition or example should be replaced with Yoruba
.
Create a new folder inside /ig
that will hold all the *.html
files.
Make sure that all the routes pointing to these files are updated.
Currently, all the documentation on how to interact with the API is in the README.
That information should be moved out to its own doc. Maybe a USAGE.md file or create a new page in the Wiki.
Maybe we explore using automatically generating API docs with packages like apidoc
Currently, the raw text that's found in the the PDF dictionaries use accent marks that users of this API might not want to include when search for terms.
While building dictionaries, there should be a dictionary that's built with normalized text.
Add a small Build Status badge right below the title of the README
This repo is a good place to see how to add a badge - https://github.com/CultureHQ/github-actions-badge
This Medium guide is pretty helpful - https://medium.com/@AndreSand/adding-github-action-workflow-status-badge-to-your-repository-22ccea025af6
Scrape the Igbo terms along with their information from this site:
The data that gets scrapped and parsed should be appended to the JSON dictionary objects.
Create a new branch dedicated to the source code for the front facing site discussed in #21. Setup a basic Gatsby project in this branch.
If a user searches for a singular word that's in the phrases section of a term, that phrases should be returned with its information.
Convert the for
loop into .reduce()
to clean up the code
The main route api/v1/search/words
should use MongoDB data while api/v1/test/words
should use JSON data.
This is one step closer to consistently relying on the MongoDB data as a real site or service would.
Currently, the parsing script thinks that a new line for a definition or example is considered to be an entirely new example.
The way that the script should determine if a new line item is either a continuation of a definition or example or a completely new example should be based on the top
pixel differences.
If the script moves to a new row in the same column and the difference of tops is a clean 15px
then it's a continuation of that column's cell. If the difference between tops isn't a flat integer, it's a new cell in that column.
Create a quick site that uses the API and displays the results.
Instead of paginating through sets of ten words each time, the next page only introduces one new word while removing one term.
So instead of getting a set of ten new words, the frontend will only get one new word at a time.
Currently, definitions and examples are getting normalized. The normalization script it optimized to remove non-Igbo words which isn't beneficial.
Now that the API has basic functionality to search for Igbo terms using either Igbo or English, it's time to release to the world.
The tools that I'm thinking about using include:
What are people's thoughts about these platforms? Is there something else we should consider using that could help down the road with scaling?
Now that this repo is able to parse the dictionary PDF and create a well-structured JSON object with terms and their information, it's time to start preparing this data to live in MongoDB.
This issue focuses on creating basic MongoDB documents and collections. Mongoose will be used to build out some basic schemas.
word - String
wordClass - String
definitions - Array[String]
phrases - Array[Phrase]
examples - Array[Example]
phrase - String
parentWord - Word
definition - String
Examples - Array[Example]
example - String
parentPhrase - Phrase
parentWord - Word
Write a few tests that check to see if the app is functioning properly.
After that, create a GitHub Actions workflow that creates a continuous integration pipeline that automatically runs the tests.
If there are more than 20 words that come back from the database, then the client should be able to paginate through the responses.
This will help with network response times.
Use the query param page
to allow the client to specify which page of responses they want to see.
The current ReadMe file for this branch is the auto-generated file by the Gatsby starter. Create a custom ReadMe with project details to make the process easier for contributors.
According to react/forbid-prop-types projects should stay away from PropTypes.object
.
Remove all instances of PropTypes.object
and provide more detailed object structures.
Also, remove the eslintrc.json
react/forbid-prop-types
rule from the 'rules'
section so that it gets enforced.
There are different ways a user can type out the same word. One use might insist on using accent marks, while another doesn't know how to use accent marks. The API should still return the same information to the user.
For this to happen, a hash function that knows how to map normalized and non-normalized to text to the same word so the user gets the expected information.
There are numerous cases in the original PDF where words don't have any word class, definition, phrases, or examples.
Instead, it just says to see a different word.
Words that are in this situation shouldn't be considered a standalone word in the JSON dictionary. Instead, it should be considered a variation in the word that it's telling the reader to see.
Scrape the data that's found in this site and append it to the JSON dictionary objects.
https://1000mostcommonwords.com/1000-most-common-igbo-words/
If the word doesn't exist in the JSON dictionary, create a new object where that object has definitions, word, wordClass, phrases, examples, and variations
.
If the word already exists in the JSON dictionary, append that information.
A lot of words can have different spellings or variations.
When the project is parsing the HTML and building the JSON objects, a variations
key should be placed on each word object. This will allow words to keep track of their variations.
In the Columbia PDF, variations are denoted by the use of commas in the far left column in the dictionary table. Each comma-separated term should be separated into an array. The first word of the possible variations will be the key of the word object, and the subsequent words will be placed in the variations
array.
The MongoDB Word
schema should also be updated to capture this new variations
key. It will be an array that contains strings.
The search functionality should also check each word's variations
key to see if the searched keyword matches any of a given word's variations.
Build a normalization map where for each key in the map is the normalized term and the value is an array of all the non-normalized terms.
So whenever a user searches without tonal marks, the program can find the term as a key in the map and then grab all the term data for each of the words in the current array.
Currently, the phrases
key has an object for its value. On that object, each key is a phrase object that has a definitions
and examples
keys.
phrases
should change from an object to an array of objects.
The updated shape is already represented in the MongoDB database, but it makes sense to keep the JSON up-to-date
{
"phrases": {
"(agụū) -gụ": {
"definitions": [
"be hungry"
],
"examples": []
}
}
}
{
"phrases": [
{
"phrase": "(agụū) -gụ",
"definitions": [
"be hungry"
],
"examples": []
}
]
}
Now that the API is deployed, we should mention it in the README.
Add a 'Try it Out' section near the top of the README that uses the following link:
The API demo site (see #21 ) has been set up in this branch: https://github.com/ijemmao/igbo_api/tree/gatsby-dev.
Implement error handling for the axios request to respond when user search returns no words or an error is thrown,
Follow this guide to enable the database to be automatically migrated
Database migrations will act as a form of version control so if the database becomes invalidated or malformed, it can be rolled back
The generated directories db/
and build/
need to be added to the .gitignore
to prevent those files from accidentally entering this branch's codebase.
The Columbia paper has primary and secondary definitions for words using the prefixes A. and B.
parseAndBuild.js
needs to move those phrases into a given word's definition
property.
There are two normalized JSON dictionaries ig-en_normalized.json
and ig-en_normalized_expanded.json
that aren't used in the project and probably won't be used in the future.
Delete these two files along with the logic that's responsible for building them when the script yarn build:dictionaries
is executed.
The project is currently using eslint:recommended
which I don't think is strong enough linting.
There are small formatting errors that fall through the cracks with this configuration.
So the .eslintrc.json
file should "extends": "airbnb"
Double-check with your text editor or IDE that it's set up correctly (it's showing red error lines when a rule has been broken).
The following search features should be included to help the user search easier:
kpo
is provided, then the user should get kpọ
Now that the project is able to move the JSON data into MongoDB, the API should start grabbing the data from MongoDB instead of the JSON files.
Create a new /GET
endpoint similar to the one that exists. When query keyword
is provided, the word along with it's resolved information should be included.
Add tests.
When an id is provided in the API route for a word or phrase, then the API should return back the object with that id.
After the initial version of the front site is complete, it would be nice to add a suggestions feature where users can request to see a change within the API.
I don't have any concrete ideas for how this feature will work. So I wanted to ask for people's opinions on what they think is the best way to capture user requests.
Here are a couple of ideas I've had so far:
On the front site, there would be a 'Suggestions' button where a user could input key information about what changes they want to see. When they submit that form, a new GitHub issue would be created
I was thinking about tracking changes in GitHub so it's easier to track, but are there concerns with mixing user-requested changes with technical implementation efforts?
Again on the front site, instead of using GitHub issues, any requests will be sent to an email
This approach would be nice because it would only house the issues found via the front site, but it's less accessible for future contributors to address.
Again on the front site, instead of issues or emails, we could make a new document in the database under a Request
model
This is my least favorite approach because it encourages human tampering with production-level data (once we've addressed the request we would have to go into the database to delete the document or update its status), but it's another thought I had
Also, a couple of more questions I had for people: how would we verify user-requested changes? What would the verification process look like? Would we want to
For examples, -zu-zò
doesn't include it's phrase -zuzò èzuzò
because it doesn't have a definition next to it
Currently, the backend will search word definitions if they provide an English term. This search should be extended to phrase definitions that belong to a particular word.
This extension of the search functionality will help provide more search results to the client.
Words that have multiple definitions will have multiple strings inside of their definitions
array.
What happens often is that those strings in the definitions
array are prefixed with the letters A., B., C., or any other letter to denote that there are multiple definitions.
Those letters should be removed so that the cleaned text will serve as one of the definitions for a given term.
If people would want to join the Slack group, they should!
Add a section in the README that has the following link:
The following search features should be included to help the user search easier:
n obe
is provided, then the expected n’obe
should be returned
bia
is provided, then the expected -bia
should be returned
Currently, the Mongo objects that get returned from the backend over to the frontend include the key _id
which is against the ESLint no-underscore-dangle rule.
So all the objects that will be returned to the frontend must be transformed to remove the _id
and replace it with an id
key.
The __v
key should be removed completely so the frontend doesn't know the version of the word, phrase, or example, object.
Create a git hook to check for any ESLint errors before attempting to push
Currently, the plain strings left
, center
, and right
are used through out the buildDictionary
function.
They should be placed in an enum for consistency
A basic Gatsby project for this site (see #21 ) has been set up in this branch: https://github.com/ijemmao/igbo_api/tree/gatsby-dev.
Using Tailwind, add utility classes to improve appearance.
Currently, the API allows for Igbo to English search, but to further expand what this API can do it needs to have an English to Igbo search capability.
This issue doesn't focus on implementing a full fledge English to Igbo, instead the main focus is to lay down the ground work for future related features.
The files and folders that are found in the top-level utils
folder should be placed in the shared
folder.
Everything inside utils/constants
should be moved over to shared/constants
The files directly inside utils
should be moved over to shared/utils
In order to make the search feature more scalable and easy to maintain, the data found in the dictionary JSON files need to be transferred into a MongoDB database.
Here are the current models that would be helpful:
Term
Phrase
documentsExample
documentsPhrase
Example
documentsExample
Note: Terms in bold are still a work in progress and might not be included in the document
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.