This project creates a dataset from the nist glossary of terms. The dataset is published to kaggle here: TBD
-
Clone the Repository:
git clone https://github.com/RonMallory/data-science-example.git cd nist-terminology-dataset
-
Setup with Poetry:
Ensure you have Poetry installed:
poetry install
This command installs all the necessary dependencies specified in
pyproject.toml
.
This project uses pre-commit
to maintain code quality and consistency. The following hooks are in place:
pre-commit install
The init dataset is published manually while additional updates to the dataset occur in github actions.
- Create api token from kaggle account settings
- Run the following command to generate the dataset.csv and the kaggle-metadata.json files
poetry run python src/main.py
- Run the following command to publish the dataset to kaggle
kaggle datasets create -p ./data"
- With the kaggle.json file that was created in Initial Publish create a github secret with the name KAGGLE_USERNAME and KAGGLE_KEY
- Once a pull request has been approved and merged into the main branch the github action will run and update the dataset.
- The ci.yml file will use the commit message to annotate the dataset with the changes made.
- Fork the project.
- Create a branch based on the DSLP strategy:
git checkout -b feature/new-feature
- Commit your changes:
git commit -am 'Add new feature'
- Push to the branch:
git push origin feature/new-feature
- Submit a pull request against the appropriate DSLP branch.
This project is licensed under the MIT License.