GithubHelp home page GithubHelp logo

alphagov / govuk-content-metadata Goto Github PK

View Code? Open in Web Editor NEW
4.0 10.0 1.0 21.74 MB

GovNER: an encoder-based language model (RoBERTa) fine-tuned to perform Named Entity Recognition (NER) on GOV.UK content

License: MIT License

Makefile 0.92% Jupyter Notebook 34.54% Python 54.87% Shell 7.50% Dockerfile 1.44% HTML 0.72%
data-products-team gcp metadata-extraction named-entity-recognition nlp semantic-metadata transformer-models cpto govuk-content

govuk-content-metadata's Introduction

๐Ÿ” GovNER ๐Ÿง : extracting Named Entities from GOV.UK

Repository for the GovNER project.

GovNER systematically extracts key metadata from the content of the GOV.UK website. GovNER is an encoder-based language model (RoBERTa) that has been fine-tuned to perform Named Entity Recognition (NER) on "govspeak", the language(s) specific of the GOV.UK content estate.

The repository consists of 5 main stand-alone components, each contained in their own sub-directory:

Tech Stack ๐Ÿ’

  • Python
  • FastApi / uvicorn
  • Docker
  • Google Cloud Platform (Cloud Engine, Vertex AI, Workflows, Cloud Run, BigQuery, Cloud Storage, Scheduler)
  • Github Actions
  • bash

Named Entity Recognition (NER) and Entity Schema

Named Entity Recognition (NER) is an Natural Language Processing (NLP) technique, a type of multi-class supervised machine-learning method that identifies and sorts 'entities', real-world things like people, organisations or events, from text.

The Named Entity Schema is the set of all entity types (i.e., categories) that the NER model is trained to extract, together with their definitions and annotation instructions. For GovNER, we built as much as possible on schema.org. Using an agile approach, delivery was broken down into 3 phases, corresponding to three sets of entity types, for which we fine-tuned separate NER models. We have so far completed 2 phases. Predictions from these models were combined at inference stage.

Phase-1 entities

  • Money (amount)
  • Form (government forms)
  • Person
  • Date
  • Postcode
  • Email
  • Phone (number)

Phase-2 entities

  • Occupation
  • Role
  • Title
  • GPE
  • Location (non-GPE)
  • Facility
  • Organisation
  • Event

Daily 'new content only' inference pipeline ๐Ÿš€

Complete code, requirements and documentation in inference_pipeline_new_content.

Inference pipeline scheduled to run daily to extract named entities from the content items on GOV.UK that substantially changed or were newly created the day before.

Vertex AI Batch Predictions are served via HTTP POST method, as part of a scheduled Google Cloud Workflow.

Serving the model in production via FastAPI and uvicorn ๐Ÿฆ„

Complete code, requirements and documentation in fast_api_model_serving.

Containerised code to deploy and run an HTTP server to serve predictions vis API for our custom-trained fine-tuned NER models.

Bulk inference pipeline ๐Ÿ‹๏ธ

Complete code, requirements and documentation in bulk_inference_pipeline.

Inference pipeline to extract named entities from the whole GOV.UK content estate (in "bulk"). The pipeline is deployed in a Docker container onto a Virtual Machine (VM) instance with GPU on Google Compute Engine (GCE).

The bulk pipeline is intended to be executed as a one-off, if either of the phase-1 entity or phase-2 entity models is retrained and re-deployed.

Training pipeline ๐Ÿƒ

Complete code, requirements and documentation in training_pipe.

Pipeline to fine-tune the encoder-style transformer roberta-base for custom NER on Google Vertex AI, using a custom container training workflow and spaCy Projects for the training application.

Annotation workflow ๐Ÿ“

Complete code, requirements and documentation in prodigy_annotation.

Containerised code to create an annotation environment for annotators, using the proprietary software Prodigy.

GovNER web app ๐Ÿ’ป

Complete code, requirements and documentation in src/ner_streamlit_app.

Containerised code to build the interactive web application aimed at helping prospective users understand how NER works via visualisation and user interaction.

Developing ๐Ÿ—๏ธ

Where we refer to the root directory we mean where this README.md is located.

Requirements ๐Ÿšง

In addition:

Credentials

Access to the project on Google Cloud Platform.

Python requirements and pre-commit hooks

To install the Python requirements and pre-commit hooks, open your terminal and enter:

make requirements

or, alternatively, to only install the necessary Python packages using pip:

pip install -r requirements.txt

To add to the Python requirement file, add any new dependencies actually imported in your code to the requirements-original.txt file, and then run:

pip freeze -r requirements-original.txt > requirements.txt

Tests ๐Ÿšฆ

Tests are run as part of a GitHub action.

To run test locally:

pytest

Licence

Unless stated otherwise, the codebase is released under the MIT License. This covers both the codebase and any sample code in the documentation. The documentation is ยฉ Crown copyright and available under the terms of the Open Government 3.0 licence.

govuk-content-metadata's People

Contributors

exfalsoquodlibet avatar jakerutherford avatar rory-hurley-gds avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

kosarailieva

govuk-content-metadata's Issues

Add a logger to fast_api_model_serving/main.py

          I'm a fan of logging. This can help debug issues and understand the flow of the application. You could use python's builtin `logging` library, others exist. This can add log statements to various bits of the code which has some advantages to print statements.

ie.. here you could add for the spacy loading of models

import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# ...

print("Load the spacy models")
logger.info("Loading spacy models...")
nlp_phase1 = spacy.load("models/phase1_ner_trf_model/model-best")
nlp_phase2 = spacy.load("models/phase2_ner_trf_model/model-best")

Originally posted by @mammykins in #96 (comment)

Port NER pipeline to new input files

The Knowledge Graph bucket now gets the "new" files that replace the preprocessed content store. This means that the old preprocessed content store will be soon discontinued. This also means that work can now begin to port NER to the new files.

New input files

In S3 bucket: https://s3.console.aws.amazon.com/s3/buckets/govuk-data-infrastructure-integration/knowledge-graph

  • title.csv.gz,
  • description.csv.gz,
  • all the ones ending *_lines.csv.gz. The "lines" files have one line of content on each row of the CSV file, so there are no newline characters to deal with.

See https://github.com/alphagov/govuk-knowledge-graph/blob/main/src/data/extract_from_mongodb.sh from line 125 for info on fields.

Other new files needed because of metadata

  • document_type: document_type.csv.gz
  • piblishing_app: publishing_app.csv.gz

In the future:

  • public_updated_at: public_updated_at.csv.gz
  • first_published_at: first_published_at.csv.gz

Note

There's no file for base_path, but you could strip https://www.gov.uk/ from the url file

Affected NER files

  • src/make_data/infer_entities.py (minor)

  • src/make_data/infer_entities.sh (minor)

  • A couple of notebooks that use the pre-processed content stores as a shortcut when creating the training dataset. Won't update those and instead will ensure we have a copy of the preprocessed content store for the unforeseeable future
    auditing and add a note to the files. We won't be recreating these exact training sets anyway so hopefully it is ok.

For future training sets, we will create new files/notebooks that use the new files as input.

Missing 'base_path' in 'meta' tag

For some of the 'earlier' annotated sentences, there was no base_path in the meta.

base_paths should be added, even if the value of such is'unknown'

ToDo: Prodigy folder and Dockerfile

Rory has been using prodigy locally prior to this. The files from this should be moved into a shared location, so it can be used by the wider project group and on docker.

Transformer not running in bulk inference pipeline

from @exfalsoquodlibet

Note, this branch/PR contained changes to the branch update-bulk-inference-pipe-to-gcp.

At the moment:

On a GCE Virtual Machine instance (called bulk-inference-pipeline), the pipeline:

works (i.e., multiple CPUs are used, and entities are successfully extracted) if it loads the Spacy's en-core-web-md off-the-shelf model (line 273 in bulk_inference_pipeline/extract_entities_cloud.py)
does not work if Spacy's off-the-shelf transformer model is used (line 272)
does not work if our own fine-tuned transformer model is used (line 271)
The code works (multiple CPUs used etc) if it is run on a local machine.

So my current thinking is that it may be an issue with the batch- multi-process code and transformer models and the configuration of the VM.

NOTE: currently only processing 40,000 items for each part of page (this is set via line 268 in bulk_inference_pipeline/extract_entities_cloud.py.

Empty "text" and/or "details" fields in pre-processed content store

The Issue

Some document types, e.g. contact and national statistical announcement, have no values in the pre-processed content store fields that usually contain content text (i.e., "details" and/or "details_parts"). This is because, for some of these document types, the content is contained in other metadata elements; for others, the content is generated by other rendering apps and so it is not in the Content Store.

For instance:

  • For content document_types (example), the information is contained in the phone_numbers and post_addresses fields;

  • For national statistical announcement (example), this is a re-direct so it is expected to be emoty.

Quantifying the issue

There are other cases, you can see an overview of how many base_paths per document type have missing values for the above fields by running this code:

From the terminal, download a copy of the pre-processed content store and dump it to the /tmp/preproc_store/ (please change date accordingly if wanted a different copy):

gds aws govuk-integration-datascience --assume-role-ttl 480m aws s3 cp s3://govuk-data-infrastructure-integration/knowledge-graph/2022-05-25/preprocessed_content_store_250522.csv.gz /tmp/preproc_store/

The from within an activated python environment:

from functools import reduce

df = pd.read_csv('/tmp/preproc_store/preprocessed_content_store_250522.csv.gz', compression='gzip', header=0, sep="\t")
all = pd.DataFrame(df[['base_path', 'document_type']].document_type.value_counts())
empty_text = pd.DataFrame(df[df.text.isnull()][['base_path', 'document_type', 'text']].document_type.value_counts())
empty_details_part = pd.DataFrame(df[df.details_parts.isnull()][['base_path', 'document_type', 'details_parts']].document_type.value_counts())
empty_details = pd.DataFrame(df[[d == '{}' for d in df.details]][['base_path', 'document_type', 'details']].document_type.value_counts())

output = reduce(lambda df_left, df_right: pd.merge(df_left, df_right, left_index=True, right_index=True, how='outer'), [all, empty_text, empty_details, empty_details_part])
output.columns = ['tot_paths_count', 'n_paths_empty_text', 'n_paths_empty_details', 'n_paths_empty_details_parts']
output.to_csv("paths_empty_content.csv")

A copy of the output is attached to this post.
paths_empty_content.csv

What can we do?

  • For some of these document types, we do not care as we are not interested in them in that we know their content is not in the Content Store
  • For others, like Contact, we will need to modify the preprocessing steps and extract the content of interest via ad-hoc functions that query the relevant metadata elements.

DRAFT - spacy project pull is not downloading all files

Hi,

Our team is encountering problems when using spacy project pull to fetch outputs from the default remote (a bucket on Google Storage):

  • the command does not fetch all the outputs (it usually fails to download training/model-best and metrics/metrics.json as for the below project.yml file);
  • the command does not fetch any outputs at all when created by another team member working on the same project;
  • the folder assets structure is however recreated, albeit with folder partially or fully empty.

Checking the outputs hashes on the remote, these are updated as expected when the pipeline is re-run and seem to reflect the changes that have occurred.

We are not sure whether this is a problem with our project.yml file and perhaps some misunderstanding on the underlying working of spacy project push/pull which we hope someone could clarify.

Environment specification

The spacy project pipeline is run within a Docker container deployed on a Google Vertex AI VM with GPU:

  • Base image : nvidia/cuda:11.2.1-runtime-ubuntu20.04
  • Python version: 3.8
  • spacy[cuda112]

project.yml

The data assets is available locally and pushed with the container when the docker image is built.

vars:
  config: "config.cfg"
  gpu_id: 0
  files:
    train_file: "data.jsonl"
  prodigy:
    prodigy-dataset: "dataset-from-data"
  gcp_storage_remote: "gs://$PROJECT/$BUCKET"

remotes:
  default: '${vars.gcp_storage_remote}'

directories: ["assets", "training", "configs", "metrics", "corpus"]

assets:
  - dest: "assets/${vars.files.train_file}"
    description: "JSONL-formatted training data exported from Prodigy"

workflows:
  all:
    - get-assets
    - db-in
    - create-config
    - data-to-spacy
    - train_spacy
    - evaluate
    - push_remote

commands:

  - name: get-assets
    script:
      - "python3 -m spacy project assets"
    help: "Fetch project assets"

  - name: db-in
    help: "Load the annotated .jsonl file in as a prodigy database."
    script:
      - "python3 -m prodigy db-in ${vars.prodigy.prodigy-dataset} assets/${vars.files.train_file}"
    deps:
      - "assets/${vars.files.train_file}"

  - name: create-config
    help: "Initialise and save a config.cfg file using the recommended settings for your use case"
    script:
      - "python3 -m spacy init config configs/${vars.config} --lang en --pipeline transformer,ner --optimize accuracy --gpu --force"
    outputs:
      - "configs/${vars.config}"

  - name: data-to-spacy
    help: "Convert annotated data to spaCy's binary format and create a train and a dev set based on the provided split threshold"
    script:
      - "python3 -m prodigy data-to-spacy ./corpus --ner ${vars.prodigy.prodigy-dataset} --eval-split 0.2 --verbose --config configs/${vars.config}"
    outputs:
      - "corpus/train.spacy"
      - "corpus/dev.spacy"

  - name: train_spacy
    help: "Train a named entity recognition model with spaCy"
    script:
      - "python3 -m spacy train configs/${vars.config} --output training/ --paths.train corpus/train.spacy --paths.dev corpus/dev.spacy --gpu-id ${vars.gpu_id}"
    deps:
      - "corpus/train.spacy"
      - "corpus/dev.spacy"
    outputs:
      - "training/model-best"

  - name: "evaluate"
    help: "Evaluate the model and export metrics"
    script:
      - "python -m spacy evaluate training/model-best corpus/dev.spacy --output metrics/metrics.json"
    deps:
      - "corpus/dev.spacy"
      - "training/model-best"
    outputs:
      - "metrics/metrics.json"

  - name: push_remote
    help: "Push outputs to remote"
    script:
      - "python3 -m spacy project push default"
    deps:
      - "training/model-best"
      - "metrics/metrics.json"

  # clean up files (not in workflow by default)
  - name: clean
    help: "Remove intermediate files"
    script:
      - "rm -rf training/*"
      - "rm -rf metrics/*"
      - "rm -rf corpus/*"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.