wellcomecollection / catalogue-pipeline Goto Github PK

View Code? Open in Web Editor NEW

12.0 12.0 2.0 95.45 MB

:oil_drum: The data pipeline services extracting & transforming data from our museum and collections.

Home Page: https://developers.wellcomecollection.org/catalogue

License: MIT License

Scala 85.69% Python 7.45% Shell 0.83% Dockerfile 0.22% HCL 5.51% JavaScript 0.04% TypeScript 0.26% Makefile 0.01%

wellcome-digital-platform

catalogue-pipeline's People

Contributors

Stargazers

Watchers

Forkers

pollecuttn weco-bot

catalogue-pipeline's Issues

Scored colour filtering (colour querying)

Create display model for new location model

Transform sierra data into location types

Ref #817

On the bib records:

Type	Label	Sierra mapping
OpenShelves	Open shelves	whml
ClosedStores	Closed stores	stax, arch
DigitalResource	Online	elro, digi

Create dashboards showing results and affect of test

Catalogue API logging to Logstash is not working

In order to save on running costs and enable opportunities for handling logs more effectively we want to log to an ELK stack.

There is no logstash service running in the Catalogue API cluster and consequently the API cannot log to elasticsearch through it.

ID search

Currently IDs can carry weird phrases such as External Reference from Nuffield department of anaethestics, Oxford on https://wellcomecollection.org/works/hkqtbg3k.

The current way we index and query means that query=Oxford matches, and shouldn't.

Index as lowercase keywords will solve most of these, but might eradicate the ability to search for multiple IDs and IDs as part of queries, but that might not be an issue.

Get new pipeline with title mapping out

Contributor / concepts fields

Add ascii folding to concepts
Boost contributors to more than title, but title above other concepts

Create a new query focussing on returning titles of known existing works

Known errors:

Yokai

Terms
Yōkai works
Yokai does not work

Result
Pandemonium and parade [electronic resource] : Japanese monsters and the culture of Yōkai

https://wellcomecollection.org/works/x2zbjrgp

## Two monsoons

Result is #15 which seems way too far down.

Terms
Two monsoons, Theon Wilkinson

Result
https://wellcomecollection.org/works/hgcbf8q9

(from standup) - search problems for known (or approximately known) titles

My known item was this - https://wellcomecollection.org/works/km3uczhf
My query was the transformations of insects
Only one result that looked even vaguely relevant on p1. This book appears halfway through p2.
My known item was this - https://wellcomecollection.org/works/ysmqsfhg
My query was the biological basis of medicine (which in this case is its exact title)
This result appears towards the end of all the results on page 3
There are many works that appear in both sets of results. I wondered if that's because both queries are the xxx of yyy but the result titles don't support that. (edited)
11:24
(why was I searching... to find the work IDs to use in this comment - wellcomecollection/platform#4603 (comment)) (edited)

wellcomecollection/platform#4603 (comment)

ID search

Researchers can search for IDs in a variety of different way.

Healthchecks for post-stage deployment

Ref #898

After a deployment to the stage API, we should have some tests run against

/works
/works?query={query}
/works/{id}
/images
/images?query={query}
/images/{id}

On the front end, the following paths should be checked:

/visit-us
/whats-on
/stories
/collections
/what-we-do

To make sure we get 200s.

Image versioning should reflect changes to sources

add proper batching to the palette encoder

Known item searches

Do we know that know item searching is better and we can move on?
Analysis / rank_eval

Searches that we would expect the first result to be a known item, but isn't.

query: twenty one things
expected: https://wellcomecollection.org/works/fjbxucnf

query: maurice wilkins archive
expected: https://wellcomecollection.org/works/rq5g8r9g
when searching phrase maurice wilkins archive, I get the item and series-level stuff as expected, but no obvious appearance of the collection-level on first page of results; when filtering down to Archive collection, it shows me the Francis Crick collection (@jennpb)

Location type aggregation

Use search landing page tags to surface intentions and measure relevance

Could we surface example queries of the different intentions we are aware of.

Add top level archive collections in search results

As a way for users to find and explore collections, we want to add top level collections to the search results.

We can do this quite easily by just adding archive-collection to the list of current workTypes we allow through.

This would add 1173 results to the list, which is insignificant in terms of affecting the tf/idf.

We will be adding and working via the rank_eval tests to ensure they don't skew the current query.

The main issue however that we also have the filter of items.locations.locationType=iiif-image,iiif-presentation which none of the archive-collections have, so they will not be returned.

Ping @harrisonpim @alicerichmond @taceybadgerbrook

FixedFields: Scoring tiers working correctly

We were querying on the wrong fields, this made more of the tiers work as intended.

#338

Download images in inference manager

Create ranked documents from search intentions document

Establish minimal reproducible example for ES `array_index_out_of_bounds_exception`

First stab at location model

Look into merging multiple MIRO records into a sierra record

https://github.com/wellcomecollection/catalogue/pull/521/files#r409647521

score according to best match in contributors list

i think we can still improve the way we're scoring contributors - we should be looking for the best matches within each contributor in the list, not the overall match across the list of contributors

eg. when searching for william blake,
{id: "12345", "contributors": ["william blake", "someone else"]}
should be scored much higher than
{id: "67890", "contributors": ["william someone", "blake else"]}

i'm seeing instances of the second example being given the same score as the first

group searches by unique anonymous_id and unique query

Attach relevance information to search results

See this PR for an example:
#419

Colour filtering frontend test

Testing from slack

Shake your body

Create filters for locations

?items.locations.type=OpenShelves,ClosedStores,DigitalResource

Add palette inferrer to inference service

move common inferrer functionality to package in pypi

Our two prod image inferrers (features, palettes) share loads of functionality, and therefore loads of code. A quick look suggests that ~75% of their code is shared, and we're now starting to duplicate even more in the concepts store.

It makes sense to pull that shared functionality out into a common library, which could then be installed with pypi.

Write integration tests against intention examples

test LAB colour space conversion

Color palette similarity query

Location types

We currently store Sierra location codes in the locationType of a Location.

This isn't useful, and is potentially a data leak into where and how we store rare materials in our closed stores.

We could split these into ClosedStores and OpenShelves and map those from the locations codes from Sierra.

Here's an example of some before and afters.

It has been said that you can infer this from the prefix on the Sierra location code being wg meaning it is on the open shelves.

We should leave the DigitalLocations as is (IIIFPresentation and IIIFImage) for backwards compatibility.

Write script to determine is deployment is successful for CI

Ref #898

Write a script that can compare the deployment:label tag of a service against the tasks running within that service to determine that a service has successfully deployed.

This can then be added to the release to stage step of the build as a "completed" check.

search dashboard

audit

metrics

filters

works/images
page of results
number of tokens in query

Deploy new pipeline with ID search

Search image identifiers on works

https://wellcome.slack.com/archives/C016NQB58N4/p1597935859065100

Post stage-release testing

Currently different members across the team have different smoke tests they run before deploying to prod.

One set of tests is running the experience app against the staging API. It would be good to have these Dockerised and accessible to the whole team to help with a smoother, more confident API deployment.

This will add confidence to our releases as we add and transform a whole load of new data into the API.

This will then be available to CI where we're starting to deploy the API via CI to stage 🥳 .

After we have run some post-stage tests, we will be closer to be full Agile™ and continuously deploy to prod.

Release new image merging rules

Work type filtering

analyse filter usage and overlaps

Add glossary to relevance documentation

Try (again) to fix matcher performance

Location filters

Add location filters to the API.

Locations

Type	Label
OpenShelves	Open shelves
ClosedStores	Closed stores
DigitalResource	Online

We're removing the current locationType(iiif-presentation, iiif-image) and will add filtering on the above locations via the items.locations.type query parameter. e.g. ?items.locations.type=OpenShelves,ClosedStores.

These will be aggregated so available to the frontend with the above labels and counts of how many are in each aggregation the same way we do with workType. e.g.

Note: I've just nabbed the interface with what is live, it will follow whatever filter pattern we have at the time of implementation.

Fix deployment IDs being applied to services

Ref #898

weco-deploy needs to tag the service with deployment:label: deploymentID on the deployed service.

This will then allow us to check tasks against this ID to determine when a service has finished being deployed.

Add principles to relevance documentation

following conversation with @jamesgorrie about adding fuzziness to queries. While this is a clear intention, it should really be applied to all queries, rather than adding a new one to the tiers. Things like this should sit in a set of principles, eg

Principles

we should try to match all tokens in a query
all tokens should have fuzziness applied to them
etc

wellcomecollection / catalogue-pipeline Goto Github PK

catalogue-pipeline's People

Contributors

Stargazers

Watchers

Forkers

catalogue-pipeline's Issues

Yokai

Searches that we would expect the first result to be a known item, but isn't.

metrics

filters

Locations

Principles

Recommend Projects

Recommend Topics

Recommend Org

Jobs