GithubHelp home page GithubHelp logo

wellcomecollection / catalogue-pipeline Goto Github PK

View Code? Open in Web Editor NEW
12.0 12.0 2.0 95.45 MB

:oil_drum: The data pipeline services extracting & transforming data from our museum and collections.

Home Page: https://developers.wellcomecollection.org/catalogue

License: MIT License

Scala 85.69% Python 7.45% Shell 0.83% Dockerfile 0.22% HCL 5.51% JavaScript 0.04% TypeScript 0.26% Makefile 0.01%
wellcome-digital-platform

catalogue-pipeline's People

Contributors

agnesgaroux avatar alexwlchan avatar alicefuzier avatar dependabot[bot] avatar georgiaewhitney avatar harrisonpim avatar jamesgorrie avatar jamieparkinson avatar jtweed avatar kenoir avatar melanierogan avatar mklander avatar paul-butcher avatar rcantin-w avatar stepanbrychta avatar taceybadgerbrook avatar warrd avatar weco-bot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

catalogue-pipeline's Issues

Catalogue API logging to Logstash is not working

In order to save on running costs and enable opportunities for handling logs more effectively we want to log to an ELK stack.

There is no logstash service running in the Catalogue API cluster and consequently the API cannot log to elasticsearch through it.

ID search

Currently IDs can carry weird phrases such as External Reference from Nuffield department of anaethestics, Oxford on https://wellcomecollection.org/works/hkqtbg3k.

The current way we index and query means that query=Oxford matches, and shouldn't.

Index as lowercase keywords will solve most of these, but might eradicate the ability to search for multiple IDs and IDs as part of queries, but that might not be an issue.

Create a new query focussing on returning titles of known existing works

Known errors:

Yokai

Terms
Yōkai works
Yokai does not work

Result
Pandemonium and parade [electronic resource] : Japanese monsters and the culture of Yōkai

https://wellcomecollection.org/works/x2zbjrgp


## Two monsoons

Result is #15 which seems way too far down.

Terms
Two monsoons, Theon Wilkinson


(from standup) - search problems for known (or approximately known) titles

  1. My known item was this - https://wellcomecollection.org/works/km3uczhf
    My query was the transformations of insects
    Only one result that looked even vaguely relevant on p1. This book appears halfway through p2.
  2. My known item was this - https://wellcomecollection.org/works/ysmqsfhg
    My query was the biological basis of medicine (which in this case is its exact title)
    This result appears towards the end of all the results on page 3
    There are many works that appear in both sets of results. I wondered if that's because both queries are the xxx of yyy but the result titles don't support that. (edited)
    11:24
    (why was I searching... to find the work IDs to use in this comment - wellcomecollection/platform#4603 (comment)) (edited)

wellcomecollection/platform#4603 (comment)

Known item searches

  • Do we know that know item searching is better and we can move on?
  • Analysis / rank_eval

Searches that we would expect the first result to be a known item, but isn't.

query: twenty one things
expected: https://wellcomecollection.org/works/fjbxucnf


query: maurice wilkins archive
expected: https://wellcomecollection.org/works/rq5g8r9g
when searching phrase maurice wilkins archive, I get the item and series-level stuff as expected, but no obvious appearance of the collection-level on first page of results; when filtering down to Archive collection, it shows me the Francis Crick collection (@jennpb)

Add top level archive collections in search results

As a way for users to find and explore collections, we want to add top level collections to the search results.

We can do this quite easily by just adding archive-collection to the list of current workTypes we allow through.

This would add 1173 results to the list, which is insignificant in terms of affecting the tf/idf.

We will be adding and working via the rank_eval tests to ensure they don't skew the current query.

The main issue however that we also have the filter of items.locations.locationType=iiif-image,iiif-presentation which none of the archive-collections have, so they will not be returned.

Ping @harrisonpim @alicerichmond @taceybadgerbrook

score according to best match in contributors list

i think we can still improve the way we're scoring contributors - we should be looking for the best matches within each contributor in the list, not the overall match across the list of contributors

eg. when searching for william blake,
{id: "12345", "contributors": ["william blake", "someone else"]}
should be scored much higher than
{id: "67890", "contributors": ["william someone", "blake else"]}

i'm seeing instances of the second example being given the same score as the first

move common inferrer functionality to package in pypi

Our two prod image inferrers (features, palettes) share loads of functionality, and therefore loads of code. A quick look suggests that ~75% of their code is shared, and we're now starting to duplicate even more in the concepts store.

It makes sense to pull that shared functionality out into a common library, which could then be installed with pypi.

Location types

We currently store Sierra location codes in the locationType of a Location.

This isn't useful, and is potentially a data leak into where and how we store rare materials in our closed stores.

We could split these into ClosedStores and OpenShelves and map those from the locations codes from Sierra.

Here's an example of some before and afters.

It has been said that you can infer this from the prefix on the Sierra location code being wg meaning it is on the open shelves.

We should leave the DigitalLocations as is (IIIFPresentation and IIIFImage) for backwards compatibility.

search dashboard

  • audit

metrics

  • searches
  • sessions
  • click through
  • click position
  • no results
  • unique sessions with clicks / unique sessions

filters

  • works/images
  • page of results
  • number of tokens in query

Post stage-release testing

Currently different members across the team have different smoke tests they run before deploying to prod.

One set of tests is running the experience app against the staging API. It would be good to have these Dockerised and accessible to the whole team to help with a smoother, more confident API deployment.

This will add confidence to our releases as we add and transform a whole load of new data into the API.

This will then be available to CI where we're starting to deploy the API via CI to stage 🥳 .

After we have run some post-stage tests, we will be closer to be full Agile™ and continuously deploy to prod.

Location filters

Add location filters to the API.

Locations

Type Label
OpenShelves Open shelves
ClosedStores Closed stores
DigitalResource Online

We're removing the current locationType(iiif-presentation, iiif-image) and will add filtering on the above locations via the items.locations.type query parameter. e.g. ?items.locations.type=OpenShelves,ClosedStores.

These will be aggregated so available to the frontend with the above labels and counts of how many are in each aggregation the same way we do with workType. e.g.
Screenshot 2020-09-10 at 10 06 30

Note: I've just nabbed the interface with what is live, it will follow whatever filter pattern we have at the time of implementation.

Fix deployment IDs being applied to services

Ref #898

weco-deploy needs to tag the service with deployment:label: deploymentID on the deployed service.

This will then allow us to check tasks against this ID to determine when a service has finished being deployed.

Add principles to relevance documentation

following conversation with @jamesgorrie about adding fuzziness to queries. While this is a clear intention, it should really be applied to all queries, rather than adding a new one to the tiers. Things like this should sit in a set of principles, eg

Principles

  • we should try to match all tokens in a query
  • all tokens should have fuzziness applied to them
  • etc

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.