GithubHelp home page GithubHelp logo

princeton-cdh / geniza Goto Github PK

View Code? Open in Web Editor NEW
11.0 3.0 2.0 46.31 MB

version 4.x of the Princeton Geniza Project

Home Page: https://geniza.princeton.edu

License: Apache License 2.0

Python 77.18% HTML 6.90% CSS 0.59% JavaScript 4.08% SCSS 11.12% Shell 0.13%
python django digital-humanities geniza judaeo-arabic

geniza's Introduction

geniza's People

Contributors

blms avatar rlskoeser avatar kmcelwee avatar thatbudakguy avatar gissoo avatar quadrismegistus avatar apjanco avatar allcontributors[bot] avatar owenduffymassey avatar

Stargazers

Ben בִﻦ avatar Bastian Politycki avatar  avatar  avatar  avatar  avatar Nikhil Desai avatar Nathan P. Gibson avatar  avatar

Watchers

 avatar James Cloos avatar  avatar

geniza's Issues

As a global admin, I want to be able to add and edit a language, script, or correlation between language and script in our ontology, in order to expand our content footprint.

testing notes

log in on an account with Content Admin group permissions

  • list view should include language, script, and optional display name
  • should be able to edit existing records or add new one; display name should be optional
  • should be able to delete an existing record
  • if you try to enter a language + script combination that already exists, you should get an error

log in on an account with Content Editor group permissions

  • should be able to view languages and scripts but not add, edit, delete

  1. Adding a new language or script or correlation (for example, Georgian or Arabic in Greek script); 2) Editing a language or script already in the list, or the correlation between them.

Revise the sitemap and site flow diagram so that the project team would know about the content that would exist on the site and the possible ways they are connected in a more comprehensible way to meet their needs

Link to the revised sitemap

  • revised the map structure to make it easier to read
  • separated "content" from "features"
  • color coded page hierarchy by levels (0-4)
  • visually identified "potentially out of scope" pages and content and features
  • removed details for the two pages that are "potentially out of scope" and new in concept

Questions for you:

  1. Can you read this map?
  2. Does separating content from features make sense to you in this way?
  3. Please read through all the "content" and "features" and tell me, are there items that are categorized incorrectly? If yes, what are they?
  4. Does this map answer any questions for you? What are they? Are there questions it's not answering for you?
  5. Does this level of detail make sense to you?

As a content editor, I want to see the number of documents we have in each language/script combination, so that I can understand the relative proportions and provide information for data visualization and research.

testing notes

on the language+script admin list view, confirm that:

  • counts are present and accurate for language
  • counts are present and accurate for probable language
  • clicking on language count takes you to the correctly filtered document list
  • clicking on probable language count takes you to the correctly filtered list
  • sorting on language and probably language counts works

This is in the "language/script" interface.

Revise the sitemap diagram based on comments in #51 to make sure the propoposed website content and their hierarchies are comprehensible to the project team

Here is the link to the revised sitemap.

Note:

  • I have added notes in blue where more context was needed. Please read.
  • Please do not pay attention to what the "explore fragments" and "explore words" and "citation/scholarship records" will entail – we will discuss during our meeting. (just view the levels at which they are placed).

Description:

  • this is the revised sitemap, covering the content and functionality, and levels of the following pages on the site:
  1. homepage,
  2. cluster search,
  3. browse documents by cluster (the page shown once a cluster is selected for further exploration),
  4. search results (the page shown when there is an input in the search box)
  5. document details
  6. Citation/Scholarship Records
  7. Contact Us
  8. About with its subpages: 1. credits, how to cite, data exports, technical and FAQ
  • Pages that may be out of scope depending on data and priorities: (will discuss at our meeting)
  1. Citation/Scholarship Records
  2. Explore Fragments
  3. Explore words

Questions:

  1. Does the sitemap make sense to you? Does the legend make sense? How about the page levels? If not, please say why
    2.Do you have any additions to what's considered as "Content" vs. "functionality" on any of the pages? Anything that's missing? Or you consider unnecessary?
  2. Would you want to revise the name of any of the pages? For instance, which one makes more sense, "Citation records" or "Scholarship Records"?
  3. Is there anything that you expected to see and is missing?

As a global admin and content editor, I want to clone a document and have a record of the process, to keep track of the origins of the document record and changes in the data.

testing notes

  • Choose a document from the document list for editing, and use the 'save as new' button at the bottom. Confirm that all fields are populated from the original document record, and there is a note added that this record was cloned from the other one.
  • Do the same test, but add content to the notes field of the first document before saving it. Confirm that the note about the record being cloned is added to whatever text you added in the notes field.
  • Confirm that date created and last modified are set accurately for the new record, and not copied from the previous one.

dev notes

ideal implementation:

  • add a button to the edit view next "save and continue", etc. that's "clone this" or similar
  • button links to a new add view with all fields prepopulated using current values from the object, but doesn't submit (so user can change it before submitting)
  • prepopulate notes field with text "cloned from str(document)"

As a user, I want to search with multiple tags and choose how they should be combined (ANY/ALL) so that I can drill down or combine search results.

  • add faceting on tags field in the solr queryset
  • add tag facets input to to search form; display tag name and count for current search (use checkboxes to allow multiple)
  • add any/all configuration option specific to tags (radio boxes; "find documents that match [ALL/ANY] selected tags")
  • in the view, filter the search based on the selected tags and specified mode
  • when generating the search page with tags selected, make sure the form reflects current status (any/all and any selected tags)

As a content editor, I want to edit all of the documents associated with the fragment on the same screen that I use to edit the fragment, so that in the case of demerging I can make sure that the data is split correctly.

Testing notes

  • On the fragment admin display, ensure that the fields are displayed properly in the TextBlock inline.
  • Click on the document string and ensure that it leads to the proper page.

dev note

create simple inline for text block and enable on fragment edit

  • editable: side & text+extent/region
  • display document id + description

As a user, I want to read the geniza project in my native language so it's easier to understand.

testing notes

note - this issue is a companion to #36, but they address different issues and function differently. this issue covers all the text on the site that isn't actual "PGP data" - in other words, nothing that would come from the database.

you might encounter documentation (or people) that use the terms i18n and l10n. these are lazy ways of writing the long words "internationalization" and "localization", where the numbers mean how many letters you skipped in the middle of the word. the former refers to writing code that can be translated into multiple languages, while the latter refers to actually doing the translation ("localizing") that code into some particular language. for the PGP, the developers will be doing the i18n, but the l10n can be done by the project team.

as a user

  1. visit the test site and you'll see a (very basic) homepage that tells you what language you're currently reading the PGP in.
  2. you'll notice that the default is english; there's also now a /en/ appended to the URL to indicate this.
  3. check out the list of professions in english and you'll see the transliterated profession names.
  4. choose another language to read in using the dropdown at the top right, and click "Go." (these languages are configurable and we can choose as many or as few as we want).
  5. you should now see the profession names change to reflect your choice, and the url suffix should also change to a language code (e.g. /he/ for hebrew). note though that the actual URLs will still be in english (e.g. /ar/people). let us know in a comment if this makes sense - it's possible to instead do /ar/اشخاص/ or /ar/ashas (idk if these are correct but you get the point).
  6. go back to the homepage by clicking "home" in the top left, and you should see the text after "your language is" has also changed. once you pick a language, the website will "remember" it until you make another choice - including if you refresh, close the tab, close the browser, etc. you can remove the choice by clearing your cookies to default back to english, or just choose it from the dropdown.
  7. note that the rest of the language on the homepage didn't change! to make that work, we have to do some extra work from the content editor's point of view.

as a content editor

  1. have a look at the locale folder on github. you'll see three folders: ar, en, and he, corresponding to our three language options. each has a folder inside called LC_MESSAGES (this is a standard name that's required to use).
  2. go ahead and click the folder until you see a file called django.po. this .po file (called a "message file") stores translations for each bit of text on the website that can be translated into multiple languages. if you have a look at the .po file for hebrew, you'll see starting on line 46 entries for all of the bits of text on the homepage ("Geniza multi-language testing", etc).
  3. there's a lot going on here, so let's review the type of messages that you may see in this file. notes I added are in parentheses.
#. Translators: button on language chooser in navigation  (note left by the developer for the translator)
#: templates/base.html:33  (where in the code this bit of text is located)
msgctxt "choose this option"  (extra context for the translator, since "go" could be translated many ways)
msgid "go"    (what the original (untranslated) text reads)
msgstr ""       (the place where the translation goes)

#. Translators: subheading on homepage                               
#: templates/home.html:9                                                      
#, python-format   (indicates this bit of text contains python code)
msgid "Your language is: %(lang_name)s" ("lang_name" will be filled in later; we don't know what it is right now)
msgstr "" (the translation will include a placeholder for "lang_name")
  1. time to add a translation! there are dedicated programs available for editing .po files, but the easiest way to test it right now is just to edit on github. click the "pencil" icon in the top right to edit it directly:

Screen Shot 2021-02-09 at 9 31 50 AM

  1. now you can fill in some of the msgstr fields with translations in whatever language's file you're editing. they don't need to be correct or "real" translations, but doing them in the correct language would be good.
  2. when you're finished, scroll to the bottom and find "commit changes". fill in a commit message in the top box (something like "add arabic homepage translations") and if you're feeling fancy also add an extended description of what you did in the bottom box (not required).
  3. make sure you click "create a new branch for this commit and start a pull request". this way, the developers will be notified of the changes and will have a chance to review everything before adding the code. it also helps prevent situations where two people edit the file at the same time and one person's changes "win" (unlikely, but possible). the name for your branch isn't important; something like "arabic-homepage-l10n"
  4. you're done! you just localized something. comment with your thoughts/opinions about the process. if you want to try a fancier way, also check out poedit, which is one possible solution for doing lots of localization at once.

dev notes

Basic django app with some public interface and menus or site content to be translated

  • add a super basic css framework/stylesheet for prototyping
  • create the base/home template
  • create a basic header or footer

As a global admin, I want to be able to add a library collection not represented on the list in order to expand our content footprint and edit those already in the list.

testing notes

log in on an account with Content Admin group permissions

  • list view should include library, collection name, library abbreviation, abbreviation, location
  • should be able to edit existing records or add new one
  • should be able to save a record with collection name but no library
  • should be able to save a record with library but no collection name
  • should not be allowed to save a record with both collection name and library empty
  • should be able to delete an existing record
  • should be able to add a new record
  • if you try to enter a collection with a library + collection combination that already exists, you should get an error

log in on an account with Content Editor group permissions

  • should be able to view collections but not add, edit, delete

For example, when new collections become available or there are name changes.

As a content editor, I want to add data to the database in multiple languages so that I can fully represent existing project data.

testing notes

note - this issue is a companion to #35, but they address different issues and function differently. this issue covers all the actual PGP data that can be stored in multiple languages, rather than text on the site.

as a user

see "as a user" on #35; since this part works exactly the same way. then, pick a person from the list of people. you'll see a basic testing page that tells you what profession that person had. currently, nobody has any profession. let's add one.

as a content editor

  1. go to the admin backend for the site and click "people". click a person's name to edit them.
  2. you'll now see the fields that are available on a person: their name, and their profession. go ahead and pick any profession from the list and save the model.
  3. note also that there are three tabs here, labeled according to the language codes of the languages that you can currently browse the test site in. if you go to one of those tabs, the field will be empty, indicating there's no value in that language. we can control which languages are available at the level of each field, but the default is to allow every language that you can browse the site in.
  4. go ahead and add a (fake or real) version of the person's name in another language and save the model.
  5. if you go back to the public site and visit the page of the person you edited, you'll see a message like "X was a Y", where X is the person's name and Y is the profession you chose. if you switch language, both X and Y should now display in the translated versions you entered! if you didn't enter a translation for the language you chose, it will just display in english instead. note that the middle part of this text ("was a") isn't actually "PGP data", and thus it would be translated via the methods covered in #35.

leave comments/opinions below on how this process went for you.

dev notes

  • model editable in django admin with translation
  • simple list page
  • simple detail page

As a global admin, I want a one-time import of all documents and fragments currently in the PGP spreadsheet and the fields in the db populated accordingly, in order to work with the data in the database.

testing notes

Check a variety of documents and fragments from the PGP metadata spreadsheet and test how they have been imported.

for fragments:

  • check that fields are populated accurately from the spreadsheet:
    • shelfmark
    • historic shelfmark
    • library/collection
    • multifragment (yes/no)
    • link to image
    • iiif url for CUL documents with link to image (can verify via iiif viewer on fragment edit page)
    • test that items with Library CUL in spreadsheet are assigned to the right collection based on shelfmark (T-S, Or., Add.)
  • check that record history documents creation via import script

for documents:

  • check that these fields are populated accurately from the spreadsheet:
    • PGPID
    • type
    • description
    • tags
    • languages — language+script based on list of languages (preliminary mapping)
    • probable languages — language+script based on languages listed with question mark
    • language note — should include text of language + parenthetical notes on vocalization, diacritics, etc
    • legacy input by
    • legacy input date
    • a text block for each associated fragment; should include when present:
      • side (recto/verso)
      • text block label (text block in spreadsheet -> extent label in database)
      • multifragment value
  • check that record history documents creation via import script

Check a few documents with joins to confirm that the document is linked to all fragments referenced by shelfmark in the join column


I want the following fields populated from the spreadsheet: library, shelfmark (current/historical), recto or verso, language/script, description, type and tags, and, if available, link to image.

dev notes

revisions after testing:

  • fragments with a multifragment value set should get boolean multifragment set true
  • actual multifragment value should be set on the text block
  • for fragments, should populate:
    • shelfmark
    • historic shelfmark
    • library/collection
    • infer missing library based on shelfmark (data cleanup requested)
    • multifragment
    • link to image
    • infer iiif url based on linked to image where possible
  • for documents, should populate:
    • PGP ID
    • language/script
    • description
    • type
    • tags
    • associate with fragments based on shelfmark and any shelfmarks included in the join field
  • on document/fragment through model, track:
    • side (recto/verso)
    • text block

Design a generous search interface for the holistic search idea so that users can learn about the relationship between clusters of data

Here is the link to the proposed data scheme and two versions for stage 1 UI + flow for the cluster view

Note: "stage 1 UI" is the first step for designing a UI – colors, font, line, and shape weights, and alignments are not complete.

  1. have proposed a way to categorize the clusters and the documents (The data scheme) – the condition mentioned in the data scheme is later used to create a graph where the clusters can be placed. But this is just a suggestion and I want to propose this to the project team and have conversations on the data scheme proposal and the condition mentioned.

  2. have proposed two ways of representing the clusters (v1 and v2)

  • v1: using the most common tag such as "tax" and grouping all the other tags that accompany it
  • v2: using the logic in v1 but labeling the group with a unique name such as "finance" – or not labeling it at all but that might be problematic once in the document list.
  1. have proposed 3 ways of navigating the clusters in each version.
  • to view all of the clusters
  • to view one cluster
  • to view a sub cluster
  1. have shown how a shared sub cluster between two clusters might be handled

  2. the document list shows the sub cluster with the largest number of documents first
    Note: the document list does represent the intended logic however does not match the number of documents in the mock up because that is not necessary to reach the goal of this issue and convey the goal of this design.

Questions for you:

  1. Does the design and the data scheme make sense to you? If not, please say why
  2. Is there something that you need to help you understand the design which is missing here?

As a content editor, I want to create, edit, filter and search documents so that I can add/edit information on documents in the database and find pertinent documents.

Testing Notes

List display:

  • Ensure that all fields that you want visible are available
  • Test that Documents are searchable by:
    • shelfmark
    • tag
    • description
    • PGP team member who input data
  • Test that results for filtering by the following fields operate properly:
    • document type
    • language
    • extent label
    • multifragment

Edit

  • Test that changes to a document through the edit display are saved and reflected in the list view.
  • Test that read only fields are displayed and not editable
  • Test that text blocks for associated fragments are listed and can be updated
  • Test that you can't add unknown for a probable language
  • Test that you aren't allowed to set the same language+script as both language and probable language
  • Test that you aren't allowed to set unknown language+script as probable language
  • Test that multiple text blocks on the same shelfmark don't result in repeated shelfmark in document combined shelfmark
  • Test setting text block order and confirm that join shelfmark follows the specified order

dev notes

revisions after testing:

  • switch language and probable language to autocomplete
  • fix language/probable language validation
  • add help text for language and probable language (to be supplied by the team)
  • fix document shelfmark so it includes unique shelfmarks in order
  • add multifragment text field
  • make extent filter empty/not empty
  • add empty/not empty multifragment filter

Fields
Provide a list of the fields and entities to be included on this model.

  • Fragments (entity)
  • description
  • side
  • text-block
  • Type (entity)
  • Tags (entity)
  • language
  • Footnotes for edition and translation (entity)
  • notes
  • input by
  • date entered

List Fields
Which fields should be included on the django admin list view?

  • shelfmark
  • description
  • type
  • tags
  • language
  • text-block (boolean)
  • edition

Edit Fields
Which fields should be editable via django admin view? Please list them in the order you want them to appear. Indicate which fields are optional, and any fields that should be displayed but not editable.

  • shelfmark (read only) at least one, possibly more
  • historical shelfmark (read only)
  • side (to indicate recto/verso)
  • Image thumbnail (optional)
  • type (optional)
  • language
  • description
  • tags (optional)
  • edition (optional)
  • translation (optional)
  • notes (optional)
  • text-block (optional)
  • Legacy input by (read only)
  • Legacy date entered (read only)


Search Fields
Which fields should be searchable in the django admin list view?

  • description
  • tags
  • shelfmark
  • edition
  • input by

Default sort

  • Shelfmark

Filters
Which fields should be used for list filters?

  • Type
  • language
  • Text-block


List Fields (optional)
Which fields (if any) should be editable on the django admin list view?
None.

Related models
List any directly related models (database tables) that would be helpful to edit when on the same page when editing this one.
    •    Fragment model

Additional context
Add any other context about the model or database here. Include a link to a database diagram if one is available.


Old version
List view fields (in order):
- Shelfmark (current) (searchable)
- Shelfmark historical (boolean: does it have a historical shelfmark?)
- Library (abbreviation) (filter)
- recto/verso (filter)
- type (filter)
- tags (filter & searchable)
- language/script (filter)
- description (searchable)
- editor (searchable)
- translator (searchable)
- link to image/image (thumbnail, IIIF),
- multifragment (boolean: is it part of a multifragment?)
- text-block (boolean: is it assigned text-blocks?)
- Input by (searchable)
- date entered (searchable).

None editable in the list view.
Filter: shelfmark (historical), Library, Recto/verso, type, tags, language/script, multifragment, text-block
Searchable: Shelfmark (current), tags, description, editor, translator, input by, date entered.

As a user, I want to see existing TEI transcriptions displayed with IIIF images so I can see how using annotations for transcription might work.

testing notes

Go to the test site at https://test-geniza.cdh.princeton.edu/iiif/ — you should see the Mirador 3 IIIF viewer with a list of manuscripts to select (note that it does load quite slowly because there are so many manifests in the list).

When you choose a document to view, it should open with the annotations panel displayed and should show a transcription beside an image. When you close the document you're looking at, use the "start here" at the top right to go back to the list and open another.

The documents available for testing right now are only those that we currently have both IIIF and transcriptions for. For now, I'm attaching all annotations to the first image in the IIIF manifest and setting the annotation zone to be the entire canvas. I'm creating separate annotations for each block of text within a TEI document, as indicated by a label preceding a new section.

I'm setting language direction as rtl and left-aligning everything; I also put all transcription lines into an ordered list, but I haven't yet looked into correcting the TEI documents with un-numbered lines that wrap, so I expect there are cases where the numbers will be wrong. There is no language detection yet.

For sign off on this story, please confirm:

  • transcriptions are associated with the correct documents (the association is based on PGP ID in the TEI and IIIF manifest in the metadata spreadsheet)
  • distinct blocks of text within a single TEI are correctly grouped

Feel free to add comments with things you notice that are missing and should be addressed when we start refining this. Here are a few things I've noticed so far:

  • Transcriptions should sometimes be associated with the second image in a manifest, but I'm not sure how to determine where they belong. It looks like if there is a "recto" or "verso" label, any sections after it are on that side.
  • Transcriptions of marginal text often seem to use / to indicate line breaks

First pass conversion of TEI transcriptions to IIIF annotation (only for subset of documents with IIIF and transcriptions, for now)

As a user, when I search by keyword I want to see a list of matching clusters with the most relevant documents so that I can get to results but also see the context of related materials.

  • add keyword search input & form
  • add new flask view & template for new search results (ideally template snippet for record display common to #23 )
  • implement solr search to retrieve and display documents in the cluster result format
  • determine and show which clusters documents in the search belong to
  • link clusters to cluster browse

test eScriptorium on sample content

  • setup local instance
  • import sample content provided by the project team
  • generate exports that can be used to test matching ocr + zones against tei transcripts

Create deploy playbook for geniza i18n prototype

  • needs apache configured to allow deployment side-by-side with search prototype
  • needs separate playbook with different app_name; uses django-specific roles/localsettings setup
  • needs mysqlclient deps installed on target machine so it can talk to mysql database (via common role)
  • clean up template_path settings in group vars for all projects since default was updated to be smarter

As a content editor, I want to suppress documents, so that the records are kept updated.

testing notes

  • Change the "Status" on a few documents and ensure that the "public" field on the list view updates properly.
  • Filter by status in the list view side bar and ensure that it works properly.

dev notes

We have similar functionality in ppa, maybe a useful reference: https://github.com/Princeton-CDH/ppa-django/blob/main/ppa/archive/models.py#L285-L293

May want to implement #78 in tandem with this one

As a content editor, I want to be able to flag and annotate a document as needing examination by a global admin, so that the global admin can either resolve the issue or submit a request for changes to the CDH.

New field: needs_review text field
Ability to create a link to a filter search of things that need review on the dashboard

testing notes

  • edit a few documents, add text to the 'needs review' field, and save
  • navigate to admin main index page: the documents you edited should be listed as awaiting review
  • you should be able to click on the document to go straight to the edit page
  • you should be able to click on the document heading to go to a filtered document list view showing all documents that need review
  • navigate to the corpus section of the admin site; confirm that the awaiting review section appears and functions the same as on the admin index page
  • create a test staff user account without permissions to view or edit documents and login as that user to confirm that they do not see the list of documents awaiting review

dev notes

  • add needs_review text field to Document; help text to indicate what it is for
  • add empty/nonempty filter on needs_review to document list view
  • create new template snippet/include with a brief list of 10 most recent (based on last modified) documents with nonempty needs review; should report how many items total need review and link document list view filtered to see all items that need review
    • (Kevin):Document.objects.exclude(needs_review='').order_by('-last_modified')[:10]
  • extend corpus app_index admin template and include the document review snippet in the sidebar (see https://docs.djangoproject.com/en/3.1/ref/contrib/admin/#templates-which-may-be-overridden-per-app-or-model)
  • extend admin index and add document review in the sidebar after other sidebar content

May want to implement #77 at the same time, since the functionality is the same but for fragments

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.