GithubHelp home page GithubHelp logo

artefactual-labs / aipscan Goto Github PK

View Code? Open in Web Editor NEW
11.0 11.0 3.0 14.3 MB

Crawl Archivematica's Archival Information Packages (AIP) and provide repository-wide reporting.

License: Apache License 2.0

Python 83.59% HTML 11.14% CSS 0.20% JavaScript 4.78% Dockerfile 0.11% Makefile 0.18%

aipscan's People

Contributors

ablwr avatar aseles13 avatar dependabot[bot] avatar mcantelon avatar melaniekung avatar petervg avatar rayzilt avatar replaceafill avatar ross-spencer avatar sallain avatar sbreker avatar scollazo avatar tw4l avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

aipscan's Issues

Pylint: unused variables

From pylint:

[W0612 unused-variable] Unused variable 'dateTimeObjStart' File: Aggregator/tasks.py, line 171, in package_lists_request
[W0612 unused-variable] Unused variable 'storageServices' File: Aggregator/views.py, line 118, in delete_storage_service
[W0612 unused-variable] Unused variable 'delta' File: Reporter/views.py, line 135, in report_formats_count
[W0612 unused-variable] Unused variable 'key' File: Reporter/views.py, line 292, in plot_formats_count

Catching these are always great as you either, a) want to use them and for some reason you're not! or b) obviously you're not using them and this memory isn't important to keep around and manipulate.

A cool little thing I learned in Golang and then realized worked in Python is the _ syntax. This is useful for loops where you're not using all of the attributes:

for key, value in some_dictionary:
   print(value)

Should become:

for _, value in some_dictionary:
   print(value)

Where key isn't used.

Underscore can be used before the assignment operator on all functions returning something. They can be helpful for indicating that there is a potential result to handle, but that you're not working with it yet.

Pylint: Replace built-in names with different choices for variable name

I'll post a couple of high priority pylint fixes to address. The first are the use of reserved names for other variables. Unfortunately nice-looking names, but there are other patterns we can adopt here that will help make the code more readable:

From pylint:

Pylint priorities - really subtle undefined behavior for these:

[W0622 redefined-builtin] Redefining built-in 'list' File: Aggregator/tasks.py, line 87, in workflow_coordinator
[W0622 redefined-builtin] Redefining built-in 'next' File: Aggregator/tasks.py, line 198, in package_lists_request
[W0622 redefined-builtin] Redefining built-in 'file' File: Aggregator/tasks.py, line 268, in get_mets
[W0622 redefined-builtin] Redefining built-in 'type' File: Aggregator/tasks.py, line 348, in get_mets
[W0622 redefined-builtin] Redefining built-in 'format' File: Aggregator/tasks.py, line 317, in get_mets
[W0622 redefined-builtin] Redefining built-in 'id' File: Aggregator/views.py, line 38, in storage_service
[W0622 redefined-builtin] Redefining built-in 'id' File: Aggregator/views.py, line 55, in edit_storage_service
[W0622 redefined-builtin] Redefining built-in 'id' File: Aggregator/views.py, line 109, in delete_storage_service
[W0622 redefined-builtin] Redefining built-in 'id' File: Aggregator/views.py, line 123, in new_fetch_job
[W0622 redefined-builtin] Redefining built-in 'id' File: Aggregator/views.py, line 197, in delete_fetch_job
[W0622 redefined-builtin] Redefining built-in 'format' File: models.py, line 131, in originals.__init__
[W0622 redefined-builtin] Redefining built-in 'format' File: models.py, line 170, in copies.__init__
[W0622 redefined-builtin] Redefining built-in 'type' File: models.py, line 212, in events.__init__
[W0622 redefined-builtin] Redefining built-in 'type' File: models.py, line 231, in agents.__init__
[W0622 redefined-builtin] Redefining built-in 'id' File: Reporter/views.py, line 22, in view_aips
[W0622 redefined-builtin] Redefining built-in 'id' File: Reporter/views.py, line 56, in view_aip
[W0622 redefined-builtin] Redefining built-in 'id' File: Reporter/views.py, line 86, in view_original
[W0622 redefined-builtin] Redefining built-in 'format' File: Reporter/views.py, line 152, in report_formats_count
[W0622 redefined-builtin] Redefining built-in 'format' File: Reporter/views.py, line 276, in plot_formats_count

If we were to think about what each of these variable names represent then it will help us to create a more descriptive name which will also help code readability, so for example in the case of id if we ask ourselves what it is an id for, the variable name falls out of it <name_of_thing>_id. So for example, format_id.

Similarly something like a list, what's in the list? mets_list, list_of_mets_files. Variants on these types of names can be really helpful where the dev needs to keep that context in their head otherwise "which id is this again?", or indeed, when a function starts to grow to include two different ids.

Jinja2 templates should be prettified

As they get more complex then it will benefit the reader, and the developer to have a consistent and objective way to format these without too much to think about.

Unibeautify comes from the Atom IDE plugins and has a nice HTML format plugin that on face value can handle the templates without too much issue. If JavaScript is imported and CSS external too then that a separate format utility can be used for that and it will help improve readability.

There is a code playground here: https://playground.unibeautify.com/

Another nice feature is that now Unibeautify has been moved out of Atom is is standalone and can be run locally via NPM and also integrated into Git CI processes.

There are few docstrings

We have a documented standard for docstrings in Archivematica: here.

It would be good to get some basic ones in as we start to move through the code-review work.

My preference is to at least get a single-line dosctring specified for most functions:

def function(a, b):
    """Do X and return a list."""

And if necessary basic multi-line strings:

def function(a, b):
    """Do X and return a list.

   I needed to use more space to describe the function here. As I couldn't fit 
   it into a single line
   """

In time, we can add parameters but I think as this code will be rapidly changing, just getting into the habit of the basics will be enough. Parameters will be changing name/type/returns for a wee-bit.

NB. I'll join this work as I go through the code as well. I can't imagine it will be a single commit here are all the docstrings but maybe it can be?

Style

I erroneously noted docstrings can be 79 characters to Tessa, but it looks like in the style guide they are:

Some teams strongly prefer a longer line length. For code maintained exclusively or primarily by a team that can reach agreement on this issue, it is okay to increase the line length limit up to 99 characters, provided that comments and docstrings are still wrapped at 72 characters.

The Python standard library is conservative and requires limiting lines to 79 characters (and docstrings/comments to 72).

So TIL.

Additional

NB. We should try and get an encoding at the top of our scripts too: # -*- coding: utf-8 -*- we do this fairly inconsistently in the AM code-base but it's not something we don't do. While it's technically mostly only needed for code using non-ASCII code-points, e.g. you might see it in tests using Unicode parameters, it's also just a good forward-looking thing to do.that isn't so anglo-focused. And also, gives the script a pleasing to look at header.

Pylint: Bare-excepts in-line

From pylint:

[W0702 bare-except] No exception type(s) specified File: Aggregator/tasks.py, line 326, in get_mets
[W0702 bare-except] No exception type(s) specified File: Aggregator/views.py, line 187, in new_fetch_job

These are pretty key to fix-up early. If you take Archivematica, one of the things that is lost in some of our except Exception blocks is the context. Dealing with primitives it might be quite obvious initially that the exception seen by the developer initially was an IndexError, ValueError etc. But as we start dealing with libraries, then we lost that ability to quickly walk back and correct this and say, okay, we're seeing a DjangoThingyException here, so let's work with that.

The primary reason for not having a bare except is that it masks other exceptions because it will catch all exception objects and there might be places further down the stack we're not expecting something. We should let the application fail/respond with an error at these points so that they can be managed correctly.

Generating report can take a long time

Right now they are opened in a separate tab. The administrator needs to make sure that the server timeouts do not restrict the time needed to generate the reports. The users only sees a browser spinning wheel on the new tab to indicate that there is report generation activity. Perhaps this workflow can be made more user-friendly, e.g. when a report request is initiated, run the code needed to feed the report in a background job. Send a notification to the user when its ready. Only open and output to a new tab when the data is ready to be fed into the report.

Model class names should be capitalized and singular

AIPscan's model classes are currently all lowercase: https://github.com/artefactual-labs/AIPscan/blob/da8fc4c939a927cf0dbd54d7ab2fedc6f625e140/AIPscan/models.py

Per PEP8, class names should use the CapWords convention:

Flask's own style guide suggests the same, with the added suggestion that "acronyms kept uppercase (HTTPWriter and not HttpWriter)".

By convention, the class names should also be singular, so for example storage_services should be StorageService, aips should be AIP and so on.

This will make it easier to identify Models in the code base, e.g. when they are used in SQLAlchemy queries.

Sourcemap JavaScript errors on page load

When inspecting pages through Chrome developer tools we're seeing errors like this.

Examples:

DevTools failed to load SourceMap: 
Could not load content for http://127.0.0.1:5000/static/js/bootstrap.min.js.map: 
HTTP error: status code 404, net::ERR_HTTP_RESPONSE_CODE_FAILURE
DevTools failed to load SourceMap: 
Could not load content for http://127.0.0.1:5000/static/css/bootstrap.min.css.map: 
HTTP error: status code 404, net::ERR_HTTP_RESPONSE_CODE_FAILURE

Given the code is functional, these feel more like warnings than errors. This is an interesting link that discusses them. So I have done that for now.

Delete Fetch Job times out

Can be a long-running task if the Fetch Job has 10k+ files in it. It deletes the download directory but takes a while doing the database cascade delete. The web server times out.

The solution should be to move the Delete Fetch Job to a background Celery worker and report on status on template when done.

Date trimming doesn't account for different formats

The current AIPscan code assumes METS timetamp values will be in the format "2020-06-12T16:19:21.230362+00:00". It trims 13 characters from the end to parse it into a more user-friendly format. However, Archivematica might be installed on operating systems that don't include timezone in their timestamps. Therefore strptime() functions stop working on date strings that have been trimmed back too far. The solution is to trim date strings from the beginning, not the end. So grabbing the first 19 characters.

Shall we enable branch protection now?

Branch protection is one of the tools we use to help improve the AtoM/Archivematica/Docs workflows. The basic rule is that the main branch is protected and it requires one CR to merge. Like those projects, each PR would be connected to an issue. I do think it is better in the long-term to do this as early as possible. It guarantees always having two developers involved in any particular change so increases the chances of others being involved. But I understand it's desirable in the short term to be able to merge fast. But what do you think @peterVG?

AIPscan cannot report on AIPs added post-fetch job

AIPscan uses a manually triggered pull task to fetch results from the storage service. As AIPscan starts to be used more, this will become increasingly counter-intuitive. The general consensus is to leverage the Archivematica storage service call back to update the database post-storage of new AIPs.

"no such table: package_tasks" error on very first Fetch Job

Error output:

File "/Users/peter/Development/AIPscan/AIPscan/Aggregator/views.py", line 181, in new_fetch_job
sql, (task.id,), sqlite3.OperationalError: no such table: package_tasks

This is caused by the fact that the Workflow Coordinator task hasn't yet created this table in the celerytasks.db but the Fetch Job call is already trying to do a lookup in it.

Consider use of AMClient

Ross: If we do subsume this work into our current efforts then we might consider using AMClient per the Architectural Decision Record 0009. We've started wrapping it into Archivematica and while it can be a little temperamental it's well tested and wraps a lot of what we need. Anything that isn't wrapped will give us the opportunity to fortify the client further as part of these efforts. Related issue: archivematica/Issues#1152

METSRW tries to call property for a non-existent object

METSRW tries to call a property for a non-existent object (inside a function that creates a new AMDSec by parsing the root element).

See https://github.com/artefactual-labs/mets-reader-writer/blob/master/metsrw/metadata.py#L94-L103

The error that is raised:

[2020-07-15 15:57:58,502: ERROR/ForkPoolWorker-4] Task AIPscan.Aggregator.tasks.get_mets[a24ce0ce-6e25-40d5-b117-ce5435944741] raised unexpected: AttributeError("'NoneType' object has no attribute 'tag'")
Traceback (most recent call last):
  File "/Users/peter/Development/AIPscan/venv/lib/python3.7/site-packages/celery/app/trace.py", line 412, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/Users/peter/Development/AIPscan/flask_celery.py", line 18, in __call__
    return TaskBase.__call__(self, *args, **kwargs)
  File "/Users/peter/Development/AIPscan/venv/lib/python3.7/site-packages/celery/app/trace.py", line 704, in __protected_call__
    return self.run(*args, **kwargs)
  File "/Users/peter/Development/AIPscan/AIPscan/Aggregator/tasks.py", line 271, in get_mets
    mets = metsrw.METSDocument.fromfile(downloadFile)
  File "/Users/peter/Development/AIPscan/venv/lib/python3.7/site-packages/metsrw/mets.py", line 593, in fromfile
    return cls.fromtree(etree.parse(path, parser=parser))
  File "/Users/peter/Development/AIPscan/venv/lib/python3.7/site-packages/metsrw/mets.py", line 617, in fromtree
    mets._parse_tree(tree)
  File "/Users/peter/Development/AIPscan/venv/lib/python3.7/site-packages/metsrw/mets.py", line 543, in _parse_tree
    tree, structMap, normative_parent_elem=normative_struct_map
  File "/Users/peter/Development/AIPscan/venv/lib/python3.7/site-packages/metsrw/mets.py", line 367, in _parse_tree_structmap
    tree, elem, normative_parent_elem=normative_elem
  File "/Users/peter/Development/AIPscan/venv/lib/python3.7/site-packages/metsrw/mets.py", line 371, in _parse_tree_structmap
    self._add_amdsecs_to_fs_entry(elem.get("ADMID"), fs_entry, tree)
  File "/Users/peter/Development/AIPscan/venv/lib/python3.7/site-packages/metsrw/mets.py", line 515, in _add_amdsecs_to_fs_entry
    amdsec = metadata.AMDSec.parse(amdsec_elem)
  File "/Users/peter/Development/AIPscan/venv/lib/python3.7/site-packages/metsrw/metadata.py", line 94, in parse
    if root.tag != utils.lxmlns("mets") + "amdSec":
AttributeError: 'NoneType' object has no attribute 'tag'

The work around for now is to patch METSRW/metadata.py with the following code:

    @classmethod
    def parse(cls, root):
        """
        Create a new AMDSec by parsing root.

        :param root: Element or ElementTree to be parsed into an object.
        """
        if root:
            if root.tag != utils.lxmlns("mets") + "amdSec":
                raise exceptions.ParseError(
                    "AMDSec can only parse amdSec elements with METS namespace."
                )
            section_id = root.get("ID")
            subsections = []
            for child in root:
                subsection = SubSection.parse(child)
                subsections.append(subsection)
            return cls(section_id, subsections)
        else:
            return

There isn't an easy to access entity relationship diagram (ERD)

We haven't a ERD for AIPscan. We can access the schema from sqlite but something ERD-like might also help developers.

sqlite> .schema --indent *
CREATE TABLE storage_services(
  id INTEGER NOT NULL,
  name VARCHAR(255),
  url VARCHAR(255),
  user_name VARCHAR(255),
  api_key VARCHAR(255),
  download_limit INTEGER,
  download_offset INTEGER,
  "default" BOOLEAN,
  PRIMARY KEY(id),
  CHECK("default" IN(0, 1))
);
CREATE UNIQUE INDEX ix_storage_services_name ON storage_services(name);
CREATE TABLE agents(
  id INTEGER NOT NULL,
  type VARCHAR(255),
  value VARCHAR(255),
  PRIMARY KEY(id)
);
CREATE INDEX ix_agents_value ON agents(value);
CREATE INDEX ix_agents_type ON agents(type);
CREATE TABLE fetch_jobs(
  id INTEGER NOT NULL,
  total_packages INTEGER,
  total_aips INTEGER,
  total_deleted_aips INTEGER,
  download_start DATETIME,
  download_end DATETIME,
  download_directory VARCHAR(255),
  storage_service_id INTEGER NOT NULL,
  PRIMARY KEY(id),
  FOREIGN KEY(storage_service_id) REFERENCES storage_services(id)
);
CREATE TABLE aips(
  id INTEGER NOT NULL,
  uuid VARCHAR(255),
  transfer_name VARCHAR(255),
  create_date DATETIME,
  originals_count INTEGER,
  copies_count INTEGER,
  storage_service_id INTEGER NOT NULL,
  fetch_job_id INTEGER NOT NULL,
  PRIMARY KEY(id),
  FOREIGN KEY(storage_service_id) REFERENCES storage_services(id),
  FOREIGN KEY(fetch_job_id) REFERENCES fetch_jobs(id)
);
CREATE INDEX ix_aips_uuid ON aips(uuid);
CREATE TABLE originals(
  id INTEGER NOT NULL,
  name VARCHAR(255),
  uuid VARCHAR(255),
  size INTEGER,
  puid VARCHAR(255),
  format VARCHAR(255),
  format_version VARCHAR(255),
  checksum_type VARCHAR(255),
  checksum_value VARCHAR(255),
  related_uuid VARCHAR(255),
  aip_id INTEGER NOT NULL,
  PRIMARY KEY(id),
  FOREIGN KEY(aip_id) REFERENCES aips(id)
);
CREATE INDEX ix_originals_related_uuid ON originals(related_uuid);
CREATE INDEX ix_originals_puid ON originals(puid);
CREATE INDEX ix_originals_uuid ON originals(uuid);
CREATE INDEX ix_originals_name ON originals(name);
CREATE TABLE copies(
  id INTEGER NOT NULL,
  name VARCHAR(255),
  uuid VARCHAR(255),
  size INTEGER,
  format VARCHAR(255),
  checksum_type VARCHAR(255),
  checksum_value VARCHAR(255),
  related_uuid VARCHAR(255),
  normalization_date DATETIME,
  aip_id INTEGER NOT NULL,
  PRIMARY KEY(id),
  FOREIGN KEY(aip_id) REFERENCES aips(id)
);
CREATE INDEX ix_copies_name ON copies(name);
CREATE INDEX ix_copies_related_uuid ON copies(related_uuid);
CREATE INDEX ix_copies_uuid ON copies(uuid);
CREATE TABLE events(
  id INTEGER NOT NULL,
  type VARCHAR(255),
  uuid VARCHAR(255),
  date DATETIME,
  detail VARCHAR(255),
  outcome VARCHAR(255),
  outcome_detail VARCHAR(255),
  original_id INTEGER NOT NULL,
  PRIMARY KEY(id),
  FOREIGN KEY(original_id) REFERENCES originals(id)
);
CREATE INDEX ix_events_uuid ON events(uuid);
CREATE INDEX ix_events_type ON events(type);
CREATE TABLE event_agents(
  event_id INTEGER,
  agent_id INTEGER,
  FOREIGN KEY(event_id) REFERENCES events(id),
  FOREIGN KEY(agent_id) REFERENCES agents(id)
);
/* completion(candidate) */;
/* dbstat(
  name,
  path,
  pageno,
  pagetype,
  ncell,
  payload,
  unused,
  mx_payload,
  pgoffset,
  pgsize
) 

Alerts for Fetch METS failures and method to retry.

AIPScan is intended to be able to scan large volumes of AIPs. We need to expect that some AIPs will not be retrieved or parsed correctly. We need to define requirements for catching & managing these types of errors (could include manual error handling workflows)

Originals and Copies database tables could be collapsed into single Files table

As can be seen in the ERD in #42, the AIPscan database currently has separate tables for original files (originals) and preservation derivatives (copies). The majority of fields for the two tables are identical, and it seems to me that they could be collapsed into a single files table.

This would make the development experience a bit smoother. As an example, to get information about the largest files, we currently have to query two tables:

if original_files is True:
    largest_files = originals.query.order_by(originals.size.desc()).limit(20)
else:
    largest_files = copies.query.order_by(copies.size.desc()).limit(20)

I'd prefer to be able to do something like:

largest_files = files.query.filter_by(type=file_type).order_by(files.size.desc()).limit(20)

Or, to get the largest files regardless of type, simply:

largest_files = files.query.order_by(files.size.desc()).limit(20)

We could manage the file type via a type field controlled by an enumerated list, or with a is_original boolean (the enumerated list might age better if we anticipate eventually tracking information about other types of files such as metadata and access derivatives in addition to original files and preservation derivatives).

Let's use isort

We have a proposed ADR: https://adr.archivematica.org/0004-isort-import-ordering.html in general I do think it makes life easier, and it's one less thing to make CR subjective. And one less thing to introduce inconsistency across modules.

I ran pylint before and after, and the score was nice to read:

  • Cmd: artefactual-labs/AIPscan/AIPscan$ isort $(find * | grep .py$)
  • Pylint output: Your code has been rated at 7.32/10 (previous run: 6.99/10, +0.33)

It is difficult to minimize METS for testing

Recording the thought here - but it might be useful to have some way to minimize a METS before recording it as a test fixture. Is it possible to do via mets-reader-writer? Trying to write a test, and trying to optimize the space we use for fixtures is a pain when you're trying to write the actual feature you're interested in. Devs should be motivated to write tests. We have an option to mock responses to serialize the METS for smaller etree objects, but what if the integration level testing is just as important.

I've turned off events in some of the METS I am submitting, but we can do more to remove sections of the PREMIS - in the example I am working on - possibly all the amdSecs if we remove the references in the fileSec.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.