openedx-unsupported / blockstore Goto Github PK

View Code? Open in Web Editor NEW

15.0 15.0 20.0 878 KB

Open edX Learning Object Repository

License: GNU Affero General Public License v3.0

Makefile 2.89% Python 95.68% HTML 1.43%

blockstore's People

Contributors

Stargazers

Watchers

Forkers

doctoryes open-craft edunext edulib mavidser mrtmm raccoongang stvstnfrd regisb dicey-tech bbrsofiane cmltawt0 girish946 brian-smith-tcril kdmccormick mtyaka streamich zeit-labs magajh

blockstore's Issues

Signing in to blockstore raise an error of invalid scope

I have my open edx devstack setup as well my blockstore containers running as well. I'm following this guide here to configure blockstore for my devstack instance https://github.com/edx/blockstore#using-with-docker-devstack. I have configured sso to be able to login into blockstore, however I get an invalid scope error. Below are screenshots to my configurations.
oauth2_provider

oauth_dispatch

private.py

settings.py

After visiting http://159.89.49.124:18250/login I'm redirected to http://159.89.49.124:18000/oauth2/authorize?redirect_uri=http%3A%2F%2F159.89.49.124%3A18250%2Fcomplete%2Fedx-oauth2%2F%3Fredirect_state%3DBH9V61XLz768nPnxpcfX5Piu7MRHYgBb&client_id=blockstore-sso-key&response_type=code&state=BH9V61XLz768nPnxpcfX5Piu7MRHYgBb&scope=user_id+profile+email which throws an invalid scope error, not sure why my scope is being set to user_id+profile+email yet I configured it to just user_id. Tried updating my scope to user_id,profile,email in the second screenshot however I still get the same error of invalid scope any help will be much appreciated thank you.

Nuances of the bundle format

Warning: brain dump.

At this time, the evolution of the bundle format used for content libraries in blockstore-relay and my draft runtime PR is described in this docstring.

One aspect of this design that has proven a bit more complex than I hoped is the following:

        Any OLX file that represents an XBlock type that can have children may
        contain <xblock-include> nodes, like this:
            <unit>
                <xblock-include definition="html/intro" />
                <xblock-include definition="html/intro" usage="alernate" />
                <xblock-include definition="linked_bundle/html/intro" usage="linked_intro" />
            </unit>
...
        Any block which exists in the main bundle (not linked in to the bundle)
        and which is not a child of other blocks (no <xblock-include /> nodes
        reference it) is called a "top-level block". Its usage ID always matches
        its definition ID.

Another way of phrasing the existing design is:

Every definition.xml file represents a definition
All definitions have one or more usage
If an <xblock-include ...> directive exists anywhere in the bundle referencing a particular definition, it's an explicit usage of that definition
If no <xblock-include ...> directives in the bundle reference a particular definition, that definition is called a "top level block" and gets an implicit usage with the same usage ID as its definition ID. Top level blocks are the blocks that one sees when listing the contents of a library.

This design has these nice properties:

It's impossible to get "orphaned" definitions: either a definition has one or more parents, or it's a "top level" block.
There is no need to maintain a separate data file / structure that lists top level blocks, because top-level-ness can be computed by scanning the bundle.
A usage can be moved from a top-level usage to a child of another block in the bundle without changing its usage ID, and vice versa. (This is a design requirement.)

But it has these problems:

A block cannot be a top-level block in the library and also used as a child usage of some other block. (Most problematic is this situation: library bundle contains two top-level blocks A & B. Now a user adds a new usage of block B as a child of A. Instead of now having A & B in the library with a second usage of B as a child of A, the library now contains only A as a top level block, with a usage of B as its child. Creating a "second" usage of B has demoted B and made it no longer a top level block. This may be very surprising behavior for users because the user was not trying to move the block B underneath A but simply create a second usage of B)
- A workaround if you really understand what's going on and want to keep B as a top level block is to create another empty top level block C and add another usage of B as a child, creating a wrapper around B. But that's messy.
- This issue doesn't affect blocks in linked bundles, because the "top level" status of blocks is not affected by usages in other bundles. If we expect (or require?) that the vast majority of multiple-usages-per-definition happens via inter-bundle links (e.g. courses linking to library content) and not intra-bundle includes (e.g. course with the same leaf html block in two places), then this is largely a non-issue.
Looking up a usage ID and verifying that usage IDs are unique can require scanning all the OLX files in the bundle (to check <xblock-include usage="..." ...> elements). This can be slow, although cacheable. (Also, in practice this is optimized considerably since the runtime knows not to bother reading definition files of types like html that cannot have children.)

So although those may not be big problems, they don't feel like a particularly elegant or clean design. I would be open to better ideas before anything gets too entrenched.

Some ideas I considered but discarded:

Having a manifest file that explicitly lists the usages of top-level blocks in the bundle. (Rejected because we need to be able to publish changes to individual blocks at a time, and there's no way to atomically publish changes to only one line in a manifest file using the blockstore drafts API while leaving other changed lines in the draft unaffected. Also because this can created orphaned definitions that are neither top-level blocks nor children of any other blocks. Also because IDs in the manifest can become out of sync with the definitions if there are any bugs or write conflicts.)
Identifying top level blocks via the presence of an empty file like .top-level alongside the definition.xml file: this allows draft/publish of individual blocks and marking blocks as top-level or not regardless of how many usages they have (nice), and it can never have IDs that are out of sync, but it still allows orphaned blocks. It's also a bit inelegant to have these extra files.
- An alternative is allowing the definition files to be named either definition.xml or child-definition.xml depending on whether they are "top level with optional other usages" (definition) or "not top level" (child-definition). Still allows orphans though.
- Perhaps we don't care if orphans exist, or we still scan need to scan through the bundle's various <xblock-include directives anyways, so we can detect them and display them as top-level blocks along with a warning.

Django 4.2 Upgrade

Description

Under the effort of Django 4.2 Upgrade, complete all of the following steps to complete the upgrade.

Update tox & Github action workflow using modernisers to add support for Django 4.2
Remove any versions of Python earlier than 3.8 from tox.ini, and GitHub Actions workflows.
Update the pinned Django version in the requirements to Django==4.2
Run make upgrade to update all dependencies for Django 4.2.
Run and verify all tests are passing in the CI for both Django 4.2 and Django 3.2 to contain backward compatibility.
Run available code-mods to fix the failing tests.
Add conditional checks wherever needed to still have support for Django 3.2.
Update the repo support field in the IDA Upgrade Sheet.

Import/Export Mapping Logic and a Dedicated CLI Client?

@pdpinch, @JAAkana, @bradenmacdonald, @symbolist, @pomegranited:

Extracted out of issue #16: What is the ideal course import/export format, keeping in mind that we will have the ability to associate assets with individual leaf XBlocks.

We know that a number of technically advanced course teams generate OLX from existing banks of problems or do all their authoring in XML and use import/export as their primary publishing workflow. However, there are some other use cases we also need to keep in mind...

Other Known Use Cases

A list of known import/export use cases compiled by @JAAkana:

Rerunning (v1)

A course team created an empty Studio ‘shell’ for their rerun months ago, and they’re ready to finally load the content of the current run into that shell today (they’d copy over their current course automatically if they could).

Rerunning (v2)

Course team may have originally created a rerun via auto-copying their current run, but some other version of their course turned out to be better.

Backups

I’m going to do something risky to my course and I need a backup copy - the import needs to be exactly the same to preserve location-ids / access to student data

Course division / course chimera (rare)

I’ve run one giant 18-week course, and I’d like to split it into 3 small courses without copy/pasting everything. Or, I’ve run 3 small courses and I want to repackage them as 2 new courses.

Moving content between multiple instances of the platform

Most commonly, this involves moving courses from test runs on Edge to MOOC runs on edX.org. Some course teams also move content from their own instances to edX controlled ones and vice versa.

Libraries

Teams want to use their MOOC problem banks on campus or vice-versa

XML editing outside of Studio

Conditional modules, changing a course’s wiki slug, adding user-readable unit URLs, etc. These are sometimes edits that either cannot be supported by the Studio UX at all (because no appropriate handler was written for them), or else are really cumbersome to do with existing editors (e.g. course-wide search and replace).

Retrieving files (rare)

I’ve uploaded a bunch of assets to Studio months ago and now I want copies of them -it is easier to get them via export than by clicking on each.

Retrieving data (rare)

I want a bunch of info from my course that’s hard to find click-wise, so I’ll export it and use the xml: e.g. ‘what are all the youtube ids? Where did I say ‘week’ instead of ‘lesson’ as I update to convert to self-paced’?

Seeding a new course (rare)

I have a introductory sequence that I’d like to appear in a lot of my courses - I’ll import this content and then build the rest of my course.

Translating from Another File Format

Converting content from another format (latex, markdown) into a format edX can consume (could use machine or human-readable format). Open edX lives in a ecosystem and it's not unusual for folks to wanted to convert to and from its format into others.

Blockstore can't serve SVG images correctly over S3

When one uploads an SVG image to a Blockstore server that's storing data on Amazon S3, Blockstore will serve the image from a URL like

https://our-blockstore-content.s3.amazonaws.com/ca9635e6-a43e-40df-be67-1764d4a6b311/snapshot_data/6bc0efe8f1d9722615ca1d7042b85d5af115ca6f?AWSAccessKeyId=...&Expires=1575574652

S3 in turn serves that response with Content-Type: application/octet-stream

Unfortunately, this means that one cannot use <img src="blockstore data URL" /> because browsers will not render SVG images that aren't served with the correct Content-Type header.

Possible solutions include:

Change the data file name so it includes the file extension
Have Blockstore know/detect the file type, and then include the response-content-type parameter in the S3 signed URL request, so S3 will serve it with the correct Content-Type header.

[DEPR]: Deployment of Blockstore as an independent service

2024 UPDATE: This was Accepted, but is also superseded by:

openedx/public-engineering#238

which will remove all of Blockstore and all references to it.

Proposal Date

2023-10-27

Target Ticket Acceptance Date

2023-11-10

Earliest Open edX Named Release Without This Functionality

Redwood

Rationale

It was decided three years ago that Blockstore, an authoring-oriented storage backend, makes more sense as an installable app than it does as an independently-deployable micro-service.

Since then, it has been both possible to deploy it as a service and install as an application, but this dual-capability comes at the high cost of layers of wrapper APIs, mostly-dead caching code, test suites that are 80% skipped, and complicated deployment instructions leading to superfluous tooling code,

Removal

#### Removal Tasks
- [x] In edx-platform: [remove the use_blockstore_app_api toggle](https://docs.openedx.org/projects/edx-platform/en/latest/references/featuretoggles.html#featuretoggle-blockstore.use_blocksto.re_app_api), keeping the "true" codepaths
- [x] In edx-platform: Remove content libraries V2 indexing and related settings.
- [x] In edx-platform: Remove `@skips` on content_libraries tests, which currently rely on Blockstore running as a micro-service. Make sure tests all pass. If some are seriously broken, put `@skip` back in and create follow-up tickets.
- [x] In edx-platform: Remove [blockstore methods wrapper](https://github.com/openedx/edx-platform/blob/master/openedx/core/lib/blockstore_api/methods.py).
- [ ] In edx-platform: Remove the [blockstore_api tests](https://github.com/openedx/edx-platform/tree/master/openedx/core/lib/blockstore_api/tests). If any mixins there are still needed for different tests, move them elsewhere. For example, they could be moved to the content_libraries app, or maybe a test
- [ ] In edx-platform: Remove the [blockstore_api wrapper module](https://github.com/openedx/edx-platform/blob/master/openedx/core/lib/blockstore_api/__init__.py). Replace imports to it with direct imports to the underlying blockstore functions.
- [ ] In edx-platform: In the [content_libraries app](https://github.com/openedx/edx-platform/blob/master/openedx/core/djangoapps/content_libraries)'s tests, there are a few classes [like this one](https://github.com/openedx/edx-platform/blob/aaea6e5/openedx/core/djangoapps/content_libraries/tests/test_content_libraries.py#L1213-L1219) which needlessly delegate out to a mixin. This is leftover from when blockstore was both an app & service. Collapse each mixin's body into its test class.
- [ ] In edx-platform: Move [blockstore db_routers](https://github.com/openedx/edx-platform/blob/master/openedx/core/lib/blockstore_api/db_routers.py) so that the blockstore_api folder can be deleted. Suggested new location: openedx/core/lib/blockstore_db_routers.py
- [ ] ~In blockstore: Update [openedx.yaml](https://github.com/openedx/blockstore~/blob/aca35cafed182b0b28197fdd9427d00bc20742a9/openedx.yaml#L12) so it isn't tagged in release.~
- [ ] ~In blockstore: Clean up the [readme](https://github.com/openedx/blockstore/tree/master#blockstore).~
- [ ] ~In blockstore: Remove all files, Makefile targets, and dotfile lines relating to of devstack, docker, and docker-compose .~
- [ ] ~In blockstore: Remove [settings files](https://github.com/openedx/blockstore/tree/master/blockstore/settings).~
- [ ] ~In blockstore: Remove [wsgi config](https://github.com/openedx/blockstore/blob/master/blockstore/wsgi.py).~
- [ ] ~In blockstore: Look through [base requirements](https://github.com/openedx/blockstore/blob/master/requirements/base.in) and remove any that was there just because it was a micro-service.~
- [ ] ~In blockstore: Update [this catalog-info comment](https://github.com/openedx/blockstore/blob/aca35cafed182b0b28197fdd9427d00bc20742a9/catalog-info.yaml#L17).~
- [ ] ~In blockstore: Do a final sweep of the code and docs to make sure there's nothing left there that's specific to running Blockstore as a service.~
- [ ] In frontend-app-library-authoring: Remove the part of the readme that tells you to run blockstore ("Now set up blockstore. Blockstore...").
- [ ] In edx-platform and tutor: Set configuration defaults so that blockstore (the app) works out-of-the-box using filesystem storage.
- [ ] In openedx-tutor-plugins: remove any parts of the tutor-contrib-blockstore-minio plugin that are now redundant.
- [ ] In openedx-tutor-plugins: remove the tutor-contrib-blockstore-filesystem plugin.

Replacement

Blockstore will continue to exist and be actively maintained as an installable Django app.

Deprecation

No plans to mark deprecation.

Migration

We will not provide any migration instructions because we are not aware of anyone using Blockstore as a micro-service.

If we are wrong and you do run Blockstore as a micro-service, please reach out and we can help you figure out how to merge your Blockstore database into your LMS/CMS database.

Additional Info

No response

Blockstore: not authorized to access this page

I followed the instructions for setting up Blockstore along side the devstack. I added the Application and application access as specified, however, I am seeing this message:

You are authenticated as edx, but are not authorized to access this page. Would you like to login to a different account?

I tried adding user "edx" in the application screen, but that did not help. I think the instructions need more updating.

Separating Policy from Content

@bradenmacdonald, @symbolist:

One of the things we've discussed has been separating content (e.g. text of a problem) from policy (e.g. due date). That distinction gets murky in a lot of places, as we have things like CCX or policies that might only apply to specific block types at the moment, like capa randomization settings.

At the Open edX 2019 conference, @Colin-Fredericks mentioned his tool for policy modifications to a course, which I think is a useful place to look for some real-world-tested solutions to this problem: https://github.com/Colin-Fredericks/hx-xml

A really interesting point to me is that the selection criteria in these scripts can be a lot more sophisticated than some of the conversations we were having in the past. We talked about setting things on a block-type/assignment-type basis, but here we have rules that introspect details about a problem such as the following logic for setting max attempts:

Your options for the number are:
  - An actual number, which sets all problems to the same # of attempts
  - "delete" or "default", which removes the value so that
    course-wide default takes over
  - "auto" (recommended), in which case attempts are set as follows:
    - Numerical and Formula problems get 10
    - Customresponse and Text problems get 5
    - Checkbox problems get a number of attempts equal to the number of choices, max 5.
    - MC problems get...
      - 3 if they have 7+ options
      - 2 if they have 4-6 options
      - 1 if they have 2-3 options
    - Multi-problems get the highest number of the set.
    - Other problem types are skipped.

I'm making this issue as a place to discuss how we can support this sort of separation on the Blockstore side. I imagine we'll get to this after Links and some of the XBlock runtime work in flight, but I wanted to pin it so we don't forget.

Why Files?

I'm re-posting a question from @regisb that he asked on the Confluence comments page. I wanted to have the discussion here because it's more likely to be found and to be useful. My apologies for not getting around to replying until after the holidays. :(

Quoting @regisb from the wiki design doc:

I'm a bit surprised by the choice of files for storing content. My 2 cents:

The data structure should be chosen by taking into consideration what kind of read/write access will be required. For instance, file systems are not good at answering the question "what is the most recent file in this folder" (they don't have an index on dates). And, it seems to me that we are frequently going to have to make such queries, for instance to get the latest version of a content element.

I think we all agree that accounting for read/write data access patterns is critical, and was much of the focus of issues #16 and #26. Accounting for the dependencies (and transitive dependencies of dependencies) is part of the motivation for having all that information stored in a single snapshot summary file, so that we can get that sort of data for any given BundleVersion with a single file request.

Also, frankly, with the amount of content that edX has, using an object store is just way cheaper than paying for the equivalent amount of database storage.

Filesystems are bad at searching: will we have to rely on a grep-like tool (i.e: slow) when searching for content?

Search capabilities will be provided by Elasticsearch, which we would still have used even if we went with a more SQL-based solution. We'll eventually want to index a bunch of things (PDF files, subtitles, etc.) and MySQL's full text search capabilities are pretty limited.

Separating data in two different storage systems (filesystem and SQL db) requires some synchronization, which is a hard problem.

Yeah. We somewhat mitigate this by having it so that the database entries are more or less pointers to immutable file system snapshots (with some deduplication on the snapshot side so we don't make useless copies of files that don't change between versions). So the synchronization is one way in that sense -- we first build a Snapshot at the file system level, and then point to it with a BundleVersion at the database level. If a failure happens halfway through the process, we might have created a Snapshot that never gets pointed to. But it's orphaned data that will be ignored, as opposed to having something two-way where conflicts can arise.

Is it going to be a requirement for large Open edX platforms to have an object storage platform like S3 to store assets? Filesystems are bad at storing many files per folder; does it mean that object storage is the only way to go?

I think that object storage will be the preferred deployment method for most folks, but it should work with filesystems. The current design already groups data files by Bundle, and so would require that the file system support ~10K files in a given folder to yield decent performance with large courses. My understanding is that this is okay for modern filesystems. If it's not, we can further sub-divide the files by prefix, so that instead of raw data files being a "xxxxxxxxxxxxxxxxxxxxxxxxxxxxx..." file, they can be "xxxxx/xxxxx/xxxxx/xxxxx..." etc.

I'm really not familiar with how Django implements its file-based storage API or large directory performance with more recent file systems (I have definitely been bit by that back in the ext2 days). What file systems are you thinking of, and what are the practical limits we should take into account?

Files don't have schema: one of the current major issues with xblocks is that they are extremely difficult to migrate, whenever their definition changes. Backward compatibility becomes very hard to maintain. Files have the same problem.

I completely agree, but addressing that is not a goal we have for Blockstore. I think that explicit versioning is something that's going to have to be added to the serialization format, but that's a whole separate topic. Blockstore is intended to be a dumb storage layer, with XBlock understanding happening at the caller level (so we don't get another giant Modulestore mess of coupled storage/XBlock logic).

(I am very new to the blockstore discussion, so it's quite possible my comments are completely irrelevant (smile))

You are very familiar with Open edX, have valuable perspectives on deployment issues, and are the author of what is still probably my all time favorite presentation on Open edX internals. So please continue to ask away. We did a lot of soul searching with this question in issue #26 a month ago, so I think using files as a basic approach is going to stay, but we can certainly look at what we can do to make sure it performs well outside of S3-like object stores.

Relationship to edx-val (Video Abstraction Layer)

Background of VAL Today

We created edx-val (Video Abstraction Layer) so that centralized video teams could change video content outside the normal draft/publish flow of courseware editing in Studio. The immediate use case was to add multiple resolutions of those videos and enable HLS support for benefit mobile devices. VAL acts as both a REST API and an XBlock runtime service (though it is not formally a service because it predates XBlock runtime services, it operates in the same way). Video metadata in VAL overrides all other settings.

The main problems that I'm aware of are:

Inefficient querying: edx-val doesn't keep careful track of what parts of what courses reference which videos, so courseware often makes an excessive number of calls when fetching metadata needed to render Video XBlocks in a sequence.
The ownership is muddled for things like subtitle uploads. If two courses use the edx-val videos and upload new subtitles via course import, should VAL allow the update?
VAL adds to the soup of different video settings that the VideoBlock has to respect. There are now at least three different ways to specify subtitles, for instance.

Long Term Goal

I believe that it would simplify the overall system if VAL were eventually eliminated and Video became just another kind of content library. This would bring it into line with the same conventions around ownership, versioning, and metadata/tagging that we have planned for other content, and would walk us back from today's position of always treating videos as a special case. We could still keep the basic outline of VAL's REST API if that's convenient, but change the backing store to Blockstore (I think the value of this is debatable, given that VEDA is the only thing that ever used it).

If that's a reasonable end goal, then the question is how do we make progress towards that world without doing a complete rewrite of the VideoModule?

I'll post a few possible directions in this thread...

Blockstore Test Cases in edxapp

As we build out the runtime and other Blockstore-based functionality in the LMS/Studio, we'll need a way for tests to interact with Blockstore.

Is there a way to have an isolated instance of Blockstore just for tests?

Or should we add some middleware that detects a request parameter and then uses a different MySQL database and filestore for that request? Then in the LMS/Studio blockstore API client we could inject that header into each request if running in test mode.

Or other ideas?

Exported Content Reuse and Human Readable Format Conventions

This issue is branched off from https://github.com/open-craft/blockstore/issues/19#issuecomment-440393721 and is to discuss a particular use case for exporting course content.

As a course owner, I have had need to export a course and then extract certain components of it, like a sequence, a unit or even a problem. Sometimes I extract these components to build a new course, for example, to create a survey course out of deep, extensive course. Sometimes I extract these components because I want to import them (as a sequence, unit or problem) into another application altogether, like DART.

Publish to PyPI & install from PyPI in edx-platform

Background

This package is currently installed into edx-platform directly from GitHub. This is non-ideal because:

GitHub-hosted requirements take significantly longer (> 4x) to install than PyPI-hosted requirements
GitHub-hosted requirement pins cannot be upgraded using our standard make upgrade workflow.

Request

Audit this package's setup.py to make sure it looks right before publishing. You can test it out by:
- creating a virtual environment
- installing package requirements
- running python setup.py sdist bdist_wheel
- checking the contents of the generated dist/ folder. The folder should contain a .tar.gz that can be unpacked, yielding the contents of what would be pushed to PyPI. The unpacked .tar.gz should include all Python source files (other than test files) as well as a LICENSE and any assets/resources needed for the block to work.
Publish this package to PyPI whenever a release happens.
Create a release in order to trigger an initial publish to PyPI
In edx-platform, remove the blockstore GitHub pin from github.in, and add blockstore to base.in, which holds all PyPI-hosted base dependencies. Run make upgrade so that the change is reflected in the requirements *.txt files.

Bundle Granularity

Capturing notes from the conversation @bradenmacdonald, @symbolist, and I had with @JAAkana on the subject of course publishing granularity.

The types of publish that exist today:

publish unit (most commonly used)
publish section (on outline page)
publish subsection (on outline page)
publish entire course (only via import)

The primitives we have to play with are BundleVersions, Links, and Files. We want to model the data so that:

Common use cases with existing courses are operationally simple.
We are set up for a long term vision of more dynamic courses content.
Import/export is still straightforward.

Making publishes happen at a level more granular than a Bundle is possible but might introduce complications -- we could make it so that you partially commit a Draft, e.g. one file that represents the Unit, but that could get complicated with dependencies.

Comparisons: Modeling Hierarchy

@bradenmacdonald, @symbolist, @pdpinch:

This is a spinoff of #16, where we can dive more into hierarchy modeling specifically. I’m trying to be even handed in my analysis, but I obviously have both an imperfect understanding of the proposals on the table as well as a personal preference, so please treat this issue description like a wiki and edit it until you feel it fairly represents the tradeoffs.

The proposals on the table are:

Bundle Per Block (BPB)
Hierarchical One Bundle Course (HOBC)
Flat One Bundle Course (FOBC)
Course Bundle, Unit Bundle (CBUB)

BPB and HOBC treat hierarchy modeling as a Blockstore primitive. BPB does this by narrowing the scope of a Bundle to only cover single Blocks, and leaning on our more-or-less original formulation of BundleVersion Links to string things together. HOBC uses a combination of file system conventions and a new Folder primitive that provides a more nuanced version of Links that allows for version-relative linking. FOBC and CBUB punt hierarchy modeling responsibility entirely to systems outside Blockstore, by storing the OLX for those relationships but not having any built-in understanding of how those relationships work.

Some of the driving use cases here:

Re-use of non-leaf blocks like Units.
Granular course publishing (at course, chapter, sequence, unit, and possibly even individual leaf levels)
Adaptive use cases calling for a more flexible/dynamic traversal of content.
Remixing of larger content that lives primarily in a course (e.g. a sequence).
Potentially restructuring existing Course Content (CCX example)

Of those, the first one is providing the strongest push to model hierarchy explicitly.

Bundle Per Block (BPB)

BPB’s core strength is that is is both conceptually simple and highly flexible. It can represent anything the other two can and its concept of Links is more straightforward than HOBC Folders since it’s just a M:M relationship between BundleVersions. There is a bit more complexity in grouping commits together logically (four Bundles changing could be the result of one meaningful edit), but that can be accommodated by modeling those commits as first class entities of their own and having a 1:M relationship with the created BundleVersions.

The main drawbacks of BPB are operational. Import operations could potentially affect thousands of separately committed Bundles that need to be updated together in a transaction. Simple updates to leaf nodes would require chained BundleVersion creation. The proliferation of BundleVersions and M:M relationships are a longer term scaling concern, particularly in the absence of a native graph database that can efficiently store and traverse such relationships—though using such a database would present other operational concerns since it would require moving away from our more established stack. The ratio of metadata to actual data is also potentially high, since we need to create a Bundle, a BundleVersion and any associated relationship metadata for leaf node that might just amount to a single tag with an attribute. This is less a storage concern than it is a "how many rows do we need to write and read from the database" concern.

It’s possible that these issues can be mitigated through more efficient representations, though there’s another tradeoff there in that such representations tend to add runtime complexity and read performance penalties.

Hierarchical One Bundle Course (HOBC)

HOBC attempts to maintain some of the simplicity of BPB while greatly reducing the operational complexity through the introduction of an additional primitive. Like BPB, there is one primitive that represents most things in the system: a "Folder" can represent a course, section, subsection, unit, XBlock, pathway, program, etc. Unlike BPB however, this smallest primitive is not versioned on its own.

The primitives become:

Collection - holds ownership, licensing, permission, and other metadata for a set of things with a common author
Bundle - a versioned container of typed folders. Container for all the data for a Course, Problem Bank, Content Library, Pathway, or Program.
Folder - a "thing" that belongs to a Bundle. Examples: a course outline, an XBlock, a section, a subsection, or a unit. A folder is owned by one bundle but may appear in other bundles.

The key difference is that when Folders refer to other Folders within the same Bundle (for example, a Unit folder declares that it is comprised of several specific XBlock Folders), it is an unversioned reference. So for example, when importing a course with 1,000 XBlocks, only a single new BundleVersion is created (whereas BPB would create 1,000+ BundleVersions that can be linked to independently).

The advantages of this model are that the representation of a single typical course is very operationally efficient, and that content reuse can happen at any granularity - XBlock, unit, subsection, section, course, or pathway.

One significant shortcoming of HOBC is that determining the entire course hierarchy/outline requires reading many different files from many different folders (read sections from course.json in the course folder, then read each section's section.xml from each section's folder, then read each subsection's subsection.xml from each subsection's folder, and so on...)

Flat One Bundle Course (FOBC)

The main focus of FOBC is implementation and operational simplicity, and so it takes the view that the storage layer of Blockstore shouldn’t try to model hierarchical relationships at all. A student’s path through a course is an LMS/Compositor level concern that structural OLX is an input to, and it’s better to do nothing than to add complexity to build a limiting version of this at this layer.

The major drawback of this is that FOBC requires more explicit intentionality in course design in order for its BundleVersion dependency tracking to be meaningful. You could borrow a single leaf block or container block from another Course, but Blockstore itself would only know that the link between the BundleVersions existed, not the specific items that were used. For that reason, a problem in a problem bank would be modeled as its own Bundle. Getting more precision than that would require building a separate, OLX-aware system to track re-use in a richer way.

Another drawback is that the Snapshot representation for large courses would be relatively large for our big outlier courses, and we’d be creating a new one on every publish. This doesn’t hit us much in costs (our worst case is still in the neighborhood of $0.20 a month), but it can make traversals of the full set of snapshots relatively expensive.

Similarly to HOBC, finding the course outline requires multiple roundtrips and many reads on large courses.

Course Bundle, Unit Bundle (CBUB)

The Course Bundle, Unit Bundle proposal is that courses (and libraries, and pathways) are represented as a single bundle containing an outline.xml file which describes the entire course structure (sections, subsections, and Units) along with a list of all the versioned leaf node (Unit XBlock) bundles that are used in the course. (So Units and Courses get their own bundle, but intermediate hierarchical elements like section and subsection are not represented as bundles.)

Units are composable up to some small limit, so a unit may contain a unit as a child, though when rendered in the LMS the result is flattened (i.e. the users would not see that hierarchy represented in the UI). An OLX convention/mechanism would allow for courses, pathways, units, etc. to link to a unit but only use one of the XBlocks within that unit rather than the whole unit.

Some advantages of this proposal are:

It's extremely simple
Reading the full outline of a course is very fast and easy (read one file)
Separation of concerns: Blockstore doesn't know anything about course hierarchy
Reasonable operational efficiency
Some flexibility of granularity: sharing is done at the Unit level but only one specific XBlock within an external unit can be used, effectively allowing sharing at the granularity of individual XBlocks or at the Unit level.
Static assets are local to a Unit

The disadvantages are:

Sharing of a Subsection or Section cannot be done, because Subsections are only defined by some XML within the outline.xml file of the course bundle, and that definition cannot be shared on its own (it's not an independent file/bundle/db-row)

Post-processed/compiled Bundle Data

This combines a few different threads of thought:

@symbolist mentioned that borrowing in the LabX model is projected to be much more granular than what we've been thinking about for edx-platform (e.g. take a single item from a Unit and reuse it, whether the original author intended or not).
@symbolist also pointed out that having a "anything works as long as it's valid OLX" model means that different clients may come up with different conventions and have difficulty interoperating.
MIT has different use cases where they have "precursor" files that compile out to OLX.
@bradenmacdonald's thread on opaque keys and the need for stable, discoverable usage keys in a Bundle.

We made the very explicit decision to give authors a file-system level flexibility to organize things how they want, and talked about entry points into specific files. But can we have our cake and eat it too? What if part of the pluggable processing for a given Bundle is basically a commit hook that will scan through the top level OLX files, execute any XIncludes, parse the OLX, and write out fragments into a normalized space? If it uses dumb heuristics (presence of a url_name, tag name as type), then it could generate a compiled list of files such that we'd have a new supplementary directory structure like:

/xblocks/unit/first_unit.olx
        /problem/survey_problem.olx
        /html/instruction.olx  # This has CDATA in it.

Advantages:

Uniform read format, which simplifies clients reading for executing the student view.
We can always make a link from an external bundle to a particular XBlock in a stable, named way without worrying about how the author will refactor their Bundle later.
We don't need the runtime to extract an XBlock into a separate file to mark it as intentionally sharable. It can be a simple toggle that happens later.
It opens up the possibility of having other types of OLX converters (e.g. QTI, latex2edx, etc.)

Open Questions:

Do static assets also get some normalized treatment like this? Or maybe we can rely on convention that they have their own top level treatment that is parallel to this?
Does this substantially increase the cost of Snapshot creation? Is it a separate Snapshot that references the original?
How are errors handled? Keep it from making a BundleVersion?

Proposal: Taking Down Content

(Note: I'm really struggling on terminology here (e.g. "takedowns"), so please correct this to whatever makes sense.)

At some point, a site operator will want to take down a piece of content in a Bundle because it violates site policy in some way (e.g. copyright violation). We need to handle this in a way that's compliant with our legal responsibilities and doesn't introduce too much complexity or performance overhead.

Use Cases

There are two broad cases that need to be addressed:

A small part of a Bundle needs to be taken down. A Bundle could represent an entire course, in which case we don't want to disable everything because the author uploaded a single video that they thought they had the rights to.
The entire Bundle needs to be taken down. Maybe it's a course that was lifted from another site, or a pirated TV series.

In both cases, we need to figure out what implications of taking down content in a Bundle will have on the various Bundles that Link to it.

High Level Approach

Blockstore clients are trusted servers, not students.
- The goal of this is not to immediately remove the files from being accessed by authors and servers but to be able to clearly flag these files so that they do not show up in searches or get published to student-facing systems.
Content takedowns are implemented as a change in metadata rather than a change in content, meaning that the BundleVersion changes, but the underlying Snapshot does not.
- The BundleStore/DraftStore layer is completely ignorant of content takedowns.
- Since it's a metadata change, it doesn't create a new BundleVersion, but would trigger a lifecycle event signal like other metadata changes would.
- We can establish a pattern for other more pluggable metadata additions to Bundle data to be made.
The kinds of queries (e.g. "all takedown notices filed today") required mean that this should be stored in the relational database.
As @bradenmacdonald suggested, we can create a blacklist of files that are forbidden for the use case where only a few files are problematic.
- We can also cover the blanket "whole Bundle" use case with a flag as a performance optimization.
Blacklists have to apply to ranges of BundleVersions.
- At some point the Bundle might be corrected to remove the offending piece from inside a file -- we don't want to permanently forbid a filename because it violated policy at one time.
- Blacklist ranges shouldn't overlap in complex ways.

REST Representation

A new app is created for takedowns (or whatever we call this).
The new app has its own resource to represent takedowns/blacklist management.
BundleVersions get a new "metadata" attribute that is only shown in the detail GET view and allows other apps to inject data into it. So in this case it might look like:

{
  // (all the other BundleVersion detail information above this)
  "metadata": {
    "takedowns": {
      "takedown_url": "https://.../takedowns/{uuid}",
      "files_affected": [
        "/path/to/file/1.pdf"
        // ... etc.
      ]
    }
  }
}

Open Questions

When do we completely delete offending data, as opposed to merely blocking access to it?
We punted before on creating a BundleVersionRange model, but that might be really useful for this (and others that want to build metadata functionality).

Authentication and Authorization

How will Blockstore authenticate and authorize requests? Authentication is not too hard, and we have existing solutions (signed JWTs from the LMS, etc.) but authorization gets really tricky. The main problem to solve is the complexity involved in courseware authorization decisions (authorization to view/learn content depends on a very complex intersection of user role, user affiliation, user course enrollment status, release dates, randomized content assignation / adaptive learning dynamic assignation, content groups, cohorts, etc.) and the sheer amount of corresponding data that must be consulted to make authorization decisions.

There are two solutions that I've been thinking about:

The simplest solution I'm aware of is to not allow most services to access Blockstore directly, and instead proxy requests through the LMS and/or Studio which can then apply appropriate permissions checks, using its knowledge of things like enrollments, cohorts, org-level permissions, etc. Then Blockstore doesn't need to worry about authentication at all. This is probably my default option due to its simplicity and that it requires very little additional work to implement.
The other interesting option I'm aware of is to have an independent authorization (micro)service, which stores most data it needs in memory and can provide blazing-fast yes/no answers to detailed questions about which users can do what. The model for this is Amazon's IAM, which supports incredibly complex policies at massive scale with minimal latency. The most interesting open source implementation of this that I've found is https://www.openpolicyagent.org/ which looks very promising, though I'm still unclear on whether or not there's any way to provide it with the required context data (enrollments, randomized content assignation, adaptive learning engine results, etc.) in real time (maintaining a single source of truth for those things) vs. having to push that data into it (duplicating data like enrollment data so that it now lives both in the LMS and in the Open Policy Agent cache) and deal with potential latency or inconsistent data.

Other software related to option 2:

keto is a "cloud native access control server" that uses Open Policy Agent. It previously was powered by ladon which is like an open source implementation of IAM policies, but that was replaced with Open Policy Agent.
Mozilla doorman much simpler authorization microservice, probably not scalable to Open edX use cases

Thoughts? Other options?

Remove Collections Model?

Collections are the abstraction for a set of learning contexts or bundles.

The idea behind collections was that they can be used to apply permissions and other metadata to many learning contexts at once, with more flexibility than org-based permissions. If Bob works at Harvard and manages 30 courses with a small team of TAs and two other profs, he can use a Collection to group them together and manage the permissions, rather than having to give everyone at Harvard access or manage the permissions for each course/library individually.

@bradenmacdonald

The issue we are encountering is twofold:

Features have not been built to enable the situations described above, therefore, the feature has very little purpose.
Collections are a pain point in the present. It is very hard to determine which collection to use when writing code that creates content libraries. The MFE knows the collection uuid, but that isn’t configured anywhere on the backend. I can just grab the first collection from the django model, but that doesn’t seem like it would be correct forever?
Another important impact of this is in the context of pre-provisoned environments like sandboxes. Putting the collection uuid in a MFE config file at build time wouldn’t work, as the collection would not exist yet.

We are unsure of the long term plans for 1. but are certain of 2. Therefore, we think it would be prudent to simply remove the concept of collections from blockstore entirely.
In the meantime, it might be worthwhile to specify a default value for the collection uuid and then make that the value.

PATCHing a Draft's files deletes its links

Just quickly making a note about a bug I noticed... If I have a draft with a staged file and a new link, and then I PATCH a change to the staged file, the link in the draft disappears.

Race condition when adding files to a draft via PATCH

If two simultaneous requests each add a (differently named) file to a draft, both requests will apparently succeed and return a 204 status code, but later attempts to GET one of the files will fail.

This is presumably because of a race condition updating the summary file.