GithubHelp home page GithubHelp logo

buildinspace / peru Goto Github PK

View Code? Open in Web Editor NEW
1.1K 27.0 69.0 4.01 MB

a generic package manager, for including other people's code in your projects

License: MIT License

Shell 1.41% Makefile 0.31% Python 98.16% Batchfile 0.11%
dependency-manager package-manager plugin-manager packaging toolchain

peru's Introduction

peru Actions Status PyPI version

Maybe sometimes better than copy-paste.

Peru is a tool for including other people's code in your projects. It fetches from anywhere -- git, hg, svn, tarballs -- and puts files wherever you like. Peru helps you track exact versions of your dependencies, so that your history is always reproducible. And it fits inside your scripts and Makefiles, so your build stays simple and foolproof.

snazzy gif

Why?

If you build with make, you don't have to do anything special when you switch branches or pull new commits. Build tools notice those changes without any help. But if you depend on other people's code, the tools aren't so automatic anymore. You need to remember when to git submodule update or go get -u or pip install -r. If you forget a step you can break your build, or worse, you might build something wrong without noticing.

Peru wants you to automate dependency management just like you automate the rest of your build. It doesn't interfere with your source control or install anything global, so you can just throw it in at the start of a script and forget about it. It'll run every time, and your dependencies will never be out of sync. Simple, and fast as heck.

The name "peru", along with our love for reproducible builds, was inspired by Amazon's Brazil build system. It also happens to be an anagram for "reup".

Installation

Peru supports Linux, macOS, and Windows. It requires:

  • python 3.5 or later
  • git, any version
  • optionally, if you want fetch from these types of repos:
    • hg, any version
    • svn, any version

git is required even if you are not retrieving a git-based module because Peru uses it internally.

Using pip

Use pip to install it:

pip install peru

Note that depending on how Python is set up on your machine, you might need to use sudo with that, and Python 3's pip might be called pip3. Also, if you have to use Python 3.3 or 3.4, those were supported up to peru 1.1.4.

Don't forget to install git, too, however is appropriate for your OS.

Using OS package managers

On Arch Linux, you can install peru from the AUR.

Homebrew has a Peru formula for macOS and Linux. brew install peru will install it running on the latest Python version that Homebrew supports.

Getting Started

Here's the peru version of the first git submodules example from the Git Book. We're going to add the Rack library to our project. First, create a peru.yaml file like this:

imports:
    rack_example: rack/  # This is where we want peru to put the module.

git module rack_example:
    url: git://github.com/chneukirchen/rack.git

Now run peru sync.

What the heck just happened?

Peru cloned Rack for you, and imported a copy of it under the rack directory. It also created a magical directory called .peru to hold that clone and some other business. If you're using source control, now would be a good time to put these directories in your ignore list (like .gitignore). You usually don't want to check them in.

Running peru clean will make the imported directory disappear. Running peru sync again will make it come back, and it'll be a lot faster this time, because peru caches everything.

Getting Fancy

For a more involved example, let's use peru to manage some dotfiles. We're big fans of the Solarized colorscheme, and we want to get it working in both ls and vim. For ls all we need peru to do is fetch a Solarized dircolors file. (That'll get loaded somewhere like .bashrc, not included in this example.) For vim we're going to need the Solarized vim plugin, and we also want Pathogen, which makes plugin installation much cleaner. Here's the peru.yaml:

imports:
    # The dircolors file just goes at the root of our project.
    dircolors: ./
    # We're going to merge Pathogen's autoload directory into our own.
    pathogen: .vim/autoload/
    # The Solarized plugin gets its own directory, where Pathogen expects it.
    vim-solarized: .vim/bundle/solarized/

git module dircolors:
    url: https://github.com/seebi/dircolors-solarized
    # Only copy this file. Can be a list of files. Accepts * and ** globs.
    pick: dircolors.ansi-dark

curl module pathogen:
    url: https://codeload.github.com/tpope/vim-pathogen/tar.gz/v2.3
    # Untar the archive after fetching.
    unpack: tar
    # After the unpack, use this subdirectory as the root of the module.
    export: vim-pathogen-2.3/autoload/

git module vim-solarized:
    url: https://github.com/altercation/vim-colors-solarized
    # Fetch this exact commit, instead of master or main.
    rev: 7a7e5c8818d717084730133ed6b84a3ffc9d0447

The contents of the dircolors module are copied to the root of our repo. The pick field restricts this to just one file, dircolors.ansi-dark.

The pathogen module uses the curl type instead of git, and its URL points to a tarball. (This is for the sake of an example. In real life you'd probably use git here too.) The unpack field means that we get the contents of the tarball rather than the tarball file itself. Because the module specifies an export directory, it's that directory rather than the whole module that gets copied to the import path, .vim/autoload. The result is that Pathogen's autoload directory gets merged with our own, which is the standard way to install Pathogen.

The vim-solarized module gets copied into its own directory under bundle, which is where Pathogen will look for it. Note that it has an explicit rev field, which tells peru to fetch that exact revision, rather than the default branch (master or main in git). That's a Super Serious Best Practice™, because it means your dependencies will always be consistent, even when you look at commits from a long time ago.

You really want all of your dependencies to have hashes, but editing those by hand is painful. The next section is about making that easier.

Magical Updates

If you run peru reup, peru will talk to each of your upstream repos, get their latest versions, and then edit your peru.yaml file with any updates. If you don't have peru.yaml checked into some kind of source control, you should probably do that first, because the reup will modify it in place. When we reup the example above, the changes look something like this:

diff --git a/peru.yaml b/peru.yaml
index 15c758d..7f0e26b 100644
--- a/peru.yaml
+++ b/peru.yaml
@@ -6,12 +6,14 @@ imports:
 git module dircolors:
     url: https://github.com/seebi/dircolors-solarized
     pick: dircolors.ansi-dark
+    rev: a5e130c642e45323a22226f331cb60fd37ce564f

 curl module pathogen:
     url: https://codeload.github.com/tpope/vim-pathogen/tar.gz/v2.3
     unpack: tar
     export: vim-pathogen-2.3/autoload/
+    sha1: 9c3fd6d9891bfe2cd3ed3ddc9ffe5f3fccb72b6a

 git module vim-solarized:
     url: https://github.com/altercation/vim-colors-solarized
-    rev: 7a7e5c8818d717084730133ed6b84a3ffc9d0447
+    rev: 528a59f26d12278698bb946f8fb82a63711eec21

Peru made three changes:

  • The dircolors module, which didn't have a rev before, just got one. By default for git, this is the current master or main. To change that, you can set the reup field to the name of a different branch.
  • The pathogen module got a sha1 field. Unlike git, a curl module is plain old HTTP, so it's stuck downloading whatever file is at the url. But it will check this hash after the download is finished, and it will raise an error if there's a mismatch.
  • The vim-solarized module had a hash before, but it's been updated. Again, the new value comes from master or main by default.

At this point, you'll probably want to make a new commit of peru.yaml to record the version bumps. You can do this every so often to keep your plugins up to date, and you'll always be able to reach old versions in your history.

Commands

  • sync
    • Pull in your imports. sync yells at you instead of overwriting existing or modified files. Use --force/-f to tell it you're serious.
  • clean
    • Remove imported files. Same --force/-f flag as sync.
  • reup
    • Update module fields with new revision information. For git, hg, and svn, this updates the rev field. For curl, this sets the sha1 field. You can optionally give specific module names as arguments.
  • copy
    • Make a copy of all the files in a module. Either specify a directory to put them in, or peru will create a temp dir for you. You can use this to see modules you don't normally import, or to play with different module/rule combinations (see "Rules" below).
  • override
    • Replace the contents of a module with a local directory path, usually a clone you've made of the same repo. This lets you test changes to imported modules without needing to push your changes upstream or edit peru.yaml.

Module Types

git, hg, svn

For cloning repos. These types all provide the same fields:

  • url: required, any protocol supported by the underlying VCS
  • rev: optional, the specific revision/branch/tag to fetch
  • reup: optional, the branch/tag to get the latest rev from when running peru reup

The git type also supports setting submodules: false to skip fetching git submodules. Otherwise they're included by default.

curl

For downloading a file from a URL. This type is powered by Pythons's standard library, rather than an external program.

  • url: required, any kind supported by urllib (HTTP, FTP, file://)
  • filename: optional, overrides the default filename
  • sha1: optional, checks that the downloaded file matches the checksum
  • unpack: optional, tar or zip

Peru includes a few other types mostly for testing purposes. See rsync for an example implemented in Bash.

Creating New Module Types

Module type plugins are as-dumb-as-possible scripts that only know how to sync, and optionally reup. Peru shells out to them and then handles most of the caching magic itself, though plugins can also do their own caching as appropriate. For example, the git and hg plugins keep track of repos they clone. Peru itself doesn't need to know how to do that. For all the details, see Architecture: Plugins.

Rules

Some fields (like rev and unpack) are specific to certain module types. There are also fields you can use in any module, which modify the tree of files after it's fetched. Some of these made an appearance in the fancy example above:

  • copy: A map or multimap of source and destination paths to copy. Works like cp on the command line, so if the destination is a directory, it'll preserve the source filename and copy into the destination directory.
  • move: A map or multimap of source and destination paths to move. Similar to copy above, but removes the source.
  • drop: A file or directory, or a list of files and directories, to remove from the module. Paths can contain * or ** globs.
  • pick: A file or directory, or a list of files and directories, to include in the module. Everything else is dropped. Paths can contain * or ** globs.
  • executable: A file or list of files to make executable, as if calling chmod +x. Also accepts globs.
  • export: A subdirectory that peru should treat as the root of the module tree. Everything else is dropped, including parent directories.

Note that these fields always take effect in the order listed above, regardless of the order they're given in peru.yaml. For example, a move is always performed before a pick. Also note that these fields can't be given twice. For example, instead of using two separate move fields (one of which would be ignored), use a single move field containing multiple moves. In practice, things work this way because these fields are parsed as keys in a dictionary, which don't preserve ordering and can't repeat.

Besides using those fields in your modules, you can also use them in "named rules", which let you transform one module in multiple ways. For example, say you want the asyncio subdir from the Tulip project, but you also want the license file somewhere else. Rather than defining the same module twice, you can use one module and two named rules, like this:

imports:
    tulip|asyncio: python/asyncio/
    tulip|license: licenses/

git module tulip:
    url: https://github.com/python/asyncio

rule asyncio:
    export: asyncio/

rule license:
    pick: COPYING

As in this example, named rules are declared a lot like modules and then used in the imports list, with the syntax module|rule. The | operator there works kind of like a shell pipeline, so you can even do twisted things like module|rule1|rule2, with each rule applying to the output tree of the previous.

Recursion

If you import a module that has a peru file of its own, peru can include that module's imports along with it, similar to how git submodules behave with git clone --recursive. To enable this, add recursive: true in a module's definition.

It's also possible to directly import modules that are defined in the peru.yaml file of another module. If your project defines a module foo, and foo has a peru file in it that defines a module bar, you can use foo.bar in your own imports. This works even if you never actually import foo, and it does not require setting recursive: true.

Configuration

There are several flags and environment variables you can set, to control where peru puts things. Flags always take precedence.

  • --file=<file>: The path to your peru YAML file. By default peru looks for peru.yaml in the current directory or one of its parents. This setting tells peru to use a specific file. If set, --sync-dir must also be set.
  • --sync-dir=<dir>: The path that all imports are interpreted relative to. That is, if you import a module to ./, the contents of that module go directly in the sync dir. By default this is the directory containing your peru.yaml file. If set, --file must also be set.
  • --state-dir=<dir>: The directory where peru stashes all of its state metadata, and also the parent of the cache dir. By default this is .peru inside the sync dir. You should not share this directory between two projects, or peru sync will get confused.
  • --cache-dir=<dir> or PERU_CACHE_DIR: The directory where peru keeps everything it's fetched. If you have many projects fetching the same dependencies, you can use a shared cache dir to speed things up.
  • --file-basename=<name>: Change the default peru file name (normally peru.yaml). As usual, peru will search the current directory and its parents for a file of that name, and it will use that file's parent dir as the sync dir. Incompatible with --file.

Links

peru's People

Contributors

colindean avatar edbrannin avatar felipefoz avatar jmbrads22 avatar mfussenegger avatar oconnor663 avatar oconnor663-zoom avatar olson-sean-k avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

peru's Issues

cancel subprocesses when a job fails?

We don't do anything special to cancel existing jobs when something fails. What actually happens? Presumably we should be sending a kill signal to existing jobs. What if a job fails to die?

use libgit2 via pygit2 instead of shelling out

The main blocker for this one is Cache.merge_trees(). We use the --prefix flag for git-read-tree, and libgit2 doesn't seem to support a similar feature. Tracking issue: libgit2/libgit2#154 We could use the treebuilder feature to build a prefixed tree, and then use git_merge_trees() on that, but that function isn't exposed through pygit2 anyway.

Other features we need in pygit2 that we've already implemented:
setting the working dir: https://github.com/oconnor663/pygit2/commit/a063867fe0e4506e29f22c45dd403d805e3fb1b7
setting a detatched HEAD: https://github.com/oconnor663/pygit2/commit/b190169f5e83cbdb2346acd52cea30e14a205eb5
EDIT: These were pushed as part of pygit2 v0.21.0 libgit2/pygit2#377

stop complaining on deleted imports

We want peru sync to be careful about overwriting the user's files. Peru doesn't pave over preexisting files, or any changes that have been made since a file was created. This is to avoid accidentally deleting users' work, and also to try to catch some of the cases where users have done the Wrong Thing with peru (like checking in synced files (#dowhatisaynotwhatido)). But currently we also freak out if the user has deleted files that peru synced, and I think that might be overzealous. Consider this scenario:

  1. I want to clean junk out of my repo, but I don't want to lose my peru cache.
  2. So I run git clean -dfx --exclude .peru. Maybe I have an alias for that.
  3. Now I call peru sync. Peru absolutely refuses until I use -f.

I think it would be better if peru stopped complaining here. There's no risk of losing work, and it's not really catching any Wrong Things. If a user sees this error all the time, they're not going to be paying attention when it eventually catches a real mistake. (Especially if we get them in the habit of using -f.)

tldr: peru sync should consider deleted files "clean".

Is it really a good idea for the imports list to be separate?

This is a consistent question I get when I demo peru for people. Why not put the import path for a module in the module's declaration? There are two decent reasons and one bad reason:

  1. It's nice to be able to look in one place and see everything that peru is going to do when you sync.
  2. Imports are not necessarily one-module-one-path. A module could be imported multiple times with different rules.
  3. Bad reason: this is a holdover from when remote modules were more involved, with potentially their own imports, when some symmetry between the remote modules and the toplevel module made sense.

Allowing import paths as part of a module declaration definitely simplifies the hello world example. I could go either way on examples that are more complicated than that. I really want to avoid having two different ways to do the same thing, like allowing both an imports list and inline import paths. I think the biggest question for me right now is whether point (1) is really true, or whether I just think it's true because I'm used to it...

Windows compatibility

  • We need to make sure we're using the ProactorEventLoop on Windows. The default loop doesn't support subprocesses.
  • We need to use create_subprocess_shell instead of create_subprocess_exec to execute e.g. .py plugins.

specify the cache on a per-project basis

Something in .peru. Maybe a pseudo symlink like git uses.

The PERU_CACHE env var is a little too broad. You might want to have some projects share the cache but not others.

backup peru.yaml during a reup

There's no reason we shouldn't keep backups of peru.yaml under the .peru dir when we modify that file. It wouldn't take very much disk space, and it could be helpful for users who aren't under version control. (As, for example, our future workspace feature might not be.)

allow imports to be a list of pairs

This would allow the user to import the same target twice without hacks. It would also let them control the merge order, which I don't know why you'd want to do that, but maybe.

`peru copy` should support `--all`

I should be able to build (and force a build) of any target, not just local rules. Likewise, I should be able to export any tree, including the local imports. Once export can do that, our validate_third_party.sh script can use it and be simpler/faster.

Changing the cache can cause "failed to unpack tree object" errors

Say you run peru sync and then you change the value of PERU_CACHE and run peru sync again. The lastimports file will contain a reference to a tree that's not in your new cache, and you'll get a git error. We should detect this case ("hey, it looks like your last imports tree is no longer in cache") and allow the --force flag to just pave over everything.

peru workspaces

It would be nice to be able to manage a big ecosystem of projects with peru. We'd probably build on the existing overrides feature to do it. One idea we had was generating a peru.yaml file (not version controlled) that refers to your project repository as an overridden remote module. We might recursive peru to make this work.

refactor test code

The test harness is kind of a mess. In particular, the plugin tests have some inflexible scaffolding that doesn't work well for anything but distributed VCS plugins like git and hg. Until this is done, it may be difficult and hacky to test plugins like svn.

Should the cache directory live in $HOME by default?

As I'm writing the README, I find myself telling new users to set the $PERU_CACHE variable to avoid recloning things after they clean. When new users need to configure some random setting, that's usually a sign that the default is bad. Should we be storing the cache in a centralized spot by default?

Pros:

  • This is what Maven and Ivy do.
  • This makes the fastest setup the default for new users.
  • Different projects with the same dependencies would share their networking and disk space by default.

Cons:

  • This is what Maven and Ivy do.
  • This is not what git does by default, even though it can be configured to.
  • This default might be bad for complicated disk setups.
    • Say your build machine uses an NFS mount for /home, and uses /var/local or something for the actual local disk. You might want to do all your clones and builds in /var/local for speed, but behind the scenes peru is doing big git operations over the network in /home.
  • This can be confusing for modules without an explicit rev.
    • If two projects use the same dependency, the rev that one of them is getting will be affected by the other. A "new" dependency could be very stale because the other caller cached it in the past. We will probably also have a --skipcache flag or something in the future to force plugin fetches, and doing that would update the cache for all callers.
    • Similar problem for nondeterministic build commands.
  • This is a band-aid for our slow, serial plugin fetching. We should make it faster instead.
  • When we start actually using locks for our cache writes, this default could create more lock contention and stale lock issues.

get a cert for buildinspace.com

I've opened up http:// on port 80, but our .arcconfig and commit logs are still pointing to https://, so we should really fix this.

run fetches in parallel

We'll use asyncio for this, from the 3.3-compatible "tulip" library. Some things to remember:

  • Make sure modules and rules lock their get_tree methods. We don't want to allow multiple fetches to happen at once for the same module.
  • Use a semaphore to limit the number of parallel jobs.
  • Refactor resolver.py so that not everything needs to become a coroutine.

write a bash_cp plugin

This will help us force the plugin interface to stay simple, and to give an example of how plugins in other languages should be written.

implement PERU_WORK_DIR

The validate_third_party.sh script has to copy peru.yaml around and then clean it up. That's annoying. We also have hacks in tests to handle peru.yaml when we're comparing contents of directories. Make all this cleaner.

recursive peru

Remote modules should be able to include their own peru.yaml files. This should allow default rules, as well as referencing rules and modules defined in the remote.

support whitespace in plugin field names?

We tend to use spaces in our field names, because it's just nicer to read (required fields vs required_fields). We should probably allow plugins to do the same with the names they define. For example, suppose the curl plugin wanted a field called "fallback url". We should probably let them call it fallback url with the space. But we'd want to pass it along as $PERU_MODULE_FALLBACK_URL rather than allowing whitespace into an env var name.

`peru diff`

When doing a reup, it would be nice to see the before/after diff. One way to do this would be to support some kind of peru diff FILE [FILE2]. Note that FILE could be something like

<(git show HEAD^:peru.yaml

Peru could prepare the imports tree and then do some kind of git diff between that and the current tree.

One way to hack around this right now is to git add -A --force and commit all your imported files in a temp branch, do the reup, make another commit, and then compare those two.

don't let builds print straight to stdout

That conflicts with the fancy display. Maybe the displays could be extended to provide a different kind of output writer, which works like the print method does now.

add support for hg, svn

These need plugins. We should probably refactor some shared logic out of the plugin main functions when we do this.

make sure our .peru dir is versioned

We want to be able to make changes to the format without needing everyone to git clean their projects. Possibly also version the plugin caches?

allow different builds for different platforms

We should probably let the build field optionally take a mapping of system names to build commands. Complicated build commands can already do their own uname testing on posix systems, so this will almost exclusively be intended to support Windows. We should probably use sys.platform and the .startswith() idiom (https://docs.python.org/3.4/library/sys.html#sys.platform), but it might be nice to also check against os.name, so that users could specify posix without needing to duplicate things for each different posix-like os. Should we match against an ordered list?

remove the build command

sync only ever syncs one thing (everything). That's a good thing. It means you don't have a lot of state that you need to worry about. You're either synced, or you're not.

build has a similar restriction, but it seems to make a lot less sense. Almost all builds need to support multiple different invocations, like make and make install. To be useful in anything but the most trivial cases, build would need to start taking parameters that it passes along to build commands, and the target syntax would need to support this too.

Rather than trying to patch up a bad model, I think we should scrap the build command. We should encourage the pattern where other build tools call peru sync.

One question this raises: Projects can have a toplevel build: field. Previously peru sync ignored this, and only peru build triggered it. With this change, the only way to invoke a toplevel build field will be to have another module depend on you as a recursive project (not implemented yet). Is that a world that makes sense?

Actually, it's no different from export: and files:, neither of which is meaningful at the top level unless someone depends on you as a recursive project. Maybe it's good that build: would be more like those.

But that raises another question: Does it really make sense for build, export, and files to be first-class, toplevel fields? Maybe we should cordon them off in a section of their own?

move .peru/cache/tmp to .peru/tmp

Users should be able to set PERU_CACHE to their home dir without causing peru to write a ton of temp files there. Honestly, there really shouldn't be a reason to set PERU_PLUGINS_CACHE instead of PERU_CACHE.

add a `filter` field to rules and modules

It's been fairly common for me to use the build field to do something like

mkdir out && cp myfile out/

when I want to export only part of a directory. That feels pretty hacky, and it will be very inconvenient in builds that need to support Windows or even just cp -r (Mac requires the -R flag instead).

It would be better to have some explicit filter field. git add supports * and ** globs natively, so it shouldn't be too much trouble to expose this through Cache.import_tree.

My guess is that it would make sense to apply the filter step after export, which means that filter paths would be relative to the export dir rather than relative to the module root. That would save the user from duplicating the export path in the filter spec. The order of application of rule fields would then be:

  1. build
  2. export
  3. filter

improve the README

Some things we should probably mention:

  • The general rule fields: build, export, and files.
  • "Treat downloaded files like generated files."
  • A little bit about named rules.

checking cache keys should not cause a build

We can compute the cache key for a rule without building it. So we should really be able to do that without building its dependencies either. The current approach has the benefit of noticing when we've run a rule on the same inputs before though. Can we get both?

allow plugin scripts to be one file

Right now we force plugins to separate their fetch and reup scripts, at least to some degree. This forces all of our plugins to use the *_shared idiom, which is pretty annoying. That layout used to make sense before we have plugins.yaml, but now maybe it doesn't. It should be easy enough for that file to tell us what to invoke for fetch and reup, and there's no reason those couldn't be the same thing. (We could use another env var like PERU_PLUGIN_COMMAND to make it possible-but-not-required to use one script for both.) @olson-sean-k what do you think?

add some logging

.peru/log seems like a reasonable place. It would be nice to record entries like

module foo cached: 5d5fb9a5c41a0bca34af6fcb1e554b79af6534ea

so that when I want to clear the cache for just one module, I can find its cache key in the log. And of course, we should be logging errors.

should remote modules even have imports?

Maybe that's more complicated than it's worth. (Especially when it comes to overridden modules, where we have to stick a .peru dir in them.) I'm not sure I can think of a good use case. We want to encourage nontrivial build commands to come out of the peru.yaml file anyway, right? Maybe only the toplevel project (and hypothetical recursive projects) should have imports.

parallel fetching can cause cache conflicts

Our parallelism uses module object locking to avoid fetching the same module twice. But there's nothing preventing two different modules from using the same URL. Those two modules could get fetched in parallel, and then you have two instances of the git plugin (or whatever) trying to write to the same directory.

We definitely don't want to shove any locking responsibilities down to the plugins. What we should do is create more granular plugin cache directories (instead of one big global one) and use a lock in peru itself to prevent two fetches from touching one cache at the same time. I'm tempted to use the full hash of a module's fields to name this directory, but we don't want to invalidate a git clone when the user changes rev for example. We could use the name+type of a module (because a module should definitely get a clean plugin cache if it changes type), but that could still get confused if one module swaps names with another. Maybe the solution is to name/lock the cache dir with a hash of all plugin fields, but also allow plugin.yaml to restrict the list of fields that get hashed. So the git plugin for example could say, "Only use my url field for the purposes of plugin caching." Is that too complicated? It might even make sense to make this configuration semi-mandatory, so that plugins that don't specify their cacheable fields get /dev/null as their PERU_PLUGIN_CACHE. Random upside to all this: we can get rid of the urlencoding that the plugins are doing now.

Related: You could have two modules with exactly the same fields. Ideally the second one should be a cache hit. But if they're fetched in parallel, they might both be cache misses, and then they would duplicate work. The solution to this would be to take module locks by cache key, rather than just by module object instance. (This should've been obvious from the beginning, since the read-write that we're protecting is done on that key.) Unlike the plugin issue above, this distinction is just a duplicated-work issue in a weird corner case, rather than a serious correctness issue. But since we already have to do module-level locking (to cover the case where both A and B depend on C), we might as well do it right.

All together, here's what that locking is going to look like. All of this lives in RemoteModule.get_tree, though RemoteModule.reup will probably want to do it too, so hopefully we can share it cleanly.

  1. Take a module lock keyed off of the module's cache key (the hash of all module fields). Think of this as the "don't fetch the same module twice" lock, though it will also handle identical modules with different names.
  2. Check the module cache and exit early if it's a hit.
  3. Take a plugin lock keyed off the relevant-to-plugin-caching fields specified in plugin.yaml. Think of this as the "only one job at a time using a given plugin cache directory" lock. If the plugin hasn't configured these fields, there's no lock here, and we don't provide a cache dir at all.
  4. Take the max-parallel-fetches semaphore. Think of this as the "even though we could run infinity jobs in parallel, let's be sensible and only run 10" semaphore.
  5. Actually shell out to the plugin.

replace the plugin command line protocol with plugin.yaml and environment variables

Right now plugins have to do some nontrivial parsing to separate out plugin fields from command arguments. This gets duplicated in every plugin, even though some of it is shared. A fairly trivial plugin like cp, which should be one line, ends up being four or five (let along the Bash rsync plugin), and also the sets of mandatory and optional fields get duplicated between fetch and reup scripts.

One of the reasons we didn't use more environment variables earlier is that it's difficult for the plugin to recognize invalid fields if it doesn't its fields in a list. But it shouldn't be the plugin's responsibility to recognize invalid fields -- that's more duplicated logic that should live in peru core. We should create a plugin.yaml convention that lets the plugin declare what fields it supports. (And possibly other stuff in the future, who knows.)

Once that's done, there's no reason not to pass the url field as e.g. PERU_FIELD_URL or something. Then the plugin never needs to parse anything.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.