GithubHelp home page GithubHelp logo

webmozarts / console-parallelization Goto Github PK

View Code? Open in Web Editor NEW
221.0 7.0 19.0 580 KB

Enables the parallelization of Symfony Console commands.

License: MIT License

PHP 98.48% Makefile 1.52%
symfony console parallelization

console-parallelization's Introduction

Parallelization for the Symfony Console

This library supports the parallelization of Symfony Console commands.

How it works

When you launch a command with multiprocessing enabled, a main process fetches items and distributes them across the given number of child processes over the standard input. Child processes are killed after a fixed number of items (a segment) in order to prevent them from slowing down over time.

Optionally, the work of child processes can be split down into further chunks (batches). You can perform certain work before and after each of these batches (for example flushing changes to the database) in order to optimize the performance of your command.

Installation

Use Composer to install the package:

composer require webmozarts/console-parallelization

Usage

Add parallelization capabilities to your project, you can either extend the ParallelCommand class or use the Parallelization trait:

use Symfony\Component\Console\Command\Command;
use Symfony\Component\Console\Input\InputInterface;
use Symfony\Component\Console\Output\OutputInterface;
use Webmozarts\Console\Parallelization\ParallelCommand;
use Webmozarts\Console\Parallelization\Parallelization;
use Webmozarts\Console\Parallelization\Input\ParallelizationInput;

class ImportMoviesCommand extends ParallelCommand
{
    public function __construct()
    {
        parent::__construct('import:movies');
    }

    protected function configure(): void
    {
        parent::configure();
        
        // ...
    }

    protected function fetchItems(InputInterface $input, OutputInterface $output): iterable
    {
        // open up the file and read movie data...

        // return items as strings
        return [
            '{"id": 1, "name": "Star Wars"}',
            '{"id": 2, "name": "Django Unchained"}',
            // ...
        ];
    }

    protected function runSingleCommand(string $item, InputInterface $input, OutputInterface $output): void
    {
        $movieData = json_decode($item);
   
        // insert into the database
    }

    protected function getItemName(?int $count): string
    {
        if (null === $count) {
            return 'movie(s)';
        }

        return 1 === $count ? 'movie' : 'movies';
    }
}

You can run this command like a regular Symfony Console command:

$ bin/console import:movies --main-process
Processing 2768 movies in segments of 2768, batches of 50, 1 round, 56 batches in 1 process

 2768/2768 [============================] 100% 56 secs/56 secs 32.0 MiB
            
Processed 2768 movies.

Or, if you want, you can run the command using parallelization:

$ bin/console import:movies
# or with a specific number of processes instead:
$ bin/console import:movies --processes 2
Processing 2768 movies in segments of 50, batches of 50, 56 rounds, 56 batches in 2 processes

 2768/2768 [============================] 100% 31 secs/31 secs 32.0 MiB
            
Processed 2768 movies.

The API

The ParallelCommand and the Parallelization trait

This library offers a ParallelCommand base class and a Parallelization trait. If you are looking for a basic usage, the ParallelCommand should be simpler to use as it provides the strictly required methods as abstract methods. All other hooks can be configured by overriding the ::configureParallelExecutableFactory() method.

The Parallelization trait on the other hand implements all hooks by default, requiring a bit less manual task. It does require to call ParallelizationInput::configureCommand() to add the parallelization related input arguments and options.

Items

The main process fetches all the items that need to be processed and passes them to the child processes through their Standard Input (STDIN). Hence, items must fulfill two requirements:

  • Items must be strings
  • Items must not contain newlines

Typically, you want to keep items small in order to offload processing from the main process to the child process. Some typical examples for items:

  • The main process reads a file and passes the lines to the child processes
  • The main processes fetches IDs of database rows that need to be updated and passes them to the child processes

Segments

When you run a command with multiprocessing enabled, the items returned by fetchItems() are split into segments of a fixed size. Each child processes process a single segment and kills itself after that.

By default, the segment size is the same as the batch size (see below), but you can try to tweak the performance of your command by choosing a different segment size (ideally a multiple of the batch size). You can do so by overriding the getSegmentSize() method:

protected function configureParallelExecutableFactory(
      ParallelExecutorFactory $parallelExecutorFactory,
      InputInterface $input,
      OutputInterface $output
): ParallelExecutorFactory {
    return $parallelExecutorFactory
        ->withSegmentSize(250);
}

Batches

By default, the batch size and the segment size are the same. If desired, you can however choose a smaller batch size than the segment size and run custom code before or after each batch. You will typically do so in order to flush changes to the database or free resources that you don't need anymore.

To run code before/after each batch, override the hooks runBeforeBatch() and runAfterBatch():

// When using the ParallelCommand
protected function runBeforeBatch(InputInterface $input, OutputInterface $output, array $items): void
{
    // e.g. fetch needed resources collectively
}

protected function runAfterBatch(InputInterface $input, OutputInterface $output, array $items): void
{
    // e.g. flush database changes and free resources
}

protected function configureParallelExecutableFactory(
      ParallelExecutorFactory $parallelExecutorFactory,
      InputInterface $input,
      OutputInterface $output,
): ParallelExecutorFactory {
    return $parallelExecutorFactory
        ->withRunAfterBatch($this->runBeforeBatch(...))
        ->withRunAfterBatch($this->runAfterBatch(...));
}

// When using the Parallelization trait, this can be simplified a bit:
protected function runBeforeBatch(
    InputInterface $input,
    OutputInterface $output,
    array $items
): void {
    // ...
}

You can customize the default batch size of 50 by overriding the getBatchSize() method:

protected function configureParallelExecutableFactory(
      ParallelExecutorFactory $parallelExecutorFactory,
      InputInterface $input,
      OutputInterface $output,
): ParallelExecutorFactory {
    return $parallelExecutorFactory
        ->withBatchSize(150);
}

Configuration

The library offers a wide variety of configuration settings:

  • ::getParallelExecutableFactory() allows you to completely configure the ParallelExecutorFactory factory which goes from fragment, batch sizes, which PHP executable is used or any of the process handling hooks.
  • ::configureParallelExecutableFactory() is a different, lighter extension point to configure the ParallelExecutorFactory factory.
  • ::getContainer() allows you to configure which container is used. By default, it passes the application's kernel's container if there is one. This is used by the default error handler which resets the container in-between each item failure to avoid things such as a broken Doctrine entity manager. If you are not using a kernel (e.g. outside a Symfony application), no container will be returned by default.
  • ::createErrorHandler() allows you to configure the error handler you want to use.
  • ::createLogger() allows you to completely configure the logger you want.

Hooks

The library supports several process hooks which can be configured via ::configureParallelExecutableFactory():

Method* Scope Description
runBeforeFirstCommand($input, $output) Main process Run before any child process is spawned
runAfterLastCommand($input, $output) Main process Run after all child processes have completed
runBeforeBatch($input, $output, $items) Child process Run before each batch in the child process (or main if no child process is spawned)
runAfterBatch($input, $output, $items) Child process Run after each batch in the child process (or main if no child process is spawned)

*: When using the Parallelization trait, those hooks can be directly configured by overriding the corresponding method.

Subscribed Services

You should be using subscribed services or proxies. Indeed, you may otherwise end up with the issue that the service initially injected in the command may end up being different from the one used by the container. This is because upon error, the ResetServiceErrorHandler error handler is used which resets the container when an item fails. As a result, if the service is not directly fetched from the container (to get a fresh instance if the container resets), you will end up using an obsolete service.

A common symptom of this issue is to run into a closed entity manager issue.

Differences with Amphp/ReactPHP

If you came across this library and wonder what the differences are with Amphp or ReactPHP or other potential parallelization libraries, this section is to highlight a few differences.

The primary difference is the parallelization mechanism itself. Amphp or ReactPHP work by spawning a pool of workers and distributing the work to those. This library however, spawns a pool of processes. To be more specific, the differences lies in how the spawn processed are used:

  • An Amphp/ReactPHP worker can share state; with this library however you cannot easily do so.
  • A worker may handle multiple jobs, whereas with this library the process will be killed after each segment is completed. To bring it to a similar level, it would be somewhat equivalent to consider the work of handling a segment in this library as a Amphp/ReactPHP worker task, and that the worker is killed after handling a single task.

The other difference is that this library works with a command as its central point. This offers the following advantages:

  • No additional context need to be provided: once in your child process, you are in your command as usual. No custom bootstrap is necessary.
  • The command can be executed with and without parallelization seamlessly. It is also trivial to mimic the execution of a child process as it is a matter of using the --child option and passing the child items via the STDIN.
  • It is easier to adapt the distribution of the load and memory leaks of the task by configuring the segment and batch sizes.

Contribute

Contributions to the package are always welcome!

To run the CS fixer and tests you can use the command make. More details available with make help.

Upgrade

See the upgrade guide.

Authors

License

All contents of this package are licensed under the MIT license.

console-parallelization's People

Contributors

andreas-gruenwald avatar brusch avatar christian-kolb avatar dependabot[bot] avatar jbuechner avatar kocal avatar pitchart avatar robindev avatar stanislavgoraj avatar theofidry avatar webmozart avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

console-parallelization's Issues

Running in child processes even on single process

@webmozart I have a process with many items which I can't run in multiple processes. As far as I can see there is no way to spawn child processes for every item, is there? In the steps there is a lot of work done which uses a lot of RAM. With child processes, every data usage would be removed after one run.

Is there any way to trigger the single items within child processes without having multiple processes?

Better solution for the "item" - argument?

Sometimes you want to provide a custom argument in your command and combine ith with several options.

With the current solution (v1.0.0) it is difficult to add custom arguments as the trait already has an internal "item" argument.

    protected function configure()
    {
        parent::configure();
        $this
            ->setDescription('Processes the preparation and/or update-index queue.')
            ->addArgument('queue', InputArgument::REQUIRED | InputArgument::IS_ARRAY)            
        ;

        self::configureParallelization($this);

 $this
            ->addOption('tenant', null, InputOption::VALUE_REQUIRED | InputOption::VALUE_IS_ARRAY)
            //...

    }

Error:
Cannot add an argument after an array argument.

Do you think it is possible to replace the item argument and define it as an option instead?
We need to use the argument as it is because we don't want to break compatibility in Pimcore.

Trait:

            ->addArgument(
                'item',
                InputArgument::OPTIONAL,
                'The item to process'
            )
```

Plans for v2

  • Switch to fidry/console
  • Provide a nicer output
  • Review the edge cases and clarify the different behaviours: item passed over stdin, error handler, consequences on the container state, segment failure etc.
  • Make the library easy to test
  • Make implementing a parallel command easy to test
  • Make it usable outside of a SymfonyApp
  • Review with the team about the design/names & co.
  • Add documentation for an upgrade path
  • Review the upgrade path

Error when command is executed outside bin/ directory.

Currently commands that implement the Parallelization trait must be executed in bin/.

Example: cd ~/www/bin && bin/console pimcore:thumbnails:image --processes 1 -> works.
Example: cd ~www && ~/www/bin/console pimcore:thumbnails:image --processes 1 -> error message: Expected a string. Got: boolean.

Reason: $consolePath = realpath(getcwd().'/bin/console'); returns false if the script is executed from the home directory (debian).

I don't have a solution right now, but according to https://www.php.net/manual/en/function.getcwd.php:

On some Unix variants, getcwd() will return FALSE if any one of the parent directories does not have the readable or search mode set, even if the current directory does.

Pipe breaks when quotes are used in input options

Bug Description

When double quotes are used in input options, the command outputs do not work properly anymore.

image

Example:

bin/console app:ecommerce:bootstrap --list-condition="o_id=20"

causes an error messages and the console output doesn't work properly anymore.

Analysis

See

As a simple workaround we wrapped string values into quotes:

 $optionString .= ' --'.$name.'='.is_string($arrayValue) ? sprintf('"%s"', $arrayValue) : arrayValue;

With that approach, several combinations worked:

bin/console app:ecommerce:bootstrap --list-condition="o_id=20"
bin/console app:ecommerce:bootstrap --list-condition="o_id in('20')"
bin/console app:ecommerce:bootstrap --list-condition='o_id in("20")'

We should discuss whether to add that simple improvement, or add and test some kind of escaper solution, such as
https://github.com/symfony/symfony/blob/master/src/Symfony/Component/Yaml/Escaper.php (which is unfortunately marked as "internal").

Allow trait to work without knowing the number of items

Right now, fetchItems() is only called once, which is fine.
However, we also want to use the ParallelizationTrait to process queue alike structures, or at least do another run if some items couldn't be processed for any reason.

Probably it wouldn't be that much refactoring to add a loop, so that fetchItems() will be called again before

$this->runAfterLastCommand($input, $output);

and the entiring processing will repeat until fetchItems() returns an empty result?

Optionally, fetchItems() could have an argument which contains the number of the rounds (first round: 0, second round: 1, ...) so that the developer can implement a stop after the first round.

As an alternative solution, fetchItems() could remain the same (=default implementation), but optionally, fetchNextItems() can be implemented with support for queue processing.

I can provide a PR if you like.

Error Handling in Child Processes / Exchange Payload

Improvement Suggestions

Error Handling in Child processes

If in runSingleCommand() an exception is thrown, then currently the child process stops.

Screenshot:
image

There is no out-of-the-box solution to react in the parent. Is there a way to forward specific exceptions to the parent, so that the parent can handle the exceptions?

Remove dependency on container in trait

As Symfony 4+ kinda deprecate injecting the whole container, wouldn't it make sense for the trait, rather than asking a getContainer method, ask for whatever it needs (a few parameters and the logger IIRC) ? So that these can be injected and used from a command as a service.

ItemBatchIterator | Implement as Iterable?

I just stumpled upon your new ItemBatchIterator implementation: https://github.com/webmozarts/console-parallelization/blob/master/src/ItemBatchIterator.php

Actually we use a very similar concept to iterate over various types of files (XML, CSV, Excel, arrays, etc.) and are using PHP's Iterable and Countable interfaces.

interface IteratorInterface extends \Iterator, \Countable {
 
}

Probably it would be a good idea to also let the ItemBatchIterator implement a similar interface to make the solution very generic and allow multiple types of iterators?

Error when the server script name is an absolute path

With the current code:

$pwd = $_SERVER['PWD'];
        $scriptName = $_SERVER['SCRIPT_NAME'];

        return str_starts_with($scriptName, $pwd)
            ? $scriptName
            : $pwd.DIRECTORY_SEPARATOR.$scriptName;

If the script name is absolute, e.g. /usr/local/bin/box, the resolved script name will be a nonsensical path.

Roadmap?

Hi guys,
I just want to thank you for that amazing bundle, which is very likely to be part of Pimcore soon!
πŸŽ‰πŸŽ‰πŸŽ‰

That's why I want to ask if there already exists a (rough) roadmap for the future releases?

  • Stable version 1.0 ist out since 2 days, and Pimcore will probably start with that one.
  • Version 2.* is already in deployment. When will the next release approximately take place? How likely is it that there are breaking changes?

Error when $_SERVER['PWD'] is not set

Hi. When PWD is not set in der $_SERVER vars this error comes up:

In ParallelExecutorFactory.php line 437:
  mb_strpos() expects parameter 2 to be string, null given

It is caused in this method of class ParallelExecutorFactory:

    private static function getScriptPath(): string
    {
        $pwd = $_SERVER['PWD'];
        $scriptName = $_SERVER['SCRIPT_NAME'];

        return 0 === mb_strpos($scriptName, $pwd)
            ? $scriptName
            : $pwd.DIRECTORY_SEPARATOR.$scriptName;
    }

Thanks, kind regards!
Tim

[2.x] Some resources are missing

Hello,

I've just have a look on this project that seems usefull. As there is a major version v2 under development, I've look on it (branch main), and found that some resources are still missing in GitHub repository (for beta 2)

Checks

Namespaces did not follow Composer PSR-4 structure

Unable to test it in real condition. Or did I missed something ?

PS: I've tried with following composer.json file

{
    "require": {
        "webmozarts/console-parallelization": "^2@dev"
    }
}

Error when launchin child processes

Hi,
I encountered an error using this package with a number of items greather than the segment size.
The child processes return the following message on each line :

===== Process Output=========
Could not open input file: bin/console

Using realpath() for the console path fixes the problem

Incorrect error message

Found:

OUT Failed to process the item "..." ...

This is incorrect as the item has a name and that name should be used instead.

Documentation for ::getParallelExecutableFactory

Hello,

Is it possible to provide documentation for using the ::getParallelExecutableFactory ?

It's not clear, and I don't understand how to use it, I have many command using the runBeforeFirstCommand function before the upgrade.

Regards,
Louis

Bug on Windows systems

On Windows systems an error occurs because PHP_EOL is not "\n" there but "\n\r".

Line 109 \Webmozarts\Console\Parallelization\ProcessLauncher
(instead of "\n" PHP_EOL should be used here)

Add a utility decorator logger

When writing a new logger just to decorate one or two methods, it is quite verbose to do so. It would make sense IMO to include a LoggerDecorator which:

  • decorates a given logger
  • forwards all calls to the decorated logger

This way the user could extend this class and override just the desired method.

Limit the exit code to 255

This is standard in bash and I thought it was specific to bash only, but apparently the Symfony console does the same too:

if ($this->autoExit) {
    if ($exitCode > 255) {
        $exitCode = 255;
    }

    exit($exitCode);
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.