Comments (3)
Hey @Tom-Newton, great to see you diving deep into Flyte's propeller again! 🚀
Thank you for bringing up this important issue related to the distribution of workflows across shards in Flyte. Your observation about the potential for a single shard to become overloaded due to the current shard-key
inheritance behavior is valid, and your proposal to generate independent shard-keys
for each subworkflow launched by the admin-launcher
could indeed help to alleviate this issue.
To implement this, you would need to modify the shard-key generation logic in the Flyte system. Currently, the shard-key
is generated based on a range-bounded hash over various components of the FlyteWorkflow metadata, with the keyspace range defined as [0,32). This deterministic approach ensures that each FlyteWorkflow evaluation is handled by a specific FlytePropeller instance, based on the project, domain, and shard-key
labels.
However, as you've pointed out, this can lead to an uneven distribution of workflows across shards, especially in cases where a large workflow spawns many subworkflows. To achieve a more uniform distribution, the shard-key generation logic could be updated to include a dynamic component, such as a UUID or a timestamp, in addition to the existing metadata components. This would ensure that each new workflow gets a unique shard-key
, leading to a more even distribution across shards.
Here's a conceptual approach to implementing this:
-
Modify the Shard-Key Generation Logic: Update the logic that generates the
shard-key
for each FlyteWorkflow to include a dynamic component. For example, you could concatenate the current hash components with a UUID or the current timestamp at the moment of workflow creation. -
Update the FlytePropeller Manager Configuration: Ensure that the FlytePropeller Manager is configured to handle the updated range of
shard-keys
. Since the dynamic component could potentially increase the variability ofshard-keys
, you might need to adjust theshard-count
or the keyspace range to accommodate the new distribution pattern. -
Test the Distribution: After implementing the changes, it's crucial to test the new
shard-key
generation mechanism to ensure that it achieves a more uniform distribution of workflows across shards. You could simulate a large number of workflow creations and analyze the resultingshard-key
distribution to verify its uniformity.
Here is a pseudo-code example illustrating how you might implement the dynamic shard-key
generation:
import (
"crypto/sha256"
"encoding/hex"
"time"
)
func generateShardKey(workflowMetadata string) string {
// Generate a dynamic component, e.g., a timestamp
timestamp := time.Now().UnixNano()
dynamicComponent := fmt.Sprintf("%d", timestamp)
// Concatenate the workflow metadata with the dynamic component
combinedKey := workflowMetadata + dynamicComponent
// Hash the combined key to generate the shard-key
hasher := sha256.New()
hasher.Write([]byte(combinedKey))
shardKey := hex.EncodeToString(hasher.Sum(nil))
// Optionally, you could further process the shardKey to fit the desired keyspace range
// For simplicity, this step is omitted
return shardKey
}
This example combines the workflow metadata with a dynamic component (a timestamp in this case) to generate a unique shard-key
for each workflow. You would need to integrate this logic into the part of the Flyte system responsible for initializing FlyteWorkflow CRDs, ensuring that each new workflow gets a unique shard-key
based on this mechanism.
Please note that this is a conceptual approach and might require adjustments based on the specific requirements and constraints of your Flyte deployment. Also, keep in mind that any changes to the shard-key
generation logic should be thoroughly tested to ensure they don't introduce new issues or regressions.
As for your offer to contribute a Pull Request, that's greatly appreciated! However, as a bot, I'm unable to create or manage Pull Requests. I recommend discussing your proposal and the potential changes with the Flyte maintainers directly to ensure they align with the project's goals and guidelines. They can provide valuable feedback and guidance on how to best implement and test these changes.
I hope this information is helpful and provides a good starting point for addressing this issue. If you have any further questions or need additional clarification, please don't hesitate to ask.
Sources
About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
from flyte.
I think the bot hasn't really understood #5174 should resolve it.
from flyte.
I agree this should use a different shard key
from flyte.
Related Issues (20)
- [Core feature] Replace `os.path` with `pathlib` for flytekit HOT 1
- Obfuscate sensitive data in TaskConfig HOT 4
- [BUG] Fix non thread safe token cache behavior HOT 1
- [Core feature] Flyteadmin SMPT email publisher HOT 1
- [BUG] rshift '>>' operator doesn't work properly with remoteEntities HOT 2
- [Core feature] Allow flytectl to set a targetExecutionCluster HOT 1
- [BUG] Boolean values within pydantic base model being casted to scalar value HOT 1
- [Housekeeping] Files used in `data_types_and_io.normalize_csv_file` and `data_types_and_io.download_and_normalize_csv_files` are no longer accessible HOT 6
- [Core feature] Default task resource behavior should apply for node level overrides HOT 3
- [Core feature] Update/register multiple launch plans with different inputs HOT 1
- [BUG] (Kubeflow) PyTorchPlugin sets Replicas to 0 casuing infinite loop HOT 3
- [BUG] regression: envFrom provided from pod template is discarded HOT 1
- [BUG] Relaunch workflow converts large numbers in array of structs to objects. HOT 2
- [BUG] `uri` of the input with `Any` type of the workflow is incorrect when run remotely HOT 1
- [BUG] Accessing attributes fails on complex types HOT 3
- [BUG] Plugin Collector does not include group HOT 3
- [BUG] Flyte v1.11.0 will fail to load the task's input & output ran by Flyte v1.3.2 HOT 1
- [BUG] Task errors are not directly surfaced in unit tests in 1.12.0 HOT 3
- [Docs] Help messages, error messages, and documentation for `pyflyte package` unfortunately result in a very bad first-time UX HOT 1
- [Core Feature] Local logs for local container task executions. HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from flyte.