Skip to content

fix: enforce concurrency limits for data operations#5754

Open
jakharmonika364 wants to merge 1 commit intofluid-cloudnative:masterfrom
jakharmonika364:fix/concurrency-lock
Open

fix: enforce concurrency limits for data operations#5754
jakharmonika364 wants to merge 1 commit intofluid-cloudnative:masterfrom
jakharmonika364:fix/concurrency-lock

Conversation

@jakharmonika364
Copy link
Copy Markdown
Contributor

Ⅰ. Describe what this PR does

Ensures that data operations (specifically DataLoad) strictly respect their ParallelTaskNumber limit.

Previously, SetDataOperationInProgress would always append new operation names to a comma-separated string, effectively allowing multiple tasks to run on the same dataset simultaneously. This PR introduces a CanStartDataOperation check in the common locking logic (operation_lock.go) to prevent new tasks from registering if the parallel limit (which is 1 for DataLoad/DataMigrate) has already been reached.

Ⅱ. Does this pull request fix one issue?

fixes #1138

Ⅲ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.

Added unit tests in api/v1alpha1/dataset_types_test.go to cover:

  • Successful lock acquisition on empty datasets.
  • Correct blocking when ParallelTaskNumber is 1 and another operation is present.
  • Support for parallel limits > 1.
  • Handling of re-entrant sync cycles (avoiding self-blocking).

Ⅳ. Describe how to verify it

  1. Create a DataLoad task for a dataset.
  2. While the first task is Executing, create a second DataLoad for the same dataset.
  3. Verify that the second task remains Pending and the controller logs indicate the dataset is already performing an operation.
  4. Once the first task finishes, the second task should automatically proceed.

Ⅴ. Special notes for reviews

The implementation is generic and applies to all operations using the OperationInterface, ensuring consistent locking behavior across the project.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a concurrency control mechanism for data operations by adding a CanStartDataOperation method to the Dataset type. This method checks if a new operation can be initiated based on a maximum parallel task limit and the current state of operations. The changes include comprehensive unit tests for this new logic and its integration into the operation locking flow. The review feedback suggests a performance optimization by moving the concurrency check before the resource-intensive DeepCopy operation and recommends enhancing the error message to include the specific concurrency limit for better observability.

Comment on lines +74 to +78
datasetToUpdate := dataset.DeepCopy()

if !datasetToUpdate.CanStartDataOperation(operationTypeName, operation.GetParallelTaskNumber(), dataOpKey) {
return fmt.Errorf("the dataset %s is already in %s, please wait", targetDataset.Name, operationTypeName)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are two improvement opportunities here:

  1. Performance Optimization: Move the CanStartDataOperation check before calling DeepCopy(). DeepCopy() is an expensive operation, and it is unnecessary to perform it if the concurrency limit has already been reached and the operation cannot start.
  2. Error Message Clarity: The error message is a bit generic. Since this logic now supports ParallelTaskNumber > 1, it would be more helpful to include the limit in the error message to assist with troubleshooting.

Note: CanStartDataOperation is a read-only check on the Status field, so it is safe to call on the original dataset object.

Suggested change
datasetToUpdate := dataset.DeepCopy()
if !datasetToUpdate.CanStartDataOperation(operationTypeName, operation.GetParallelTaskNumber(), dataOpKey) {
return fmt.Errorf("the dataset %s is already in %s, please wait", targetDataset.Name, operationTypeName)
}
if !dataset.CanStartDataOperation(operationTypeName, operation.GetParallelTaskNumber(), dataOpKey) {
return fmt.Errorf("the dataset %s has reached the maximum number of parallel %s operations (limit: %d), please wait", targetDataset.Name, operationTypeName, operation.GetParallelTaskNumber())
}
datasetToUpdate := dataset.DeepCopy()

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 1, 2026

Codecov Report

❌ Patch coverage is 0% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.22%. Comparing base (965f661) to head (7abe7f0).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
pkg/ddc/base/operation_lock.go 0.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5754      +/-   ##
==========================================
- Coverage   61.22%   61.22%   -0.01%     
==========================================
  Files         444      444              
  Lines       30557    30561       +4     
==========================================
  Hits        18710    18710              
- Misses      10307    10311       +4     
  Partials     1540     1540              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR tightens dataset-level locking for data operations so that operations (notably DataLoad) respect a configured parallelism limit when registering themselves in Dataset.Status.OperationRef, preventing unbounded concurrent operations on the same dataset.

Changes:

  • Add a gate (CanStartDataOperation) to check whether a new operation can be registered given the current OperationRef entries and a max-parallel limit.
  • Enforce the gate inside the common lock acquisition path (SetDataOperationInTargetDataset) before updating dataset status.
  • Add unit tests covering empty state, re-entrant behavior, and max-parallel enforcement.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
pkg/ddc/base/operation_lock.go Adds concurrency gating before setting OperationRef during lock acquisition.
api/v1alpha1/dataset_types.go Introduces Dataset.CanStartDataOperation to validate whether an operation can start given current refs and a limit.
api/v1alpha1/dataset_types_test.go Adds unit tests for the new CanStartDataOperation behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +76 to 78

// set current data operation in the target dataset
datasetToUpdate := dataset.DeepCopy()
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock limit is derived from operation.GetParallelTaskNumber(). For DataMigrate this returns Spec.Parallelism (parallel workers within a single DataMigrate), so a DataMigrate with Parallelism>1 would allow multiple concurrent DataMigrate CRs on the same Dataset, contradicting the PR description that DataMigrate is limited to 1. Consider separating “max concurrent operations per dataset” from “intra-operation parallelism” (e.g., a dedicated interface method or a hard-coded 1 for DataMigrate).

Copilot uses AI. Check for mistakes.
Comment on lines +76 to 78

// set current data operation in the target dataset
datasetToUpdate := dataset.DeepCopy()
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning a plain fmt.Errorf here causes SetDataOperationInTargetDataset to log at error level on every retry (and the reconciler requeues periodically), even though lock contention is an expected state when parallelism is exceeded. Consider using a typed/sentinel error so callers can log at Info/Debug and/or emit a clearer event, and include the dataset namespace/current OperationRef in the message to help users diagnose what is blocking.

Copilot uses AI. Check for mistakes.
Signed-off-by: Monika Jhakar <jakharmonika364@gmail.com>
Signed-off-by: Monika Jakhar <jakharmonika364@gmail.com>
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Apr 1, 2026

@fluid-e2e-bot
Copy link
Copy Markdown

fluid-e2e-bot bot commented Apr 3, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yangyuliufeng for approval by writing /assign @yangyuliufeng in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

2 similar comments
@fluid-e2e-bot
Copy link
Copy Markdown

fluid-e2e-bot bot commented Apr 3, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yangyuliufeng for approval by writing /assign @yangyuliufeng in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fluid-e2e-bot
Copy link
Copy Markdown

fluid-e2e-bot bot commented Apr 3, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yangyuliufeng for approval by writing /assign @yangyuliufeng in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fluid-e2e-bot
Copy link
Copy Markdown

fluid-e2e-bot bot commented Apr 3, 2026

Hi @jakharmonika364. Thanks for your PR.

I'm waiting for a fluid-cloudnative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

what would happen if multiple ml training task read the same fluid dataset without pre-heating?

2 participants