-
Notifications
You must be signed in to change notification settings - Fork 1.3k
cloud/fs concurrency for large files #9893
Copy link
Copy link
Open
Labels
A: data-syncRelated to dvc get/fetch/import/pull/pushRelated to dvc get/fetch/import/pull/pushfs: gsRelated to the Google Cloud Storage filesystemRelated to the Google Cloud Storage filesystemp3-nice-to-haveIt should be done this or next sprintIt should be done this or next sprint
Metadata
Metadata
Assignees
Labels
A: data-syncRelated to dvc get/fetch/import/pull/pushRelated to dvc get/fetch/import/pull/pushfs: gsRelated to the Google Cloud Storage filesystemRelated to the Google Cloud Storage filesystemp3-nice-to-haveIt should be done this or next sprintIt should be done this or next sprint
Type
Fields
Give feedbackNo fields configured for issues without a type.
@shcheklein in that case our concurrency level will be
jobs * jobs, which is generally going to be way too high in the default case. I also considered splittingjobsbetween the two (sobatch_size=(min(1, jobs // 2))and the same formax_concurrency) but that will make us perform a lot worse in the cases where you are pushing a large # of files that are smaller than the chunk sizeI think it will be worth revisiting this to properly determine what level of concurrency we should be using at both the file and chunk level, but that is dependent on the number of files being transferred, the size of all of those files, and the chunk size for the given cloud. This is all work that we can do at some point, but in the short term I prioritized getting a fix for the worst case scenario for azure (pushing a single large file).
Also, any work that we do on this right now would only work for Azure, since right now adlfs is the only underlying fsspec implementation that actually does concurrent chunked/multipart upload/downloads . It would be better for us to contribute upstream to make the s3/gcs/etc implementations support the chunk/multipart concurrency first, before we get into trying to make DVC optimize balancing file and chunk level concurrency
Originally posted by @pmrowla in treeverse/dvc-objects#218 (comment)
Tasks