perf(watson): prefetch relations + force async indexing#14881
Open
valentijnscholten wants to merge 6 commits into
Open
perf(watson): prefetch relations + force async indexing#14881valentijnscholten wants to merge 6 commits into
valentijnscholten wants to merge 6 commits into
Conversation
Watson's SearchAdapter resolves __-separated relation paths via per-instance getattr, triggering an N+1 query storm during async indexing. For Finding (test__engagement__product__name + jira_issue__jira_key) and Vulnerability_Id (finding__test__engagement__product__name) on a 1000-row batch this adds thousands of extra queries per task. dojo/utils_watson_prefetch.py auto-derives select_related / prefetch_related paths from each adapter's fields/store by walking model._meta, then applies them in update_watson_search_index_for_model. Toggle: DD_WATSON_INDEX_PREFETCH_ENABLED (default True). On any error we log loudly and fall back to the plain queryset so indexing still completes. Also adds force_async=True to dojo_dispatch_task / we_want_async — keeps the watson indexer in the background even when the caller is a block_execution=True user, since index updates are slow and never need to be synchronous from the user's perspective. Tests: - unittests/test_watson_index_prefetch.py (10 tests) — path classification for Product/Finding/Vulnerability_Id/Endpoint, unknown-path drop, setting toggle, derivation-raise fallback with log assertion. - unittests/test_celery_dispatch_force_async.py (4 tests) — force_async precedence over sync=True and block_execution.
3 tasks
- test_tag_inheritance_perf: update V2/V3 import baselines (-52 each) to reflect adapter-derived select_related/prefetch_related in the async watson indexer running inline under CELERY_TASK_ALWAYS_EAGER. - test_watson_async_search_index: add CELERY_TASK_ALWAYS_EAGER=True to the threshold=0 case. force_async=True now always dispatches via apply_async; without eager mode the task never runs and the index stays empty.
Wrap watson.search_context_manager.add_to_context with a size-based hook that drains the per-request context to async celery tasks as soon as it reaches WATSON_ASYNC_INDEX_UPDATE_BATCH_SIZE, instead of waiting for end-of-request. Bounds in-memory growth on long-running imports and lets celery workers start indexing batches earlier (parallel fanout). Hook installed once in dojo.apps.ready(). BATCH_SIZE doubles as threshold; set to 0/negative to disable the intermediate flush. Drop WATSON_ASYNC_INDEX_UPDATE_THRESHOLD: index dispatch is now unconditionally async. Removes the sub-threshold sync branch (which blocked the request on _bulk_save_search_entries) and the disable-async path. Consolidate _extract_tasks_for_async + _trigger_async_index_update + _dispatch_async_index_batches + _flush_search_context_intermediate into one helper `_drain_search_context_to_async` that groups, dispatches, and discards entries from the set in place. With the set drained, watson's end() bulk-saves an empty iterator — no explicit invalidate() needed. Tests: - test_watson_intermediate_flush: new — drain dispatches + clears, threshold-triggered hook, threshold=0 disables, invalid context skips. - test_watson_async_search_index: collapse three threshold-variant tests into one, class-level CELERY_TASK_ALWAYS_EAGER=True. - test_tag_inheritance_perf: reimport no-change baselines V2 69→74, V3 87→92 (always-async path adds 5 queries vs prior sub-threshold sync branch).
Lock in the N+1 elimination claim directly with CaptureQueriesContext — previously only observed indirectly via the ZAP import perf test.
CI runs the V3_FEATURE_LOCATIONS=True matrix where BaseModel.save calls full_clean — Product.description is blank=False, so the bare fixture ValidationErrors out. Local default (V3 off) skips validation, masking this in the prior run.
6431064 to
2683553
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Watson index updates were (in OS) done in the background. Still they were slow and inefficient as there is no prefetching of relations. So Waton will happily execute N x M x X x ... queries. They were also accumulating all models of a long running request in memory.
SearchAdapter._resolve_fieldwalks__-separated relation paths via per-instancegetattr, triggering one query per FK hop per object during async indexing. ForFinding(test__engagement__product__name+jira_issue__jira_key) andVulnerability_Id(finding__test__engagement__product__name) on a 1000-row batch this adds thousands of extra queries per task.dojo/utils_watson_prefetch.pyauto-derivesselect_related/prefetch_relatedpaths from each adapter'sfields/storeby walkingmodel._meta, then applies them inupdate_watson_search_index_for_model. On any error we log loudly and fall back to the plain queryset so indexing still completes.DD_WATSON_INDEX_PREFETCH_ENABLEDenv /WATSON_INDEX_PREFETCH_ENABLEDsetting, defaultTrue.force_async=Truedispatch flag.dojo_dispatch_task/we_want_asyncnow acceptforce_async=True, which keeps a task in the background even when the user hasblock_execution=Trueorsync=Trueis also passed. Wired into the async watson indexer middleware — index updates are slow and never need to be synchronous from the user's perspective.Why
AsyncSearchContextMiddleware. Even on the background path the indexing was slow because Watson re-fetched every relation per object. This change cuts indexing queries from O(objects × FK depth) to O(1) per batch on the FK chain.force_asyncdecouples "this task should run in the background" from "this user wants foreground execution" — the watson indexer is a clear case where the caller knows better than the user preference.Notes
sync=True→force_sync=Truefor symmetry withforce_async), but deferred to a dedicated PR since it touches ~84 sites