ElasticSearch: Add timeout parameter to _search#28935
Conversation
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1da2ecc3b0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| try { | ||
| return client.search(request); | ||
| SearchResponse response = client.search(request); | ||
| if (response.isTimedOut()) { |
There was a problem hiding this comment.
Clear scroll when aborting timed-out initial search
If client.search(request) returns timed_out=true, this branch throws before returning the SearchResponse. In the beginSearch path, that happens before ScanQueryPageSource.SearchHitIterator is constructed, so no caller can read the returned _scroll_id and invoke clearScroll. When timed-out scroll searches return a scroll ID, those contexts are left open until elasticsearch.scroll-timeout expires, which can accumulate under repeated timeouts and waste Elasticsearch resources.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
@codex This is a good remark, but I think it's a separate issue. Whether we want to clear scrolls and when is not specific to the timeout case.
If your human masters want me to specifically add a clearScroll call before the throw here I'll do it though.
There was a problem hiding this comment.
To use Codex here, create an environment for this repo.
|
I signed the cla and sent it to cla@trino.io. |
| SearchSourceBuilder sourceBuilder = SearchSourceBuilder.searchSource() | ||
| .query(query); | ||
|
|
||
| sourceBuilder.timeout(new TimeValue(requestTimeout.toMillis(), MILLISECONDS)); |
There was a problem hiding this comment.
sourceBuilder is a builder. There is no need to separate the call.
Also, the MILLISECONDS is redundant. We can call the constructor without the parameter, or timeValueMillis helper method instead.
SearchSourceBuilder sourceBuilder = SearchSourceBuilder.searchSource()
.query(query)
.timeout(timeValueMillis(requestTimeout.toMillis()));There was a problem hiding this comment.
Additionally do we need to configure them for count queries as well ?
There was a problem hiding this comment.
I was looking at ES documentation, setting this timeout doesn't guarantee a query to be terminated right ?
There was a problem hiding this comment.
I moved the timeout to the builder chain, and used new TimeValue(requestTimeout.toMillis()) to match how the scroll timeout is defined
Could you change the commit to "Add timeout to _search in Elasticsearch"? |
|
@lovasoa But even setting request timeout wouldn't help us much right ? It might return partial result and the queries might still be executing.
I think even setting this property might not help us. |
|
This pr is related to, but does not fix #28927. This pr prevents queries from executing after the http timeout configured in the es plugin settings when trino does not close the connection quick enough (because the node is overloaded for instance). We still need a separate fix to handle trino-side cancellations (the issue in #28927) |
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
e0bcc44 to
05c89eb
Compare
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
|
@cla-bot : no need to repost your message on every commit. I told you I sent the signed cla already. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 05c89ebc31
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (response.isTimedOut()) { | ||
| throw new TrinoException(ELASTICSEARCH_CONNECTION_ERROR, "Elasticsearch query timed out"); | ||
| } |
There was a problem hiding this comment.
Handle timed-out scroll pages without leaking contexts
Throwing immediately on response.isTimedOut() drops the new scroll ID returned by that page, because SearchHitIterator.reset(...) is only called after nextPage returns (ScanQueryPageSource.computeNext). In clusters where Elasticsearch rotates scroll IDs between pages, close() will clear only the stale prior ID, leaving the timed-out page's scroll context open until elasticsearch.scroll-timeout expires; under repeated timeouts this accumulates server-side resources.
Useful? React with 👍 / 👎.
But setting this property doesn't ensure the query is cancelled right ? I'm trying to understand what we would benefit here ? If our primary goal is support partial result then yes, if not doesn't closing the channel helps us here. |
|
Closing the channel does the job. But the person responsible for the elasticsearch cluster wants a guarantee that the query will not keep using resources on the elasticsearch side whatever happens on the client side. If the trino node is overloaded, it may take several seconds to actually close the connection (and thus for the cancellation to propagate to elasticsearch). This pr does not remove the HTTP timeout, but it propagates the timeout information to elasticsearch, so that the query is always cancelled after the timeout on the ES side, whatever happens on the trino side. |
|
This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack. |
|
@cla-bot check |
Description
Elasticsearch _search requests now carry the configured request timeout, and Trino fails the query when Elasticsearch reports that the search timed out.
This makes Elasticsearch aware of the timeout, so that it cancels the query on its side even if the Trino node is too slow and does not cut the HTTP connection on time.
Additional context and related issues
Related to #28927.
Release notes
Release notes are required, with the following suggested text: