Support REACTIVE pipeline recovery with config.reload manager#18930
Support REACTIVE pipeline recovery with config.reload manager#18930yaauie merged 12 commits intoelastic:mainfrom
Conversation
🤖 GitHub commentsJust comment with:
|
|
This pull request does not have a backport label. Could you fix it @yaauie? 🙏
|
🔍 Preview links for changed docs |
Vale Linting ResultsSummary: 1 suggestion found 💡 Suggestions (1)
The Vale linter checks documentation changes against the Elastic Docs style guide. To use Vale locally or report issues, refer to Elastic style guide for Vale. |
This is odd, given that the crashy.conf does allow the pipeline to boot up but occasionally crash. so seeing reload.failures: 23 vs reload.successes: 0 doesn't seem correct. |
| elsif pipeline.crashed? && pipeline.configured_as_recoverable? | ||
| actions << LogStash::PipelineAction::Recover.new(pipeline_config, @metric) |
There was a problem hiding this comment.
Ensure that a PipelineAction::Recover isn't initiated until the crash has fully settled.
| elsif pipeline.crashed? && pipeline.configured_as_recoverable? | |
| actions << LogStash::PipelineAction::Recover.new(pipeline_config, @metric) | |
| elsif pipeline.crashed? && !pipeline.running? && pipeline.configured_as_recoverable? | |
| actions << LogStash::PipelineAction::Recover.new(pipeline_config, @metric) |
I failed to capture this in If we also avoid kicking off the action until the pipeline has settled its crash, we will avoid seeing the failures there, too. |
- do not resolve recovery action until pipeline has settled into its crash state - capture a successful recovery as a successful reload in metrics
|
@yaauie Can you move this PR out of draft please? |
|
Health report tests against f02a99a are green here |
| @JRubyMethod(name = "configured_as_recoverable?") | ||
| public final IRubyObject isConfiguredAsRecoverable(final ThreadContext context) { | ||
| final String recoverableSettingValue = getSetting(context, "pipeline.recoverable").asJavaString(); | ||
| final boolean result = switch (recoverableSettingValue) { |
There was a problem hiding this comment.
Should we warn/fail loading when config.reload.automatic + true or auto is set?
There was a problem hiding this comment.
Handled in f7ae601:
queue.type: memory+pipeline.recoverable: auto+config.reload.automatic: true:
resolves tofalse, no warning[2026-04-07T20:21:04,173][INFO ][logstash.javapipeline ][main] Starting pipeline {pipeline_id: "main", "pipeline.workers" => 12, "pipeline.batch.size" => 125, "pipeline.batch.delay" => 50, "pipeline.batch.output_chunking.growth_threshold_factor" => 1000, "pipeline.max_inflight" => 1500, "batch_metric_sampling" => "minimal", "pipeline.sources" => ["/Users/rye/src/elastic/logstash@main/crashy.conf"], "pipeline.recoverable" => false, thread: "#<Thread:0x1dcd1110 /Users/rye/src/elastic/logstash@main/logstash-core/lib/logstash/java_pipeline.rb:147 run>"}queue.type: persisted+pipeline.recoverable: auto+config.reload.automatic: true
resolves totrue; informational message[2026-04-07T20:21:58,949][INFO ][logstash.javapipeline ][main] Pipeline with `queue.type: persisted` is configured to be recoverable with `pipeline.recoverable: auto`; in the event of a crash some in-flight events may be re-processed {pipeline_id: "main", thread: "#<Thread:0x4c14afd /Users/rye/src/elastic/logstash@main/logstash-core/lib/logstash/java_pipeline.rb:147 run>"} [2026-04-07T20:21:58,952][INFO ][logstash.javapipeline ][main] Starting pipeline {pipeline_id: "main", "pipeline.workers" => 12, "pipeline.batch.size" => 125, "pipeline.batch.delay" => 50, "pipeline.batch.output_chunking.growth_threshold_factor" => 1000, "pipeline.max_inflight" => 1500, "batch_metric_sampling" => "minimal", "pipeline.sources" => ["/Users/rye/src/elastic/logstash@main/crashy.conf"], "pipeline.recoverable" => true, thread: "#<Thread:0x4c14afd /Users/rye/src/elastic/logstash@main/logstash-core/lib/logstash/java_pipeline.rb:147 run>"}queue.type: memory+pipeline.recoverable: true+config.reload.automatic: true
logs warning about automating data loss[2026-04-07T20:22:43,473][WARN ][logstash.javapipeline ][main] Pipeline with `queue.type: memory` is configured to be recoverable with `pipeline.recoverable: true`; in the event of a crash in-flight events will be lost, so enabling auto-recovery increases the risk of data loss. {pipeline_id: "main", thread: "#<Thread:0x58a27296 /Users/rye/src/elastic/logstash@main/logstash-core/lib/logstash/java_pipeline.rb:147 run>"} [2026-04-07T20:22:43,477][INFO ][logstash.javapipeline ][main] Starting pipeline {pipeline_id: "main", "pipeline.workers" => 12, "pipeline.batch.size" => 125, "pipeline.batch.delay" => 50, "pipeline.batch.output_chunking.growth_threshold_factor" => 1000, "pipeline.max_inflight" => 1500, "batch_metric_sampling" => "minimal", "pipeline.sources" => ["/Users/rye/src/elastic/logstash@main/crashy.conf"], "pipeline.recoverable" => true, thread: "#<Thread:0x58a27296 /Users/rye/src/elastic/logstash@main/logstash-core/lib/logstash/java_pipeline.rb:147 run>"}queue.type: persisted+pipeline.recoverable: true+config.reload.automatic: true
logs info about possible re-processing[2026-04-07T20:23:27,604][INFO ][logstash.javapipeline ][main] Pipeline with `queue.type: persisted` is configured to be recoverable with `pipeline.recoverable: true`; in the event of a crash some in-flight events may be re-processed {pipeline_id: "main", thread: "#<Thread:0x4f6df6d3 /Users/rye/src/elastic/logstash@main/logstash-core/lib/logstash/java_pipeline.rb:147 run>"} [2026-04-07T20:23:27,609][INFO ][logstash.javapipeline ][main] Starting pipeline {pipeline_id: "main", "pipeline.workers" => 12, "pipeline.batch.size" => 125, "pipeline.batch.delay" => 50, "pipeline.batch.output_chunking.growth_threshold_factor" => 1000, "pipeline.max_inflight" => 1500, "batch_metric_sampling" => "minimal", "pipeline.sources" => ["/Users/rye/src/elastic/logstash@main/crashy.conf"], "pipeline.recoverable" => true, thread: "#<Thread:0x4f6df6d3 /Users/rye/src/elastic/logstash@main/logstash-core/lib/logstash/java_pipeline.rb:147 run>"}pipeline.recoverable: true+config.reload.automatic: false
logs warning about not actually being recoverable[2026-04-07T20:26:18,539][WARN ][logstash.javapipeline ][main] Pipeline is configured to be recoverable with `pipeline.recoverable: true`, but config reloading has been disabled with `config.reload.automatic: false`; if this pipeline crashes it will NOT be recovered. {pipeline_id: "main", thread: "#<Thread:0x1007a0c /Users/rye/src/elastic/logstash@main/logstash-core/lib/logstash/java_pipeline.rb:147 run>"} [2026-04-07T20:26:18,544][INFO ][logstash.javapipeline ][main] Starting pipeline {pipeline_id: "main", "pipeline.workers" => 12, "pipeline.batch.size" => 125, "pipeline.batch.delay" => 50, "pipeline.batch.output_chunking.growth_threshold_factor" => 1000, "pipeline.max_inflight" => 1500, "batch_metric_sampling" => "minimal", "pipeline.sources" => ["/Users/rye/src/elastic/logstash@main/crashy.conf"], "pipeline.recoverable" => false, thread: "#<Thread:0x1007a0c /Users/rye/src/elastic/logstash@main/logstash-core/lib/logstash/java_pipeline.rb:147 run>"}
logstash-core/src/main/java/org/logstash/execution/AbstractPipelineExt.java
Show resolved
Hide resolved
logstash-core/src/main/java/org/logstash/execution/AbstractPipelineExt.java
Show resolved
Hide resolved
docker/data/logstash/env2yaml/src/main/java/org/logstash/env2yaml/Env2Yaml.java
Show resolved
Hide resolved
donoghuc
left a comment
There was a problem hiding this comment.
In general all my manual testing looked great. I think this is absolutely solid from a behavior standpoint. Have a few niche and nitpicky comments.
donoghuc
left a comment
There was a problem hiding this comment.
I think maybe some trivial documentation issues in the example configs, but i tested the updates and they work as expected!
Co-authored-by: Cas Donoghue <cas.donoghue@gmail.com>
donoghuc
left a comment
There was a problem hiding this comment.
Solid! I think this will be a very helpful feature
💚 Build Succeeded
History
|
…c#18930) * reload automatic: recover crashed pipelines during convergence * recovery: add health report probes * derp: fix invocation of failure_injector filter * PR feedback: - do not resolve recovery action until pipeline has settled into its crash state - capture a successful recovery as a successful reload in metrics * health tests: back out local branch changes * update recovery test assertions to use new $match helper * Apply suggestion from @yaauie * add logging, examples for `pipeline.recoverable` setting * recovery: clean up old pipeline * recovery: keep only last 5min of recovery log * Apply suggestions from code review Co-authored-by: Cas Donoghue <cas.donoghue@gmail.com> --------- Co-authored-by: Cas Donoghue <cas.donoghue@gmail.com>
#18967) * reload automatic: recover crashed pipelines during convergence * recovery: add health report probes * derp: fix invocation of failure_injector filter * PR feedback: - do not resolve recovery action until pipeline has settled into its crash state - capture a successful recovery as a successful reload in metrics * health tests: back out local branch changes * update recovery test assertions to use new $match helper * Apply suggestion from @yaauie * add logging, examples for `pipeline.recoverable` setting * recovery: clean up old pipeline * recovery: keep only last 5min of recovery log * Apply suggestions from code review --------- Co-authored-by: Cas Donoghue <cas.donoghue@gmail.com>
Release notes
pipeline.recoveryoption that works whenconfig.reload.automaticis enabled and accepts the following values:auto: recovers crashed pipelines that are backed by the persistent queuefalse(default): do not automate recovery of crashed pipelinestrue: recovers all crashed pipelines, even if backed by the ephemeral memory queue (risk: data loss)What does this PR do?
Why is it important/What is the impact to the user?
While pipeline crashes are rare (typically caused by a plugin crashing while handling events), in some cases users running managed pipelines would prefer that the pipeline be automatically restarted.
Checklist
Author's Checklist
How to test this PR locally
With the following
crashy.confpipeline definition: