Skip to content

Fix compose-tests stability: quickstart startup race and missing Hadoop conf path#3565

Open
SethSmucker wants to merge 4 commits into
integrationfrom
task/compose-tests-stability
Open

Fix compose-tests stability: quickstart startup race and missing Hadoop conf path#3565
SethSmucker wants to merge 4 commits into
integrationfrom
task/compose-tests-stability

Conversation

@SethSmucker
Copy link
Copy Markdown
Collaborator

@SethSmucker SethSmucker commented May 20, 2026

Summary

  • Dockerfile: symlink /etc/hadoop/conf -> $HADOOP_CONF_DIR in the quickstart image so Accumulo TabletServer-side datawave iterators find the Hadoop config they expect.
  • docker-compose.yml: declare depends_on: quickstart (condition: service_healthy) for the 9 services (annotation, accumulo, audit, dictionary, modification, query, mapreduce-query, executor-pool1, executor-pool2) that connect to ZooKeeper or Hadoop, matching the existing pattern in metrics.

Fixes #3564.

See the issue for the full evidence (3 failing CI runs analyzed + local stack-trace captures from the TabletServer and audit containers).

Test plan

  • Cold local rebuild with both fixes applied: docker/scripts/testAll.sh reports Tests Passed: 13 / Failed Tests: 0 (was 5/13 prior to either fix).
  • Compose startup ordering verified end-to-end: docker-quickstart-1 Healthy now precedes docker-audit-1 Started. Audit's previous ZK-connect retries are gone from the audit container log.
  • /etc/hadoop/conf symlink confirmed inside a freshly-built quickstart container.
  • CI run on this PR with this branch as evidence that the workflow no longer hits the audit exited (1) flake.

SethSmucker and others added 4 commits May 20, 2026 16:42
Datawave's server-side query iterators load Hadoop configuration from
/etc/hadoop/conf when they run inside the Accumulo TabletServer. That
path does not exist in the quickstart container — Hadoop is installed
under /opt/datawave/contrib/datawave-quickstart/hadoop and HADOOP_CONF_DIR
points there, not at /etc/hadoop/conf. Without the path, every
event-table scan throws FileNotFoundException server-side and the
query service silently returns 0 events.

Symlink /etc/hadoop/conf to the installed Hadoop config directory so the
expected path resolves. Verified locally: docker/scripts/testAll.sh goes
from 5/13 to 13/13 passing.
…adoop

Nine services in docker-compose.yml use ZOOKEEPER_HOST or HADOOP_HOST
but did not declare a depends_on for the quickstart container, which
hosts ZooKeeper, HDFS, and Accumulo. Only metrics correctly declared
the dependency.

When quickstart's startup happens to take longer than the 60s ZooKeeper
connect timeout used by Spring services like audit's AccumuloAuditor
bean, those services exit 1 during init with "Failed to connect to
zookeeper (quickstart:2181) within 2x zookeeper timeout period 30000",
and docker compose aborts with "dependency failed to start: container
docker-audit-1 exited (1)". The race is environment-dependent: on
slower runners it is frequently lost, producing intermittent CI failures.

Declare depends_on quickstart (condition: service_healthy) for
annotation, accumulo, audit, dictionary, modification, query,
mapreduce-query, executor-pool1, and executor-pool2 so the startup is
correctly serialized behind the container that provides their backing
services.
PR #3560 removed deprecated eventPerDayThreshold and shardsPerDayThreshold
placeholders from the starter source and from docker/config/application-query.yml,
but the released spring-boot-starter-datawave-query-1.0.10.jar still embeds
the old QueryLogicFactory.xml that references those placeholders. With no
fresh 1.0.11 release cut yet, every consumer of the 1.0.10 jar (query,
executor, mapreduce-query, cached-results) fails Spring init at startup
with:

  Caused by: java.lang.IllegalArgumentException: Could not resolve
  placeholder 'datawave.query.logic.logics.BaseEventQuery.eventPerDayThreshold'

This blocks the compose-tests workflow on every recent PR opened against
integration (#3559, #3562, #3563, #3565 all show the same FAILURE).

Bump <version.datawave.starter-query> from 1.0.10 to 1.0.11-SNAPSHOT in the
six consuming poms so the multi-module reactor builds the starter from
current source (which has #3560's removal) and the query/executor docker
images embed the fresh jar.

This is a stop-gap. The proper fix is running microservice-cascade-release.yml
starting from microservices/starters/query to publish a real 1.0.11; once
that lands, these SNAPSHOT pins should be replaced with the released version.
<version.datawave>7.33.1</version.datawave>
<version.datawave.mapreduce-layout-factory>1.0.0</version.datawave.mapreduce-layout-factory>
<version.datawave.starter-query>1.0.10</version.datawave.starter-query>
<version.datawave.starter-query>1.0.11-SNAPSHOT</version.datawave.starter-query>
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should release starter-query and only merge this after release so we arent merging in dependencies on a snapshot

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there changes in starter-query that are required for this PR ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivakegg - changes from 3560 removed eventPerDayThreshold stuff, but the jar wasn't updated so it keeps looking for that property but can never resolve it. 1.0.11-SNAPSHOT is just to include those changes that were made but not reflected in the current jar. So it's more so that the changes were fine, we just need to bump the jar.

@alerman Agreed, I have the snapshot in here just to make sure that all of the tests actually pass and I diagnosed the issues accurately. So we can remove the snapshot from this PR, but the other changes should still probably be made since that was another chunk of issues that we get when running the CI tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

compose-tests: startup race + missing Hadoop conf path in quickstart container

3 participants