Fix compose-tests stability: quickstart startup race and missing Hadoop conf path#3565
Fix compose-tests stability: quickstart startup race and missing Hadoop conf path#3565SethSmucker wants to merge 4 commits into
Conversation
Datawave's server-side query iterators load Hadoop configuration from /etc/hadoop/conf when they run inside the Accumulo TabletServer. That path does not exist in the quickstart container — Hadoop is installed under /opt/datawave/contrib/datawave-quickstart/hadoop and HADOOP_CONF_DIR points there, not at /etc/hadoop/conf. Without the path, every event-table scan throws FileNotFoundException server-side and the query service silently returns 0 events. Symlink /etc/hadoop/conf to the installed Hadoop config directory so the expected path resolves. Verified locally: docker/scripts/testAll.sh goes from 5/13 to 13/13 passing.
…adoop Nine services in docker-compose.yml use ZOOKEEPER_HOST or HADOOP_HOST but did not declare a depends_on for the quickstart container, which hosts ZooKeeper, HDFS, and Accumulo. Only metrics correctly declared the dependency. When quickstart's startup happens to take longer than the 60s ZooKeeper connect timeout used by Spring services like audit's AccumuloAuditor bean, those services exit 1 during init with "Failed to connect to zookeeper (quickstart:2181) within 2x zookeeper timeout period 30000", and docker compose aborts with "dependency failed to start: container docker-audit-1 exited (1)". The race is environment-dependent: on slower runners it is frequently lost, producing intermittent CI failures. Declare depends_on quickstart (condition: service_healthy) for annotation, accumulo, audit, dictionary, modification, query, mapreduce-query, executor-pool1, and executor-pool2 so the startup is correctly serialized behind the container that provides their backing services.
PR #3560 removed deprecated eventPerDayThreshold and shardsPerDayThreshold placeholders from the starter source and from docker/config/application-query.yml, but the released spring-boot-starter-datawave-query-1.0.10.jar still embeds the old QueryLogicFactory.xml that references those placeholders. With no fresh 1.0.11 release cut yet, every consumer of the 1.0.10 jar (query, executor, mapreduce-query, cached-results) fails Spring init at startup with: Caused by: java.lang.IllegalArgumentException: Could not resolve placeholder 'datawave.query.logic.logics.BaseEventQuery.eventPerDayThreshold' This blocks the compose-tests workflow on every recent PR opened against integration (#3559, #3562, #3563, #3565 all show the same FAILURE). Bump <version.datawave.starter-query> from 1.0.10 to 1.0.11-SNAPSHOT in the six consuming poms so the multi-module reactor builds the starter from current source (which has #3560's removal) and the query/executor docker images embed the fresh jar. This is a stop-gap. The proper fix is running microservice-cascade-release.yml starting from microservices/starters/query to publish a real 1.0.11; once that lands, these SNAPSHOT pins should be replaced with the released version.
| <version.datawave>7.33.1</version.datawave> | ||
| <version.datawave.mapreduce-layout-factory>1.0.0</version.datawave.mapreduce-layout-factory> | ||
| <version.datawave.starter-query>1.0.10</version.datawave.starter-query> | ||
| <version.datawave.starter-query>1.0.11-SNAPSHOT</version.datawave.starter-query> |
There was a problem hiding this comment.
We should release starter-query and only merge this after release so we arent merging in dependencies on a snapshot
There was a problem hiding this comment.
Are there changes in starter-query that are required for this PR ?
There was a problem hiding this comment.
@ivakegg - changes from 3560 removed eventPerDayThreshold stuff, but the jar wasn't updated so it keeps looking for that property but can never resolve it. 1.0.11-SNAPSHOT is just to include those changes that were made but not reflected in the current jar. So it's more so that the changes were fine, we just need to bump the jar.
@alerman Agreed, I have the snapshot in here just to make sure that all of the tests actually pass and I diagnosed the issues accurately. So we can remove the snapshot from this PR, but the other changes should still probably be made since that was another chunk of issues that we get when running the CI tests.
Summary
/etc/hadoop/conf -> $HADOOP_CONF_DIRin the quickstart image so Accumulo TabletServer-side datawave iterators find the Hadoop config they expect.depends_on: quickstart (condition: service_healthy)for the 9 services (annotation, accumulo, audit, dictionary, modification, query, mapreduce-query, executor-pool1, executor-pool2) that connect to ZooKeeper or Hadoop, matching the existing pattern inmetrics.Fixes #3564.
See the issue for the full evidence (3 failing CI runs analyzed + local stack-trace captures from the TabletServer and audit containers).
Test plan
docker/scripts/testAll.shreportsTests Passed: 13 / Failed Tests: 0(was 5/13 prior to either fix).docker-quickstart-1 Healthynow precedesdocker-audit-1 Started. Audit's previous ZK-connect retries are gone from the audit container log./etc/hadoop/confsymlink confirmed inside a freshly-built quickstart container.audit exited (1)flake.