Skip to content

Add databricks doctor command#4730

Open
simonfaltum wants to merge 15 commits intomainfrom
simonfaltum/doctor-command
Open

Add databricks doctor command#4730
simonfaltum wants to merge 15 commits intomainfrom
simonfaltum/doctor-command

Conversation

@simonfaltum
Copy link
Copy Markdown
Member

@simonfaltum simonfaltum commented Mar 12, 2026

Why

Users debugging CLI setup issues (auth failures, config problems, network issues, proxy/TLS, missing toolchain) have no single command to diagnose their environment. They must manually run separate commands to check each layer.

Changes

Before: Users had to manually check auth, config, connectivity, and toolchain separately.
Now: A new experimental command databricks experimental doctor runs all diagnostic checks and reports results as a checklist:

  • CLI version (info)
  • Updates: compares the running build against the latest GitHub release (info/pass/warn, skipped for dev builds)
  • Toolchain: versions of git, python3, uv, terraform; missing binaries are flagged not found, tools that ran and errored are flagged exited N / error: <msg>
  • Proxy/TLS: which of HTTPS_PROXY, NO_PROXY, SSL_CERT_FILE, REQUESTS_CA_BUNDLE, etc. are set; userinfo in proxy URLs is replaced with *** (including token-only forms like TOKEN@proxy)
  • Log File: where CLI logs are being written (DATABRICKS_LOG_FILE)
  • Config file readability and profile count (pass/warn/fail)
  • Current profile source (flag / env / config-file)
  • Authentication validity and auth type (pass/fail)
  • Identity: workspace profiles call CurrentUser.Me; account-level profiles call Workspaces.List so invalid account creds fail instead of being skipped
  • Network connectivity to workspace host (pass/fail)

Text output uses colored status icons. JSON output (--output json) returns a structured array. Auth failures are reported as check results, not command errors.

Command placement

The command is registered under the experimental subtree, so it is accessed as:

databricks experimental doctor

It is hidden from top-level help, following the convention used by experimental aitools. Source lives under experimental/doctor/cmd/.

Example output

$ databricks experimental doctor --profile e2-dogfood
[info] CLI Version: 0.0.0-dev+78c6c02cb8d2
[info] Updates: development build (0.0.0-dev+78c6c02cb8d2)
[info] Toolchain: git 2.50.1 (Apple Git-155), python 3.14.4, uv 0.9.3, terraform not found
[info] Proxy/TLS: no proxy or TLS overrides configured
[info] Log File: not configured (set DATABRICKS_LOG_FILE or pass --log-file to enable)
[ok] Config File: ~/.databrickscfg (7 profiles)
[info] Current Profile: e2-dogfood
[ok] Authentication: OK (databricks-cli)
[ok] Identity: simon.faltum@databricks.com
[ok] Network: https://e2-dogfood.staging.cloud.databricks.com is reachable

Failure mode (no profile, missing credentials):

$ databricks experimental doctor
[info] CLI Version: 0.0.0-dev+78c6c02cb8d2
[info] Updates: development build (0.0.0-dev+78c6c02cb8d2)
[info] Toolchain: git 2.50.1, python 3.14.4, uv 0.9.3, terraform not found
[info] Proxy/TLS: no proxy or TLS overrides configured
[info] Log File: not configured (set DATABRICKS_LOG_FILE or pass --log-file to enable)
[ok] Config File: ~/.databrickscfg (7 profiles)
[info] Current Profile: none (using environment or defaults)
[FAIL] Authentication: Authentication failed (default auth: cannot configure default credentials...)
[skip] Identity: Skipped (authentication failed)
[FAIL] Network: No host configured
Error: one or more checks failed

JSON output (--output json) emits the same checks as a structured array.

Key design decisions

  • Pure functions + injected deps: Following go-code-structure.md, check functions take context.Context and primitives (profile string, fromFlag bool), not *cobra.Command. The Cobra RunE is a thin adapter. Rendering is a pure render(w, results, outputType) function. External dependencies (exec for toolchain, HTTP client for updates) are injected, so tests don't shell out or hit the network.
  • Account-level identity: account profiles issue a lightweight Workspaces.List call, matching the pattern auth describe uses, so invalid account PAT/OAuth tokens fail the identity check rather than being silently skipped.
  • Proxy credential masking: maskProxyValue replaces the full userinfo segment with *** whenever present, covering both user:pass@host and token-only TOKEN@host URLs.
  • Toolchain error clarity: exec.ErrNotFound renders as not found (install it), *exec.ExitError renders as exited N (it's broken), other errors render with the message.
  • SPOG/unified-host classification: uses the shared auth.ResolveConfigType helper rather than rolling its own so SPOG profiles are classified correctly (the SDK's own ConfigType() returns InvalidConfig for them in v0.127.0+).
  • Also fixes a latent bug in libs/env/loader.go: SetSetS so non-string env-backed config attributes (ints, bools) parse correctly.

Changelog

No NEXT_CHANGELOG.md entry: experimental commands are omitted from release notes until they graduate out of experimental. The PR template makes NEXT_CHANGELOG additions opt-in.

Open item

  • Top-level command deny list: not relevant now that doctor is under experimental and so no longer collides with the auto-generated command namespace.

Test plan

  • Unit tests for each check function (toolchain, proxy, updates, log file, auth, identity, network, config, profile)
  • Unit tests for both text and JSON rendering
  • Table-driven tests for unified-host / SPOG account classification
  • Tests for credential masking (user:pass and token-only URLs)
  • Tests for toolchain error differentiation (not-found vs exited vs other)
  • Tests for account-level identity (pass via Workspaces.List, fail on 401)
  • Acceptance test at acceptance/cmd/experimental/doctor/ (toolchain line masked for machine-independent output)
  • Local dry run against a real workspace profile (see "Example output")
  • make lintfull passes
  • make checks passes
  • make fmtfull clean

@eng-dev-ecosystem-bot
Copy link
Copy Markdown
Collaborator

eng-dev-ecosystem-bot commented Mar 12, 2026

Commit: b2e9d7a

Run: 23287322738

Env 🔄​flaky 💚​RECOVERED 🙈​SKIP ✅​pass 🙈​skip Time
💚​ aws linux 8 9 268 798 5:48
💚​ aws windows 8 9 270 796 4:43
🔄​ aws-ucws linux 2 7 9 364 713 7:32
🔄​ aws-ucws windows 2 7 9 366 711 6:03
💚​ azure linux 2 11 271 796 6:05
💚​ azure windows 2 11 273 794 3:54
🔄​ azure-ucws linux 2 1 11 369 709 7:48
🔄​ azure-ucws windows 2 1 11 371 707 6:12
💚​ gcp linux 2 11 267 799 6:11
💚​ gcp windows 2 11 269 797 6:42
18 interesting tests: 9 SKIP, 6 RECOVERED, 3 flaky
Test Name aws linux aws windows aws-ucws linux aws-ucws windows azure linux azure windows azure-ucws linux azure-ucws windows gcp linux gcp windows
🔄​ TestAccept 💚​R 💚​R 🔄​f 🔄​f 💚​R 💚​R 💚​R 🔄​f 💚​R 💚​R
🙈​ TestAccept/bundle/resources/permissions 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions 💚​R 💚​R 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=direct 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions 💚​R 💚​R 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=direct 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 💚​R 💚​R 💚​R 💚​R
🙈​ TestAccept/bundle/resources/postgres_branches/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/update_protected 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/without_branch_id 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_projects/update_display_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/synced_database_tables/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🔄​ TestAccept/ssh/connect-serverless-gpu 🙈​s 🙈​s 🔄​f 🔄​f 🙈​s 🙈​s 🔄​f 🔄​f 🙈​s 🙈​s
🔄​ TestAccept/ssh/connection 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 🔄​f 💚​R 💚​R 💚​R
Top 20 slowest tests (at least 2 minutes):
duration env testname
4:10 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:45 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:16 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:09 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:05 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:00 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:55 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:54 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:53 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:47 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:45 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:41 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:38 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:13 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:13 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:12 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:11 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:10 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:10 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:09 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct

Copy link
Copy Markdown
Contributor

@shreyas-goenka shreyas-goenka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This review was posted by Claude (AI assistant). Shreyas will do a separate, more thorough review pass.

Priority: HIGH — Config resolution diverges from real CLI auth path

MAJOR: resolveConfig diverges from real CLI auth

The resolveConfig function in databricks doctor constructs its own config resolution path instead of going through the standard SDK/CLI authentication flow. This means the doctor command could report "config is fine" while the real CLI fails (or vice versa). If the goal is to diagnose auth issues, it should use the same code path the CLI uses.

MEDIUM: Network check bypasses SDK HTTP client

The connectivity check uses http.DefaultClient directly instead of going through the SDK's configured HTTP client. In enterprise environments with proxies or custom TLS, this will give misleading results — the check might fail even though the SDK would succeed (or vice versa).

Other Observations

  • Good idea for a diagnostic command overall
  • The step-by-step output format is user-friendly
  • Missing test coverage for the core diagnostic logic

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 9, 2026

Approval status: pending

/libs/env/ - needs approval

Files: libs/env/loader.go
Eligible: @renaudhartert-db, @hectorcast-db, @parthban-db, @tanmay-db, @Divyansh-db, @tejaskochar-db, @mihaimitrea-db, @chrisst, @rauchy

General files (require maintainer)

10 files changed
Based on git history:

  • @pietern -- recent work in libs/env/

Any maintainer (@andrewnester, @anton-107, @denik, @pietern, @shreyas-goenka, @renaudhartert-db) can approve all areas.
See OWNERS for ownership rules.

Adds a new top-level 'databricks doctor' command that runs diagnostic
checks against the user's CLI setup and reports a checklist of results
(text) or a JSON array.

Checks:
- CLI version
- Updates (queries the GitHub releases API, skipped for dev builds)
- Toolchain (git, python3, uv, terraform versions)
- Proxy/TLS environment variables (HTTPS_PROXY, NO_PROXY, etc., with
  credentials masked)
- Log file path
- Config file readability and profile count
- Current profile source
- Authentication validity and auth type
- Identity via CurrentUser.Me (skipped for account-level profiles)
- Network reachability

Design follows go-code-structure.md: check functions take context and
primitives, Cobra RunE is a thin adapter, rendering is a pure function,
and external dependencies (exec, HTTP client) are injected for tests.

SPOG / unified-host account profiles are correctly classified via the
existing auth.ResolveConfigType helper, so the SDK's ConfigType()
returning InvalidConfig for those hosts no longer causes a
misclassification.

Also fixes a latent bug in libs/env/loader.go: Set was replaced with
SetS so that non-string env-backed config attributes (e.g. ints, bools)
are parsed correctly.

Co-authored-by: Isaac
@simonfaltum simonfaltum force-pushed the simonfaltum/doctor-command branch from 061f67a to d598ec4 Compare April 17, 2026 14:16
- checkIdentity now does a real API call for account-level profiles
  (a.Workspaces.List), matching what auth describe does. Previously
  account-level profiles skipped identity entirely, so invalid account
  PAT/OAuth could report Authentication: OK with no failing check.
- maskProxyValue now masks the full userinfo segment whenever present,
  not only when a password is set. Protects against token-only proxy
  URLs like http://TOKEN@proxy from leaking the token in diagnostics.
- checkToolchain distinguishes 'not found' (exec.ErrNotFound) from
  non-zero exit (*exec.ExitError) from other errors (permission denied,
  etc.), so users can tell 'install this' apart from 'this is broken'.

Tests updated to cover all three cases.
Per feedback, the command is moved behind the 'experimental' subtree:
  databricks doctor  ->  databricks experimental doctor

- Source: cmd/doctor/ -> experimental/doctor/cmd/ (matches the aitools
  convention under experimental/)
- Entry point renamed New() -> NewDoctorCmd()
- Hidden: true; no GroupID since it's no longer a top-level command
- Registration moves from cmd/cmd.go to cmd/experimental/experimental.go
- Acceptance test: acceptance/cmd/doctor -> acceptance/cmd/experimental/doctor
  and its script invokes 'experimental doctor'
- Help output no longer lists doctor under Developer Tools
- NEXT_CHANGELOG wording updated to flag the command as experimental
Experimental commands don't go into the release notes until they
graduate. No CI check requires an entry, so this file matches main
now.
Comment thread experimental/doctor/cmd/doctor.go
Comment thread experimental/doctor/cmd/doctor.go Outdated
Comment thread experimental/doctor/cmd/doctor.go Outdated
Comment thread experimental/doctor/cmd/doctor.go
Comment thread experimental/doctor/cmd/checks.go
Comment thread experimental/doctor/cmd/checks.go Outdated
Comment thread experimental/doctor/cmd/checks.go Outdated
- Wrap JSON output in DoctorReport{Results: ...} so we can add fields
  later without breaking callers of the array shape.
- Type the status enum (type status string) so misuse is a compile
  error, while keeping the string JSON wire contract.
- Make CheckResult.Detail any so future structured details don't
  require a breaking change.
- Add omitempty to Name/Status/Message so empty results render as {}.
- Inline runChecks into a composite literal to avoid append resizes.
- Replace if/return chain in checkCurrentProfile with a switch.

Co-authored-by: Isaac
…build

# Conflicts:
#	cmd/experimental/experimental.go
@simonfaltum simonfaltum removed the request for review from pietern April 21, 2026 11:58
Comment thread experimental/doctor/cmd/doctor.go
Comment thread experimental/doctor/cmd/checks.go Outdated
Comment thread experimental/doctor/cmd/checks.go Outdated
Comment thread experimental/doctor/cmd/checks.go Outdated
- Reword isAccountLevelConfig godoc to "can target" so callers don't read
  it as "account-exclusive".
- Drop the withCheckTimeout wrapper and inline context.WithTimeout at
  call sites; the wrapper was not adding readability.
- Move config-resolution error handling out of checkAuth and into
  runChecks, so checkAuth is only responsible for authenticating a
  resolved config.

Co-authored-by: Isaac
WithProfiler was removed as dead code on main (PR #4974). Match the
rest of the doctor command by injecting the profiler directly into
checkConfigFile, the same way exec and http.Client are injected into
the other checks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants