[FEATURE][UMBRELLA] Data Agent Engine — AI-Powered Autonomous Data Analysis

### Code of Conduct

- [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct)

### Search before asking

- [X] I have searched in the [issues](https://github.com/apache/kyuubi/issues?q=is%3Aissue) and found no similar issues.

### Describe the feature

Umbrella issue tracking the implementation of `DATA_AGENT` engine type.

KPIP: https://github.com/apache/kyuubi/discussions/7373

The Data Agent enables users to perform data analysis through natural language — an AI agent autonomously explores schemas, generates SQL, executes queries via Kyuubi's existing multi-engine infrastructure, and self-corrects through multi-turn ReAct reasoning.

#### Architecture

```
Client (JDBC/REST/Web UI)
        │
        ▼
  Kyuubi Server (Gateway)  ◄───  JDBC (user creds)  ───┐
        │                                               │
        ▼                                               │
  Data Agent Engine                                     │
  ┌──────────────────────────┐                          │
  │  ReAct Loop              │                          │
  │  LLM ←→ Tools  ──────────┼──────────────────────────┘
  │       └─ sql_query (via Kyuubi JDBC)
  │  Middleware Pipeline     │
  │  ├─ ApprovalMiddleware   │
  │  └─ LoggingMiddleware    │
  └──────────────────────────┘
```

#### Sub-tasks

- [x] **PR 1: Module skeleton, configuration, and engine core** — New module `externals/kyuubi-data-agent-engine` with engine fully runnable via Echo provider. Includes Thrift frontend, session/operation management, IncrementalFetchIterator for streaming, event system, and all `kyuubi.engine.data.agent.*` configuration entries.

- [ ] **PR 2a: Tool system, data source, and prompt templates** — SqlQueryTool with maxRows enforcement and output truncation, ToolRegistry with JSON schema generation for LLM function calling, data source abstraction with dialect auto-detection (Spark/SQLite/MySQL/Trino), and composable system prompt builder with per-dialect templates.

- [ ] **PR 2b: Agent runtime, middleware, and OpenAI provider** — ReAct loop agent with streaming LLM interaction, ConversationMemory for multi-turn context, middleware pipeline (ApprovalMiddleware with STRICT/NORMAL/AUTO_APPROVE modes, LoggingMiddleware), and OpenAI-compatible provider. Integration tests with MockLlmProvider validate the complete tool-call pipeline without a real LLM.

- [ ] **PR 3: REST API and Web UI** — SSE streaming chat endpoint (`POST /api/v1/data-agent/{sessionHandle}/chat`), tool approval endpoint (`POST /api/v1/data-agent/{sessionHandle}/approve`), and complete Vue web interface with session management, real-time message streaming, tool call visualization, and approval workflow UI.

### Motivation

See [KPIP-7373](https://github.com/apache/kyuubi/discussions/7373) for full motivation. In short: Kyuubi's existing Chat Engine is stateless with no data access. The Data Agent Engine bridges LLMs with Kyuubi's multi-engine SQL execution, enabling business users and analysts to query data warehouses through natural language without writing SQL.

### Describe the solution

See [KPIP-7373](https://github.com/apache/kyuubi/discussions/7373) for detailed design. Key decisions:

1. **SQL routes through Kyuubi Server** — The agent's `sql_query` tool connects back to Kyuubi Server via JDBC with the original user's credentials, inheriting AuthZ/Ranger policies, audit, and resource isolation.
2. **Pluggable LLM providers** — OpenAI-compatible API as default via official OpenAI Java SDK; extensible through `DataAgentProvider` interface.
3. **Java for business logic, Scala for framework wrappers** — Agent runtime, tools, providers, events, and middleware are all in Java; Scala is used only for thin integration with Kyuubi's Session/Operation/Engine infrastructure.
4. **Streaming-first** — `IncrementalFetchIterator` enables real-time event streaming to both JDBC and REST/SSE clients.
5. **Human-in-the-loop approval** — Configurable approval workflow (AUTO_APPROVE / NORMAL / STRICT) for controlling tool execution risk.

### Additional context

Test strategy:

| Layer | Approach | LLM Required? |
|---|---|---|
| Unit tests (Java) | JUnit 4 — tools, events, memory, middleware, prompts, data source | No |
| Integration tests (Scala) | MockLlmProvider drives full engine pipeline against SQLite | No |
| JDBC tests (Scala) | HiveJDBCTestHelper + Echo/Mock engine | No |
| Live tests (Java/Scala) | Real LLM API + SQLite test database | Yes (CI-optional) |
| Web UI tests (TypeScript) | Vitest — API client mocking | No |

### Are you willing to submit PR?

- [X] Yes. I would be willing to submit a PR with guidance from the Kyuubi community to improve.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE][UMBRELLA] Data Agent Engine — AI-Powered Autonomous Data Analysis #7379

Code of Conduct

Search before asking

Describe the feature

Architecture

Sub-tasks

Motivation

Describe the solution

Additional context

Are you willing to submit PR?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Layer	Approach	LLM Required?
Unit tests (Java)	JUnit 4 — tools, events, memory, middleware, prompts, data source	No
Integration tests (Scala)	MockLlmProvider drives full engine pipeline against SQLite	No
JDBC tests (Scala)	HiveJDBCTestHelper + Echo/Mock engine	No
Live tests (Java/Scala)	Real LLM API + SQLite test database	Yes (CI-optional)
Web UI tests (TypeScript)	Vitest — API client mocking	No

[FEATURE][UMBRELLA] Data Agent Engine — AI-Powered Autonomous Data Analysis #7379

Description

Code of Conduct

Search before asking

Describe the feature

Architecture

Sub-tasks

Motivation

Describe the solution

Additional context

Are you willing to submit PR?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions