Skip to content

[FEATURE][UMBRELLA] Data Agent Engine — AI-Powered Autonomous Data Analysis #7379

@wangzhigang1999

Description

@wangzhigang1999

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

Describe the feature

Umbrella issue tracking the implementation of DATA_AGENT engine type.

KPIP: #7373

The Data Agent enables users to perform data analysis through natural language — an AI agent autonomously explores schemas, generates SQL, executes queries via Kyuubi's existing multi-engine infrastructure, and self-corrects through multi-turn ReAct reasoning.

Architecture

Client (JDBC/REST/Web UI)
        │
        ▼
  Kyuubi Server (Gateway)  ◄───  JDBC (user creds)  ───┐
        │                                               │
        ▼                                               │
  Data Agent Engine                                     │
  ┌──────────────────────────┐                          │
  │  ReAct Loop              │                          │
  │  LLM ←→ Tools  ──────────┼──────────────────────────┘
  │       └─ sql_query (via Kyuubi JDBC)
  │  Middleware Pipeline     │
  │  ├─ ApprovalMiddleware   │
  │  └─ LoggingMiddleware    │
  └──────────────────────────┘

Sub-tasks

  • PR 1: Module skeleton, configuration, and engine core — New module externals/kyuubi-data-agent-engine with engine fully runnable via Echo provider. Includes Thrift frontend, session/operation management, IncrementalFetchIterator for streaming, event system, and all kyuubi.engine.data.agent.* configuration entries.

  • PR 2a: Tool system, data source, and prompt templates — SqlQueryTool with maxRows enforcement and output truncation, ToolRegistry with JSON schema generation for LLM function calling, data source abstraction with dialect auto-detection (Spark/SQLite/MySQL/Trino), and composable system prompt builder with per-dialect templates.

  • PR 2b: Agent runtime, middleware, and OpenAI provider — ReAct loop agent with streaming LLM interaction, ConversationMemory for multi-turn context, middleware pipeline (ApprovalMiddleware with STRICT/NORMAL/AUTO_APPROVE modes, LoggingMiddleware), and OpenAI-compatible provider. Integration tests with MockLlmProvider validate the complete tool-call pipeline without a real LLM.

  • PR 3: REST API and Web UI — SSE streaming chat endpoint (POST /api/v1/data-agent/{sessionHandle}/chat), tool approval endpoint (POST /api/v1/data-agent/{sessionHandle}/approve), and complete Vue web interface with session management, real-time message streaming, tool call visualization, and approval workflow UI.

Motivation

See KPIP-7373 for full motivation. In short: Kyuubi's existing Chat Engine is stateless with no data access. The Data Agent Engine bridges LLMs with Kyuubi's multi-engine SQL execution, enabling business users and analysts to query data warehouses through natural language without writing SQL.

Describe the solution

See KPIP-7373 for detailed design. Key decisions:

  1. SQL routes through Kyuubi Server — The agent's sql_query tool connects back to Kyuubi Server via JDBC with the original user's credentials, inheriting AuthZ/Ranger policies, audit, and resource isolation.
  2. Pluggable LLM providers — OpenAI-compatible API as default via official OpenAI Java SDK; extensible through DataAgentProvider interface.
  3. Java for business logic, Scala for framework wrappers — Agent runtime, tools, providers, events, and middleware are all in Java; Scala is used only for thin integration with Kyuubi's Session/Operation/Engine infrastructure.
  4. Streaming-firstIncrementalFetchIterator enables real-time event streaming to both JDBC and REST/SSE clients.
  5. Human-in-the-loop approval — Configurable approval workflow (AUTO_APPROVE / NORMAL / STRICT) for controlling tool execution risk.

Additional context

Test strategy:

Layer Approach LLM Required?
Unit tests (Java) JUnit 4 — tools, events, memory, middleware, prompts, data source No
Integration tests (Scala) MockLlmProvider drives full engine pipeline against SQLite No
JDBC tests (Scala) HiveJDBCTestHelper + Echo/Mock engine No
Live tests (Java/Scala) Real LLM API + SQLite test database Yes (CI-optional)
Web UI tests (TypeScript) Vitest — API client mocking No

Are you willing to submit PR?

  • Yes. I would be willing to submit a PR with guidance from the Kyuubi community to improve.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions