You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Data Agent enables users to perform data analysis through natural language — an AI agent autonomously explores schemas, generates SQL, executes queries via Kyuubi's existing multi-engine infrastructure, and self-corrects through multi-turn ReAct reasoning.
PR 1: Module skeleton, configuration, and engine core — New module externals/kyuubi-data-agent-engine with engine fully runnable via Echo provider. Includes Thrift frontend, session/operation management, IncrementalFetchIterator for streaming, event system, and all kyuubi.engine.data.agent.* configuration entries.
PR 2a: Tool system, data source, and prompt templates — SqlQueryTool with maxRows enforcement and output truncation, ToolRegistry with JSON schema generation for LLM function calling, data source abstraction with dialect auto-detection (Spark/SQLite/MySQL/Trino), and composable system prompt builder with per-dialect templates.
PR 2b: Agent runtime, middleware, and OpenAI provider — ReAct loop agent with streaming LLM interaction, ConversationMemory for multi-turn context, middleware pipeline (ApprovalMiddleware with STRICT/NORMAL/AUTO_APPROVE modes, LoggingMiddleware), and OpenAI-compatible provider. Integration tests with MockLlmProvider validate the complete tool-call pipeline without a real LLM.
PR 3: REST API and Web UI — SSE streaming chat endpoint (POST /api/v1/data-agent/{sessionHandle}/chat), tool approval endpoint (POST /api/v1/data-agent/{sessionHandle}/approve), and complete Vue web interface with session management, real-time message streaming, tool call visualization, and approval workflow UI.
Motivation
See KPIP-7373 for full motivation. In short: Kyuubi's existing Chat Engine is stateless with no data access. The Data Agent Engine bridges LLMs with Kyuubi's multi-engine SQL execution, enabling business users and analysts to query data warehouses through natural language without writing SQL.
SQL routes through Kyuubi Server — The agent's sql_query tool connects back to Kyuubi Server via JDBC with the original user's credentials, inheriting AuthZ/Ranger policies, audit, and resource isolation.
Pluggable LLM providers — OpenAI-compatible API as default via official OpenAI Java SDK; extensible through DataAgentProvider interface.
Java for business logic, Scala for framework wrappers — Agent runtime, tools, providers, events, and middleware are all in Java; Scala is used only for thin integration with Kyuubi's Session/Operation/Engine infrastructure.
Streaming-first — IncrementalFetchIterator enables real-time event streaming to both JDBC and REST/SSE clients.
Human-in-the-loop approval — Configurable approval workflow (AUTO_APPROVE / NORMAL / STRICT) for controlling tool execution risk.
Additional context
Test strategy:
Layer
Approach
LLM Required?
Unit tests (Java)
JUnit 4 — tools, events, memory, middleware, prompts, data source
No
Integration tests (Scala)
MockLlmProvider drives full engine pipeline against SQLite
No
JDBC tests (Scala)
HiveJDBCTestHelper + Echo/Mock engine
No
Live tests (Java/Scala)
Real LLM API + SQLite test database
Yes (CI-optional)
Web UI tests (TypeScript)
Vitest — API client mocking
No
Are you willing to submit PR?
Yes. I would be willing to submit a PR with guidance from the Kyuubi community to improve.
Code of Conduct
Search before asking
Describe the feature
Umbrella issue tracking the implementation of
DATA_AGENTengine type.KPIP: #7373
The Data Agent enables users to perform data analysis through natural language — an AI agent autonomously explores schemas, generates SQL, executes queries via Kyuubi's existing multi-engine infrastructure, and self-corrects through multi-turn ReAct reasoning.
Architecture
Sub-tasks
PR 1: Module skeleton, configuration, and engine core — New module
externals/kyuubi-data-agent-enginewith engine fully runnable via Echo provider. Includes Thrift frontend, session/operation management, IncrementalFetchIterator for streaming, event system, and allkyuubi.engine.data.agent.*configuration entries.PR 2a: Tool system, data source, and prompt templates — SqlQueryTool with maxRows enforcement and output truncation, ToolRegistry with JSON schema generation for LLM function calling, data source abstraction with dialect auto-detection (Spark/SQLite/MySQL/Trino), and composable system prompt builder with per-dialect templates.
PR 2b: Agent runtime, middleware, and OpenAI provider — ReAct loop agent with streaming LLM interaction, ConversationMemory for multi-turn context, middleware pipeline (ApprovalMiddleware with STRICT/NORMAL/AUTO_APPROVE modes, LoggingMiddleware), and OpenAI-compatible provider. Integration tests with MockLlmProvider validate the complete tool-call pipeline without a real LLM.
PR 3: REST API and Web UI — SSE streaming chat endpoint (
POST /api/v1/data-agent/{sessionHandle}/chat), tool approval endpoint (POST /api/v1/data-agent/{sessionHandle}/approve), and complete Vue web interface with session management, real-time message streaming, tool call visualization, and approval workflow UI.Motivation
See KPIP-7373 for full motivation. In short: Kyuubi's existing Chat Engine is stateless with no data access. The Data Agent Engine bridges LLMs with Kyuubi's multi-engine SQL execution, enabling business users and analysts to query data warehouses through natural language without writing SQL.
Describe the solution
See KPIP-7373 for detailed design. Key decisions:
sql_querytool connects back to Kyuubi Server via JDBC with the original user's credentials, inheriting AuthZ/Ranger policies, audit, and resource isolation.DataAgentProviderinterface.IncrementalFetchIteratorenables real-time event streaming to both JDBC and REST/SSE clients.Additional context
Test strategy:
Are you willing to submit PR?