MCP integration guide
Model Context Protocol (MCP) is an open standard that lets AI agents and LLM-powered tools discover and invoke capabilities exposed by a server. For ClickHouse integrations, MCP is the right surface when your integration is AI-native — meaning an agent or LLM is driving the queries — rather than a traditional BI tool, ETL pipeline, or application making programmatic requests.
| Integration type | Recommended surface |
|---|---|
| BI tool / query builder | JDBC or HTTP API |
| ETL / data pipeline | HTTP API or JDBC |
| AI agent / LLM assistant | Remote MCP server |
| IDE coding assistant | Local MCP server (stdio) or remote MCP |
| Data catalog with AI features | Remote MCP server alongside existing HTTP/JDBC |
Connecting to ClickHouse via MCP
ClickHouse Cloud built-in remote MCP server
ClickHouse Cloud includes a fully managed remote MCP server. It requires no infrastructure to deploy and authenticates via OAuth 2.0.
Endpoint:
Transport: Streamable HTTP (MCP standard)
Authentication: OAuth 2.0 — the MCP client initiates a browser-based OAuth flow using ClickHouse Cloud credentials on first connect.
Capabilities: 13 read-only tools across querying, schema discovery, service management, backups, ClickPipes, and billing. All tools are annotated with readOnlyHint: true.
Enable it per service in the ClickHouse Cloud console under Connect → MCP. Once enabled, point any MCP client at the endpoint above. See the remote MCP setup guide for IDE-specific configuration steps.
If your integration targets ClickHouse Cloud users, this is the fastest path to MCP connectivity — recommend it over building your own server.
Open-source MCP server for self-hosted ClickHouse
For integrations that must support self-hosted ClickHouse instances, use the mcp-clickhouse open-source server as a starting point. It exposes three core tools (run_query, list_databases, list_tables) and supports both local stdio and remote HTTP/SSE transports.
Building a custom MCP server
Build a custom MCP server when you need tools or behavior that neither the Cloud server nor the open-source server provides — for example, domain-specific query templates, write operations under controlled conditions, or integration with your own auth system.
Choose the right transport
| Transport | Use when |
|---|---|
| stdio | Local dev tools (Claude Code, Cursor, VS Code Copilot). No network exposure. |
| Streamable HTTP | Cloud-deployed servers, multi-tenant services, load-balanced deployments. The MCP standard transport for remote servers. |
| SSE (legacy) | Legacy MCP clients that predate Streamable HTTP. Prefer Streamable HTTP for new servers. |
For remote servers, Streamable HTTP is the current standard. Design stateless request handlers — do not store session state in memory, as load balancers will distribute requests across instances.
Authentication
| Scenario | Recommended approach |
|---|---|
| User-facing (human authenticates) | OAuth 2.0 with PKCE — consistent with ClickHouse Cloud's own approach |
| Service-to-service (agent authenticates) | Static Bearer token via Authorization header; rotate regularly |
| Development / local-only | Disable auth; never in production |
Always require authentication for any remotely accessible MCP server. The /health endpoint is the only route that should remain unauthenticated (for orchestrator probes).
Consumption best practices
The following practices apply to any MCP server that exposes ClickHouse query capabilities to an AI agent.
Design read-only tools by default
Annotate every tool with readOnlyHint: true in its MCP metadata unless writes are explicitly part of your design. This signals to MCP hosts that the tool has no side effects, enabling better agent planning.
readOnlyHint is advisory only — it informs the LLM host but is not enforced at the protocol level. Apply server-side enforcement as well:
- Connect to ClickHouse with a read-only user (no INSERT, ALTER, DROP grants)
- Validate that query strings begin with
SELECTorWITHbefore execution - Use ClickHouse's
readonlysession setting as a secondary guard:
If your server must support writes (e.g., agent-driven INSERT), expose write tools under a separate, explicitly named operation, require an additional confirmation parameter, and log every invocation.
Apply resource limits to all queries
Agents can generate unbounded queries. Always enforce limits to prevent runaway execution. The most reliable approach is to set them on the ClickHouse user — they apply regardless of what the agent sends:
Or enforce per-query via HTTP parameters:
10,000 rows is a practical ceiling for agent-readable results — LLMs cannot meaningfully process millions of rows and large results inflate token usage. Design tool descriptions to guide agents toward aggregating queries rather than full table scans.
Expose focused schema discovery tools
Agents need to understand the data model before they can write useful queries. Provide dedicated tools for schema discovery rather than expecting agents to write raw system.* queries. Good tool design:
- A
list_tablestool that returns table name, engine, row count estimate, and a brief description fromsystem.tables.comment - A
describe_tabletool that returns column names, types (with Nullable/LowCardinality wrappers stripped), andis_in_sorting_keyto help the agent write efficient filtered queries - Exclude
system,information_schema, andINFORMATION_SCHEMAdatabases from all schema tools
Write tool descriptions that constrain agent behavior
Tool descriptions are read by the LLM to decide how and when to invoke a tool. Vague descriptions lead to over-use and inefficient queries. Be specific about what the tool does, what good input looks like, and what its limits are:
Include ClickHouse SQL specifics in descriptions: function name casing (count() not COUNT(*)), date literal functions (toDate(), toDateTime()), and the fact that identifiers are case-sensitive.
Size results for LLM context
LLM context windows are finite and token usage scales with result size. Design tools to return summaries, not raw data dumps:
- Return aggregates when the agent's question can be answered with counts, sums, or averages
- Cap raw row results at 100–500 rows for display tools; use the 10,000 row limit only for data export tools
- Include row count metadata alongside results so the agent knows when results were truncated
- For schema tools, return column descriptions not full table DDL
Guard against prompt injection
Query results returned by your MCP server flow back into the LLM's context. If a ClickHouse table contains user-generated text, that text could carry adversarial instructions targeting the agent.
Mitigations:
- Limit result size — small result sets reduce the attack surface
- Return structured data — parse
JSONEachRowserver-side before passing to the agent; avoid returning raw string columns that could contain markdown or instruction-like text directly into the agent's context - Sanitize schema names — when returning database, table, or column names, strip or escape characters that could be interpreted as markdown formatting or instructions
Identify your MCP server in query logs
Set User-Agent and log_comment on all queries issued by your MCP server, exactly as you would for any integration. This makes it possible to distinguish agent-driven queries from human queries in system.query_log and attribute cost and performance to specific tools:
Query performance by tool:
MCP checklist
- Using ClickHouse Cloud? Enable the built-in MCP server rather than building your own
- Remote server uses Streamable HTTP transport with stateless request handlers
- All tools annotated with
readOnlyHint: trueunless writes are explicitly required - Server-side enforcement: read-only ClickHouse user +
SELECT/WITHSQL prefix validation -
max_execution_time,max_result_rows, andmax_bytes_to_readset on the ClickHouse user - Dedicated schema discovery tools (
list_tables,describe_table) that return sorting key info - Tool descriptions specify aggregation preference, result limits, and ClickHouse SQL idioms
- Result size designed for LLM context: summaries and aggregates over raw row dumps
-
query_idandlog_commentset on every MCP-issued query - Prompt injection mitigations: structured data, result size limits, sanitized schema names