Building Your Own AI Coding Assistant Product
Stack for shipping a custom AI coding assistant — code completion, chat, code search, and agentic refactoring — as a standalone product or IDE plugin.
The Stack
Anthropic
— Primary reasoning LLMClaude 3.5/3.7 Sonnet leads benchmarks on code generation, refactoring, and multi-file edits. Its 200k context window handles full repository context without truncation for most codebases.
Alternatives: openai, deepseek-api, codestral, groq, zhipu-ai, baichuan-ai
LiteLLM
— LLM routing and cost optimizationLiteLLM proxies 100+ LLM APIs behind a single OpenAI-compatible interface — route completions to fast/cheap models (Groq + Llama) and chat to powerful models (Claude/GPT-4o) without changing application code.
Alternatives: openrouter, fireworks-ai, together-ai
ast-grep
— AST-based code search and refactoring optionalast-grep performs structural code search and rewrite across repositories — essential for building refactoring features that understand syntax rather than just text patterns.
Qdrant
— Code vector searchQdrant stores AST-chunked code embeddings with payload filters by file path, language, and symbol type — enables accurate semantic code search across millions of lines.
Alternatives: chroma, weaviate, pgvector, milvus
E2B
— Code execution sandbox optionalE2B runs AI-generated code in isolated sandboxes to verify correctness before presenting to users — the foundation for 'AI writes and tests the code' workflows.
Alternatives: modal-labs, replit
Instructor
— Structured LLM outputs for code edits optionalInstructor validates LLM JSON outputs against Pydantic schemas — ensures code edit payloads (file path, start/end lines, replacement text) parse correctly without custom retry logic.
Alternatives: outlines, guidance, mirascope
Langfuse
— Prompt and completion tracing optionalEvery completion and chat turn is traced with prompt template version, model, token count, and user ID — lets you compare model upgrades on real user queries before rolling out.
Alternatives: langsmith, braintrust, helicone
DeepEval
— Code generation evaluation optionalDeepEval's code-specific metrics (correctness via execution, test pass rate) let you benchmark model upgrades against a golden set of coding tasks before deployment.
Semgrep
— Security scanning for generated code optionalSemgrep scans AI-generated code for known vulnerability patterns (SQL injection, hardcoded secrets, insecure deserialization) before surfacing it to users — critical for trust in coding tools.
Gotchas
- ⚠️ Code context windows are expensive: a 2,000-file repository fully embedded and retrieved can easily inject 50k+ tokens per chat turn at $0.75+ per message with GPT-4o.
- ⚠️ Completion latency requirements (<100ms first token) are incompatible with GPT-4o — use Groq (Llama 3.1 70B) or Fireworks AI for inline autocomplete and reserve powerful models for chat.
- ⚠️ AST-chunking is language-specific — a generic text splitter will split function bodies mid-logic. Invest in Tree-sitter-based chunking early or retrieval quality will be permanently poor.
- ⚠️ AI-generated code that looks plausible but introduces security vulnerabilities is worse than no suggestion — implement at minimum a Semgrep scan before surfacing suggestions to users.
- ⚠️ DeepSeek Coder and Codestral are 10x cheaper than GPT-4o for code tasks with comparable quality on most benchmarks — benchmark your specific language/framework before defaulting to GPT-4o.
- ⚠️ Multi-file edits require transactional semantics — if the LLM generates edits to 5 files and file 3 fails validation, you need rollback logic. Most off-the-shelf frameworks don't handle this.
Related Stacks
RAG Knowledge Base (Internal or External)
Retrieval-Augmented Generation stack for grounding an LLM in your company's docs, PDFs, and data sources with accurate citations.
Multi-Agent Autonomous Platform
Stack for building production multi-agent systems that browse the web, write and run code, use tools, and complete long-horizon tasks autonomously.
LLM Production Observability and Evaluation
Stack for monitoring LLM applications in production: tracing every call, evaluating output quality, catching model drift, and controlling costs.