Run AI entirely in your browser

Your conversations never leave your device. No servers, no API keys, no telemetry. Just you and a model running on your own hardware.

Why ThinkHere?

Privacy First

Your data never leaves your device. Every conversation, every prompt, every response stays in your browser. There is no server to breach because there is no server at all.

Zero Infrastructure

No servers to manage, no API keys to rotate, no cloud bills. Open the page, load a model, and start chatting. Everything runs on WebGPU right in your browser tab.

Open Core

The free tier is MIT licensed and fully open source. Inspect the code, fork it, contribute to it. Additional features are available with a free account, and premium features with a paid subscription.

Three paths to in-browser AI

ThinkHere supports three different inference backends, each with its own trade-offs. All three use WebGPU for GPU acceleration, and all model weights are cached in your browser after the first download.

WebLLM + MLC

HuggingFaceMLC-format weights
TVM / MLC CompilerAhead-of-time compilation
WebGPU Compute ShadersPre-optimized GPU kernels
Your GPU

Models are compiled ahead-of-time using Apache TVM / MLC (Machine Learning Compilation). The compiler transforms model weights and operations into optimized WebGPU compute shaders that run directly on your GPU.

Trade-offs
  • Fast inference — kernels are pre-optimized
  • First run compiles shaders for your GPU (cached after)
  • Model must be specifically compiled for MLC

Used by

SmolLM2 360M, SmolLM2 1.7B, Qwen3 4B, Phi-3.5 Mini, Llama 3.2 1B

Transformers.js + ONNX Runtime Web

HuggingFaceONNX-format model graph + weights
ONNX Runtime WebBuilds execution plan at load time
WebGPU
WASM fallback
Your GPU / CPU

Models are stored in the standard ONNX (Open Neural Network Exchange) format. ONNX Runtime Web interprets the model graph at load time and executes it on your GPU via WebGPU, or falls back to WebAssembly on unsupported hardware.

Trade-offs
  • Supports any model exportable to ONNX
  • Can fall back to WASM if WebGPU is unavailable
  • Slightly more overhead than pre-compiled kernels

Used by

Qwen3.5 0.8B, Qwen3.5 2B, Qwen3.5 4B

MediaPipe + LiteRT

HuggingFaceLiteRT model file (.litertlm)
MediaPipe GenAILLM Inference API
WebGPU ComputeMultimodal: text + images
Your GPU

Google's MediaPipe LLM Inference API loads Gemma models in the LiteRT format (formerly TFLite). Supports multimodal input — text and images — all processed on-device via WebGPU.

Trade-offs
  • Multimodal: text and image input
  • Single large file download (no split shards)
  • Requires WebGPU — no WASM fallback

Used by

Gemma 3n E2B, Gemma 3n E4B

All three methods use WebGPU for GPU acceleration. All model weights are cached in your browser after the first download — no server involved.

Ready to try private AI that runs entirely on your device?