When you hear "LLM," you probably think of APIs, tokens, and cloud infrastructure. But what if we could remove the server completely? That your browser could download a model, run it on your device, and answer questions in real time, all using just JavaScript and GPU acceleration?
Local LLMs running inside the browser were nearly impossible just a year ago. But thanks to new technologies like WebLLM and WebGPU, you can now load a full language model into memory, run it on your device, and have a real-time conversation, all without a server.
In this guide, we'll build a local chatbot that runs entirely in the browser. No backend. No API keys. By the end, you should have a good understanding of WebLLM and WebGPU, and will have built an app that looks and functions like this:

You can also try out the app using this live URL.
To build this, we need to first understand two important pieces: WebLLM and WebGPU.
What is WebLLM?
WebLLM is an open-source project from the team at MLC (Machine Learning Compiler). It lets you run language models directly in your browser tab using GPU acceleration. The models are compiled into formats that your browser can understand and execute, no need to send data to a server.
Why is this important?
It keeps user data private
It reduces latency
It works offline after the model download
It removes the need for API costs or rate limits
Under the hood, WebLLM handles model loading, tokenization, execution, and streaming responses. It gives you a simple interface to load and chat with a model like LLaMA or Phi.
But WebLLM can't do it alone. It needs hardware access, and that's where WebGPU comes in.
What is WebGPU?
WebGPU is a new browser API that gives JavaScript access to the system's GPU, not just for drawing graphics, but for running large-scale parallel computations like matrix operations and tensor math.
In our case, WebGPU lets the browser perform the heavy math required to generate text from an LLM.
Here's what WebGPU does for us:
Performance: Runs faster than JavaScript or even WebAssembly for these workloads
GPU-first: Designed from the ground up for compute, not just rendering
Accessibility: Available across different browsers, though support varies by platform. As of June 2025:
Chrome/Edge: Fully supported on Windows, Mac, and ChromeOS since version 113. On Linux, it requires enabling the chrome://flags/#enable-unsafe-webgpu flag
Firefox: Available in Nightly builds by default, with stable release tentatively planned for Firefox 141
Safari: Available in Safari Technology Preview, with support in iOS 18 and visionOS 2 betas via Feature Flags
Android: Chrome 121+ supports WebGPU on Android
For production applications, you should include proper WebGPU feature detection and provide fallbacks for unsupported browsers.
Together, WebLLM and WebGPU allow us to do something powerful: load a quantized language model directly in the browser and have real-time chat without any backend server.
With this understanding of WebLLM and WebGPU, we can now start building!
Setting up the HTML for our AI chat app
Before we write any JavaScript, we need a user interface. This will be the visible part of our app, the dropdown for selecting models, the chat area, and the input box.
Here's the plan:
We'll create a container for everything
Add a select box for choosing the model
Include a progress bar for when the model is loading
Display the chat history
Create a form with a text input and a button to submit prompts
To begin, create an index.html file and paste the following code inside it:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Browser LLM Chat Demo</title>
<style>
/* We'll add styling here. But later! */
</style>
</head>
<body>
<div class="app-container">
<h1>Chat with LLM (In your browser)</h1>
<div class="controls">
<select id="model-select">
<option value="SmolLM2-360M-Instruct-q4f32_1-MLC">
SmolLM2 360M (Very Small)
</option>
<option value="Phi-3.5-mini-instruct-q4f32_1-MLC">
Phi 3.5 Mini (Medium)
</option>
<option value="Llama-3.1-8B-Instruct-q4f32_1-MLC">
Llama 3.1 8B (Large)
</option>
</select>
<button id="load-model">Load Model</button>
</div>
<div class="chat-container">
<div id="output">Select a model and click "Load Model" to begin</div>
<div
id="progress-container"
class="progress-container"
style="display: none"
>
<div class="progress-bar">
<div id="progress-fill" class="progress-fill"></div>
</div>
<div id="progress-text" class="progress-text">0%</div>
</div>
<form id="chat-form" class="form-group">
<input id="prompt" placeholder="Type your question..." disabled />
<button type="submit" disabled>Send</button>
</form>
</div>
</div>
</body>
</html>
In the HTML file, we've created a chat interface with controls for model selection and loading. The interface includes a chat output area, progress indicators, and an input form, which we need to let users interact with the AI model.
Model selection
Notice that in the div with class controls, we have a select element for model selection and a button for loading the model. Here are the specifications for each model:
Model | Parameters | Q4 file size | VRAM needed |
SmolLM2-360M | 360 million | ~270 MB | ~380 MB |
Phi-3.5-mini | 3.8 billion | ~2.4 GB | ~3.7 GB |
Llama-3.1-8B | 8.03 billion | ~4.9 GB | ~5 GB |
When you're deciding which of these models to use in a browser environment with WebLLM, think first about what kind of work you want it to handle.
SmolLM2-360M is the smallest by a wide margin, which means it loads quickly and puts the least strain on your device. If you're writing short notes, rewriting text, or making quick coding helpers that run in a browser, this might be all you need.
Phi-3.5-mini brings more parameters and more capacity for reasoning, even though it still runs entirely in your browser. It's good for handling multi-step explanations, short document summarisation, or answering questions about moderately long prompts. If you're looking for a balance between size and capability, Phi-3.5-mini has a comfortable middle ground.
Llama-3.1-8B is the largest of the three and carries more of the general knowledge and pattern recognition that bigger models can offer. It's more reliable if you're dealing with open-ended dialogue, creative writing, or complex coding tasks. But you'll need more memory.
Each of these models trades off size, memory use, and output quality in different ways. So choosing the right one depends on what your hardware can handle and what kind of prompts you plan to work with. All can run directly in modern browsers with WebGPU support.
There are more models available at the WebLLM repository, ranging from smaller models for mobile devices to larger ones for more capable systems.
With the HTML in place, the next thing we'll do is work on the JavaScript implementation, then add some CSS to make it look nice. We're saving the CSS for the last step so we can focus on the core features.
Using WebLLM and WebGPU to build the chatbot
Our JavaScript will do four main things:
Load the selected model into memory
Track the loading progress and show feedback
Enable the chat form once the model is ready
Stream the response back as the assistant types it out
We'll build this piece by piece.
Step 1: Import WebLLM
We need to bring in WebLLM so we can access the model engine.
import { CreateMLCEngine } from '<https://esm.run/@mlc-ai/web-llm@0.2.79>'
This gives us the function that will initialize the model.
Step 2: Get references to the DOM elements
Let's wire up the interface. We'll grab the elements we need so we can update them later.
const output = document.getElementById('output')
const form = document.getElementById('chat-form')
const promptInput = document.getElementById('prompt')
const submitButton = document.querySelector('button[type="submit"]')
const modelSelect = document.getElementById('model-select')
const loadModelButton = document.getElementById('load-model')
const progressContainer = document.getElementById('progress-container')
const progressFill = document.getElementById('progress-fill')
const progressText = document.getElementById('progress-text')
We'll use these to display messages, show progress, and control the form state.
Step 3: Track the model engine
We need a variable to hold the model once it's loaded.
let engine = null
We'll update this later when the user loads a model.
Step 4: Show progress
When downloading and initializing the model, we want to keep the user informed.
const updateProgress = (percent) => {
progressContainer.style.display = 'block'
progressFill.style.width = `${percent}%`
progressText.textContent = `${percent}%`
}
This sets the visual width of the progress bar and updates the percentage text.
Step 5: Load the model
Here's the key function. We call this when the user clicks the "Load Model" button.
const loadModel = async (modelId) => {
try {
output.textContent = 'Initializing...'
promptInput.disabled = true
submitButton.disabled = true
loadModelButton.disabled = true
progressContainer.style.display = 'none'
We first disable the interface to prevent interference while loading.
Next, we make sure the browser supports WebGPU:
if (!navigator.gpu) {
throw new Error('WebGPU not supported in this browser...')
}
This check is crucial because WebGPU availability varies significantly across browsers and platforms. The code will gracefully fail if WebGPU isn't available, allowing you to show appropriate fallback content to users.
Then we download and initialize the model:
engine = await CreateMLCEngine(modelId, {
initProgressCallback: (progress) => {
let percent =
typeof progress === 'number'
? Math.floor(progress * 100)
: Math.floor(progress.progress * 100)
updateProgress(percent)
output.textContent = `Loading model... ${percent}%`
},
useIndexedDBCache: true,
})
Once complete:
output.textContent = 'Model ready! Ask me something!'
promptInput.disabled = false
submitButton.disabled = false
loadModelButton.disabled = false
} catch (error) {
loadModelButton.disabled = false
output.innerHTML += `<div class="error">Failed to load model: ${error.message}</div>`
}
}
Now the model is ready to use.
Step 6: Handle chat form submission
When the user types a question and presses enter, this block sends it to the model:
form.addEventListener('submit', async (e) => {
e.preventDefault()
if (!engine) {
output.innerHTML += `<div class="error">No model loaded...</div>`
return
}
const prompt = promptInput.value.trim()
if (!prompt) return
output.textContent = `You: ${prompt}\\n\\nAssistant: `
promptInput.value = ''
promptInput.disabled = true
submitButton.disabled = true
Then we stream the assistant's response:
try {
const stream = await engine.chat.completions.create({
messages: [{ role: 'user', content: prompt }],
stream: true,
})
for await (const chunk of stream) {
const token = chunk.choices[0].delta.content || ''
output.textContent += token
output.scrollTop = output.scrollHeight
}
promptInput.disabled = false
submitButton.disabled = false
promptInput.focus()
} catch (error) {
output.innerHTML += `<div class="error">Error during chat: ${error.message}</div>`
promptInput.disabled = false
submitButton.disabled = false
}
})
Step 7: Trigger the model load on button click
Finally, we hook up the "Load Model" button:
loadModelButton.addEventListener('click', async () => {
await loadModel(modelSelect.value)
})
Adding CSS styling
Now that we have the JavaScript in place, let's add some CSS to give the app a clean look. In the <style> tag in the <head> section of the HTML file, add the following CSS:
body {
font-family: Inter, system-ui, -apple-system, sans-serif;
background-color: #f9fafb;
color: #111827;
line-height: 1.6;
max-width: 800px;
margin: 0 auto;
padding: 2rem;
}
.app-container {
background-color: white;
border-radius: 16px;
box-shadow: 0 4px 20px rgba(0, 0, 0, 0.08);
padding: 2rem;
overflow: hidden;
margin-top: 5rem;
}
h1 {
font-size: 1.5rem;
font-weight: 600;
margin: 0 0 1.5rem;
color: #111827;
}
#output {
background: #f3f4f6;
padding: 1.25rem;
min-height: 220px;
border-radius: 12px;
white-space: pre-wrap;
overflow-y: auto;
max-height: 420px;
font-size: 0.95rem;
}
input,
button,
select {
font-family: inherit;
font-size: 0.95rem;
padding: 0.8rem 1rem;
width: 100%;
border-radius: 10px;
border: 1px solid #e5e7eb;
background-color: white;
}
button {
background-color: #fd366e;
color: white;
border: none;
font-weight: 500;
cursor: pointer;
transition: background-color 0.15s;
}
button:hover:not(:disabled) {
background-color: #e62e60;
}
button:disabled {
background-color: #ffa5c0;
cursor: not-allowed;
}
.error {
color: #dc2626;
background-color: #fee2e2;
padding: 0.8rem;
border-radius: 10px;
margin-top: 0.75rem;
font-size: 0.9rem;
}
.controls {
display: grid;
grid-template-columns: 2fr 1fr;
gap: 0.75rem;
margin-bottom: 1.5rem;
}
.chat-container {
margin-top: 1.5rem;
display: flex;
flex-direction: column;
gap: 1.5rem;
}
.progress-container {
margin-top: 1rem;
}
.progress-bar {
width: 100%;
height: 8px;
background-color: #e5e7eb;
border-radius: 999px;
overflow: hidden;
}
.progress-fill {
height: 100%;
background-color: #fd366e;
width: 0%;
transition: width 0.3s ease;
}
.progress-text {
font-size: 0.8rem;
text-align: center;
margin-top: 0.5rem;
color: #6b7280;
}
.form-group {
display: flex;
flex-direction: column;
gap: 0.75rem;
}
input {
box-sizing: border-box;
margin: 0rem;
}
input:focus,
select:focus {
outline: none;
border-color: rgba(253, 54, 110, 0.5);
box-shadow: 0 0 0 1px rgba(253, 54, 110, 0.1);
}
Next, open the index.html file in your browser and you should see something like this:

Try loading a model and chatting with the AI. To be sure that this all runs locally, you can also turn off your internet connection and see if the app still works after initial model download.
Conclusion
We've built a local chatbot that runs entirely in the browser. You can now load a model, chat with it, and have a real-time conversation, all without a server.
This is the future of AI. Local, private, and fast. And what we've done is just the beginning. You can build on this foundation, add chat history, improve UX, support longer contexts, or experiment with your own compiled models.
You can also check out a more complex version of this app that includes chat history and a few other features here.
The source code for the app we built in this guide is available on GitHub.