Running Nvidia GPU-Powered LLMs with Ollama and OpenCode in Windows 11
Running large language models locally has become increasingly practical, especially if you have a Nvidia GPU in your machine. The appeal is straightforward: privacy and zero API costs. Ollama makes this accessible—it handles model downloads, GPU routing, and inference with minimal setup. Pair it with OpenCode, a terminal-based interface, and you have a capable local coding assistant.
In this post, I’ll walk through the complete setup on Windows 11, from CUDA installation through verification. I’ll note alternatives to each step since my approach uses Scoop for package management, but any package manager or manual downloads work equally well.
Prerequisites and Expectations
Before starting, know what you’re aiming for: Ollama runs a local inference server, and OpenCode connects to it as a client. Your GPU does the heavy lifting during inference—without proper CUDA setup, the system falls back to CPU, which is dramatically slower for larger models.
This guide assumes:
- Windows 11 with an Nvidia GPU (RTX or newer)
- Basic familiarity with terminal commands
- ~15 minutes for the full setup
If you’re on AMD or Intel Arc, Ollama has some support but the configuration differs—check their documentation for your hardware.
Step 1: Install the Nvidia CUDA Toolkit
Ollama needs CUDA to offload work to your GPU. Download the toolkit from Nvidia’s CUDA downloads page. At the time of writing, version 13.2 is current, but newer versions work fine.
Run the installer and follow the standard steps. The installer sets necessary environment variables, so a restart afterward is usually required.
You can verify the installation by opening a terminal and running:
nvidia-smi
If you see GPU information displayed, you’re ready to move on. If not, double-check that Nvidia drivers are up-to-date.
Step 2: Install Ollama and Pull a Model
With CUDA ready, install Ollama. Using Scoop:
scoop install ollama-full
This installs the CLI and daemon. Alternatively, grab the installer from ollama.ai or use another package manager.
Next, pull a model. I’ve been using Qwen 2.5-7B—it’s fast, capable, and fits comfortably on most mid-range GPUs like my. The download varies by model size (7B is ~4-5GB).
ollama pull qwen2.5:7b
Not sure which model fits your needs? The Ollama Explorer is a handy reference—you can browse models by size, capability, and use case. If you choose a different model, just swap the name in later steps.
Once the pull completes, test it:
ollama run qwen2.5:7b
Type a simple prompt like “hi” and wait for a response. When done, type /bye to exit. This confirms the model works and that your GPU is accessible.
Step 3: Install and Configure OpenCode
With Ollama running smoothly, add OpenCode as your interface. Again via Scoop:
scoop install opencode
Don’t launch it yet. First, configure it to use your local Ollama instance. OpenCode reads its config from $HOME/.config/opencode/opencode.json.
Create that file and add this configuration:
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"ollama": {
"npm": "@ai-sdk/openai-compatible",
"name": "Ollama (local)",
"options": {
"baseURL": "http://localhost:11434/v1"
},
"models": {
"qwen2.5:7b": {
"name": "Qwen2.5-7B",
"tools": true
}
}
}
}
}
This tells OpenCode where Ollama lives (localhost:11434) and which model to use. If you picked a different model in Step 2, update the model name and ID here.
Now launch OpenCode:
opencode
Below the prompt box you’ll see which provider and model are active. If it’s not set to “Ollama (local)”, type /provider and select it. Similarly, /models lets you switch between available models.
Step 4: Verify GPU Usage
Here’s where many setups break down: everything runs, but on CPU instead of GPU, which defeats the purpose. Verification is quick.
With OpenCode running and a model selected, type a prompt—even something simple like “Explain Docker in one sentence.” While it’s generating, open another terminal window and check GPU usage.
Using ollama ps shows active models:
ollama ps
The output includes a processor column showing CPU and GPU percentages.
Alternatively, use Nvidia’s monitoring tool:
nvidia-smi -l 1
Run this while prompting the model, and watch for VRAM usage to spike and GPU utilization to climb. If you see neither—if utilization stays at 0% and memory unchanged—something’s wrong. Common causes:
- CUDA not properly installed: Reinstall CUDA and restart.
- Environment not reloaded: Close and reopen your terminal after CUDA install.
- Ollama serving on CPU by default: Restart the Ollama service:
net stop Ollamathennet start Ollama(or equivalent for your service manager). - GPU memory exhaustion: A smaller model or GPU might be forcing CPU fallback. Check
nvidia-smito see available VRAM.
Reflections and Next Steps
This setup unlocks something valuable: a sandbox for experimenting with prompting, testing ideas locally, and iterating on your own terms. No API rate limits, no latency waiting for cloud requests, and complete privacy.
The trade-off is that local models are generally smaller and less capable than their cloud counterparts. Qwen 2.5-7B is solid for tasks like explaining code, brainstorming, or general Q&A. For highly specialized or nuanced work, you might still reach for GPT-4 or Claude. The real win is having the option—locally, immediately, and cheaply.
See you in the next post.