Can I Run / GLM-5 / on NVIDIA RTX 4090

Can I Run GLM-5 on a NVIDIA RTX 4090?

Won't fit — even the smallest quant (Q4_K_M) needs 140.4GB VRAM.

Model size

230B

GPU memory

24.0GB

Smallest quant

Q4_K_M

Best fit

—

None of GLM-5's quantizations fit

Even the most aggressive quantization needs more memory than the NVIDIA RTX 4090 provides. Your options below: rent a bigger GPU in the cloud, or upgrade.

Run it in the cloud instead

GLM-5 doesn't fit your 24GB setup. Rent a GPU by the second — no hardware purchase needed.

RunPodBest for big models

Per-second GPU rental from $0.20/hr. Spin up an A100, H100, or 4090 in seconds and run any model.

→

Vast.ai

Marketplace of consumer + datacenter GPUs. Often the cheapest spot prices for inference.

→

Lambda

On-demand H100s and A100s with reserved-instance pricing for production workloads.

→

Together AINo setup

Pay-per-token serverless inference. No GPU setup — just call the API.

→

Affiliate links — we earn a commission at no cost to you.

Or upgrade your hardware

GPUs that would let you run this model locally:

Apple Mac Studio M3 Ultra (192GB)~$7,499

Unified memory means ~190GB of usable model RAM in a single quiet box. Runs 405B at Q4.

Amazon →Newegg →B&H Photo →

NVIDIA H100 80GB~$30,000

Datacenter-grade. Most users should rent rather than buy — see cloud options.

Amazon →Newegg →B&H Photo →

Full model details

GLM-5 →

All quant variants, benchmark scores, and use-case tags.

Best models for this GPU

NVIDIA RTX 4090 →

Top-ranked open-source models that fit in 24.0GB.

FAQ

Can the NVIDIA RTX 4090 run GLM-5?

No. GLM-5 (230B) needs at least 140.4GB even at its smallest quantization, more than the 24.0GB on the NVIDIA RTX 4090.

What's the best quantization to use?

None of GLM-5's available quantizations fit in 24.0GB. You'll need either a larger GPU, a smaller model, or to run it in the cloud.

What if I need more headroom for context length?

KV cache memory grows with context length. The numbers above assume a baseline 2K-4K context. For long-context use (32K+), add another 2-6GB depending on the model architecture.