Published 2025-11-08.
Time to read: 6 minutes.
llm collection.
The following articles were written sequentially.
It would take over 13 years to pay for a new RTX 4090 GPU in Canada at current hourly rental rates, running 24/7, based on a typical price of around $4,500 CAD. Clearly renting GPUs for LLMs makes much more economic sense than buying.
Swapping
Video cards employ memory swapping for immutable files such as executable programs is performed to free up space in the VRAM for the program that needs to run next. This process is also known as using virtual memory, and it allows a computer or video card to run more applications than it could with its physical RAM alone.
My NVIDIA RTX 3060 has 12 GB DDR5 VRAM, and the video card bus interface is PCIe 4.0 x16. Grok says that swapping on such a system should take 200–800 ms per layer.
Llama offloads individual layers as needed, releasing approximately 2–4 GB per layer. A 9B Q5_K_M model with 7 GB VRAM is expected to swap 1–2 layers in ~400–600 ms.
A full model swap is not usually required. Good thing because my 3060 is expected to take 600-800ms to fully flush, and a full reload would take another 800-1200 ms. The total time to fully flush and reload all 12 GB VRAM is ~1.8 seconds!
Swapping costs include VRAM offloading latency and memory paging overhead. The LLM VRAM Calculator for Self-Hosting provides good qualitative and quantitative explanations.
The rest of this article does not yet take swap time into account. Stay tuned for a future version of this article with lots more detail on this important factor.
Desktop GPU Comparison for LLMs
LLM models need at least one GPU to run on, and GPUs are expensive.
Best value is defined as the highest tokens/second per dollar (t/s/$) when considering new retail prices as of November 2025. The best value workstation GPUs available today for the LLM models discussed in this article are:
The NVIDIA RTX 3060 is the value winner! The 3060’s 12 GB VRAM handles 7B–14B models fully at Q5/Q4 quantization (e.g., GLM-9B at ~55–75 t/s, Qwen3-14B at ~35–50 t/s) and supports 30B models like Qwen3-30B with minimal offload (~20–30 t/s).
The NVIDIA RTX 3090 (24 GB GDDR6X VRAM) provides the best tokens/second per dollar (t/s/$) among new 24 GB GPUs as of November 2025. It enables full Q5/Q4 loads for up to 30B models (~70–100 t/s averaged across sizes, 4K–8K context) without offload, outperforming pricier options like the RTX 4090 in cost-efficiency. Its Ampere architecture delivers ~70–80% of Ada Lovelace speeds but with mature CUDA support for LLMs, making it ideal for interactive coding tasks.
An RTX 3090 would approximately double the performance over the RTX 3060 on 30B models (e.g., from 20–30 t/s to 50–70 t/s) at excellent value (0.08–0.11 t/s/$). It's widely available, power-efficient (350W), and should be competitive for about 2 years.
The NVIDIA RTX 4090 is the value runner-up for 24 GB GPUS. This GPU delivers 2 to 3 times faster speeds than the 3060 (a marginal increase over the 3090), but it costs 6 to 8 times more than the 3060 to purchase.
Remote GPUs
On-demand remote virtualized GPUs can save 90% of the cost of a heavy user of Anthropic, Open AI, Google, etc.
Renting time on remote GPUs is a quick, simple, easy, and cost-effective way to set up LLMs, provided sufficient internet bandwidth is available.
Companies like HiveNet offer 5 hours of dedicated NVIDIA 4090 GPU time with 24 GB VRAM for 1 Euro ($1.62 CAD). That really opens up possibilities.
A step-to-step guide on how to deploy Llama 3.1 8B on Compute with Hivenet .
Scaleway and Hivenet both offer attractive pricing. Unfortunately, these two European vendors do not have datacenters in North America, so latency is a problem. The Canadian Vendors section below shows vendors that have Canadian datacenters.
It would take approximately 160 months (over 13 years) to pay for a new RTX 4090 GPU in Canada at a rate of $1.62 per hour, running 24/7, based on a typical price of around $4,500 CAD. Clearly renting GPUs for LLMs makes much more economic sense than buying.
Desktop / Datacenter GPU Comparison
GPUs used in datacenters differ from desktop GPUs primarily in their design for large-scale, continuous workloads versus gaming or general use. While desktop GPUs prioritize features for single-machine entertainment like video outputs, datacenter GPUs tend to feature more VRAM and higher bandwidth, a focus on reliability and longevity for 24/7 operation, and specialized hardware for computational tasks like AI and HPC.
The maximum VRAM possible for NVIDIA A10 and L4 is the same: 24 GB. The NVIDIA L4 is a newer, more efficient GPU for LLMs that fit into memory, while the NVIDIA A10 is better for working with large LLMs that require extensive swapping.
Note that the 4090 has a 250ms swap time, significantly less than the L4 at 350 ms and the A10 at 400 ms. This is why the L4 and A10 cost less to rent. If your LLMs and their data fit into memory, then the L4 and A10 would be more cost-effective choices for your needs than the 4090.
See nvidia.com for more details about GPUs that are fast swappers (aka virtualization).
Larger datacenter GPUs are available, but the relative cost is 50 times higher. For example, the A100 and H100 with 80 GB VRAM for use on the very largest models with significant data volumes. Lots of VRAM reduces or eliminates the need for swapping.
The following table shows performance of each GPU for small LLMs, sorted by cost factor. Swap times are shown for larger models, or multiple models.
The Relative Cost column uses the cost of a rented 3060 as the baseline. The cost factors for the other GPUs was computed by asking Grok to take the median of pricing from vendors that service the Canadian market (AWS, Hetzner, Lambda Labs, OVHCloud Canada, RunPod, and Vast.ai), and converting to $CAD.
| GPU Model | Relative Cost | VRAM | Gemma2 9B t/s | DeepSeek 6.7B t/s | Swap Latency ms/layer | Notes |
|---|---|---|---|---|---|---|
| H100 80GB | 55.6 | 80 GB | 140–160 | 180–220 | None | Fastest enterprise; has memory headroom. |
| L100 80GB | 50.2 | 80 GB | 135–155 | 175–210 | None | Enterprise; has highest bandwidth and memory headroom. |
| RTX 4090 | 2.0 | 24 GB | 100–115 | 130–150 | ~250 | Top consumer version. |
| RTX 3090 | 2.2 | 24 GB | 80–90 | 110–130 | ~300 | Consumer version; excellent for 8K contexts. |
| L4 | 1.5 | 24 GB | 70–80 | 95–115 | ~350 | Efficient inference, datacenter GPU. |
| A10 | 1.8 | 24 GB | 65–75 | 90–110 | ~400 | Best value datacenter GPU. |
| RTX 4000 Ada | 1.6 | 20 GB | 60–70 | 85–105 | ~500 | Balanced mix of speed, power, and cost for general ML workloads in a datacenter. |
| RTX 3060 | 1.0 | 12 GB | 50–60 | 75–90 | ~800 | Consumer baseline, slow swap, limited memory. |
The same GPUs are shown below, sorted by speed when running large LLMs optimized for each GPU’s capabilities, using an 8K Context, and quantized to Q4_K_M.
| GPU Model | Speed t/s | VRAM | Max Model | Example Model | Notes |
|---|---|---|---|---|---|
| H100 80GB | 110‑130 | 80 GB | 70B+ | Llama 3.1 70B | Largest models; no swap. |
| L100 80GB | 105‑125 | 80 GB | 70B+ | Mixtral 8x22B | Largest models; no swap, highest bandwidth. |
| RTX 4090 | 80‑95 | 24 GB | 34B | Llama 3 34B | Top consumer; near-pro. |
| RTX 3090 | 70‑80 | 24 GB | 30B | DeepSeek-V2 16B | High VRAM; strong consumer. |
| RTX 4000 Ada | 60‑70 | 20 GB | 22B | Codestral 22B | Faster than L4 & A10 because the model chosen is smaller. |
| RTX 3060 | 55‑65 | 12 GB | 13B | GLM-Z1-9B | Smallest LLMs required. |
| L4 | 50‑60 | 24 GB | 30B | Qwen3-Coder 30B | |
| A10 | 45‑55 | 24 GB | 34B | CodeLlama 34B | Full offload; balanced ML. |
Why Are Consumer-Grade GPUs So Expensive?
It is much cheaper to rent GPUs, as shown in the tables in the previous section. When renting, the cost of a 4090 is only 2x the cost of a 3060, and is actually less expensive than a 3090. This makes the 4090 the best value when renting a remote desktop GPU.
There are three main reasons for why GPUs cost so much more for consumers than companies:
- Manufacturers generally receive only 60-75% of the purchase price when sold through retailers. Distributors usually get 3-7%, and retailers get 8-15%.
- Volume pricing.
- Consumer demand keeps retail prices high.
Datacenter-quality GPUs can cost less than a 4090 for similar or better performance with LLMs, as shown in the previous section.
Canadian Vendors
The following table shows vendors that provide remote GPU cloud services for AI/ML workloads for users in Canada. These are comparable to European-based instances offered by Scaleway and Hivenet. Some of the following datacenters support compliance (e.g., PIPEDA). Some vendors charge for egress, storage and other factors. Prices are in $CAD.
Scaleway and HiveNet both offer compelling service offerings at attractive prices, but they do not have data centers in North America, which means latency would be a problem for Canadian users.
The table rows for each vendor are sorted by increasing GPU capability for LLMs. We can see that the NVIDIA L4 is a more economical choice than the NVIDIA A10, and it handles LLMs better.
| Vendor Name | GPU Model | VRAM | Hourly Rate | Per-Minute Rate | Notes |
|---|---|---|---|---|---|
| AWS | RTX 4000 Ada | 20 GB | ~2.86–4.28 | ~0.048–0.071 | Workstation-grade. |
| NVIDIA A10 | 24 GB | ~2.28–3.42 | ~0.038–0.057 | Balanced for ML/graphics. | |
| NVIDIA L4 | 24 GB | ~1.90 | ~0.032 | Entry-level inference; G5 instances. | |
| Hetzner | RTX 4000 Ada | 20 GB | ~0.81 | ~0.014 | Auction-based; monthly dedicated also available. |
| A10 | 24 GB | ~1.61 | ~0.027 | Auction for short-term; monthly dedicated available. | |
| L4 | 24 GB | ~1.21 | ~0.020 | Auction pricing; dedicated monthly also available. | |
| Lambda Labs | RTX 4090 | 24 GB | 0.88 | ~0.015 | On-demand; no spot interruptions. |
| A100 80GB | 80 GB | 2.63 | ~0.044 | On-demand; Toronto data center? | |
| H100 80GB | 80 GB | 4.39 | ~0.073 | Pay-per-minute; private cloud options. | |
| OVHcloud Canada | RTX 4000 Ada | 20 GB | ~3.28–5.47 | ~0.055–0.091 | |
| A10 | 24 GB | ~2.19–3.28 | ~0.036–0.055 | Pay-as-you-go. | |
| L4 | 24 GB | ~1.64 | ~0.027 | Per-minute after 1-hour minimum; Montreal-area data center. | |
| RunPod | RTX 4000 Ada | 20 GB | 1.12 | ~0.019 | Scalable to 4x; not per-minute granular, but per-hour. |
| A10 | 24 GB | 0.72–1.16 | ~0.012–0.019 | Hourly blocks; auction for short-term. | |
| L4 | 24 GB | 1.16–1.47 | ~0.019–0.024 | Per-second billing; Canadian availability zones; secure cloud. | |
| Vast.ai | RTX 4000 Ada | 20 GB | 0.29–0.74 | ~0.005–0.012 | Marketplace bidding can be 60–80% cheaper. |
| A10 | 24 GB | 0.72–1.16 | ~0.012–0.019 | ||
| L4 | 24 GB | 0.74–1.16 | ~0.012–0.019 |
The above table is very wide, in fact you can only see a portion of it at a time. Drag left with your finger over the above table or use the horizontal scrollbar and push right to see the rest of the table.
The following articles were written sequentially.