Renting vs. Purchasing GPUs for LLMs In Canada

Published 2025-11-08.
Time to read: 6 minutes.

This page is part of the llm collection.

The following articles were written sequentially.

It would take over 13 years to pay for a new RTX 4090 GPU in Canada at current hourly rental rates, running 24/7, based on a typical price of around $4,500 CAD. Clearly renting GPUs for LLMs makes much more economic sense than buying.

Swapping

Video cards employ memory swapping for immutable files such as executable programs is performed to free up space in the VRAM for the program that needs to run next. This process is also known as using virtual memory, and it allows a computer or video card to run more applications than it could with its physical RAM alone.

My NVIDIA RTX 3060 has 12 GB DDR5 VRAM, and the video card bus interface is PCIe 4.0 x16. Grok says that swapping on such a system should take 200–800 ms per layer.

Llama offloads individual layers as needed, releasing approximately 2–4 GB per layer. A 9B Q5_K_M model with 7 GB VRAM is expected to swap 1–2 layers in ~400–600 ms.

A full model swap is not usually required. Good thing because my 3060 is expected to take 600-800ms to fully flush, and a full reload would take another 800-1200 ms. The total time to fully flush and reload all 12 GB VRAM is ~1.8 seconds!

Swapping costs include VRAM offloading latency and memory paging overhead. The LLM VRAM Calculator for Self-Hosting provides good qualitative and quantitative explanations.

The rest of this article does not yet take swap time into account. Stay tuned for a future version of this article with lots more detail on this important factor.

Desktop GPU Comparison for LLMs

LLM models need at least one GPU to run on, and GPUs are expensive.

Best value is defined as the highest tokens/second per dollar (t/s/$) when considering new retail prices as of November 2025. The best value workstation GPUs available today for the LLM models discussed in this article are:

The NVIDIA RTX 3060 is the value winner! The 3060’s 12 GB VRAM handles 7B–14B models fully at Q5/Q4 quantization (e.g., GLM-9B at ~55–75 t/s, Qwen3-14B at ~35–50 t/s) and supports 30B models like Qwen3-30B with minimal offload (~20–30 t/s).

The NVIDIA RTX 3090 (24 GB GDDR6X VRAM) provides the best tokens/second per dollar (t/s/$) among new 24 GB GPUs as of November 2025. It enables full Q5/Q4 loads for up to 30B models (~70–100 t/s averaged across sizes, 4K–8K context) without offload, outperforming pricier options like the RTX 4090 in cost-efficiency. Its Ampere architecture delivers ~70–80% of Ada Lovelace speeds but with mature CUDA support for LLMs, making it ideal for interactive coding tasks.

An RTX 3090 would approximately double the performance over the RTX 3060 on 30B models (e.g., from 20–30 t/s to 50–70 t/s) at excellent value (0.08–0.11 t/s/$). It's widely available, power-efficient (350W), and should be competitive for about 2 years.

The NVIDIA RTX 4090 is the value runner-up for 24 GB GPUS. This GPU delivers 2 to 3 times faster speeds than the 3060 (a marginal increase over the 3090), but it costs 6 to 8 times more than the 3060 to purchase.

Remote GPUs

On-demand remote virtualized GPUs can save 90% of the cost of a heavy user of Anthropic, Open AI, Google, etc.

Renting time on remote GPUs is a quick, simple, easy, and cost-effective way to set up LLMs, provided sufficient internet bandwidth is available.

Companies like HiveNet offer 5 hours of dedicated NVIDIA 4090 GPU time with 24 GB VRAM for 1 Euro ($1.62 CAD). That really opens up possibilities.

A step-to-step guide on how to deploy Llama 3.1 8B on Compute with Hivenet .

Scaleway and Hivenet both offer attractive pricing. Unfortunately, these two European vendors do not have datacenters in North America, so latency is a problem. The Canadian Vendors section below shows vendors that have Canadian datacenters.

It would take approximately 160 months (over 13 years) to pay for a new RTX 4090 GPU in Canada at a rate of $1.62 per hour, running 24/7, based on a typical price of around $4,500 CAD. Clearly renting GPUs for LLMs makes much more economic sense than buying.

Desktop / Datacenter GPU Comparison

GPUs used in datacenters differ from desktop GPUs primarily in their design for large-scale, continuous workloads versus gaming or general use. While desktop GPUs prioritize features for single-machine entertainment like video outputs, datacenter GPUs tend to feature more VRAM and higher bandwidth, a focus on reliability and longevity for 24/7 operation, and specialized hardware for computational tasks like AI and HPC.

The maximum VRAM possible for NVIDIA A10 and L4 is the same: 24 GB. The NVIDIA L4 is a newer, more efficient GPU for LLMs that fit into memory, while the NVIDIA A10 is better for working with large LLMs that require extensive swapping.

Note that the 4090 has a 250ms swap time, significantly less than the L4 at 350 ms and the A10 at 400 ms. This is why the L4 and A10 cost less to rent. If your LLMs and their data fit into memory, then the L4 and A10 would be more cost-effective choices for your needs than the 4090.

See nvidia.com for more details about GPUs that are fast swappers (aka virtualization).

Larger datacenter GPUs are available, but the relative cost is 50 times higher. For example, the A100 and H100 with 80 GB VRAM for use on the very largest models with significant data volumes. Lots of VRAM reduces or eliminates the need for swapping.

The following table shows performance of each GPU for small LLMs, sorted by cost factor. Swap times are shown for larger models, or multiple models.

The Relative Cost column uses the cost of a rented 3060 as the baseline. The cost factors for the other GPUs was computed by asking Grok to take the median of pricing from vendors that service the Canadian market (AWS, Hetzner, Lambda Labs, OVHCloud Canada, RunPod, and Vast.ai), and converting to $CAD.

GPU Model	Relative Cost	VRAM	Gemma2 9B t/s	DeepSeek 6.7B t/s	Swap Latency ms/layer	Notes
H100 80GB	55.6	80 GB	140–160	180–220	None	Fastest enterprise; has memory headroom.
L100 80GB	50.2	80 GB	135–155	175–210	None	Enterprise; has highest bandwidth and memory headroom.
RTX 4090	2.0	24 GB	100–115	130–150	~250	Top consumer version.
RTX 3090	2.2	24 GB	80–90	110–130	~300	Consumer version; excellent for 8K contexts.
L4	1.5	24 GB	70–80	95–115	~350	Efficient inference, datacenter GPU.
A10	1.8	24 GB	65–75	90–110	~400	Best value datacenter GPU.
RTX 4000 Ada	1.6	20 GB	60–70	85–105	~500	Balanced mix of speed, power, and cost for general ML workloads in a datacenter.
RTX 3060	1.0	12 GB	50–60	75–90	~800	Consumer baseline, slow swap, limited memory.

The same GPUs are shown below, sorted by speed when running large LLMs optimized for each GPU’s capabilities, using an 8K Context, and quantized to Q4_K_M.

GPU Model	Speed t/s	VRAM	Max Model	Example Model	Notes
H100 80GB	110‑130	80 GB	70B+	Llama 3.1 70B	Largest models; no swap.
L100 80GB	105‑125	80 GB	70B+	Mixtral 8x22B	Largest models; no swap, highest bandwidth.
RTX 4090	80‑95	24 GB	34B	Llama 3 34B	Top consumer; near-pro.
RTX 3090	70‑80	24 GB	30B	DeepSeek-V2 16B	High VRAM; strong consumer.
RTX 4000 Ada	60‑70	20 GB	22B	Codestral 22B	Faster than L4 & A10 because the model chosen is smaller.
RTX 3060	55‑65	12 GB	13B	GLM-Z1-9B	Smallest LLMs required.
L4	50‑60	24 GB	30B	Qwen3-Coder 30B
A10	45‑55	24 GB	34B	CodeLlama 34B	Full offload; balanced ML.

Why Are Consumer-Grade GPUs So Expensive?

It is much cheaper to rent GPUs, as shown in the tables in the previous section. When renting, the cost of a 4090 is only 2x the cost of a 3060, and is actually less expensive than a 3090. This makes the 4090 the best value when renting a remote desktop GPU.

There are three main reasons for why GPUs cost so much more for consumers than companies:

Manufacturers generally receive only 60-75% of the purchase price when sold through retailers. Distributors usually get 3-7%, and retailers get 8-15%.
Volume pricing.
Consumer demand keeps retail prices high.

Datacenter-quality GPUs can cost less than a 4090 for similar or better performance with LLMs, as shown in the previous section.

Canadian Vendors

The following table shows vendors that provide remote GPU cloud services for AI/ML workloads for users in Canada. These are comparable to European-based instances offered by Scaleway and Hivenet. Some of the following datacenters support compliance (e.g., PIPEDA). Some vendors charge for egress, storage and other factors. Prices are in $CAD.

Scaleway and HiveNet both offer compelling service offerings at attractive prices, but they do not have data centers in North America, which means latency would be a problem for Canadian users.

The table rows for each vendor are sorted by increasing GPU capability for LLMs. We can see that the NVIDIA L4 is a more economical choice than the NVIDIA A10, and it handles LLMs better.

Vendor Name	GPU Model	VRAM	Hourly Rate	Per-Minute Rate	Notes
AWS	RTX 4000 Ada	20 GB	~2.86–4.28	~0.048–0.071	Workstation-grade.
	NVIDIA A10	24 GB	~2.28–3.42	~0.038–0.057	Balanced for ML/graphics.
	NVIDIA L4	24 GB	~1.90	~0.032	Entry-level inference; G5 instances.
Hetzner	RTX 4000 Ada	20 GB	~0.81	~0.014	Auction-based; monthly dedicated also available.
	A10	24 GB	~1.61	~0.027	Auction for short-term; monthly dedicated available.
	L4	24 GB	~1.21	~0.020	Auction pricing; dedicated monthly also available.
Lambda Labs	RTX 4090	24 GB	0.88	~0.015	On-demand; no spot interruptions.
	A100 80GB	80 GB	2.63	~0.044	On-demand; Toronto data center?
	H100 80GB	80 GB	4.39	~0.073	Pay-per-minute; private cloud options.
OVHcloud Canada	RTX 4000 Ada	20 GB	~3.28–5.47	~0.055–0.091
	A10	24 GB	~2.19–3.28	~0.036–0.055	Pay-as-you-go.
	L4	24 GB	~1.64	~0.027	Per-minute after 1-hour minimum; Montreal-area data center.
RunPod	RTX 4000 Ada	20 GB	1.12	~0.019	Scalable to 4x; not per-minute granular, but per-hour.
	A10	24 GB	0.72–1.16	~0.012–0.019	Hourly blocks; auction for short-term.
	L4	24 GB	1.16–1.47	~0.019–0.024	Per-second billing; Canadian availability zones; secure cloud.
Vast.ai	RTX 4000 Ada	20 GB	0.29–0.74	~0.005–0.012	Marketplace bidding can be 60–80% cheaper.
	A10	24 GB	0.72–1.16	~0.012–0.019
	L4	24 GB	0.74–1.16	~0.012–0.019

The above table is very wide, in fact you can only see a portion of it at a time. Drag left with your finger over the above table or use the horizontal scrollbar and push right to see the rest of the table.

The following articles were written sequentially.

Mainframe image; Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License by PekoeBlaze

© Copyright 1994-2025 Michael Slinn. All rights reserved.
For requests to use this copyright-protected work in any manner, email mslinn@mslinn.com.

This website was made using Jekyll and Mike Slinn’s Jekyll Plugins.