Large Language Models

Renting vs. Purchasing GPUs for LLMs In Canada

Published 2025-11-08.
Time to read: 6 minutes.

This page is part of the llm collection.

It would take over 13 years to pay for a new RTX 4090 GPU in Canada at current hourly rental rates, running 24/7, based on a typical price of around $4,500 CAD. Clearly renting GPUs for LLMs makes much more economic sense than buying.

Swapping

Video cards employ memory swapping for immutable files such as executable programs is performed to free up space in the VRAM for the program that needs to run next. This process is also known as using virtual memory, and it allows a computer or video card to run more applications than it could with its physical RAM alone.

My NVIDIA RTX 3060 has 12 GB DDR5 VRAM, and the video card bus interface is PCIe 4.0 x16. Grok says that swapping on such a system should take 200–800 ms per layer.

Llama offloads individual layers as needed, releasing approximately 2–4 GB per layer. A 9B Q5_K_M model with 7 GB VRAM is expected to swap 1–2 layers in ~400–600 ms.

A full model swap is not usually required. Good thing because my 3060 is expected to take 600-800ms to fully flush, and a full reload would take another 800-1200 ms. The total time to fully flush and reload all 12 GB VRAM is ~1.8 seconds!

Swapping costs include VRAM offloading latency and memory paging overhead. The LLM VRAM Calculator for Self-Hosting provides good qualitative and quantitative explanations.

The rest of this article does not yet take swap time into account. Stay tuned for a future version of this article with lots more detail on this important factor.

Desktop GPU Comparison for LLMs

LLM models need at least one GPU to run on, and GPUs are expensive.

Best value is defined as the highest tokens/second per dollar (t/s/$) when considering new retail prices as of November 2025. The best value workstation GPUs available today for the LLM models discussed in this article are:

The NVIDIA RTX 3060 is the value winner! The 3060’s 12 GB VRAM handles 7B–14B models fully at Q5/Q4 quantization (e.g., GLM-9B at ~55–75 t/s, Qwen3-14B at ~35–50 t/s) and supports 30B models like Qwen3-30B with minimal offload (~20–30 t/s).

The NVIDIA RTX 3090 (24 GB GDDR6X VRAM) provides the best tokens/second per dollar (t/s/$) among new 24 GB GPUs as of November 2025. It enables full Q5/Q4 loads for up to 30B models (~70–100 t/s averaged across sizes, 4K–8K context) without offload, outperforming pricier options like the RTX 4090 in cost-efficiency. Its Ampere architecture delivers ~70–80% of Ada Lovelace speeds but with mature CUDA support for LLMs, making it ideal for interactive coding tasks.

An RTX 3090 would approximately double the performance over the RTX 3060 on 30B models (e.g., from 20–30 t/s to 50–70 t/s) at excellent value (0.08–0.11 t/s/$). It's widely available, power-efficient (350W), and should be competitive for about 2 years.

The NVIDIA RTX 4090 is the value runner-up for 24 GB GPUS. This GPU delivers 2 to 3 times faster speeds than the 3060 (a marginal increase over the 3090), but it costs 6 to 8 times more than the 3060 to purchase.

Remote GPUs

On-demand remote virtualized GPUs can save 90% of the cost of a heavy user of Anthropic, Open AI, Google, etc.

Renting time on remote GPUs is a quick, simple, easy, and cost-effective way to set up LLMs, provided sufficient internet bandwidth is available.

Companies like HiveNet offer 5 hours of dedicated NVIDIA 4090 GPU time with 24 GB VRAM for 1 Euro ($1.62 CAD). That really opens up possibilities.

A step-to-step guide on how to deploy Llama 3.1 8B on Compute with Hivenet .

Scaleway and Hivenet both offer attractive pricing. Unfortunately, these two European vendors do not have datacenters in North America, so latency is a problem. The Canadian Vendors section below shows vendors that have Canadian datacenters.

It would take approximately 160 months (over 13 years) to pay for a new RTX 4090 GPU in Canada at a rate of $1.62 per hour, running 24/7, based on a typical price of around $4,500 CAD. Clearly renting GPUs for LLMs makes much more economic sense than buying.

Desktop / Datacenter GPU Comparison

GPUs used in datacenters differ from desktop GPUs primarily in their design for large-scale, continuous workloads versus gaming or general use. While desktop GPUs prioritize features for single-machine entertainment like video outputs, datacenter GPUs tend to feature more VRAM and higher bandwidth, a focus on reliability and longevity for 24/7 operation, and specialized hardware for computational tasks like AI and HPC.

The maximum VRAM possible for NVIDIA A10 and L4 is the same: 24 GB. The NVIDIA L4 is a newer, more efficient GPU for LLMs that fit into memory, while the NVIDIA A10 is better for working with large LLMs that require extensive swapping.

Note that the 4090 has a 250ms swap time, significantly less than the L4 at 350 ms and the A10 at 400 ms. This is why the L4 and A10 cost less to rent. If your LLMs and their data fit into memory, then the L4 and A10 would be more cost-effective choices for your needs than the 4090.

See nvidia.com for more details about GPUs that are fast swappers (aka virtualization).

Larger datacenter GPUs are available, but the relative cost is 50 times higher. For example, the A100 and H100 with 80 GB VRAM for use on the very largest models with significant data volumes. Lots of VRAM reduces or eliminates the need for swapping.

The following table shows performance of each GPU for small LLMs, sorted by cost factor. Swap times are shown for larger models, or multiple models.

The Relative Cost column uses the cost of a rented 3060 as the baseline. The cost factors for the other GPUs was computed by asking Grok to take the median of pricing from vendors that service the Canadian market (AWS, Hetzner, Lambda Labs, OVHCloud Canada, RunPod, and Vast.ai), and converting to $CAD.

GPU Model Relative Cost VRAM Gemma2
9B
t/s
DeepSeek
6.7B
t/s
Swap Latency
ms/layer
Notes
H100 80GB 55.6 80 GB 140–160 180–220 None Fastest enterprise; has memory headroom.
L100 80GB 50.2 80 GB 135–155 175–210 None Enterprise; has highest bandwidth and memory headroom.
RTX 4090 2.0 24 GB 100–115 130–150 ~250 Top consumer version.
RTX 3090 2.2 24 GB 80–90 110–130 ~300 Consumer version; excellent for 8K contexts.
L4 1.5 24 GB 70–80 95–115 ~350 Efficient inference, datacenter GPU.
A10 1.8 24 GB 65–75 90–110 ~400 Best value datacenter GPU.
RTX 4000 Ada 1.6 20 GB 60–70 85–105 ~500 Balanced mix of speed, power, and cost for general ML workloads in a datacenter.
RTX 3060 1.0 12 GB 50–60 75–90 ~800 Consumer baseline, slow swap, limited memory.

The same GPUs are shown below, sorted by speed when running large LLMs optimized for each GPU’s capabilities, using an 8K Context, and quantized to Q4_K_M.

GPU Model Speed
t/s
VRAM Max Model Example Model Notes
H100 80GB 110‑130 80 GB 70B+ Llama 3.1 70B Largest models; no swap.
L100 80GB 105‑125 80 GB 70B+ Mixtral 8x22B Largest models; no swap, highest bandwidth.
RTX 4090 80‑95 24 GB 34B Llama 3 34B Top consumer; near-pro.
RTX 3090 70‑80 24 GB 30B DeepSeek-V2 16B High VRAM; strong consumer.
RTX 4000 Ada 60‑70 20 GB 22B Codestral 22B Faster than L4 & A10 because the model chosen is smaller.
RTX 3060 55‑65 12 GB 13B GLM-Z1-9B Smallest LLMs required.
L4 50‑60 24 GB 30B Qwen3-Coder 30B  
A10 45‑55 24 GB 34B CodeLlama 34B Full offload; balanced ML.

Why Are Consumer-Grade GPUs So Expensive?

It is much cheaper to rent GPUs, as shown in the tables in the previous section. When renting, the cost of a 4090 is only 2x the cost of a 3060, and is actually less expensive than a 3090. This makes the 4090 the best value when renting a remote desktop GPU.

There are three main reasons for why GPUs cost so much more for consumers than companies:

  1. Manufacturers generally receive only 60-75% of the purchase price when sold through retailers. Distributors usually get 3-7%, and retailers get 8-15%.
  2. Volume pricing.
  3. Consumer demand keeps retail prices high.

Datacenter-quality GPUs can cost less than a 4090 for similar or better performance with LLMs, as shown in the previous section.

Canadian Vendors

The following table shows vendors that provide remote GPU cloud services for AI/ML workloads for users in Canada. These are comparable to European-based instances offered by Scaleway and Hivenet. Some of the following datacenters support compliance (e.g., PIPEDA). Some vendors charge for egress, storage and other factors. Prices are in $CAD.

Scaleway and HiveNet both offer compelling service offerings at attractive prices, but they do not have data centers in North America, which means latency would be a problem for Canadian users.

The table rows for each vendor are sorted by increasing GPU capability for LLMs. We can see that the NVIDIA L4 is a more economical choice than the NVIDIA A10, and it handles LLMs better.

Vendor Name GPU Model VRAM Hourly Rate Per-Minute Rate Notes
AWS RTX 4000 Ada 20 GB ~2.86–4.28 ~0.048–0.071 Workstation-grade.
  NVIDIA A10 24 GB ~2.28–3.42 ~0.038–0.057 Balanced for ML/graphics.
  NVIDIA L4 24 GB ~1.90 ~0.032 Entry-level inference; G5 instances.
Hetzner RTX 4000 Ada 20 GB ~0.81 ~0.014 Auction-based; monthly dedicated also available.
  A10 24 GB ~1.61 ~0.027 Auction for short-term; monthly dedicated available.
  L4 24 GB ~1.21 ~0.020 Auction pricing; dedicated monthly also available.
Lambda Labs RTX 4090 24 GB 0.88 ~0.015 On-demand; no spot interruptions.
  A100 80GB 80 GB 2.63 ~0.044 On-demand; Toronto data center?
  H100 80GB 80 GB 4.39 ~0.073 Pay-per-minute; private cloud options.
OVHcloud Canada RTX 4000 Ada 20 GB ~3.28–5.47 ~0.055–0.091  
  A10 24 GB ~2.19–3.28 ~0.036–0.055 Pay-as-you-go.
  L4 24 GB ~1.64 ~0.027 Per-minute after 1-hour minimum;
Montreal-area data center.
RunPod RTX 4000 Ada 20 GB 1.12 ~0.019 Scalable to 4x; not per-minute granular, but per-hour.
  A10 24 GB 0.72–1.16 ~0.012–0.019 Hourly blocks; auction for short-term.
  L4 24 GB 1.16–1.47 ~0.019–0.024 Per-second billing; Canadian availability zones;
secure cloud.
Vast.ai RTX 4000 Ada 20 GB 0.29–0.74 ~0.005–0.012 Marketplace bidding can be 60–80% cheaper.
  A10 24 GB 0.72–1.16 ~0.012–0.019  
  L4 24 GB 0.74–1.16 ~0.012–0.019  

The above table is very wide, in fact you can only see a portion of it at a time. Drag left with your finger over the above table or use the horizontal scrollbar and push right to see the rest of the table.

* indicates a required field.

Please select the following to receive Mike Slinn’s newsletter:

You can unsubscribe at any time by clicking the link in the footer of emails.

Mike Slinn uses Mailchimp as his marketing platform. By clicking below to subscribe, you acknowledge that your information will be transferred to Mailchimp for processing. Learn more about Mailchimp’s privacy practices.