The NVIDIA Jetson AGX Thor Developer Kit is the successor to the highly successful NVIDIA Jetson AGX Orin Developer Kit, pushing edge AI performance to an entirely new level. While the Jetson AGX Orin set the standard for robotics, autonomous machines, and edge AI with its balance of performance and power efficiency, the Jetson AGX Thor builds on that foundation with even greater computational capability and expanded support for advanced workloads like physical AI and humanoid robotics. The new Jetson AGX Thor represents a significant upgrade from the previous Jetson AGX Orin. Orin was built on the Ampere architecture, but the Thor uses the new Blackwell architecture, which brings major performance improvements.
In this tutorial, we’ll walk step-by-step through the setup process: from unboxing the developer kit and flashing the operating system to booting into Ubuntu 22.04.3 LTS and preparing the device for generative AI projects using vLLM and SGLang inference engines.
Note: this guide is an introduction to the Nvidia Jetson AGX Thor Developer Kit, there is also detailed guide from Nvidia - Quick Start Guide.
Here is the box as it arrived from NVIDIA. I was among the lucky first to receive Jetson Thor for review. Thank you to the NVIDIA team for sending this unit!
Opening the box, you’ll find:
- Jetson AGX Thor Developer Kit
- Power supply (with adapter and regional plugs)
- USB-C to USB-A cable
- Quick start documentation
The moment you pick it up, the Jetson AGX Thor feels solid and well-built. It packs the power of a workstation into a remarkably compact and portable device.
One of the best features is that the NVMe SSD is pre-installed, so you don't have to worry about setting up storage. You can jump straight to flashing the system image and getting to work.
The NVIDIA Jetson AGX Thor represents a significant leap forward compared to the Jetson AGX Orin developer kit. While the Orin was built on the Ampere architecture with up to 2048 CUDA cores, 64 Tensor Cores, and 12 ARM Cortex-A78AE CPU cores, Thor moves to the new Blackwell architecture, integrating 2560 CUDA cores, 96 fifth-generation Tensor Cores, and a 14-core Arm Neoverse-V3AE CPU that runs about 2.6 times faster than Orin’s CPU.
Orin could deliver around 200-275 INT8 TOPS, which roughly aligned with early FP8 throughput. Thor, on the other hand, reaches about 1000 FP8 TOPS and up to 2000 FP4 TOPS, with NVIDIA quoting 2070 FP4 TFLOPS in sparse mode. This leap comes from the introduction of FP6 and FP4 formats in Blackwell, where FP4 doubles the throughput of FP8, itself already twice as efficient as FP16.
For information about the interfaces on the developer kit module and carrier board, refer to the user manual for detailed diagrams and descriptions.
Step 2: Preparation of Jetpack image with Balena EtcherTo install Ubuntu on the Jetson, we’ll create a bootable USB stick with NVIDIA’s Jetson Thor installer.
- Download the JetPack 7 ISO image from the Jetson Download Center.
- Insert a USB drive into your PC.
- Use Balena Etcher to flash the Jetson Thor installation image onto the USB stick.
- Select the downloaded image.
- Choose your USB drive.
- Click Flash.
Here’s a summary of the setup process:
Once finished, safely eject the USB stick and plug it into the Jetson AGX Thor.
Flash Jetson Thor from the USBNow comes the key step of flashing the Jetson Thor itself.
- Power on the Jetson AGX Thor and boot from the USB stick.
- In the installer, select Jetson Thor options.
- Choose Flash Jetson Thor AGX Developer Kit on NVMe.
- Select the version 0.2.0-r38.1 (latest supported at time of writing).
The flashing process will take some time as the image is written to the NVMe drive.
First Display: Welcome to Ubuntu 24.04.3 LTSOnce the flashing completes, remove the USB stick and reboot. From here, you can go through the standard Ubuntu setup wizard:
- Choose your language and region
- Set up your username and password
- Configure networking
You’ll be greeted with: Welcome to Ubuntu 24.04.3 LTS 🎉
After the initial setup, it's essential to update the software to ensure you have the latest features and bug fixes.
sudo apt update
sudo apt install nvidia-jetpack
Depending on the updates available, this process might take some time.
I would also set Jetson Thor power mode with maximum CPU clock speeds so that the following installation steps could run faster.
Change the power mode of NVIDIA Jetson AGX Thor to maximum performance.
sudo nvpmodel -m 0
To check which mode is being used:
sudo nvpmodel -q
NV Power Mode: MAXN
0
Then run the below command:
sudo jetson_clocks
Enabled Legacy persistence mode for GPU 00000000:01:00.0.
All done.
Enabling jetson_clocks
will make sure all CPU and GPU cores are clocked at their maximum frequency.
The Nvidia Jetson AGX Thor developer kit is ready for use. You can begin developing generative AI projects right away with the resources provided by NVIDIA.
Running vLLMvLLMis a high-throughput and memory-efficient inference and serving engine for LLMs.
Run the vLLM docker command to start the container:
sudo docker run --runtime=nvidia \
--gpus all \
-it \
--rm \
--network=host \
--ipc=host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
thor_vllm_container:25.08-py3-baseFor example, serving Qwen3-4B:
Once inside the container, you can start the vLLM server for a specific model. For instance, to serve the Qwen3-4B model and allocate 50% of the GPU memory, you would use this command:
vllm serve Qwen/Qwen3-4B --gpu-memory-utilization 0.5
Qwen 3 is Alibaba’s open-source large language model family, featuring switchable thinking
and non-thinking
modes for enhanced reasoning and multilingual performance across 119+ languages.
Create a test_model.py
script with the following contents:
import openai
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
stream = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{
"role": "system",
"content": "You are a helpful assistant. Be concise and accurate."
},
{
"role": "user",
"content": "What can you tell me about quantum computing?"
}
],
stream=True,
max_tokens=500
)
for chunk in stream:
content = chunk.choices[0].delta.content or ""
print(content, end="", flush=True)
print()
For the vLLM inference engine, the Qwen3-4B model was the fastest performer, generating tokens at a rate of up to 15.4 tokens per second. The larger Qwen3-8B model was slower, with a generation speed of about 10.3 tokens per second. The Llama-3.1-8B model performed similarly to the Qwen3-8B, generating about 10.5 tokens per second.
Running SGLangSGLang is a fast-serving framework for LLMs and VLMs, providing OpenAI-compatible APIs.
Unlike vLLM, SGLang can be launched directly with a Python command, which is a straightforward way to get a model server up and running. The following command launches the Qwen3-8B model, specifies the port, sets a large context length, and utilizes a specific reasoning parser.
python -m sglang.launch_server --model-path Qwen/Qwen3-8B --port 30000 --context-length 131072 --reasoning-parser qwen3
Once running, test responses via REST or Python client. Here is an example of a model response:
Quantum computing is a type of computing that leverages **qubits** (quantum bits) instead of classical bits. Unlike classical bits, which are either 0 or 1, qubits can exist in **superposition**, representing both 0 and 1 simultaneously. This allows quantum computers to process vast amounts of information in parallel.
### Key Principles:
1. **Superposition**: Qubits can be in multiple states at once, enabling parallel computation.
2. **Entanglement**: Qubits can be linked, so the state of one instantly influences another, regardless of distance.
3. **Quantum Interference**: Manipulates probabilities to amplify correct solutions and cancel out incorrect ones.
### Advantages:
- Solves specific problems exponentially faster than classical computers
Below is an example of the metrics output from SGLang:
2025-08-23 09:15:19] max_total_num_tokens=611765, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=2389, context_len=131072, available_gpu_mem=1.35 GB
[2025-08-23 09:15:20] INFO: Started server process [162]
[2025-08-23 09:15:20] INFO: Waiting for application startup.
[2025-08-23 09:15:20] INFO: Application startup complete.
[2025-08-23 09:15:20] INFO: Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2025-08-23 09:15:21] INFO: 127.0.0.1:51398 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-08-23 09:15:21] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-08-23 09:15:40] INFO: 127.0.0.1:51414 - "POST /generate HTTP/1.1" 200 OK
[2025-08-23 09:15:40] The server is fired up and ready to roll!
[2025-08-23 09:15:56] INFO: 127.0.0.1:39126 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-08-23 09:15:56] Prefill batch. #new-seq: 1, #new-token: 33, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-08-23 09:15:59] Decode batch. #running-req: 1, #token: 66, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1.01, #queue-req: 0
[2025-08-23 09:16:03] Decode batch. #running-req: 1, #token: 106, token usage: 0.00, cuda graph: True, gen throughput (token/s): 10.56, #queue-req: 0
[2025-08-23 09:16:07] Decode batch. #running-req: 1, #token: 146, token usage: 0.00, cuda graph: True, gen throughput (token/s): 10.56, #queue-req: 0
A key metric is the gen throughput, which measures the token generation speed.
NVIDIA has introduced NVFP4, a new 4-bit floating-point format for their Blackwell GPUs, designed for ultra-low precision inference with minimal loss in model accuracy. While running FP4 models on Jetson Thor may not yet be supported currently in SGLang and vLLM, it's a feature that may be added soon. I'll cover this topic in a future update.
With the NVIDIA Jetson AGX Thor Developer Kit, you have an edge AI powerhouse for robotics, multimodal AI, and generative applications, so get ready to unleash your creativity and build impressive projects! Happy developing! 🚀
References:
Comments