What is Layerwise (Re)loading?¶

Layerwise reloading is the system used to handle the loading of new weight data into existing weight data destinations without triggering recompilation of the cuda graph and other runtime artifacts. This system is used to enable QeRL-style post training flows, where full-precision trainer weights are quantized and loaded into a target vLLM instance for fast, high-exploration rollouts. The core implementation can be found in layerwise.py.

Layerwise Reloading for QeRL¶

In order to load new weights into existing weight data destinations, a weight must undergo the following operations:

Transfer: weights must be transferred from trainer model to target node/device
Fuse: weight partitions must be fused, for example qkv/gate_up
Process: this typically means online quantization and kernel-specific padding or striding
Shard: weights must be sharded according to the selected parallelism strategy
Copy: weights must be copied into the existing weight data destinations

Layerwise reloading achieves this using the following steps:

Weights are transferred from the trainer to the target (see weight_transfer)
Weights loaded via model.load_weights, during which they are sharded and fused
Weights are processed in an online fashion as soon as all of a layer's weights are loaded
Weights are copied into the existing weight data destinations

For more information on implementation, see Low Level layerwise API.

Layerwise Loading with Online Quantization¶

Online quantization refers to when a user provides full precision weights and those weights are quantized on-the-fly as they are loaded into the model. The layerwise reloading system handles this by treating online quantization as a processing step, which is then handled in an online way both during first-time load and during reload. A typical online quantization method implementation should look like this:

class Fp8OnlineLinearMethod(Fp8LinearMethod):
    """Online version of Fp8LinearMethod which loads a full precision checkpoint
    and quantizes weights during loading."""

    uses_meta_device: bool = True

    def create_weights(self, layer: torch.nn.Module, ...):
        # weight is materialized and processed during loading
        layer.weight = ModelWeightParameter(
            data=torch.empty(..., device="meta"),
            weight_loader=weight_loader,
        )

        # set up online processing
        initialize_online_processing(layer)

    def process_weights_after_loading(self, layer: Module) -> None:
        if getattr(layer, "_already_called_process_weights_after_loading", False):
            return

        layer.weight, layer.weight_scale = ops.scaled_fp8_quant(layer.weight)

        # Prevent duplicate processing (e.g., during weight reload)
        layer._already_called_process_weights_after_loading = True

Example Usages¶

High Level Weight Transfer API¶

The layerwise reloading system is integrated with the post-training weight transfer system. To use layerwise reloading in conjunction to the weight transfer system, follow the examples found here. Layerwise reloading is controlled by the WeightTransferUpdateInfo.is_checkpoint_format flag and is set to True by default.

Mid Level `reload_weights` API¶

Layerwise reloading is also exposed via the reload_weights API. This interface can be called using the following code:

from vllm import LLM

llm = LLM("Qwen/Qwen3-0.6B")
llm.collective_rpc("reload_weights")

This interface also allows specifying a weights_path which can be used to select a checkpoint path to load from:

from vllm import LLM

# fine tuned model checkpoints for testing
mul_path = "inference-optimization/Qwen3-0.6B-debug-multiply"
add_path = "inference-optimization/Qwen3-0.6B-debug-add"

llm = LLM("Qwen/Qwen3-0.6B")
llm.collective_rpc("reload_weights", kwargs={"weights_path": mul_path})
llm.generate("3 4 = ")  # 12

llm.collective_rpc("reload_weights", kwargs={"weights_path": add_path})
llm.generate("3 4 = ")  # 7

Finally, a weights_iterator can be provided directly. This iterator can be lazy or eagerly defined.

from vllm import LLM

weights_iterator = [("q_proj", ...), ("k_proj", ...), ...]

llm = LLM("Qwen/Qwen3-0.6B")
llm.collective_rpc("reload_weights", kwargs={"weights_iterator": weights_iterator})

Low Level `layerwise` API¶

layerwise.py Implements the following functions to execute its lifecycle:

Function	Purpose	Quantized Reload	Online Quantization
`record_metadata_for_reloading`	Record tensor metadata so that layers can be restored on the meta device	Called by `BaseModelLoader`	Called by `BaseModelLoader`
`restore_layer_on_meta`	Restore layer to model format at start of reload	Called by `initialize_layerwise_reload`	Not called. Online quantized weights already start on meta device via `...OnlineLinearMethod.create_weights`
`initialize_online_processing`	Wrap weight loaders with the `online_process_loader` wrapper, which buffers weights until all layer weights have been loaded	Called by `initialize_layerwise_reload`	Called by `...OnlineLinearMethod.create_weights`
`_layerwise_process`	Process layer once all weights are loaded	Called by `online_process_loader` during loading	Called by `online_process_loader` during loading
`_copy_and_restore_kernel_tensors`	Copy processed weights into original tensor locations to affect compiled cuda graphs, etc.	Called by `_layerwise_process` after `process_weights_after_loading`	Not called. There is no compiled cuda graph yet
`finalize_layerwise_processing`	Catch any layers which did not load all weights (for example attention weights or weights with padding)	Called by `BaseModelLoader`	Called by `BaseModelLoader`

You can plug into this lifecycle directly by calling the initialize_layerwise_reload, loading weights, then calling finalize_layerwise_processing:

from vllm import LLM
from vllm.model_executor.model_loader.reload import initialize_layerwise_reload, finalize_layerwise_processing

llm = LLM("Qwen/Qwen3-0.6B")

# this model path requires `VLLM_ENABLE_V1_MULTIPROCESSING=0` and is not stable
model = llm.llm_engine.engine_core.engine_core.model_executor.driver_worker.worker.get_model()

# layerwise reload
initialize_layerwise_reload(model)
model.load_weights(...)
finalize_layerwise_processing(model, llm.model_config)

Troubleshooting Excessive Memory Usage¶

Layerwise reloading allows users to incrementally load and process weights as they are loaded into the model. This system relies on buffering layer weights on device until all weights of a layer have been loaded. However, without offloading, this approach necessarily causes excessive buffering if weights are loaded out of order.

For this reason, users must take care as to the order of weights when they are reloading into the model. Weight should be loaded "in order", meaning that each layer's weights are fully loaded before beginning to load the next layer's weights. "Out of order" loading can cause layer weights to stay buffered while other layer weights are loading, leading to excessive memory usage. In the example below, q_proj, k_proj, v_proj, and up_proj are all buffered at the same time, using more memory than if up_proj was loaded after q_proj, k_proj and v_proj.

Correct Loading	Incorrect Loading

Users will see a warning like the one below if weights are loaded out-of-order.

WARNING [layerwise.py:198] Allocating 28.5 MB of device memory to buffers to load ["QKVParallelLinear", "MergedColumnParallelLinear"] layers. This extra memory usage can be avoided by ordering weights by their parent layer when reloading.