归一化：BatchNorm、LayerNorm 与 RMSNorm

Naifan Li — Mon, 16 Mar 2026 17:05:46 +0800

为什么需要归一化

深层网络中，每一层的输出尺度会随着层数的增加变得不可控——有些层输出极大，有些极小。这直接导致梯度不稳定，学习率难以调整，训练容易发散。

归一化的本质作用是把中间表示拉回一个可控的尺度附近，从而：

让 loss landscape 更平滑，梯度更稳定
允许使用更大的学习率，加速收敛
降低对参数初始化的敏感度

BatchNorm 论文最初将此解释为"缓解 internal covariate shift"，但后续研究表明，归一化真正的价值更多在于改善优化条件，而不仅仅是修正分布漂移。

BatchNorm

算法

设输入为 $x$，对某个特征维（或通道），BatchNorm 分四步：

计算 batch 均值

$$ \mu = \frac{1}{|\mathcal{B}|}\sum_{i \in \mathcal{B}} x_i $$

计算 batch 方差

$$ \sigma^2 = \frac{1}{|\mathcal{B}|}\sum_{i \in \mathcal{B}} (x_i - \mu)^2 $$

标准化

$$ \hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}} $$

仿射变换

$$ y_i = \gamma\, \hat{x}_i + \beta $$

其中 $\gamma, \beta$ 是可学习参数，$\epsilon$ 防止除零。

统计维度取决于输入形状：

全连接层 $x \in \mathbb{R}^{B \times D}$：对每个特征维 $d$，在 batch 维 $B$ 上统计
卷积层 $x \in \mathbb{R}^{B \times C \times H \times W}$：对每个通道 $c$，在 $(B, H, W)$ 上统计

每个特征维（或通道）有独立的一组 $\gamma, \beta$。

Transformer: Attention Is All You Need

Naifan Li — Mon, 12 Jun 2017 17:11:59 +0800

NeurIPS 2017 Google arXiv 1706.03762 tensorflow/tensor2tensor

TL;DR

Motivation

Contribution

Approach

Model Architecture

Transformer

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31


class Transformer(nn.Module):
    def __init__(
        self,
        d_model: int = 512,
        nhead: int = 8,
        num_encoder_layers: int = 6,
        num_decoder_layers: int = 6,
        dim_feedforward: int = 2048,
        dropout: float = 0.1,
    ) -> None:
        super().__init__()
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=nhead,
            dim_feedforward=dim_feedforward,
            dropout=dropout,
        )
        self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_encoder_layers)
        
        decoder_layer = nn.TransformerDecoderLayer(
            d_model=d_model,
            nhead=nhead,
            dim_feedforward=dim_feedforward,
            dropout=dropout,
        )
        self.decoder = nn.TransformerDecoder(decoder_layer, num_layers=num_decoder_layers)

    def forward(self, src: Tensor, tgt: Tensor) -> Tensor:
        memory = self.encoder(src)
        output = self.decoder(tgt, memory)
        return output

TransformerEncoder

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


class TransformerEncoder(nn.Module):
    def __init__(self, encoder_layer: nn.TransformerEncoderLayer, num_layers: int):
        super().__init__()
        self.layers = ModuleList([copy.deepcopy(encoder_layer) for _ in range(num_layers)])
    
    def forward(self, src: Tensor) -> Tensor:
        output = src
        for mod in self.layers:
            output = mod(output)
        return output

TransformerEncoderLayer

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29


class TransformerEncoderLayer(nn.Module):
    r"""TransformerEncoderLayer is made up of self-attn and feedforward network."""

    def __init__(self, d_model: int, nhead: int, dim_feedforward: int, dropout: float):
        super().__init__()
        # Implementation of self-attention
        self.self_attn = MultiheadAttention(d_model, nhead, dropout=dropout)
        self.dropout1 = Dropout(dropout)

        # Implementation of Feedforward model
        self.linear1 = Linear(d_model, dim_feedforward)
        self.activation = ReLU()
        self.dropout = Dropout(dropout)
        self.linear2 = Linear(dim_feedforward, d_model)
        self.dropout2 = Dropout(dropout)

        self.norm1 = LayerNorm(d_model)
        self.norm2 = LayerNorm(d_model)
    
    def forward(self, src: Tensor) -> Tensor:
        x = src
        x = self.norm1(x + self._sa_block(x))
        x = self.norm2(x + self._ff_block(x))
        return x

    # self-attention block
    def _sa_block(self, x: Tensor) -> Tensor:
        x = self.self_attn(x, x, x)[0]
        return self.dropout1(x)

Transformer - Tag - Naifan Li's Blog