步骤	操作	PyTorch代码	数学表达式	说明
1	输入张量	`x`	$x$	任意形状的张量
2	计算立方项	`torch.pow(x, 3)`	$x^3$	x的三次方
3	乘以系数	`0.044715 * torch.pow(x, 3)`	$0.044715x^3$	近似公式的修正项
4	加上原始值	`x + 0.044715 * torch.pow(x, 3)`	$x + 0.044715x^3$	tanh内部的多项式
5	乘以缩放因子	`sqrt_2_pi * (...)`	$\sqrt{\frac{2}{\pi}}(x + 0.044715x^3)$	高斯分布标准化
6	应用tanh函数	`torch.tanh(...)`	$\tanh(\sqrt{\frac{2}{\pi}}(x + 0.044715x^3))$	双曲正切函数
7	加1	`1 + torch.tanh(...)`	$1 + \tanh(...)$	偏移到正值区间
8	乘以x	`x * (1 + tanh(...))`	$x(1 + \tanh(...))$	门控机制
9	乘以0.5	`0.5 * x * (1 + tanh(...))`	$\frac{1}{2}x(1 + \tanh(...))$	GELU最终结果

完整代码实现

GELU激活函数.py
import torch
import torch.nn as nn

class GELU(nn.Module):
    """高斯误差线性单元（GELU）激活函数实现（基于Hendrycks等2016年原始论文的近似公式）"""
    def __init__(self):
        super().__init__()
        # 无需可训练参数，与ReLU相比具有更平滑的梯度特性

    def forward(self, x):
        """前向传播实现（使用tanh近似公式提高计算效率）
        Args:
            x (Tensor): 输入张量
        Return:
            Tensor: 激活后的输出
        """
        return 0.5 * x * (1 + torch.tanh(
            torch.sqrt(torch.tensor(2.0 / torch.pi)) *  # √(2/π) 来自高斯分布特性
            (x + 0.044715 * torch.pow(x, 3))  # 三次项系数优化训练稳定性
        ))

# 激活函数可视化对比
import matplotlib.pyplot as plt
gelu = GELU()  # 实例化GELU模块
relu = nn.ReLU()  # 实例化标准ReLU模块作为对比
x = torch.linspace(-3, 3, 100)  # 生成[-3,3]范围内100个均匀分布点
y_gelu = gelu(x)  # 计算GELU输出
y_relu = relu(x)  # 计算ReLU输出

plt.figure(figsize=(8, 3))  # 设置画布尺寸（宽度8英寸，高度3英寸）

# 双图对比可视化（左GELU，右ReLU）
for i, (y, label) in enumerate(zip([y_gelu, y_relu], ["GELU", "ReLU"]), 1):
    plt.subplot(1, 2, i)  # 创建子图位置索引
    plt.plot(x.numpy(), y.numpy())  # 张量转numpy数组绘图
    plt.title(f"{label} activation function")
    plt.xlabel("x")  # x轴标签
    plt.ylabel(f"{label}(x)")  # y轴标签
    plt.grid(True)  # 显示网格线

plt.tight_layout()  # 自动调整子图间距
plt.show()  # 显示图像

代码详细解析

1. 类设计说明

GELU类结构

class GELU(nn.Module):
    def __init__(self):
        # 无需可训练参数
    def forward(self, x):
        # 前向传播逻辑

设计要点：

继承自nn.Module，符合PyTorch模块化设计
无需初始化参数，是纯函数式激活函数
支持任意形状的张量输入

2. 核心实现分析

近似公式的选择

return 0.5 * x * (1 + torch.tanh(
    torch.sqrt(torch.tensor(2.0 / torch.pi)) *
    (x + 0.044715 * torch.pow(x, 3))
))

关键组件解析：

组件	数学含义	作用
`0.5 * x`	$\frac{x}{2}$	基础线性项
`torch.sqrt(2.0 / torch.pi)`	$\sqrt{\frac{2}{\pi}} \approx 0.7979$	高斯分布标准化常数
`0.044715`	近似系数	三次项修正，提高近似精度
`torch.pow(x, 3)`	$x^3$	高阶修正项
`torch.tanh(...)`	$\tanh(\cdot)$	平滑的S型函数

3. 数值稳定性考虑

为什么使用tanh近似？

计算效率：避免了误差函数erf的复杂计算
数值稳定：tanh函数在深度学习框架中优化良好
精度保证：在实际应用范围内误差很小（< 0.1%）

精度对比

# 精确GELU vs 近似GELU
def gelu_exact(x):
    return 0.5 * x * (1 + torch.erf(x / torch.sqrt(torch.tensor(2.0))))

def gelu_approx(x):
    return 0.5 * x * (1 + torch.tanh(
        torch.sqrt(torch.tensor(2.0 / torch.pi)) *
        (x + 0.044715 * torch.pow(x, 3))
    ))

4. 可视化分析

图形特征解读

平滑性：GELU在整个定义域内连续可导
负值处理：负值不完全截断，保留小部分信息
正值增强：正值区域近似线性，保持梯度流
零点行为：在x=0附近平滑过渡

与其他激活函数的对比

激活函数	负值处理	平滑性	计算复杂度	梯度特性
ReLU	完全截断	不连续	极低	稀疏梯度
Leaky ReLU	线性衰减	不连续	低	稀疏梯度
Swish	平滑衰减	连续	中等	密集梯度
GELU	概率门控	连续	中等	密集梯度

性能特点与应用

1. 优势分析

理论优势

概率解释：基于高斯分布的理论基础
平滑梯度：避免梯度消失和爆炸
信息保留：负值区域保留部分信息

实验优势

收敛速度：通常比ReLU收敛更快
最终性能：在多个基准测试中表现更好
泛化能力：减少过拟合风险

2. 在Transformer中的应用

GELU在现代Transformer架构中广泛应用：

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.gelu = GELU()  # 使用GELU激活
        self.linear2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(0.1)

    def forward(self, x):
        return self.linear2(self.dropout(self.gelu(self.linear1(x))))

3. 使用建议

适用场景

自然语言处理：BERT、GPT等模型的标准选择
计算机视觉：Vision Transformer等架构
深层网络：需要平滑梯度的深层架构

注意事项

计算开销：比ReLU略高，但通常可接受
内存使用：需要存储中间计算结果
硬件支持：现代GPU对GELU优化良好

扩展实现

1. 其他GELU变体

class GeluVariants(nn.Module):
    def __init__(self, variant='tanh'):
        super().__init__()
        self.variant = variant

    def forward(self, x):
        if self.variant == 'tanh':
            # 标准tanh近似
            return 0.5 * x * (1 + torch.tanh(
                torch.sqrt(torch.tensor(2.0 / torch.pi)) *
                (x + 0.044715 * torch.pow(x, 3))
            ))
        elif self.variant == 'sigmoid':
            # Sigmoid近似
            return x * torch.sigmoid(1.702 * x)
        elif self.variant == 'exact':
            # 精确实现
            return 0.5 * x * (1 + torch.erf(x / torch.sqrt(torch.tensor(2.0))))

2. 性能优化版本

class FastGELU(nn.Module):
    """优化的GELU实现，预计算常数"""
    def __init__(self):
        super().__init__()
        self.sqrt_2_over_pi = torch.sqrt(torch.tensor(2.0 / torch.pi))
        self.coeff = 0.044715

    def forward(self, x):
        return 0.5 * x * (1 + torch.tanh(
            self.sqrt_2_over_pi * (x + self.coeff * x * x * x)
        ))

更新日志

2025/8/18 00:31

查看所有更新日志

bd1d0-迁移目录于 2025/8/18
bcc6d-docs: 完善大模型架构文档 - 增加设计思路与执行流程于 2025/8/17
dfb81-update于 2025/6/26
1a489-update于 2025/6/20

版权所有

版权归属：NateHHX

许可证：署名 4.0 国际 (CC-BY-4.0)

数据准备与采样

自注意力机制

大模型架构

无标签数据上进行预训练

GELU激活函数