BERT模型源码解析(12) _生活百科

Also see: 也可以参照GitHub
https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py
Args: 入参说明：输入张量，隐藏层大小，隐藏层个数，注意力头数
input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size].
attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length,
seq_length], with 1 for positions that can be attended to and 0 in
positions that should not be.
hidden_size: int. Hidden size of the Transformer.
num_hidden_layers: int. Number of layers (blocks) in the Transformer.
num_attention_heads: int. Number of attention heads in the Transformer.
中间层大小，中间层的激活函数，隐藏层的丢弃比例，注意力概率层的丢弃比例
intermediate_size: int. The size of the "intermediate" (a.k.a., feed
forward) layer.
intermediate_act_fn: function. The non-linear activation function to apply
to the output of the intermediate/feed-forward layer.
hidden_dropout_prob: float. Dropout probability for the hidden layers.
attention_probs_dropout_prob: float. Dropout probability of the attention
probabilities.
截断的标准正太分布的标准差，
也就是权重参数初始化的数值范围（超出该范围的会被截断）
initializer_range: float. Range of the initializer (stddev of truncated
normal).
是否要求返回所有的层，还是返回最后一层
do_return_all_layers: Whether to also return all layers or just the final
layer.
Returns: 返回值，一个张量，Transformer模型中的最后一个隐藏层
float Tensor of shape [batch_size, seq_length, hidden_size], the final
hidden layer of the Transformer.
Raises: 异常无效的形状或参数值
ValueError: A Tensor shape or parameter is invalid.
"""
if hidden_size % num_attention_heads != 0:
如果隐藏的大小不能整除注意力头数，就触发异常
raise ValueError(
"The hidden size (%d) is not a multiple of the number of attention "
"heads (%d)" % (hidden_size, num_attention_heads))
attention_head_size = int(hidden_size / num_attention_heads)
input_shape = get_shape_list(input_tensor, expected_rank=3)
batch_size = input_shape[0]
seq_length = input_shape[1]
input_width = input_shape[2]
Transformer 需要对残差进行求和计算，所以所需的参数和隐藏层相同
# The Transformer performs sum residuals on all layers so the input needs
# to be the same as the hidden size.
if input_width != hidden_size: 参数尺寸不一致，就报错
raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %
(input_width, hidden_size))
我们始终使用2D张量，避免来回的变形；
矩阵变形对于GPU和CPU是很简单的，但是对于TPU就有点麻烦了，
所以要减少这么不必要的转换带来的计算量，从而提高模型效率；
# We keep the representation as a 2D tensor to avoid re-shaping it back and
# forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on
# the GPU/CPU but may not be free on the TPU, so we want to minimize them to
# help the optimizer.
将输入的张量转换为2D矩阵
prev_output = reshape_to_matrix(input_tensor)
all_layer_outputs = [] 定义所有的输出层
variable_scope()是作用域，和tf.get_variable()搭配使用
variable_scope也是个作为上下文管理器的角色,
下文管理器:意思就是，在这个管理器下做的事情，会被这个管理器管着。
variable_scope 主要是因为变量共享的需求。
for layer_idx in range(num_hidden_layers): 遍历所有的层
with tf.variable_scope("layer_%d" % layer_idx):
layer_input = prev_output 输入就是原先的输出

BERT模型源码解析(12)

经验总结扩展阅读

三星W999有什么配件

护肤先敷面膜还是先用芦荟胶?

桃胶什么季节吃最好

春秋五霸是谁

事业单位高温补贴多少钱事业单位高温补贴发放是每年都有吗

月经期的饮食保健

蜜蜡和翡翠怎么保养

黄色和金黄色的区别?

淘宝聊天窗口打开失败怎么回事

提拉紧致按摩手法轻柔处理才能改善问题

男人的情感软肋在哪里：3个男人告诉你

28岁女白领：靠出轨38岁领导走向事业顶峰，我却过得很煎熬

观赏鱼饲养用水有哪些要求?

灰紫色的翡翠手镯怎么样

supreme羊驼真假怎么辨别?

鸿蒙侧边栏怎么删除应用?

久久说情感凤凰男要求AA制，多年后却向妻子求助，妻子回应：我有钱，但不帮

卡西欧手表哪一款性价比高,卡西欧系列的手表都有哪些好的推荐？

肺结节10个人9个人有吗真的吗

连衣裙炎热的夏天，穿一件短款修身连衣裙游逛商厦绝对是最惬意的事情！