BERT模型源码解析(12)


Also see: 也可以参照GitHub
https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py
Args: 入参说明:输入张量,隐藏层大小,隐藏层个数,注意力头数
input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size].
attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length,
seq_length], with 1 for positions that can be attended to and 0 in
positions that should not be.
hidden_size: int. Hidden size of the Transformer.
num_hidden_layers: int. Number of layers (blocks) in the Transformer.
num_attention_heads: int. Number of attention heads in the Transformer.
中间层大小,中间层的激活函数,隐藏层的丢弃比例,注意力概率层的丢弃比例
intermediate_size: int. The size of the "intermediate" (a.k.a., feed
forward) layer.
intermediate_act_fn: function. The non-linear activation function to apply
to the output of the intermediate/feed-forward layer.
hidden_dropout_prob: float. Dropout probability for the hidden layers.
attention_probs_dropout_prob: float. Dropout probability of the attention
probabilities.
截断的标准正太分布的标准差,
也就是权重参数初始化的数值范围(超出该范围的会被截断)
initializer_range: float. Range of the initializer (stddev of truncated
normal).
是否要求返回所有的层,还是返回最后一层
do_return_all_layers: Whether to also return all layers or just the final
layer.
Returns: 返回值,一个张量,Transformer模型中的最后一个隐藏层
float Tensor of shape [batch_size, seq_length, hidden_size], the final
hidden layer of the Transformer.
Raises: 异常 无效的形状或参数值
ValueError: A Tensor shape or parameter is invalid.
"""
if hidden_size % num_attention_heads != 0:
如果隐藏的大小不能整除注意力头数,就触发异常
raise ValueError(
"The hidden size (%d) is not a multiple of the number of attention "
"heads (%d)" % (hidden_size, num_attention_heads))
attention_head_size = int(hidden_size / num_attention_heads)
input_shape = get_shape_list(input_tensor, expected_rank=3)
batch_size = input_shape[0]
seq_length = input_shape[1]
input_width = input_shape[2]
Transformer 需要对残差进行求和计算,所以 所需的参数和隐藏层相同
# The Transformer performs sum residuals on all layers so the input needs
# to be the same as the hidden size.
if input_width != hidden_size: 参数尺寸不一致,就报错
raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %
(input_width, hidden_size))
我们始终使用2D张量,避免来回的变形;
矩阵变形对于GPU和CPU是很简单的,但是对于TPU就有点麻烦了,
所以要减少这么不必要的转换带来的计算量,从而提高模型效率;
# We keep the representation as a 2D tensor to avoid re-shaping it back and
# forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on
# the GPU/CPU but may not be free on the TPU, so we want to minimize them to
# help the optimizer.
将输入的张量转换为2D矩阵
prev_output = reshape_to_matrix(input_tensor)
all_layer_outputs = [] 定义所有的输出层
variable_scope()是作用域,和tf.get_variable()搭配使用
variable_scope也是个作为上下文管理器的角色,
下文管理器:意思就是,在这个管理器下做的事情,会被这个管理器管着 。
variable_scope 主要是因为 变量共享 的需求 。
for layer_idx in range(num_hidden_layers): 遍历所有的层
with tf.variable_scope("layer_%d" % layer_idx):
layer_input = prev_output 输入就是原先的输出

经验总结扩展阅读