BERT模型源码解析(13) _生活百科

with tf.variable_scope("attention"):
attention_heads = [] 定义注意力头的集合
with tf.variable_scope("self"):
attention_head = attention_layer(  每个注意力头就是一个注意力层
from_tensor=layer_input,   源矩阵和目标矩阵相同，也就是自己注意自己
to_tensor=layer_input,
attention_mask=attention_mask, 注意力掩码
num_attention_heads=num_attention_heads,  头数
size_per_head=attention_head_size,  每头的大小
attention_probs_dropout_prob=attention_probs_dropout_prob, 注意力数据丢弃比例
initializer_range=initializer_range, 数据初始化范围，也就是标准差
do_return_2d_tensor=True,  是否要求返回2D张量
batch_size=batch_size,  批处理量
from_seq_length=seq_length, 源序列长度
to_seq_length=seq_length)  目标序列长度
attention_heads.append(attention_head)  将生成的头【矩阵】添加到集合中
attention_output = None
if len(attention_heads) == 1: 如果只有一头，则输出就是这一头
attention_output = attention_heads[0]
else:  如果有好多头
         有多头的情况下，我们将他们连接起来，然后再投影；
# In the case where we have other sequences, we just concatenate
# them to the self-attention head before the projection.
attention_output = tf.concat(attention_heads, axis=-1)
tf.concat(),tensorflow中用来拼接张量的函数tf.concat()，用法:
axis=0 代表在第0个维度拼接; axis=1 代表在第1个维度拼接
axis=-1表示倒数第一个维度，对于三维矩阵拼接来说，axis=-1等价于axis=2 。
对于一个二维矩阵，第0个维度代表最外层方括号所框下的子集，第1个维度代表内部方括号所框下的子集。维度越高，括号越小。
# Run a linear projection of `hidden_size` then add a residual
# with `layer_input`.
对隐藏层尺寸进行线性投影，然后再加上一个残差
with tf.variable_scope("output"):
attention_output = tf.layers.dense(  创建一个全连接层/密集层
attention_output,
hidden_size,
kernel_initializer=create_initializer(initializer_range))
attention_output = dropout(attention_output, hidden_dropout_prob) 丢弃
attention_output = layer_norm(attention_output + layer_input) 标准化
激活函数仅用于中间层
# The activation is only applied to the "intermediate" hidden layer.
with tf.variable_scope("intermediate"):
intermediate_output = tf.layers.dense(  创建一个全连接层/密集层
attention_output,  将上一层的输出，作为本层的输入
intermediate_size, 中间层大小
activation=intermediate_act_fn,
kernel_initializer=create_initializer(initializer_range))
向下投射到隐藏层大小，然后再和残差相加
# Down-project back to `hidden_size` then add the residual.
with tf.variable_scope("output"):
layer_output = tf.layers.dense( 创建密集层，进行矩阵投影运算
intermediate_output,
hidden_size,
kernel_initializer=create_initializer(initializer_range))
layer_output = dropout(layer_output, hidden_dropout_prob) 丢弃
layer_output = layer_norm(layer_output + attention_output) 标准化
prev_output = layer_output
all_layer_outputs.append(layer_output)  再添加一个层
if do_return_all_layers: 如果要求返回所有的层
final_outputs = [] 最终返回值
for layer_output in all_layer_outputs: 遍历所有层
final_output = reshape_from_matrix(layer_output, input_shape) 每个层都进行变形
final_outputs.append(final_output) 添加到返回值中

BERT模型源码解析(13)

经验总结扩展阅读

三星W999有什么配件

护肤先敷面膜还是先用芦荟胶?

桃胶什么季节吃最好

春秋五霸是谁

事业单位高温补贴多少钱事业单位高温补贴发放是每年都有吗

月经期的饮食保健

蜜蜡和翡翠怎么保养

黄色和金黄色的区别?

淘宝聊天窗口打开失败怎么回事

提拉紧致按摩手法轻柔处理才能改善问题

男人的情感软肋在哪里：3个男人告诉你

28岁女白领：靠出轨38岁领导走向事业顶峰，我却过得很煎熬

观赏鱼饲养用水有哪些要求?

灰紫色的翡翠手镯怎么样

supreme羊驼真假怎么辨别?

鸿蒙侧边栏怎么删除应用?

久久说情感凤凰男要求AA制，多年后却向妻子求助，妻子回应：我有钱，但不帮

卡西欧手表哪一款性价比高,卡西欧系列的手表都有哪些好的推荐？

肺结节10个人9个人有吗真的吗

连衣裙炎热的夏天，穿一件短款修身连衣裙游逛商厦绝对是最惬意的事情！