BERT模型源码解析( 九 ) _生活百科

size_per_head=512,
query_act=None,
key_act=None,
value_act=None,
attention_probs_dropout_prob=0.0,
initializer_range=0.02,
do_return_2d_tensor=False,
batch_size=None,
from_seq_length=None,
to_seq_length=None):
"""Performs multi-headed attention from `from_tensor` to `to_tensor`.
多头的注意力
This is an implementation of multi-headed attention
based on "Attention is all you Need".
这是一个多头注意力的实现，注意的才是需要的
如果from_tensor和to_tensor是一样的，name这个注意力就是自己注意自己，也叫自注意力。
If `from_tensor` and `to_tensor` are the same, then
this is self-attention. Each timestep in `from_tensor` attends to the
corresponding sequence in `to_tensor`, and returns a fixed-with vector.
先将from_tensor投射成query张量，并且将to_tensor投射成key和value张量。
这将产生一系列张量，张量个数=头数，
其中每个张量的形状都是[批处理量，序列长度，头的大小]
This function first projects `from_tensor` into a "query" tensor and
`to_tensor` into "key" and "value" tensors. These are (effectively) a list
of tensors of length `num_attention_heads`, where each tensor is of shape
[batch_size, seq_length, size_per_head].
query 张量和key张量都是点积的和成比例的？？？。
通过softmax运算从而获取注意力数据。
value 张量通过这些注意力数据差值计算得出，然后把它们连接成一个张量。
Then, the query and key tensors are dot-producted and scaled. These are
softmaxed to obtain attention probabilities. The value tensors are then
interpolated by these probabilities, then concatenated back to a single
tensor and returned.
实际操作中，多头注意力进行转置和变形运算，而不是独立的张量运算。
In practice, the multi-headed attention are done with transposes and
reshapes rather than actual separate tensors.
Args: 入参，输入张量，输出张量
from_tensor: float Tensor of shape [batch_size, from_seq_length,
from_width].
to_tensor: float Tensor of shape [batch_size, to_seq_length, to_width].
注意力掩码
attention_mask: (optional) int32 Tensor of shape [batch_size,
from_seq_length, to_seq_length]. The values should be 1 or 0. The
attention scores will effectively be set to -infinity for any positions in
the mask that are 0, and will be unchanged for positions that are 1.
注意力头数
num_attention_heads: int. Number of attention heads.
每个头的大小
size_per_head: int. Size of each attention head.
query变形的激活函数
key 变形的激活函数
value 变形的激活函数
query_act: (optional) Activation function for the query transform.
key_act: (optional) Activation function for the key transform.
value_act: (optional) Activation function for the value transform.
注意力数据的丢弃比例
attention_probs_dropout_prob: (optional) float. Dropout probability of the
attention probabilities.
标准差，数据初始化的范围（截断的正态分布）
initializer_range: float. Range of the weight initializer.
是否返回2d张量
do_return_2d_tensor: bool. If True, the output will be of shape [batch_size
* from_seq_length, num_attention_heads * size_per_head]. If False, the
output will be of shape [batch_size, from_seq_length, num_attention_heads
* size_per_head].
批处理量，输入序列长度，输出序列长度
batch_size: (Optional) int. If the input is 2D, this might be the batch size
of the 3D version of the `from_tensor` and `to_tensor`.
from_seq_length: (Optional) If the input is 2D, this might be the seq length
of the 3D version of the `from_tensor`.

BERT模型源码解析( 九 )

经验总结扩展阅读

春秋五霸是谁

桃胶什么季节吃最好

护肤先敷面膜还是先用芦荟胶?

淘宝聊天窗口打开失败怎么回事

卡西欧手表哪一款性价比高,卡西欧系列的手表都有哪些好的推荐？

连衣裙炎热的夏天，穿一件短款修身连衣裙游逛商厦绝对是最惬意的事情！

观赏鱼饲养用水有哪些要求?

三星W999有什么配件

28岁女白领：靠出轨38岁领导走向事业顶峰，我却过得很煎熬

鸿蒙侧边栏怎么删除应用?

灰紫色的翡翠手镯怎么样

事业单位高温补贴多少钱事业单位高温补贴发放是每年都有吗

肺结节10个人9个人有吗真的吗

黄色和金黄色的区别?

supreme羊驼真假怎么辨别?

月经期的饮食保健

男人的情感软肋在哪里：3个男人告诉你

提拉紧致按摩手法轻柔处理才能改善问题

蜜蜡和翡翠怎么保养

久久说情感凤凰男要求AA制，多年后却向妻子求助，妻子回应：我有钱，但不帮