BERT模型源码解析( 九 )


size_per_head=512,
query_act=None,
key_act=None,
value_act=None,
attention_probs_dropout_prob=0.0,
initializer_range=0.02,
do_return_2d_tensor=False,
batch_size=None,
from_seq_length=None,
to_seq_length=None):
"""Performs multi-headed attention from `from_tensor` to `to_tensor`.
多头的注意力
This is an implementation of multi-headed attention
based on "Attention is all you Need".
这是一个多头注意力的实现,注意的才是需要的
如果from_tensor和to_tensor是一样的,name这个注意力就是自己注意自己,也叫自注意力 。
If `from_tensor` and `to_tensor` are the same, then
this is self-attention. Each timestep in `from_tensor` attends to the
corresponding sequence in `to_tensor`, and returns a fixed-with vector.
先将from_tensor投射成query张量,并且将to_tensor投射成key和value张量 。
这将产生一系列张量,张量个数=头数,
其中每个张量的形状都是[批处理量,序列长度,头的大小]
This function first projects `from_tensor` into a "query" tensor and
`to_tensor` into "key" and "value" tensors. These are (effectively) a list
of tensors of length `num_attention_heads`, where each tensor is of shape
[batch_size, seq_length, size_per_head].
query 张量和key张量都是 点积的 和成比例的??? 。
通过softmax运算从而获取注意力数据 。
value 张量通过这些注意力数据差值计算得出,然后把它们连接成一个张量 。
Then, the query and key tensors are dot-producted and scaled. These are
softmaxed to obtain attention probabilities. The value tensors are then
interpolated by these probabilities, then concatenated back to a single
tensor and returned.
实际操作中,多头注意力进行转置和变形运算,而不是独立的张量运算 。
In practice, the multi-headed attention are done with transposes and
reshapes rather than actual separate tensors.
Args: 入参,输入张量,输出张量
from_tensor: float Tensor of shape [batch_size, from_seq_length,
from_width].
to_tensor: float Tensor of shape [batch_size, to_seq_length, to_width].
注意力掩码
attention_mask: (optional) int32 Tensor of shape [batch_size,
from_seq_length, to_seq_length]. The values should be 1 or 0. The
attention scores will effectively be set to -infinity for any positions in
the mask that are 0, and will be unchanged for positions that are 1.
注意力头数
num_attention_heads: int. Number of attention heads.
每个头的大小
size_per_head: int. Size of each attention head.
query变形的激活函数
key 变形的激活函数
value 变形的激活函数
query_act: (optional) Activation function for the query transform.
key_act: (optional) Activation function for the key transform.
value_act: (optional) Activation function for the value transform.
注意力数据的 丢弃比例
attention_probs_dropout_prob: (optional) float. Dropout probability of the
attention probabilities.
标准差,数据初始化的范围(截断的正态分布)
initializer_range: float. Range of the weight initializer.
是否返回2d张量
do_return_2d_tensor: bool. If True, the output will be of shape [batch_size
* from_seq_length, num_attention_heads * size_per_head]. If False, the
output will be of shape [batch_size, from_seq_length, num_attention_heads
* size_per_head].
批处理量,输入序列长度,输出序列长度
batch_size: (Optional) int. If the input is 2D, this might be the batch size
of the 3D version of the `from_tensor` and `to_tensor`.
from_seq_length: (Optional) If the input is 2D, this might be the seq length
of the 3D version of the `from_tensor`.

经验总结扩展阅读