A review on the attention mechanism of deep learning

读完这篇综述，对attention有了一些简单的了解，这里对内容和值得引文的文章进行了简单的总结。

attention机制是一个开支散叶非常多的领域，对其进行分类讨论非常的有必要。首先从人本身的注意利形式开始：

Bottom-up：可以理解为潜在的注意力，例如：在嘈杂的对话声中，人们往往更可能听到声音最大的那一种，这种形式在DL中与max-pooling以及gating machanism较为相似。
Top-down：可以理解为聚焦的注意力，例如：人们主积极动地关注某一个类物体，这种形式往往在DL的特定任务任务中使用。

而如果根据attention模型结构上的区别，主要可以划分为以下几类：

在分类总结各种不同结构的模型前，可以先看看attention模型的统一结构

1. attention的统一结构

Vaswani对attention有一个非常好的总结：attention机制是一种将query和Key-Value对映射至输出的结构，该结构通过组合key和对应的query计算出每一个value对应的权重，然后将value的加权和作为输出。

the atten- tion mechanism ‘‘can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is com- puted by a compatibility function of the query with the corre- sponding key。

attention可以分为两步，第一步是根据keys和query计算attention权重，对应于上图上方的分支，第二步则根据values和对应的weigth计算出context vector。

以Neural machine translation by jointly learning to align and translate为例，作者首次提出了attention，并将其应用至翻译任务中，其中attention是作为Seq2Seq模型的一部分，其中Key和Value相同，是各个时刻状态下encoder的输出$h_j$，query是decoder上一时刻的状态向量$s_{i-1}$，然后计算出权重$\alpha$。

其中decoder的状态向量$s_i = f(s_{i-1}, y_{i-1}, c_i)$，（需要熟悉Seq2Seq）,下面则是本文中使用的attention表达式：

其中contex vector $c_i = \sum\limits_{j=1}^{T_x} \alpha_{ij}h_j$

其中权重$\alpha_{ij} = \frac{exp(e_{ij})}{\sum^{Tx}_{k=1}exp(e_{ik})}$ ，Score function为$e_{ij} = a(s_{i-1}, h_j)$

在attention中整合Keys和Query计算出Energy score的Score function方法有很多，在不同场景下各有优势：

2. Forms of Feature Sampling

Soft-attention

Neural machine translation by jointly learning to align and translate中使用的就是soft-attention，在计算contex vector时，使用的是values加权平均的方式，这样整个attention模块相对于输入是可微的（因为仅仅只涉及Key，Value， Query的四则运算），可以通过反向传播的方法进行训练。