Escaping the Gradient Vanishing: Periodic Alternatives of Softmax in Attention Mechanism

Softmax is widely used in neural networks for multiclass classification, gate structure, and attention mechanisms. The statistical assumption that the input is normally distributed supports the gradient stability of softmax. However, when used in attention mechanisms such as transformers, because th...

Full description

Bibliographic Details
Main Authors: Shulun Wang, Feng Liu, Bin Liu
Format: Article
Language:English
Published: IEEE 2021-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9662308/