Escaping the Gradient Vanishing: Periodic Alternatives of Softmax in Attention Mechanism
Softmax is widely used in neural networks for multiclass classification, gate structure, and attention mechanisms. The statistical assumption that the input is normally distributed supports the gradient stability of softmax. However, when used in attention mechanisms such as transformers, because th...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2021-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/9662308/ |