Swish was introduce on oct 2017 as an alternative activation function to relu. Swish was found using a combinaton exhaustive search and reinforcement learning. In the originial paper [1], swish was demostrated improvement of top-1 classification by ImageNet by 0.9% by simply replacing all relu activation functions with swish. Nonethless, swish is very easy to implement just writing 1 line of code is enough to implement swish in tensorflow
x1 = tf.nn.conv2d(X, W1, strides=[1,1,1,1], padding='SAME') + B1
Y1 = x1*tf.nn.sigmoid(beta1*x1)# output is 28x28
During the inital phase of training the loss function remains , on average, the same this shows that swish suffers from poor intialisation during training, at least when using initally normal distributed weights with std_dev =0.1.
We were unable to replicate the results reported in the Swish paper, beta1 for us did not converge near 1 maybe because we didn't train our model long enough.
It seems that He initilisation doesn't really help this problem.
After change from SGD to RMSprop we immediately get better results.
- Searching for Activation Functions https://arxiv.org/abs/1710.05941