# Using Activation Functions in Neural Networks

Activation functions play an integral role in neural networks by introducing non-linearity. This nonlinearity allows neural networks to develop complex representations and functions based on the inputs that would not be possible with a simple linear regression model.

There have been many different non-linear activation functions proposed throughout the history of neural networks. In this post, we will explore three popular ones: sigmoid, tanh, and ReLU.

After reading this article, you will learn:

Why nonlinearity is important in a neural network

How different activation functions can contribute to the vanishing gradient problem

Sigmoid, tanh, and ReLU activation functions

How to use different activation functions in your TensorFlow model

Let’s get started.

## Overview

This article is split into five sections; they are:

Why do we need nonlinear activation functions

Sigmoid function and vanishing gradient

Hyperbolic tangent function

Rectified Linear Unit (ReLU)

Using the activation functions in practice

## Why Do We Need Nonlinear Activation Functions

You might be wondering, why all this hype about non-linear activation functions? Or why can’t we just use an identity function after the weighted linear combination of activations from the previous layer. Using multiple linear layers is basically the same as using a single linear layer. This can be seen through a simple example. Let’s say we have a one hidden layer neural network, each with two hidden neurons.

We can then rewrite the output layer as a linear combination of the original input variable if we used a linear hidden layer. If we had more neurons and weights, the equation would be a lot longer with more nesting and more multiplications between successive layer weights but the idea remains the same: we can represent the entire network as a single linear layer. To make the network to represent more complex functions, we would need nonlinear activation functions. Let’s start with a popular example, the sigmoid function.

## Sigmoid Function and Vanishing Gradient

The sigmoid activation function is a popular choice for the non-linear activation function for neural networks. One reason for its popularity is that it has output values between 0 and 1 which mimic probability values and is hence used to convert the real valued output of a linear layer to a probability, which can be used as a probability output. This has also allowed it to be an important part of logistic regression methods which can be used directly for binary classification.

The sigmoid function is commonly represented by $sigma$ and has the form $sigma = frac{1}{1 + e^{-1}}$. In TensorFlow, we can call the sigmoid function from the Keras library as follows:

import tensorflow as tf

from tensorflow.keras.activations import sigmoid

input_array = tf.constant([-1, 0, 1], dtype=tf.float32)

print (sigmoid(input_array))

This gives us the output:

tf.Tensor([0.26894143 0.5 0.7310586 ], shape=(3,), dtype=float32)

We can also plot the sigmoid function as a function of $x$,

When looking at the activation function for the neurons in a neural network, we should also be interested in its derivative due to backpropagation and the chain rule which would affect how the neural network learns from data.

Here, we can observe that the gradient of the sigmoid function is always between 0 and 0.25. And as the $x$ tends to positive or negative infinity, the gradient tends to zero. This could contribute to the vanishing gradient problem, which when the input are at some large magnitude of $x$ (e.g., due to the output from earlier layers), the gradient is too small to initiate the correction.

Vanishing gradient is a problem because we use the chain rule in backpropagation in deep neural networks. Recall that in neural networks, the gradient (of the loss function) at each layer is the gradient at its subsequent layer multiplied with the gradient of its activation function. As there are many layers in the network, if the gradient of the activation functions are less than 1, the gradient at some layer far away from output will be close to zero. And any layer with a gradient close to zero will stop the gradient propagate further back to the earlier layers.

Since the sigmoid function is always less than 1, a network with more layers would exacerbate the vanishing gradient problem. Furthermore, there is a saturation region where the gradient of the sigmoid tends to 0, which is where the magnitude of $x$ is large. So, if the output of the weighted sum of activations from previous layers is large then we would have a very small gradient propagating through this neuron as the derivative of the activation $a$ with respect to the input to the activation function would be small (in saturation region).

Granted, there is also the derivative of the linear term with respect to the previous layer’s activations which might be greater than 1 for the layer since the weight might be large and it’s a sum of derivatives from the different neurons. However, it might still raise concern at the start of training as weights are usually initialized to be small.

## Hyperbolic Tangent Function

Another activation function we can consider is the tanh activation function, otherwise known as the hyperbolic tangent function. It has a larger range of output values compared to the sigmoid function and has a larger maximum gradient as well. The tanh function is hyperbolic analogue to the normal tangent function for circles that most people are familiar with.

Plotting out the tanh function,

Let’s look at the gradient as well,

Notice that the gradient now has a maximum value of 1, compared to the sigmoid function where the largest gradient value is at 0. This makes a network with tanh activation less susceptible to the vanishing gradient problem. However, the tanh function also has a saturation region, where the value of the gradient tends towards as the magnitude of the input $x$ gets larger.

In TensorFlow, we can implement the tanh activation on a tensor using the tanh function in Keras’ activations module

import tensorflow as tf

from tensorflow.keras.activations import tanh

input_array = tf.constant([-1, 0, 1], dtype=tf.float32)

print (tanh(input_array))

which gives the output

tf.Tensor([-0.7615942 0. 0.7615942], shape=(3,), dtype=float32)

## Rectified Linear Unit (ReLU)

The last activation function we’ll look at in detail is the Rectified Linear Unit, also popularly known as ReLU. It has become popular recently due to its relatively simple computation which helps to speed up neural networks and seems to get empirically good performance, which makes it a good starting choice for the activation function.

The ReLU function is a simple $max(0, x)$ function, which can also be thought of as a piecewise function with all inputs less than 0 mapping to 0 and all inputs greater than or equal to 0 mapping back to themselves (i.e., identity function). Graphically,

Next up, we can also look at the gradient of the ReLU function:

Notice that the gradient of ReLU is 1 whenever the input is positive, which is helpful in addressing the vanishing gradient problem. However, whenever the input is negative, the gradient is 0 which can cause another problem, the dead neuron/dying ReLU problem, which is an issue if a neuron is **persistently inactivated**. In this case, the neuron is never able to learn and its weights are never updated due to the chain rule as it has a 0 gradient as one of its terms. If this happens for all data in your dataset then it can be very difficult for this neuron to learn from your dataset unless the activations in the previous layer change such that the neuron is no longer “dead”.

To use the ReLU activation in TensorFlow,

import tensorflow as tf

from tensorflow.keras.activations import relu

input_array = tf.constant([-1, 0, 1], dtype=tf.float32)

print (relu(input_array))

which gives us the output:

tf.Tensor([0. 0. 1.], shape=(3,), dtype=float32)

Over the three activation functions we reviewed above, we see that they are all monotonically increasing functions. This is required or otherwise we cannot apply the gradient descent algorithm.

Now that we’ve explored some common activation functions and how to use them in TensorFlow, let’s take a look at how we can use these in practice in an actual model.

## Using Activation Functions in Practice

Before we explore the use of activation functions in practice, let’s look at another common way that we can use activation functions when combining them with another Keras layer. Let’s say we want to add a ReLU activation on top of a Dense layer. One way we can do this following the above methods shown is to do

x = Dense(units=10)(input_layer)

x = relu(x)

However, for many Keras layers, we can also use a more compact representation to add the activation on top of the layer:

x = Dense(units=10, activation=”relu”)(input_layer)

Using this more compact representation, let’s build our LeNet5 model using Keras:

import tensorflow as tf

import tensorflow.keras as keras

from tensorflow.keras.layers import Dense, Input, Flatten, Conv2D, BatchNormalization, MaxPool2D

from tensorflow.keras.models import Model

(trainX, trainY), (testX, testY) = keras.datasets.cifar10.load_data()

input_layer = Input(shape=(32,32,3,))

x = Conv2D(filters=6, kernel_size=(5,5), padding=”same”, activation=”relu”)(input_layer)

x = MaxPool2D(pool_size=(2,2))(x)

x = Conv2D(filters=16, kernel_size=(5,5), padding=”same”, activation=”relu”)(x)

x = MaxPool2D(pool_size=(2, 2))(x)

x = Conv2D(filters=120, kernel_size=(5,5), padding=”same”, activation=”relu”)(x)

x = Flatten()(x)

x = Dense(units=84, activation=”relu”)(x)

x = Dense(units=10, activation=”softmax”)(x)

model = Model(inputs=input_layer, outputs=x)

print(model.summary())

model.compile(optimizer=”adam”, loss=tf.keras.losses.SparseCategoricalCrossentropy(), metrics=”acc”)

history = model.fit(x=trainX, y=trainY, batch_size=256, epochs=10, validation_data=(testX, testY))

And running this code gives us the output

Model: “model”

_________________________________________________________________

Layer (type) Output Shape Param #

=================================================================

input_1 (InputLayer) [(None, 32, 32, 3)] 0

conv2d (Conv2D) (None, 32, 32, 6) 456

max_pooling2d (MaxPooling2D (None, 16, 16, 6) 0

)

conv2d_1 (Conv2D) (None, 16, 16, 16) 2416

max_pooling2d_1 (MaxPooling (None, 8, 8, 16) 0

2D)

conv2d_2 (Conv2D) (None, 8, 8, 120) 48120

flatten (Flatten) (None, 7680) 0

dense (Dense) (None, 84) 645204

dense_1 (Dense) (None, 10) 850

=================================================================

Total params: 697,046

Trainable params: 697,046

Non-trainable params: 0

_________________________________________________________________

None

Epoch 1/10

196/196 [==============================] – 14s 11ms/step – loss: 2.9758 acc: 0.3390 – val_loss: 1.5530 – val_acc: 0.4513

Epoch 2/10

196/196 [==============================] – 2s 8ms/step – loss: 1.4319 – acc: 0.4927 – val_loss: 1.3814 – val_acc: 0.5106

Epoch 3/10

196/196 [==============================] – 2s 8ms/step – loss: 1.2505 – acc: 0.5583 – val_loss: 1.3595 – val_acc: 0.5170

Epoch 4/10

196/196 [==============================] – 2s 8ms/step – loss: 1.1127 – acc: 0.6094 – val_loss: 1.2892 – val_acc: 0.5534

Epoch 5/10

196/196 [==============================] – 2s 8ms/step – loss: 0.9763 – acc: 0.6594 – val_loss: 1.3228 – val_acc: 0.5513

Epoch 6/10

196/196 [==============================] – 2s 8ms/step – loss: 0.8510 – acc: 0.7017 – val_loss: 1.3953 – val_acc: 0.5494

Epoch 7/10

196/196 [==============================] – 2s 8ms/step – loss: 0.7361 – acc: 0.7426 – val_loss: 1.4123 – val_acc: 0.5488

Epoch 8/10

196/196 [==============================] – 2s 8ms/step – loss: 0.6060 – acc: 0.7894 – val_loss: 1.5356 – val_acc: 0.5435

Epoch 9/10

196/196 [==============================] – 2s 8ms/step – loss: 0.5020 – acc: 0.8265 – val_loss: 1.7801 – val_acc: 0.5333

Epoch 10/10

196/196 [==============================] – 2s 8ms/step – loss: 0.4013 – acc: 0.8605 – val_loss: 1.8308 – val_acc: 0.5417

And that’s how we can use different activation functions in our TensorFlow models!

## Further Reading

Other examples of activation functions:

Leaky ReLU (ReLU where the negative has a non-zero gradient): https://www.tensorflow.org/api_docs/python/tf/keras/layers/LeakyReLU

Parametric ReLU (non-zero gradient on negative is a learned parameter): https://arxiv.org/abs/1502.01852

Maxout unit: https://arxiv.org/abs/1302.4389

## Summary

In this post, you have seen why activation functions are important to allow for the complex neural networks that we see common in deep learning today. You have also seen some popular activation functions, their derivatives, and how to integrate them into your TensorFlow models.

Specifically, you learned:

Why non-linearity is important in a neural network

How different activation functions can contribute to the vanishing gradient problem

Sigmoid, tanh, and ReLU activation functions

How to use different activation functions in your TensorFlow model

The post Using Activation Functions in Neural Networks appeared first on Machine Learning Mastery.