Building a model

Now when I have my dataset prepared, it's time to create a model. Many papers, tutorials, videos, blog posts have been written about neural networks for image recognition. For me this topic is fascinating and very broad. I am just going to summarize here the key points - what steps I took and what I learned.

Preparing and training a model is quite a complex task - model architecture, hyperparameters (parameters that are not learned by the model, but are configured by us) setting, minimizing overfitting. Despite many guidelines, rules of thumb, I still see that tuning all of the parameters properly is kind of an art.

I am going to use Python and TensorFlow to build and train the model - the script is available in my repository and is called person_detection.py

I will first focus on the network architecture itself.
The most popular neural network architecture for image recognition is a convolutional neural network (CNN). It is based on performing a convolution of the image pixels with a set of filters. This technique helps preserve spatial structure and helps the network extract the features.

While creating the CNN architecture, it is worth to keep in mind that we are working on a model for a microcontroller, so we need to keep the number of trainable parameters low - to ensure that we can still fit the model in our limited memory.

In the beginning, I use a series of convolution layers, activation and pooling layers - in order to extract as many features as possible. There are various parameters worth tweaking - the kernel size (size of the filter matrix), number of filters, padding, stride (step). The output of this layer is called an activation map. We need to keep in mind that the more filters we use, the more features are extracted, but at the same time the size and number of activation maps increases - which leads to the increase of trainable parameters. And as already mentioned, we need to be careful about that.
Each convolution layer is then passed via the activation function - ReLu - to introduce some non- linearity (interesting lecture from Stanford describing activation functions)
Next, there is a pooling layer - to downsample and minimize the size of the activation maps. Decision about the number of pooling layers to use is often a trial and error based - we want to make the input smaller, but at the same time we need to make sure we don't lose any important information.

Next, I need to convert the data from 2D array to 1D array (flatten) as the input to the dense (fully connected) layer. In convolutional layers each filter focuses on the same spatial location and extracts the important information, while in dense layer - each of the neurons looks at the same full image. The number of neurons is the parameter to play with here..

After that I use a dropout layer - which is considered a type of regularization (how to penalize model complexity) - and is one of the ways to reduce the overfitting. This happens when the network becomes to specialized for the training set and is not able to properly generalize for the unseen data. Here you can specify the rate - fraction of the input units to drop. The last one is dense layer with softmax - the output of which gives me the probability score for each of the labels, in my case - whether a person is detected or not.

I build the model using:

model = tf.keras.models.Sequential

and as an input I use a list of the layers, for example:

        # Convolutional layer.
        tf.keras.layers.Conv2D(
            8, (3, 3), activation="relu",  input_shape=(IMG_WIDTH, IMG_HEIGHT, 1)

        ),

        # Max-pooling layer, using 2x2 pool size
        tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),

When I have all this ready I need to create the model using

model.compile()

where I can specify a number of parameters. I think the most important are:

optimizer - how to minimize the loss - algorithm that calculates how the parameters of the network shall be changed. I chose 'Adam' - one of the most popular and usually a safe choice - solves the problems of local minima, overshoot or oscillation caused by the fixed values of the learning rates during the updating of network parameter. And about the learning rate - it specifies the step size towards minimizing loss. Too high learning rate - the weights may "explode", too low learning rate - the loss will be barely changing.
loss - how good our classifier is. I chose 'categorical_crossentropy'. Most commonly used for image classification, where the model outputs a vector of probabilities that the input image belongs to
metrics - I selected 'accuracy' - the fraction of the images that are correctly classified.

Easy, right..?

Gathering dataset

Training

Discussions

Become a Hackaday.io Member