Detecting Humans, Part III - Look Under the Hood

In the last two articles we had a broad look at how human detection – as controversial a concept as it is – has hundreds of different use cases that can help in automating workflow and business processes throughout industries. Immersive progress in computer vision and object detection research during the past 10 years has made it possible to not just classify objects better, but to easily track multiple objects at the same time, re-identify objects if they vanish and reappear in camera view, or describe the behaviour of the object. In this part of the blog series, we’ll take a look at how we use computer vision techniques in simple terms.

When an object detection model is applied to detect and track humans and their behaviour,  people tend to get cautious. Is the camera following me? Where is the camera feed going? Is someone watching me all the time and does someone know at this very minute where I am and what I am doing?

At Smartbi, this is how we see this privacy trilemma: with all the new technology and innovation it is understandable that people feel cautious about new technological innovations New technology tends to include new types of risks that need to be validated, there can always be a bug in software, and if the application is online there’s always a chance that a data leak may happen.

To de-mystify human detection, we try to explain the technology and idea behind it as clearly as possible and to show how privacy and security issues can easily be made safer with a few additional technologies and methods.

Training object detection models

As with any object detection application, the learning model for detecting humans is taught with a sample set, usually tens of thousands of source images from which the model has learned to detect certain elementary patterns, like edges, corners, and from which it has learned to infer bigger patterns like body parts or clothes. until it has learned to infer the target object – in our case a human. Luckily in our applications, the so-called backbone models are widely available in open source communities as well as pre-trained models where the elementary but data-intensive training has already been done with vast datasets which are also open-source and freely available. This means that only the final layers of the network has to be trained with a custom dataset reducing the computation tasks drastically.

Once we have trained a  human detection model, we can build many applications on top of it by adding extra features like action detection, human tracking and human identification.

Object tracking in general is based on detecting the objects from single video frames. Detections on frames are compared to the previous frames in sequential order. If the detection areas of contiguous frames are overlapping over the certain threshold level – say the overlapping area is more than 50% in both detection areas – the object can be considered as the same. In the longer video shot, the detected objects can hence be easily associated with certain identities even without trying to identify them by any advanced level.

Action detection

The objects we’re most interested in tracking are of course people. Typically people tend to hide and reappear in video frames or an object like a wall or piece of furniture can hide the person for a couple of seconds, which disturbs the machine enough to give the person a new identity – in other words, the algorithm counts two people instead of one. These issues can be tackled with statistical inference approaches like with Bayesian filtering or by comparing the similarities of two detections. Nowadays the similarity of the detected objects can be stored in the model's memory dynamically without assigning computational heavy template matching.

In action detection, the object detection model is taught to identify different body parts and track their movement relative to the other body parts. The final layer of the model can for example use clusterization to assign different movements of body parts to certain actions like walking versus running versus jumping versus rolling. This allows us to not only understand that humans are in the frame, but that some humans are running while others are walking and some may be sitting down. In a hospital environment where we want to endorse safety rules like detecting a fall, a model like this is often enough.

Identifying humans

Finally, the most advanced yet the most controversial part is human identification. Before recent progress in deep learning, the only way to detect a person's face was to store it to the memory and then compare it to existing faces in the background. This sort of template matching is computationally heavy and easily can raise privacy issues if that cache data is stored somewhere in the end. Nowadays, neural networks have gone forward so much that people can be identified only by how their body parts look and how they move compared to each other. People can also easily be classified based on their appearance to different demographic profiles. For an exact match, recognition on face or fingerprint is still a common practise in official identification processes.

In general, the more you add different things to recognize in the detection process the more computing power you need. Especially if you want to do detection online, you have to process about 20-60 frames in a second. Luckily the edge computing devices like Raspberry Pi and Nvidia Jetson provide a good starting point for it. 

In the final part of human detection blog series, we will cover how object detection systems can be taken to the production level with privacy and security in mind. We will also review how edge computing can help in tackling some general pitfalls in cloud computing.

Joni Karras

Machine Learning Engineer