Anyone in machine learning will tell you that the most important part of training any network architecure is the data set used to do so. It needs to be varied so that the network learns to recognize the attributes in several different contexts and this means the volume of data implied is quite considerable.
Our job is to train an architeture which will then by deployed on a CCTV network in order to create custom made alarm systems which will trigger only in certain circumstances, reducing the number of false alarms we are currently observing.
The final product should be able to recognize several situations and objetcs in the streams, but as an initial approach to our dataset we gathered footage off the CCTV network and tagged exclusively those in which people were present. The footage collector was automatized and will automatically generate tags for pictures in which it finds people, tags which are then manually checked by a member of our teams using CVAT, the best open source computer vision anotation tool we found.
There are several different ways of tagging pictures (bounding boxes, polygons, splines, among some) but we found the most suitable to be the use of bounding boxes since we could use our pretagging method gracefully directly on the footage as it was being saved.
We brainstormed as to what variability the data would face given the nature of the job the network would have to do, and we came up with 3 major factors:
- Light: The cameras’ sensibility changes with light, especially artificial versus natural light. The data set should include pictures taken during the day and the night
- Weather conditions: Water on lenses create large blurry areas which make the task of detecting pictures difficult. Furthermore fog reduces vision significantly, so the data set should include pictures taken in sunny, rainy, stormy and foggy days to account for all cases.
- Angles: Cameras at angled at 180° were excluded since the shapes in these cases are significantly warped and would be difficult to learn
With this in mind, we gathered around 5000 images from over 40 different cameras over the course of a week, carefully choosing days when the weather was rainy and foggy as well. The process of checking tags was long and time consuming, so we had several members of our team working on it.
What comes next? Well, training a sample TLT pretrained network. Come back next week to see how that went