It’s taken weeks of gathering and tagging, retagging, checking, resizing and renaming thousands of pictures. You’ve carefully tried training various models both successfully and unsuccessfully, you’ve cursed at the air but you finally selected the best suitor. Now what?
Even though the evaluation phase of the training provides some insight into the performance of the algorithm and the quality of the predictions, we found that it mostly falls short of the big picture: another metric of performance needed to be introduced. But what are we testing for, exactly?
- False positives: When the model finds a certain instance of a class which is not actually present in a frame. Even though this means triggering unnecesary alarms which could be avoided this case does not imply undetected dangerous scenarios being undetected.
- False negatives: When a model doesn’t find a certain instance of a class which is present in a frame. This is the case that worries us the most, given this means there’s potentally danger involved which has been overlooked. We decided the first metric to be introduced needed to measure this case specifically, since it directly relates to the reliability of the model.
How can we measure false positives? Well, we came up with a script that automatizes this testing in the following manner:
- We put together a “testing set”, which consists of videos and images containing diverse scenarios: different weather, cameras, lighting and situations, which have been previously manually tagged by members of our staff. We know that total number of frames in the set and the total number of labels.
- We built a programe which picks up the model, runs it through the testing set and counts the detections generated by the model. It then outputs the net detections, so we can begin to compare the manual vs the AI ratio.
Ideally this ratio should be one, but realistically speaking that won’t be the case given the network isn’t fault-proof. In future versions of this testing algorithm, it will output detections for each frame so a human can manually contrast the outputs and see if we can pick up on any specifically challenging scenarios that we could add to enrich our dataset and make the model more reliable.