Data augmentation can have unwanted consequences. For example, horizontal or vertical flipping, what could be more harmless? You can still recognize stuff when it's upside-down, can't you? It's a great data augmentation... Unless, of course, your dataset happens to involve text or symbols in any way, such as the numbers '6' and '9', or left/right arrows in traffic signs, or such obscure letters as 'd' and 'b'.
> Unless, of course, your dataset happens to involve text or symbols in any way, such as the numbers '6' and '9', or left/right arrows in traffic signs, or such obscure letters as 'd' and 'b'.
If your dataset consists of nothing but isolated 'd's and 'p's in unknown orientation, you won't be able to classify them correctly because that is an impossible task. But it would be more common for your dataset to consist of text rather than isolated letters, and in the context of surrounding text it's easy to determine the correct orientation, and therefore to discriminate 'd' from 'p'.
So it's not a problem, except when it is. Good to know.
Incidentally, how does that work for mirroring, when all that surrounding text gets mirrored too? (Consider the real example of the lolcats generated by Nvidia's StyleGAN, where the text captions are completely wrong, and will always be wrong, because it looks like Cyrillic - due to the horizontal dataflipping StyleGAN has enabled by default...)
I'd say you're not making the problem "harder" - rather, requiring that object detection be orietation-agnostic makes the problem exactly as hard as the problem actually is. Allowing the network to train only on images with a known, fixed, (correct) orientation makes the object detection too _easy_, so the network results will likely fail if you feed it any real-world data.
i.e. you should be training your image application with skew/stretch/shrink/rotation/color-pallete-shift/contrast-adjust/noise-addition/etc. applied to all training images if you want it to be useful for anything other than getting a high top-N score on the validation set.
Yeah... this article really looks like blame-shifting to me. I'm imagining a future wherein we have bipedal murderbots, but we're still training AI with "properly oriented" images... a bot trips over a rock, and starts blasting away at mid-identified objects.
Well, even humans find it more difficult to process upside-down faces or objects. There's power in assumptions that are correct 99% of the time.
Regardless, the article isn't really shifting blame, in so much as explaining what's happening in the real world, with the real tools. The tools don't care about EXIF. Consumer software uses EXIF to abstract over reality. A lot of people playing with ML don't know about either.
you may not know what face it is, but you do know that it is a face right? it's like saying you know someone is in front of you if he were standing in up, in your clear view, but the moment he's lying on a couch in centre of your vision you can't tell with certainty if the thing on the couch is a human being?
I could tell, but I'm under impression that my 4 months old kid still has problems with that, or at least had when she was 2 months old.
I think this is closer to the performance we should expect of current neural models - a few months old child, not an adult. NNs may be good at doing stuff similar to what some of our sight apparatus does, but they're missing additional layers of "higher-level" processing humans employ. That, and a whole lot of training data.