An intro to self supervised learning.
Deep learning has truly reshuffled things in machine learning field, and specifically in image recognition tasks. In 2012, Alex-net has initiated a (still far from ending) race towards solving, or at least significantly improving, computer vision tasks. And while the main idea is quite stable (use deep neural networks for everything) researchers took quite different paths:
- Try to optimize the model architectures
- Try to optimize the training schedule, e.g the optimizers themselves
- Try optimizing the data, e.g it’s order, size, diversity and so on
Each of these research paths improves training quality (speed, accuracy, sometimes generalization), but it seems that doing more of the same thing may result in some gradual improvements, but not a in significant breakthrough.
On the other hand, growing body of work in deep learning shows that there are significant flaws in current methods, especially in terms of generalization, e.g this recent one: generalization failure when objects are rotated:
So there seems to be a need of improvements that are a bit more aggressive. Or perhaps expanding the research spectrum to ideas that may be a bit riskier.
Along with the aforementioned approaches, there are also directions which try to shift the learning paradigm. May it be:
- N shot learning
- Semi supervised learning
- Domain adaptation
- Self supervised learning
These approaches take some different training paradigms, try to be more creative, or mimic some human-like patterns. Although we are yet to have evidence from the above methods (and others) to reach a significant breakthroughs, they do reach some non-trivial results, and also teach us a lot about training process.
In this and the following posts, I will try to discuss some of the most interesting approaches, and dub the series by the name “Different kinds of (deep) learning”. I by no means try to predict the future developments in deep learning, but merely to describe recent interesting works, that perhaps doesn’t get the spotlight. This may serve the readers for a few purposes:
- You may be interested in learning about works you didn’t know of.
- You may get new ideas for your own work.
- You may learn of relation between logical parts and tasks in deep learning that you were not aware of.
The first part of this series will be about self supervised learning that was one of the main drivers for me to write this series.
Self supervised learning
Imagine you have an agent that scours the web, and seamlessly learns from every image it encounters. This notion is quite intriguing, since if it will be realized, the greatest considered barrier for deep learning, the annotated data, will be (partially) removed.
But how can it be done? well, it was first suggested in text: text is well structured by humans, therefore there are many concepts that can be learned from it without any annotations. Predicting the next/previous word is a prominent example, as done in word embeddings and language models tasks.
In vision, doing such tricks is a bit more complex, since the vision data (images and videos) are not human created explicitly (well, some photographer may put certain amount of thought in his photography) but not every video, and definitely not every image has any kind of logicial structure that can be used to extract signal from.
Isn’t it just another form of unsupervised learning? indeed, but it has a special subtlety: since the tasks are supervised (e.g. classification) but no active annotation has taken place. This topic is one of my favorites, and has quickly became the main topic of this article. I can’t promise that this specific paradigm will bring the best achievements to deep learning, but it is definitely already brought some great creative ideas.
As said, the name of such tasks is Self Supervised Learning. Unlike “weak annotations” which mean images with different tags, headers, or captions, self-supervised task considered to have no annotations but the image itself. If you ask what can be learned from an image with no annotations, stay tuned.
Colorization
Perhaps the most intuitive signal in an image, is it’s color. When there are 3 channels in most of computerized color representations, 1 or 2 can seamlessly be used as annotation.
Since colorizing old images is an interesting task, there are many works that address it. However, if we consider fully automated colorization (which qualifies as self-supervised) the numbers dwindle down to quite a few.
The colorization task in this case is formed as a “cross channel encoder”, which means that one (or some) channels in the image are used to encode other. This concept will be discussed further in later posts.
The most noticeable colorization paper is this one, by Richard Zhang and Alexei Efros.
The common way to address colorization tasks, is not using the standard RGB encoding, but using the Lab color space. In Lab colorspace, L stands for lightness (B&W intensity) and is used for predicting ab channels (a — green to red, b — blue to yellow).
As we will see in all tasks we discuss, self supervised learning is not as straightforward as we got used to in deep learning. There are some artifacts that interrupt the model achieving what it was designed to. Additionally, sometimes if training was not examined carefully, model will make “shortcuts” that will hinder it from generalizing to other tasks.
Here are some challenges of the colorization task:
1. Inherent ambiguity in colorization: It is clear that for some images there is more than one plausible colorization. This issue causes multiple problems either in training and evaluation:
In the Donald Trump image below, the color of the curtains may either red or blue (and many others). Donald’s tie can match (or not). Given different examples of ties and curtains in the dataset, the model will tend to average them, coloring such items in grey.
Solution: In Zhang’s article, the researchers treat the colorization as a classification problem instead of regression. Along with using a special loss function, their model predicts a probability distribution layer instead of the actual colors of the image, and then translates these probabilities to colors — out of 313 available colors in Lab space:
2. Bias: Lab is not an evenly distributed space. most of solutions tend towards the lower values, due to high frequency of clouds, pavements etc.
Solution: a reweight of loss function takes place to address this issue.
3. Evaluation problem: Now that the model can predict different answers which are correct, e.g if the ground truth is blue and the model will choose red, in a standard evaluation it will be considered wrong.
Solution: using different evaluation methods, among others: human post-hok classification — “Colorization Turing test”, where people were asked to tell between the real image, and machine colorized one. Additionally, feeding the images into an image classifier, comparing the results with real images.
The model scored 35% in the Colorization Turing test, which is… not so bad.
In another recent paper, Larson et al worked concurrently to Zhang and Efros (both papers mention each other) and used spatially localized multi-layer slices (hyper- columns) and regression loss. They’ve tried to overcome the ambiguity issue by predicting a color histogram, and sampling from it:
This work, apart from using the LAB space, also tries to predict Hue/Chroma attributes, which is related to as “HSV” color space.
Context
Besides the color prediction, the next most evident (but also creative) task is learning things from image structure. More precisely, trying to predict something about image crops.
The inspiration for this task came directly from word2vec, and perhaps we can call it the “skip gram” of images.
However, in text, number of words is limited to the size of the vocabulary, and will probably not exceed 1 Million. While completing an image patch pixel by pixel resides in a much larger space. You may say that GANs do exactly that but:
1. There are really large number of correct solutions, therefore it will be hard to make this generalize
2. We’ll discus GANs in the next parts.
In this kinds of paradigm, the actual task doesn’t emerge naturally: researchers have to come up with “games” for the models to solve. Will go through some prominent examples:
Jigsaw context
The easiest way to extract context from images was using jigsaw-like tasks. First one was a work by Doersch and Efros: patches were cropped from images, and model was trained to classify their relation. an illustration will best explain it:
As in colorization, task was not straightforward. Specifically, model was looking for “shortcuts”: instead of actually learning the high level features and their relations, it may learn certain low level features, such as edges, and lighting relations. which tend to hint the image part.
To Solve this problem, the researchers applied some jitter on the patches (As seen in the illustration)
Another issue the researchers suffered from was the model predicting patch location by some lighting artifact — chromatic aberration. This means that in some cameras distribution of color varies in different parts of the image. Solution: this was partially handled by some color transformation, specifically shifting green and magenta towards grey.
The next prominent result was this paper by Noroozi and Favaro, went all the way and used a harder problem, of solving full 9 part jigsaw, but reached a stronger performance in return:
The researchers applied verification of good shuffling of patches and more then 1 shuffling sample per image.
Context encoder
As said, word2vec in text fills in the missing word. Are there any attempts of doing this in vision? In a matter of fact, there are. In this article, Pathak et al (and of course Efros) have tried a few auto-encoder models to fill in a cropped space on images.
Results show it is actually possible, especially with adding adversarial loss, which succeeded in avoiding handling multi modes (as discussed previously), thus preventing blurred, “averaged” result.
Rotation
Before we jump ahead to next level stuff, I want to mention this tidbit: rotation prediction. This paper approached took the creative approach of predicting image rotation.
Rotation prediction, apart from being creative, is relatively fast, and doesn’t require any kind of prepossessing as in other tasks we’ve seen before, to overcome learning of trivial features.
Paper also explores some “Attention maps” which show their network focuses on the important parts of images: heads, eyes, etc.
Although reporting state of the art results on transfer learning to imageNet classification (most other works relate to pascal), some flaws were found in the paper by reviewers, so it has to be taken with a grain of salt.
Generalization
So after all this work, what do we get from it? sure, coloring B&W images is nice, and solving jigsaws may be a fun demo app, but the greater goal is to achieve better results, in the main tasks — especially classification, detection and segmentation.
The most common benchmark is the VOC Pascal dataset, with current state of the art, when imagenet pretraining is used:
And the current results are:
Well, it seems we are not there yet. Although self-supervised data is practically unlimited, there have yet to be a work to challenge “classic” Imagenet-based transfer learning results. However, there are a few nice results on specific tasks that we will discuss in later posts.
Besides the standard generalization on to above tasks, the researchers exploit the specific features of this set of tasks to try and generalize some other tasks, such as image clustering (nearest neighbors, visual data mining, etc)
Wrap up
Will the next big step will come from self-supervised learning? maybe, or maybe not, but I believe that exploring such different approaches significantly improves the deep learning field, and may indirectly positively influence the real breakthroughs. In the next post we will learn about more ideas and methods, that lead to some interesting and novel results.
If you wish to read further, stay tuned (and follow me) for the next parts of this series. additionally, here are some resources that immensely helped me in studying and learning these topic:
1. A talk by Alexei Efros, one of the prominent researchers in this field (and co-author of most papers discussed here) . His talk about it here is very recommended.
2. A great deck summarizes the work done in this field.
Series links:
- An intro to self supervised learning (this post)
Comments