Jekyll2017-12-01T18:32:07+00:00https://prajjwal1.github.io/prajjwal1.github.ioMachine Learning | Deep learning | Computer Vision | NLP | Open Source contributorWhy does Batch normalization work ?2017-09-11T00:00:00+00:002017-09-11T00:00:00+00:00https://prajjwal1.github.io/batch-normalization<p>So, why does batch normalization work? Here’s one reason, you know how normalizing the input features, the X’s, to mean zero and variance one, how that can speed up learning?
So rather than having some features that range from zero to one and some from 1 to 1,000 , by normalizing all the features , input features X, to take on a similar range of values that can speed up learning.
So one intuition behind working of batch norm is that this is doing a similar thing but further values in your hidden units and not just your input there.</p>
<p>A second reason why batch norm works is that it makes weights later or deeper into your network.
Say thet weight on layer 10, is more robust to weights in earlier layers of neural network, say in layer one.</p>
<p>To explain what I mean, let’s take a look at this most vivid example.
<img src="https://prajjwal1.github.io/images/batch normalization/1.png" alt="cat" /></p>
<p>Let’s see a training network, maybe a shallow network like logistic regression or maybe a neural network. Maybe a shallow network like this regression or maybe a deep network on cat detection problem.
But let’s say that you have trained your data sets on all images of black cats. If you now try to apply this network to data with colored cats where the positive examples are not just black cats like on the left but also color cats on the right, then your network might not do very well. So in pictures, if your training set looks like this, where you have positive exaples here and negative examples there and you were to generalize it to a dataset where positive exaples are here and the negative exaples are there then you might not expect a model trained on the left to do very well on the right.</p>
<p>Even though this might be same function that actually works well,but you wouldn’t expect your learning algorithm to discover the decision boundary(in green) by just looking at the data on the left.</p>
<p>So this idea of data distribution changing goes by “covariate shift”.
The idea is that if you have learned some X to Y mapping, if the distribution of X changes, then you might need to retrain your learning algorithm.And this is true if the function, the ground true function mapping from X to Y, remains unchanged which it is in this example because the ground true function is this picture a cat or not.And the need to retrain your function becomes even more acute or it becomes even worse if the ground true function shifts as well.</p>
<p>So, how does this problem of covariate shift apply to a neural network?
Consider a deep network like this,
<img src="https://prajjwal1.github.io/images/batch normalization/2.png" alt="Deep Neural net2" /></p>
<p>Let’s look at he learning process from the perspective of the certain layer, let’s say the third layer.
So this network has learned the paramters W<sup>[3]</sup> and B<sup>[3]</sup> and from the perspective of the third hidden layer, it gets some set of values from earlier layers and then it has to do some stuff to hopefully make the output of Y_hat close to the ground true value Y.
So let me cover up the nodes on the left for some time.
<img src="https://prajjwal1.github.io/images/batch normalization/3.png" alt="Deep Neural net3" /></p>
<p>So from the perspective of this third hidden layer, it gets some values. Let’s call them A<sup>[2]</sup><sub>[1]</sub>, A<sup>[2]</sup><sub>[2]</sub>, A<sup>[2]</sup><sub>[3]</sub>, A<sup>[2]</sup><sub>[4]</sub>. But these values might as well be features X1,X2,X3,X4.
The job of the third hidden layer is to take these values and find a way to map them to Y_hat. So you could imagine doing gradient descent by these paramters W<sup>[3]</sup>, B<sup>[3]</sup> as well as W<sup>[4]</sup>,B<sup>[4]</sup> and even W<sup>[5]</sup>. Maybe try to make our network learn these parameters so that it does a good job mapping from <strong>values</strong> (shown on left side in blue box) to the output value which gives Y_hat. The network is also adapting parameters W<sup>[2]</sup>,B<sup>[2]</sup> & W<sup>[1]</sup>,B<sup>[1]</sup>, and so if these paramters change, these <strong>values</strong> will also change. So from the perspective of third hidden layer, these hidden units value are changing all the time and suffering from the problem of “covariate shift”.</p>
<p>It reduces the amount that the distribution of these hidden unit value shifts around. If we were to plot the distribution of hidden unit values, maybe we technicalize it as z.</p>
<p><img src="https://prajjwal1.github.io/images/batch normalization/4.png" alt="Deep Neural net4" /></p>
<p>So we plot two values instead of four values, so we can visualize it in 2d. What batch norm will do is is, these values of Z<sup>2</sup><sub>2</sub> and Z<sup>2</sup><sub>1</sub> can change and indeed they will change when the neural network updates the parameters in later layers. What batch norm would do is that how it changes the mean and variance of Z<sup>[2]</sup><sub>1</sub> and Z<sup>[2]</sup><sub>2</sub> will remain the same. So even if the exact values of Z<sup>[2]</sup><sub>1</sub> and Z<sup>[2]</sup><sub>2</sub> change, their mean and variance will remain the same (mean = 0, variance =1 (not necessarily) ). But whatever values are governed by β<sup>[2]</sup> and γ<sup>[2]</sup> which neural network can force it to be mean 0 and variance equal to 1. But what it does is that it limits the amount to which updating the parameter in the earlier layers can affect the distribution of values that the third layer now sees and therefore has to learn on. And so batch norm reduces the problem of the input values changing, it really causes these values to become more stable, so that the later layers of the neural network of the neural network has more firm ground to stand on. Even though the input distribution changes a bit, it changes less and what this does is that even as the earlier layers keep learning, the amount that this forces the later layers to adapt to as early as layer changes is reduced or if you will, it weakens the coupling between what the early layer parameters has to doand what the later layer parameters have to do. And so it allows each layer of the network to learn itself, a little more independently of other layers, and this has the effect of speeding up learning in the whole network. The takeaway is that batch normalization means that especially from the perspective of one of the later layers of the neural network, the earlier layers don’t get to shift around as much because they are constrained to have the same mean and variance. So this makes the job of learning on the later layers easier. It turns out that batch norm has a second effect, it has a slight regularization effect.</p>
<h2 id="batch-norm-as-a-regularization">Batch norm as a regularization</h2>
<ul>
<li>Each mini batch is scaled by the mean/variance computed on just that mini batch.</li>
<li>This adds some noise to the values Z<sup>[1]</sup> within that mini batch. So similar to dropout, it adds some noise to each hidden layer’s activations.</li>
</ul>
<p>So one non intuitive thing of a batch norm is that each mini batch, let’s say mini batch X,T has the values Z,l scaled by the mean and variance computed on just that mini batch . Now because the mean and variance computed on just that mini batch as opposed to computer on the entire data set, that means mean and variance has a little noise in it because it has computed just on your mini batch of say 64 or 128 or 256 or larger. So because the mean and variance are little bit noisy it’s estimated with just a relatively small sample of data. the scaling process going from Z_l to Z_l<sup>~</sup> is a little bit noisy as well because it’s computed using slightly noisy mean and variance. So similar to dropout it adds noise to each hidden layers’s activation. The way dropout adds noises. It takes a hidden unit and it multiplies it by zero with some probability and multiplies it by one with some probability. And so your dropout has noise because it has been multiplied by zero and one whereas batch norm has noise because of scaling by standard devaiation as well as additive noise because of subtracting the mean.</p>
<p>Well here the estimates of mean and standartd deviation are noisy and so similar to dropout, batch norm has a slight regularization effect. Because by adding noise to hidden units, it’s forcing the downstream hidden units not to rely too much on any one hidden unit.</p>
<p>Because the noise added is quite small, this is not a huge regularization effect and you might choose to use batch norm together with dropout if you want the more powerful regulrizaion effect of dropout.And maybe slight less intuitive effect is that if you use a bigger mini batch size of size say 512 intead of 64, so by choosing larger mini batchsize you’re reducing the noise and therefore reducing the regularization effect. So that’s one strange property of dropout which is that by using a bigger mini batch size, you reduce the the regularization effect.Sometimes it has this extra unintended effect on your learning sustem algorithm. Don’t use batch norm as a regularizer. Use it as a way to normalize your hidden units and therefore speed up training and I think regularization is unintended side effect.</p>
<p>Batch norm handles data one mini batch at a time. It computes means and variance on mini batches, so at test time you try and make prediction, try and evaluate the neural network, you might not have a mini batch of examples, you might be processing a single example at the time. So at test tme, you need to do something slightly different to make sure your prediction makes sense.</p>So, why does batch normalization work? Here’s one reason, you know how normalizing the input features, the X’s, to mean zero and variance one, how that can speed up learning? So rather than having some features that range from zero to one and some from 1 to 1,000 , by normalizing all the features , input features X, to take on a similar range of values that can speed up learning. So one intuition behind working of batch norm is that this is doing a similar thing but further values in your hidden units and not just your input there.Blogging with jekyll2017-06-19T00:00:00+00:002017-06-19T00:00:00+00:00https://prajjwal1.github.io/Blogging-with-jekyll<p>Jekyll describes itself as a tool for building “Simple, bog-aware, static sites”. It has been made by Github co-founder, <a href="http://tom.preston-werner.com/">Tom Preston-Werner</a>.</p>
<h4 id="jekyll">Jekyll</h4>
<p>It’s really simple to manage workflow. There are no clutters, and you don’t need to manage plugins etc.</p>
<ul>
<li>I could try out different ideas and explore a variety of posts all from the comfort of my preferred editor and the command line.</li>
<li>Complexity would be kept to an absolute minimum, so a static site would be preferable to a dynamic site that required ongoing maintenance.
*It takes a template directory (representing the raw form of a website), runs it through Textile and Liquid converters, and spits out a complete, static website suitable for serving with Apache or your favorite web server.</li>
<li>My blog would need to be easily customizable. I’ll always be tweaking the site’s appearance and layout.</li>
<li>Many services dynamically render their content with <strong>.php</strong>, and there are few applications that actually require it . Dynamic code
execution makes your blog vulnerable to exploits.</li>
<li>This blog is accessible everywhere.</li>
<li>Your blogs are not stuck in specific SQL database rather they are available in and open source repository.</li>
<li>You can obtain your post in raw form any time.</li>
<li>Everything is backed up on Github.</li>
<li>As a whole, you have full freedom to do anything you want.</li>
</ul>
<h4 id="integration-with-github">Integration with Github</h4>
<p>Jekyll has been tightly integrated with Github. Creating blog using Github is much safer than creating on any other place.</p>
<ul>
<li>My blogs are placed in a single folder <code class="highlighter-rouge">_posts</code>, written in a markdown file, including this one.</li>
<li>My images are placed in a single folder <code class="highlighter-rouge">assets</code> .</li>
</ul>
<p>There’s this right balance in Jekyll that’s difficult to find elsewhere.
Complete transparency!
Github automatically refreshes <a href="http://prajjwal1.github.io/">prajjwal1.github.io</a> to point to newly generated <code class="highlighter-rouge">_site</code>. Your posts in blog will be live .I added <strong>Google Analytics</strong> tracking code to all my pages by tweaking html template and also comments have been enabled on all posts with javascript code.</p>
<h4 id="about-markdown">About Markdown</h4>
<p>Markdown is a plain text formatting syntax for writers. It allows you to quickly write structured content for the web, and it is seamlessly converted to clean, strctured HTML.
With just a couple of extra characters, Markdown makes rich document formatting quick and beautiful. Markdown allows you to keep yor fingers firmly planted on your keyboard as apply formatting on the fly.</p>
<p>For more info:
Check out <a href="http://jekyllrb.com/">Jekyll</a> and get blogging !</p>Jekyll describes itself as a tool for building “Simple, bog-aware, static sites”. It has been made by Github co-founder, Tom Preston-Werner.Hello World2017-06-18T00:00:00+00:002017-06-18T00:00:00+00:00https://prajjwal1.github.io/Hello-World<p>Hi Everyone !</p>
<p>This is my first blog ! This blog is gonna be about artificial intelligence , techology and other cool stuff !I will try to write about my experiences and the projects I’m gonna be working on related to deep learning.
I have been understanding deep learning for a while ! What fascinates me about it is its real life applications and how it can change the way we live in this world. It truly has made a deep impact on how we lead our life today.
It’s a wonderful experience to explore the world of AI and I’m waiting for that time when they will surpass humans in many aspects. I’m on a quest to solve intelligence.</p>Hi Everyone !