Exploring Mathematics (Posts about Neural Networks)http://www.seancarrell.com/enContents © 2017 <a href="mailto:s.r.carrell@gmail.com">Sean Carrell</a>
<a rel="license" href="https://creativecommons.org/licenses/by-nc-sa/4.0/">
<img alt="Creative Commons License BY-NC-SA"
style="border-width:0; margin-bottom:12px;"
src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png"></a>Sat, 02 Dec 2017 01:34:15 GMTNikola (getnikola.com)http://blogs.law.harvard.edu/tech/rss- A Simple Quadratic Convolutional Layer in Tensorflowhttp://www.seancarrell.com/posts/a-simple-quadratic-convolutional-layer-in-tensorflow/Sean Carrell<div><p>Generally in computer vision research and applications the greatest successes have come from neural networks composed of various convolutional layers. More specifically, they are composed of linear convolutional layers.</p>
<p>As a refresher, in order to construct a linear convolutional layer applied to an input image <span class="math">\(X\)</span> in which the convolutional layer has kernel of size <span class="math">\(k \times k\)</span>, we first decompose <span class="math">\(X\)</span> as a sequence of <span class="math">\(k \times k\)</span> patches. Then, we vectorize each patch, i.e., we reshape the patch as a vector of length <span class="math">\(k^2\)</span>. Lastly, given each vectorized patch <span class="math">\(x\)</span>, we compute a result given by</p>
<div class="math">
\begin{equation*}
y = w^T x + b,
\end{equation*}
</div>
<p>where <span class="math">\(w\)</span> is a vector of length <span class="math">\(k^2\)</span>, the kernel, and <span class="math">\(b\)</span> is the bias term. For a beautiful, lucid description of linear convolutions I recommend <a class="reference external" href="http://colah.github.io/posts/2014-07-Understanding-Convolutions/">Christopher Olah's post</a>.</p>
<p>Of course, in the description above I have intentionally ignored some very important details, such as what to do about patches on the boarder of the input (same or valid convolutions), dealing with stride length, etc? What is import to the discussion here, however, is the type of function that is applied to each patch.</p>
<p>The reason I have been stressing the 'linear' in linear convolutions is that the function applied to each patch is a typical linear function. In principle we need not constrain ourselves to linear functions, however, there are good reasons to do so. For one, linear functions are very easy to compute. In fact, many of the recent advances in machine learning can be attributed to our ability to find increasingly more effective ways to perform linear operations (gpus and the like). Another reason is that computing the gradient of a linear function is also very easy, making things like stochastic gradient descent possible and effective when used to optimized linear convolutional networks.</p>
<p>There are also reasons to try other functions besides linear. In a recent <a class="reference external" href="https://arxiv.org/abs/1708.07038">Arxiv paper</a>, some authors tried what they called Volterra convolutional layers, or more generally quadratic layers. The idea here is that for each patch <span class="math">\(x\)</span> in the input image, we compute the value</p>
<div class="math">
\begin{equation*}
x^T Q x + w^T x + b,
\end{equation*}
</div>
<p>where <span class="math">\(w\)</span> and <span class="math">\(b\)</span> are the same as those appearing in a linear convolution and <span class="math">\(Q\)</span> is a <span class="math">\(k^2 \times k^2\)</span> matrix which contributes a quadratic non-linearity.</p>
<p>It should be noted that this isn't the first time that a quadratic convolutional layer has been proposed in the neural computing literature. <a class="reference external" href="https://aclanthology.info/pdf/N/N09/N09-2062.pdf">For example</a>, some researchers proposed a similar extension however this was done a little while ago when computational facilities were not quite as good as they are now. In addition, the reason for trying a quadratic nonlinearity comes from research into the modelling of the eye and how it responds to various stimulus.</p>
<p>The researchers involved in the Volterra convolutions paper made their code available although for my purposes it was not terribly helpful. Mostly because it was written in Lua and CUDA code implementing the convolution itself. In order to play around with this type of filter I wrote a crude implementation in Tensorflow. Note that there is a slight difference in my implementation compared to the original author's. I do not take into account the symmetric nature of the quadratic form used and so in effect perform some redundant computations. This shouldn't affect the performance of the filter in terms of accuracy and applicability however.</p>
<pre class="code python"><a name="rest_code_4e39a8ffb75f4219829b7a703c738420-1"></a><span class="k">def</span> <span class="nf">volterra_conv</span><span class="p">(</span><span class="n">inputs</span><span class="p">,</span> <span class="n">W</span><span class="p">,</span> <span class="n">input_dim</span><span class="p">,</span> <span class="n">kernel_size</span><span class="p">,</span> <span class="n">padding</span><span class="o">=</span><span class="s1">'SAME'</span><span class="p">):</span>
<a name="rest_code_4e39a8ffb75f4219829b7a703c738420-2"></a> <span class="n">input_patches</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">extract_image_patches</span><span class="p">(</span><span class="n">inputs</span><span class="p">,</span>
<a name="rest_code_4e39a8ffb75f4219829b7a703c738420-3"></a> <span class="n">ksizes</span><span class="o">=</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="n">kernel_size</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">kernel_size</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="mi">1</span><span class="p">],</span>
<a name="rest_code_4e39a8ffb75f4219829b7a703c738420-4"></a> <span class="n">strides</span><span class="o">=</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span>
<a name="rest_code_4e39a8ffb75f4219829b7a703c738420-5"></a> <span class="n">rates</span><span class="o">=</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span>
<a name="rest_code_4e39a8ffb75f4219829b7a703c738420-6"></a> <span class="n">padding</span><span class="o">=</span><span class="n">padding</span><span class="p">)</span>
<a name="rest_code_4e39a8ffb75f4219829b7a703c738420-7"></a> <span class="n">batch</span><span class="p">,</span> <span class="n">out_row</span><span class="p">,</span> <span class="n">out_col</span><span class="p">,</span> <span class="n">sizes</span> <span class="o">=</span> <span class="n">input_patches</span><span class="o">.</span><span class="n">get_shape</span><span class="p">()</span><span class="o">.</span><span class="n">as_list</span><span class="p">()</span>
<a name="rest_code_4e39a8ffb75f4219829b7a703c738420-8"></a> <span class="n">input_patches</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">input_patches</span><span class="p">,</span>
<a name="rest_code_4e39a8ffb75f4219829b7a703c738420-9"></a> <span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="n">out_row</span><span class="p">,</span> <span class="n">out_col</span><span class="p">,</span> <span class="n">kernel_size</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">*</span><span class="n">kernel_size</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">input_dim</span><span class="p">])</span>
<a name="rest_code_4e39a8ffb75f4219829b7a703c738420-10"></a> <span class="n">V</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">einsum</span><span class="p">(</span><span class="s1">'abcid,abcjd,dijo->abcdo'</span><span class="p">,</span> <span class="n">input_patches</span><span class="p">,</span> <span class="n">input_patches</span><span class="p">,</span> <span class="n">W</span><span class="p">)</span>
<a name="rest_code_4e39a8ffb75f4219829b7a703c738420-11"></a> <span class="k">return</span> <span class="n">tf</span><span class="o">.</span><span class="n">reduce_sum</span><span class="p">(</span><span class="n">V</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span>
<a name="rest_code_4e39a8ffb75f4219829b7a703c738420-12"></a>
<a name="rest_code_4e39a8ffb75f4219829b7a703c738420-13"></a>
<a name="rest_code_4e39a8ffb75f4219829b7a703c738420-14"></a><span class="k">def</span> <span class="nf">volterra_layer</span><span class="p">(</span><span class="n">inputs</span><span class="p">,</span>
<a name="rest_code_4e39a8ffb75f4219829b7a703c738420-15"></a> <span class="n">filters</span><span class="o">=</span><span class="mi">32</span><span class="p">,</span>
<a name="rest_code_4e39a8ffb75f4219829b7a703c738420-16"></a> <span class="n">kernel_size</span><span class="o">=</span><span class="p">[</span><span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">],</span>
<a name="rest_code_4e39a8ffb75f4219829b7a703c738420-17"></a> <span class="n">padding</span><span class="o">=</span><span class="s1">'SAME'</span><span class="p">,</span>
<a name="rest_code_4e39a8ffb75f4219829b7a703c738420-18"></a> <span class="n">activation</span><span class="o">=</span><span class="n">tf</span><span class="o">.</span><span class="n">nn</span><span class="o">.</span><span class="n">relu</span><span class="p">):</span>
<a name="rest_code_4e39a8ffb75f4219829b7a703c738420-19"></a> <span class="n">input_dim</span> <span class="o">=</span> <span class="n">inputs</span><span class="o">.</span><span class="n">get_shape</span><span class="p">()</span><span class="o">.</span><span class="n">as_list</span><span class="p">()[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<a name="rest_code_4e39a8ffb75f4219829b7a703c738420-20"></a>
<a name="rest_code_4e39a8ffb75f4219829b7a703c738420-21"></a> <span class="n">W1</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">Variable</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">truncated_normal</span><span class="p">([</span><span class="n">input_dim</span><span class="p">,</span> <span class="n">kernel_size</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">**</span><span class="mi">2</span><span class="p">,</span> <span class="n">kernel_size</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">**</span><span class="mi">2</span><span class="p">,</span> <span class="n">filters</span><span class="p">]))</span>
<a name="rest_code_4e39a8ffb75f4219829b7a703c738420-22"></a> <span class="n">W2</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">Variable</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">truncated_normal</span><span class="p">([</span><span class="n">kernel_size</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">kernel_size</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">input_dim</span><span class="p">,</span> <span class="n">filters</span><span class="p">]))</span>
<a name="rest_code_4e39a8ffb75f4219829b7a703c738420-23"></a> <span class="n">b</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">Variable</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">constant</span><span class="p">(</span><span class="mf">0.1</span><span class="p">,</span> <span class="n">shape</span><span class="o">=</span><span class="p">[</span><span class="n">filters</span><span class="p">]))</span>
<a name="rest_code_4e39a8ffb75f4219829b7a703c738420-24"></a>
<a name="rest_code_4e39a8ffb75f4219829b7a703c738420-25"></a> <span class="k">return</span> <span class="n">activation</span><span class="p">(</span><span class="n">volterra_conv</span><span class="p">(</span><span class="n">inputs</span><span class="p">,</span> <span class="n">W1</span><span class="p">,</span> <span class="n">input_dim</span><span class="p">,</span> <span class="n">kernel_size</span><span class="p">,</span> <span class="n">padding</span><span class="o">=</span><span class="n">padding</span><span class="p">)</span>
<a name="rest_code_4e39a8ffb75f4219829b7a703c738420-26"></a> <span class="o">+</span> <span class="n">tf</span><span class="o">.</span><span class="n">nn</span><span class="o">.</span><span class="n">conv2d</span><span class="p">(</span><span class="n">inputs</span><span class="p">,</span> <span class="n">W2</span><span class="p">,</span> <span class="n">strides</span><span class="o">=</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">],</span> <span class="n">padding</span><span class="o">=</span><span class="n">padding</span><span class="p">)</span> <span class="o">+</span> <span class="n">b</span><span class="p">)</span>
</pre><p>Using the code above to implement a quadratic convolutional layer I compared against a standard linear convolutional network applied to the Fashion MNIST data set. The benchmark code used is composed of a 5x5 linear convolutional layer with 32 filters, a max pooling layer, another 5x5 linear convolutional layer with 64 filters, a max pooling layer and then finally a fully dense layer. In order to play fair, since a quadratic convolutional layer has many more free parameters, I tested against a model which consisted of a single 3x3 quadratic convolutional layer with 32 filters followed by max pooling down to a 7x7 image (so it wasn't cheating by using a higher resolution upper layer) followed fully dense layer.</p>
<img alt="/images/LinearvsQuadratic.png" class="align-center" src="http://www.seancarrell.com/images/LinearvsQuadratic.png">
<p>After running both models for 100 epochs the baseline model achieved an accuracy of about 90.23% on the test set and the quadratic model achieved 90.97%. As can be seen in the image above, the quadratic model seems to converge quicker than the baseline linear model. What isn't clear is if this holds for more complicated data sets or more sophisticated models. In addition, due to limited computing power on my end, I do not know if the quadratic model converges to a higher accuracy than the linear model after a larger number of epochs.</p>
<p>What is interesting about this result, however, is the much smaller number of parameters in the quadratic model compared to the linear model. The quadratic model has around 3000 parameters in the convolutional layers where as the linear model has somewhere over 50000 parameters (if my arithmetic is correct). It would be interesting to do a similar comparison on a richer data set.</p>
<p>Please note that a <a class="reference external" href="https://github.com/srcarrel/QuadraticConvolutions">Python notebook</a> is available containing the tests detailed above.</p></div>mathjaxNeural NetworksTensorflowhttp://www.seancarrell.com/posts/a-simple-quadratic-convolutional-layer-in-tensorflow/Thu, 09 Nov 2017 19:19:36 GMT