Let’s consider the single layer neural network model for predicting a single numeric \(Y\).
Let \(X = (X_1,X_2,\ldots,X_p)\) be our vector of predictor variables.
We first take \(K\) linear combinations of \(X\) (plus an intercept).
\(z_k = w_{k0} + \sum_{j=1}^p \, w_{kj} \, X_j\).
We then transform each linear combination non-linearly with an activation function.
\(A_k = g(z_k), \;\; k=1,2,\ldots, K\).
We then output a linear function of the ``activations’’.
\(f(X) = \beta_0 + \sum_{k=1}^K \, \beta_k \, A_k\).
All in one line:
\[ f(X) = \beta_0 + \sum_{k=1}^K \, \beta_k \, g(w_{k0} + \sum_{j=1}^p \, w_{kj} \, X_j) \]
Let’s use simple squared error loss so that, given \(f = f(X)\),
\[
L(y,f) = (y-f)^2
\] Each of our linear combinations (the \(z_k\)) correspond to a unit of our
single layer model.
So, our notation is that our layer has \(K\) units.
Typically neural net models are fit using stochastic gradient descent with some chose of adaptive learning rate.
For example, here is keras code to fit a single layer model:
#make model
lp2pen = .000 #l2 penalty
#nunit = 500
nunit = 5
nx = Xs.shape[1] # number of x's
nn1 = keras.models.Sequential()
## add one hidden layer
nn1.add(keras.layers.Dense(units=nunit,activation='sigmoid',kernel_regularizer = keras.regularizers.l2(lp2pen),input_shape=(nx,)))
## one numberic output
nn1.add(keras.layers.Dense(units=1))
#compile model
nn1.compile(loss='mse',optimizer='rmsprop',metrics=['mse'])
# fit
nepoch = 1000
nhist = nn1.fit(Xs,y,epochs=nepoch,verbose=1,batch_size=20)
Ignore the l2pen stuff (for now).
Choices made above are:
To do gradient descent we need the gradient.
For the single layer model this is pretty easy.
For large, deep neural networks computing the gradient is a major part
of the technology but let’s not worry about that now.
\[ \frac{\partial L}{\partial \beta_k} = \frac{\partial L}{\partial f} \, \frac{\partial f}{\partial \beta_k} = -2(y-f) \ A_k = -2(y-f) \, g(z_k) \]
\[ \frac{\partial L}{\partial w_{kj}} = \frac{\partial L}{\partial f} \, \frac{\partial f}{\partial A_k} \, \frac{\partial A_k}{\partial z_{k}} \, \frac{\partial z_k}{\partial w_{kj}} = -2(y-f) \, \beta_k \, g'(z_k) \, X_j \] with simpler versions for \(\beta_0\) and \(w_{k0}\).
In general, we like to estimate models by minimizing the loss on the training data.
\[ \underset{\theta}{\text{minimize}} \, \frac{1}{n} \, \sum_{i=1}^n \, L(y_i,f(x_i,\theta)) \]
As discussed at the beginning of the optimization notes, if the \(\theta\) corresponds to a rich model, then we may overfit.
To avoid overfitting we often add a complexity penalty.
For example we could minimize
\[
\underset{\theta}{\text{minimize}} \, \frac{1}{n} \, \sum_{i=1}^n \,
L(y_i,f(x_i,\theta)) + \lambda \sum_{j=1}^p \theta_j^2
\] rather than just the loss.
The penalty is called L2 regularization. Our complexity penalty is the
square of the L2 norm of \(\theta\).
The idea is that for a “good” \(\lambda\), the \(\theta\) values at the optimum will be shrunk towards zero, and this stops the model from overfitting.
Clearly computing the gradient is no problem.
\[ \text{gradient} = \, \frac{1}{n} \, \sum_{i=1}^n \, \nabla L(y_i,f(x_i,\theta)) + 2 \, \lambda \sum_{j=1}^p \theta_j \]
Our our single layer neural network model, \(\theta\) is all the \(w\)’s and all the \(\beta\)’s.
From the Elements of Statistical Learning:
“Generally speaking it is better to have too many hidden units than too few.
With too few hidden units, the model might not have enough flexibility
to capture the nonlinearities in the data; with too many hidden units,
the extra weights can be shrunk toward zero if appropriate regularization (decay)
is used. Typically the number of hidden units is somewhere in the range of
5 to 100, with the number increasing with the number of of inputs and the
number of training cases. It is most common to put down a reasonably
large number of units and train them with regularization. Some researchers
have used cross-validation to estimate the optimal number, but this seems
unnecessary if cross-validation is used to estimate the regularization parameter.
Choice of the number of hidden layers is guided by background knowledge
and experimentation.”
Write code to implement stochastic gradient descent for the single layer model.
Play around with different versions (gradient descent, stochastic gradient descent, learning rates, activation functions …)
For activation functions you can just use sigmoid or sigmoid and tanh.
Play around with the L2 regularization, is it true than you have have “too many” units if you regularize?
Find simple examples to try it on. Simulated data ? real data ?
Don’t be afraid to keep things as simple as possible (e.g try it with just one \(x\)).
Don’t forget to have fun!!