Nonlinearity of deep learning

NOTE

This is part of the 8th homework, titled Deep Learning II, for the course Machine Learning (IN2064) in the Winter Semester 2024/25 at TUM.

Problem 3:

You are trying to solve a regression task and you want to choose between two approaches:

A simple linear regression model.
A feed forward neural network $f_{W} (x)$ with $L$ hidden layers, where each hidden layer $l \in {1, \dots, L}$ has a weight matrix $W_{l} \in R^{D \times D}$ and a ReLU activation function. The output layer has a weight matrix $W_{L + 1} \in R^{D \times 1}$ and no activation function. In both models, there are no bias terms. Your dataset $D$ contains data points with nonnegative features $x_{n}$ and the target $y_{n}$ is continuous:

D = {(x_{n}, y_{n})}_{n = 1}^{N}, x_{n} \in R_{\geq 0}^{D}, y_{n} \in R

Let $w_{L S}^{*} \in R^{D}$ be the optimal weights for the linear regression model corresponding to a global minimum of the following least squares optimization problem:

w_{L S}^{*} = ar g w \in R^{D} min L_{L S} (w) = ar g w \in R^{D} min \frac{1}{2} n = 1 \sum N (w^{⊤} x_{n} - y_{n})^{2}

Let $W_{NN}^{*} = {W_{1}^{*}, \dots, W_{L + 1}^{*}}$ be the optimal weights for the neural network corresponding to a global minimum of the following optimization problem:

W_{NN}^{*} = ar g W min L_{NN} (W) = ar g W min \frac{1}{2} n = 1 \sum N (f_{W} (x_{n}) - y_{n})^{2}

a) Assume that the optimal $W_{NN}^{*}$ you obtain are non-negative. What will the relation ( $<, \leq, =, \geq, >$ ) between the neural network loss $L_{NN} (W_{NN}^{*})$ and the linear regression loss $L_{L S} (w_{L S}^{*})$ be? Provide a mathematical argument to justify your answer.

b) In contrast to (a), now assume that the optimal weights $w_{L S}^{*}$ you obtain are non-negative. What will the relation ( $<, \leq, =, \geq, >$ ) between the linear regression loss $L_{L S} (w_{L S}^{*})$ and the neural network loss $L_{NN} (W_{NN}^{*})$ be? Provide a mathematical argument to justify your answer.

Solution: a)

In general, the following relation holds: $L_{NN} (W_{NN}^{*}) \leq L_{L S} (w_{L S}^{*})$ , since the described neural network can, in the worst-case scenario, be equivalent to the linear regression model. In cases where there is nonlinearity between features $x_{n}$ and their labels $y_{n}$ , it can outperform linear regression due to the nonlinearity achieved through the ReLU activation functions in all but the last layer.

However, we are assuming that the optimal weights $W_{NN}^{*}$ of the neural network are non-negative. Recall the ReLU activation function: $ReLU (z) = max (0, z) : z \in R$ This means that if we constrain the domain of the input $z \geq 0$ , then the ReLU function becomes equivalent to the identity function, i.e., we can remove it from our considerations. From the task description, we know that dataset $D$ contains only non-negative features $x_{n}$ . This means that outputs of all the layers will be non-negative as we assumed that the optimal weights $W_{NN}^{*}$ are non-negative too. Thus, our network becomes equivalent to the regression model, given that $L + 1$ linear maps can be composed into a single one. Therefore, the relation between losses is as follows:

L_{NN} (W_{NN}^{*}) = L_{L S} (w_{L S}^{*})

Solution: b)

Here the situation is a bit different as we assume the optimal weights for the linear regression model $w_{L S}^{*}$ are non-negative. That means that the optimal linear map between the feature space and label space only does scaling and stretching. Since the relationship between $x_{n}$ and $y_{n}$ can be nonlinear, then it would follow that the loss of the neural network at its global minimum can be lower than the loss of the optimal linear regression model. However, that is not the case if there are no bias terms in our network and all the input features $x_{n}$ are nonnegative. This is because each layer in effect is actually only a linear map, where negative weights are equivalent to multiplying by zero since, the ReLU activation function will set all negative values to zero either way. A composition of linear functions is a linear function thus the upper bound on loss is given by the loss of the optimal regression model.

This upper-bound can be reached by the neural network as it can represent any linear regression model by setting the weights of all hidden layers to identity matrices ( $W_{l} = I$ for $l \in 1, \dots, L$ ) and setting the output layer weights $W_{L + 1}$ equal to $w_{L S}^{*}$ . This means that the neural network’s hypothesis space strictly includes the linear regression model’s hypothesis space. Therefore, the relation between losses is: $L_{NN} (W_{NN}^{*}) = L_{L S} (w_{L S}^{*})$

aderylo: field-report

Explorer

Nonlinearity of deep learning

Problem 3:

Solution: a)

Solution: b)

Graph View

Table of Contents

Backlinks