All it does is make a difference in the step size of the learning step. Scaling the outputs down makes the gradients smaller, so the gradient descent updates are smaller. We want that so we do not jump over a good solution with bigger steps.

Let's say the activation function is $a =\sigma(x)$ where $x$ is the input to the activation function and $a$ is the output.

The range of true outputs $y$ is on the order of 10 times of the range of $\sigma$. We can either scale $\sigma$ up, or scale $y$ down.

### Scaling activation function

Lets scale the output by 10: $y' = 10a$.

The loss function $L$ compares predictions vs. truth: $L(y', y)$. We need the gradient w.r.t $x$ for backpropagation. That is, to find how the loss changes with respect to input so we can change the input:

$$
\frac{\partial L(y', y)}{\partial x} =
\frac{\partial L(y', y)}{\partial y'} \times \frac{\partial y'}{\partial a} \times \frac{\partial a}{\partial x} \\=
\frac{\partial L(y', y)}{\partial y'} \times 10 \times \frac{\partial a}{\partial x} \\ =
\frac{\partial L(10a, y)}{10\cdot \partial a} \times 10 \times \frac{\partial a}{\partial x} \\=
\frac{\partial L(10a, y)}{\partial a} \times \frac{\partial a}{\partial x} \\=
\frac{\partial L(10 \cdot (a, 0.1y))}{\partial a} \times \frac{\partial a}{\partial x}
$$

### Scaling output

On the other hand, lets scale down to data, so $y_s = 0.1y$. This means we do not need to scale $a$. The loss function gradient now is:

$$
\frac{\partial L(a, y_s)}{\partial x} = \frac{\partial L(a, 0.1y)}{\partial a} \times \frac{\partial a}{\partial x}
$$

Now for both cases, note the final forms of the gradient. The only difference is the arguments of $L$. For the case where the output was scaled, the arguments of the loss function are 10 times smaller. Which means that the gradient will be smaller. Which means that the step update made to $x$ will be smaller. We usually want small updates so we can converge to an optimal solution.

But also note, we can make the step size smaller anyways by reducing the learning rate too.

So scaling the output *down* instead of the activation *up* is a nice rule of thumb to get better convergence. It is not a rule.