Want to understand the difference in roles of tf.clip_by_value
and tf.clip_by_global_norm
during the implementation of Gradient Clipping in TensorFlow. Which one is preferred and how to decide the max value to clip on?
TL;DR: use tf.clip_by_global_norm
for gradient clipping.
clip_by_value
tf.clip_by_value
clips each value inside one tensor, regardless of the other values in the tensor. For instance,
tf.clip_by_value([1, 2, 10], 0, 3) > [0, 2, 3] # Only the values below 0 or above 3 are changed
Consequently, it can change the direction of the tensor, so it should be used if the values in the tensor are decorrelated one from another (which is not the case for gradient clipping), or to avoid zero / infinite values in a tensor that could lead to Nan / infinite values elsewhere (by clipping with a minimum of epsilon=1e8 and a very big max value for instance).
clip_by_norm
tf.clip_by_norm
rescales one tensor if necessary, so that its L2 norm does not exceed a certain threshold. It's useful typically to avoid exploding gradient on one tensor, because you keep the gradient direction. For instance:
tf.clip_by_norm([2, 3, 6], 5) > [2, 3, 6]*5/7 # The original L2 norm is 7, which is >5, so the final one is 5
tf.clip_by_norm([2, 3, 6], 9) > [2, 3, 6] # The original L2 norm is 7, which is <9, so it is left unchanged
However, clip_by_norm
works on only one gradient, so if you use it on all your gradient tensors, you'll unbalance them (some will be rescaled, others not, and not all with the same scale).
Note that the two first ones work on only one tensor, while the last one is used on a list of tensors.
clip_by_global_norm
tf.clip_by_global_norm
rescales a list of tensors so that the total norm of the vector of all their norms does not exceed a threshold. The goal is the same as clip_by_norm
(avoid exploding gradient, keep the gradient directions), but it works on all the gradients at once rather than on each one separately (that is, all of them are rescaled by the same factor if necessary, or none of them are rescaled). This is better, because the balance between the different gradients is maintained.
For instance:
tf.clip_by_global_norm([tf.constant([2, 3, 6]),tf.constant([4, 6, 12])] , 14.5)
will rescale both tensors by a factor 14.5/sqrt(49 + 196)
, because the first tensor has a L2 norm of 7, the second one 14, and sqrt(7^2+ 14^2)>14.5
This (tf.clip_by_global_norm
) is the one that you should use for gradient clipping. See this for instance for more information.
Choosing the value
Choosing the max value is the hardest part. You should use the biggest value such that you don't have exploding gradient (whose effects can be Nan
s or infinite
values appearing in your tensors, constant loss /accuracy after a few training steps). The value should be bigger for tf.clip_by_global_norm
than for the others, since the global L2 norm will be mechanically bigger than the other ones due to the number of tensors implied.


2I mean direction in the same sense as the direction of a vector: if you multiply all the values in your tensor by 2 or by 0.5, you don't change the ratio between the different vales in it, so you don't change its direction (it still represents a movement in the same direction, just with a bigger distance). If you change one value more than the others, then you do change the ratio, so you do change the direction, which is bad in the case of a gradient that's supposed to point in the right direction (approximately in the direction of the minimum).– gdelabFeb 3 '19 at 0:33

@gdelab when you say "you should use the biggest value such that you don't have exploding gradients", what values are you referring to?– AleBMay 12 '20 at 9:07