Sorry I just opened that file now, and browsed through it very quickly, but my eye fell on the excerpt:
```
However, we did not observe any speedup by increasing the batch size from
65536 to 131072 for the first stage, thus, we restrict the batch size to 65536 for this stage.
```
which I think is more or less my point: increasing batch size essentially always helps, but the speedup reduces the more you push the batch size. Provided that your dataset is large enough, more batch size will always make you run a bit faster without sacrificing accuracy, but the speedup will be less and less as you increase the batch size, until you are anyway maxing out the power of your GPU and you can't see any measurable speedup anymore.