Keras 的 epoch 概念问题

tfers-migration · March 31, 2020, 4:26pm

一般来讲一个 epoch 是把所有训练图片训练一遍。
而 Keras 的 fit 的参数中，每个 epoch 的训练图片数量其实是（ steps_per_epoch 乘以 Batch 大小），steps_per_epoch 又要求是 Integer 类型，所以它这边的一个 epoch 其实并不是准准的把所有图片过一遍。比如 13 张图片，你的 batch size=8 的时候，它跑完第一个 epoch 其实才跑了 8 张图片。

当然这不是什么大问题，就是有点变扭。

model.fit (
    train_tfdata.make_one_shot_iterator (),
    steps_per_epoch=int (train_no / _BATCH_SIZE),
    epochs=_EPOCHS)

虽然可以重写，用 tf.data 在 epoch 完的时候，丢出来的 tf.errors.OutOfRangeError 来判断一个 epoch 的终止，可是代码量会上去，不容易读。
各位有没有简单的解决途径？

提问人：树涛发表在 2018-10-8 17:48:49

tfers-migration · March 31, 2020, 4:26pm

repeat。方法无限循环如何？

舟 3332 发表于 2018-10-9 00:24:09

tfers-migration · March 31, 2020, 4:27pm

别扭主要起因是 Keras 的 epoch 概念和我们一般指的 epoch 概念不一样。

树涛, 2018-10-10 10:18

tfers-migration · March 31, 2020, 4:27pm

对，当 num_samples = 13，但是 batch_szie = 8 的时候，一个 epoch 的大小其实就是 8，而且后面的 5 个在第二轮 epoch 里也不会用到。

wangzhe258369 发表于 2018-10-9 11:09:21

tfers-migration · March 31, 2020, 4:28pm

纠正一下，剩下的 5 个在第二轮 epoch 里面是会被用到的。

wangzhe258369 发表于 2018-10-9 11:09:21

tfers-migration · March 31, 2020, 4:29pm

可以通过一小段代码来检验最后一个不完整的 batch 是否会被 tf.data.Iterator 循环到：

x_tensor_train, y_tensor_train = dataset_train.make_one_shot_iterator ().get_next ()
with tf.Session () as sess:
    while True:
        try:
            y_batch = sess.run (y_tensor_train)
            print (y_batch.shape)
        except tf.errors.OutOfRangeError:
            break

试验一下就会发现最后一个不完整的 batch 是可以被循环到的。那么为什么在 keras 的 model.fit 中就不可以呢？原因出在 steps_per_epoch=int (train_no / _BATCH_SIZE) 上。

通过检查 tf.keras.models.Model 的 fit 方法中的代码，可以逐渐定位到下面这段代码：

  for step_index in range (steps_per_epoch):
    batch_logs = {}
    batch_logs ['batch'] = step_index
    batch_logs ['size'] = 1
    callbacks.on_batch_begin (step_index, batch_logs)
    try:
      outs = f (ins)
    except errors.OutOfRangeError:
      logging.warning ('Your dataset iterator ran out of data; '
                      'interrupting training. Make sure that your dataset '
                      'can generate at least `steps_per_epoch * epochs` '
                      'batches (in this case, %d batches).' %
                      steps_per_epoch * epochs)
      break

    if not isinstance (outs, list):
      outs = [outs]
    for l, o in zip (out_labels, outs):
      batch_logs [l] = o

    callbacks.on_batch_end (step_index, batch_logs)
    if callback_model.stop_training:
      break

如果 steps_per_epoch=int (train_no / _BATCH_SIZE) + 1 的话，最后一个 batch 就会被上面的代码循环到了。注意到过大的 steps_per_epoch 可能会导致 steps_per_epoch * epochs 大于你 dataset 中可循环的 batch 的个数，所以建议最大也不要超过 int (train_no / _BATCH_SIZE) + 1。
希望能对你有所帮助。

livernana 发表于 2018-10-9 16:49:22

tfers-migration · March 31, 2020, 4:31pm

1 的话，后面会报错的，如 errors.OutOfRangeError 的 warning 所示，steps_per_epoch * epochs不能大于总的 dataset 样本数量。

其实 Keras 的 epoch 和我们一般讲的 epoch 是有区别的，我主要是想讨论一下这个。

我们一般的 epoch 的认识如下：

   dataset = tf.data.Dataset.range (13)
    dataset = dataset.batch (8)
    iterator = dataset.make_one_shot_iterator ()
    next_element = iterator.get_next ()

    # Compute for 100 epochs.
    for _ in range (100):
        while True:
            try:
                sess.run (next_element)
            except tf.errors.OutOfRangeError:
                print ('Epoch End.')
                break

树涛, 2018-10-10 10:16