Keras + Estimator API训练过程中切换优化算法失败

tfers-migration · March 30, 2020, 4:50am

我的代码使用keras定义模型，然后使用 tf.keras.estimator.model_to_estimator 接口将其转换为了官方推荐的Estimator API，然后进行训练。Estimator API定期将模型保存为Tensorflow Checkpoint，而如果我在训练的途中希望更改优化算法，例如从Adam切换为SGD，则抛出如下异常：

tensorflow.python.framework.errors_impl.NotFoundError: Key training/SGD/Variable_10 not found in checkpoint

原因是因为SGD相关的变量在Checkpoint中找不到

有没有小伙伴，知道如何解决？

我尝试过手动修改Checkpoint文件，但没找到有效的途径

提问人：winter
发帖时间：2018-05-22

tfers-migration · March 30, 2020, 4:51am

我觉得可能是 optimizer 也写到了图里。所以换了 optimizer 有些东西的参数就没有办法读取了。

舟3332
发表于 2018-5-21 23:38:26

tfers-migration · March 30, 2020, 4:51am

是这个原因，estimator 做 checkpoint时默认保存环境中的所有变量。

winter，2018-5-25 10:57

tfers-migration · March 30, 2020, 4:53am

首先声明一下，我并没有在实际应用中用过这种方法，尽管下面假想的例子里貌似可行，但如果跟楼主实际需要有出入，还请见谅并指出，共同学习。

楼主的需要从大面上讲就是选择性的从checkpoint中恢复estimator的参数。就我所学所见大概有三种做法：

用Scaffold：貌似不是tensorflow开发者认为最好的做法，我也就偷懒不看了。
用tf.train.init_from_checkpoint：下面用到的方法。
用WarmStartSettings：貌似是更好的方法，但是要求构建Estimator时从warm_start_from传入，而tf.keras.estimator.model_to_estimator并没有这样的实现，所以在楼主这个问题中应用应该很难。
实际上，上述三种方法在Github的一个issue中都有提到，楼主可以参看：
https://github.com/tensorflow/tensorflow/issues/14713

使用tf.train.init_from_checkpoint的思路主要是，在改变优化器后重开模型路径，重新训练，这样的话Estimator就没有办法从checkpoint恢复参数（因为模型路径下没有以前的checkpoint），这时候用tf.train.init_from_checkpoint从原来的模型路径下找到checkpoint来恢复参数。鉴于楼主这里Estimator的构建来源于tf.keras.estimator.model_to_estimator，所以还需要用SessionRunHook来插入tf.train.init_from_checkpoint。测试代码如下：

import tensorflow as tf
from tensorflow.python.tools import inspect_checkpoint as chkp

tf.logging.set_verbosity(tf.logging.INFO)

# 第一组：使用Adam优化器的原始模型
current_model_dir = 'test_Adam'
old_model_dir = 'test_Adam'
optimizer = 'Adam'

# 第二组：改为SGD优化器的模型
# current_model_dir = 'test_SGD'
# old_model_dir = 'test_Adam'
# optimizer = 'SGD'

# For a single-input model with 2 classes (binary classification):
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(32, activation='relu', input_dim=100))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))
model.compile(optimizer=optimizer,
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()
keras_estimator = tf.keras.estimator.model_to_estimator(
    keras_model=model, model_dir=current_model_dir)

# Generate dummy data
import numpy as np
train_input_fn = tf.estimator.inputs.numpy_input_fn(
    x={model.input_names[0]: np.random.random((1000, 100))},
    y=np.random.randint(2, size=(1000, 1)),
    batch_size=200,
    shuffle=False)

class InitHook(tf.train.SessionRunHook):
    def __init__(self, ckpt_dir_or_file):
        self.ckpt_dir_or_file = ckpt_dir_or_file

    def begin(self):
        print('-----------------------------------------------------------------')
        try:
            ckpt_var_list = [vt[0] for vt in tf.train.list_variables(self.ckpt_dir_or_file)]
            assignment_map = {v.op.name: v
                              for v in tf.global_variables()
                              if v.op.name in ckpt_var_list}
            tf.train.init_from_checkpoint(self.ckpt_dir_or_file, assignment_map)
        except:
            tf.logging.info('No custom variable initialization.')        
        print('-----------------------------------------------------------------')

# Train the model with all data from train_input_fn
keras_estimator.train(train_input_fn, hooks=[InitHook(old_model_dir)])

测试方法：

首先，用Adam优化器，新旧模型路径都是test_Adam，这时候尽管Estimator可以从checkpoint恢复参数，但是tf.train.init_from_checkpoint会覆盖其操作，实质还是通过tf.train.init_from_checkpoint恢复参数。当然，这部分楼主可以按照自己原来的做法训练模型，我这里一方面是为了第二部分真正解决问题做准备，另一方面是尽可能贴近第二部分的模型，便于比较理解。
这部分的目的在于模拟楼主的情况：换为SGD优化器。这时候新开模型路径test_SGD，test_Adam为旧模型路径，在InitHook中tf.train.init_from_checkpoint会恢复新旧模型共有的变量，而没被初始化的SGD变量则会由Estimator负责初始化。
多说一句的是，如果楼主没有修改模型，那么不同批次的训练完全可以用楼主原来的方法，没有必要用这种做法。4. 放在最后的一点，也是我比较想弱化的一点就是，keras转过去的Estimator模型跟纯粹新建的Estimator模型在初始化参数上略有不同，但就上述的傻瓜示例而言应该不重要，且我能力有限，在这里就点一下不展开了。

yunhai_luo，发表于 2018-5-22 12:01:08