TensorFlow 分布式训练

kobe24o · February 3, 2021, 3:44pm

第二种策略，代码会报错

RuntimeError                              Traceback (most recent call last)
<ipython-input-4-df7add50a6f4> in <module>()
     15     'task': {'type': 'worker', 'index': 0}
     16 })
---> 17 strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
     18 batch_size = batch_size_per_replica * num_workers
     19 

6 frames
/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/context.py in configure_collective_ops(self, collective_leader, scoped_allocator_enabled_ops, use_nccl_communication, device_filters)
    727 
    728     if self._context_handle is not None:
--> 729       raise RuntimeError("Collective ops must be configured at program startup")
    730 
    731     self._collective_leader = collective_leader

RuntimeError: Collective ops must be configured at program startup