TensorFlow 常用模块

好像确实是因为训练集比较大,需要 GPU 跑一段时间,现在可以了,谢谢!

在请问下,为何我这边的次数是你的一半(我这边是 727,你的是 1454),一样的代码,还有有个 warning 信息能帮忙看看是啥意思吗?

Epoch 2/10
Corrupt JPEG data: 1403 extraneous bytes before marker 0xd9
Corrupt JPEG data: 65 extraneous bytes before marker 0xd9
Corrupt JPEG data: 239 extraneous bytes before marker 0xd9
Corrupt JPEG data: 228 extraneous bytes before marker 0xd9
Corrupt JPEG data: 128 extraneous bytes before marker 0xd9
Corrupt JPEG data: 214 extraneous bytes before marker 0xd9
Corrupt JPEG data: 162 extraneous bytes before marker 0xd9
Corrupt JPEG data: 99 extraneous bytes before marker 0xd9
Warning: unknown JFIF revision number 0.00
Corrupt JPEG data: 396 extraneous bytes before marker 0xd9
Corrupt JPEG data: 252 extraneous bytes before marker 0xd9
Corrupt JPEG data: 2226 extraneous bytes before marker 0xd9
Corrupt JPEG data: 1153 extraneous bytes before marker 0xd9
727/727 - 31s - loss: 0.6006 - sparse_categorical_accuracy: 0.6799

因为我把 batch_size 设成了 16(size 越小,batch 数量越多,每次计算消耗的计算资源和内存越小)。

warning 信息代表数据集里面的个别图片数据读取失败,见 TensorFlow 模型建立与训练 - #25 by slyrx@slyrx 的提问及回答。

好的,谢谢啦!

一处笔误:TFRocrdDataset -> TFRecordDataset

老师,请问我根据您的代码跑下来后,在终端运行 tensorboard --logdir=./tensorboard,只能显示出 scalars 的图,Graphs 和 Profile 都无法可视化,有解决方法吗?

请贴一下你写的代码

老师您好,我在运行 cats_vs_dogs 图像分类时,显示 UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xd5 in position 150: invalid continuation byte,是为什么?我的代码如下:

num_epochs = 10
batch_size = 32
learning_rate = 0.001
data_dir = 'E:\datasets\cats_vs_dogs'
train_cats_dir = data_dir + '/train/cats'
train_dogs_dir = data_dir + '/train/dogs'
test_cats_dir = data_dir + '/valid/cats'
test_dogs_dir = data_dir + '/valid/dogs'

def _decode_and_resize (filename,label):
    image_string = tf.io.read_file (filename)
    image_decoded = tf.image.decode_jpeg (image_string)
    image_resized = tf.image.resize (image_decoded,[256,256]) / 255.0
    return image_resized,label

if __name__ == '__main__':
    train_cat_filenames = tf.constant ([train_cats_dir + filename for filename in os.listdir (train_cats_dir)])
    train_dog_filenames = tf.constant ([train_dogs_dir + filename for filename in os.listdir (train_dogs_dir)])
    train_filenames = tf.concat ([train_cat_filenames,train_dog_filenames],axis=-1)
    train_labels = tf.concat ([
        tf.zeros (train_cat_filenames.shape,dtype=tf.int32),
        tf.ones (train_dog_filenames.shape,dtype=tf.int32)],
        axis=-1)
    
    train_datasets = tf.data.Dataset.from_tensor_slices ((train_filenames,train_labels))
    train_datasets = train_datasets.map (
        map_func=_decode_and_resize,
        num_parallel_calls=tf.data.experimental.AUTOTUNE)
    train_datasets = train_datasets.shuffle (buffer_size=23000)
    train_datasets = train_datasets.batch (batch_size)
    train_datasets = train_datasets.prefetch (tf.data.experimental.AUTOTUNE)
    
    model = tf.keras.Sequential ([
        tf.keras.layers.Conv2D (32, 3, activation='relu', input_shape=(256, 256, 3)),
        tf.keras.layers.MaxPooling2D (),
        tf.keras.layers.Conv2D (32, 5, activation='relu'),
        tf.keras.layers.MaxPooling2D (),
        tf.keras.layers.Flatten (),
        tf.keras.layers.Dense (64, activation='relu'),
        tf.keras.layers.Dense (2, activation='softmax')
    ])
    
    model.compile (
        optimizer=tf.keras.optimizers.Adam (learning_rate=learning_rate),
        loss=tf.keras.losses.sparse_categorical_crossentropy,
        metrics=[tf.keras.metrics.sparse_categorical_accuracy]
    )
    
    model.fit (train_datasets, epochs=num_epochs)

请问你的报错信息出现在代码的哪一行

是最后一行 model.fit (train_datasets, epochs=num_epochs),错误提示是这样的:

UnicodeDecodeError Traceback (most recent call last)
in
48 )
49
—> 50 model.fit (train_datasets, epochs=num_epochs)

~\Anaconda3\lib\site-packages\tensorflow_core\python\keras\engine\training.py in fit (self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)
817 max_queue_size=max_queue_size,
818 workers=workers,
→ 819 use_multiprocessing=use_multiprocessing)
820
821 def evaluate (self,

~\Anaconda3\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py in fit (self, model, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)
340 mode=ModeKeys.TRAIN,
341 training_context=training_context,
→ 342 total_epochs=epochs)
343 cbks.make_logs (model, epoch_logs, training_result, ModeKeys.TRAIN)
344

~\Anaconda3\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py in run_one_epoch (model, iterator, execution_function, dataset_size, batch_size, strategy, steps_per_epoch, num_samples, mode, training_context, total_epochs)
126 step=step, mode=mode, size=current_batch_size) as batch_logs:
127 try:
→ 128 batch_outs = execution_function (iterator)
129 except (StopIteration, errors.OutOfRangeError):
130 # TODO (kaftan): File bug about tf function and errors.OutOfRangeError?

~\Anaconda3\lib\site-packages\tensorflow_core\python\keras\engine\training_v2_utils.py in execution_function (input_fn)
96 # numpy translates Tensors to values in Eager mode.
97 return nest.map_structure (_non_none_constant_value,
—> 98 distributed_function (input_fn))
99
100 return execution_function

~\Anaconda3\lib\site-packages\tensorflow_core\python\eager\def_function.py in call(self, *args, **kwds)
566 xla_context.Exit ()
567 else:
→ 568 result = self._call (*args, **kwds)
569
570 if tracing_count == self._get_tracing_count ():

~\Anaconda3\lib\site-packages\tensorflow_core\python\eager\def_function.py in _call (self, *args, **kwds)
630 # Lifting succeeded, so variables are initialized and we can run the
631 # stateless function.
→ 632 return self._stateless_fn (*args, **kwds)
633 else:
634 canon_args, canon_kwds = \

~\Anaconda3\lib\site-packages\tensorflow_core\python\eager\function.py in call(self, *args, **kwargs)
2361 with self._lock:
2362 graph_function, args, kwargs = self._maybe_define_function (args, kwargs)
→ 2363 return graph_function._filtered_call (args, kwargs) # pylint: disable=protected-access
2364
2365 @property

~\Anaconda3\lib\site-packages\tensorflow_core\python\eager\function.py in _filtered_call (self, args, kwargs)
1609 if isinstance (t, (ops.Tensor,
1610 resource_variable_ops.BaseResourceVariable))),
→ 1611 self.captured_inputs)
1612
1613 def _call_flat (self, args, captured_inputs, cancellation_manager=None):

~\Anaconda3\lib\site-packages\tensorflow_core\python\eager\function.py in _call_flat (self, args, captured_inputs, cancellation_manager)
1690 # No tape is watching; skip to running the function.
1691 return self._build_call_outputs (self._inference_function.call (
→ 1692 ctx, args, cancellation_manager=cancellation_manager))
1693 forward_backward = self._select_forward_and_backward_functions (
1694 args,

~\Anaconda3\lib\site-packages\tensorflow_core\python\eager\function.py in call (self, ctx, args, cancellation_manager)
543 inputs=args,
544 attrs=(“executor_type”, executor_type, “config_proto”, config),
→ 545 ctx=ctx)
546 else:
547 outputs = execute.execute_with_cancellation (

~\Anaconda3\lib\site-packages\tensorflow_core\python\eager\execute.py in quick_execute (op_name, num_outputs, inputs, attrs, ctx, name)
59 tensors = pywrap_tensorflow.TFE_Py_Execute (ctx._handle, device_name,
60 op_name, inputs, attrs,
—> 61 num_outputs)
62 except core._NotOkStatusException as e:
63 if name is not None:

UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xd5 in position 149: invalid continuation byte

你又打错字了……初始化数据目录的时候你少打了一个/

data_dir = 'E:\datasets\cats_vs_dogs'
train_cats_dir = data_dir + '/train/cats'
train_dogs_dir = data_dir + '/train/dogs'
test_cats_dir = data_dir + '/valid/cats'
test_dogs_dir = data_dir + '/valid/dogs'

应该是

data_dir = 'E:\datasets\cats_vs_dogs'
train_cats_dir = data_dir + '/train/cats/'
train_dogs_dir = data_dir + '/train/dogs/'
test_cats_dir = data_dir + '/valid/cats/'
test_dogs_dir = data_dir + '/valid/dogs/'

你这个代码运行的时候,会报出大量的形如

2020-06-14 13:27:38.031283: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at whole_file_read_ops.cc:116 : Not found: NewRandomAccessFile failed to Create/Open: C:\datasets\cats_vs_dogs/train/catscat.0.jpg : 系统找不到指定的文件。

的错误,很明显C:\datasets\cats_vs_dogs/train/catscat.0.jpg这种路径就有问题,应该是很容易定位到问题的。

知道啦,非常感谢!

你好,我想知道你的那个 cats_vs_dogs 图像分类实例最终的准确率是多少呢?我采用 num_epochs=10,shuffle 中的 bufer_size=12000,因为设置为 23000 的话会报错。这样的话我的结果只有 73%。

这里测试集结果在 73%是正常的。这里主要是给大家介绍 tf.data 的使用方式,所以 CNN 的模型和各种参数都没有仔细调,结果还有很大的提升空间。

如果怕 shuffle 不均匀的话可以参考 Tensorflow 如何载入大型数据集 - #2 by snowkylin

tf.config.list_physical_devices 应该为 tf.config.experimental.list_physical_devices 噢。

TensorFlow 这部分的 API 还不是很稳定,我记得之前是从 tf.config.experimental.list_physical_devices改到tf.config.list_physical_devices里面了的,你可以在 tf 2.2 里试一试。

雪麟老师,我在运行 cat_vs_dogs 时,无法得出最终结果,运行的结果在下面。
2020-07-06 20:49:54.398388: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-07-06 20:49:56.487140: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-07-06 20:49:57.758546: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 2060 computeCapability: 7.5
coreClock: 1.35GHz coreCount: 30 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 245.91GiB/s
2020-07-06 20:49:57.758722: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-07-06 20:49:57.782157: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-07-06 20:49:57.803768: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-07-06 20:49:57.807765: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-07-06 20:49:57.836030: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-07-06 20:49:57.846999: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-07-06 20:49:57.895531: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-07-06 20:49:57.896272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-07-06 20:49:58.043598: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-07-06 20:49:58.045038: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 2060 computeCapability: 7.5
coreClock: 1.35GHz coreCount: 30 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 245.91GiB/s
2020-07-06 20:49:58.045217: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-07-06 20:49:58.045294: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-07-06 20:49:58.045369: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-07-06 20:49:58.045449: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-07-06 20:49:58.045551: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-07-06 20:49:58.045645: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-07-06 20:49:58.045735: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-07-06 20:49:58.046006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-07-06 20:49:59.779958: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-06 20:49:59.780047: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0
2020-07-06 20:49:59.780097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N
2020-07-06 20:49:59.781944: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4733 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2060, pci bus id: 0000:01:00.0, compute capability: 7.5)
Train for 719 steps
Epoch 1/10
2020-07-06 20:50:00.343809: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-07-06 20:50:10.828637: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:150] Filling up shuffle buffer (this may take a while): 17766 of 23000
2020-07-06 20:50:20.879926: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:150] Filling up shuffle buffer (this may take a while): 20538 of 23000

Process finished with exit code -1073740791 (0xC0000409)

你把 shuffle 的 buffer_size 设小一点或者把数据取小一点看看,是不是内存不够了

在猫狗图像分类这一节有一段代码:

    train_dataset = tf.data.Dataset.from_tensor_slices ((train_filenames, train_labels))
    train_dataset = train_dataset.map (
        map_func=_decode_and_resize, 
        num_parallel_calls=tf.data.experimental.AUTOTUNE)
    # 取出前 buffer_size 个数据放入 buffer,并从其中随机采样,采样后的数据用后续数据替换
    train_dataset = train_dataset.shuffle (buffer_size=23000)    
    train_dataset = train_dataset.batch (batch_size)
    train_dataset = train_dataset.prefetch (tf.data.experimental.AUTOTUNE)

我想知道整个执行过程细节是怎么样的,在这里先使用了 map 函数对 train_dataset 进行预处理是不是意味着对 train_dataset 所有元素处理后再进行下面的 shuffle 等操作,如果不是,map 操作是在何时进行的呢?这个问题困扰了我很多天,能麻烦您详细讲解一下这几行代码的实现流程么?

你好,在使用 tensorboard 查看 Graph 和 Profile 的时候,程序运行输出:
W0807 10:45:57.553489 8936 deprecation.py:323] From D:\PYTHON\lib\site-packages\tensorflow\python\ops\summary_ops_v2.py:1259: stop (from tensorflow.python.eager.profiler) is deprecated and will be removed after 2020-07-01.
Instructions for updating:
use tf.profiler.experimental.stop instead.
2020-08-07 10:45:58.602293: I tensorflow/core/profiler/internal/gpu/device_tracer.cc:223] GpuTracer has collected 0 callback api events and 0 activity events.
W0807 10:45:59.775066 8936 deprecation.py:323] From D:\PYTHON\lib\site-packages\tensorflow\python\ops\summary_ops_v2.py:1259: save (from tensorflow.python.eager.profiler) is deprecated and will be removed after 2020-07-01.
Instructions for updating:
tf.python.eager.profiler has deprecated, use tf.profiler instead.
W0807 10:45:59.780068 8936 deprecation.py:323] From D:\PYTHON\lib\site-packages\tensorflow\python\eager\profiler.py:151: maybe_create_event_file (from tensorflow.python.eager.profiler) is deprecated and will be removed after 2020-07-01.
Instructions for updating:
tf.python.eager.profiler has deprecated, use tf.profiler instead.

进入 tensorboard 也并没有 graph 和 profile 的输出,scale 是正常的。请问一下该如何解决,搜索没有发现相关的问题,tensorflow-gpu 2.3.0,tensorboard 2.3.0,谢谢解答