目錄
- 一、問題源起
- 二、開發(fā)環(huán)境
- 三、Tensorflow針對GPU內(nèi)存的分配策略
- 四、問題分析驗證
- 五、GPU分配策略分析
- 六、擴(kuò)展
一、問題源起
從以下的異常堆??梢钥吹绞荁LAS程序集初始化失敗,可以看到是執(zhí)行MatMul的時候發(fā)生的異常,基本可以斷定可能數(shù)據(jù)集太大導(dǎo)致memory不夠用了。
2021-08-10 16:38:04.917501: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-08-10 16:38:04.960048: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-08-10 16:38:04.986898: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-08-10 16:38:04.992366: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-08-10 16:38:04.992389: W tensorflow/stream_executor/stream.cc:1455] attempting to perform BLAS operation using StreamExecutor without BLAS support
Traceback (most recent call last):
File "/home/mango/PycharmProjects/DeepLearing/minist_conv.py", line 32, in module>
model.fit(train_images, train_labels, epochs=5, batch_size=64)
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/keras/engine/training.py", line 1183, in fit
tmp_logs = self.train_function(iterator)
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/def_function.py", line 889, in __call__
result = self._call(*args, **kwds)
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/def_function.py", line 950, in _call
return self._stateless_fn(*args, **kwds)
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/function.py", line 3023, in __call__
return graph_function._call_flat(
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/function.py", line 1960, in _call_flat
return self._build_call_outputs(self._inference_function.call(
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/function.py", line 591, in call
outputs = execute.execute(
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: Blas xGEMM launch failed : a.shape=[1,64,576], b.shape=[1,576,64], m=64, n=64, k=576
[[node sequential/dense/MatMul (defined at home/mango/PycharmProjects/DeepLearing/minist_conv.py:32) ]] [Op:__inference_train_function_993]
Function call stack:
train_function
二、開發(fā)環(huán)境
mango@mango-ubuntu:~$ /usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Wed_Jul_14_19:41:19_PDT_2021
Cuda~~ compilation tools, release 11.4, V11.4.100==
Build cuda_11.4.r11.4/compiler.30188945_0
mango@mango-ubuntu:~$ tail -n 10 /usr/include/cudnn_version.h
#ifndef CUDNN_VERSION_H_
#define CUDNN_VERSION_H_
#define CUDNN_MAJOR 8
#define CUDNN_MINOR 2
#define CUDNN_PATCHLEVEL 2
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)
#endif /* CUDNN_VERSION_H */
mango@mango-ubuntu:~$ python3 --version
Python 3.9.5
mango@mango-ubuntu:~$ nvidia-smi
Tue Aug 10 19:57:58 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| N/A 54C P0 N/A / N/A | 329MiB / 2002MiB | 9% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1818 G /usr/lib/xorg/Xorg 186MiB |
| 0 N/A N/A 2002 G /usr/bin/gnome-shell 45MiB |
| 0 N/A N/A 3435 G ...AAAAAAAAA= --shared-files 75MiB |
| 0 N/A N/A 6016 G python3 13MiB |
+-----------------------------------------------------------------------------+
mango@mango-ubuntu:~$ python3
Python 3.9.5 (default, May 11 2021, 08:20:37)
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2021-08-10 18:33:05.917520: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
>>> tf.__version__
'2.5.0'
>>>
三、Tensorflow針對GPU內(nèi)存的分配策略
By default, TensorFlow maps nearly all of the GPU memory of all GPUs (subject to CUDA_VISIBLE_DEVICES) visible to the process. This is done to more efficiently use the relatively precious GPU memory resources on the devices by reducing memory fragmentation.
默認(rèn)情況下,為了通過減少內(nèi)存碎片更有效地利用設(shè)備上相對寶貴的GPU內(nèi)存資源,TensorFlow進(jìn)程會使用所有可見的GPU。
In some cases it is desirable for the process to only allocate a subset of the available memory, or to only grow the memory usage as is needed by the process. TensorFlow provides two methods to control this.
在某些情況下,進(jìn)程只分配可用內(nèi)存的一個子集,或者只根據(jù)進(jìn)程的需要增加內(nèi)存使用量。TensorFlow提供了兩種方法來控制這種情況。
The first option is to turn on memory growth by calling tf.config.experimental.set_memory_growth, which attempts to allocate only as much GPU memory as needed for the runtime allocations: it starts out allocating very little memory, and as the program gets run and more GPU memory is needed, the GPU memory region is extended for the TensorFlow process. Memory is not released since it can lead to memory fragmentation. To turn on memory growth for a specific GPU, use the following code prior to allocating any tensors or executing any ops.
第一種選擇是通過調(diào)用tf.config.experimental.set_memory_growth來打開內(nèi)存增長,它嘗試只分配運行時所需的GPU內(nèi)存:它開始分配很少的內(nèi)存,當(dāng)程序運行時需要更多的GPU內(nèi)存時,GPU內(nèi)存區(qū)域會進(jìn)一步擴(kuò)展增大。內(nèi)存不會被釋放,因為這會導(dǎo)致內(nèi)存碎片。為了打開特定GPU的內(nèi)存增長,在分配任何張量或執(zhí)行任何操作之前,使用以下代碼。
gpus = tf.config.list_physical_devices('GPU')
if gpus:
try:
# Currently, memory growth needs to be the same across GPUs
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Memory growth must be set before GPUs have been initialized
print(e)
Another way to enable this option is to set the environmental variable TF_FORCE_GPU_ALLOW_GROWTH to true. This configuration is platform specific.
啟用該選項的另一種方法是將環(huán)境變量TF_FORCE_GPU_ALLOW_GROWTH設(shè)置為true。此配置是特定于平臺的。
The second method is to configure a virtual GPU device with tf.config.experimental.set_virtual_device_configuration and set a hard limit on the total memory to allocate on the GPU.
This is useful if you want to truly bound the amount of GPU memory available to the TensorFlow process. This is common practice for local development when the GPU is shared with other applications such as a workstation GUI.
第二種方法是使用tf.config.experimental.set_virtual_device_configuration配置虛擬GPU設(shè)備,并設(shè)置GPU上可分配的總內(nèi)存的硬限制。
如果你想真正將GPU內(nèi)存的數(shù)量綁定到TensorFlow進(jìn)程中,這是非常有用的。當(dāng)GPU與其他應(yīng)用程序(如工作站GUI)共享時,這是本地開發(fā)的常見做法。
gpus = tf.config.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only allocate 1GB of memory on the first GPU
try:
tf.config.set_logical_device_configuration(
gpus[0],
[tf.config.LogicalDeviceConfiguration(memory_limit=1024)])
logical_gpus = tf.config.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Virtual devices must be set before GPUs have been initialized
print(e)
四、問題分析驗證
通過上邊對TensorFlow文檔的分析,默認(rèn)情況下會占用所有的GPU內(nèi)存,但是TensorFlow提供了兩種方式可以靈活的控制內(nèi)存的分配策略;
我們可以直接設(shè)置GPU內(nèi)存按需動態(tài)分配
import tensorflow as tf
physical_gpus = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_gpus[0], True)
通過以下命令可以看到執(zhí)行過程中GPU內(nèi)存的占用最高為697M
mango@mango-ubuntu:~$ while true; do nvidia-smi; sleep 0.2; done;
Tue Aug 10 20:30:58 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| N/A 58C P0 N/A / N/A | 1026MiB / 2002MiB | 72% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1818 G /usr/lib/xorg/Xorg 186MiB |
| 0 N/A N/A 2002 G /usr/bin/gnome-shell 45MiB |
| 0 N/A N/A 3435 G ...AAAAAAAAA= --shared-files 73MiB |
| 0 N/A N/A 6016 G python3 13MiB |
| 0 N/A N/A 13829 C /usr/bin/python3.9 697MiB |
+-----------------------------------------------------------------------------+
我們也可以限制最多使用1024M的GPU內(nèi)存
import tensorflow as tf
physical_gpus = tf.config.list_physical_devices('GPU')
tf.config.set_logical_device_configuration(physical_gpus[0], [tf.config.LogicalDeviceConfiguration(memory_limit=1024)])
同樣通過命令可以看到執(zhí)行過程中GPU內(nèi)存的占用最高為1455M
mango@mango-ubuntu:~$ while true; do nvidia-smi; sleep 0.2; done;
Tue Aug 10 20:31:24 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| N/A 58C P0 N/A / N/A | 1784MiB / 2002MiB | 74% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1818 G /usr/lib/xorg/Xorg 186MiB |
| 0 N/A N/A 2002 G /usr/bin/gnome-shell 46MiB |
| 0 N/A N/A 3435 G ...AAAAAAAAA= --shared-files 72MiB |
| 0 N/A N/A 6016 G python3 13MiB |
| 0 N/A N/A 13570 C /usr/bin/python3.9 1455MiB |
+-----------------------------------------------------------------------------+
五、GPU分配策略分析
通過四中的測試結(jié)果可得
- 默認(rèn)的分配策略會占用所有的內(nèi)存,并且執(zhí)行中不會進(jìn)行釋放,如果訓(xùn)練數(shù)據(jù)量比較打很容易內(nèi)存不夠用;
- 限制最大使用內(nèi)存,測試占用內(nèi)存比設(shè)置的大,這個可能跟訓(xùn)練中間使用的模型和操作的復(fù)雜程度有關(guān)系,需要根據(jù)具體的業(yè)務(wù)場景設(shè)置合適的值;但是要注意不能設(shè)置大了,否則還是會報錯,但是設(shè)置小了只是執(zhí)行的慢一些罷了;
- 設(shè)置內(nèi)存按需分配可能是一個相對比較中庸的方案,感覺可能是一個更好的方案,不知道TensorFlow為什么沒有設(shè)置為默認(rèn)值,留作一個問題,后續(xù)有新的認(rèn)知的話再補(bǔ)充;
六、擴(kuò)展
單GPU模擬多GPU環(huán)境
當(dāng)我們的本地開發(fā)環(huán)境只有一個GPU,但卻需要編寫多GPU的程序在工作站上進(jìn)行訓(xùn)練任務(wù)時,TensorFlow為我們提供了一個方便的功能,可以讓我們在本地開發(fā)環(huán)境中建立多個模擬GPU,從而讓多GPU的程序調(diào)試變得更加方便。以下代碼在實體GPU GPU:0 的基礎(chǔ)上建立了兩個顯存均為2GB的虛擬GPU。
gpus = tf.config.list_physical_devices('GPU')
if gpus:
# Create 2 virtual GPUs with 1GB memory each
try:
tf.config.set_logical_device_configuration(
gpus[0],
[tf.config.LogicalDeviceConfiguration(memory_limit=1024),
tf.config.LogicalDeviceConfiguration(memory_limit=1024)])
logical_gpus = tf.config.list_logical_devices('GPU')
print(len(gpus), "Physical GPU,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Virtual devices must be set before GPUs have been initialized
print(e)
多GPU的數(shù)據(jù)并行
使用 tf.distribute.Strategy可以將模型拷貝到每個GPU上,然后將訓(xùn)練數(shù)據(jù)分批在不同的GPU上執(zhí)行,達(dá)到數(shù)據(jù)并行。
tf.debugging.set_log_device_placement(True)
gpus = tf.config.list_logical_devices('GPU')
strategy = tf.distribute.MirroredStrategy(gpus)
with strategy.scope():
inputs = tf.keras.layers.Input(shape=(1,))
predictions = tf.keras.layers.Dense(1)(inputs)
model = tf.keras.models.Model(inputs=inputs, outputs=predictions)
model.compile(loss='mse',
optimizer=tf.keras.optimizers.SGD(learning_rate=0.2))
到此這篇關(guān)于淺談Tensorflow2對GPU內(nèi)存的分配策略的文章就介紹到這了,更多相關(guān)Tensorflow2 GPU內(nèi)存分配內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家!
您可能感興趣的文章:- Keras設(shè)定GPU使用內(nèi)存大小方式(Tensorflow backend)