模型训练内存占用较高

AIStudio791260 发布于2019-11

版本、环境信息：
1）PaddlePaddle版本：1.5.0
2）CPU：
3）GPU：tesla v100
训练信息
1）单机单卡
2）显存信息
模型为双塔模型左侧 resnet 50 右侧 BOW 顶部两层FC
目前的情况是训练过程中内存占用率较高，大概占用84G内存
模型在训练过程中使用了io.PyReader, 显存策略如下

places = fluid.cuda_places() 
place = fluid.CUDAPlace(0)
 
exe = fluid.Executor(place)

exec_strategy = fluid.ExecutionStrategy()
exec_strategy.num_threads = fluid.core.get_cuda_device_count()
exec_strategy.num_iteration_per_drop_scope = 100

build_strategy = fluid.BuildStrategy()
build_strategy.enable_inplace = True

train_exe = fluid.ParallelExecutor(
            use_cuda=True,
            main_program=fluid.default_main_program(),
            loss_name=loss.name,
            build_strategy=build_strategy,
            exec_strategy=exec_strategy)

train_reader = fluid.io.PyReader(
            feed_list=feed_list,
            capacity=5,
            use_double_buffer=True,
            iterable=True)
train_reader.decorate_batch_generator(train_batch_gen, places=places)
# train_batch_gen 为多进程数据读取 ，使用了linecache

export CUDA_VISIBLE_DEVICES=4
export FLAGS_sync_nccl_allreduce=1
export FLAGS_fraction_of_gpu_memory_to_use=0
export FLAGS_eager_delete_tensor_gb=0.0
export FLAGS_fast_eager_deletion_mode=1
python XXX.py

全部评论(6)

AIStudio791260

#2 回复于2019-11

同时我在数据处理时对image数据使用了以下的处理方式，猜测可能是这样的处理方式导致了内存泄露问题

def parse_image(self, image):
        image = self.image_dir + image
        img = Image.open(image)
        img = img.convert("RGB")
        img = img.resize((224, 224), Image.ANTIALIAS)
        img = np.array(img)
        img = img.astype("float32")
        img = np.transpose(img, (2, 0, 1))
        img = img * 1.0 / 225
        return img

AIStudio791260

#3 回复于2019-11

经观察，在训练过程中，内存占比不断升高，怀疑是内存泄露问题

一个可能的原因是使用了pyreader

    feed_list = [
        instance.input_src_ids,
        instance.input_txt_ids,
        instance.input_pos_ids,
        instance.input_mask,
        instance.input_image,
        instance.input_hard_label]
    train_batch_gen = data_reader.multiprocessing_wrapper(
            file_names=train_file_name,
            data_sizes=train_data_size,           
            num_workers=5,
            epochs=10)
    test_batch_gen = data_reader.batch_wrapper(
            file_name=test_file_name,
            data_size=test_data_size,          
            batch_size=128,            
            shuffle=True)
    train_reader = fluid.io.PyReader(
            feed_list=feed_list,
            capacity=5,
            use_double_buffer=True,
            iterable=True)
    train_reader.decorate_batch_generator(train_batch_gen, places=places)  
    test_reader = fluid.io.PyReader(
            feed_list=feed_list,
            capacity=5,
            use_double_buffer=True,
            iterable=True)
    test_reader.decorate_batch_generator(test_batch_gen, places=places)
    
    cnt = 0 
    for train_data in train_reader():
        _loss = train_exe.run(
                feed=train_data,
                fetch_list=[loss.name])
        print("{}\t{}".format(cnt, _loss[0]))
        cnt += 1
        if cnt % 200 == 0:
            test_cnt = 0
            auces = []
            losses = []
            for test_data in test_reader():
                _test_loss, _pred, _label = test_exe.run(
                        feed=test_data,
                        fetch_list=[loss.name, pred.name, input_label.name],
                        return_numpy=True)               
                test_cnt += 1
                if test_cnt >= 5:               
                    break

如上所示，当移除上述test_reader相关的代码，内存占比不会逐渐升高
在上述test_reader被多次调用时，是否会发生内存泄漏问题呢？

AIStudio791421

#4 回复于2019-11

您不使用PyReader是否会发现泄漏？
您试试test_reader中不break，等待所有test batch都跑完，是否会发现泄漏？（1.5版本提前break可能会有问题）。

AIStudio791260

#5 回复于2019-11

不使用PyReader的情况下不会发现泄露问题，这个问题是在使用了Pyreader之后产生的。
单独测试了PIL的image api 内存占比稳定应该不是PIL的问题。

第二个问题还没有测试，但如果只使用train_reader，不使用上述代码中的test_reader 也不会发生内存泄露问题，想问下，1.5版本pyreader提前break的话会发生什么呢？

AIStudio791421

#6 回复于2019-12

@JingChunzhen

提前break在1.5版本中可能会造成上一轮的异步读取线程还在读取数据。相当于您跑了N次test，就会有N个线程残留（假设线程均没结束），可能会导致您说的内存泄漏问题。此问题已在1.6修复。

AIStudio791260

#7 回复于2019-12

感谢解答~