复现代码过程记录¶

约 2191 个字 • 289 行代码

OpenP5 (第一次尝试但中断)¶

论文网址：

OpenP5: Benchmarking Foundation Models for Recommendation | Papers With Code

代码地址：

agiresearch/OpenP5: OpenP5: An Open-source Platform for Developing, Fine-tuning, and Evaluating LLM-based Recommenders (github.com) - https://github.com/agiresearch/OpenP5

环境 environment.txt :

python==3.9.7
transformers==4.26.0
torch==1.8.1+cu111
sklearn==1.1.2
torchvision==0.9.1+cu111
tqdm==4.64.1
time
collections
argparse
os
sys
numpy==1.23.1

由于 time colloctions os sys 好像都是自带的不能装，而 torch torchvision 需要装的是 cuda 版的，所以我将其修改为

transformers==4.26.0
scikit-learn==1.1.2
tqdm==4.64.1
argparse
numpy==1.23.1

其中 sklearn 在之前安装时发现说是更改成了 scikit-learn 所以我也进行了修改

这样就可以直接运行命令来安装对应的包

pip install -r environment.txt

安装好这些后，打开 clone 的仓库，发现 train.py 中还是有一些报错

并且 README.md 中写着

## Usage
Download the data from Google Drive link, and put them into ./data folder.
The training command can be found in ./command folder. Run the command such as
cd command
sh ML1M_random.sh

然而并没有 ./command 文件夹，然后我发现，从main分支里下载的文件里面并没有command文件夹，

而有一个 old_version分支里面有，并且 ./command/ML1M_random.sh 文件中的命令是要运行 ./src/main.py 文件，而这个文件只在 old_version 分支中有，并且几乎所装的包都在 main.py 中被导入，所以我决定使用 old_version

由于需要执行 .sh 文件，这个在windows的cmd中好像不能使用，只能在git bash终端中使用，而我又需要使用 conda 环境，

所以发现在 bash 中不能像cmd中一样运行 conda activate openp5

搜索后发现了这个方法有效

Conda environment fails to activate with Git Bash · Issue #19534 · microsoft/vscode-python (github.com)

手册 - Anaconda

source anaconda3 安装路径下的 /Scripts/activate

source /e/Programs/Anaconda3/Scripts/activate

然后就会启动 anaconda 的 base 环境，这时 conda activate openp5 就有可以使用 openp5 的虚拟环境了

解决 `.pyc` 文件导入问题¶

old_version的 main.py 中有一处报错，

是使用了自己的包，而好像 pycharm 中 ./src/model 文件夹中并没有这个 P5 的东西

经过一顿折腾之后，偶然发现了，原本的仓库里面， model 文件夹下有一个 __pycache__ 文件夹，里面有一个 P5.cpython-39.pyc 文件，我认为这个应该就是 main.py 要导的包，

所以开始查询如何在 pycharm 中才能导入这个包(因为最近也有一个类似的情况(编译Orbbec SDK) - 在终端中能使用/导入pyc文件，而在pycharm中会报错)，然后发现了

python - Cannot see pyc files in PyCharm - Stack Overflow

这个的回答

需要在 pycharm 的 设置 - 编辑器 - 文件类型 - 忽略的文件和文件夹 里，把 *.pyc 和 __pycache__ 去掉

但经过测试之后发现，即使能看到pyc文件也还是会报错

想起来在 Orbbec SDK for Python 使用手册测试 Sample 里，终端上使用时，需要设置一个 PYTHONPATH 的环境变量，

# set PYTHONPATH environment variable to include the lib directory in the install directory
export PYTHONPATH=$PYTHONPATH:$(pwd)/install/lib/

所以开始查询如何在 pycharm 中设置 PYTHONPATH 环境变量，然后从

python - PyCharm and PYTHONPATH - Stack Overflow

Manage interpreter paths | PyCharm Documentation (jetbrains.com)

找到了方法：

设置 - 项目 - Python解释器 - 在 Python 解释器: 右侧的框的右侧点击向下的三角/箭头 - 全部显示 - 在左上角图标中找到 显示解释器路径 - 添加相应的路径

由于 main.py 文件中，是

from model.P5 import P5

所以应该是把 model 的父目录(即项目的根目录)添加上

E:\Github\code_reproduction\OpenP5-old_version

添加完了之后，即使还是会有红色的警告，但能运行了

报错¶

error info

$ sh ML1M_random.sh
{'seed': 2023, 'model_dir': '../model', 'checkpoint_dir': '../checkpoint', 'model_name': 'model.pt', 'log_dir': '../log', 'distributed': 1, 'gpu': '0,1', 'master_addr': 'localhost', 'master_port': '1991', 'logging_level': 20, 'data_path': '../data', 'item_indexing': 'random', 'tasks': 'sequential,straightforward', 'datasets': 'ML1M', 'prompt_file': '../prompt.txt', 'sequential_order': 'original', 'collaborative_token_size': 200, 'collaborative_cluster': 20, 'collaborative_last_token': 'sequential', 'collaborative_float32': 0, 'max_his': 20, 'his_prefix': 1, 'his_sep': ' , ', 'skip_empty_his': 1, 'valid_prompt': 'seen:0', 'valid_prompt_sample': 1, 'valid_sample_num': '3,3', 'test_prompt': 'seen:0', 'sample_prompt': 1, 'sample_num': '3,3', 'batch_size': 128, 'eval_batch_size': 20, 'dist_sampler': 0, 'optim': 'AdamW', 'epochs': 10, 'lr': 0.001, 'clip': 1, 'logging_step': 100, 'warmup_prop': 0.05, 'gradient_accumulation_steps': 1, 'weight_decay': 0.01, 'adam_eps': 1e-06, 'dropout': 0.1, 'alpha': 2, 'train': 1, 'backbone': 't5-small', 'metrics': 'hit@5,hit@10,ndcg@5,ndcg@10', 'load': 0, 'random_initialize': 1, 'test_epoch': 0, 'valid_select': 0, 'test_before_train': 0, 'test_filtered': 0, 'test_filtered_batch': 1, 'log_name': '1_1_1_1_20_1991_ML1M_sequential,straightforward_t5-small_random_0.001_10_128_3,3_prompt', 'model_path': '../model\\ML1M\\1_1_1_1_20_1991_ML1M_sequential,straightforward_t5-small_random_0.001_10_128_3,3_prompt.pt', 'rank': 0}
'(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /t5-small/resolve/main/tokenizer_config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x0000018265B83BB0>, 'Connection to huggingface.co timed out. (connect timeout=10)'))"), '(Request ID: 703c57e1-700e-4497-8fdf-3cf500baea63)')' thrown while requesting HEAD https://huggingface.co/t5-small/resolve/main/tokenizer_config.json
'(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /t5-small/resolve/main/tokenizer_config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x0000018265B83BB0>, 'Connection to huggingface.co timed out. (connect timeout=10)'))"), '(Request ID: 703c57e1-700e-4497-8fdf-3cf500baea63)')' thrown while requesting HEAD https://huggingface.co/t5-small/resolve/main/tokenizer_config.json
'(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /t5-small/resolve/main/config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x000001821CE4F7C0>, 'Connection to huggingface.co timed out. (connect timeout=10)'))"), '(Request ID: fece3cd4-12a5-4ab5-b3a2-7fea10c749bf)')' thrown while requesting HEAD https://huggingface.co/t5-small/resolve/main/config.json
'(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /t5-small/resolve/main/config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x000001821CE4F7C0>, 'Connection to huggingface.co timed out. (connect timeout=10)'))"), '(Request ID: fece3cd4-12a5-4ab5-b3a2-7fea10c749bf)')' thrown while requesting HEAD https://huggingface.co/t5-small/resolve/main/config.json
Traceback (most recent call last):
  File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\urllib3\connection.py", line 203, in _new_conn
    sock = connection.create_connection(
  File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\urllib3\util\connection.py", line 85, in create_connection
    raise err
  File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\urllib3\util\connection.py", line 73, in create_connection
    sock.connect(sa)
socket.timeout: timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\urllib3\connectionpool.py", line 790, in urlopen
    response = self._make_request(
  File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\urllib3\connectionpool.py", line 491, in _make_request
    raise new_e
  File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\urllib3\connectionpool.py", line 467, in _make_request
    self._validate_conn(conn)
  File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\urllib3\connectionpool.py", line 1092, in _validate_conn
    conn.connect()
  File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\urllib3\connection.py", line 611, in connect
    self.sock = sock = self._new_conn()
  File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\urllib3\connection.py", line 212, in _new_conn
    raise ConnectTimeoutError(
urllib3.exceptions.ConnectTimeoutError: (<urllib3.connection.HTTPSConnection object at 0x000001821CE4F7C0>, 'Connection to huggingface.co timed out. (connect timeout=10)')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\requests\adapters.py", line 486, in send
    resp = conn.urlopen(
  File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\urllib3\connectionpool.py", line 844, in urlopen
    retries = retries.increment(
  File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\urllib3\util\retry.py", line 515, in increment
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /t5-small/resolve/main/config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x000001821CE4F7C0>, 'Connection to huggingface.co timed out. (connect timeout=10)'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\huggingface_hub\file_download.py", line 1232, in hf_hub_download
    metadata = get_hf_file_metadata(
  File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\huggingface_hub\utils\_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\huggingface_hub\file_download.py", line 1599, in get_hf_file_metadata
    r = _request_wrapper(
  File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\huggingface_hub\file_download.py", line 417, in _request_wrapper
    response = _request_wrapper(
  File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\huggingface_hub\file_download.py", line 452, in _request_wrapper
    return http_backoff(
  File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\huggingface_hub\utils\_http.py", line 274, in http_backoff
    raise err
  File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\huggingface_hub\utils\_http.py", line 258, in http_backoff
    response = session.request(method=method, url=url, **kwargs)
  File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\requests\sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\requests\sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\huggingface_hub\utils\_http.py", line 63, in send
    return super().send(request, *args, **kwargs)
  File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\requests\adapters.py", line 507, in send
    raise ConnectTimeout(e, request=request)
requests.exceptions.ConnectTimeout: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /t5-small/resolve/main/config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x000001821CE4F7C0>, 'Connection to huggingface.co timed out. (connect timeout=10)'))"), '(Request ID: fece3cd4-12a5-4ab5-b3a2-7fea10c749bf)')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\transformers\utils\hub.py", line 409, in cached_file
    resolved_file = hf_hub_download(
  File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\huggingface_hub\utils\_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\huggingface_hub\file_download.py", line 1349, in hf_hub_download
    raise LocalEntryNotFoundError(
huggingface_hub.utils._errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "E:\Github\code_reproduction\OpenP5-old_version\src\main.py", line 233, in <module>
    single_main()
  File "E:\Github\code_reproduction\OpenP5-old_version\src\main.py", line 92, in single_main
    tokenizer = AutoTokenizer.from_pretrained(args.backbone)
  File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 613, in from_pretrained
    config = AutoConfig.from_pretrained(
  File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\transformers\models\auto\configuration_auto.py", line 852, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\transformers\configuration_utils.py", line 565, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\transformers\configuration_utils.py", line 620, in _get_config_dict
    resolved_config_file = cached_file(
  File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\transformers\utils\hub.py", line 443, in cached_file
    raise EnvironmentError(
OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like t5-small is not the path to a directory containing a file named config.json.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.

GenRec (第一次尝试但中断)¶

由于上一个代码暂时解决不了 OSError: We couldn't connect to 'https://huggingface.co' to load this file ... 的报错，所以开始尝试这篇论文的代码

论文网址：

GenRec: Large Language Model for Generative Recommendation | Papers With Code

代码地址：

rutgerswiselab/GenRec: Large Language Model for Generative Recommendation (github.com)

安装好环境之后，初次运行发现

ModuleNotFoundError: No module named 'scipy'

于是安装 scipy

然后再运行

ImportError: Using `load_in_8bit=True` requires Accelerate: `pip install accelerate` and the latest version of bitsandbytes `pip install -i https://test.pypi.org/simple/ bitsandbytes` or pip install bitsandbytes`

尝试 pip install accelerate 和 pip install -i https://test.pypi.org/simple/ bitsandbytes 后，显示都安装了，所以上网查找相关信息，然后发现两个 不久前的回答

都说了将 transformers 的版本降级 downgrade 到 4.30 就好了(我本来自动安装的版本是 4.35.0.dev0 )(这个 transformers 怎么这么麻烦，上一篇论文也是弄这玩意弄了好久😡🤬)

修改以后发现可以运行了

最后出现这样的报错

Traceback (most recent call last):
  File "E:\Github\code_reproduction\GenRec\rec.py", line 289, in <module>
    fire.Fire(train)
  File "E:\Programs\Anaconda3\envs\genrec\lib\site-packages\fire\core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "E:\Programs\Anaconda3\envs\genrec\lib\site-packages\fire\core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "E:\Programs\Anaconda3\envs\genrec\lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "E:\Github\code_reproduction\GenRec\rec.py", line 104, in train
    model = LlamaForCausalLM.from_pretrained(
  File "E:\Programs\Anaconda3\envs\genrec\lib\site-packages\transformers\modeling_utils.py", line 2819, in from_pretrained
    raise ValueError(
ValueError:
                        Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
                        the quantized model. If you want to dispatch the model on the CPU or the disk while keeping
                        these modules in 32-bit, you need to set `load_in_8bit_fp32_cpu_offload=True` and pass a custom
                        `device_map` to `from_pretrained`. Check
                        https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
                        for more details.

(10.6)

我怀疑可能是因为这个环境没有装 cuda 的原因，所以打算换用之前 OpenP5 的环境，于是开始安装对相应的包

由于出现了相似的网络问题，

pip install -r requirements.txt

requirements.txt 中有几个需要从 github 上安装包

git+https://github.com/huggingface/peft.git
git+https://github.com/huggingface/transformers.git

我打算直接 pip install peft ( transformers 之前 openp5 环境上已经安装过了)

安装时，把之前环境中的 torch-1.8.1+cu111 卸载了

Installing collected packages: mpmath, sympy, psutil, networkx, MarkupSafe, jinja2, torch
  Attempting uninstall: torch
    Found existing installation: torch 1.8.1+cu111
    Uninstalling torch-1.8.1+cu111:
      Successfully uninstalled torch-1.8.1+cu111

所以在结束后又重新安装了(还好之前的 .whl 文件还保留着)

pip install "torch-1.8.1+cu111-cp39-cp39-win_amd64.whl"

但是发现 peft 和 accelerate 要求更高版本的torch，

peft 0.6.0.dev0 requires torch>=1.13.0, but you have torch 1.8.1+cu111 which is incompatible.
accelerate 0.23.0 requires torch>=1.10.0, but you have torch 1.8.1+cu111 which is incompatible.

所以我打算安装版本低一些 peft 和 accelerate

发现 peft 0.4.x 的版本也要求torch最低版本1.13，所以我打算直接给原来的 genrec 环境安装新的 torch ，

但是发现，torch 1.13 最低只支持 11.6 的 cuda ，因此又安装了 11.6 的 cuda ，然后安装了 torch-1.13.0+cu116

然后运行就和之前显示的信息不一样了

RuntimeError:
        CUDA Setup failed despite GPU being available. Please run the following command to get more information:

        python -m bitsandbytes

        Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
        to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
        and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues

上网搜索到

CUDA Setup failed despite GPU being available. Inspect the CUDA SETUP outputs above to fix your environment! · Issue #175 · TimDettmers/bitsandbytes (github.com)

注意到 Keith-Hon的回答可能可行，并且 maximus-sallam也说到他的方法可行

Quote

I have fixed it by including the .dll and fixed the file path. It now works on windows 10.

https://github.com/Keith-Hon/bitsandbytes-windows.git

Install the bitsandbytes library by

pip install bitsandbytes-windows.

Be noted that it may not work directly with transformers library as it references the bitsandbytes package by using 'bitsandbytes' name. <= to avoid this issue, you could directly install from the git repo

pip install git+https://github.com/Keith-Hon/bitsandbytes-windows.git

bitsandbytes-windows 的 github 仓库地址
https://github.com/Keith-Hon/bitsandbytes-windows

所以我尝试

pip install bitsandbytes-windows

然后发现可行，之前的报错消失了，

但随之而来的是，碰到了和上一篇论文一样的问题/报错(连接不了 huggingface.co )

requests.exceptions.ProxyError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /decapoda-research/llama-7b-hf/resolve/main/config.json (Caused by ProxyError('Unable to connect to proxy', SSLError(SSLZeroReturnError(6, 'TLS/SSL connection has been closed (EOF) (_ssl.c:1129)'))))"), '(Request ID: 3eadaf0f-dc33-4690-98b6-d2e5dc10c3ba)')

解决 huggingface.co 连接不上问题，成功使用离线模式¶

提到了可以将模型下载下来，并提到下载模型可以参阅如何从huggingface官网下载模型-CSDN博客

然后我突然想到了，上一篇论文代码最后的报错信息中，有提到

Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.

所以我重新开始查看它的文档并理解应该怎么使用离线模式(之前也看了但是没理解应该怎么使用离线模式)

Offline mode - Installation (huggingface.co) https://huggingface.co/docs/transformers/installation#offline-mode

由于 csdn 那两篇文章都没有明说下载的文件应该放在哪个文件夹，所以最后我判断应该是可以自定义存放的路径 (我一开始以为需要放在指定的路径下)，于是开始查看应该如何加载自定义的路径

然后我先是注意到了官方文档中的

from transformers import T5Model

model = T5Model.from_pretrained("./path/to/local/directory", local_files_only=True)

这里我认为是开始加载了模型，于是我在 rec.py 中查找对应的代码，发现 104 行有加载模型的代码

    model = LlamaForCausalLM.from_pretrained(
        base_model,
        load_in_8bit=True,
        torch_dtype=torch.float16,
        device_map=device_map,
    )

然后 Ctrl + 点击 base_model 查看具体它是什么，然后跳转到第 28 行

def train(
    # model/data params
    base_model: str = "",  # the only required argument
    data_path: str = "yahma/alpaca-cleaned",
    output_dir: str = "/common/users/jj635/llama/mycheckpoint/",
    # training hyperparams
    batch_size: int = 128,#used to be 128
    micro_batch_size: int = 4,
    num_epochs: int = 3,
    ...

然后想到了命令行输入的命令(也是感觉到跟之前使用 yolov7 时，命令行命令和代码中的对应关系 有点像)

python rec.py \
    --base_model 'decapoda-research/llama-7b-hf' \
    --data_path './moives' \
    --output_dir './checkpoint'

然后再对比官方文档中的，离线模式的命令行命令

HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 \
python examples/pytorch/translation/run_translation.py --model_name_or_path t5-small --dataset_name wmt16 --dataset_config ro-en ...

所以得到一个可能的解决方法，即先将

于是按照如何从huggingface官网下载模型-CSDN博客中写的，在 Models - Hugging Face https://huggingface.co/models 中找到/查询到对应的模型库

根据之前的报错信息

... Max retries exceeded with url: /decapoda-research/llama-7b-hf/resolve/main/config.json ...

或者根据使用的命令中的

    --base_model 'decapoda-research/llama-7b-hf' \

可以找到相应的网址

decapoda-research/llama-7b-hf · Hugging Face

然后点击 File and versions ，就可以在这里下载

decapoda-research/llama-7b-hf at main (huggingface.co)

最后对 rec.py 中进行相应的修改，之前的报错消失

但是出现了和之前(最开始时)一样的报错信息🙃🙄(怎么兜兜转转回到原地)

ValueError:u
                        Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
                        the quantized model. If you want to dispatch the model on the CPU or the disk while keeping
                        these modules in 32-bit, you need to set `load_in_8bit_fp32_cpu_offload=True` and pass a custom
                        `device_map` to `from_pretrained`. Check
                        https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
                        for more details.

粗略尝试了

Quantize 🤗 Transformers models (huggingface.co) https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu

上的方法，我将代码修改成

model = LlamaForCausalLM.from_pretrained(
        # base_model,
        "../llama-7b-hf",
        load_in_8bit=True,
        torch_dtype=torch.float16,
        # device_map=device_map,

        local_files_only=True,
        device_map={
            "transformer.word_embeddings": 0,
            "transformer.word_embeddings_layernorm": 0,
            "lm_head": "cpu",
            "transformer.h": 0,
            "transformer.ln_f": 0,
        },
        quantization_config=BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True),
    )

但最后显示

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
binary_path: E:\Programs\Anaconda3\envs\genrec\lib\site-packages\bitsandbytes\cuda_setup\libbitsandbytes_cuda116.dll
CUDA SETUP: Loading binary E:\Programs\Anaconda3\envs\genrec\lib\site-packages\bitsandbytes\cuda_setup\libbitsandbytes_cuda116.dll...
Training Alpaca-LoRA model with params:
base_model: decapoda-research/llama-7b-hf
data_path: ./moives
output_dir: ./checkpoint
batch_size: 128
micro_batch_size: 4
num_epochs: 3
learning_rate: 0.0003
cutoff_len: 256
val_set_size: 0
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules: ['q_proj', 'v_proj']
train_on_inputs: True
group_by_length: False
wandb_project:
wandb_run_name:
wandb_watch:
wandb_log_model:
resume_from_checkpoint: None

Loading checkpoint shards:   0%|                                                                | 0/33 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "E:\Github\code_reproduction\GenRec\rec.py", line 302, in <module>
    fire.Fire(train)
  File "E:\Programs\Anaconda3\envs\genrec\lib\site-packages\fire\core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "E:\Programs\Anaconda3\envs\genrec\lib\site-packages\fire\core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "E:\Programs\Anaconda3\envs\genrec\lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "E:\Github\code_reproduction\GenRec\rec.py", line 106, in train
    model = LlamaForCausalLM.from_pretrained(
  File "E:\Programs\Anaconda3\envs\genrec\lib\site-packages\transformers\modeling_utils.py", line 2881, in from_pretrained
    ) = cls._load_pretrained_model(
  File "E:\Programs\Anaconda3\envs\genrec\lib\site-packages\transformers\modeling_utils.py", line 3228, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "E:\Programs\Anaconda3\envs\genrec\lib\site-packages\transformers\modeling_utils.py", line 710, in _load_state_dict_into_meta_model
    raise ValueError(f"{param_name} doesn't have any device set.")
ValueError: model.layers.0.self_attn.q_proj.weight doesn't have any device set.

看到了这个问题

google/flan-ul2 · ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM (huggingface.co)

根据其中ybelkaba 的回答所提到的，我猜这个报错应该由于cpu和gpu内存不够大

OpenP5 (第二次尝试)¶

由于 GenRec 的代码遇到的问题暂时不知道如何解决，并且由于解决了 huggingface.co 的离线使用问题，所以打算再次尝试运行 OpenP5

首先是下载模型，然后修改相应的代码

我在 ./src/main.py 第 88 行处新增一行 args.backbone = "../../" + args.backbone ：

    ...
    args.rank = 0

    device = torch.device("cuda", int(args.gpu.split(',')[0]))
    # offline set
    args.backbone = "../../" + args.backbone
    ...

并且把每个 .from_pretrained() 中都加上了 local_files_only=True ，并且运行

HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 \
sh ML1M_random.sh

运行时，之前的报错消失了，变成了

Traceback (most recent call last):
  File "E:\Github\code_reproduction\OpenP5-old_version\src\main.py", line 241, in <module>
    single_main()
  File "E:\Github\code_reproduction\OpenP5-old_version\src\main.py", line 98, in single_main
    TrainSet, ValidSet = get_dataset(args)
  File "E:\Github\code_reproduction\OpenP5-old_version\src\main.py", line 33, in get_dataset
    TrainDataset = MultiTaskDataset(args, data, 'train')
  File "E:\Github\code_reproduction\OpenP5-old_version\src\data\MultiTaskDataset.py", line 110, in __init__
    dist.barrier()
  File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\torch\distributed\distributed_c10d.py", line 2419, in barrier
    default_pg = _get_default_group()
  File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\torch\distributed\distributed_c10d.py", line 347, in _get_default_group
    raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

最后更新: 2023-10-08
创建日期: 2023-10-04