复现代码过程记录¶
约 2191 个字 • 289 行代码
OpenP5 (第一次尝试但中断)¶
论文网址:
OpenP5: Benchmarking Foundation Models for Recommendation | Papers With Code
代码地址:
环境 environment.txt
:
python==3.9.7
transformers==4.26.0
torch==1.8.1+cu111
sklearn==1.1.2
torchvision==0.9.1+cu111
tqdm==4.64.1
time
collections
argparse
os
sys
numpy==1.23.1
由于 time
colloctions
os
sys
好像都是自带的不能装,而 torch
torchvision
需要装的是 cuda 版的,所以我将其修改为
其中 sklearn
在之前安装时发现说是更改成了 scikit-learn
所以我也进行了修改
这样就可以直接运行命令来安装对应的包
安装好这些后,打开 clone 的仓库,发现 train.py
中还是有一些报错
并且 README.md
中写着
## Usage
Download the data from Google Drive link, and put them into
./data
folder.The training command can be found in
./command
folder. Run the command such as
然而并没有 ./command
文件夹,然后我发现,从main分支里下载的文件里面并没有command文件夹,
而有一个 old_version分支里面有,并且 ./command/ML1M_random.sh
文件中的命令是要运行 ./src/main.py
文件,而这个文件只在 old_version 分支中有,并且几乎所装的包都在 main.py
中被导入,所以我决定使用 old_version
由于需要执行 .sh
文件,这个在windows的cmd中好像不能使用,只能在git bash终端中使用,而我又需要使用 conda 环境,
所以发现在 bash 中不能像cmd中一样运行 conda activate openp5
搜索后发现了这个方法有效
source
anaconda3 安装路径下的 /Scripts/activate
然后就会启动 anaconda 的 base 环境,这时 conda activate openp5
就有可以使用 openp5 的虚拟环境了
解决 .pyc
文件导入问题¶
old_version的 main.py
中有一处报错,
是使用了自己的包,而好像 pycharm 中 ./src/model
文件夹中并没有这个 P5
的东西
经过一顿折腾之后,偶然发现了,原本的仓库里面, model
文件夹下有一个 __pycache__
文件夹,里面有一个 P5.cpython-39.pyc
文件,我认为这个应该就是 main.py
要导的包,
所以开始查询如何在 pycharm 中才能导入这个包(因为最近也有一个类似的情况(编译Orbbec SDK) - 在终端中能使用/导入pyc文件,而在pycharm中会报错),然后发现了
python - Cannot see pyc files in PyCharm - Stack Overflow
这个的回答
需要在 pycharm 的 设置 - 编辑器 - 文件类型 - 忽略的文件和文件夹 里,把 *.pyc
和 __pycache__
去掉
但经过测试之后发现,即使能看到pyc文件也还是会报错
想起来在 Orbbec SDK for Python 使用手册 测试 Sample 里,终端上使用时,需要设置一个 PYTHONPATH
的环境变量,
# set PYTHONPATH environment variable to include the lib directory in the install directory
export PYTHONPATH=$PYTHONPATH:$(pwd)/install/lib/
所以开始查询如何在 pycharm 中设置 PYTHONPATH
环境变量,然后从
python - PyCharm and PYTHONPATH - Stack Overflow
Manage interpreter paths | PyCharm Documentation (jetbrains.com)
找到了方法:
设置 - 项目 - Python解释器 - 在 Python 解释器:
右侧的框的右侧点击向下的三角/箭头 - 全部显示 - 在左上角图标中找到 显示解释器路径
- 添加相应的路径
由于 main.py
文件中,是
所以应该是把 model
的父目录(即项目的根目录)添加上
添加完了之后,即使还是会有红色的警告,但能运行了
报错¶
error info
$ sh ML1M_random.sh
{'seed': 2023, 'model_dir': '../model', 'checkpoint_dir': '../checkpoint', 'model_name': 'model.pt', 'log_dir': '../log', 'distributed': 1, 'gpu': '0,1', 'master_addr': 'localhost', 'master_port': '1991', 'logging_level': 20, 'data_path': '../data', 'item_indexing': 'random', 'tasks': 'sequential,straightforward', 'datasets': 'ML1M', 'prompt_file': '../prompt.txt', 'sequential_order': 'original', 'collaborative_token_size': 200, 'collaborative_cluster': 20, 'collaborative_last_token': 'sequential', 'collaborative_float32': 0, 'max_his': 20, 'his_prefix': 1, 'his_sep': ' , ', 'skip_empty_his': 1, 'valid_prompt': 'seen:0', 'valid_prompt_sample': 1, 'valid_sample_num': '3,3', 'test_prompt': 'seen:0', 'sample_prompt': 1, 'sample_num': '3,3', 'batch_size': 128, 'eval_batch_size': 20, 'dist_sampler': 0, 'optim': 'AdamW', 'epochs': 10, 'lr': 0.001, 'clip': 1, 'logging_step': 100, 'warmup_prop': 0.05, 'gradient_accumulation_steps': 1, 'weight_decay': 0.01, 'adam_eps': 1e-06, 'dropout': 0.1, 'alpha': 2, 'train': 1, 'backbone': 't5-small', 'metrics': 'hit@5,hit@10,ndcg@5,ndcg@10', 'load': 0, 'random_initialize': 1, 'test_epoch': 0, 'valid_select': 0, 'test_before_train': 0, 'test_filtered': 0, 'test_filtered_batch': 1, 'log_name': '1_1_1_1_20_1991_ML1M_sequential,straightforward_t5-small_random_0.001_10_128_3,3_prompt', 'model_path': '../model\\ML1M\\1_1_1_1_20_1991_ML1M_sequential,straightforward_t5-small_random_0.001_10_128_3,3_prompt.pt', 'rank': 0}
'(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /t5-small/resolve/main/tokenizer_config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x0000018265B83BB0>, 'Connection to huggingface.co timed out. (connect timeout=10)'))"), '(Request ID: 703c57e1-700e-4497-8fdf-3cf500baea63)')' thrown while requesting HEAD https://huggingface.co/t5-small/resolve/main/tokenizer_config.json
'(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /t5-small/resolve/main/tokenizer_config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x0000018265B83BB0>, 'Connection to huggingface.co timed out. (connect timeout=10)'))"), '(Request ID: 703c57e1-700e-4497-8fdf-3cf500baea63)')' thrown while requesting HEAD https://huggingface.co/t5-small/resolve/main/tokenizer_config.json
'(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /t5-small/resolve/main/config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x000001821CE4F7C0>, 'Connection to huggingface.co timed out. (connect timeout=10)'))"), '(Request ID: fece3cd4-12a5-4ab5-b3a2-7fea10c749bf)')' thrown while requesting HEAD https://huggingface.co/t5-small/resolve/main/config.json
'(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /t5-small/resolve/main/config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x000001821CE4F7C0>, 'Connection to huggingface.co timed out. (connect timeout=10)'))"), '(Request ID: fece3cd4-12a5-4ab5-b3a2-7fea10c749bf)')' thrown while requesting HEAD https://huggingface.co/t5-small/resolve/main/config.json
Traceback (most recent call last):
File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\urllib3\connection.py", line 203, in _new_conn
sock = connection.create_connection(
File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\urllib3\util\connection.py", line 85, in create_connection
raise err
File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\urllib3\util\connection.py", line 73, in create_connection
sock.connect(sa)
socket.timeout: timed out
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\urllib3\connectionpool.py", line 790, in urlopen
response = self._make_request(
File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\urllib3\connectionpool.py", line 491, in _make_request
raise new_e
File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\urllib3\connectionpool.py", line 467, in _make_request
self._validate_conn(conn)
File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\urllib3\connectionpool.py", line 1092, in _validate_conn
conn.connect()
File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\urllib3\connection.py", line 611, in connect
self.sock = sock = self._new_conn()
File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\urllib3\connection.py", line 212, in _new_conn
raise ConnectTimeoutError(
urllib3.exceptions.ConnectTimeoutError: (<urllib3.connection.HTTPSConnection object at 0x000001821CE4F7C0>, 'Connection to huggingface.co timed out. (connect timeout=10)')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\requests\adapters.py", line 486, in send
resp = conn.urlopen(
File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\urllib3\connectionpool.py", line 844, in urlopen
retries = retries.increment(
File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\urllib3\util\retry.py", line 515, in increment
raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type]
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /t5-small/resolve/main/config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x000001821CE4F7C0>, 'Connection to huggingface.co timed out. (connect timeout=10)'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\huggingface_hub\file_download.py", line 1232, in hf_hub_download
metadata = get_hf_file_metadata(
File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\huggingface_hub\utils\_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\huggingface_hub\file_download.py", line 1599, in get_hf_file_metadata
r = _request_wrapper(
File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\huggingface_hub\file_download.py", line 417, in _request_wrapper
response = _request_wrapper(
File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\huggingface_hub\file_download.py", line 452, in _request_wrapper
return http_backoff(
File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\huggingface_hub\utils\_http.py", line 274, in http_backoff
raise err
File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\huggingface_hub\utils\_http.py", line 258, in http_backoff
response = session.request(method=method, url=url, **kwargs)
File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\requests\sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\requests\sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\huggingface_hub\utils\_http.py", line 63, in send
return super().send(request, *args, **kwargs)
File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\requests\adapters.py", line 507, in send
raise ConnectTimeout(e, request=request)
requests.exceptions.ConnectTimeout: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /t5-small/resolve/main/config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x000001821CE4F7C0>, 'Connection to huggingface.co timed out. (connect timeout=10)'))"), '(Request ID: fece3cd4-12a5-4ab5-b3a2-7fea10c749bf)')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\transformers\utils\hub.py", line 409, in cached_file
resolved_file = hf_hub_download(
File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\huggingface_hub\utils\_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\huggingface_hub\file_download.py", line 1349, in hf_hub_download
raise LocalEntryNotFoundError(
huggingface_hub.utils._errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "E:\Github\code_reproduction\OpenP5-old_version\src\main.py", line 233, in <module>
single_main()
File "E:\Github\code_reproduction\OpenP5-old_version\src\main.py", line 92, in single_main
tokenizer = AutoTokenizer.from_pretrained(args.backbone)
File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 613, in from_pretrained
config = AutoConfig.from_pretrained(
File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\transformers\models\auto\configuration_auto.py", line 852, in from_pretrained
config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\transformers\configuration_utils.py", line 565, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\transformers\configuration_utils.py", line 620, in _get_config_dict
resolved_config_file = cached_file(
File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\transformers\utils\hub.py", line 443, in cached_file
raise EnvironmentError(
OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like t5-small is not the path to a directory containing a file named config.json.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.
GenRec (第一次尝试但中断)¶
由于上一个代码暂时解决不了 OSError: We couldn't connect to 'https://huggingface.co' to load this file ...
的报错,所以开始尝试这篇论文的代码
论文网址:
GenRec: Large Language Model for Generative Recommendation | Papers With Code
代码地址:
rutgerswiselab/GenRec: Large Language Model for Generative Recommendation (github.com)
安装好环境之后,初次运行发现
于是安装 scipy
然后再运行
ImportError: Using `load_in_8bit=True` requires Accelerate: `pip install accelerate` and the latest version of bitsandbytes `pip install -i https://test.pypi.org/simple/ bitsandbytes` or pip install bitsandbytes`
尝试 pip install accelerate
和 pip install -i https://test.pypi.org/simple/ bitsandbytes
后,显示都安装了,所以上网查找相关信息,然后发现两个 不久前的回答
都说了将 transformers
的版本降级 downgrade 到 4.30
就好了(我本来自动安装的版本是 4.35.0.dev0
)(这个 transformers
怎么这么麻烦,上一篇论文也是弄这玩意弄了好久😡🤬)
修改以后发现可以运行了
最后出现这样的报错
Traceback (most recent call last):
File "E:\Github\code_reproduction\GenRec\rec.py", line 289, in <module>
fire.Fire(train)
File "E:\Programs\Anaconda3\envs\genrec\lib\site-packages\fire\core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "E:\Programs\Anaconda3\envs\genrec\lib\site-packages\fire\core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "E:\Programs\Anaconda3\envs\genrec\lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "E:\Github\code_reproduction\GenRec\rec.py", line 104, in train
model = LlamaForCausalLM.from_pretrained(
File "E:\Programs\Anaconda3\envs\genrec\lib\site-packages\transformers\modeling_utils.py", line 2819, in from_pretrained
raise ValueError(
ValueError:
Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
the quantized model. If you want to dispatch the model on the CPU or the disk while keeping
these modules in 32-bit, you need to set `load_in_8bit_fp32_cpu_offload=True` and pass a custom
`device_map` to `from_pretrained`. Check
https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
for more details.
(10.6)
我怀疑可能是因为这个环境没有装 cuda 的原因,所以打算换用之前 OpenP5 的环境,于是开始安装对相应的包
由于出现了相似的网络问题,
requirements.txt
中有几个需要从 github 上安装包
我打算直接 pip install peft
( transformers
之前 openp5
环境上已经安装过了)
安装时,把之前环境中的 torch-1.8.1+cu111
卸载了
Installing collected packages: mpmath, sympy, psutil, networkx, MarkupSafe, jinja2, torch
Attempting uninstall: torch
Found existing installation: torch 1.8.1+cu111
Uninstalling torch-1.8.1+cu111:
Successfully uninstalled torch-1.8.1+cu111
所以在结束后又重新安装了(还好之前的 .whl
文件还保留着)
但是发现 peft
和 accelerate
要求更高版本的torch,
peft 0.6.0.dev0 requires torch>=1.13.0, but you have torch 1.8.1+cu111 which is incompatible.
accelerate 0.23.0 requires torch>=1.10.0, but you have torch 1.8.1+cu111 which is incompatible.
所以我打算安装版本低一些 peft
和 accelerate
发现 peft 0.4.x 的版本也要求torch最低版本1.13,所以我打算直接给原来的 genrec
环境安装新的 torch ,
但是发现,torch 1.13 最低只支持 11.6 的 cuda ,因此又安装了 11.6 的 cuda , 然后安装了 torch-1.13.0+cu116
然后运行就和之前显示的信息不一样了
RuntimeError:
CUDA Setup failed despite GPU being available. Please run the following command to get more information:
python -m bitsandbytes
Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues
上网搜索到
注意到 Keith-Hon的回答可能可行,并且 maximus-sallam也说到他的方法可行
Quote
I have fixed it by including the .dll and fixed the file path. It now works on windows 10.
https://github.com/Keith-Hon/bitsandbytes-windows.git
Install the bitsandbytes library by
pip install bitsandbytes-windows.
Be noted that it may not work directly with transformers library as it references the bitsandbytes package by using 'bitsandbytes' name. <= to avoid this issue, you could directly install from the git repo
pip install git+https://github.com/Keith-Hon/bitsandbytes-windows.git
bitsandbytes-windows 的 github 仓库地址
所以我尝试
然后发现可行,之前的报错消失了,
但随之而来的是,碰到了和上一篇论文一样的问题/报错(连接不了 huggingface.co
)
requests.exceptions.ProxyError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /decapoda-research/llama-7b-hf/resolve/main/config.json (Caused by ProxyError('Unable to connect to proxy', SSLError(SSLZeroReturnError(6, 'TLS/SSL connection has been closed (EOF) (_ssl.c:1129)'))))"), '(Request ID: 3eadaf0f-dc33-4690-98b6-d2e5dc10c3ba)')
解决 huggingface.co 连接不上问题,成功使用离线模式¶
搜索相关信息时,发现了一篇文章
提到了可以将模型下载下来,并提到下载模型可以参阅 如何从huggingface官网下载模型-CSDN博客
然后我突然想到了,上一篇论文代码最后的报错信息中,有提到
所以我重新开始查看它的文档并理解应该怎么使用离线模式(之前也看了但是没理解应该怎么使用离线模式)
由于 csdn 那两篇文章都 没有明说下载的文件应该放在哪个文件夹,所以 最后我 判断应该是可以自定义存放的路径 (我一开始以为需要放在指定的路径下),于是开始查看应该如何加载自定义的路径
然后我先是注意到了官方文档中的
from transformers import T5Model
model = T5Model.from_pretrained("./path/to/local/directory", local_files_only=True)
这里我认为是开始加载了模型,于是我在 rec.py
中查找对应的代码,发现 104 行有加载模型的代码
model = LlamaForCausalLM.from_pretrained(
base_model,
load_in_8bit=True,
torch_dtype=torch.float16,
device_map=device_map,
)
然后 Ctrl + 点击 base_model
查看具体它是什么,然后跳转到第 28 行
def train(
# model/data params
base_model: str = "", # the only required argument
data_path: str = "yahma/alpaca-cleaned",
output_dir: str = "/common/users/jj635/llama/mycheckpoint/",
# training hyperparams
batch_size: int = 128,#used to be 128
micro_batch_size: int = 4,
num_epochs: int = 3,
...
然后想到了命令行输入的命令(也是感觉到跟之前使用 yolov7 时,命令行命令和代码中的对应关系 有点像)
python rec.py \
--base_model 'decapoda-research/llama-7b-hf' \
--data_path './moives' \
--output_dir './checkpoint'
然后再对比官方文档中的,离线模式的命令行命令
HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 \
python examples/pytorch/translation/run_translation.py --model_name_or_path t5-small --dataset_name wmt16 --dataset_config ro-en ...
所以得到一个可能的解决方法,即先将
于是按照 如何从huggingface官网下载模型-CSDN博客 中写的,在 Models - Hugging Face https://huggingface.co/models 中找到/查询到对应的模型库
-
根据之前的报错信息
-
或者根据使用的命令中的
可以找到相应的网址
decapoda-research/llama-7b-hf · Hugging Face
然后点击 File and versions
,就可以在这里下载
decapoda-research/llama-7b-hf at main (huggingface.co)
最后对 rec.py
中进行相应的修改,之前的报错消失
但是出现了和之前(最开始时)一样的报错信息🙃🙄(怎么兜兜转转回到原地)
ValueError:u Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `load_in_8bit_fp32_cpu_offload=True` and pass a custom `device_map` to `from_pretrained`. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details.
粗略尝试了
上的方法,我将代码修改成
model = LlamaForCausalLM.from_pretrained(
# base_model,
"../llama-7b-hf",
load_in_8bit=True,
torch_dtype=torch.float16,
# device_map=device_map,
local_files_only=True,
device_map={
"transformer.word_embeddings": 0,
"transformer.word_embeddings_layernorm": 0,
"lm_head": "cpu",
"transformer.h": 0,
"transformer.ln_f": 0,
},
quantization_config=BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True),
)
但最后显示
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
binary_path: E:\Programs\Anaconda3\envs\genrec\lib\site-packages\bitsandbytes\cuda_setup\libbitsandbytes_cuda116.dll
CUDA SETUP: Loading binary E:\Programs\Anaconda3\envs\genrec\lib\site-packages\bitsandbytes\cuda_setup\libbitsandbytes_cuda116.dll...
Training Alpaca-LoRA model with params:
base_model: decapoda-research/llama-7b-hf
data_path: ./moives
output_dir: ./checkpoint
batch_size: 128
micro_batch_size: 4
num_epochs: 3
learning_rate: 0.0003
cutoff_len: 256
val_set_size: 0
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules: ['q_proj', 'v_proj']
train_on_inputs: True
group_by_length: False
wandb_project:
wandb_run_name:
wandb_watch:
wandb_log_model:
resume_from_checkpoint: None
Loading checkpoint shards: 0%| | 0/33 [00:00<?, ?it/s]
Traceback (most recent call last):
File "E:\Github\code_reproduction\GenRec\rec.py", line 302, in <module>
fire.Fire(train)
File "E:\Programs\Anaconda3\envs\genrec\lib\site-packages\fire\core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "E:\Programs\Anaconda3\envs\genrec\lib\site-packages\fire\core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "E:\Programs\Anaconda3\envs\genrec\lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "E:\Github\code_reproduction\GenRec\rec.py", line 106, in train
model = LlamaForCausalLM.from_pretrained(
File "E:\Programs\Anaconda3\envs\genrec\lib\site-packages\transformers\modeling_utils.py", line 2881, in from_pretrained
) = cls._load_pretrained_model(
File "E:\Programs\Anaconda3\envs\genrec\lib\site-packages\transformers\modeling_utils.py", line 3228, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "E:\Programs\Anaconda3\envs\genrec\lib\site-packages\transformers\modeling_utils.py", line 710, in _load_state_dict_into_meta_model
raise ValueError(f"{param_name} doesn't have any device set.")
ValueError: model.layers.0.self_attn.q_proj.weight doesn't have any device set.
看到了这个问题
根据其中ybelkaba 的回答所提到的,我猜这个报错应该由于cpu和gpu内存不够大
OpenP5 (第二次尝试)¶
由于 GenRec 的代码遇到的问题暂时不知道如何解决,并且由于解决了 huggingface.co 的离线使用问题,所以打算再次尝试运行 OpenP5
首先是下载模型,然后修改相应的代码
我在 ./src/main.py
第 88 行处新增一行 args.backbone = "../../" + args.backbone
:
...
args.rank = 0
device = torch.device("cuda", int(args.gpu.split(',')[0]))
# offline set
args.backbone = "../../" + args.backbone
...
并且把每个 .from_pretrained()
中都加上了 local_files_only=True
,并且运行
运行时,之前的报错消失了,变成了
Traceback (most recent call last):
File "E:\Github\code_reproduction\OpenP5-old_version\src\main.py", line 241, in <module>
single_main()
File "E:\Github\code_reproduction\OpenP5-old_version\src\main.py", line 98, in single_main
TrainSet, ValidSet = get_dataset(args)
File "E:\Github\code_reproduction\OpenP5-old_version\src\main.py", line 33, in get_dataset
TrainDataset = MultiTaskDataset(args, data, 'train')
File "E:\Github\code_reproduction\OpenP5-old_version\src\data\MultiTaskDataset.py", line 110, in __init__
dist.barrier()
File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\torch\distributed\distributed_c10d.py", line 2419, in barrier
default_pg = _get_default_group()
File "E:\Programs\Anaconda3\envs\openp5\lib\site-packages\torch\distributed\distributed_c10d.py", line 347, in _get_default_group
raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
创建日期: 2023-10-04