fix(config): make tokenizer optional and include a troubleshooting doc (#1998)
Some checks are pending
publish docs / publish-docs (push) Waiting to run
release-please / release-please (push) Waiting to run
tests / setup (push) Waiting to run
tests / ${{ matrix.quality-command }} (black) (push) Blocked by required conditions
tests / ${{ matrix.quality-command }} (mypy) (push) Blocked by required conditions
tests / ${{ matrix.quality-command }} (ruff) (push) Blocked by required conditions
tests / test (push) Blocked by required conditions
tests / all_checks_passed (push) Blocked by required conditions

* docs: add troubleshooting

* fix: pass HF token to setup script and prevent to download tokenizer when it is empty

* fix: improve log and disable specific tokenizer by default

* chore: change HF_TOKEN environment to be aligned with default config

* ifx: mypy
This commit is contained in:
Javier Martinez 2024-07-17 10:06:27 +02:00 committed by GitHub
parent 15f73dbc48
commit 01b7ccd064
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
6 changed files with 65 additions and 12 deletions

View file

@ -24,6 +24,7 @@ snapshot_download(
repo_id=settings().huggingface.embedding_hf_model_name,
cache_dir=models_cache_path,
local_dir=embedding_path,
token=settings().huggingface.access_token,
)
print("Embedding model downloaded!")
@ -35,15 +36,18 @@ hf_hub_download(
cache_dir=models_cache_path,
local_dir=models_path,
resume_download=resume_download,
token=settings().huggingface.access_token,
)
print("LLM model downloaded!")
# Download Tokenizer
print(f"Downloading tokenizer {settings().llm.tokenizer}")
AutoTokenizer.from_pretrained(
pretrained_model_name_or_path=settings().llm.tokenizer,
cache_dir=models_cache_path,
)
print("Tokenizer downloaded!")
if settings().llm.tokenizer:
print(f"Downloading tokenizer {settings().llm.tokenizer}")
AutoTokenizer.from_pretrained(
pretrained_model_name_or_path=settings().llm.tokenizer,
cache_dir=models_cache_path,
token=settings().huggingface.access_token,
)
print("Tokenizer downloaded!")
print("Setup done")