Meta Llama 3, released on April 18, 2024, is the latest generation of open-source large language models from Meta, featuring 8 billion and 70 billion parameter versions. These models are designed for a wide range of applications, offering state-of-the-art performance on industry benchmarks and improved reasoning capabilities.
Key features include a decoder-only transformer architecture with a 128K token vocabulary for better language encoding efficiency, and Grouped Query Attention (GQA) for enhanced inference performance. The models were pre-trained on over 15 trillion tokens of publicly available data, which includes more than 5% high-quality non-English text covering over 30 languages. Instruction tuning combines supervised fine-tuning (SFT), rejection sampling, proximal policy optimization (PPO), and direct preference optimization (DPO) to align with human preferences.
Model Specifications
| Parameter Count | Context Length | Training Tokens | Knowledge Cutoff |
|---|---|---|---|
| 8B | 8k | 15T+ | March 2023 |
| 70B | 8k | 15T+ | December 2023 |
Installation and Setup To download the model weights and tokenizer, visit the Meta Llama website and accept the license agreement. After approval, use the provided URL with the download script:
./download.sh
Install dependencies in a Conda environment with PyTorch/CUDA:
pip install -e .
For Hugging Face integration, install transformers and huggingface-hub, then download the model:
import transformers
import torch
pipeline = transformers.pipeline(
"text-generation",
model="meta-llama/Meta-Llama-3-8B-Instruct",
torch_dtype=torch.bfloat16,
device="cuda"
)
Quick Inference Example Run a basic text completion using the example script:
torchrun --nproc_per_node 1 example_text_completion.py \
--ckpt_dir Meta-Llama-3-8B/ \
--tokenizer_path Meta-Llama-3-8B/tokenizer.model \
--max_seq_len 128 --max_batch_size 4
For chat completion with the instruction-tuned model:
torchrun --nproc_per_node 1 example_chat_completion.py \
--ckpt_dir Meta-Llama-3-8B-Instruct/ \
--tokenizer_path Meta-Llama-3-8B-Instruct/tokenizer.model \
--max_seq_len 512 --max_batch_size 6
Memory Requirements
- Inference: FP16 mode requires ~16 GB VRAM; INT4 quantization reduces this to ~8 GB.
- Training: Full training with AMP needs up to 1200 GB for the 70B model, while QLoRA with 4-bit quantization can lower this to 48 GB.
Fine-Tuning with LoRA
Use Low-Rank Adaptation (LoRA) for efficient fine-tuning on custom datasets. Example using the transformers library:
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model
import torch
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
lora_config = LoraConfig(
r=8,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.1,
bias="none"
)
model = get_peft_model(model, lora_config)
Deployment Options
- Web Demo with Streamlit: Create an interactive chat interface.
- FastAPI Backend: Build a REST API for model serving.
- Ollama Integration: Deploy using Ollama for local inference with Docker.
- GPT4ALL GUI: Load quantized GGUF model files for desktop chat applications.
Context Window Extension The base 8k context can be expanded using LoRA-based techniques. For example, merge adapters to achieve up to 1048k context:
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-70B-Instruct")
for adapter in adapter_paths:
base_model = PeftModel.from_pretrained(base_model, adapter)
base_model = base_model.merge_and_unload()
base_model.save_pretrained("merged_model")
Available Datasets for Fine-Tuning
- firefly-train-1.1M: Chinese NLP tasks with cultural data.
- ShareGPT-Chinese-English-90k: Bilingual human-machine dialogues.
- WizardLM_evol_instruct_V2_143k: English instructions for complex tasks.
- school-math-0.25M: Mathematical problem-solving exmaples.
Responsible Use Guidelines Adhere to Meta's Responsible Use Guide, which includes content filtering recommendations and safety tools like Llama Guard 2 and Code Shield for mitigating risks.