1. Implemanting Image-to-Text with VisualGLM
This section demonstrates how to build a multimodal application capable of understanding visual inputs and generating descriptive text. We will utilize the VisualGLM-6B model within the PaddlePaddle ecosystem.
Environment Setup and Dependencies
First, acquire the necessary model repositories and install dependencies for audio processing and image handling.
git clone https://github.com/PaddlePaddle/PaddleMIX.git
pip install soundfile librosa requests pillow
Model Initialization
Import the required libraries and configure the computing environment. Load the pre-trained model and processor, ensuring the model is set to evaluation mode.
import os
import requests
from PIL import Image
from PaddleMIX.paddlemix import VisualGLMForConditionalGeneration, VisualGLMProcessor
# Specify GPU device
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
# Model configuration
model_source = "aistudio/visualglm-6b"
# Initialize model and processor
vision_model = VisualGLMForConditionalGeneration.from_pretrained(
model_source, from_aistudio=True, dtype="float32"
)
vision_model.eval()
input_processor = VisualGLMProcessor.from_pretrained(model_source, from_aistudio=True)
Image Inference and Description
Load an image from a remote URL and configure the generation parameters. The model will process the image and a text query to produce a description.
# Fetch image
img_url = "https://i02piccdn.sogoucdn.com/5dd40dedd7107cc5"
raw_image = Image.open(requests.get(img_url, stream=True).raw)
# Generation parameters
generation_config = {
"max_length": 1024,
"min_length": 10,
"top_p": 1.0,
"temperature": 0.8,
"repetition_penalty": 1.2,
"decode_strategy": "sampling",
}
# Initial description task
user_prompt = "Describe this scene in a poem."
interaction_history = []
# Process inputs and generate
model_inputs = input_processor(raw_image, user_prompt)
output_ids, _ = vision_model.generate(**model_inputs, **generation_config)
generated_text = input_processor.get_responses(output_ids)
interaction_history.append([user_prompt, generated_text[0]])
print(f"Response: {generated_text[0]}")
Contextual Inference
Perform follow-up reasoning based on the previous interaction history. For example, identifying the director of a movie shown in the image requires maintaining the context of the conversation.
follow_up_query = "Who is the director of this movie?"
context_inputs = input_processor(raw_image, follow_up_query, history=interaction_history)
next_output_ids, _ = vision_model.generate(**context_inputs, **generation_config)
next_response = input_processor.get_responses(next_output_ids)
print(f"Follow-up Response: {next_response[0]}")
2. Building a Knowledge Base with RAG and LLMs
This section outlines the construction of a Retrieval-Augmented Generation (RAG) system. We will use the Wenxin (Ernie) model to answer questions based on external financial documents stored in a vector database.
Document Preparation and Processing
Install the necessary libraries for text processing and vector storage, then load the documents.
# Install dependencies
pip install transformers langchain openai unstructured tiktoken faiss-cpu sentence_transformers pypdf
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Load PDF documents
file_paths = ['car.pdf', 'carbon.pdf']
documents = []
for path in file_paths:
loader = PyPDFLoader(path)
documents.extend(loader.load())
# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=300,
chunk_overlap=30,
separators=['\n', ' ', '']
)
text_chunks = text_splitter.split_documents(documents)
print(f"Total chunks created: {len(text_chunks)}")
Vectorization and Storage
Convert text chunks into embeddings using a pre-trained model and store them in a FAISS vector store for efficient similarity search.
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
# Initialize embedding model
embed_model_name = 'moka-ai/m3e-base'
embeddings = HuggingFaceEmbeddings(model_name=embed_model_name)
# Create vector store
vector_db = FAISS.from_documents(text_chunks, embeddings)
Retrieval Mechanism
Implement a search function that queries the vector database to find the most relevant text chunks based on a user's question.
def retrieve_context(question, k=5):
results = vector_db.similarity_search_with_score(question, k=k)
context_list = []
for doc, score in results:
meta_source = doc.metadata['source']
content_preview = doc.page_content[:30]
print(f"Source: {meta_source} | Score: {score:.4f} | Preview: {content_preview}...")
context_list.append(doc.page_content)
return "\n".join(context_list)
# Example retrieval
query_context = retrieve_context('What dual-carbon policies has the government released?')
LLM Integrasion (Wenxin/Ernie)
Encapsulate the API interaction with the Baidu Ernie model to handle authentication and chat requests.
import requests
class ErnieBotClient:
def __init__(self, api_key, secret_key):
self.api_key = api_key
self.secret_key = secret_key
self.base_url = "https://aip.baidubce.com"
self.access_token = self._authenticate()
def _authenticate(self):
url = f"{self.base_url}/oauth/2.0/token"
params = {
"grant_type": "client_credentials",
"client_id": self.api_key,
"client_secret": self.secret_key
}
resp = requests.get(url, params=params)
if resp.status_code == 200:
return resp.json().get("access_token")
raise Exception("Authentication failed")
def generate_response(self, prompt_text, user_id):
endpoint = f"{self.base_url}/rpc/2.0/ai_custom/v1/wenxinworkshop/chat/eb-instant"
payload = {
"messages": [{"role": "user", "content": prompt_text}],
"user_id": user_id
}
headers = {"Content-Type": "application/json"}
resp = requests.post(
f"{endpoint}?access_token={self.access_token}",
json=payload,
headers=headers
)
if resp.status_code == 200:
return resp.json().get("result")
raise Exception("API request failed")
# Initialize client (Replace with actual credentials)
ernie_client = ErnieBotClient(
api_key="YOUR_API_KEY",
secret_key="YOUR_SECRET_KEY"
)
Complete RAG Pipeline
Combine the retrieval and generation steps into a single function that formulates a prompt using the retrieved context and queries the LLM.
def rag_pipeline(user_question):
# Step 1: Retrieve relevant documents
relevant_context = retrieve_context(user_question, k=5)
# Step 2: Construct the prompt
system_instruction = (
"You are a helpful assistant. Answer the question based ONLY on the "
"provided known information. If the information is not related, say 'I do not know'."
)
final_prompt = f"{system_instruction}\nQuestion: {user_question}\nContext: {relevant_context}"
# Step 3: Generate answer via LLM
answer = ernie_client.generate_response(final_prompt, user_id="user_123")
return answer
# Execute
final_answer = rag_pipeline('What dual-carbon policies has the government released?')
print(f"Answer: {final_answer}")
3. Advanced Project Recommendations
For developers seeking to deepen their expertise in Large Language Models and AIGC, the following comprehensive resources are recommended:
- Multimodal Large Models: An intensive curriculum focusing on advanced vision-language models (approx. 2 weeks study time).
- Medical AI and LLMs: Application of large models in the healthcare domain, covering medical data processing and analysis.
- Digital Avatar Customization: Techniques for creating personalized digital humans, including voice synthesis, appearance modeling, and interactive capabilities for chat and translation.