Practical Large Language Model Applications: Multimodal Vision and RAG Systems

1. Implemanting Image-to-Text with VisualGLM

This section demonstrates how to build a multimodal application capable of understanding visual inputs and generating descriptive text. We will utilize the VisualGLM-6B model within the PaddlePaddle ecosystem.

Environment Setup and Dependencies

First, acquire the necessary model repositories and install dependencies for audio processing and image handling.

git clone https://github.com/PaddlePaddle/PaddleMIX.git
pip install soundfile librosa requests pillow

Model Initialization

Import the required libraries and configure the computing environment. Load the pre-trained model and processor, ensuring the model is set to evaluation mode.

import os
import requests
from PIL import Image
from PaddleMIX.paddlemix import VisualGLMForConditionalGeneration, VisualGLMProcessor

# Specify GPU device
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

# Model configuration
model_source = "aistudio/visualglm-6b"

# Initialize model and processor
vision_model = VisualGLMForConditionalGeneration.from_pretrained(
    model_source, from_aistudio=True, dtype="float32"
)
vision_model.eval()
input_processor = VisualGLMProcessor.from_pretrained(model_source, from_aistudio=True)

Image Inference and Description

Load an image from a remote URL and configure the generation parameters. The model will process the image and a text query to produce a description.

# Fetch image
img_url = "https://i02piccdn.sogoucdn.com/5dd40dedd7107cc5"
raw_image = Image.open(requests.get(img_url, stream=True).raw)

# Generation parameters
generation_config = {
    "max_length": 1024,
    "min_length": 10,
    "top_p": 1.0,
    "temperature": 0.8,
    "repetition_penalty": 1.2,
    "decode_strategy": "sampling",
}

# Initial description task
user_prompt = "Describe this scene in a poem."
interaction_history = []

# Process inputs and generate
model_inputs = input_processor(raw_image, user_prompt)
output_ids, _ = vision_model.generate(**model_inputs, **generation_config)
generated_text = input_processor.get_responses(output_ids)

interaction_history.append([user_prompt, generated_text[0]])
print(f"Response: {generated_text[0]}")

Contextual Inference

Perform follow-up reasoning based on the previous interaction history. For example, identifying the director of a movie shown in the image requires maintaining the context of the conversation.

follow_up_query = "Who is the director of this movie?"
context_inputs = input_processor(raw_image, follow_up_query, history=interaction_history)

next_output_ids, _ = vision_model.generate(**context_inputs, **generation_config)
next_response = input_processor.get_responses(next_output_ids)

print(f"Follow-up Response: {next_response[0]}")

2. Building a Knowledge Base with RAG and LLMs

This section outlines the construction of a Retrieval-Augmented Generation (RAG) system. We will use the Wenxin (Ernie) model to answer questions based on external financial documents stored in a vector database.

Document Preparation and Processing

Install the necessary libraries for text processing and vector storage, then load the documents.

# Install dependencies
pip install transformers langchain openai unstructured tiktoken faiss-cpu sentence_transformers pypdf

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load PDF documents
file_paths = ['car.pdf', 'carbon.pdf']
documents = []

for path in file_paths:
    loader = PyPDFLoader(path)
    documents.extend(loader.load())

# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=300, 
    chunk_overlap=30, 
    separators=['\n', ' ', '']
)
text_chunks = text_splitter.split_documents(documents)
print(f"Total chunks created: {len(text_chunks)}")

Vectorization and Storage

Convert text chunks into embeddings using a pre-trained model and store them in a FAISS vector store for efficient similarity search.

from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

# Initialize embedding model
embed_model_name = 'moka-ai/m3e-base'
embeddings = HuggingFaceEmbeddings(model_name=embed_model_name)

# Create vector store
vector_db = FAISS.from_documents(text_chunks, embeddings)

Retrieval Mechanism

Implement a search function that queries the vector database to find the most relevant text chunks based on a user's question.

def retrieve_context(question, k=5):
    results = vector_db.similarity_search_with_score(question, k=k)
    
    context_list = []
    for doc, score in results:
        meta_source = doc.metadata['source']
        content_preview = doc.page_content[:30]
        print(f"Source: {meta_source} | Score: {score:.4f} | Preview: {content_preview}...")
        context_list.append(doc.page_content)
        
    return "\n".join(context_list)

# Example retrieval
query_context = retrieve_context('What dual-carbon policies has the government released?')

LLM Integrasion (Wenxin/Ernie)

Encapsulate the API interaction with the Baidu Ernie model to handle authentication and chat requests.

import requests

class ErnieBotClient:
    def __init__(self, api_key, secret_key):
        self.api_key = api_key
        self.secret_key = secret_key
        self.base_url = "https://aip.baidubce.com"
        self.access_token = self._authenticate()

    def _authenticate(self):
        url = f"{self.base_url}/oauth/2.0/token"
        params = {
            "grant_type": "client_credentials",
            "client_id": self.api_key,
            "client_secret": self.secret_key
        }
        resp = requests.get(url, params=params)
        if resp.status_code == 200:
            return resp.json().get("access_token")
        raise Exception("Authentication failed")

    def generate_response(self, prompt_text, user_id):
        endpoint = f"{self.base_url}/rpc/2.0/ai_custom/v1/wenxinworkshop/chat/eb-instant"
        payload = {
            "messages": [{"role": "user", "content": prompt_text}],
            "user_id": user_id
        }
        headers = {"Content-Type": "application/json"}
        resp = requests.post(
            f"{endpoint}?access_token={self.access_token}", 
            json=payload, 
            headers=headers
        )
        if resp.status_code == 200:
            return resp.json().get("result")
        raise Exception("API request failed")

# Initialize client (Replace with actual credentials)
ernie_client = ErnieBotClient(
    api_key="YOUR_API_KEY", 
    secret_key="YOUR_SECRET_KEY"
)

Complete RAG Pipeline

Combine the retrieval and generation steps into a single function that formulates a prompt using the retrieved context and queries the LLM.

def rag_pipeline(user_question):
    # Step 1: Retrieve relevant documents
    relevant_context = retrieve_context(user_question, k=5)
    
    # Step 2: Construct the prompt
    system_instruction = (
        "You are a helpful assistant. Answer the question based ONLY on the "
        "provided known information. If the information is not related, say 'I do not know'."
    )
    final_prompt = f"{system_instruction}\nQuestion: {user_question}\nContext: {relevant_context}"
    
    # Step 3: Generate answer via LLM
    answer = ernie_client.generate_response(final_prompt, user_id="user_123")
    return answer

# Execute
final_answer = rag_pipeline('What dual-carbon policies has the government released?')
print(f"Answer: {final_answer}")

3. Advanced Project Recommendations

For developers seeking to deepen their expertise in Large Language Models and AIGC, the following comprehensive resources are recommended:

  • Multimodal Large Models: An intensive curriculum focusing on advanced vision-language models (approx. 2 weeks study time).
  • Medical AI and LLMs: Application of large models in the healthcare domain, covering medical data processing and analysis.
  • Digital Avatar Customization: Techniques for creating personalized digital humans, including voice synthesis, appearance modeling, and interactive capabilities for chat and translation.

Posted on Thu, 14 May 2026 13:51:53 +0000 by ifm1989