Vision Templates¶
Overview¶
The Chat Template System provides comprehensive support for multi-modal templates that can process images and videos alongside text. This guide covers how to create, configure, and use vision-enabled templates.
Vision Architecture¶
Pipeline Overview¶
The vision processing follows a clear separation of concerns:
Messages → Template Processing → Vision Processor → LLM-Ready Inputs
- Template Processing: Creates human-readable prompts with vision tokens
- Vision Processing: Handles image/video processing and token expansion
- Final Output: LLM-ready inputs with proper tensor alignment
Key Components¶
- Vision Tokens: Placeholders in prompts that get expanded to actual tokens
- Vision Processors: Specialized classes that handle multi-modal input processing
- Token Expansion: Converting vision tokens to their actual token representations
- Tensor Alignment: Ensuring all tensors (input_ids, attention_mask, labels, action_mask) are properly aligned
Creating Vision Templates¶
Basic Vision Template¶
from chat_bricks import Template, register_template
vision_template = register_template(
Template(
name="vision-enabled",
system_template="You are a vision-capable AI assistant.\n",
system_message="You are a vision-capable AI assistant.",
user_template="User: {content}\n",
assistant_template="Assistant: {content}\n",
# Vision configuration
vision_start="<|vision_start|>",
vision_end="<|vision_end|>",
image_token="<|image_pad|>",
video_token="<|video_pad|>",
stop_words=["\n"]
)
)
Vision Template with Tools¶
vision_tool_template = register_template(
Template(
name="vision-tool-enabled",
system_template="You are a vision-capable AI assistant{tools}.\n",
tools_template=" with tools.\n\nTools: {tools}",
system_message="You are a vision-capable AI assistant with tools.",
user_template="User: {content}\n",
user_template_with_tools="User: {content}\n\nTools: {tools}\n",
assistant_template="Assistant: {content}\n",
observations_template="Tool: {observation}\n",
# Vision configuration
vision_start="<|vision_start|>",
vision_end="<|vision_end|>",
image_token="<|image_pad|>",
video_token="<|video_pad|>",
stop_words=["\n"]
)
)
Use the template¶
messages = [
{
"role": "system",
"content": "You are a multi-modal assistant that can answer questions about images.",
},
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
chat = Chat(template="vision-enabled", messages=messages)
print(chat.prompt())
# Tokenize the prompt
from transformers import AutoTokenizer, AutoProcessor
model_name = "Qwen/Qwen2.5-VL-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
processor = AutoProcessor.from_pretrained(model_name)
inputs = chat.tokenize(tokenizer=tokenizer, processor=processor)
print(inputs.keys())
Vision Processor Configuration¶
Automatic Registration¶
Vision processors are automatically registered when vision tokens are detected:
# This happens automatically in __post_init__
def _register_vision_processor(self):
"""Automatically register a vision processor for this template"""
if self.image_token or self.video_token:
from .vision_processor import VisionProcessorConfig, register_processor
# Determine model type based on template name
model_type = self._infer_model_type()
# Create vision config
config = VisionProcessorConfig(
model_type=model_type,
image_token=self.image_token or "",
video_token=self.video_token or "",
vision_start=self.vision_start or "",
vision_end=self.vision_end or "",
processor_class="AutoProcessor",
expansion_strategy="patch_based"
)
# Register the processor
register_processor(self.name, config)
Model Type Inference¶
The system automatically infers the appropriate vision processor based on template name:
def _infer_model_type(self) -> str:
"""Infer model type from template name"""
name_lower = self.name.lower()
if "qwen" in name_lower:
return "qwen_vl"
elif "llava" in name_lower:
return "llava"
elif "gemma" in name_lower:
return "gemma3"
elif "paligemma" in name_lower:
return "paligemma"
elif "internvl" in name_lower:
return "internvl"
elif "minicpm" in name_lower:
return "minicpm"
elif "mllama" in name_lower:
return "mllama"
elif "pixtral" in name_lower:
return "pixtral"
elif "video" in name_lower:
return "video_llava"
else:
# Default to patch-based for unknown models
return "patch_based"
Vision Processor Types¶
Patch-Based Processor¶
The default processor used by most vision models:
from chat_bricks import PatchBasedProcessor
# Automatically used for most models
# Supports multiple image input formats
# Handles token calculation and expansion
Qwen-VL Processor¶
Specialized processor for Qwen-VL models:
from chat_bricks import QwenVLProcessor
# Qwen-VL specific image preprocessing
# Custom token calculation using grid-based approach
# Optimized for Qwen-VL architecture
LLaVA Processor¶
Specialized processor for LLaVA models:
from chat_bricks import LlavaProcessor
# LLaVA specific token calculation
# Optimized for LLaVA architecture
Input Formats¶
Image Input Formats¶
The system supports multiple image input formats:
# File path
image_path = "/path/to/image.jpg"
# URL
image_url = "https://example.com/image.jpg"
# Base64 string (data URL)
image_base64 = "data:image/jpeg;base64,/9j/4AAQ..."
# Raw base64 string
raw_base64 = "iVBORw0KGgoAAAANSUhEUgAA..."
# PIL Image object
from PIL import Image
pil_image = Image.open("image.jpg")
# Bytes
with open("image.jpg", "rb") as f:
image_bytes = f.read()
# File-like object
with open("image.jpg", "rb") as f:
image_file = f
# Dict format
image_dict = {"path": "/path/to/image.jpg"}
# or
image_dict = {"bytes": b"image_data"}
Video Input Formats¶
# Video file path
video_path = "/path/to/video.mp4"
# File-like object
with open("video.mp4", "rb") as f:
video_file = f
# List of image frames
video_frames = [
"/path/to/frame1.jpg",
"/path/to/frame2.jpg",
"/path/to/frame3.jpg"
]
# List of PIL Image objects
from PIL import Image
video_frames = [
Image.open("frame1.jpg"),
Image.open("frame2.jpg"),
Image.open("frame3.jpg")
]
Message Format with Vision¶
# Message with image
message_with_image = {
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image", "image": "/path/to/image.jpg"}
]
}
# Message with video
message_with_video = {
"role": "user",
"content": [
{"type": "text", "text": "Analyze this video"},
{"type": "video", "video": "/path/to/video.mp4"}
]
}
# Message with URL image
message_with_url = {
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
]
}
Using Vision Templates¶
Basic Vision Chat¶
from chat_bricks import Chat
# Create chat with vision template
chat = Chat(template="qwen2.5-vl", messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Describe what you see in this image"},
{"type": "image", "image": "/path/to/image.jpg"}
]
}
])
# Generate prompt
prompt = chat.prompt()
print(prompt)
Vision Chat with Tools¶
# Vision chat with tool definitions
tools = [
{
"function": {
"name": "analyze_image",
"description": "Analyze image content",
"parameters": {
"type": "object",
"properties": {
"analysis_type": {"type": "string", "enum": ["objects", "text", "emotions"]}
}
}
}
}
]
chat = Chat(template="qwen2.5-vl", messages=messages_with_image, tools=tools)
prompt = chat.prompt(tools=tools)
Tokenization with Vision¶
# Tokenize vision-enabled conversation
inputs = chat.tokenize(
tokenizer=tokenizer,
processor=processor, # Required for vision processing
add_generation_prompt=True,
tools=tools
)
# Result includes:
# - input_ids: Token IDs with vision tokens expanded
# - attention_mask: Attention mask
# - labels: Labels for training (-100 for non-assistant tokens)
# - action_mask: Action mask for training (1 for assistant tokens)
# - pixel_values: Image/video tensors
# - image_grid_thw: Grid information (for some models)
Vision Processing Pipeline¶
Step 1: Template Processing¶
# Template creates prompt with vision tokens
prompt, elements, roles = template.render(messages, tools=tools)
# Result: "User: Describe what you see in this image <|image_pad|>"
Step 2: Vision Token Expansion¶
# Vision processor expands tokens based on actual image content
expanded_prompt = vision_processor.expand_vision_tokens(
prompt=prompt,
images=images,
videos=videos,
processor=processor
)
# Result: "User: Describe what you see in this image <|image_pad|><|image_pad|><|image_pad|>..."
Step 3: Multi-Modal Input Generation¶
# Generate vision inputs
mm_inputs = vision_processor.get_mm_inputs(images, videos, processor)
# Result: {"pixel_values": tensor, "image_grid_thw": tensor}
Step 4: Final Tokenization¶
# Tokenize expanded prompt with proper alignment
final_inputs = vision_processor.process_for_llm(
prompt=prompt,
elements=elements,
mask_flags=mask_flags,
images=images,
videos=videos,
processor=processor,
tokenizer=tokenizer
)
Token Calculation¶
Image Token Calculation¶
def calculate_image_tokens(self, image_data, processor):
"""Calculate tokens needed for an image"""
if "pixel_values" in image_data:
# Try grid-based calculation first (HuggingFace method)
if "image_grid_thw" in image_data:
grid_info = image_data["image_grid_thw"]
grid_prod = grid_info.prod().item()
# Get merge_size from processor
merge_size = getattr(processor, "merge_size", 1)
merge_length = merge_size ** 2
num_image_tokens = grid_prod // merge_length
return max(1, num_image_tokens)
# Fallback to patch-based calculation
height, width = get_image_size(image_data["pixel_values"][0])
image_seqlen = (height // processor.patch_size) * (width // processor.patch_size)
# Add additional tokens if specified
if hasattr(processor, 'num_additional_image_tokens'):
image_seqlen += processor.num_additional_image_tokens
# Adjust for feature selection strategy
if (hasattr(processor, 'vision_feature_select_strategy') and
processor.vision_feature_select_strategy == "default"):
image_seqlen -= 1
return image_seqlen
return 1
Video Token Calculation¶
def calculate_video_tokens(self, video_data, processor):
"""Calculate tokens needed for a video"""
if "pixel_values" in video_data:
video_tensor = video_data["pixel_values"][0]
if len(video_tensor.shape) > 3: # Has frame dimension
num_frames = video_tensor.shape[0]
height, width = get_image_size(video_tensor[0])
frame_seqlen = (height // processor.patch_size) * (width // processor.patch_size)
# Add additional tokens if specified
if hasattr(processor, 'num_additional_image_tokens'):
frame_seqlen += processor.num_additional_image_tokens
# Adjust for feature selection strategy
if (hasattr(processor, 'vision_feature_select_strategy') and
processor.vision_feature_select_strategy == "default"):
frame_seqlen -= 1
return frame_seqlen * num_frames
else:
# Single frame video
return self.calculate_image_tokens(video_data, processor)
return 1
Advanced Vision Features¶
Custom Vision Processors¶
from chat_bricks import VisionProcessor, VisionProcessorConfig
class CustomVisionProcessor(VisionProcessor):
"""Custom vision processor for specific needs"""
def preprocess_images(self, images, processor):
"""Custom image preprocessing"""
# Custom preprocessing logic
processed_images = []
for image in images:
# Apply custom transformations
processed_image = self._custom_transform(image)
processed_images.append(processed_image)
# Use processor's image processor
image_processor = getattr(processor, "image_processor", None)
if image_processor is None:
raise ValueError("Image processor not found")
return image_processor(processed_images, return_tensors="pt")
def calculate_image_tokens(self, image_data, processor):
"""Custom token calculation"""
# Custom token calculation logic
base_tokens = super().calculate_image_tokens(image_data, processor)
return base_tokens * 2 # Example: double the tokens
def expand_vision_tokens(self, prompt, images, videos, processor):
"""Custom token expansion"""
# Custom expansion logic
expanded = super().expand_vision_tokens(prompt, images, videos, processor)
return f"<vision_start>{expanded}<vision_end>"
# Register custom processor
config = VisionProcessorConfig(
model_type="custom",
image_token="<custom_image>",
video_token="<custom_video>",
vision_start="<custom_vision_start>",
vision_end="<custom_vision_end>"
)
from chat_bricks import register_processor
register_processor("custom-template", config, CustomVisionProcessor)
Vision Configuration Options¶
from chat_bricks import VisionProcessorConfig
config = VisionProcessorConfig(
model_type="qwen_vl",
image_token="<|image_pad|>",
video_token="<|video_pad|>",
vision_start="<|vision_start|>",
vision_end="<|vision_end|>",
processor_class="AutoProcessor",
expansion_strategy="patch_based",
image_max_pixels=16384 * 28 * 28, # Maximum image size
image_min_pixels=4 * 28 * 28, # Minimum image size
video_max_pixels=16384 * 28 * 28, # Maximum video size
video_min_pixels=4 * 28 * 28, # Minimum video size
video_fps=2.0, # Video frame rate
video_maxlen=128 # Maximum video length
)
Best Practices¶
1. Template Design¶
- Use descriptive vision token names
- Ensure vision tokens are unique and recognizable
- Consider token expansion implications
2. Image Processing¶
- Use appropriate image formats (JPEG, PNG)
- Consider image size and resolution
- Handle various input formats gracefully
3. Video Processing¶
- Use appropriate video formats (MP4, AVI)
- Consider frame rate and length
- Handle both file and frame-based inputs
4. Token Management¶
- Understand token calculation for your model
- Consider memory implications of large images/videos
- Use appropriate token limits
5. Error Handling¶
- Validate image/video inputs
- Handle processing failures gracefully
- Provide meaningful error messages
6. Performance¶
- Cache processed images when possible
- Use appropriate image sizes for your use case
- Consider batch processing for multiple images
Example: Complete Vision Template¶
Here's a complete example of creating and using a vision template:
from chat_bricks import Template, register_template, Chat
from chat_bricks import ToolPolicy, JsonFormatter
from chat_bricks import ToolPlacement
# Create a comprehensive vision template
vision_template = Template(
name="comprehensive-vision",
# Basic templates — {tools} stays empty when no tools are passed
system_template="<|im_start|>system\n{system_message}{tools}<|im_end|>\n",
system_message="You are a comprehensive vision-capable AI assistant.",
# Tool support
tools_template="\n\nAvailable Tools:\n{tools}",
user_template="<|im_start|>user\n{content}<|im_end|>\n",
user_template_with_tools="<|im_start|>user\n{content}\n\nTools: {tools}<|im_end|>\n",
assistant_template="<|im_start|>assistant\n{content}<|im_end|>\n",
observations_template="<|im_start|>tool\n{observation}<|im_end|>\n",
# Vision support
vision_start="<|vision_start|>",
vision_end="<|vision_end|>",
image_token="<|image_pad|>",
video_token="<|video_pad|>",
# Stop words
stop_words=["<|im_end|>"],
# Tool policy
tool_policy=ToolPolicy(
placement=ToolPlacement.SYSTEM,
formatter=JsonFormatter(indent=2)
)
)
# Register the template
register_template(vision_template)
# Create chat with vision content
chat = Chat(template="comprehensive-vision", messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Analyze this image and describe what you see"},
{"type": "image", "image": "/path/to/image.jpg"}
]
}
])
# Generate prompt
prompt = chat.prompt()
print(prompt)
# Tokenize with vision processing
inputs = chat.tokenize(
tokenizer=tokenizer,
processor=processor,
add_generation_prompt=True
)
print("Input shape:", inputs["input_ids"].shape)
print("Vision inputs:", list(inputs.keys()))
This comprehensive guide covers all aspects of vision templates in the Chat Template System. Use these features to create powerful multi-modal templates that can handle images, videos, and text seamlessly.