Chunking and Embedding

Chunking and embedding are critical steps in Retrieval-Augmented Generation (RAG) systems. They ensure that large documents can be effectively processed, stored, and retrieved by AI models to provide accurate and contextually relevant responses.

Chunking: The process of breaking down large text documents into smaller, meaningful segments to improve retrieval efficiency.
Embedding: Converting text chunks into vector representations that can be efficiently stored and searched within a vector database.

Chunking Strategies

Fixed-Length Chunking

A simple approach where the document is divided into fixed-size chunks (e.g., 500 tokens per chunk). This method is easy to implement but may result in incomplete contextual understanding.

import tiktoken

def fixed_length_chunking(text, chunk_size=500):
    encoding = tiktoken.get_encoding("cl100k_base")
    tokens = encoding.encode(text)
    return [tokens[i:i+chunk_size] for i in range(0, len(tokens), chunk_size)]

Semantic Chunking

Divides the text based on semantic meaning, often using NLP techniques such as sentence segmentation or topic modeling.

import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

def semantic_chunking(text, max_sentences=5):
    sentences = sent_tokenize(text)
    return [' '.join(sentences[i:i+max_sentences]) for i in range(0, len(sentences), max_sentences)]

Embedding Strategies

Using OpenAI Embeddings

OpenAI provides powerful embedding models that can transform text into vector representations.

import openai

def get_openai_embedding(text):
    response = openai.Embedding.create(
        input=text,
        model="text-embedding-ada-002"
    )
    return response['data'][0]['embedding']

Using Cohere Embeddings

Cohere offers an alternative embedding solution optimized for different NLP tasks.

import cohere
co = cohere.Client('YOUR_COHERE_API_KEY')

def get_cohere_embedding(text):
    response = co.embed(texts=[text], model='large')
    return response.embeddings[0]

Example: Chunking and Embedding Workflow

This example demonstrates how to apply chunking and embedding to process a large document.

def process_document(text, chunking_method='fixed', embedding_provider='openai'):
    # Step 1: Chunking
    if chunking_method == 'fixed':
        chunks = fixed_length_chunking(text)
    else:
        chunks = semantic_chunking(text)
    
    # Step 2: Embedding
    embeddings = []
    for chunk in chunks:
        if embedding_provider == 'openai':
            embeddings.append(get_openai_embedding(chunk))
        else:
            embeddings.append(get_cohere_embedding(chunk))
    
    return embeddings

PreviousOverview NextVector Databases

Last updated 28 days ago

import tiktoken def fixed_length_chunking(text, chunk_size=500): encoding = tiktoken.get_encoding("cl100k_base") tokens = encoding.encode(text) return [tokens[i:i+chunk_size] for i in range(0, len(tokens), chunk_size)]

import nltk nltk.download('punkt') from nltk.tokenize import sent_tokenize def semantic_chunking(text, max_sentences=5): sentences = sent_tokenize(text) return [' '.join(sentences[i:i+max_sentences]) for i in range(0, len(sentences), max_sentences)]

def process_document(text, chunking_method='fixed', embedding_provider='openai'): # Step 1: Chunking if chunking_method == 'fixed': chunks = fixed_length_chunking(text) else: chunks = semantic_chunking(text) # Step 2: Embedding embeddings = [] for chunk in chunks: if embedding_provider == 'openai': embeddings.append(get_openai_embedding(chunk)) else: embeddings.append(get_cohere_embedding(chunk)) return embeddings