Embedding Processors¶

Default Embedding Processor¶

CDP comes with a default embedding processor that supports the following embedding functions:

Default (default) - The default ChromaDB embedding function based on OnnxRuntime and MiniLM-L6-v2 model.
OpenAI (openai) - OpenAI's text-embedding-ada-002 model.
Cohere (cohere) - Cohere's embedding models.
HuggingFrace (hf) - HuggingFace's embedding models.
SentenceTransformers (st) - SentenceTransformers' embedding models.
Ollama (ollama) - Ollama's embedding models.

The embedding functions are based on ChromaDB's embedding functions.

.env Files

CDP supports loading environment variables from .env files. You can create a .env file in the root of your project (or from wherever you run CDP commands) and add environment-specific variables on new lines in the form of NAME=VALUE.

Usage¶

Default¶

The below command will read a PDF files at the specified path, filter the output for a particular pdf (grep). Select the first document's page, chunk it to 500 characters, embed each chunk using Chroma's default (MiniLM-L2-v2) model. The resulting documents with embeddings will be written to chroma-data.jsonl file.

cdp imp pdf sample-data/papers/ | cdp chunk -s 500 | cdp embed --ef default > chroma-data.jsonl

OpenAI¶

To use this embedding function, you need to install the openai python package.

pip install openai

OpenAI API Key

You need to have an OpenAI API key to use this embedding function. You can get an API key by signing up for an account at OpenAI API Keys page. The API key must be exported as env variable OPENAI_API_KEY=sk-xxxxxx.

OpenAI Embedding Models

By default, if not specified, the text-embedding-ada-002 model is used. You can pass in an optional --model=text-embedding-3-small argument or env variable GEMINI_MODEL_NAME=text-embedding-3-large , which lets you choose which OpenAI embeddings model to use.

The below command will read a PDF files at the specified path, filter the output for a particular pdf (grep). Select the first document's page, chunk it to 500 characters, embed each chunk using OpenAI's text-embedding-ada-002 model.

export OPENAI_API_KEY=sk-xxxxxx
cdp imp pdf sample-data/papers/ |grep "2401.02412.pdf" | head -1 | cdp chunk -s 500 | cdp embed --ef openai

Cohere¶

To use this embedding function, you need to install the cohere python package.

pip install cohere

Cohere API Key

You need to have a Cohere API key to use this embedding function. You can get an API key by signing up for an account at Cohere. The API key must be exported as env variable COHERE_API_KEY=x4q....

Cohere Embedding Models

By default, if not specified, the embed-english-v3.0 model is used. You can pass in an optional --model=embed-english-light-v3.0 argument or env variable COHERE_MODEL_NAME=embed-multilingual-v3.0 , which lets you choose which Cohere embeddings model to use. More about available models can be found at Cohere's API docs

The below command will read a PDF files at the specified path, select the last document's page, chunk it to 100 characters, embed each chunk using Cohere's embed-english-light-v3.0 model.

export COHERE_API_KEY=x4q
export COHERE_MODEL_NAME="embed-english-light-v3.0"
cdp imp pdf sample-data/papers/ | tail -1 | cdp chunk -s 100 | cdp embed --ef cohere

HuggingFace¶

HF API Token

You need to have a HuggungFace API token to use this embedding function. Create or use one from your tokens page. The API key must be exported as env variable HF_TOKEN=hf_xxxx.

HF Embedding Models

By default, if not specified, the sentence-transformers/all-MiniLM-L6-v2 model is used. You can pass in an optional --model=BAAI/bge-large-en-v1.5 argument or env variable HF_MODEL_NAME=BAAI/bge-large-en-v1.5 , which lets you choose which Hugging Frace embeddings model to use.

The below command will read a PDF files at the specified path, select the first two pages, chunk it to 150 characters, selects the last chunk and embeds the chunk using BAAI/bge-large-en-v1.5 model.

export HF_TOKEN=hf_xxxx
export HF_MODEL_NAME="BAAI/bge-large-en-v1.5"
cdp imp pdf sample-data/papers/ | head -2 | cdp chunk -s 150 | tail -1 | cdp embed --ef hf

SentenceTransformers¶

To use this embedding function, you need to install the sentence-transformers python package.

pip install sentence-transformers

SentenceTransformers Embedding Models

By default, if not specified, the all-MiniLM-L6-v2 model is used. You can pass in an optional --model=BAAI/bge-large-en-v1.5 argument or env variable ST_MODEL_NAME=BAAI/bge-large-en-v1.5 , which lets you choose which Sentence Transformers embeddings model to use.

The below command will read a PDF files at the specified path, select the first two pages, chunk it to 150 characters, selects the last chunk and embeds the chunk using BAAI/bge-small-en-v1.5 model.

export ST_MODEL_NAME="BAAI/bge-small-en-v1.5"
cdp imp pdf sample-data/papers/ | head -2 | cdp chunk -s 150 | tail -1 | cdp embed --ef st

Google Generative AI Embedding (Gemini)¶

To use Google Generative AI Embedding (Gemini) function, you need to install the google-generativeai python package.

pip install google-generativeai

Google API Key

You need to have a Google API key to use this embedding function. To manage your keys go to Maker Suite. The API key must be exported as env variable GEMINI_API_KEY=xxxx.

Models

By default, if not specified, the models/embedding-001 model is used. You can pass in an optional --model=models/embedding-001 argument or env variable GEMINI_MODEL_NAME=models/embedding-001, which lets you choose which Gemini embeddings model to use.

Task Type

The embedding function also supports task type parameter. By default we use RETRIEVAL_DOCUMENT, For more details visit Gemini API Docs.

The below command will read a PDF files at the specified path, select the first two pages, chunk it to 150 characters, selects the last chunk and embeds the chunk using models/embedding-001 model.

cdp imp pdf sample-data/papers/ | head -2 | cdp chunk -s 150 | tail -1 | cdp embed --ef gemini

Ollama Embeddings¶

To use Ollama embedding function, you need to run an Ollama server see instructions on official Ollama GH repo

Models

By default, if not specified, the chroma/all-minilm-l6-v2-f32 model is used. You can pass in an optional --model=nomic-embed-text argument or env variable OLLAMA_MODEL_NAME=nomic-embed-text, which lets you choose which Ollama embeddings model to use.

Embedding URL

By default the embedding function will try to connect to Ollama server running on http://localhost:11434/api/embeddings endpoint. If you whish to override that you can export an env var OLLAMA_EMBED_URL

The below command will read a PDF files at the specified path, select the first two pages, chunk it to 150 characters, selects the last chunk and embeds the chunk using nomic-embed-text model.

cdp imp pdf sample-data/papers/ | head -2 | cdp chunk -s 150 | tail -1 | cdp embed --ef ollama --model=chroma/all-minilm-l6-v2-f32