Embedding Processors¶
Default Embedding Processor¶
CDP comes with a default embedding processor that supports the following embedding functions:
- Default (
default
) - The default ChromaDB embedding function based on OnnxRuntime and MiniLM-L6-v2 model. - OpenAI (
openai
) - OpenAI's text-embedding-ada-002 model. - Cohere (
cohere
) - Cohere's embedding models. - HuggingFrace (
hf
) - HuggingFace's embedding models. - SentenceTransformers (
st
) - SentenceTransformers' embedding models. - Ollama (
ollama
) - Ollama's embedding models.
The embedding functions are based on ChromaDB's embedding functions.
.env Files
CDP supports loading environment variables from .env
files. You can create a .env
file in the root of
your project (or from wherever you run CDP commands) and add environment-specific variables on new lines
in the form of NAME=VALUE
.
Usage¶
Default¶
The below command will read a PDF files at the specified path, filter the output for a particular pdf (grep
). Select
the first document's page, chunk it to 500 characters, embed each chunk using Chroma's default (MiniLM-L2-v2) model. The
resulting documents with embeddings will be written to chroma-data.jsonl
file.
OpenAI¶
To use this embedding function, you need to install the openai
python package.
OpenAI API Key
You need to have an OpenAI API key to use this embedding function.
You can get an API key by signing up for an account at OpenAI API Keys page.
The API key must be exported as env variable OPENAI_API_KEY=sk-xxxxxx
.
OpenAI Embedding Models
By default, if not specified, the text-embedding-ada-002
model is used.
You can pass in an optional --model=text-embedding-3-small
argument or
env variable GEMINI_MODEL_NAME=text-embedding-3-large
, which lets you choose which OpenAI embeddings model to use.
The below command will read a PDF files at the specified path, filter the output for a particular pdf (grep
). Select
the first document's page, chunk it to 500 characters, embed each chunk using OpenAI's text-embedding-ada-002 model.
export OPENAI_API_KEY=sk-xxxxxx
cdp imp pdf sample-data/papers/ |grep "2401.02412.pdf" | head -1 | cdp chunk -s 500 | cdp embed --ef openai
Cohere¶
To use this embedding function, you need to install the cohere
python package.
Cohere API Key
You need to have a Cohere API key to use this embedding function. You can get an API key by signing up for an account
at Cohere.
The API key must be exported as env variable COHERE_API_KEY=x4q...
.
Cohere Embedding Models
By default, if not specified, the embed-english-v3.0
model is used.
You can pass in an optional --model=embed-english-light-v3.0
argument or
env variable COHERE_MODEL_NAME=embed-multilingual-v3.0
, which lets you choose which Cohere embeddings model to use.
More about available models can be found at Cohere's API docs
The below command will read a PDF files at the specified path, select the last document's page, chunk it to 100 characters, embed each chunk using Cohere's embed-english-light-v3.0 model.
export COHERE_API_KEY=x4q
export COHERE_MODEL_NAME="embed-english-light-v3.0"
cdp imp pdf sample-data/papers/ | tail -1 | cdp chunk -s 100 | cdp embed --ef cohere
HuggingFace¶
HF API Token
You need to have a HuggungFace API token to use this embedding function.
Create or use one from your tokens page.
The API key must be exported as env variable HF_TOKEN=hf_xxxx
.
HF Embedding Models
By default, if not specified, the sentence-transformers/all-MiniLM-L6-v2
model is used.
You can pass in an optional --model=BAAI/bge-large-en-v1.5
argument or
env variable HF_MODEL_NAME=BAAI/bge-large-en-v1.5
, which lets you choose which
Hugging Frace embeddings model to use.
The below command will read a PDF files at the specified path, select the first two pages, chunk it to 150 characters, selects the last chunk and embeds the chunk using BAAI/bge-large-en-v1.5 model.
export HF_TOKEN=hf_xxxx
export HF_MODEL_NAME="BAAI/bge-large-en-v1.5"
cdp imp pdf sample-data/papers/ | head -2 | cdp chunk -s 150 | tail -1 | cdp embed --ef hf
SentenceTransformers¶
To use this embedding function, you need to install the sentence-transformers
python package.
SentenceTransformers Embedding Models
By default, if not specified, the all-MiniLM-L6-v2
model is used.
You can pass in an optional --model=BAAI/bge-large-en-v1.5
argument or
env variable ST_MODEL_NAME=BAAI/bge-large-en-v1.5
, which lets you choose which Sentence Transformers
embeddings model to use.
The below command will read a PDF files at the specified path, select the first two pages, chunk it to 150 characters, selects the last chunk and embeds the chunk using BAAI/bge-small-en-v1.5 model.
export ST_MODEL_NAME="BAAI/bge-small-en-v1.5"
cdp imp pdf sample-data/papers/ | head -2 | cdp chunk -s 150 | tail -1 | cdp embed --ef st
Google Generative AI Embedding (Gemini)¶
To use Google Generative AI Embedding (Gemini) function, you need to install the google-generativeai
python package.
Google API Key
You need to have a Google API key to use this embedding function.
To manage your keys go to Maker Suite.
The API key must be exported as env variable GEMINI_API_KEY=xxxx
.
Models
By default, if not specified, the models/embedding-001
model is used.
You can pass in an optional --model=models/embedding-001
argument or
env variable GEMINI_MODEL_NAME=models/embedding-001
, which lets you choose which Gemini embeddings model to use.
Task Type
The embedding function also supports task type parameter. By default we use RETRIEVAL_DOCUMENT
, For more details
visit Gemini API Docs.
The below command will read a PDF files at the specified path, select the first two pages, chunk it to 150 characters,
selects the last chunk and embeds the chunk using models/embedding-001
model.
Ollama Embeddings¶
To use Ollama embedding function, you need to run an Ollama server see instructions on official Ollama GH repo
Models
By default, if not specified, the chroma/all-minilm-l6-v2-f32
model is used.
You can pass in an optional --model=nomic-embed-text
argument or
env variable OLLAMA_MODEL_NAME=nomic-embed-text
, which lets you choose which Ollama embeddings model to use.
Embedding URL
By default the embedding function will try to connect to Ollama server running on http://localhost:11434/api/embeddings
endpoint. If you whish to override that you can export an env var OLLAMA_EMBED_URL
The below command will read a PDF files at the specified path, select the first two pages, chunk it to 150 characters,
selects the last chunk and embeds the chunk using nomic-embed-text
model.