Home¶
ChromaDB Data Pipes is a collection of tools to build data pipelines for Chroma DB, inspired by the Unix philosophy of " do one thing and do it well".
Roadmap:
- β Integration with LangChain π¦π
- π« Integration with LlamaIndex π¦
- β
Support more than
all-MiniLM-L6-v2
as embedding functions (head over to Embedding Processors for more info) - π« Multimodal support
- βΎοΈ Much more!
Installation¶
Usage¶
Get help:
.env Files
CDP supports loading environment variables from .env
files. You can create a .env
file in the root of
your project (or from wherever you run CDP commands) and add environment-specific variables on new lines
in the form of NAME=VALUE
.
Example Use Cases¶
This is a short list of use cases to evaluate whether this is the right tool for your needs:
- Importing large datasets from local documents (PDF, TXT, etc.), from HuggingFace, from local persisted Chroma DB or even another remote Chroma DB.
- Exporting large dataset to HuggingFace or any other dataformat supported by the library (if your format is not supported, either implement it in a small function or open an issue)
- Create a dataset from your data that you can share with others (including the embeddings)
- Clone Collection with different embedding function, distance function, and other HNSW fine-tuning parameters
- Re-embed documents in a collection with a different embedding function
- Backup your data to a
jsonl
file - Use other existing unix or other tools to transform your data after exporting from or before importing into Chroma DB
Importing¶
Import data from HuggingFace Datasets to .jsonl
file:
Import data from HuggingFace Datasets to Chroma DB Server:
The below command will import the train
split of the given dataset to Chroma chroma-qna chroma-qna
collection. The
collection will be created if it does not exist and documents will be upserted.
cdp ds-get "hf://tazarov/chroma-qna?split=train" | cdp import "http://localhost:8000/chroma-qna" --upsert --create
Importing from a directory with PDF files into Local Persisted Chroma DB:
cdp imp pdf sample-data/papers/ | grep "2401.02412.pdf" | head -1 | cdp chunk -s 500 | cdp embed --ef default | cdp import "file://chroma-data/my-pdfs" --upsert --create
Note
The above command will import the first PDF file from the sample-data/papers/
directory, chunk it into 500 word
chunks, embed each chunk and import the chunks to the my-pdfs
collection in Chroma DB.
Exporting¶
Export data from Local Persisted Chroma DB to .jsonl
file:
The below command will export the first 10 documents from the chroma-qna
collection to chroma-qna.jsonl
file.
Export data from Local Persisted Chroma DB to .jsonl
file with filter:
The below command will export data from local persisted Chroma DB to a .jsonl
file using a where
filter to select
the documents to export.
Export data from Chroma DB Server to HuggingFace Datasets:
The below command will export the first 10 documents with offset 10 from the chroma-qna
collection to HuggingFace
Datasets tazarov/chroma-qna
dataset. The dataset will be uploaded to HF.
Note
Make sure you have HF_TOKEN=hf_....
environment variable set.
If you want your dataset to be private, add --private
flag to the cdp ds-put
command.
cdp export "http://localhost:8000/chroma-qna" --limit 10 --offset 10 | cdp ds-put "hf://tazarov/chroma-qna-modified"
To export a dataset to a file, use --uri
with file://
prefix:
cdp export "http://localhost:8000/chroma-qna" --limit 10 --offset 10 | cdp ds-put "file://chroma-qna"
Note
The file is relative to the current working directory.
Processing¶
Copy collection from one Chroma collection to another and re-embed the documents:
cdp export "http://localhost:8000/chroma-qna" | cdp embed --ef default | cdp import "http://localhost:8000/chroma-qna-def-emb" --upsert --create
Embeddings Processor
See Embedding Processors for more info about supported embedding functions.
Import dataset from HF to Local Persisted Chroma and embed the documents:
cdp ds-get "hf://tazarov/ds2?split=train" | cdp embed --ef default | cdp import "file://chroma-data/chroma-qna-def-emb-hf" --upsert --create
Chunk Large Documents:
Misc¶
Count the number of documents in a collection: