Astra DB
This page provides a quickstart for using Astra DB and Apache Cassandra® as a Vector Store.
Note: in addition to access to the database, an OpenAI API Key is required to run the full example.
Setup and general dependencies
Use of the integration requires the following Python package.
pip install --quiet "astrapy>=0.5.3"
Note: depending on your LangChain setup, you may need to install/upgrade other dependencies needed for this demo
(specifically, recent versions of datasets
, openai
, pypdf
and tiktoken
are required).
import os
from getpass import getpass
from datasets import (
load_dataset,
)
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.prompts import ChatPromptTemplate
from langchain.schema import Document
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough
from langchain.text_splitter import RecursiveCharacterTextSplitter
os.environ["OPENAI_API_KEY"] = getpass("OPENAI_API_KEY = ")
embe = OpenAIEmbeddings()
Keep reading to connect with Astra DB. For usage with Apache Cassandra and Astra DB through CQL, scroll to the section below.
Astra DB
DataStax Astra DB is a serverless vector-capable database built on Cassandra and made conveniently available through an easy-to-use JSON API.
from langchain.vectorstores import AstraDB
Astra DB connection parameters
- the API Endpoint looks like
https://01234567-89ab-cdef-0123-456789abcdef-us-east1.apps.astra.datastax.com
- the Token looks like
AstraCS:6gBhNmsk135....
ASTRA_DB_API_ENDPOINT = input("ASTRA_DB_API_ENDPOINT = ")
ASTRA_DB_APPLICATION_TOKEN = getpass("ASTRA_DB_APPLICATION_TOKEN = ")
vstore = AstraDB(
embedding=embe,
collection_name="astra_vector_demo",
api_endpoint=ASTRA_DB_API_ENDPOINT,
token=ASTRA_DB_APPLICATION_TOKEN,
)
Load a dataset
Convert each entry in the source dataset into a Document
, then write them into the vector store:
philo_dataset = load_dataset("datastax/philosopher-quotes")["train"]
docs = []
for entry in philo_dataset:
metadata = {"author": entry["author"]}
doc = Document(page_content=entry["quote"], metadata=metadata)
docs.append(doc)
inserted_ids = vstore.add_documents(docs)
print(f"\nInserted {len(inserted_ids)} documents.")
In the above, metadata
dictionaries are created from the source data and are part of the Document
.
Note: check the Astra DB API Docs for the valid metadata field names: some characters are reserved and cannot be used.
Add some more entries, this time with add_texts
:
texts = ["I think, therefore I am.", "To the things themselves!"]
metadatas = [{"author": "descartes"}, {"author": "husserl"}]
ids = ["desc_01", "huss_xy"]
inserted_ids_2 = vstore.add_texts(texts=texts, metadatas=metadatas, ids=ids)
print(f"\nInserted {len(inserted_ids_2)} documents.")
Note: you may want to speed up the execution of add_texts
and add_documents
by increasing the concurrency level for
these bulk operations - check out the *_concurrency
parameters in the class constructor and the add_texts
docstrings
for more details. Depending on the network and the client machine specifications, your best-performing choice of parameters may vary.
Run simple searches
This section demonstrates metadata filtering and getting the similarity scores back:
results = vstore.similarity_search("Our life is what we make of it", k=3)
for res in results:
print(f"* {res.page_content} [{res.metadata}]")
results_filtered = vstore.similarity_search(
"Our life is what we make of it",
k=3,
filter={"author": "plato"},
)
for res in results_filtered:
print(f"* {res.page_content} [{res.metadata}]")
results = vstore.similarity_search_with_score("Our life is what we make of it", k=3)
for res, score in results:
print(f"* [SIM={score:3f}] {res.page_content} [{res.metadata}]")
MMR (Maximal-marginal-relevance) search
results = vstore.max_marginal_relevance_search(
"Our life is what we make of it",
k=3,
filter={"author": "aristotle"},
)
for res in results:
print(f"* {res.page_content} [{res.metadata}]")
Deleting stored documents
delete_1 = vstore.delete(inserted_ids[:3])
print(f"all_succeed={delete_1}") # True, all documents deleted
delete_2 = vstore.delete(inserted_ids[2:5])
print(f"some_succeeds={delete_2}") # True, though some IDs were gone already
A minimal RAG chain
The next cells will implement a simple RAG pipeline:
- download a sample PDF file and load it onto the store;
- create a RAG chain with LCEL (LangChain Expression Language), with the vector store at its heart;
- run the question-answering chain.
curl -L \
"https://github.com/awesome-astra/datasets/blob/main/demo-resources/what-is-philosophy/what-is-philosophy.pdf?raw=true" \
-o "what-is-philosophy.pdf"
pdf_loader = PyPDFLoader("what-is-philosophy.pdf")
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
docs_from_pdf = pdf_loader.load_and_split(text_splitter=splitter)
print(f"Documents from PDF: {len(docs_from_pdf)}.")
inserted_ids_from_pdf = vstore.add_documents(docs_from_pdf)
print(f"Inserted {len(inserted_ids_from_pdf)} documents.")
retriever = vstore.as_retriever(search_kwargs={"k": 3})
philo_template = """
You are a philosopher that draws inspiration from great thinkers of the past
to craft well-thought answers to user questions. Use the provided context as the basis
for your answers and do not make up new reasoning paths - just mix-and-match what you are given.
Your answers must be concise and to the point, and refrain from answering about other topics than philosophy.
CONTEXT:
{context}
QUESTION: {question}
YOUR ANSWER:"""
philo_prompt = ChatPromptTemplate.from_template(philo_template)
llm = ChatOpenAI()
chain = (
{"context": retriever, "question": RunnablePassthrough()}
| philo_prompt
| llm
| StrOutputParser()
)
chain.invoke("How does Russel elaborate on Peirce's idea of the security blanket?")
For more, check out a complete RAG template using Astra DB here.
Cleanup
If you want to completely delete the collection from your Astra DB instance, run this.
(You will lose the data you stored in it.)
vstore.delete_collection()
Apache Cassandra and Astra DB through CQL
Cassandra is a NoSQL, row-oriented, highly scalable and highly available database.Starting with version 5.0, the database ships with vector search capabilities.
DataStax Astra DB through CQL is a managed serverless database built on Cassandra, offering the same interface and strengths.
What sets this case apart from "Astra DB" above?
Thanks to LangChain having a standardized VectorStore
interface, most of the "Astra DB" section above applies to this case as well. However, this time the database uses the CQL protocol, which means you'll use a different class this time and instantiate it in another way.
The cells below show how you should get your vstore
object in this case and how you can clean up the database resources at the end: for the rest, i.e. the actual usage of the vector store, you will be able to run the very code that was shown above.
In other words, running this demo in full with Cassandra or Astra DB through CQL means:
- initialization as shown below
- "Load a dataset", see above section
- "Run simple searches", see above section
- "MMR search", see above section
- "Deleting stored documents", see above section
- "A minimal RAG chain", see above section
- cleanup as shown below
Initialization
The class to use is the following:
from langchain.vectorstores import Cassandra
Now, depending on whether you connect to a Cassandra cluster or to Astra DB through CQL, you will provide different parameters when creating the vector store object.
Initialization (Cassandra cluster)
In this case, you first need to create a cassandra.cluster.Session
object, as described in the Cassandra driver documentation. The details vary (e.g. with network settings and authentication), but this might be something like:
from cassandra.cluster import Cluster
cluster = Cluster(["127.0.0.1"])
session = cluster.connect()
You can now set the session, along with your desired keyspace name, as a global CassIO parameter:
import cassio
CASSANDRA_KEYSPACE = input("CASSANDRA_KEYSPACE = ")
cassio.init(session=session, keyspace=CASSANDRA_KEYSPACE)
Now you can create the vector store:
vstore = Cassandra(
embedding=embe,
table_name="cassandra_vector_demo",
# session=None, keyspace=None # Uncomment on older versions of LangChain
)
Initialization (Astra DB through CQL)
In this case you initialize CassIO with the following connection parameters:
- the Database ID, e.g.
01234567-89ab-cdef-0123-456789abcdef
- the Token, e.g.
AstraCS:6gBhNmsk135....
(it must be a "Database Administrator" token) - Optionally a Keyspace name (if omitted, the default one for the database will be used)
ASTRA_DB_ID = input("ASTRA_DB_ID = ")
ASTRA_DB_APPLICATION_TOKEN = getpass("ASTRA_DB_APPLICATION_TOKEN = ")
desired_keyspace = input("ASTRA_DB_KEYSPACE (optional, can be left empty) = ")
if desired_keyspace:
ASTRA_DB_KEYSPACE = desired_keyspace
else:
ASTRA_DB_KEYSPACE = None
import cassio
cassio.init(
database_id=ASTRA_DB_ID,
token=ASTRA_DB_APPLICATION_TOKEN,
keyspace=ASTRA_DB_KEYSPACE,
)
Now you can create the vector store:
vstore = Cassandra(
embedding=embe,
table_name="cassandra_vector_demo",
# session=None, keyspace=None # Uncomment on older versions of LangChain
)
Usage of the vector store
See the sections "Load a dataset" through "A minimal RAG chain" above.
Speaking of the latter, you can check out a full RAG template for Astra DB through CQL here.
Cleanup
the following essentially retrieves the Session
object from CassIO and runs a CQL DROP TABLE
statement with it:
cassio.config.resolve_session().execute(
f"DROP TABLE {cassio.config.resolve_keyspace()}.cassandra_vector_demo;"
)
Learn more
For more information, extended quickstarts and additional usage examples, please visit the CassIO documentation for more on using the LangChain Cassandra
vector store.