Ollama
Ollama allows you to run open-source large language models, such as Llama 2, locally.
Ollama bundles model weights, configuration, and data into a single package, defined by a Modelfile.
It optimizes setup and configuration details, including GPU usage.
For a complete list of supported models and model variants, see the Ollama model library.
Setup
First, follow these instructions to set up and run a local Ollama instance:
- Download
- Fetch a model via
ollama pull <model family>
- e.g., for
Llama-7b
:ollama pull llama2
(see full list here) - This will download the most basic version of the model typically (e.g., smallest # parameters and
q4_0
) - On Mac, it will download to
~/.ollama/models/manifests/registry.ollama.ai/library/<model family>/latest
- And we specify a particular version, e.g., for
ollama pull vicuna:13b-v1.5-16k-q4_0
- The file is here with the model version in place of
latest
~/.ollama/models/manifests/registry.ollama.ai/library/vicuna/13b-v1.5-16k-q4_0
You can easily access models in a few ways:
1/ if the app is running:
- All of your local models are automatically served on
localhost:11434
- Select your model when setting
llm = Ollama(..., model="<model family>:<version>")
- If you set
llm = Ollama(..., model="<model family")
withoout a version it will simply look forlatest
2/ if building from source or just running the binary:
- Then you must run
ollama serve
- All of your local models are automatically served on
localhost:11434
- Then, select as shown above
Usage
You can see a full list of supported parameters on the API reference page.
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.llms import Ollama
llm = Ollama(
model="llama2", callback_manager=CallbackManager([StreamingStdOutCallbackHandler()])
)
With StreamingStdOutCallbackHandler
, you will see tokens streamed.
llm("Tell me about the history of AI")
Ollama supports embeddings via OllamaEmbeddings
:
from langchain.embeddings import OllamaEmbeddings
oembed = OllamaEmbeddings(base_url="http://localhost:11434", model="llama2")
oembed.embed_query("Llamas are social animals and live with others as a herd.")
RAG
We can use Olama with RAG, just as shown here.
Let's use the 13b model:
ollama pull llama2:13b
Let's also use local embeddings from OllamaEmbeddings
and Chroma
.
pip install chromadb
# Load web page
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
data = loader.load()
# Split into chunks
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=100)
all_splits = text_splitter.split_documents(data)
# Embed and store
from langchain.embeddings import (
GPT4AllEmbeddings,
OllamaEmbeddings, # We can also try Ollama embeddings
)
from langchain.vectorstores import Chroma
vectorstore = Chroma.from_documents(documents=all_splits, embedding=GPT4AllEmbeddings())
Found model file at /Users/rlm/.cache/gpt4all/ggml-all-MiniLM-L6-v2-f16.bin
objc[77472]: Class GGMLMetalClass is implemented in both /Users/rlm/miniforge3/envs/llama2/lib/python3.9/site-packages/gpt4all/llmodel_DO_NOT_MODIFY/build/libreplit-mainline-metal.dylib (0x17f754208) and /Users/rlm/miniforge3/envs/llama2/lib/python3.9/site-packages/gpt4all/llmodel_DO_NOT_MODIFY/build/libllamamodel-mainline-metal.dylib (0x17fb80208). One of the two will be used. Which one is undefined.
# Retrieve
question = "How can Task Decomposition be done?"
docs = vectorstore.similarity_search(question)
len(docs)
4
# RAG prompt
from langchain import hub
QA_CHAIN_PROMPT = hub.pull("rlm/rag-prompt-llama")
# LLM
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.llms import Ollama
llm = Ollama(
model="llama2",
verbose=True,
callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]),
)
# QA chain
from langchain.chains import RetrievalQA
qa_chain = RetrievalQA.from_chain_type(
llm,
retriever=vectorstore.as_retriever(),
chain_type_kwargs={"prompt": QA_CHAIN_PROMPT},
)
question = "What are the various approaches to Task Decomposition for AI Agents?"
result = qa_chain({"query": question})
There are several approaches to task decomposition for AI agents, including:
1. Chain of thought (CoT): This involves instructing the model to "think step by step" and use more test-time computation to decompose hard tasks into smaller and simpler steps.
2. Tree of thoughts (ToT): This extends CoT by exploring multiple reasoning possibilities at each step, creating a tree structure. The search process can be BFS or DFS with each state evaluated by a classifier or majority vote.
3. Using task-specific instructions: For example, "Write a story outline." for writing a novel.
4. Human inputs: The agent can receive input from a human operator to perform tasks that require creativity and domain expertise.
These approaches allow the agent to break down complex tasks into manageable subgoals, enabling efficient handling of tasks and improving the quality of final results through self-reflection and refinement.
You can also get logging for tokens.
from langchain.callbacks.base import BaseCallbackHandler
from langchain.schema import LLMResult
class GenerationStatisticsCallback(BaseCallbackHandler):
def on_llm_end(self, response: LLMResult, **kwargs) -> None:
print(response.generations[0][0].generation_info)
callback_manager = CallbackManager(
[StreamingStdOutCallbackHandler(), GenerationStatisticsCallback()]
)
llm = Ollama(
base_url="http://localhost:11434",
model="llama2",
verbose=True,
callback_manager=callback_manager,
)
qa_chain = RetrievalQA.from_chain_type(
llm,
retriever=vectorstore.as_retriever(),
chain_type_kwargs={"prompt": QA_CHAIN_PROMPT},
)
question = "What are the approaches to Task Decomposition?"
result = qa_chain({"query": question})
eval_count
/ (eval_duration
/10e9) gets tok / s
62 / (1313002000 / 1000 / 1000 / 1000)
47.22003469910937
Using the Hub for prompt management
Open-source models often benefit from specific prompts.
For example, Mistral 7b was fine-tuned for chat using the prompt format shown here.
Get the model: ollama pull mistral:7b-instruct
# LLM
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.llms import Ollama
llm = Ollama(
model="mistral:7b-instruct",
verbose=True,
callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]),
)
from langchain import hub
QA_CHAIN_PROMPT = hub.pull("rlm/rag-prompt-mistral")
# QA chain
from langchain.chains import RetrievalQA
qa_chain = RetrievalQA.from_chain_type(
llm,
retriever=vectorstore.as_retriever(),
chain_type_kwargs={"prompt": QA_CHAIN_PROMPT},
)
question = "What are the various approaches to Task Decomposition for AI Agents?"
result = qa_chain({"query": question})
There are different approaches to Task Decomposition for AI Agents such as Chain of thought (CoT) and Tree of Thoughts (ToT). CoT breaks down big tasks into multiple manageable tasks and generates multiple thoughts per step, while ToT explores multiple reasoning possibilities at each step. Task decomposition can be done by LLM with simple prompting or using task-specific instructions or human inputs.