Integration: Neo4j
Use the Neo4j database with Haystack
Table of Contents
Overview
An integration of Neo4j graph database with Haystack v2.0 by deepset. In Neo4j Vector search index is being used for storing document embeddings and dense retrievals.
The library allows using Neo4j as a
DocumentStore, and implements the required
Protocol methods. You can start working with the implementation by importing it from neo4j_haystack
package:
from neo4j_haystack import Neo4jDocumentStore
In addition to the Neo4jDocumentStore
the library includes the following haystack components which can be used in a pipeline:
-
Neo4jEmbeddingRetriever - a typical
retriever component which can be used to query vector store index and find related Documents. The component uses
Neo4jDocumentStore
to query embeddings. -
Neo4jDynamicDocumentRetriever is also a retriever component in a sense that it can be used to query Documents in Neo4j. However it is decoupled from
Neo4jDocumentStore
and allows to run arbitrary Cypher query to extract documents. Practically it is possible to query Neo4j same wayNeo4jDocumentStore
does, including vector search.
The neo4j-haystack
library uses
Python Driver and
Cypher Queries to interact with Neo4j database and hide all complexities under the hood.
Neo4jDocumentStore
will store Documents as Graph nodes in Neo4j. Embeddings are stored as part of the node, but indexing and querying of vector embeddings using ANN is managed by a dedicated
Vector Index.
+-----------------------------+
| Neo4j Database |
+-----------------------------+
| |
| +----------------+ |
| | Document | |
write_documents | +----------------+ |
+------------------------+----->| properties | |
| | | | |
+---------+----------+ | | embedding | |
| | | +--------+-------+ |
| Neo4jDocumentStore | | | |
| | | |index/query |
+---------+----------+ | | |
| | +--------+--------+ |
| | | Vector Index | |
+----------------------->| | | |
query_embeddings | | (for embedding) | |
| +-----------------+ |
| |
+-----------------------------+
In the above diagram:
Document
is a Neo4j node (with “Document” label)properties
are Document attributes stored as part of the node.embedding
is also a property of the Document node (just shown separately in the diagram for clarity) which is a vector of typeLIST[FLOAT]
.Vector Index
is where embeddings are getting indexed by Neo4j as soon as those are updated in Document nodes.
Installation
neo4j-haystack
can be installed as any other Python library, using pip:
pip install --upgrade pip # optional
pip install sentence-transformers # required in order to run pipeline examples given below
pip install neo4j-haystack
Usage
Once installed, you can start using Neo4jDocumentStore
as any other document stores that support embeddings.
from neo4j_haystack import Neo4jDocumentStore
document_store = Neo4jDocumentStore(
url="bolt://localhost:7687",
username="neo4j",
password="passw0rd",
database="neo4j",
embedding_dim=384,
embedding_field="embedding",
index="document-embeddings", # The name of the Vector Index in Neo4j
node_label="Document", # Providing a label to Neo4j nodes which store Documents
)
Assuming there is a list of documents available you can write/index those in Neo4j, e.g.:
documents: List[Document] = ...
document_store.write_documents(documents)
The full list of parameters accepted by Neo4jDocumentStore
can be found in
API documentation.
Please notice you will need to have a running instance of Neo4j database (in-memory version of Neo4j is not supported). There are several options available:
- Docker, other options available in the same Operations Manual
- AuraDB - a fully managed Cloud Instance of Neo4j
- Neo4j Desktop client application
The simplest way to start database locally will be with Docker container:
docker run \
--restart always \
--publish=7474:7474 --publish=7687:7687 \
--env NEO4J_AUTH=neo4j/passw0rd \
neo4j:5.15.0
Retrieving documents
Neo4jEmbeddingRetriever
component can be used to retrieve documents from Neo4j by querying vector index using an embedded query. Below is a pipeline which finds documents using query embedding as well as
metadata filtering:
from typing import List
from haystack import Document, Pipeline
from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder
from neo4j_haystack import Neo4jEmbeddingRetriever, Neo4jDocumentStore
document_store = Neo4jDocumentStore(
url="bolt://localhost:7687",
username="neo4j",
password="passw0rd",
database="neo4j",
embedding_dim=384,
index="document-embeddings",
)
documents = [
Document(content="My name is Morgan and I live in Paris.", meta={"release_date": "2018-12-09"})]
document_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
document_embedder.warm_up()
documents_with_embeddings = document_embedder.run(documents)
document_store.write_documents(documents_with_embeddings.get("documents"))
print(document_store.count_documents())
pipeline = Pipeline()
pipeline.add_component("text_embedder", SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2"))
pipeline.add_component("retriever", Neo4jEmbeddingRetriever(document_store=document_store))
pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
result = pipeline.run(
data={
"text_embedder": {"text": "What cities do people live in?"},
"retriever": {
"top_k": 5,
"filters": {"field": "release_date", "operator": "==", "value": "2018-12-09"},
},
}
)
documents: List[Document] = result["retriever"]["documents"]
>>> output:
`[Document(id=e765764ab700b231db1eeae208d6a59047b4b93712d1a9e379ae9599128ffdbd, content: 'My name is Morgan and I live in Paris.', meta: {'release_date': '2018-12-09'}, score: 0.8416308164596558)]`
Retrieving documents using Cypher
Neo4jDynamicDocumentRetriever
is a flexible retriever component which can run a Cypher query to obtain documents. The above example of Neo4jEmbeddingRetriever
could be rewritten without usage of Neo4jDocumentStore
:
from haystack import Document, Pipeline
from haystack.components.embedders import SentenceTransformersTextEmbedder
from neo4j_haystack import Neo4jClientConfig, Neo4jDynamicDocumentRetriever
client_config = Neo4jClientConfig(
url="bolt://localhost:7687",
username="neo4j",
password="passw0rd",
database="neo4j",
)
cypher_query = """
CALL db.index.vector.queryNodes($index, $top_k, $query_embedding)
YIELD node as doc, score
MATCH (doc) WHERE doc.release_date = $release_date
RETURN doc{.*, score}, score
ORDER BY score DESC LIMIT $top_k
"""
embedder = SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
retriever = Neo4jDynamicDocumentRetriever(
client_config=client_config, runtime_parameters=["query_embedding"], doc_node_name="doc"
)
pipeline = Pipeline()
pipeline.add_component("text_embedder", embedder)
pipeline.add_component("retriever", retriever)
pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
result = pipeline.run(
data={
"text_embedder": {"text": "What cities do people live in?"},
"retriever": {
"query": cypher_query,
"parameters": {"index": "document-embeddings", "top_k": 5, "release_date": "2018-12-09"},
},
}
)
documents: List[Document] = result["retriever"]["documents"]
Please notice how query parameters are being used in the cypher_query
:
runtime_parameters
is a list of parameter names which are going to be input slots when connecting components in a pipeline. In our casequery_embedding
input is connected to thetext_embedder.embedding
output.pipeline.run
specifies additional parameters to theretriever
component which can be referenced in thecypher_query
, e.g.top_k
.
More examples
You can find more examples in the implementation repository:
- indexing_pipeline.py - Indexing text files (documents) from a remote http location.
-
rag_pipeline.py - Generative question answering RAG pipeline using
Neo4jEmbeddingRetriever
to fetch documents from Neo4j document store and answer question using HuggingFaceTGIGenerator. -
rag_pipeline_cypher.py - Same as
rag_pipeline.py
but usingNeo4jDynamicDocumentRetriever
.
You might find more technical details in the Code Reference documentation. For example, in real world scenarios there could be requirements to tune connection settings to Neo4j database (e.g. request timeout). Neo4jDocumentStore accepts an extended client configuration using Neo4jClientConfig class.
License
neo4j-haystack
is distributed under the terms of the
MIT license.