Maintained by deepset

Integration: spaCy

Annotate named entities in your Haystack pipelines with spaCy models

Authors

deepset

GitHub Repo PyPI Package

Overview
Installation
Usage
License

Overview

spaCy is a popular open-source library for Natural Language Processing in Python. The spacy-haystack integration provides the SpacyNamedEntityExtractor, which uses spaCy models to recognize named entities — such as people, organizations, and locations — and attach them to your documents.

Installation

Install the spacy-haystack package:

pip install spacy-haystack

Usage

Components

This integration provides one component:

SpacyNamedEntityExtractor: annotates named entities in documents using a spaCy model.

When initializing it, you must set a model. Optionally, you can pass pipeline_kwargs (forwarded to the spaCy pipeline) and a device to run the model on.

Standalone

The component works with any spaCy model that contains an NER component. SpacyNamedEntityExtractor accepts a list of Documents, annotates the text, and stores the result in each document’s meta under the named_entities key. Use the get_stored_annotations helper to read the annotations back, and the span offsets to recover the entity text:

from haystack import Document
from haystack_integrations.components.extractors.spacy import SpacyNamedEntityExtractor

extractor = SpacyNamedEntityExtractor(model="en_core_web_sm")

documents = [
    Document(content="My name is Clara and I live in Berkeley, California."),
    Document(content="New York State is home to the Empire State Building."),
]

results = extractor.run(documents=documents)["documents"]

for doc in results:
    print(doc.content)
    for ann in SpacyNamedEntityExtractor.get_stored_annotations(doc):
        print(f"  {ann.entity}: {doc.content[ann.start:ann.end]}")

# My name is Clara and I live in Berkeley, California.
#   PERSON: Clara
#   GPE: Berkeley
#   GPE: California
# New York State is home to the Empire State Building.
#   GPE: New York State
#   ORG: the Empire State Building

Pipeline

The most common place for the extractor is right after the preprocessing step of an indexing pipeline, so that the entities are stored alongside the documents you write to a Document Store:

from haystack import Pipeline
from haystack.components.converters import TextFileToDocument
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.extractors.spacy import SpacyNamedEntityExtractor

document_store = InMemoryDocumentStore()

pipeline = Pipeline()
pipeline.add_component("converter", TextFileToDocument())
pipeline.add_component("splitter", DocumentSplitter(split_by="word", split_length=200))
pipeline.add_component("extractor", SpacyNamedEntityExtractor(model="en_core_web_sm"))
pipeline.add_component("writer", DocumentWriter(document_store=document_store))

pipeline.connect("converter", "splitter")
pipeline.connect("splitter", "extractor")
pipeline.connect("extractor", "writer")

pipeline.run({"converter": {"sources": ["document.txt"]}})

# Each stored document now carries its named entities in meta["named_entities"].
print(document_store.filter_documents()[0].meta["named_entities"])

License

spacy-haystack is distributed under the terms of the Apache-2.0 license.