Building Efficient Knowledge Graphs with Relik and LlamaIndex: Entity Linking & Relationship Extraction

Introduction to Knowledge Graph Construction

Constructing knowledge graphs from unstructured text is a cornerstone of modern data analysis and artificial intelligence, enabling deeper insights and more sophisticated applications. Traditionally, this process involved a complex pipeline of specialized models, each dedicated to a specific task like coreference resolution, named entity recognition (NER), entity linking, and relationship extraction. While this modular approach offered control and potential cost savings through fine-tuning smaller models, it often led to integration challenges and increased development overhead.

The advent of Large Language Models (LLMs) has revolutionized information extraction, offering more streamlined and powerful capabilities. However, the computational cost and complexity associated with deploying large LLMs can be prohibitive for many applications. This has spurred a renewed interest in efficient, specialized models that can deliver high performance without the substantial resource requirements of general-purpose LLMs. The Relik framework, combined with LlamaIndex and Neo4j, presents a compelling solution for building robust knowledge graphs in a cost-effective and efficient manner.

This tutorial will guide you through the process of leveraging Relik for entity linking and relationship extraction within the LlamaIndex framework, storing the extracted information in a Neo4j graph database. We will cover the entire workflow, from setting up your environment and preparing your data to constructing the knowledge graph and querying it for insights. This approach is particularly beneficial for Retrieval-Augmented Generation (RAG) applications, where a well-structured knowledge graph can significantly enhance the quality and relevance of generated responses.

The Traditional Information Extraction Pipeline

Before diving into the Relik-based approach, it is instructive to understand the components of a typical information extraction pipeline:

Coreference Resolution: This step identifies all expressions in a text that refer to the same entity. For instance, distinguishing between "Tomaz" and "He" when they refer to the same person.
Named Entity Recognition (NER): NER identifies and categorizes named entities in the text, such as persons, organizations, locations, and dates.
Entity Linking: Following NER, entity linking maps the recognized entities to unique identifiers in a knowledge base or database. This disambiguates entities and enriches them with external information. For example, linking "Tomaz" to a specific entry in a database.
Relationship Extraction: This crucial step identifies and classifies the semantic relationships between entities. For example, understanding that "Tomaz WRITES Blog Posts" or "Tomaz IS INTERESTED IN Diagrams."

Traditionally, each of these steps would be handled by a separate, specialized model. While this allowed for fine-tuning and cost optimization, the integration and orchestration of these models presented significant engineering challenges. The Relik framework, integrated with LlamaIndex, offers a more unified and efficient solution.

Environment Setup

To begin, ensure you have the necessary libraries installed. You will need llama-index, neo4j, pandas, spacy, and coreferee. The following code snippets assume you have a Neo4j database instance running and accessible.

Connecting to Neo4j with LlamaIndex

First, establish a connection to your Neo4j database using LlamaIndex:

from llama_index.graph_stores.neo4j import Neo4jPGStore

username="neo4j"
password="rubber-cuffs-radiator"
url="bolt://54.89.19.156:7687"

graph_store = Neo4jPGStore(
    username=username,
    password=password,
    url=url,
    refresh_schema=False
)

Dataset Preparation

Next, load your dataset. For this tutorial, we will use a CSV file containing news articles.

import pandas as pd

NUMBER_OF_ARTICLES = 100
news = pd.read_csv(
    "https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv"
)
news = news.head(NUMBER_OF_ARTICLES)

Coreference Resolution

The first step in our pipeline is coreference resolution. This task is crucial for identifying all expressions in a text that refer to the same entity. We will use spaCy and the Coreferee model for this purpose.

import spacy
import coreferee

coref_nlp = spacy.load(