【论文复现】HippoRAG & HippoRAG2

模型结构:

参考项目:OSU-NLP-Group/HippoRAG

安装

1
2
3
conda create -n hipporag python=3.10
conda activate hipporag
pip install hipporag

下载数据集:huggingface-cli download --repo-type dataset osunlp/HippoRAG_2 --local-dir dataset

下载 Embedding 模型(NV-Embed, GritLM, Contriever):huggingface-cli download nvidia/NV-Embed-v2 --local-dir model/NV-Embed-v2

下载 LLM:huggingface-cli download nreHieW/Llama-3.1-8B-Instruct --local-dir model/Llama-3.1-8B-Instruct

start.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
from hipporag import HippoRAG
import argparse

docs = [
"Oliver Badman is a politician.",
"George Rankin is a politician.",
"Thomas Marwick is a politician.",
"Cinderella attended the royal ball.",
"The prince used the lost glass slipper to search the kingdom.",
"When the slipper fit perfectly, Cinderella was reunited with the prince.",
"Erik Hort's birthplace is Montebello.",
"Marina is bom in Minsk.",
"Montebello is a part of Rockland County."
]

parser = argparse.ArgumentParser(description="HippoRAG retrieval and QA")
parser.add_argument('mode', type=str, choices=['online', 'offline'], help='Mode')
args = parser.parse_args()

if args.mode == 'online':
llm_model_name, llm_base_url = 'deepseek-r1', 'https://dashscope.aliyuncs.com/compatible-mode/v1'
else:
llm_model_name, llm_base_url = 'model/Llama-3.1-8B-Instruct', 'http://localhost:8000/v1'

save_dir = f'outputs/{args.mode}'
embedding_model_name = 'model/NV-Embed-v2'

hipporag = HippoRAG(save_dir=save_dir,
llm_model_name=llm_model_name,
llm_base_url=llm_base_url,
embedding_model_name=embedding_model_name)

hipporag.index(docs=docs)

queries = [
"What is George Rankin's occupation?",
"How did Cinderella reach her happy ending?",
"What county is Erik Hort's birthplace a part of?"
]

# QuerySolution: question, docs(2), doc_scores(2)
retrieval_results = hipporag.retrieve(queries=queries, num_to_retrieve=2)
print(f'Retrieval results: {retrieval_results}')

# QuerySolution: question, docs(2), doc_scores(2), answer; Predict_Answer; Tokens: prompt_tokens, completion_tokens, finish_reason
qa_results = hipporag.rag_qa(retrieval_results)
print(f'QA results: {qa_results}')

# QuerySolution: question, docs, doc_scores, answer; Predict_Answer; Tokens: prompt_tokens, completion_tokens, finish_reason
rag_results = hipporag.rag_qa(queries=queries)
print(f'RAG QA results: {rag_results}')

answers = [
["Politician"],
["By going to the ball."],
["Rockland County"]
]

gold_docs = [
["George Rankin is a politician."],
["Cinderella attended the royal ball.",
"The prince used the lost glass slipper to search the kingdom.",
"When the slipper fit perfectly, Cinderella was reunited with the prince."],
["Erik Hort's birthplace is Montebello.",
"Montebello is a part of Rockland County."]
]

# QuerySolution: question, docs, doc_scores, answer; Predict_Answer; Tokens: prompt_tokens, completion_tokens, finish_reason; Recall: Recall@1, Recall@2, Recall@5, Recall@10, Recall@20, Recall@30, Recall@50, Recall@100, Recall@150, Recall@200; Evaluation: ExactMatch, F1
rag_results = hipporag.rag_qa(queries=queries, gold_docs=gold_docs, gold_answers=answers)
print(f'RAG QA results: {rag_results}')

执行流程:

graph TD
    A["载入大模型(Loading checkpoint shards)"] --> B["实体识别(NER)"]
    B --> C["提取三元组(Extractin triples)"]
    C --> D["(Batch Encoding)KNN for Queries"]
    D --> E[Retrieving]
    E --> F[Collecting QA prompts]
    F --> G[QA Reading]
    G --> H[Extraction Answers from LLM Response]

生成内容:

1
2
3
4
5
6
7
8
9
10
11
${llm_model_name}_${embedding_model_name}/
chunk_embeddings/vdb_chunk.parquet
entity_embeddings/vdb_chunk.parquet
fact_embeddings/vdb_chunk.parquet
graph.graphml

llm_cache/
${llm_model_name}_cache.sqlite
${llm_model_name}_cache.sqlite.lock

openie_results_ner_${llm_model_name}.json

openie_results_ner_${llm_model_name}.json 结构:

1
2
3
4
5
6
7
8
9
10
11
12
13
{
"docs"[
{
"idx": "段落标识符",
"passage": "段落",
"extracted_entities": ["实体", ...],
"extracted_triples": [["三元组"], ...],
},
...
]
"avg_ent_chars": 所有提取实体的平均字符数,
"avg_ent_words": 所有提取实体的平均字符数
}

调用 API

调用阿里云百炼的 DeepSeek-R1 API

1
2
3
export OPENAI_API_KEY="Your API Key"
conda activate hipporag
python start.py online

输出:

1
2
3
4
5
6
7
8
9
{
'num_phrase_nodes': 17,
'num_passage_nodes': 9,
'num_total_nodes': 26,
'num_extracted_triples': 13,
'num_triples_with_passage_node': 22,
'num_synonymy_triples': 15,
'num_total_triples': 50
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Retrieval results: [
QuerySolution(
question="What is George Rankin's occupation?",
docs=['George Rankin is a politician.', 'Thomas Marwick is a politician.'],
doc_scores=array([0.10445492, 0.02884537]),
answer=None,
gold_answers=None,
gold_docs=None
),
QuerySolution(
question='How did Cinderella reach her happy ending?',
docs=['When the slipper fit perfectly, Cinderella was reunited with the prince.', 'Cinderella attended the royal ball.'],
doc_scores=array([0.04911817, 0.04404111]),
answer=None,
gold_answers=None,
gold_docs=None
),
QuerySolution(
question="What county is Erik Hort's birthplace a part of?",
docs=["Erik Hort's birthplace is Montebello.", 'Montebello is a part of Rockland County.'],
doc_scores=array([0.09849677, 0.05840253]),
answer=None,
gold_answers=None,
gold_docs=None
)
]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
QA results: (
[
QuerySolution(question="What is George Rankin's occupation?", docs=['George Rankin is a politician.', 'Thomas Marwick is a politician.'], doc_scores=array([0.10445492, 0.02884537]), answer='politician.', gold_answers=None, gold_docs=None),
QuerySolution(question='How did Cinderella reach her happy ending?', docs=['When the slipper fit perfectly, Cinderella was reunited with the prince.', 'Cinderella attended the royal ball.'], doc_scores=array([0.04911817, 0.04404111]), answer='Cinderella reached her happy ending when her glass slipper fit perfectly, leading the prince to reunite with and marry her.', gold_answers=None, gold_docs=None),
QuerySolution(question="What county is Erik Hort's birthplace a part of?", docs=["Erik Hort's birthplace is Montebello.", 'Montebello is a part of Rockland County.'], doc_scores=array([0.09849677, 0.05840253]), answer='Rockland County.', gold_answers=None, gold_docs=None)
],
[
'Answer: politician.',
'Answer: Cinderella reached her happy ending when her glass slipper fit perfectly, leading the prince to reunite with and marry her.',
'Answer: Rockland County.'
],
[
{'prompt_tokens': 703, 'completion_tokens': 135, 'finish_reason': 'stop'},
{'prompt_tokens': 712, 'completion_tokens': 284, 'finish_reason': 'stop'},
{'prompt_tokens': 712, 'completion_tokens': 166, 'finish_reason': 'stop'}
]
)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
RAG QA results: (
[
QuerySolution(question="What is George Rankin's occupation?", docs=['George Rankin is a politician.', 'Thomas Marwick is a politician.', 'Oliver Badman is a politician.', 'Montebello is a part of Rockland County.', 'Marina is bom in Minsk.', "Erik Hort's birthplace is Montebello.", 'The prince used the lost glass slipper to search the kingdom.', 'When the slipper fit perfectly, Cinderella was reunited with the prince.', 'Cinderella attended the royal ball.'], doc_scores=array([1.04454916e-01, 2.88453687e-02, 2.86224239e-02, 2.37302870e-03, 1.60382181e-03, 1.54294019e-03, 1.37682343e-03, 1.12725136e-03, 2.00820154e-05]), answer='politician.', gold_answers=None, gold_docs=None),
QuerySolution(question='How did Cinderella reach her happy ending?', docs=['When the slipper fit perfectly, Cinderella was reunited with the prince.', 'Cinderella attended the royal ball.', 'The prince used the lost glass slipper to search the kingdom.', 'Marina is bom in Minsk.', 'Montebello is a part of Rockland County.', 'Thomas Marwick is a politician.', 'George Rankin is a politician.', "Erik Hort's birthplace is Montebello.", 'Oliver Badman is a politician.'], doc_scores=array([4.91181658e-02, 4.40411111e-02, 3.13502299e-02, 9.84442137e-04, 4.19267052e-04, 4.06979324e-04, 2.93263127e-04, 4.93737401e-05, 2.09027597e-05]), answer='The prince found her using the glass slipper that fit her perfectly.', gold_answers=None, gold_docs=None),
QuerySolution(question="What county is Erik Hort's birthplace a part of?", docs=["Erik Hort's birthplace is Montebello.", 'Montebello is a part of Rockland County.', 'Marina is bom in Minsk.', 'George Rankin is a politician.', 'The prince used the lost glass slipper to search the kingdom.', 'Thomas Marwick is a politician.', 'Oliver Badman is a politician.', 'When the slipper fit perfectly, Cinderella was reunited with the prince.', 'Cinderella attended the royal ball.'], doc_scores=array([9.84967740e-02, 5.84025250e-02, 2.67207424e-03, 1.66319251e-03, 1.45981989e-03, 1.28462349e-03, 3.96527292e-04, 1.85392503e-04, 7.33911131e-06]), answer='Rockland County.', gold_answers=None, gold_docs=None)
],
[
'Answer: politician.',
'Thought: Cinderella reached her happy ending by having the prince search for her using the glass slipper she lost at the royal ball. When the slipper fit her perfectly, they were reunited. \nAnswer: The prince found her using the glass slipper that fit her perfectly.',
'Answer: Rockland County.'
],
[
{'prompt_tokens': 737, 'completion_tokens': 99, 'finish_reason': 'stop'},
{'prompt_tokens': 752, 'completion_tokens': 240, 'finish_reason': 'stop'},
{'prompt_tokens': 748, 'completion_tokens': 203, 'finish_reason': 'stop'}
]
)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
RAG QA results: (
[
QuerySolution(question="What is George Rankin's occupation?", docs=['George Rankin is a politician.', 'Thomas Marwick is a politician.', 'Oliver Badman is a politician.', 'Montebello is a part of Rockland County.', 'Marina is bom in Minsk.', "Erik Hort's birthplace is Montebello.", 'The prince used the lost glass slipper to search the kingdom.', 'When the slipper fit perfectly, Cinderella was reunited with the prince.', 'Cinderella attended the royal ball.'], doc_scores=array([1.04454916e-01, 2.88453687e-02, 2.86224239e-02, 2.37302870e-03, 1.60382181e-03, 1.54294019e-03, 1.37682343e-03, 1.12725136e-03, 2.00820154e-05]), answer='politician.', gold_answers=['Politician'], gold_docs=['George Rankin is a politician.']),
QuerySolution(question='How did Cinderella reach her happy ending?', docs=['When the slipper fit perfectly, Cinderella was reunited with the prince.', 'Cinderella attended the royal ball.', 'The prince used the lost glass slipper to search the kingdom.', 'Marina is bom in Minsk.', 'Montebello is a part of Rockland County.', 'Thomas Marwick is a politician.', 'George Rankin is a politician.', "Erik Hort's birthplace is Montebello.", 'Oliver Badman is a politician.'], doc_scores=array([4.91181658e-02, 4.40411111e-02, 3.13502299e-02, 9.84442137e-04, 4.19267052e-04, 4.06979324e-04, 2.93263127e-04, 4.93737401e-05, 2.09027597e-05]), answer='The prince found her using the glass slipper that fit her perfectly.', gold_answers=['By going to the ball.'], gold_docs=['Cinderella attended the royal ball.', 'The prince used the lost glass slipper to search the kingdom.', 'When the slipper fit perfectly, Cinderella was reunited with the prince.']),
QuerySolution(question="What county is Erik Hort's birthplace a part of?", docs=["Erik Hort's birthplace is Montebello.", 'Montebello is a part of Rockland County.', 'Marina is bom in Minsk.', 'George Rankin is a politician.', 'The prince used the lost glass slipper to search the kingdom.', 'Thomas Marwick is a politician.', 'Oliver Badman is a politician.', 'When the slipper fit perfectly, Cinderella was reunited with the prince.', 'Cinderella attended the royal ball.'], doc_scores=array([9.84967740e-02, 5.84025250e-02, 2.67207424e-03, 1.66319251e-03, 1.45981989e-03, 1.28462349e-03, 3.96527292e-04, 1.85392503e-04, 7.33911131e-06]), answer='Rockland County.', gold_answers=['Rockland County'], gold_docs=["Erik Hort's birthplace is Montebello.", 'Montebello is a part of Rockland County.'])
],
[
'Answer: politician.',
'Thought: Cinderella reached her happy ending by having the prince search for her using the glass slipper she lost at the royal ball. When the slipper fit her perfectly, they were reunited. \nAnswer: The prince found her using the glass slipper that fit her perfectly.',
'Answer: Rockland County.'
],
[
{'prompt_tokens': 737, 'completion_tokens': 99, 'finish_reason': 'stop'},
{'prompt_tokens': 752, 'completion_tokens': 240, 'finish_reason': 'stop'},
{'prompt_tokens': 748, 'completion_tokens': 203, 'finish_reason': 'stop'}
],
{'Recall@1': 0.6111, 'Recall@2': 0.8889, 'Recall@5': 1.0, 'Recall@10': 1.0, 'Recall@20': 1.0, 'Recall@30': 1.0, 'Recall@50': 1.0, 'Recall@100': 1.0, 'Recall@150': 1.0, 'Recall@200': 1.0},
{'ExactMatch': 0.6667, 'F1': 0.6667}
)

vllm 本地部署

如果发生 OOM(out of memory),调整 gpu-memory-utilizationmax_model_len 以适应 GPU 内存:vllm serve model/Llama-3.1-8B-Instruct --tensor-parallel-size 2 --max_model_len 4096 --gpu-memory-utilization 0.95 --dtype half

运行:python start.py offline

输出:

1
{'num_phrase_nodes': 16, 'num_passage_nodes': 9, 'num_total_nodes': 25, 'num_extracted_triples': 13, 'num_triples_with_passage_node': 20, 'num_synonymy_triples': 13, 'num_total_triples': 46}

打印的结构类似,因此只展示 2 个结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
QA results: (
[
QuerySolution(question="What is George Rankin's occupation?", docs=['George Rankin is a politician.', 'Thomas Marwick is a politician.'], doc_scores=array([0.10445492, 0.02884537]), answer='Politician.', gold_answers=None, gold_docs=None),
QuerySolution(question='How did Cinderella reach her happy ending?', docs=['When the slipper fit perfectly, Cinderella was reunited with the prince.', 'Cinderella attended the royal ball.'], doc_scores=array([0.04447086, 0.04025739]), answer='She attended the royal ball and was reunited with the prince after the slipper fit perfectly.', gold_answers=None, gold_docs=None),
QuerySolution(question="What county is Erik Hort's birthplace a part of?", docs=["Erik Hort's birthplace is Montebello.", 'Montebello is a part of Rockland County.'], doc_scores=array([0.09898717, 0.05803498]), answer='Rockland County.', gold_answers=None, gold_docs=None)
],
[
"The text does not provide information about George Rankin's occupation. However, it is mentioned that Thomas Marwick is a politician, and George Rankin is also mentioned as a politician in the Wikipedia title. \nAnswer: Politician.",
"The provided text snippets do not contain information about Cinderella's journey to her happy ending. However, based on general knowledge of the Cinderella fairy tale, it is likely that Cinderella reached her happy ending by attending the royal ball, where she met the prince, and then being reunited with him after the slipper fit perfectly.\n\nAnswer: She attended the royal ball and was reunited with the prince after the slipper fit perfectly.",
"To determine the county Erik Hort's birthplace is a part of, we need to identify the birthplace as Montebello, and then find the county that Montebello is a part of. According to the text, Montebello is a part of Rockland County. \nAnswer: Rockland County."
],
[
{'prompt_tokens': 733, 'completion_tokens': 48, 'finish_reason': 'stop'},
{'prompt_tokens': 742, 'completion_tokens': 87, 'finish_reason': 'stop'},
{'prompt_tokens': 744, 'completion_tokens': 64, 'finish_reason': 'stop'}
]
)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
RAG QA results: (
[
QuerySolution(question="What is George Rankin's occupation?", docs=['George Rankin is a politician.', 'Thomas Marwick is a politician.', 'Oliver Badman is a politician.', 'Montebello is a part of Rockland County.', 'Marina is bom in Minsk.', "Erik Hort's birthplace is Montebello.", 'The prince used the lost glass slipper to search the kingdom.', 'When the slipper fit perfectly, Cinderella was reunited with the prince.', 'Cinderella attended the royal ball.'], doc_scores=array([1.04454916e-01, 2.88453687e-02, 2.86224239e-02, 2.37302870e-03, 1.60382181e-03, 1.54294019e-03, 1.40683217e-03, 1.07797284e-03, 2.29945501e-05]), answer='Politician.', gold_answers=['Politician'], gold_docs=['George Rankin is a politician.']),
QuerySolution(question='How did Cinderella reach her happy ending?', docs=['When the slipper fit perfectly, Cinderella was reunited with the prince.', 'Cinderella attended the royal ball.', 'The prince used the lost glass slipper to search the kingdom.', 'Marina is bom in Minsk.', 'Montebello is a part of Rockland County.', 'Thomas Marwick is a politician.', 'George Rankin is a politician.', "Erik Hort's birthplace is Montebello.", 'Oliver Badman is a politician.'], doc_scores=array([4.44708555e-02, 4.02573902e-02, 2.10223824e-02, 9.48902014e-04, 4.04130761e-04, 3.92286642e-04, 2.82675803e-04, 4.75912597e-05, 2.01481327e-05]), answer='By attending the royal ball and the prince searching for her using the lost glass slipper.', gold_answers=['By going to the ball.'], gold_docs=['Cinderella attended the royal ball.', 'The prince used the lost glass slipper to search the kingdom.', 'When the slipper fit perfectly, Cinderella was reunited with the prince.']),
QuerySolution(question="What county is Erik Hort's birthplace a part of?", docs=["Erik Hort's birthplace is Montebello.", 'Montebello is a part of Rockland County.', 'Marina is bom in Minsk.', 'George Rankin is a politician.', 'The prince used the lost glass slipper to search the kingdom.', 'Thomas Marwick is a politician.', 'Oliver Badman is a politician.', 'When the slipper fit perfectly, Cinderella was reunited with the prince.', 'Cinderella attended the royal ball.'], doc_scores=array([9.89871677e-02, 5.80349811e-02, 2.69545125e-03, 1.67774318e-03, 1.51392444e-03, 1.29586220e-03, 3.99996366e-04, 1.57814418e-04, 4.49168290e-06]), answer='Rockland County.', gold_answers=['Rockland County'], gold_docs=["Erik Hort's birthplace is Montebello.", 'Montebello is a part of Rockland County.'])
],
[
"To determine George Rankin's occupation, we need to analyze the given information. The Wikipedia titles provided do not directly mention George Rankin's occupation. However, the fact that there are Wikipedia titles for George Rankin, Thomas Marwick, and Oliver Badman, all of which are politicians, suggests that George Rankin is also a politician.\n\nAnswer: Politician.",
"To answer this question, we need to analyze the given information. However, the provided Wikipedia titles do not directly mention Cinderella's journey to her happy ending. They only provide brief summaries of Cinderella's story.\n\nThe first title mentions Cinderella being reunited with the prince when the slipper fit perfectly, but it doesn't explain how she reached that point. The second title mentions Cinderella attending the royal ball, but it doesn't provide any context. The third title mentions the prince using the lost glass slipper to search the kingdom, which is a crucial part of the Cinderella story.\n\nSince the provided information is incomplete, we can't directly answer the question. However, based on the general knowledge of the Cinderella story, we can infer that Cinderella reached her happy ending by attending the royal ball, losing a glass slipper, and the prince searching for her using the slipper.\n\nAnswer: By attending the royal ball and the prince searching for her using the lost glass slipper.",
"To answer this question, we need to find the connection between Erik Hort's birthplace and the county. We know that Erik Hort's birthplace is Montebello, and Montebello is a part of Rockland County.\n\nTherefore, we can conclude that Erik Hort's birthplace is a part of Rockland County.\nAnswer: Rockland County."
],
[
{'prompt_tokens': 770, 'completion_tokens': 75, 'finish_reason': 'stop'},
{'prompt_tokens': 785, 'completion_tokens': 201, 'finish_reason': 'stop'},
{'prompt_tokens': 783, 'completion_tokens': 72, 'finish_reason': 'stop'}
],
{'Recall@1': 0.6111, 'Recall@2': 0.8889, 'Recall@5': 1.0, 'Recall@10': 1.0, 'Recall@20': 1.0, 'Recall@30': 1.0, 'Recall@50': 1.0, 'Recall@100': 1.0, 'Recall@150': 1.0, 'Recall@200': 1.0},
{'ExactMatch': 0.6667, 'F1': 0.7451}
)

绘制 graphml

1
2
3
4
5
6
7
8
9
10
11
import networkx as nx
import matplotlib.pyplot as plt

# 读取GraphML文件
G = nx.read_graphml('outputs/graph.graphml')

# 绘制图
nx.draw(G, with_labels=True, node_size=500, node_color='lightblue', font_size=10, font_weight='bold')

# 显示图
plt.show()

绘制结果:


【论文复现】HippoRAG & HippoRAG2
http://xuan-van.github.io/代码复现/【论文复现】hipporag-hipporag2/
作者
文晋
发布于
2025年3月27日
许可协议