HiRAG: 계층적 지식 기반의 검색 증강 생성 (Retrieval-Augmented Generation)

이 저장소는 논문 HiRAG: Retrieval-Augmented Generation with Hierarchical Knowledge에 대한 것입니다.
EMNLP 2025 Findings에 채택되었습니다!🎉
지식 기반을 재색인(re-indexing)하는 것이 너무 비용이 많이 드나요🤯? 테스트 시점에 지식 기반을 개선하고 싶으신가요? 저희의 새로운 작업 DeepRefine을 확인해 보세요!

# 먼저 이 저장소를 클론하세요
cd HiRAG
pip install -e .

HiRAG를 사용하여 쿼리를 수행하려면 다음 코드를 활용할 수 있습니다.

graph_func = HiRAG(
working_dir="./your_work_dir",
enable_llm_cache=True,
...

또는 DeepSeek, ChatGLM 또는 기타 타사 검색 API와 함께 HiRAG를 사용하려면 ./hi_Search_deepseek.py,
./hi_Search_glm.py,
그리고 ./hi_Search_openai.py에 예시가 있습니다.

API 키와 LLM 설정은 ./config.yaml에서 할 수 있습니다.

저희는 Mix 데이터셋의 절차를 예로 사용하겠습니다.

cd ./HiRAG/eval

원본 QA 데이터셋에서 컨텍스트 추출.

python extract_context.py -i ./datasets/mix -o ./datasets/mix

컨텍스트를 그래프 데이터베이스에 삽입.

python insert_context_deepseek.py

참고: 스크립트 insert_context_deepseek.py는 DeepSeek-v3 API를 사용한 생성 설정용이며, 이를 insert_context_openai.py 또는 insert_context_glm.py로 대체할 수 있습니다.

HiRAG의 다양한 버전으로 테스트.

# 다양한 검색 옵션이 있습니다
# HiRAG 접근 방식을 사용하려면 다음을 실행하세요:
python test_deepseek.py -d mix -m hi
...

참고: 데이터셋 mix는 Hugging Face 링크의 다른 모든 데이터셋으로 대체할 수 있습니다. 그리고 스크립트 test_deepseek.py는 DeepSeek-v3 API를 사용한 생성 설정용이며, 이를 test_openai.py 또는 test_glm.py로 대체할 수 있습니다.

생성된 답변 평가.

먼저, 평가 요청을 합니다.

python batch_eval.py -m request -api openai
python batch_eval.py -m request -api deepseek

두 번째 단계로, 결과를 가져옵니다.

python batch_eval.py -m result -api openai
python batch_eval.py -m result -api deepseek

output_file 설정과 함께

set as `f"./datasets/{DATASET}/{DATASET}_eval_hi_naive.jsonl"`

, 그냥 다음 명령어를 실행합니다:

```
python batch_eval.py -m result -api openai
```

| Dataset | Dimension | NaiveRAG % | HiRAG % |
|---|---|---|---|
| Mix | |||
| Comprehensiveness | 16.6 | 83.4 | 
| | Empowerment | 11.6 | 88.4 | 
| | Diversity | 12.7 | 87.3 | 
| | Overall | 12.4 | 87.6 |
| CS | |||
| Comprehensiveness | 30.0 | 70.0 | 
| | Empowerment | 29.0 | 71.0 | 
| | Diversity | 14.5 | 85.5 | 
| | Overall | 26.5 | 73.5 |
| Legal | |||
| Comprehensiveness | 32.5 | 67.5 | 
| | Empowerment | 25.0 | 75.0 | 
| | Diversity | 22.0 | 78.0 | 
| | Overall | 22.5 | 74.5 |
| Agriculture | |||
| Comprehensiveness | 34.0 | 66.0 | 
| | Empowerment | 31.0 | 69.0 | 
| | Diversity | 21.0 | 79.0 | 
| | Overall | 28.5 | 71.5 |

`output_file` 설정을 `f"./datasets/{DATASET}/{DATASET}_eval_hi_graphrag.jsonl"`로 하고, 다음 명령어를 실행합니다:

```
python batch_eval.py -m result -api openai
```

| Dataset | Dimension | GraphRAG % | HiRAG % |
|---|---|---|---|
| Mix | |||
| Comprehensiveness | 42.1 | 57.9 | 
| | Empowerment | 35.1 | 64.9 | 
| | Diversity | 40.5 | 59.5 | 
| | Overall | 35.9 | 64.1 |
| CS | |||
| Comprehensiveness | 40.5 | 59.5 | 
| | Empowerment | 38.5 | 61.5 | 
| | Diversity | 30.5 | 69.5 | 
| | Overall | 36.0 | 64.0 |
| Legal | |||
| Comprehensiveness | 48.5 | 51.5 | 
| | Empowerment | 43.5 | 56.5 | 
| | Diversity | 47.0 | 53.0 | 
| | Overall | 45.5 | 54.5 |
| Agriculture | |||
| Comprehensiveness | 49.0 | 51.0 | 
| | Empowerment | 48.5 | 51.5 | 
| | Diversity | 45.5 | 54.5 | 
| | Overall | 46.0 | 54.0 |

`output_file` 설정을 `f"./datasets/{DATASET}/{DATASET}_eval_hi_lightrag.jsonl"`로 하고, 다음 명령어를 실행합니다:

```
python batch_eval.py -m result -api openai
```

| Dataset | Dimension | FastGraphRAG % | HiRAG % |
|---|---|---|---|
| Mix | |||
| Comprehensiveness | 0.8 | 99.2 |
| Empowerment | 0.8 | 99.2 |
| Diversity | 0.8 | 99.2 |
| Overall | 0.8 | 99.2 |
| CS | |||
| Comprehensiveness | 0.0 | 100.0 |
| Empowerment | 0.0 | 100.0 |
| Diversity | 0.5 | 99.5 |
| Overall | 0.0 | 100.0 |
| Legal | |||
| Comprehensiveness | 1.0 | 99.0 |
| Empowerment | 0.0 | 100.0 |
| Diversity | 1.5 | 98.5 |
| Overall | 0.0 | 100.0 |
| Agriculture | |||
| Comprehensiveness | 0.0 | 100.0 |
| Empowerment | 0.0 | 100.0 |
| Diversity | 0.0 | 100.0 |
| Overall | 0.0 | 100.0 |

`output_file` 설정을 `f"./datasets/{DATASET}/{DATASET}_eval_hi_kag.jsonl"`로 한 후, 다음 명령어를 실행합니다:

```
python batch_eval.py -m result -api openai
```

| Dataset | Dimension | KAG % | HiRAG % |
|---|---|---|---|
| Mix | |||
| Comprehensiveness | 2.3 | 97.7 |
| Empowerment | 3.5 | 96.5 |
| Diversity | 3.8 | 96.2 |
| Overall | 2.3 | 97.7 |
| CS | |||
| Comprehensiveness | 1.0 | 99.0 |
| Empowerment | 4.5 | 95.5 |
| Diversity | 5.0 | 95.0 |
| Overall | 1.5 | 98.5 |
| Legal | |||
| Comprehensiveness | 16.5 | 83.5 |
| Empowerment | 9.0 | 91.0 |
| Diversity | 11.0 | 89.0 |
| Overall | 8.5 | 91.5 |
| Agriculture | |||
| Comprehensiveness | 5.0 | 95.0 |
| Empowerment | 5.0 | 95.0 |
| Diversity | 3.5 | 96.5 |
| Overall | 0.0 | 100.0 |

본 연구에서 다음 오픈 소스 프로젝트들을 활용했음을 감사드립니다:

- nano-graphrag: 간단하고 쉽게 수정할 수 있는 GraphRAG 구현체

- RAPTOR: 문서를 기반으로 재귀적 트리 구조를 구성하여 검색 증강 언어 모델(retrieval-augmented language models)에 대한 새로운 접근 방식.

```
@article{huang2025retrieval,
title={Retrieval-Augmented Generation with Hierarchical Knowledge},
author={Huang, Haoyu and Huang, Yongfeng and Yang, Junjie and Pan, Zhenyu and Chen, Yongqiang and Ma, Kaili and Chen, Hongzhi and Cheng, James},
... 
```
```

HiRAG: 계층적 지식 기반의 검색 증강 생성 (Retrieval-Augmented Generation)

요약

핵심 포인트

댓글