본문으로 건너뛰기

© 2026 Molayo

HN중요요약2026. 04. 24. 14:52

개인 데이터 기반 AI 검색 및 채팅 시스템 Danswer 오픈소스 출시

요약

Danswer는 기업의 사내 문서를 활용하여 질문에 답변하는 자체 호스팅(self-hostable)형 AI 검색 및 채팅 플랫폼입니다. Slack, Google Drive, Jira 등 25여 개 업무 도구와 연동되어 팀 고유 지식을 자연어 질의를 통해 찾아냅니다. 핵심은 커스텀 RAG (Retrieval Augmented Generation) 파이프라인에 있으며, 하이브리드 키워드(BM25) 및 벡터 인덱스를 결합하여 정확도를 높였습니다. 오픈소스로 공개되어 보안과 비용 측면에서 민감한 데이터를 다루는 기업 환경에 최적화되었으며,

핵심 포인트

  • Danswer는 Slack, Google Drive, Jira 등 25개 주요 업무 도구와 연동하여 사내 지식을 활용하는 ChatGPT 스타일 시스템을 제공합니다.
  • 시스템은 커스텀 RAG 파이프라인을 사용하며, 벡터 인덱스와 BM25 키워드 인덱스를 결합한 하이브리드 검색 방식을 채택했습니다.
  • 오픈소스 및 자체 호스팅(self-hostable) 구조로 설계되어 모든 데이터가 사내에 보관되며, LLM도 온프레미스 배포가 가능합니다.
  • 검색 정확도를 높이기 위해 쿼리 증강(query-augmentation), 컨텍스트 재작성(contextual-rephrasing) 등 고급 전처리 기법을 적용했습니다.

Launch HN: Danswer (YC W24) – Open-source AI search and chat over private data

Hey HN! Chris and Yuhong here from Danswer (
https://github.com/danswer-ai/danswer). We’re building an open source and self-hostable ChatGPT-style system that can access your team’s unique knowledge by connecting to 25 of the most common workplace tools (Slack, Google Drive, Jira, etc.). You ask questions in natural language and get back answers based on your team’s documents. Where relevant, answers are backed by citations and links to the exact documents used to generate them.

Quick Demo: https://youtu.be/hqSouur2FXw

Originally Danswer was a side project motivated by a challenge we experienced at work. We noticed that as teams scale, finding the right information becomes more and more challenging. I recall being on call and helping a customer recover from a mission critical failure but the error was related to some obscure legacy feature I had never used. For most projects, a simple question to ChatGPT would have solved it; but in this moment, ChatGPT was completely clueless without additional context (which I also couldn’t find).

We believe that within a few years, every org will be using team-specific knowledge assistants. We also understand that teams don’t want to tell us their secrets and not every team has the budget for yet another SaaS solution, so we open-sourced the project. It is just a set of containers that can be deployed on any cloud or on-premise. All of the data is processed and persisted on that same instance. Some teams have even opted to self-host open-source LLMs to truly airgap the system.

I also want to share a bit about the actual design of the system (https://docs.danswer.dev/system_overview). If you have questions about any parts of the flow such as the model choice, hyperparameters, prompting, etc. we’re happy to go into more depth in the comments.

The system revolves around a custom Retrieval Augmented Generation (RAG) pipeline we’ve built. During indexing time (we pull documents from connected sources every 10 minutes), documents are chunked and indexed into hybrid keyword+vector indices (https://github.com/danswer-ai/danswer/blob/main/backend/dans...).

For the vector index (which gives the system the flexibility to understand natural language queries), we use state of art prefix-aware embedding models trained with contrastive loss. Optionally the system can be configured to go over each doc with multiple passes of different granularity to capture wide context vs fine details. We also supplement the vector search with a keyword based BM25 index + N-Grams so that the system performs well even in low data domains. Additionally we’ve added in learning from feedback and time based decay—see our custom ranking function (https://github.com/danswer-ai/danswer/blob/main/backend/dans... – this flexibility is why we love Vespa as a Vector DB).

At query time, we preprocess the query with query-augmentation, contextual-rephrasing, as well as standard techniques like removing stopwords and lemmatization. Once the top documents are retrieved, we ask a smaller LLM to decide which of the chunks are “useful for answering the query” (this is something we haven’t seen much of elsewhere, but our tests have shown to be one of the biggest drivers for both precision and recall). Finally the most relevant passages are passed to the LLM along with the user query and chat history to produce the final answer. We post-process by checking guardrails and extracting citations to link the user to relevant documents. (https://github.com/danswer-ai/danswer/blob/main/backend/dans...)

The Vector and Keyword indices are both stored locally and the NLP models run on the same instance (we’ve chosen ones that can run without GPU). The only exception is that the default Generative model is OpenAI’s GPT, however this can also be swapped out (https://docs.danswer.dev/gen_ai_configs/overview).

We’ve seen teams use Danswer on problems like:

  • Improving turnaround times for support by reducing time taken to find relevant documentation;
  • Helping sales teams get customer context instantly by combing through calls and notes;
  • Reducing lost engineering time from answering cross-team questions, building duplicate features due to inability to surface old tickets or code merges, and helping on-calls resolve critical issues faster by providing the complete history on an error in one place;
  • Self-serving onboarding for new members who don’t know where to find information.

If you’d like to play around with things locally, check out the quickstart guide here: https://docs.danswer.dev/quickstart. If you already have Docker, you should be able to get things up and running in <15 minutes. And for folks who want a zero-effort way of trying it out or don’t want to self-host, please visit our Cloud: https://www.danswer.ai/

AI 자동 생성 콘텐츠

본 콘텐츠는 HN AI Engineering의 원문을 AI가 자동으로 요약·번역·분석한 것입니다. 원 저작권은 원저작자에게 있으며, 정확한 내용은 반드시 원문을 확인해 주세요.

원문 바로가기
4

댓글

0