ML 학습 최적화 OCR 파이프라인: 복잡한 학술 자료 구조화

요약

본 시스템은 시험지, 교재 등 복잡하고 다국어적인 학술 자료에서 텍스트, 표, 수학 공식, 도표 등의 구조화된 데이터를 추출하는 AI 파이프라인입니다. 단순 OCR을 넘어, 시각적 콘텐츠에 대한 자연어 설명(Semantic Annotation)과 맥락 정보를 함께 제공하여 머신러닝 (ML) 학습 데이터셋 구축에 최적화되어 있습니다. 90~95% 이상의 높은 정확도를 자랑하며, JSON/Markdown 등 AI 친화적인 형식으로 출력을 지원합니다.

핵심 포인트

학습 자료(시험지 등)에서 텍스트, 표, 수학 공식, 도표를 추출하고 ML 학습에 최적화된 구조로 변환하는 파이프라인입니다.
단순 OCR을 넘어 시각 정보에 대한 자연어 설명(Semantic Annotation)을 자동 생성하여 데이터셋의 교육적 가치를 극대화합니다.
다국어 (일본어, 한국어, 영어 등)를 지원하며, 추출된 모든 요소는 JSON 또는 Markdown 형식으로 AI가 즉시 활용 가능한 형태로 제공됩니다.
DocLayout-YOLO, Google Vision API, MathPix OCR 등을 결합하여 복잡한 레이아웃과 높은 정확도(90~95% 이상)를 구현했습니다.

Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)

This OCR project is just the beginning.
In less than 1 month, a powerful new system will be released: A customizable AI pipeline with memory — tailored to your field.

Whether you're a student, researcher, or developer, you’ll be able to build your own smart, memory-enhanced AI — without needing deep AI knowledge.

First of all, thank you so much for your interest in this project. I had originally planned to release the first version of the AI pipeline before June. But to be honest, I've been juggling a major academic commitment (a critical exam on June 15) and development at the same time — and it's been tougher than I expected.

Rather than rushing out something incomplete, I’ve decided to take a bit more time to ensure the release is genuinely useful, stable, and worth your time. This whole system — including the multi-modal OCR — actually started as a tool to help with my own studies. I didn't expect it to get this much attention, so thanks.

Since I'm the first user, I want to make sure it's something I’d actually want to use before releasing it. Development will resume after the exam, and the public release will follow once the system is truly ready. Thanks again for your patience — I really appreciate it.

System Overview

This OCR system is specifically designed to extract structured data from complex educational materials—such as exam papers—in a format optimized for machine learning (ML) training. It supports multilingual text, mathematical formulas, tables, diagrams, and charts, making it ideal for creating high-quality training datasets.

Optimized for ML Training: Extracted elements such as diagrams, tables, and figures are semantically annotated with contextual explanations. This includes automatic generation of natural language descriptions for visual content (e.g., “This figure shows the process of mitosis in four stages”) to enhance downstream model training.
Multilingual Support: Works with Japanese, Korean, and English, and can be easily customized for additional languages.
Structured Output: Generates AI-ready outputs in JSON or Markdown, including human-readable descriptions of mathematical expressions, table summaries, and figure captions.
High Accuracy: Achieves over 90–95% accuracy on real-world academic datasets such as EJU Biology and UTokyo Math.
Complex Layout Support: Accurately processes exam-style PDFs with dense scientific content, formula-heavy paragraphs, and rich visual elements.
Built With: DocLayout-YOLO, Google Vision API, Gemini Pro Vision, MathPix OCR, OpenAI API, OpenCV, and more.

Examples of Outputs

Below are actual examples of outputs generated by this system using real-world materials (2017 EJU Biology & 2014 University of Tokyo Math), including English-translated semantic context and extracted data.

Example 1: Geometry Problem (English)

Question: Consider the rectangular prism OABC–DEFG with a square base of side length 1. Points P, Q, R are on the segments AE, BF, and CG, respectively, and four points O, P, Q, and R lie on the same plane. Let S be the area of quadrilateral OPQR. Also, let $\angle AOP$ be $\alpha$ and $\angle COR$ be $\beta$. (2) If $\alpha + \beta = 1$ and $S = S$, find the value of $\tan \alpha + \tan \beta$. Also, if $\alpha \le \beta$, find the value of $\tan \alpha$.

Image description: This image shows the rectangular prism OAB–CDEFGQ. Each vertex is labeled with alphabets. The angle $\alpha$ is marked on face OAB. The plane ORPQ intersects the prism and is highlighted. Line RC lies on face ODCG, and line PB lies on face ABFQ.
Educational value: This image enhances spatial reasoning by visualizing 3D geometry and cross-sections. It helps learners understand concepts such as plane geometry, solid shapes, spatial visualization, and angles.
Related topics: Solid geometry, cross-sections, prism faces, triangle, spatial reasoning
Exam relevance: This type of question appears in entrance exams like:

Calculate the area of ORPQ using angle $\alpha$
Find the lengths of OR, RP, PQ, QO
Determine the angle between ORPQ and the prism's face
Locate points P, Q, R in coordinate space
Calculate volume/area of the prism parts
Predict shapes based on constraints
Sketch the shape of the prism

Example 2: Biology Process (English)

Question: The photo shows the mitotic cell division process (somatic cell division) of an onion root tip. Cells A–D are in different stages of division. Match the stages (prophase, metaphase, anaphase, telophase) to each cell and select the correct combination from options ①–⑧.

Image description: This image shows the process of plant cell division observed under a microscope. Various cells are in different mitotic phases, including chromosomes aligned at the center (metaphase), separating to poles (anaphase), or forming daughter nuclei (telophase).

A – appears to be in anaphase
B – possibly telophase
C – prophase or prometaphase
D – metaphase
Educational value: This helps students visually understand the process of mitosis, reinforcing knowledge of cell division phases and their characteristics. It connects to biology concepts like DNA replication, cancer biology, and genetics.
Related topics: Mitosis, Cell cycle, Prophase, Metaphase, Anaphase, Telophase, DNA replication
Exam relevance: This image is used in questions such as:
Match A, B, C, D to appropriate mitotic phases
Describe characteristics of each phase
Explain the significance of mitosis
Discuss how errors in mitosis lead to genetic diseases

Example 3: Table Data (Korean/Chinese)

Table:

前期	中期	後期
A	C	D
A	D	B
B	C	C
B	D	C
C	A	D
C	D	A
D	A	B
D	C	A
Summary: Each option (①–⑧) corresponds to a specific mapping of A, B, C, D to prophase, metaphase, and anaphase.
Educational value: Understanding time-based transition in mitosis and data organization in tables. Enhances data interpretation, pattern recognition, and analysis skills.
Related topics: Data analysis, table interpretation, biological data classification

Technical Implementation Details

Step 1 – Initial OCR Extraction: Run ocr_stage1.py to extract raw elements (text, tables, figures, etc.) from input PDFs. This step performs layout detection and stores intermediate results (e.g., coordinates, cropped images, raw content).
Step 2 – Semantic Interpretation & Final Output: Run ocr_stage2.py to process the intermediate data and convert it into structured, human-readable output. This includes generating natural-language explanations, summaries, and organizing content into AI-ready formats (JSON/Markdown).

Optimization Details

Table Processing Optimization:
- Table regions are detected using DocLayout-YOLO
- Google Vision OCR is used for table processing instead of MathPix for better accuracy with Japanese text
- Table structures are preserved in structured JSON format (maintaining row/column structure)
- Y-coordinate information is maintained to ensure contextual continuity
- Original layout information is preserved alongside structured data for ML training
Image and Special Region Processing:
- Image regions are processed using Google Vision API's image analysis features (imageProperties, labelDetection, textDetection)
- Image descriptions are generated using Google Cloud Vision API
- Graphs/charts are processed using Google Cloud Vision API's document analysis features with data point extraction
- Special region processing results are stored in structured JSON format for ML training
- Original coordinate information and region type metadata are added to maintain contextual continuity

This OCR system is an open project, and I’d love to see others improve or build upon it. Continuous updates and community-driven enhancements are the goal.

If you’re interested in custom AI tools or would like to collaborate on an AI-related project, feel free to reach out via email: ses425500000@gmail.com

This project is now licensed under the GNU Affero General Public License v3.0 (AGPL-3.0), in compliance with the original license of the DocLayout-YOLO model used in this repository.
Please note that any derivative or deployed version (including as a web service) must also publicly share its complete source code.

More details: https://www.gnu.org/licenses/agpl-3.0.html
See the LICENSE file for full terms.

AI 자동 생성 콘텐츠

원문 바로가기