ses425500000's comments

ses425500000 | 11 months ago | on: Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)

Yeah, hallucination part was also one thing I was worry about. So I make LLM only run after OCR step, and I put simple check to not change correct text. I will try to show real examples and hallucination rate too. Thanks for feedback!

This project was just hobby and my first time posting something. I didn’t imagine people would care this much… Next time I will prepare better before sharing.

ses425500000 | 11 months ago | on: Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)

Yep — this project uses a pre-trained DocLayout-YOLO model released under an open license by the original authors. No additional datasets were used for training. All sample data in the repo is either synthetic, publicly available, or user-generated specifically for testing purposes. If there are any concerns about specific models or datasets, I’m happy to review them and make adjustments as needed.

ses425500000 | 11 months ago | on: Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)

Thanks for the insightful comment! You’re absolutely right — organizing extracted data into a coherent, semantically meaningful structure is critical for high-quality ML training.

Right now, the pipeline focuses on generating OCR outputs optimized for ML models by cleaning, deduplicating, and segmenting content across modalities (text, tables, figures, formulas). For diagrams and tables, we add semantic tags and preserve layout relationships to aid downstream modeling.

I’m planning to add a semantic structuring module that goes beyond basic layout analysis — something that builds hierarchical, MECE-style representations and identifies entity relationships across sections. That’s absolutely the next frontier, and I really appreciate you pointing it out.

Thanks again for the thoughtful feedback!

ses425500000 | 11 months ago | on: Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)

Great question — I’m using traditional OCR engines for the initial text extraction (e.g., MathPix, Google Vision), but then I apply generative AI models in a second stage to refine the output. This includes removing noisy or irrelevant elements, normalizing format inconsistencies, and improving alignment across multi-modal inputs.

In addition, for figures and diagrams, I use Gemini Pro Vision not just to extract the content, but to generate context-aware, structured descriptions that are better suited as ML training input — rather than just dumping raw image text.

So in short, generative AI is used here more as a smart post-processing layer to enhance the usability and semantic clarity of the OCR outputs.

ses425500000 | 11 months ago | on: Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)

Thanks! Yes — I’m definitely planning to update and refine the project over time.

This initial release is mostly a working prototype to demonstrate the full pipeline logic, and I’ll continue improving stability, modularity, and usability. A lot more updates are in the pipeline, so stay tuned! Feel free to open issues or suggestions anytime — feedback is always welcome!

ses425500000 | 11 months ago | on: Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)

Thanks for sharing — Marker is a great tool, especially for human-readable formatting!

In contrast, this project focuses less on preserving the visual layout for human readers, and more on extracting structured semantic data for machine learning training.

So instead of optimizing for clean Markdown or HTML, it extracts context-aware elements like:

• table data as JSON,

• math expressions in LaTeX,

• diagrams with image descriptions,

• multilingual text segments,

• and semantic roles (e.g. “question”, “explanation”, etc.)

In short: Marker is great for reading, This is built for feeding into ML pipelines — especially for tasks like question-answering, diagram reasoning, or multimodal pretraining.

ses425500000 | 11 months ago | on: Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)

Yep — some components currently rely on external APIs (e.g. OpenAI, MathPix), primarily for stability and ease of deployment during early release. But I’m planning to support fully local inference in the future to eliminate API key dependency.

The local pipeline would include:

• Tesseract or TrOCR for general OCR

• Pix2Struct, Donut, or DocTR for document structure understanding

• OpenAI CLIP for image-text semantic alignment

• Gemma / Phi / LLaMA / Mistral for downstream reasoning tasks

Goal is to make the system fully self-hostable for offline and private use.

page 1