ses425500000 | 11 months ago | on: Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)
ses425500000's comments
ses425500000 | 11 months ago | on: Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)
ses425500000 | 11 months ago | on: Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)
ses425500000 | 11 months ago | on: Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)
ses425500000 | 11 months ago | on: Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)
This project was just hobby and my first time posting something. I didn’t imagine people would care this much… Next time I will prepare better before sharing.
ses425500000 | 11 months ago | on: Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)
ses425500000 | 11 months ago | on: Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)
ses425500000 | 11 months ago | on: Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)
Right now, the pipeline focuses on generating OCR outputs optimized for ML models by cleaning, deduplicating, and segmenting content across modalities (text, tables, figures, formulas). For diagrams and tables, we add semantic tags and preserve layout relationships to aid downstream modeling.
I’m planning to add a semantic structuring module that goes beyond basic layout analysis — something that builds hierarchical, MECE-style representations and identifies entity relationships across sections. That’s absolutely the next frontier, and I really appreciate you pointing it out.
Thanks again for the thoughtful feedback!
ses425500000 | 11 months ago | on: Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)
In addition, for figures and diagrams, I use Gemini Pro Vision not just to extract the content, but to generate context-aware, structured descriptions that are better suited as ML training input — rather than just dumping raw image text.
So in short, generative AI is used here more as a smart post-processing layer to enhance the usability and semantic clarity of the OCR outputs.
ses425500000 | 11 months ago | on: Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)
This initial release is mostly a working prototype to demonstrate the full pipeline logic, and I’ll continue improving stability, modularity, and usability. A lot more updates are in the pipeline, so stay tuned! Feel free to open issues or suggestions anytime — feedback is always welcome!
ses425500000 | 11 months ago | on: Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)
In contrast, this project focuses less on preserving the visual layout for human readers, and more on extracting structured semantic data for machine learning training.
So instead of optimizing for clean Markdown or HTML, it extracts context-aware elements like:
• table data as JSON,
• math expressions in LaTeX,
• diagrams with image descriptions,
• multilingual text segments,
• and semantic roles (e.g. “question”, “explanation”, etc.)
In short: Marker is great for reading, This is built for feeding into ML pipelines — especially for tasks like question-answering, diagram reasoning, or multimodal pretraining.
ses425500000 | 11 months ago | on: Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)
The local pipeline would include:
• Tesseract or TrOCR for general OCR
• Pix2Struct, Donut, or DocTR for document structure understanding
• OpenAI CLIP for image-text semantic alignment
• Gemma / Phi / LLaMA / Mistral for downstream reasoning tasks
Goal is to make the system fully self-hostable for offline and private use.
ses425500000 | 11 months ago | on: Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)
If it still misbehaves in any edge cases, feel free to open an issue on GitHub — happy to patch it up.