top | item 31582264

(no title)

vivekweb2013 | 3 years ago

Noted! I'll see if this feature can be added.

discuss

order

westurner|3 years ago

https://www.elastic.co/guide/en/elasticsearch/plugins/curren... :

> [Teh ElasticSearch Core Ingest Attachment Processor Plugin]: The ingest attachment plugin lets Elasticsearch extract file attachments in common formats (such as PPT, XLS, and PDF) by using the Apache text extraction library Tika.

> The source field must be a base64 encoded binary. If you do not want to incur the overhead of converting back and forth between base64, you can use the CBOR format instead of JSON and specify the field as a bytes array instead of a string representation. The processor will skip the base64 decoding then

Apache Tika supported formats > Images > TesseractOCR: https://tika.apache.org/2.4.0/formats.html https://tika.apache.org/2.4.0/formats.html#Image_formats :

> When extracting from images, it is also possible to chain in Tesseract, via the TesseractOCRParser, to have OCR performed on the contents of the image.

/? Meilisearch "ocr" GitHub;

Looks like e.g. paperbase (agpl) also implements ocr with tesseractocr: https://docs.paperbase.app/

tesseract-ocr/tesseract https://github.com/tesseract-ocr/tesseract

/? https://github.com/awesome-selfhosted/awesome-selfhosted#sea... ctrl-f "ocr"

westurner|3 years ago

Would be good to have:

- Search results on a timeline indicating search match occurrence frequency; ability to "zoom in" or "drill down"

- "Find more like these" that prepopulates a search query form

- "Find more like these" that mutates the query pattern and displays the count for each original and mutated query along with the results; with optional optimization