"Q: What type of text can Amazon Textract detect and extract?
A: Amazon Textract can detect Latin-script characters from the standard English alphabet and ASCII symbols."
So, English only. But very worryingly is that they're going to keep your companies' documents:
"Q. Are document and image inputs processed by Amazon Textract stored, and how are they used by AWS?
A: Amazon Textract may store and use document and image inputs processed by the service solely to provide and maintain the service and to improve and develop the quality of Amazon Textract..."
"Q. Can I delete images and documents stored by Amazon Textract?
A: Yes. You can request deletion of document and image inputs associated with your account by contacting AWS Support. Deleting image and document inputs may degrade your Amazon Textract experience."
That said, I'm still baffled on what value-add they're providing? For me, from the name alone, it would generate other documents of common types: .txt (without images), .doc, .html (zip). That is, a large part of extracting text is the ability to reflow the text across page boundaries & columns. However, this product states that:
"All extracted data is returned with bounding box coordinates" [1]
...which is how pdf documents lay things out in the first place...Have I missed something?
The point of this service is to train their own OCR models for use in other products like Kindle / their e-book store. There doesn't really need to be a value add - if people use it it's a win for them... if people don't it's not really a big loss.
Think less about books, and more about automating input from forms filled out by hand. In working with this tech, I can say that none of it is great and it would be very nice to be able to ditch what's available for stuff that would work better.
For my employer's use case, the data storage and privacy implications are a non-starter.
As tracker1 mentioned, don't think of this as for reflowing text for different devices but as a data capture and documents processing solution.
Example: You are dealing with a lot of PDF documents that contain unstructured information (e.g. a filled form) and you need to extract bits of information (e.g. name, address) and output it in a structured format (e.g. JSON/XLS).
Keeping documents and analyzing your business is not new and will not keep people from using it in their companies I'm afraid. At least it doesn't stop people from Using Windows and other M$ products.
Given how high and continuing the popularity of the "simple" conversion
of regular PDF forms/tables -- even for the technically-sophisticated HN audience [0] -- if Amazon can deliver on OCR-to-data, that feels like a huge achievement. Not as sexy (or creepy) as Rekognition, perhaps, but almost certainly more day-to-day useful to the many, many professionals who work with documents and legacy data entry systems.
There's Google Cloud Vision and Microsoft Cognitive Services that act as competitors to Amazon Rekognition, but AFAIK there's no offering from a FAANG that competes with AWS Textract.
It looks like it's competing with ABBYY (FlexiCapture) and Kofax.
This plays so well with the theory of AWS taking a slice of all web activity. They are commoditising more and more complex tasks and enabling huge number of engineers to bootstrap their idea with amazing tech from day 1. A huge jump from S3/EC2 to this. Commendable.
I sort of agree. But I think the reality is a little closer to Apple's style of innovation. Few of the things that aws offers are things that didn't exist before. For example, data extraction and image recognition APIs have been around for a long while now from several different providers.
AWS is just aggregating it all into one place and giving it a really good final polish.
Not sure if this is bad news for the Robotic Process Automation (RPA) sector or an opportunity to offload the "Robotic" part while focusing on business process...
Is off the shelf open source OCR not reliable for an image of reasonable fidelity, like a smartphone camera picture of a B&W text document?
I ask because it feels like I should have an app that lets me scan with my phone, process the text with OCR, then let me plain text search every scanned document I have.
The first part only natively made it into iOS Notes a year or two ago, but that whole experience above should be out of the box, IMHO…
No open source ocr doesn't work that great, i work for a telecom company, and we process over millions of documents a month, we built everything in house and now are able to process it at almost 40cents per 1000 documents.
It a long process to process huge documents like payslips which require text boundary detection, word identification, spatial clustering and writing parsers (depends on word, segment, and clustering probabilities) which can extract required fields out of the documents.
Found some interesting tidbits in their FAQ [0]:
"Q: What type of text can Amazon Textract detect and extract?
A: Amazon Textract can detect Latin-script characters from the standard English alphabet and ASCII symbols."
So, English only. But very worryingly is that they're going to keep your companies' documents:
"Q. Are document and image inputs processed by Amazon Textract stored, and how are they used by AWS?
A: Amazon Textract may store and use document and image inputs processed by the service solely to provide and maintain the service and to improve and develop the quality of Amazon Textract..."
"Q. Can I delete images and documents stored by Amazon Textract?
A: Yes. You can request deletion of document and image inputs associated with your account by contacting AWS Support. Deleting image and document inputs may degrade your Amazon Textract experience."
If this can get me tables out of pdf's generated by crystal reports it would be a godsend for testing. This has been a nightmare to try and solve, the best option so far has been adobe cloud but they don't offer an API for that. I'm excited to try it out.
I have a personal flow using tesseract to scan docs into searchable PDFs, but it’s not that accurate. One of the main problems is that some (now most?) of the documents are in German since I live in Germany, but some are in English. There’s a way to choose the language but nothing to auto detect as far as I’m aware. I was hoping for some cloud AI service with superior OCR and simple integration or CLI (push a PDF and download one with OCR embedded). Google seems to be too complicated unfortunately... Any tips??
If you're running tesseract locally (i.e. not paying per invocation), run it once with EN and count occurrences of the/this/a/any etc, run it again with DE and count occurrences of der/die/das/um/ab/wie, and go from there?
Edit: Hell, even average word length is probably going to be a good indicator since German is so agglutinative. Collect some factors like this and I think you'll be able to build a pretty good classifier.
In tesseract, if you want to recognize both English and German you can use option -l deu+eng.
If you want to perform language detection you can do the following:
a. Invoke tesseract with "-l eng".
b. Pass the output text to langdetect [1]. It is a port of Google's language detection library to Python which will give you the probabilities of the languages for a given text.
c. Invoke tesseract with "-l langdetect_output"
Note that langdetect generates 2 character codes (ISO 639-1) whereas tesseract expects 3 character codes (ISO 639-2).
If you don't absolutely need the integration/CLI, I recommend FineReader (Standard edition). You can specify that the document can contain text from a set of languages (e.g., German and English) and it will auto-detect appropriately. If you need automation (of import, processing, export), this can be done with FineReader Server (formerly known as Recognition Server), but the pricing is quite high for personal use. FineReader Corporate edition has limited automation -- if sufficient for your needs, the pricing might be much more reasonable. I have used the Standard edition and Recognition Server extensively, but have not used the Corporate edition. If you really want a cloud service, you can make your own with their Cloud SDK or use their FineReader Online, but I also have no experience with these.
As for accuracy, the details of your documents and scanning can matter, but, for normal personal usage, it should be very high.
This looks a lot like what I've seen from companies such as InstaBase[1]. Given how hard it is to do well (largely due to poor initial images), I'm curious how Amazon's product offering will work.
I a team I'm working with had a lot of success doing this, curious what method(s) they are using.
A little late to the comment party, but I was wondering the same. I'm working on a web scrape workflow that's currently using Tika. I'm very interested in to see how well this does in comparison.
Amazon Textract may store and use document and image inputs processed by the service solely to provide and maintain the service and to improve and develop the quality of Amazon Textract..."
I still prefer the Dropbox solution for that, but I'm waiting them transforming into an API.
cmroanirgo|7 years ago
"Q: What type of text can Amazon Textract detect and extract?
A: Amazon Textract can detect Latin-script characters from the standard English alphabet and ASCII symbols."
So, English only. But very worryingly is that they're going to keep your companies' documents:
"Q. Are document and image inputs processed by Amazon Textract stored, and how are they used by AWS?
A: Amazon Textract may store and use document and image inputs processed by the service solely to provide and maintain the service and to improve and develop the quality of Amazon Textract..."
"Q. Can I delete images and documents stored by Amazon Textract?
A: Yes. You can request deletion of document and image inputs associated with your account by contacting AWS Support. Deleting image and document inputs may degrade your Amazon Textract experience."
That said, I'm still baffled on what value-add they're providing? For me, from the name alone, it would generate other documents of common types: .txt (without images), .doc, .html (zip). That is, a large part of extracting text is the ability to reflow the text across page boundaries & columns. However, this product states that:
"All extracted data is returned with bounding box coordinates" [1]
...which is how pdf documents lay things out in the first place...Have I missed something?
[0] https://aws.amazon.com/textract/faqs/
[1] https://aws.amazon.com/textract/features/
tills13|7 years ago
tracker1|7 years ago
For my employer's use case, the data storage and privacy implications are a non-starter.
ocrcustomserver|7 years ago
Example: You are dealing with a lot of PDF documents that contain unstructured information (e.g. a filled form) and you need to extract bits of information (e.g. name, address) and output it in a structured format (e.g. JSON/XLS).
VvR-Ox|7 years ago
danso|7 years ago
[0] https://hn.algolia.com/?query=pdf%20convert&sort=byPopularit...
- https://news.ycombinator.com/item?id=18199708
- https://news.ycombinator.com/item?id=5487530
just_myles|7 years ago
I do maintain some level of skepticism though. It is ocr :D
ocrcustomserver|7 years ago
It looks like it's competing with ABBYY (FlexiCapture) and Kofax.
raghavtoshniwal|7 years ago
jjeaff|7 years ago
AWS is just aggregating it all into one place and giving it a really good final polish.
ocrcustomserver|7 years ago
Edmond|7 years ago
macintux|7 years ago
Ftuuky|7 years ago
efields|7 years ago
I ask because it feels like I should have an app that lets me scan with my phone, process the text with OCR, then let me plain text search every scanned document I have.
The first part only natively made it into iOS Notes a year or two ago, but that whole experience above should be out of the box, IMHO…
Holybeds|7 years ago
For normal text OCR works well. But automatically understanding what is what is more complex.
viig99|7 years ago
wahnfrieden|7 years ago
hhanshin|7 years ago
A: Amazon Textract can detect Latin-script characters from the standard English alphabet and ASCII symbols."
So, English only. But very worryingly is that they're going to keep your companies' documents:
"Q. Are document and image inputs processed by Amazon Textract stored, and how are they used by AWS?
A: Amazon Textract may store and use document and image inputs processed by the service solely to provide and maintain the service and to improve and develop the quality of Amazon Textract..."
"Q. Can I delete images and documents stored by Amazon Textract?
A: Yes. You can request deletion of document and image inputs associated with your account by contacting AWS Support. Deleting image and document inputs may degrade your Amazon Textract experience."
BasHamer|7 years ago
mjt58|7 years ago
RandomBookmarks|7 years ago
It returns table data line by line.
ocrcustomserver|7 years ago
Announcing Amazon Textract, https://www.youtube.com/watch?v=PHX7q4pMGbo
Introducing Amazon Textract: Now in Preview, https://www.youtube.com/watch?v=hagvdqofRU4
Introducing Amazon Hieroglyph: Now in Preview (AIM363), https://www.youtube.com/watch?v=FnZFK_2oqKk
gingerlime|7 years ago
philsnow|7 years ago
Edit: Hell, even average word length is probably going to be a good indicator since German is so agglutinative. Collect some factors like this and I think you'll be able to build a pretty good classifier.
ocrcustomserver|7 years ago
If you want to perform language detection you can do the following:
a. Invoke tesseract with "-l eng".
b. Pass the output text to langdetect [1]. It is a port of Google's language detection library to Python which will give you the probabilities of the languages for a given text.
c. Invoke tesseract with "-l langdetect_output"
Note that langdetect generates 2 character codes (ISO 639-1) whereas tesseract expects 3 character codes (ISO 639-2).
[1]: https://github.com/Mimino666/langdetect
lokl|7 years ago
As for accuracy, the details of your documents and scanning can matter, but, for normal personal usage, it should be very high.
RandomBookmarks|7 years ago
ocrcustomserver|7 years ago
1. How it will deal with multiple templates that the system hasn't seen before. Especially when there is significant difference between the templates.
2. UI/UX. E.g. how it will trace the extracted data to the original source and how it will show the confidence scores of each entity.
3. Verification process, how will the workflow look like when the confidence score is low and the document has to be checked by human operators.
citilife|7 years ago
I a team I'm working with had a lot of success doing this, curious what method(s) they are using.
[1] https://en.wikipedia.org/wiki/Instabase
sbarre|7 years ago
https://tika.apache.org/
bpchaps|7 years ago
amelius|7 years ago
sbarre|7 years ago
https://tika.apache.org/
ironfootnz|7 years ago
I still prefer the Dropbox solution for that, but I'm waiting them transforming into an API.
jgalt212|7 years ago
https://www.pdfdata.io/
blacksmith_tb|7 years ago
foxhound6|7 years ago
brad0|7 years ago
- Printed text detection
- Handwritten text detection
- Key-Value detection
- Table detection
- Checkbox detection
- Other optical marks (e.g. barcode, QR code)
There's a decent possibility it has handwriting recognition. Not sure about the non-English languages though.
hbcondo714|7 years ago
dvtrn|7 years ago
jijji|7 years ago
1. make "strings" api 2. hook it to a web server 3. profit!