top | item 39379690

1.5M PDFs in 25 Minutes

63 points| vishnumohandas | 2 years ago |zerodha.tech | reply

19 comments

order
[+] dzhiurgis|2 years ago|reply
> taking some 8 hours daily

> By converting our HTML templates into TEX formats and utilizing pdflatex for PDF generation, we observed a 10X increase in speed compared to the original Puppeteer way of generating

8*60/10 = 48 minutes. So this speedup alone would've been enough, wouldn't it?

[+] rbdixon|2 years ago|reply
I have a complex document generation tool based on latex. I’ve been working on replacing it with typst. Typst is simply 10x faster.

Crazy example of how fast typst is: 30”x30” document with 2600 tiny images generates a 40MB PDF in under a half second.

Check typst out. It’s amazing. Not quite latex equivalent in some ways but moving fast on an amazing foundation.

[+] tmnvix|2 years ago|reply
That would just be for the PDF generation. The 8 hours would also include the signing and emailing.
[+] Shakahs|2 years ago|reply
This looks pretty reasonable overall, but it seems like it could have been implemented far more simply.

It appears they are emailing PDF copies of stock market transactions to fulfill a regulatory requirement, so the PDF generation should be deterministic and email delivery of a receipt is an idempotent operation.

Instead of all that scheduled orchestration complexity they could just queue their PDF jobs to SQS and do the work in a Lamba function, this would also let them queue the PDFs as the transactions are happening instead of continuing to run this as a nightly batch job.

It's also unclear why render + sign + add to mail queue need to be 3 separate queued tasks. They have to execute serially, so a single worker could do the whole process and checkpoint to S3 after each step, or not because the output is deterministic anyway.

It would also eliminate their issues with S3 request limits by keeping all the data relevant to a PDF local to 1 worker for the duration of processing.

[+] bionhoward|2 years ago|reply
I’m somebody who never really used go but one thing that struck me when parsing go was the way it seems simpler than almost any other language, GC issues aside I can definitely see why Golang is a solid option for projects like this where you have some wrangling task and need something faster than Python but Rust is too finicky
[+] shreyansh_k|2 years ago|reply
This seems to be a standard "serial processing to parallel distributed processing" story, with job generator, broker, processor etc, every grandma used to share.
[+] ByQuyzzy|2 years ago|reply
We need a standard platform for benchmarking, they don't even tell us their hardware. I vote rPi 5. It's easy to get, it's the same everywhere, and you can't just blast out whatever numbers you pull from your ass.
[+] mr-karan|2 years ago|reply
Co author of the blog post. Quoting from the article

> For the contract note generation job, we spawn about 40 instances in total currently, a mix of c6a.8xlarge, c6a.2xlarge, and c6a.4xlarge.

[+] nesarkvechnep|2 years ago|reply
Bringing in libraries for problems already solved on the BEAM. I don't know why people keep using Go for distributed work.
[+] sss111|2 years ago|reply
is 1.5M PDFs in 25 minutes, supposed to be crazy? I can't wrap my head around it (like it seems incredibly slow)
[+] koliber|2 years ago|reply
That’s generating 60,000 PDFs per minute, or 1,000 per second. That is not slow.

As a very rough benchmark, print this page to a PDF file to see how long it takes.

[+] d_sc|2 years ago|reply
does anyone have experience using haraka vs a service like sendgrid? I'm curious if there are any challenges with mail deliverability such as the haraka instance being flagged for spam.
[+] Avamander|2 years ago|reply
There's definitely more work required to set up, warm up and to deal with blocklists when self-hosting. Basic stuff like DKIM/SPF/DMARC is a must in both cases.
[+] giovannibonetti|2 years ago|reply
As a Postmark customer, I strongly recommend switching to their service if you want better deliverability than Sendgrid, for example. However, if you have a poorly sanitized email audience, you'll suffer a lot of pain initially to reduce the bounce ratio.

It's like switching from a gym where you're on your own and no one cares to one where you're yelled at if you miss your training. Naturally, the former delivers much better results, but there is an adaptation period which makes a lot of people give up.