> By converting our HTML templates into TEX formats and utilizing pdflatex for PDF generation, we observed a 10X increase in speed compared to the original Puppeteer way of generating
8*60/10 = 48 minutes. So this speedup alone would've been enough, wouldn't it?
This looks pretty reasonable overall, but it seems like it could have been implemented far more simply.
It appears they are emailing PDF copies of stock market transactions to fulfill a regulatory requirement, so the PDF generation should be deterministic and email delivery of a receipt is an idempotent operation.
Instead of all that scheduled orchestration complexity they could just queue their PDF jobs to SQS and do the work in a Lamba function, this would also let them queue the PDFs as the transactions are happening instead of continuing to run this as a nightly batch job.
It's also unclear why render + sign + add to mail queue need to be 3 separate queued tasks. They have to execute serially, so a single worker could do the whole process and checkpoint to S3 after each step, or not because the output is deterministic anyway.
It would also eliminate their issues with S3 request limits by keeping all the data relevant to a PDF local to 1 worker for the duration of processing.
I’m somebody who never really used go but one thing that struck me when parsing go was the way it seems simpler than almost any other language, GC issues aside I can definitely see why Golang is a solid option for projects like this where you have some wrangling task and need something faster than Python but Rust is too finicky
This seems to be a standard "serial processing to parallel distributed processing" story, with job generator, broker, processor etc, every grandma used to share.
We need a standard platform for benchmarking, they don't even tell us their hardware. I vote rPi 5. It's easy to get, it's the same everywhere, and you can't just blast out whatever numbers you pull from your ass.
does anyone have experience using haraka vs a service like sendgrid? I'm curious if there are any challenges with mail deliverability such as the haraka instance being flagged for spam.
There's definitely more work required to set up, warm up and to deal with blocklists when self-hosting. Basic stuff like DKIM/SPF/DMARC is a must in both cases.
As a Postmark customer, I strongly recommend switching to their service if you want better deliverability than Sendgrid, for example. However, if you have a poorly sanitized email audience, you'll suffer a lot of pain initially to reduce the bounce ratio.
It's like switching from a gym where you're on your own and no one cares to one where you're yelled at if you miss your training. Naturally, the former delivers much better results, but there is an adaptation period which makes a lot of people give up.
[+] [-] dzhiurgis|2 years ago|reply
> By converting our HTML templates into TEX formats and utilizing pdflatex for PDF generation, we observed a 10X increase in speed compared to the original Puppeteer way of generating
8*60/10 = 48 minutes. So this speedup alone would've been enough, wouldn't it?
[+] [-] rbdixon|2 years ago|reply
Crazy example of how fast typst is: 30”x30” document with 2600 tiny images generates a 40MB PDF in under a half second.
Check typst out. It’s amazing. Not quite latex equivalent in some ways but moving fast on an amazing foundation.
[+] [-] tmnvix|2 years ago|reply
[+] [-] Shakahs|2 years ago|reply
It appears they are emailing PDF copies of stock market transactions to fulfill a regulatory requirement, so the PDF generation should be deterministic and email delivery of a receipt is an idempotent operation.
Instead of all that scheduled orchestration complexity they could just queue their PDF jobs to SQS and do the work in a Lamba function, this would also let them queue the PDFs as the transactions are happening instead of continuing to run this as a nightly batch job.
It's also unclear why render + sign + add to mail queue need to be 3 separate queued tasks. They have to execute serially, so a single worker could do the whole process and checkpoint to S3 after each step, or not because the output is deterministic anyway.
It would also eliminate their issues with S3 request limits by keeping all the data relevant to a PDF local to 1 worker for the duration of processing.
[+] [-] bionhoward|2 years ago|reply
[+] [-] shreyansh_k|2 years ago|reply
[+] [-] ByQuyzzy|2 years ago|reply
[+] [-] mr-karan|2 years ago|reply
> For the contract note generation job, we spawn about 40 instances in total currently, a mix of c6a.8xlarge, c6a.2xlarge, and c6a.4xlarge.
[+] [-] rtpg|2 years ago|reply
[+] [-] mmh0000|2 years ago|reply
https://weasyprint.org/
https://github.com/Kozea/WeasyPrint
[+] [-] nesarkvechnep|2 years ago|reply
[+] [-] sss111|2 years ago|reply
[+] [-] koliber|2 years ago|reply
As a very rough benchmark, print this page to a PDF file to see how long it takes.
[+] [-] d_sc|2 years ago|reply
[+] [-] Avamander|2 years ago|reply
[+] [-] giovannibonetti|2 years ago|reply
It's like switching from a gym where you're on your own and no one cares to one where you're yelled at if you miss your training. Naturally, the former delivers much better results, but there is an adaptation period which makes a lot of people give up.