top | item 42350865

(no title)

zkid18 | 1 year ago

Great job! It seems you have around 200k companies to list. How do you handle scraping at that scale – all websites are different. What if the schema and markup change? interested to hear what the DevOps aspect looks like.

discuss

Jabbs|1 year ago

Thank you so much. In some cases I was able to standardize where the title and location are located on the page (Greenhouse, Lever, etc.). But this mostly uses a valid dataset of job descriptions that matches lists of phrases for plain text within a page (with markup removed). Also the scraper remembers companies and career pages that have job listings. It will prioritize those companies that have listings and visit more often than those that don't. Currently there are 4 worker services that visit about 10k company websites per day (each).

nunez|1 year ago

I didn't build this, but here my guess: Most companies use a handful of ATS's (applicant tracking systems), like Greenhouse, Lever and Workday. Almost all of the jobs posted on these platforms are public and their top-level pages are indexable.

If I built something like this, I would start by searching for pages that have HTML fragments indicative to those systems a few times per week (since job listings don't change much).

While this won't do anything to reveal "real" ghost jobs (job reqs that are hidden or generic enough to be used for interesting referrals), it's probably a minor edge over LinkedIn Jobs (the home of stale jobs). Many of these companies cross post to those platforms anyway.