Great idea, simple and effective.
Tiny bit of feedback: seems like some listings use "unit count" for the number of balls, look at the most expensive listing for an example. Annoyingly the second most expensive balls have the number of dozens in the unit count instead.
rockdiesel|12 days ago
Any thoughts? Should I default to what's in the product title instead of the unit count? Not sure the best way to combat this.
Propelloni|12 days ago
hluska|12 days ago
Consider the top four most expensive golf balls on your current list:
TaylorMade 2021 TP5x (3+1 Box) 4DZ Golf Ball Pack, White — uses 4DZ in title, 48.0 in unit count in product specs.
Bridgestone Golf Tour B RXS Quadfecta - nothing in the title, unit count in product specs is 4.0. This one shows 4 dozen in a different spot than other balls.
TaylorMade Golf 2024 TP5 Golf Balls 3+1 Box Four Dozen — Four dozen in the title, unit count in product specs is 1.0 but it has 4.0 dozen in the same div as the Bridgestone balls.
Srixon Z Star Yellow Golf Balls - Buy 2 DZ Get 1 DZ Free — Title shows buy 2 DZ get 1 free. That’s represented as 2+1 or 3+1 in other data. In product specs it shows a unit count of 1.0.
— In that extremely limited sample, the product weight is a pretty good metric to show that the unit count is flawed though that only works in comparison to others. I wonder if you could do a multi pass approach, where you sort data first and then do a unit count versus weight check to find outliers and then start rocking through the titles? You’ll still end up digging through a lot of edge cases and that won’t be much fun but a multi pass would at least give you some insight into those weird edge cases.
datsci_est_2015|12 days ago
See also: toilet paper sheet count comparisons.
fultonn|12 days ago
tonygrue|12 days ago