top | item 39911840

(no title)

chrash | 1 year ago

this might be my first comment here heh.

i've worked on a similar product before.

there's no way they were turning a profit. they definitely missed stuff all the time even with a ton of sensors. and sensors aren't the only cost. annotation is by far the most costly operational expense. new product? needs several annotated photos and recalibrated weight sensors. merchant decides to put Christmas branding on the same UPC? now all your vision models are poisoned for that product. it needs to be re-annotated for the month and a half it exists and the models need to be swapped out once inventory changes over again. as long as merchants are redesigning products (always) your datasets will be in a constant state of decay. even if your vision sensors are stationary and know the modular design up front, you still need to be able to somewhat generalize in case things get misplaced (big problem for weight sensors) or the camera gets bumped.

between dataset management, technology costs, research costs, rote operational costs, etc this is a very expensive problem to solve. and large models with a ton of parameters are little help; they may lower annotation costs a bit but will increase the cost of compute.

once i really dug into this problem i saw Amazon Go's Just Walk Out for what it really was: a marketing stunt

discuss

hackernewds|1 year ago

the biggest cost is not annotators at the scale you're imagining. it is labor costs.

Amazon bet that the federal govt would raise labor costs to $20/hr and all their competitors (besides themselves with this tech) would get wiped out. They even publicly campaigned and lobbied. That didn't come to fruition as the election promises turned to fluff, and the populists simply chose to empower unions instead.

chrash|1 year ago

i mean, labor cost (as in in-store labor) is the target for this cost optimization. unfortunately for the time being labor cost is not as significant as the other costs associated with annotation and dataset curation. technology costs are not really significant if this can be pulled off at scale.

in-store employees know where things are supposed to be and why, if at all, items are "misplaced" according to the modular design

beefnugs|1 year ago

I think this failing, and tesla failing to actually ship "self driving" is clear : machine learning definitely has complexity limits, that we are a long way off from perfecting or even getting beyond some reasonable threshold.

whoitwas|1 year ago

If you read the article, it was powered by 1000 cashiers in India, no sensors.

wk_end|1 year ago

> The company’s senior vice president of grocery stores says they’re moving away from Just Walk Out, which relied on cameras and sensors to track what people were leaving the store with. [emphasis mine]

> Though it seemed completely automated, Just Walk Out relied on more than 1,000 people in India watching and labeling videos to ensure accurate checkouts.

> According to The Information, 700 out of 1,000 Just Walk Out sales required human reviewers as of 2022. This widely missed Amazon’s internal goals of reaching less than 50 reviews per 1,000 sales. Amazon called this characterization inaccurate, and disputes how many purchases require reviews.

> “The primary role of our Machine Learning data associates is to annotate video images, which is necessary for continuously improving the underlying machine learning model powering,” said an Amazon spokesperson to Gizmodo. However, the spokesperson acknowledged these associates validate “a small minority” of shopping visits when AI can’t determine a purchase.

The article is kind of all over the place, but it sounds like there were lots of sensors and also lots of human intervention.

wiricon|1 year ago

How well does simulated data work in this space? My first stab at doing this scalably would be as follows: given a new product, physically obtain a single instance of the product (or ideally a 3d model, but seems like a big ask from manufacturers at this stage), capture images of it from every conceivable angle and a variety of lighting conditions (seems like you could automate this data capture pretty well with a robotic arm to rotate the object and some kind of lighting rig), get an instance mask for each image (using either human annotator or a 3d reconstruction method or a FG-BG segmentation model), paste those instances on random background images (e.g. from any large image dataset), add distractor objects and other augmentations, and finally train a model on the resulting dataset. Helps that many grocery items are relatively rigid (boxes, bottles, etc). I guess this would only work for e.g. boxes and bottles, which always look the same, you'd need a lot more variety for things like fruit and veg that are non rigid and have a lot of variety in their appearance, and we'd need to take into account changing packaging as well.

chrash|1 year ago

as mentioned in another comment, "scale" is not just horizontal, it's vertical as well. with millions of products (UPCs) across different visual tolerances it's hard to generalize. your annotation method is indeed more efficient than a multistep "go take a bunch of pictures and upload them to our severs for annotators" but is still costly in terms of stakeholder buy-in, R&D, hardware costs, and indeed labor. if you can scope your verticals such that you only have, say, 1000 products the problem become feasible, but once you start to scale to an actual grocery store or bodega with ever-shifting visual data requirements the problem doesn't scale well. add in the detail that every store moves merchandise at different rates or has localized merchandise then the problem becomes even more complex.

the simulated data also becomes an issue of cost. we have to produce a realistic (at least according to the model) digital twin that doesn't interfere too much with real data, and measuring that difference is important when you're measuring the difference between Lay's and Lay's Low Sodium.

i'm not saying it's unsolvable. it's just a difficult problem

londons_explore|1 year ago

I want to know why these places don't simply dramatically drop the accuracy requirement.

Rather than giving itemized receipts, give just a total dollar value. Then just make sure that most customers are charged within 10% of the correct amount.

Only if the customer requests and itemized receipt, then go watch the video and generate them one. But after a while most customers won't request it, and that means you can just guess at a dollar amount and as long as you're close-ish (which should be easy based on weight and past purchase history), that's fine.

leroy-is-here|1 year ago

The problem is taxes. Different items can be taxed differently. Itemized receipts are a must from the get-go unless you want to explain to the government that you underpaid them because you charge within a 10% error-margin.

jrpt|1 year ago

Come on, it isn’t anything to do with being a “marketing stunt.” Often products like this are expected to lose money at first, but they hope with enough R&D and scale that they can make it successful eventually.

For example, you are pointing out that annotating is costly, but that’s an expense that scales independently of the number of stores. So with enough scale it wouldn’t be as big a deal. Or if they figured out some R&D that could improve it too.

chrash|1 year ago

right, that's how it starts. but the improvements in methodology simply aren't there as the ML sector has been laser focused on generality in modeling (GenAI as it's affectionately known). "at scale" doesn't just mean more stores; it means more products and thus more annotation. how many UPCs do you figure there are in a given Target or Whole Foods? i assure you it's in the millions.

one advantages of the Amazon Go initiative is a smaller scope of products.

shay_ker|1 year ago

are larger image/video models unable to catch things like Christmas branding?

chrash|1 year ago

a big problem in the space is that products that look very similar will be clustered in the same section. large models are very good at generalizing, so they may be more attuned to "this is a Christmas thing", but they won't know that it should be classified as the same UPC as the thing that was in that spot yesterday without you specifically telling it to. how would it know it's not a misplaced product or a random piece of trash? (you won't believe the things you find on store shelves once you start looking) you can definitely speed up your annotation time with something like SAM[1], but it will never know without training or context that it's the same product but Christmas (ie it resolves to the same UPC).

[1]: https://github.com/facebookresearch/segment-anything

hotpockets|1 year ago

but what if everything cost the same, like a dollar store?