(no title)
kuzee | 4 years ago
I created Cogmint.com ("cognition minting") to solve this problem for myself.
You can submit known correct answers for questions, and those questions are then used as ground truth to score worker accuracy. Workers are then scored on their similarity to known correct answers and other workers that have accurately answered questions. It works surprisingly well for how simple it is. It's been a fun challenge to create simple methods of scoring similarity across different task types.
It's a side project, so don't rely on it for mission critical things, but I rely on it for some production tasks, so it's stable.
It currently supports classification (choose from a set of possible answers) and has beta support for bounding box task types. String input task types are coming very soon.
I'd love to see if it can help you out, I'll waive the fees: I'm not in it for the money I just like making things useful and reliable. Reach out and say hi!
georgeutsin|4 years ago
My undergrad thesis was to build https://tagbull.com, where we tried to have turkers validate the work of other turkers by breaking up a label into sub tasks, and getting multi-turker consensus on those before moving forward.
The main issue we ran into is that the incentive system is incredibly misaligned with the responsibility that the turkers have. It’s very difficult to build trust, especially with a crowd of people who haven’t signed contracts, and who face virtually no repercussions for doing bad work, whether intentionally or unintentionally.
webmaven|4 years ago
Have you noticed problems that show up with questions whose answers have a bimodal distribution (ie. The gold standard question actually has two or more correct answers)?
In one sense, this is just a labeling quality problem with the 'gold standard' data, but to a lesser extent these same issues may crop up in the data being labeled when using similarity or clustering to rate or classify the workers and transitively apply that to the other results they produce.
kuzee|4 years ago
I haven't dug into the data on this across the platform but you've given me the idea to go see if I can find evidence of this, and see if I can improve somehow. There's only low hundreds of projects, so I might be able to find some that have this problem.
sokoloff|4 years ago