top | item 28268189

(no title)

Orou | 4 years ago

Most ML requires collecting, cleaning, and transforming datasets into something that a model can train on for a specific domain. Codex and Copilot aren't good examples of this because they are training on terabytes of public code repos - meaning that there is no code cleaning step. It's relying on the sheer volume of data that is being processed to try and filter the 'unclean' data (think buggy code written by a human) out of the model.

These are really the exception rather than the rule when it comes to collecting data for ML/AI applications.

discuss

No comments yet.