top | item 7799531

(no title)

lignuist | 11 years ago

I used that strategy for parsing gigabytes of CSVs containing arbitrary natural language from the web - try to get these files fixed, or figure out a grammar for gigabytes of fuzzy data...

My approach never failed for me, so telling me that my strategy does not work is a strong claim, where it reliably did the job for me.

Your examples are all valid, but what you are describing are theoretical attacks on the method, while the method works in almost all cases in practice. We are talking about two different viewpoints: dealing with large amounts of messy data on one hand and parser theory in an ideal cosmos on the other hand.

discuss

zAy0LfpBZLC8mAC|11 years ago

How do you know that the strategy worked reliably if you never compared the results to the results obtained using a reliable method (which you presumably didn't, because then you could just have used the reliable method)? The larger the data you have to deal with, the more likely it is that corner cases will occur in it, and the less likely that you will notice anomalies, thus the more important that you are very strict in your logic if you want to derive any meaningful results.

As such, the two viewpoints really are: not really caring about the soundness of your results and solving the actual problem.

Now, maybe you really can show that the bugs in the methods you use only cause negligible noise in your results, in which case it might be perfectly fine to use those methods. But just ignoring errors in your deduction process because you don't feel like doing the work of actually solving the problem at hand is not pragmatism. You'll have to at least demonstrate that your approach does not invalidate the result.

lignuist|11 years ago

Nitpicking much?

As I wrote above, by making sure that I use a placeholder that does not appear in the data, I make sure that it does not cause the issues you describe. And if I was wrong with that assumption, I can at least minimize the effect by choosing a very unlikely sequence as placeholder.

I really see no issue here. How do you find valid grammars for fuzzy data in practice?