(no title)
F-0X | 6 years ago
Awk would chew through that no problem.
> Some of which contain "quoted records", others, same column, are unquoted.
In which case, there is the FPAT variable which can be used to define what a field is. FPAT="\"[^\"]\"|[^,]", which means "stuff between quotes, or things that are not commas", would probably have worked for you. (EDIT: Looks like formatting has gotten hold of my FPAT and I don't know how to stop it... hopefully it is still clear where asterisks should be)
> Some contain comma's, in the fields, most don't. CSV is like that: more like a guideline than actual sense.
Well, I would say that's absolutely false. You can't just put the delimiter wherever you fancy and call it a well-formed file. Quoting exists for the unfortunate cases your data includes the delimiting character (ideally the author would have the sense to use a more suitable character, like a tab).
This is just a retort to prevent your post from dissuading readers from awk, which is a fantastic tool. If you actually sit for half and hour and learn it rather than google to cobble together code that works, it is wonderful. I also don't think it is valid to base your judgement of a tool on what was apparently garbage data.
ajanuary|6 years ago
But if you want to be in a world where people only deal with well specified files like RFC 4180 (for some definition of well specified), your quick field pattern doesn’t conform. It doesn’t handle escaped double quotes or quoted line breaks. If you’re using your quick awk command to transform an RFC 4180 file into another RFC 4180 file you’ve just puked out the sort of garbage you were railing against.
While awk is a great tool if you’re dealing with a csv format with a predictable specification, and probably could be made to bend to the GP will with a little more knowledge, it gets trickier if you’re dealing with handling some of the garbage that comes up in the real world. What’s worse is the programming model leads you down the path of never validating your assumptions and silently failing.
I love awk for interactive sessions when I can manually sanity check the output. But if I’m writing something mildly complex that has to work in a batch on input I’ve never seen, I too would reach for ruby.