top | item 499595

Sentdiff: Diff for Writing

41 points| jackowayed | 17 years ago |github.com | reply

21 comments

order
[+] gourneau|17 years ago|reply
Howdy, thanks for posting this.

It would be awesome to visualize wikipedia edits overtime. I don't really care about what the text says, just how blocks of it change over time. I am after the ascetics present in the ever flowing change of data. I think your script might be a good starting point. Think of the videos of a flowers growing, that compress months into a few seconds. Do something similar with wikipedia edits.

[+] keenerd|17 years ago|reply
I've been doing something similar with my blog, except it is currently for people who do care what the text says. The diff algo is tricky, I should have built up a larger corpus of material before designing.

It looks at paragraphs, sentences, sub-sentence structures, words. It even draws little sparkgraph-ish diagrams. It is not really that long (250 lines by wc) but it has been a huge time sink for tweaking.

For an example of some heavy editing: http://kmkeen.com/inabow/2009-01-07-11-22-00.html

[+] ashr|17 years ago|reply
I was hoping to see a compact implementation of diff algorithm. However, the script seems to be relying on using the 'diff' utility already present. Not a bad thing, but I was expecting to see something else.
[+] jackowayed|17 years ago|reply
I went with a don't-reinvent-the-wheel approach. Anything I did would have at least doubled the time to write the script and probably yielded a diff half as good.
[+] jackowayed|17 years ago|reply
I wrote this as well as submitting it.

Any suggestions, thoughts, etc. would be greatly appreciated.

[+] ntoshev|17 years ago|reply
I would approach this problem differently: tokenize the text and diff the stream of tokens (as opposed to a stream of characters).
[+] akkartik|17 years ago|reply
Oh my eyes.. :) I suggest replacing main with

    def file_with_split_sentences(old)
      contents = File.read(old).split_keep_after(/[\.!\?] *[A-Z\n]/, 2)
      begin newname = "#{old}-sdiff-#{random_chars 5}"; end while File.exists?(newname)
      File.open(newname, "w") { |file| file.write contents.join("\n")}
      newname
    end

    fnames = ARGV[-2..-1].collect{|f| file_with_split_sentences f.chomp}
    system "diff #{ARGV[0...-2].join " "} #{fnames.join " "}"
    "rm #{fnames.join " "}"
[+] socmoth|17 years ago|reply
i like this and needed it about a week ago. i eventually had to write my own script.

i also found dwdiff, which you may like because it is very similar and very unixy http://www.linux.com/feature/114176

[+] boucher|17 years ago|reply
Why sentences and not words?
[+] gcv|17 years ago|reply
For Emacs users: ediff highlights changed words. I've used it to track changes in text.
[+] albertcardona|17 years ago|reply
I could use this as part of git itself for comparing latex document revisions. The current line-oriented diff has all the problems that sentdiff tries to solve.
[+] jackowayed|17 years ago|reply
yeah, I was thinking it would be cool as an option for git diffs or something for when you have text files under git.

Never really looked at the git source code, so I'm not sure how easy it would be to do.

[+] bbb|17 years ago|reply
Git has something very much like that already built in. Try

git diff --color-words