top | item 41383985

(no title)

zachmu | 1 year ago

when I was learning Go, I read a guide that told you to fire off a goroutine to walk a tree and send the values back to the main goroutine via a channel. I think about that "just an example" guide a lot when I see bad channel code.

For me the biggest red flag is somebody using a channel as part of an exported library function signature, either as a param or a return value. Almost never the right call.

discuss

lanstin|1 year ago

I've used that pattern to write tools to e.g. re-encrypt all whatever millions of objects in an S3 bucket, and examine 400m files for jars that are or contain the log4j vulnerable code. I had a large machine near the bucket/NFS filer in question, and wanted to use all the CPUs. It worked well for that purpose. The API is you provide callbacks for each depth of the tree, and that callback was given an array of channels and some current object to examine; your CB would figure out if that object (could be S3 path, object, version, directory, file, jar inside a jar, whatever) met the criteria for whatever action at hand, or if it generated more objects for the tree. I was able to do stuff in like 8 hours when AWS support was promising 10 days. And deleted the bad log4j jar few times a day while we tracked down the repos/code still putting it back on the NFS filer.

The library is called "go-treewalk" :) The data of course never ends back in main, it's for doing things or maybe printing out data, not doing more calcualation across the tree.

lelanthran|1 year ago

> when I was learning Go, I read a guide that told you to fire off a goroutine to walk a tree and send the values back to the main goroutine via a channel.

Okay, I gotta ask - what exactly is wrong with this approach? Unless you're starting only a single goroutine[1], this seems to me like a reasonable approach.

Think about recursively finding all files in a directory that match a particular filter, and then performing some action on the matches. It's better to start a goroutine that sends each match to the caller via a channel so that as each file is found the caller can process them while the searcher is still finding more matches.

The alternatives are:

1. No async searching, the tree-walker simply collects all the results into a list, and returns the one big list when it is done, at which point the caller will start processing the list.

2. Depending on which language you are using, maybe have actual coroutines, so that the caller can re-call the callee continuously until it gets no result, while the callee can call `yield(result)` for each result.

Both of those seem like poor choices in Go.

[1] And even then, there are some cases where you'd actually want the tree-walking to be asynchronous, so starting a single goroutine so you can do other stuff while talking the tree is a reasonable approach.

fnord123|1 year ago

> what exactly is wrong with this approach?

Before Go had iterators, you either had callbacks or channels to decompose work.

If you have a lot of files on a local ssd, and you're doing nothing interesting with the tree entries, it's a lot of work for no payoff. You're better off just passing a callback function.

If you're walking an NFS directory hierarchy and the computation on each entry is substantial then there's value in it because you can run computations while waiting on the potentially slow network to return results.

In the case of the callback, it is a janky interface because you would need to partially apply the function you want to do the work or pass a method on a custom struct that holds state you're trying to accumulate.

Now that iterators are becoming a part of the language ecosystem, one can use an iterator to decompose the walking and the computation without the jank of a partially applied callback and without the overhead of a channel.

kjksf|1 year ago

Assuming latest Go 1.13 I would write an iterator and used goroutines internally.

The caller would do:

    for f := range asyncDirIter(dir) {
    }

Better than exposing channel.

But my first question would be: is it really necessary? Are you really scanning such large directories to make async dir traversal beneficial?

I actually did that once but that that was for a program that scanned the whole drive. I wouldn't do it for scanning a local drive with 100 files.

Finally, you re-defined "traversing a tree" into "traversing a filesystem".

I assume that the post you're responding to was talking about traversing a tree structure in memory. In that context using goroutines is an overkill. Harder to implement, harder to use and slower.

Kamq|1 year ago

The only time I've seen it work with channels in the API is when it's something you'd realistically want to be async (say, some sort of heavy computation, network request, etc). The kind of thing that would probably already be a future/promise/etc in other languages.

And it doesn't really color the function because you can trivially make it sync again.

beautron|1 year ago

> And it doesn't really color the function because you can trivially make it sync again.

Yes, but this goes both ways: You can trivially make the sync function async (assuming it's documented as safe for concurrent use).

So I would argue that the sync API design is simpler and more natural. Callers can easily set up their own goroutine and channels around the function call if they need or want that. But if they don't need or want that, everything is simpler and they don't even need to think about channels.

lelanthran|1 year ago

> The kind of thing that would probably already be a future/promise/etc in other languages.

Or a coroutine (caller calls `yield(item)` for each match, and `return` when done).

Filligree|1 year ago

At the time, that was one of the only ways to write decent-looking generic code.

unknown|1 year ago

[deleted]