top | item 24007354

Exactly-Once Initialization in Asynchronous Python

104 points| ingve | 5 years ago |nullprogram.com | reply

58 comments

order
[+] alexchamberlain|5 years ago|reply
I'd just call the function once by avoiding the global; construct your database access object at the start of your asynchronous main method and dependency inject it to other tasks.
[+] dilatedmind|5 years ago|reply
His asyncpg example doesn't make much sense to me. What if there was a config change with a bad password? I would like to know this immediately on startup, else my rolling deploy is going to bring down all the previously well configured instances, and by the time we lazily try to connect to postgres it's too late.

I'm not a big python user, but I do find it kind of surprising there isn't an awaitable and thread safe mutex in the stdlib.

[+] orf|5 years ago|reply
This. Your code often becomes easier to read and test as well.
[+] np_tedious|5 years ago|reply
Can you clarify what you mean by dependency injection in python? Did you mean a DI framework or something more informal?

I've seen DI frameworks in python but not really used them. At a glance they don't strike me as pythonic. Rolling your own kind of inversion of control can result in unruly "config" or "context" objects that bring difficulties as well.

[+] danielscrubs|5 years ago|reply
So you’d block the thread and call it a day?
[+] cheez|5 years ago|reply
Been coming across lot of these issues. Asyncio requires slightly different thought processes.

As soon as you have an `await` anywhere in the code, you've got to assume that your code will be re-entered. Lots of asyncio.Locks all over the place for me.

Glad people are bringing this up. I had to learn this on my own.

[+] pansa2|5 years ago|reply
> As soon as you have an `await` anywhere in the code, you've got to assume that your code will be re-entered.

At least the re-entry points are explicitly marked with `await`. IMO that's the main benefit of async-await (stackless coroutines) over stackful coroutines or threads, which allow your code to be suspended and re-entered almost anywhere.

Of course the drawback of async-await is the "function color" issue [0], in which it's difficult for functions that don't suspend to call those which do.

[0] http://journal.stuffwithstuff.com/2015/02/01/what-color-is-y...

[+] nurettin|5 years ago|reply
I've wrestled with this. A cleaner solution seems to be: using an asyncio.Queue.

So when your function is not reentrant, params = await command.get() runs in a loop inside a task ( command.put_nowait(params) is called elsewhere)

you can also use this to distribute tasks to different class methods

[+] alpineidyll3|5 years ago|reply
Every time I find something that seems unnecessarily awk in asyncio, I eventually find out there's a good reason. But plenty of things that are written with it aren't using it exactly right.
[+] OrangeTux|5 years ago|reply
> Unfortunately this has a serious downside: asyncio locks are associated with the loop where they were created. Since the lock variable is global, maybe_initialize() can only be called from the same loop that loaded the module. asyncio.run() creates a new loop so it’s incompatible.

I work on several async projects, but I never had to use multiple event loops. What are use cases for using multiple event loops?

[+] itayperl|5 years ago|reply
There may be other use cases, but it can be a useful pattern for mixing async code into a non-async project. In the specific places where using async for some task makes sense, you would just spawn a thread with an event loop, then push work into the new loop from non-async code using run_coroutine_threadsafe.
[+] lmeyerov|5 years ago|reply
There is more than one way to make awaitables in asyncio -- at the core, this is about sharing a single future, for which there's a joyfully boring native standard constructor.

For example, when working w/ immutable GPU dataframes to represent our user's datasets, we often get into variants where loading a dataset may take a bit and thus get multiple services requesting it before ETL is done. So, we want to only trigger the parser once per file and have any subsequent calls wait on the first one:

  datasets = {}
  async def load_once(name):
    if not (name in datasets):          # sync,  many
      fut = asyncio.create_future()     # sync,  once
      datasets[name] = fut              # sync,  once
      fut.set_result(await load(name))  # async, once
    return await datasets[name]         # async, many
And then throw in an async lru.. :)
[+] jaen|5 years ago|reply
Unfortunately, this naive method is buggy, I have had to debug and fix this exact code in production :)

The issue is with exception safety - first, this does not handle exceptions in load() properly, but that is a trivial fix.

The more insidious problem is due to the fact that Python future are cancellable - and exceptions cancel futures.

What this means is that if two callers call load_once() in parallel, and the first caller encounters an exception (eg. from calling something else in parallel), the load() future will be cancelled for _all_ callers (eg. the second one), and will remain in a permanently wedged state.

Fixing that is, well, quite a bit more code...

[+] smabie|5 years ago|reply
How about we just use actors instead? Preemptable actors are the only good concurrency model I've ever come across. Everything else has massive problems
[+] CraigJPerry|5 years ago|reply
Actors aren’t a panacea either - your logic ends up more spread out. You’re still able to shoot yourself in the foot quite easily too, e.g. when deciding whether to use a “pull” or “push” model for concurrency.

I found async testing in python to be annoying, although i found a couple of libraries to make it nicer (pytest-async and i forget the name of the other).

[+] odiroot|5 years ago|reply
I never understood this whole "actor" thing until I had to write an extension for Mopidy. Then it really clicked with me.

It's very boilerplate-y (Mopidy uses Pykka) though and takes some time getting used to coming from other frameworks.

[+] mgraczyk|5 years ago|reply
Async await scales well to codebases with millions of lines and thousands of developers. As a result, large companies and ecosystems have mostly adopted async/await and the tooling and runtimes in those languages is now much more mature.
[+] mgraczyk|5 years ago|reply
If you're using cpython since python 3.2, you don't need to lock. You can use `dict.setdefaut` or another similar method that is guaranteed to be atomic.

    initialized = D.setdefault('initialized', True)
    ...
[+] sicromoft|5 years ago|reply
dict.setdefault doesn’t solve the problem that he’s using the lock for (atomicity is not the problem).
[+] nhumrich|5 years ago|reply
This can be a lot simpler. Just set "one_time_setup" to a single instance of the method, and all calls are waiting for the exact same run.

If that doesnt work, then set it to an 'asyncio.event`, and run the one_time_setup "in the background" (create_task), and when its done it marks the event as complete.

[+] waterside81|5 years ago|reply
Go offers this out of the box via the sync.Once function. Do other languages? Kind of surprised python doesn’t as this sort of pattern is common in applications dealing with concurrency
[+] dnautics|5 years ago|reply
Erlang has features for this baked in. What's more, if initialization of any subcomponent fails (say one of its dependencies hadn't completed booting yet due to race condition), if the author made it throw, the dependent subcomponent will automatically restart itself and try again. There are also one line strategies for trying again later, etc, so you don't even have to worry about blocking to prevent those race conditions.

> Kind of surprised python doesn’t as this sort of pattern is common in applications dealing with concurrency

Well yeah, python was not designed for that.

[+] fishywang|5 years ago|reply
lazy init in kotlin and scala is essentially the same thing.

the good thing with go's sync.Once is that it's implemented as a library instead of something in the language itself, so it's easy for curious user to see how it's actually implemented. they even have comments there pointing out wrong implementations, which I have seen people make the exact same mistake during code reviews (in other language).

[+] reedwolf|5 years ago|reply
>"global"

Please write classes, people!

[+] zbentley|5 years ago|reply
Why? To hide the fact that something is global behind mutation by reference?

"global" is a fine way to do that when you need it. Simple and says what it means.

[+] rburhum|5 years ago|reply
I would add a note that if you are running in a cluster environment like Kubernetes this won’t work because your containers would be running potentially in different machines. In those scenarios you would need another service just for the locks.
[+] jordic|5 years ago|reply
On k8s, for example running multiple parallel jobs that need to initialize only once, It used to work for me the redis redlock (it's around with multiple implementations). The first job takes the lock while initializing, the rest just waits the release, to start working on prepared items by the first. On asyncio, caches, we used a lock to prevent dogpiling on cache initialization.. prevent multiple tasks cashing the same in parallel.