top | item 46795511

(no title)

lairv | 1 month ago

I would agree if not for the fact that polars is not compatible with Python multiprocessing when using the default fork method, the following script hangs forever (the pandas equivalent runs):

    import polars as pl
    from concurrent.futures import ProcessPoolExecutor

    pl.DataFrame({"a": [1,2,3], "b": [4,5,6]}).write_parquet("test.parquet")

    def read_parquet():
        x = pl.read_parquet("test.parquet")
        print(x.shape)

    with ProcessPoolExecutor() as executor:
        futures = [executor.submit(read_parquet) for _ in range(100)]
        r = [f.result() for f in futures]

Using thread pool or "spawn" start method works but it makes polars a pain to use inside e.g. PyTorch dataloader

discuss

skylurk|1 month ago

You are not wrong, but for this example you can do something like this to run in threads:

  import polars as pl
  
  pl.DataFrame({"a": [1, 2, 3]}).write_parquet("test.parquet")
  
  
  def print_shape(df: pl.DataFrame) -> pl.DataFrame:
      print(df.shape)
      return df
  
  
  lazy_frames = [
      pl.scan_parquet("test.parquet")
      .map_batches(print_shape)
      for _ in range(100)
  ]
  pl.collect_all(lazy_frames, comm_subplan_elim=False)

(comm_subplan_elim is important)

ritchie46|1 month ago

Python 3.14 "spawns" by default.

However, this is not a Polars issue. Using "fork" can leave ANY MUTEX in the system process invalid (a multi-threaded query engine has plenty of mutexes). It is highly unsafe and has the assumption that none of you libraries in your process hold a lock at that time. That's an assumption that's not PyTorch dataloaders to make.

lairv|1 month ago

Default to "spawn" is definitely the right thing, it avoids many footguns

That said for PyTorch DataLoader specifically, switching from fork to spawn removes copy-on-write, which can significantly increase startup time and more importantly memory usage. It often requires non-trivial refactors, many training codebase aren't designed for this and will simply OOM. So in practice for this use case, I've found it more practical to just use pandas rather than doing a full refactor

schmidtleonard|1 month ago

I can't believe parallel processing is still this big of a dumpster fire in python 20 years after multi-core became the rule rather than the exception.

Do they really still not have a good mechanism to toss a flag on a for loop to capture embarrassing parallelism easily?

ritchie46|1 month ago

Polars does that for you.

skylurk|1 month ago

This is one of the reasons I use polars.

lairv|1 month ago

Well I think ProcessPoolExecutor/ThreadPoolExecutor from concurrent.futures were supposed to be that