top | item 44620618

(no title)

marcianx | 7 months ago

To make sure I understand correctly: did you want to read a `String` and have lots of references to slices within the same string without having to deal with lifetimes? If so, would another variant of `Rc<str>` which supports substrings that also update the same reference count have worked for you? Looking through crates.io, I see multiple libraries that seem to offer this functionality:

[1]: https://crates.io/crates/arcstr [2]: https://crates.io/crates/imstr

discuss

forrestthewoods|7 months ago

It’s deeper than that.

Let’s pretend I was in C. I would allocate one big flat segment of memory. I’d read the “JSON” text file into this block. Then I’d build an AST of nodes. Each node would be appended into the arena. Object nodes would container a list of pointers to child nodes.

Once I built the AST of nested nodes of varying type I would treat it as constant. I’d use it for a few purposes. And then at some point I would free the chunk of memory in one go.

In C this is trivial. No string copies. No duplicated data. Just a bunch of dirty unsafe pointers. Writing this “safely” is very easy.

In Rust this is… maybe possible. But brutally difficult. I’m pretty good at Rust. I gave up. I don’t recall what exact what wall I hit.

I’m not saying it can’t be done. But I am saying it’s really hard and really gross. It’s radically easier to allocate lots of little Strings and Vecs and Box each nested value. And then free them all one-by-one.

lokeg|7 months ago

Something like the following? I am trying and failing to reproduce the issue, even with mutable AST nodes.

  use bumpalo::Bump;
  use std::io::Read;
  fn main() {
      let mut arena = Bump::new();
      loop {
          read_and_process_lines(&mut arena);
          arena.reset();
      }
  }
  #[derive(Debug)]
  enum AstNode<'a> {
      Leaf(&'a str),
      Branch {
          line: &'a str,
          meta: usize,
          cons: &'a mut AstNode<'a>
      },
  }
  fn read_and_process_lines(arena: &Bump) {
      let cap = 40;
      let buf: &mut [u8] = arena.alloc_slice_fill_default(cap);
      let l = std::io::stdin().lock().read(buf).expect("reading stdin");
      let content: &str = str::from_utf8(&buf[..l]).unwrap();
      dbg!(content);

      let mut lines = content.lines();
      let mut latest: &mut AstNode<'_> = arena.alloc(AstNode::Leaf(lines.next().unwrap()));
      for line in lines {
          latest = arena.alloc(AstNode::Branch{line, meta:0, cons: latest});
      }
      println!("{latest:?}");
  }

rstuart4133|7 months ago

I've just built a parser in Rust that seems to function in exactly the way you say. It reads the text into a single buffer, assembles the AST nodes in an arena (indexes are NonZeroU8's to save space), and all token nodes refer to the original rather duplicating the text. The things I'm parsing are smallish (much less than 64k, 256 tokens), but their is around 100M of them so was worth the effort to build it in a way that used 0 allocations (the arena is re-used).

It was a bit of effort, but it can't be that difficult given it's my first non-toy Rust program and it wasn't that hard to get going. I'd written maybe 50 training exercises in Rust before that, over the years. Yes, the same thing in C would have been faster and easier to write (I've written many 100's of thousands of lines of C over the years), but I'm not sure it would of worked the day after it compiled.

I also had to build a 2nd parser that looked at maybe 100 lines. It was clone() all the way down for that one because I could afford the cost. I think that was easier to write in Rust than C, mostly because of the very good standard library Rust comes with.