Was thinking of InChI[0] but on Googling SMILES and SELFIES I found this[1] talk, this[2] paper and my goodness I've been down a few rabbit holes since...
Note: There are two standardized formats for this called SMILES and SELFIES. SMILES is much better supported, but SELFIES is more robust. I'm integrating them into some bio and chem software I'm working on.
You can do things like look up, using PubChem's API, similar molecules etc to a SMILES string.
I believe most molecule editors can load and save SMILES.
SMILES and SELFIES are molecular graph representations and aren't meant to solve the "parse this sum formula" problem.
SELFIES are for genAI. If you ask a VAE to generate SMILES, it will spit out some strings that are invalid - can't happen with SELFIES, that is the one application where they are robust.
I wrote a very simple SMILES parser using pyparsing
https://github.com/dakoner/smilesparser/tree/master
I wouldn't say it's intended for production work, but it has been useful in situations where I didn't want to pull in rdkit.
This code is jibberish to me, but it appears the target is just parsing how many atoms are in a molecule string of some representation. That's cool, but to do just about anything useful in chemistry we need the bond graph (and often more - bond orders stereochemistry, plus much more for biopolymers).
That was my initial reaction too, but I suspect this is has utility in applications other than what you and I are looking for. From context, I gather this may be for thermodynamic arithmetic, or reaction product arithmetic.
Does this handle, e.g., water of hydration CaSO4 . 2H2O? states of matter H2O(g)? does it preserve subunit information, as in (C6H5)CH2COOH? Writing a parser for basic formulae is such a tiny tiny part of the actual problem... deciding the scope of what you want to handle and how is the real problem
logifail|3 months ago
Was thinking of InChI[0] but on Googling SMILES and SELFIES I found this[1] talk, this[2] paper and my goodness I've been down a few rabbit holes since...
[0] https://en.wikipedia.org/wiki/International_Chemical_Identif... [1] https://www.inchi-trust.org/wp/wp-content/uploads/2019/12/18... [2] https://pubs.rsc.org/en/content/articlehtml/2022/dd/d1dd0001...
jugoetz|3 months ago
the__alchemist|3 months ago
You can do things like look up, using PubChem's API, similar molecules etc to a SMILES string.
I believe most molecule editors can load and save SMILES.
dachrillz|3 months ago
jugoetz|3 months ago
SELFIES are for genAI. If you ask a VAE to generate SMILES, it will spit out some strings that are invalid - can't happen with SELFIES, that is the one application where they are robust.
whitten|3 months ago
dalke|3 months ago
Yes. Here is the yacc grammar for the SMILES parser in the RDKit. https://github.com/rdkit/rdkit/blob/master/Code/GraphMol/Smi...
There's also one from OpenSMILES at http://opensmiles.org/opensmiles.html#_grammar . It has a shift/reduce error (as I recall) that I was not competent enough to fix.
I prefer to parser almost completely in the lexer, with a small amount of lexer state to handle balanced parens, bracket atoms, and matching ring closures. See https://hg.sr.ht/~dalke/opensmiles-ragel and more specifically https://hg.sr.ht/~dalke/opensmiles-ragel/browse/opensmiles.r... .
dekhn|3 months ago
fred_tandemai|3 months ago
[deleted]
mwt|3 months ago
the__alchemist|3 months ago
brilee|3 months ago
toast_x|3 months ago
Jaxan|3 months ago
fred_tandemai|3 months ago
[deleted]