Here is a problem I've been noodling with. If you are a decent programmer, how does your LLM help you solve this problem?
Given a cheminformatics fingerprint definition based on SMARTS substructure patterns, come up with a screening filter, likely using a decision tree, which uses intermediate feature tests to prune search space faster than simply testing each pattern one-by-one.
which could be improved by an element count test - count the number of fluorines, and only do the test if there are enough atoms in the molecule to fingerprint.
So one stage might be to construct a list of element counts;
ele_counts = [0]*200
seen = set()
for atom in mol.GetAtoms():
ele_counts[eleno:=atom.GetAtomicNum()] += 1
seen.add(eleno)
then have a lookup table for each element, based on the patterns which have at least that count of the given element type;
ele_patterns = [
# max known count, list of set of matching patterns
(0, [set()]), # element 0
(0, [set()]), # hydrogen
..
(20, [{all patterns which contain no carbon},
{all patterns which require at most 1 carbon}, ...
{all patterns which require at most 19 carbons}],
(10, [{all patterns which contain no fluorine}, ..
{all patterns which contain at most 9 fluorines}],
...]
However, this is not sophisticated enough to identify which other tests, like the "CC(=NNC=O)C" example I gave before, or "S(=O)(=O)", which might be good tests at a higher level than the element.
And clearly if there isn't a sulphur, aren't two oxygens, and aren't two double bonds then there's no need to test "S(=O)(=O)", suggesting a tree structure would be useful.
nradov|1 year ago
LLMs are pretty good at giving you what you ask for. Not so good at telling you that you're asking for the wrong thing.
drewcoo|1 year ago
So they're comparable to rubber ducks. I would like to see data from a comparative study with rubber ducks, LLMs, and a control group.
dalke|1 year ago
Given a cheminformatics fingerprint definition based on SMARTS substructure patterns, come up with a screening filter, likely using a decision tree, which uses intermediate feature tests to prune search space faster than simply testing each pattern one-by-one.
For example, the Klekota-Roth patterns defined in their supplemental data (and also available from CDK at https://github.com/cdk/cdk/blob/main/descriptor/fingerprint/...) contain patterns like:
Clearly if 'CC(=NNC=O)C' does not exist in the molecule to fingerprint then there is no reason to test for the subsequent three patterns.Similarly, there are patterns like:
which could be improved by an element count test - count the number of fluorines, and only do the test if there are enough atoms in the molecule to fingerprint.So one stage might be to construct a list of element counts;
then have a lookup table for each element, based on the patterns which have at least that count of the given element type; so one reduction can be and only test that subset of patterns.However, this is not sophisticated enough to identify which other tests, like the "CC(=NNC=O)C" example I gave before, or "S(=O)(=O)", which might be good tests at a higher level than the element.
And clearly if there isn't a sulphur, aren't two oxygens, and aren't two double bonds then there's no need to test "S(=O)(=O)", suggesting a tree structure would be useful.