Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster MCS Calculation #6

Open
kexul opened this issue Feb 14, 2023 · 7 comments
Open

Faster MCS Calculation #6

kexul opened this issue Feb 14, 2023 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@kexul
Copy link
Contributor

kexul commented Feb 14, 2023

The current matchAtoms() uses the entire molecule to get MCS, a simple idea to accelerate it is to get MCS for heavy atoms and then match up hydrogens hanging off the MCS.
See this implementation for example:
https://github.com/OpenFreeEnergy/Lomap/blob/84e1bb0d20ceeed835166fe7218dd21f04248030/lomap/mcs.py#L1170

I didn't quantitatively test the speedup(no benchmark found) but it performed very well for large molecules in my project ( the current matchAtoms() often times out).

@jmichel80
Copy link
Contributor

Hi @kexul - have you tried matchAtoms on molecules without hydrogens, and then using the resulting MCS to build a prematch to pass as keyword argument for another call to matchAtoms using the complete molecules ? I think that could work. Note that this approach may not give you the same MCS that you get calling matchAtoms() on the full molecules. If you can share examples of molecules that are slow to matchI may be able to advise on ways to speed things up.

@kexul
Copy link
Contributor Author

kexul commented Feb 14, 2023

Hi @kexul - have you tried matchAtoms on molecules without hydrogens, and then using the resulting MCS to build a prematch to pass as keyword argument for another call to matchAtoms using the complete molecules ?

Hi @jmichel80 - Yes! I've used this approach for a long time. But it often leads to prematch not found in rdkit search results and then Sire is used to find the full MCS (add hydrogen to MCS). I'm not sure what algorithm Sire is used, but it's painfully slow.

If you can share examples of molecules that are slow to match may be able to advise on ways to speed things up.

I'll try to provide some examples but it may take some time. 😢

@lohedges
Copy link
Contributor

Thanks, @kexul, I'll look into this. Given that we use the same tools behind the scenes it should be easy enough to implement (or test) the approach that you suggest. Before we switched to using RDKit for MCS (we used to just use the Sire functionality) we had custom code to do similar to what you suggest. The issue we found was, as @jmichel80 says, that the MCS might differ to the full molecule search. In order to reliably get the mappings that we wanted we ended up doing both approaches and taking the larger mapping, which is obviously very wasteful.

@lohedges
Copy link
Contributor

@jmichel80: The prematch approach won't work since there's no way to use it as a seed for the MCS. Unfortunately we just use it retrospectively, i.e. checking that the MCS generated by RDKit does contain the prematch that the user wants. (If there are multiple MCS with the same size, then we return the one that matches the prematch, if it exists.) It could well be that the algorithm passed through a state containing the pre-match during its search, but we wont know this, since we only get the final result. If nothing matches the pre-match, then we fall back on Sire, which does check the pre-match at each stage of the MCS algorithm, i.e. as it grows in size.

@jmichel80
Copy link
Contributor

@kexul by how much do you have to increase the timeout argument to reach completion for your examples?
It may be preferable to have a more robust and time consuming MCS search rather than a faster one that is more error prone. I depends on the computing cost of the operations downstream of the MCS search.

@kexul
Copy link
Contributor Author

kexul commented Feb 14, 2023

Sorry, it takes much longer than I think to find a proper example. I'll need some time to dig my old files...
@jmichel80 In my previous experiments, I tried increasing the timeout to 5 min but matchAtoms() still fails to find a reasonable mapping.

@lohedges lohedges added the enhancement New feature or request label Feb 24, 2023
@lohedges
Copy link
Contributor

lohedges commented Mar 3, 2023

I see the RDKit MCS code has an InitialSeed option. Perhaps we could try matching with heavy atoms only, then using the MCS as a seed for a full atom search?

@lohedges lohedges self-assigned this Apr 21, 2023
lohedges pushed a commit that referenced this issue Jun 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants