Details
Dear Uni-Mol Tools development team,
Thank you for your amazing work on molecular representation learning!
I raised an issue in the Uni-Mol repository month ago, but haven't gotten a response yet. I am not sure if the problem is within scope of your team, and I am sorry for duplicating, but I kindly ask for your help.
I am interested in computing pocket representations with Uni-Mol for some experimental structures from the PDB.
As I understand from the paper (Appendix A), raw PDB data is first preprocessed: missing heavy atoms, hydrogen atoms, and water molecules are added.
While going through the UniMol repository, specifically the example for computing pocket representations, I could not find the part where such preprocessing is performed. I also went throw Uni-Mol Tools examples and documentation (https://unimol-tools.readthedocs.io/en/latest/), but have not succeded to find code relevant for such preprocessing.
As far as I understand, this needs to be done as a prerequisite.
Could you please provide a script with an example of preprocessing raw protein data for the Uni-Mol pocket encoder?
I also have a few other related questions about preprocessing:
- In Appendix C, it is stated that hydrogen atoms were removed from the pocket input structures during pretraining. However, in the pretraining example, the remove-hydrogen flag is not used. It also seems that the pocket pretraining dataset transformations retain hydrogens in the structure. Could you clarify this discrepancy?
- Does one need to remove heterogens (ions, cofactors) during raw data preprocessing?
- How hydrogen and water positions should be added: with some force field or by using templates?
Details
Dear Uni-Mol Tools development team,
Thank you for your amazing work on molecular representation learning!
I raised an issue in the Uni-Mol repository month ago, but haven't gotten a response yet. I am not sure if the problem is within scope of your team, and I am sorry for duplicating, but I kindly ask for your help.
I am interested in computing pocket representations with Uni-Mol for some experimental structures from the PDB.
As I understand from the paper (Appendix A), raw PDB data is first preprocessed: missing heavy atoms, hydrogen atoms, and water molecules are added.
While going through the UniMol repository, specifically the example for computing pocket representations, I could not find the part where such preprocessing is performed. I also went throw Uni-Mol Tools examples and documentation (https://unimol-tools.readthedocs.io/en/latest/), but have not succeded to find code relevant for such preprocessing.
As far as I understand, this needs to be done as a prerequisite.
Could you please provide a script with an example of preprocessing raw protein data for the Uni-Mol pocket encoder?
I also have a few other related questions about preprocessing: