Preprocessing of raw .pdb protein files for Uni-Mol pocket encoder

### Details

Dear Uni-Mol Tools development team,
Thank you for your amazing work on molecular representation learning!

I raised an [issue](https://github.com/deepmodeling/Uni-Mol/issues/370) in the Uni-Mol repository month ago, but haven't gotten a response yet. I am not sure if the problem is within scope of your team, and I am sorry for duplicating, but I kindly ask for your help.

I am interested in computing pocket representations with Uni-Mol for some experimental structures from the PDB.
As I understand from the paper (Appendix A), raw PDB data is first preprocessed: missing heavy atoms, hydrogen atoms, and water molecules are added.

While going through the UniMol repository, specifically the [example](https://github.com/deepmodeling/Uni-Mol/blob/main/unimol/notebooks/unimol_pocket_repr_demo.ipynb) for computing pocket representations, I could not find the part where such preprocessing is performed. I also went throw Uni-Mol Tools [examples](https://github.com/deepmodeling/unimol_tools?tab=readme-ov-file#examples) and documentation (https://unimol-tools.readthedocs.io/en/latest/), but have not succeded to find code relevant for such preprocessing.
As far as I understand, this needs to be done as a prerequisite.

Could you please provide a script with an example of preprocessing raw protein data for the Uni-Mol pocket encoder?

I also have a few other related questions about preprocessing:

1) In Appendix C, it is stated that hydrogen atoms were removed from the pocket input structures during pretraining. However, in the pretraining [example](https://github.com/deepmodeling/Uni-Mol/tree/main/unimol#pocket-pretraining), the remove-hydrogen flag is not used. It also seems that the pocket pretraining dataset [transformations](https://github.com/deepmodeling/Uni-Mol/blob/main/unimol/unimol/tasks/unimol_pocket.py) retain hydrogens in the structure. Could you clarify this discrepancy?
2) Does one need to remove heterogens (ions, cofactors) during raw data preprocessing?
3) How hydrogen and water positions should be added: with some force field or by using templates?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocessing of raw .pdb protein files for Uni-Mol pocket encoder #23

Details

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Preprocessing of raw .pdb protein files for Uni-Mol pocket encoder #23

Description

Details

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions