Skip to content

Preprocessing of raw .pdb protein files for Uni-Mol pocket encoder #23

@alexander-telepov

Description

@alexander-telepov

Details

Dear Uni-Mol Tools development team,
Thank you for your amazing work on molecular representation learning!

I raised an issue in the Uni-Mol repository month ago, but haven't gotten a response yet. I am not sure if the problem is within scope of your team, and I am sorry for duplicating, but I kindly ask for your help.

I am interested in computing pocket representations with Uni-Mol for some experimental structures from the PDB.
As I understand from the paper (Appendix A), raw PDB data is first preprocessed: missing heavy atoms, hydrogen atoms, and water molecules are added.

While going through the UniMol repository, specifically the example for computing pocket representations, I could not find the part where such preprocessing is performed. I also went throw Uni-Mol Tools examples and documentation (https://unimol-tools.readthedocs.io/en/latest/), but have not succeded to find code relevant for such preprocessing.
As far as I understand, this needs to be done as a prerequisite.

Could you please provide a script with an example of preprocessing raw protein data for the Uni-Mol pocket encoder?

I also have a few other related questions about preprocessing:

  1. In Appendix C, it is stated that hydrogen atoms were removed from the pocket input structures during pretraining. However, in the pretraining example, the remove-hydrogen flag is not used. It also seems that the pocket pretraining dataset transformations retain hydrogens in the structure. Could you clarify this discrepancy?
  2. Does one need to remove heterogens (ions, cofactors) during raw data preprocessing?
  3. How hydrogen and water positions should be added: with some force field or by using templates?

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions