Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 19 additions & 4 deletions .claude/claude-docs/bingo-elastic-python.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,10 @@ The standalone Python library at `bingo/bingo-elastic/python/` that indexes Indi

## Module map

- `bingo_elastic/elastic.py` — `ElasticRepository` and `AsyncElasticRepository` (parallel sync/async classes, both take a `tau_search: bool` flag), `IndexName` enum (`BINGO_MOLECULE`, `BINGO_REACTION`, `BINGO_CUSTOM`), `build_index_body(tau_search)` builder for the index mapping, and `compile_query` (dispatches a query subject + kwargs to the right query class; reroutes substructure to the tautomer path when `options` contains `TAU`).
- `bingo_elastic/elastic.py` — `ElasticRepository` and `AsyncElasticRepository` (parallel sync/async classes, both take `tau_search: bool` and `custom_properties: CustomPropertiesMapping` flags), `IndexName` enum (`BINGO_MOLECULE`, `BINGO_REACTION`, `BINGO_CUSTOM`), `build_index_body(tau_search, custom_properties)` builder for the index mapping (merges `custom_properties` into `mappings.properties` and rejects collisions with `RESERVED_FIELDS`), and `compile_query` (dispatches a query subject + kwargs to the right query class; reroutes substructure to the tautomer path when `options` contains `TAU`). `CustomPropertiesMapping = Dict[str, Dict[str, Any]]`: keys are field names, values are ES property-mapping fragments (e.g. `{"n": {"type": "integer"}}`).
- `bingo_elastic/queries.py` — `CompilableQuery` hierarchy: `SubstructureQuery`, `TautomerSubstructureQuery` (subclass swapping in the `sub-tau` fingerprint and `tau_fingerprint` field), `ExactMatch`, similarity matches (`TanimotoSimilarityMatch`, `EuclidSimilarityMatch`, `TverskySimilarityMatch`), plus `KeywordQuery`, `RangeQuery`, `WildcardQuery` for non-chemical fields. `query_factory` maps kwarg keys (`"substructure"`, `"tautomer"`, `"exact"`, …) to a class.
- `bingo_elastic/model/record.py` — `IndigoRecord` (abstract), `IndigoRecordMolecule`, `IndigoRecordReaction`, and the `WithIndigoObject` descriptor that extracts fingerprints + `cmf` + `hash` from an `IndigoObject` at construction time. The descriptor also computes the `sub-tau` fingerprint when the record was built with `tau_search=True`.
- `bingo_elastic/model/helpers.py` — file iterators (`iterate_file`, `load_reaction`).
- `bingo_elastic/model/record.py` — `IndigoRecord` (abstract), `IndigoRecordMolecule`, `IndigoRecordReaction`, and the `WithIndigoObject` descriptor that extracts fingerprints + `cmf` + `hash` from an `IndigoObject` at construction time. The descriptor also computes the `sub-tau` fingerprint when the record was built with `tau_search=True`, and copies non-reserved properties from `iterateProperties()` (SDF tags) onto the record. `IndigoRecord(custom_properties=…)` accepts an iterable of property names used as a per-record allowlist; `RESERVED_FIELDS` lists names the extractor never overwrites (`cmf`, `hash`, fingerprints, etc.).
- `bingo_elastic/model/helpers.py` — file iterators (`iterate_file` generic dispatcher plus format-specific wrappers `iterate_sdf` / `iterate_smiles` / `iterate_cml`) and single-file loaders (`load_molecule`, `load_reaction`). All iterators accept `custom_properties=` (an iterable of allowed property names) and forward it to the records they yield — pass the keys of the repo's `custom_properties` mapping so extraction and the ES mapping stay aligned.
- `tests/` — its own pytest suite with `conftest.py` fixtures that connect to `localhost:9200`.

## Core flow (the non-obvious bit)
Expand All @@ -29,6 +29,21 @@ The tautomer path is opt-in on **both sides**: `tau_search=True` on the record g

Side effect to remember: fingerprints are computed at **record construction time** in the `WithIndigoObject` descriptor, not at `index_records()` time. By the time records reach the repo, the fingerprint is already frozen on the instance.

## Custom SDF properties

SDF `> <TAG>` lines (and any kwargs passed to `IndigoRecord(**kwargs)`) become record attributes via `WithIndigoObject` calling `iterateProperties()`. Without configuration this works through ES dynamic mapping — every tag becomes a `text`/`keyword` field. To get a *typed* field (e.g. an integer you can `RangeQuery`), pass a `custom_properties` mapping to the repo **and** the matching keys to the iterator:

```python
mapping = {"n": {"type": "integer"}, "CAS": {"type": "keyword"}}
repo = ElasticRepository(IndexName.BINGO_MOLECULE, custom_properties=mapping)
for rec in iterate_sdf("file.sdf", custom_properties=mapping): # keys-only OK too
repo.index_record(rec)
```

The same dict drives both consumers: keys are the extraction allowlist, values are the ES `properties` fragments. `build_index_body` raises `ValueError` if a key clashes with `RESERVED_FIELDS`. Leaving `custom_properties=None` (default on both sides) preserves the legacy "extract every tag, let ES dynamic-map them" behavior.

Caveat: ES mappings are immutable after first index creation. Changing `custom_properties` later requires `ElasticRepository.delete_all_records()` first — `create_index` swallows `resource_already_exists_exception` and keeps the old mapping otherwise.

## Tests

Run from `bingo/bingo-elastic/python/`:
Expand All @@ -45,7 +60,7 @@ For spinning up Elasticsearch see [claude-docs/testing.md](testing.md).

## Sync/Async parity

`ElasticRepository` and `AsyncElasticRepository` are independent classes — there is no shared base. Any signature or behavior change on one must be mirrored on the other (constructors, `filter`, `index_records`, `index_record`). Tests pair every sync `test_*` with an async `test_a_*`; follow that pattern.
`ElasticRepository` and `AsyncElasticRepository` are independent classes — there is no shared base. Any signature or behavior change on one must be mirrored on the other (constructors — including `tau_search` and `custom_properties`, `filter`, `index_records`, `index_record`). Tests pair every sync `test_*` with an async `test_a_*`; follow that pattern. Note `delete_all_records` currently exists only on the sync class; the autouse `clear_index` fixture in `tests/conftest.py` uses it to wipe both indices before every test.

## Java sibling

Expand Down
48 changes: 43 additions & 5 deletions bingo/bingo-elastic/python/bingo_elastic/elastic.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
from indigo import Indigo, IndigoObject # type: ignore

from bingo_elastic.model.record import (
RESERVED_FIELDS,
IndigoRecord,
IndigoRecordMolecule,
IndigoRecordReaction,
Expand All @@ -34,6 +35,13 @@

ElasticRepositoryT = TypeVar("ElasticRepositoryT")

# Mapping of custom (e.g. SDF tag) field name -> ES property mapping fragment,
# e.g. {"MolecularWeight": {"type": "float"}, "CAS": {"type": "keyword"}}.
# The same dict drives (1) the index mapping passed to Elasticsearch and
# (2) the allowlist used by record extraction (see IndigoRecord and
# iterate_file).
CustomPropertiesMapping = Dict[str, Dict[str, Any]]

MAX_ALLOWED_SIZE = 1000


Expand Down Expand Up @@ -110,7 +118,10 @@ def get_client(
return client_type(**arguments) # type: ignore


def build_index_body(tau_search: bool = False) -> Dict:
def build_index_body(
tau_search: bool = False,
custom_properties: Optional[CustomPropertiesMapping] = None,
) -> Dict:
index_body = {
"mappings": {
"properties": {
Expand Down Expand Up @@ -142,6 +153,15 @@ def build_index_body(tau_search: bool = False) -> Dict:
}
)

if custom_properties:
collisions = set(custom_properties).intersection(RESERVED_FIELDS)
if collisions:
raise ValueError(
"custom_properties uses reserved field name(s): "
f"{sorted(collisions)}"
)
index_body["mappings"]["properties"].update(custom_properties)

return index_body


Expand Down Expand Up @@ -216,7 +236,7 @@ def response_to_records(


class AsyncElasticRepository:
def __init__(
def __init__( # pylint: disable=too-many-arguments
self,
index_name: IndexName,
*,
Expand All @@ -228,6 +248,7 @@ def __init__(
request_timeout: int = 60,
retry_on_timeout: bool = True,
tau_search: bool = False,
custom_properties: Optional[CustomPropertiesMapping] = None,
) -> None:
"""
:param index_name: use function get_index_name for setting this argument
Expand All @@ -241,10 +262,18 @@ def __init__(
:param tau_search: declare tau_fingerprint in the index mapping so
tautomer-aware substructure search is available via
filter(..., options="TAU ...")
:param custom_properties: ES mapping fragments for caller-defined
fields (SDF tags or kwargs passed to IndigoRecord). Keys are field
names; values are ES property mappings, e.g.
{"MolecularWeight": {"type": "float"}, "CAS": {"type": "keyword"}}.
The keys should also be passed as ``custom_properties=`` to
iterate_sdf/iterate_file so extraction and the index mapping
agree on which fields exist.
"""
self.index_name = index_name.value
self.tau_search = tau_search
self.index_body = build_index_body(tau_search)
self.custom_properties = custom_properties
self.index_body = build_index_body(tau_search, custom_properties)

self.el_client = get_client(
client_type=AsyncElasticsearch,
Expand Down Expand Up @@ -394,7 +423,7 @@ async def __aexit__(self, *args, **kwargs) -> None:


class ElasticRepository:
def __init__(
def __init__( # pylint: disable=too-many-arguments
self,
index_name: IndexName,
*,
Expand All @@ -406,6 +435,7 @@ def __init__(
request_timeout: int = 60,
retry_on_timeout: bool = True,
tau_search: bool = False,
custom_properties: Optional[CustomPropertiesMapping] = None,
) -> None:
"""
:param index_name: use function get_index_name for setting this argument
Expand All @@ -419,10 +449,18 @@ def __init__(
:param tau_search: declare tau_fingerprint in the index mapping so
tautomer-aware substructure search is available via
filter(..., options="TAU ...")
:param custom_properties: ES mapping fragments for caller-defined
fields (SDF tags or kwargs passed to IndigoRecord). Keys are field
names; values are ES property mappings, e.g.
{"MolecularWeight": {"type": "float"}, "CAS": {"type": "keyword"}}.
The keys should also be passed as ``custom_properties=`` to
iterate_sdf/iterate_file so extraction and the index mapping
agree on which fields exist.
"""
self.index_name = index_name.value
self.tau_search = tau_search
self.index_body = build_index_body(tau_search)
self.custom_properties = custom_properties
self.index_body = build_index_body(tau_search, custom_properties)

self.el_client = get_client(
client_type=Elasticsearch,
Expand Down
18 changes: 16 additions & 2 deletions bingo/bingo-elastic/python/bingo_elastic/model/helpers.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from pathlib import Path
from typing import Callable, Generator, Optional, Union
from typing import Callable, Generator, Iterable, Optional, Union

from indigo import Indigo, IndigoObject # type: ignore

Expand All @@ -14,6 +14,7 @@ def iterate_file(
iterator: Optional[str] = None,
error_handler: Optional[Callable[[object, BaseException], None]] = None,
session: Optional[Indigo] = None,
custom_properties: Optional[Iterable[str]] = None,
) -> Generator[IndigoRecordMolecule, None, None]:
"""
:param file:
Expand All @@ -24,6 +25,11 @@ def iterate_file(
:param error_handler: lambda for catching exceptions
:type error_handler: Optional[Callable[[object, BaseException], None]]
:type session: Optional[Indigo]
:param custom_properties: SDF tag names to extract; pass the keys of the
ElasticRepository's custom_properties mapping so extracted attributes
match what the index mapping declares. None (default) keeps the
legacy behaviour of extracting every non-reserved property.
:type custom_properties: Optional[Iterable[str]]
:return:
"""
iterators = {
Expand All @@ -43,46 +49,54 @@ def iterate_file(
indigo_object: IndigoObject
for indigo_object in getattr(session, iterator_fn)(str(file)):
yield IndigoRecordMolecule(
indigo_object=indigo_object, error_handler=error_handler
indigo_object=indigo_object,
error_handler=error_handler,
custom_properties=custom_properties,
)


def iterate_sdf(
file: Union[Path, str],
error_handler: Optional[Callable[[object, BaseException], None]] = None,
session: Optional[Indigo] = None,
custom_properties: Optional[Iterable[str]] = None,
) -> Generator:
yield from iterate_file(
Path(file) if isinstance(file, str) else file,
"sdf",
error_handler=error_handler,
session=session,
custom_properties=custom_properties,
)


def iterate_smiles(
file: Union[Path, str],
error_handler: Optional[Callable[[object, BaseException], None]] = None,
session: Optional[Indigo] = None,
custom_properties: Optional[Iterable[str]] = None,
) -> Generator:
yield from iterate_file(
Path(file) if isinstance(file, str) else file,
"smiles",
error_handler=error_handler,
session=session,
custom_properties=custom_properties,
)


def iterate_cml(
file: Union[Path, str],
error_handler: Optional[Callable[[object, BaseException], None]] = None,
session: Optional[Indigo] = None,
custom_properties: Optional[Iterable[str]] = None,
) -> Generator:
yield from iterate_file(
Path(file) if isinstance(file, str) else file,
"cml",
error_handler=error_handler,
session=session,
custom_properties=custom_properties,
)


Expand Down
61 changes: 58 additions & 3 deletions bingo/bingo-elastic/python/bingo_elastic/model/record.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,33 @@
from __future__ import annotations

from typing import Callable, Dict, List, Optional
from typing import Callable, Dict, FrozenSet, Iterable, List, Optional
from uuid import uuid4

from indigo import Indigo, IndigoException, IndigoObject # type: ignore

MOL_TYPES = ["#02: <molecule>", "#03: <query reaction>", "#12: <RDFMolecule>"]
REAC_TYPES = ["#04: <reaction>", "#05: <query reaction>"]
RESERVED_FIELDS = frozenset(
{
"cmf",
"name",
"hash",
"has_error",
"rawData",
"sim_fingerprint",
"sim_fingerprint_len",
"sub_fingerprint",
"sub_fingerprint_len",
"tau_fingerprint",
"tau_fingerprint_len",
"record_id",
"error_handler",
"skip_errors",
"tau_search",
"indigo_object",
"elastic_response",
}
)


# pylint: disable=unused-argument
Expand All @@ -31,7 +52,7 @@ def __set__(self, instance: IndigoRecord, value: Dict):


class WithIndigoObject:
def __set__( # pylint: disable=too-many-branches
def __set__( # pylint: disable=too-many-statements, too-many-branches, too-many-locals
self, instance: IndigoRecord, value: IndigoObject
) -> None:
try:
Expand Down Expand Up @@ -92,6 +113,18 @@ def __set__( # pylint: disable=too-many-branches
except IndigoException as err_:
check_error(instance, err_)

allowed = getattr(instance, "_custom_properties", None)
try:
for prop in value_dup.iterateProperties():
prop_name = prop.name()
if prop_name in RESERVED_FIELDS:
continue
if allowed is not None and prop_name not in allowed:
continue
setattr(instance, prop_name, prop.rawData())
except IndigoException as err_:
check_error(instance, err_)


class IndigoRecord:
"""
Expand All @@ -114,6 +147,7 @@ class IndigoRecord:
elastic_response = WithElasticResponse()
record_id: Optional[str] = None
error_handler: Optional[Callable[[object, BaseException], None]] = None
_custom_properties: Optional[FrozenSet[str]] = None

def __new__(cls, *args, **kwargs):
if cls is IndigoRecord:
Expand Down Expand Up @@ -143,6 +177,13 @@ def __init__(self, **kwargs) -> None:
:param skip_errors: if True, all errors will be skipped,
no error_handler is required
:type skip_errors: bool
:param custom_properties: iterable of SDF tag names to extract from the
IndigoObject. If None (default), every non-reserved
property is extracted (backwards-compatible). Pass
the keys of the ElasticRepository's custom mapping
here so the indexed schema matches what the index
mapping declares.
:type custom_properties: Optional[Iterable[str]]
"""

# First check if skip_errors flag passed
Expand All @@ -155,12 +196,26 @@ def __init__(self, **kwargs) -> None:
self.record_id = uuid4().hex

self.tau_search = kwargs.pop("tau_search", False)
# Must be set before indigo_object assignment so the descriptor sees it
custom_properties: Optional[Iterable[str]] = kwargs.pop(
"custom_properties", None
)
self._custom_properties = (
frozenset(custom_properties)
if custom_properties is not None
else None
)
for arg, val in kwargs.items():
setattr(self, arg, val)

def as_dict(self) -> Dict:
# Add system fields here to exclude from indexing
filtered_fields = {"error_handler", "skip_errors", "tau_search"}
filtered_fields = {
"error_handler",
"skip_errors",
"tau_search",
"_custom_properties",
}
return {
key: value
for key, value in self.__dict__.items()
Expand Down
Loading
Loading