feat: support Template Strings, eg t.select(doubled=t"{_.age} * 2")#11599
feat: support Template Strings, eg t.select(doubled=t"{_.age} * 2")#11599NickCrews wants to merge 2 commits into
t.select(doubled=t"{_.age} * 2")#11599Conversation
|
All of the failing tests look to be unrelated, some issue with pins, possibly due to GCS access issues? |
a8687db to
a62f475
Compare
t.select(doubled=t"{_.age} * 2")
Before, if you used this function when duckdb wasn't installed, you got an error. Now you can use it without duckdb installed.
|
I'm realizing I want this to support multiple dialects better. eg if you just have I'm not sure exactly how this should look:
duckdb = ibis.sql_value(t"...", type=int, dialect="duckdb")
bq = ibis.sql_value(t"...", type=int, dialect="bigquery")
ibis.sql_value([duck, bq])I don't love this though, takes too much typing
My desire here is that this would be the key that enables a library author to expose a single function that is portable between backends. Currently, there is no way to do this without actually writing an Op and then monkeypatching in compile rules into the compiler. Of course, this SQL interface for writing different ways of compiling does just completely prevent you from writing a compilation rule for non-sql backends, eg polars. So perhaps the Op/compile rule method should be the official way to get true portability (and we should make this an official API), and these sql_value strings are just quick and dirty ways to get something done on a particular or small subset of backends |
|
@cpcloud can I schedule a video call with you to talk through this? I would love to get confirmation this is the right direction before I sinnk more time into it. |
|
Yeah, let's do it. |
|
@cpcloud can you book 60 minutes with me here? https://calendar.app.google/Ab3PFW7SFXHdb3b27 |
|
I have implemented a nice API of this for duckdb-python here, I am thinking of porting some of those learnings back to this PR. |
Description of changes
I am very excited about this PR, I think this really brings ibis into the future of python and SQL engines. I don't know of any other python dataframe library that has an API like this at all. It removes a huge barrier for folks who want to be able to write closer to raw SQL, and also allows more escape hatches to get down into the SQL guts when folks need it.
In summary, this allows for
my_table.select(doubled=t"{ibis._.my_col} * 2")in python 3.14+, andmy_table.select(doubled=ibis.t("{ibis._.my_col} * 2"))in versions below that.Fixes #11525. Inspired by https://orm.drizzle.team/docs/sql.
I still need to add more tests to be thorough, but in the large scale I am quite happy with the API.
I think I managed to implement this without adding really any cans of worms. The semantics seem pretty consistent with what we already have, this isn't breaking at all, and I don't think I'm adding ill-conceived data model that we will regret later.
ibis.t()as a backport of python 3.14st"hello {name}!"syntaxI vendored in the implementation from https://github.com/abilian/tstrings-backport, which has tests. I've been working on this and it seems worth depending on. I chose to vendor it in to avoid a pypi dependency. I didn't include tests, since the upstream package is tested. There are a few limitations, like it can't handle
t("nested braces: {{1,2,3}}"), but is otherwise pretty robust.I added a few other features on top, such as
PTemplate, to help with typing to represent either our implementation's Template, or the builtin string.templatelib.Template, or any other implementation that ducktypes as one.I chose to expose this so that users could actually use it as
ibis.t(). I think they will go and import it from our private modules anyway, so might as well get the API right from the beginning and then actually be useful to people on python <3.14I publicly exported ibis.t at the top level, but the rest you have to go through a submodule, eg
I'm open to adjusting this though.
ibis.sql_value(<PTemplate>)as a way of creating an ir.Value or an ibis.Deferred from a Template-ishThe exact signature is
I considered allowing
ibis.sql_value("{my_table.column} + 5"), but decided it would be better to be explicit that a template is getting created.If any Interpolation is a Deferred, this returns a Deferred, because we can't infer a specific datatype without the concrete values.
You can also pass in a specific dialect. If you pass in one, that is used. If you don't pass in a dialect, then we look through all the Interpolation values and infer the dialect from them (if there are >1 backends, that's an error). If none of the Interpolations have any bound backends, then we fall back to the dialect of
ibis.get_backend()at op creation time.You don't need to pass in a datatype: if you skip it, we infer it by compiling the ibis expression to a sqlglot expression and then using sqlglots datatype inference utils.
Sometimes, this doesn't work though, for example the original motivating example from the linked issue:
because sqlglot doesn't have complete coverage for introspecting the datatypes of this weird syntax.
To accommodate this, you can pass an explicit datatype, eg
ibis.sql_value(template, type="timestamp"). Or, it also is possible to doibis.sql_value(template).cast("timestamp"), just cast away from the Unkown type.The name is
ibis.sql_value(). I went with this, instead of plainibis.sql(), because I wanted it to be more explicit that this doesn't acceptSELECTstatements, and results in a single Value, not a relation. It also COULD be misleading because it can return a Deferred, which isn't technically a Value. So I'm open to other names.Direct use of templates are inferred as values
You don't need to go through ibis.sql_value() in many cases! For example:
In general, you need to use ibis.sql_value if:
.upper().name("uppercased")It is important to understand that this is locking us into a SQL as a first class citizen here. By doing this, it prevents us from interpreting Templates with any other DSL, for example PRQL. If we didn't want to lock ourselves in, then we could remove this feature, and require the extra wrapping in ibis.sql_value, eg
my_table.select(doubled=ibis.sql_value(ibis.t("{ibis._.my_col} * 2"))), because then that would allow for an egibis.prql_value()function. But, I think we are committed enough to SQL at this point that I think the beauty/simplicity of the reduced syntax is worth the lock in.Relation-valued template
Currently this PR only supports single-Value SQL expressions, such as
{table.column} + 5. But, it wouldn't be hard to extend this to accept relation-valued SQL expressions, such asSELECT {table.column} + 5 as a, {ibis._.b - 3} as b. This would probably be implemented re-using a lot of the same core types, but with a different op class, and a slightly different method for resolving deferreds. I would imagine this API of something likeNote how this interpolation syntax allows us to avoid the messiness that is the current method of
table.alias("a_name_i_hope_is_unique").sql("SELECT * from a_name_i_hope_is_unique")Open question: type inferrence is a little inconsistent
sqlglot interprets any bare integers as int32. This is a little different from us, where we interpret eg
ibis.literal(4)as int8. For example, this is a current test (slightly simplified)Eg we need the explicit .cast()s in the
expected = ...I think this is acceptable, but I haven't thought super deeply about if there is a better way. The escape hatch of explicitly setting the dtype is a nice escape hatch.
Open question: overly loose parsing
Currently, if you pass
t"CAST({x} AS REAL)"(eg using sqlite syntax) to sql_value() with dialect="duckdb", sqlglot is generous and still parses it. But perhaps we should be more strict and error?