Crate-level API changes for tree-sitter-lint#63
Conversation
| ) | ||
| } | ||
|
|
||
| pub struct CaptureInfo<'node> { |
| let found_capture_index = found_capture_index as u32; | ||
| if found_capture_index != capture_index { | ||
| .filter_map(|(match_, index_into_query_match_captures)| { | ||
| let this_capture = &match_.captures[index_into_query_match_captures]; |
There was a problem hiding this comment.
These changes (up through let single_captured_node = ...) are "their own thing"
What I ran into was that my understanding of this (what was being called) found_capture_index usize being returned by the .captures() API was incorrect
Based on a little sanity-checking, it appears to be an index into the .captures on the other "match" (tree_sitter::QueryMatch) that gets returned
(not a "capture index" for the query as a whole) (maybe the usize -> u32 casting should've been a tip-off?)
But so this should get merged into master as its own little PR presumably (along with some test cases of queries with multiple "top-level patterns" (which I guess is the only time this issue was happening?))
| @@ -1,5 +1,5 @@ | |||
| [package] | |||
| name = "tree-sitter-grep" | |||
| name = "tree_sitter_lint_tree-sitter-grep" | |||
There was a problem hiding this comment.
Per helixbass/tree-sitter-lint#4, updating this to allow publishing that crate while this hasn't "landed"
| const ALL_NODES_QUERY: &str = "(_) @node"; | ||
|
|
||
| #[derive(Parser)] | ||
| #[derive(Clone, Parser)] |
There was a problem hiding this comment.
I don't know if the need to do args.clone() is a "smell" that the crate-level API entry points should be taking an &Args instead?
| }) | ||
| } | ||
|
|
||
| pub fn run_for_slice_with_callback( |
There was a problem hiding this comment.
This entry point is needed for the use case where tree-sitter-lint itself has an in-memory "slice" that it wants to lint (specifically it's using that for its rule-testing helpers)
| pub fn run_for_slice_with_callback( | ||
| slice: &[u8], | ||
| args: Args, | ||
| mut callback: impl FnMut(CaptureInfo) + Sync, |
There was a problem hiding this comment.
(Vs the things that get passed to the callback in the run_with_callback() entry point) having the callback get passed the slice and a path didn't seem to make sense for this use case so it just gets passed the "capture info"
| args: Args, | ||
| mut callback: impl FnMut(CaptureInfo) + Sync, | ||
| ) -> Result<RunStatus, Error> { | ||
| let language = args.language.ok_or(Error::LanguageMissingForSlice)?; |
There was a problem hiding this comment.
Basically Args has different "validation" for this entry point (specifically we need to know which language you want this slice to be treated as)
I started thinking that maybe the slice itself could/should be a field on Args and then that would be something that Args could "always validate" (ie that if its eg .slice_to_search field is present then its .language field must also be present)?
But that might be weird because then you'd have to sort of assert that "for this entry point we expect args.slice_to_search to be set" and vice-versa?
| mut callback: impl FnMut(CaptureInfo) + Sync, | ||
| ) -> Result<RunStatus, Error> { | ||
| let language = args.language.ok_or(Error::LanguageMissingForSlice)?; | ||
| let query_text = args.get_loaded_query_text()?; |
There was a problem hiding this comment.
Don't know if there are great ways to DRY this up more wrt other entry points, didn't worry about that too much at the moment
|
|
||
| let non_fatal_errors = non_fatal_errors.lock().unwrap().clone(); | ||
| if non_fatal_errors.is_empty() { | ||
| cached_queries.error_if_no_successful_query_parsing()?; |
There was a problem hiding this comment.
This may not make any sense/be redundant since above we already checked that "the query for the definitely-specified language is parseable"
| Ok(()) | ||
| } | ||
|
|
||
| pub fn search_slice_callback_no_path( |
There was a problem hiding this comment.
The "organization" of these searcher methods is probably weird but not like I 100% understand them in the first place
Here there was already a .search_slice_callback() method (used internally by the .search_path_callback() "entry point") but it expected a Path to be passed and per above comment its callback argument took the additional file contents/path arguments
So just made a separate .search_slice_callback_no_path() "entry point"
|
|
||
| thread_local! { | ||
| static SEARCHER: OnceCell<(Rc<RefCell<Searcher>>, *const Args)> = Default::default(); | ||
| static SEARCHER_PER_ARGS_INSTANCE: RefCell<HashMap<*const Args, Rc<RefCell<Searcher>>>> = Default::default(); |
There was a problem hiding this comment.
I'd forgotten that we have now invalidated this invariant (but glad I was asserting it!) (that there's only ever once instance of Args used during a program execution)
So just made this a per-Args-instance map instead (yes in theory then you should probably be more "legitimately cache-like" and eg evict once you get past a certain size but not worrying about that for now)
This same change should be made to the printer which is using the same pattern
| }) | ||
| } | ||
|
|
||
| pub fn run_with_per_file_callback( |
There was a problem hiding this comment.
I was running into wanting callbacks in tree-sitter-lint that are happening in the context of a single file (eg rule-listener callbacks) to be able to have mutable access to things (it seemed silly to wrap all of the "per-file-run rule state" in eg Mutex's when we know that once we're in the context of a single file, things are happening "serially")
So by adding an extra layer of "callback indirection" we can accomplish that and have the "per-match" callback be an FnMut
|
|
||
| pub fn run_with_per_file_callback( | ||
| args: Args, | ||
| per_file_callback: impl Fn(&DirEntry, Box<dyn FnMut(Box<dyn FnMut(CaptureInfo, &[u8], &Path) + '_>) + '_>) |
There was a problem hiding this comment.
I don't know if there's a way to avoid Box'ing these, not sure otherwise how you'd specify the types of these closures?
| } | ||
|
|
||
| pub fn run_for_slice_with_callback<'a>( | ||
| slice: impl Into<RopeOrSlice<'a>>, |
There was a problem hiding this comment.
So I think the idea here will be that the slice argument can be generic as something like impl TextProvider<'a> + Parseable (so not need to have any "specific" awareness of ropey at the tree_sitter_grep level)
But that is blocked by running into lifetime issues at the point of calling into tree_sitter (per tree-sitter/tree-sitter#2432)
(although as discussed maybe we could reasonably temporarily depend on a forked version of tree_sitter if there is in fact a way to tell all of the individual language-grammar crates "hey use this version of that dependency instead"?)
So for now made it take the "concrete" type of RopeOrSlice
(the point of this is to try and be more efficient where in tree-sitter-lint-lsp and maybe will also add to the tree-sitter-lint "fixing loop" we're storing file text in a ropey::Rope and so it's presumably expensive to convert that to "one big" &[u8] vs letting it be parsed/queried "in chunks")
|
|
||
| pub fn run_for_slice_with_callback<'a>( | ||
| slice: impl Into<RopeOrSlice<'a>>, | ||
| tree: Option<&Tree>, |
There was a problem hiding this comment.
Rather than eg a separate "entry point" for when you have the pre-parsed tree_sitter::Tree, just added it as an optional param to this existing one
| let query = &query_context.query; | ||
| let capture_index = query_context.capture_index; | ||
| let filter = &query_context.filter; | ||
| let tree: Cow<'_, Tree> = tree.map_or_else( |
There was a problem hiding this comment.
This still feels a bit "ceremonial" but I think this is actually a sensible "idea" for how to unify the types of "we got passed an Option<&T> (where we don't want to .clone() the inner &T) and we either unwrap that or have the ability to create a new (owned) T"
(I know that in the past I have done clunkier things in such situations)
| || { | ||
| Cow::Owned( | ||
| slice | ||
| .parse(&mut get_parser(query_context.language), None) |
There was a problem hiding this comment.
This is the Parseable trait that I added for abstracting away the ropey::Rope-specific-ness (as far as "do we parse this as one big string vs in chunks")
| parser.parse_with( | ||
| &mut |byte_offset, _| { | ||
| let (chunk, chunk_start_byte_index, _, _) = self.chunk_at_byte(byte_offset); | ||
| &chunk[byte_offset - chunk_start_byte_index..] |
There was a problem hiding this comment.
No idea if this is actually right or not
| } | ||
|
|
||
| impl<'a> TextProvider<'a> for RopeOrSlice<'a> { | ||
| type I = RopeOrSliceTextProviderIterator<'a>; |
There was a problem hiding this comment.
Kind of...unidiomatic that they named their associated type I?
| let rope_slice = rope.byte_slice(node.byte_range()); | ||
| RopeOrSliceTextProviderIterator::Rope(RopeOrSliceRopeTextProviderIterator::new( | ||
| rope_slice, | ||
| |rope_slice| rope_slice.chunks(), |
There was a problem hiding this comment.
Ok so here I convinced myself that I needed a self-referential data structure
Why? Because we have to create the rope_slice RopeSlice "locally" (because we have to derive it "dynamically" from the rope) and then really all we need to store in our Iterator type is its Chunks iterator, but there's no eg .into_chunks() (that would take self ownership of that locally-created value), just .chunks() which takes &self by reference (and its (ie the Chunks's) lifetime depends on that &self lifetime (because it's an iterator over references to pieces of that &self)), but that &self lifetime would go out of scope at the end of the enclosing text() method call because it's a local variable to that method call
So therefore the only way I could picture achieving it was to say "ok rope_slice you can tag along for the Iterator too even though we don't really "need" you, we just need to "keep you alive long enough""
Aka a self-referential struct because the Chunks's lifetime refers to that RopeSlice which at that point is another field in the same struct
So I tried out I think self_cell but it seemed a little clunkier so then tried ouroboros and that seemed smoother
Here this is how they have you initialize the dependent field
| fn text(&mut self, node: Node) -> Self::I { | ||
| match self { | ||
| Self::Slice(slice) => { | ||
| RopeOrSliceTextProviderIterator::Slice(iter::once(&slice[node.byte_range()])) |
There was a problem hiding this comment.
I actually wasn't clever enough to think to express this as an iter::once() myself, I initially wrote it with a manual "have we returned the one slice yet" state field but then I saw this is how they had it written somewhere (I guess in the baked-in impl TextProvider for &[u8]?)
| pub struct RopeOrSliceRopeTextProviderIterator<'a> { | ||
| rope_slice: RopeSlice<'a>, | ||
|
|
||
| #[borrows(rope_slice)] |
There was a problem hiding this comment.
...and this is how you declare the dependency between fields on the #[self_referencing] struct
| type Item = &'a str; | ||
|
|
||
| fn next(&mut self) -> Option<Self::Item> { | ||
| self.with_chunks_iterator_mut(|chunks_iterator| chunks_iterator.next()) |
There was a problem hiding this comment.
...and this is how you get a mutable reference to one of the (non-"dependee", I believe) fields on it
eb4abe0 to
3a039da
Compare
| tree-sitter-typescript = "0.20.2" | ||
|
|
||
| [patch.crates-io] | ||
| tree-sitter = { git = "https://github.com/helixbass/tree-sitter", rev = "57e98fb0" } |
There was a problem hiding this comment.
Read about how to do this "force dependencies to use this version"
The thing that I guess I wasn't anticipating is that I also had to do this in any "outer" crates (tree-sitter-lint and its project-local crates) even though they don't have their own explicit dependency on tree-sitter (they use tree-sitter-grep's re-export)?
| imports_granularity = "Crate" | ||
| wrap_comments = true | ||
| edition = "2021" | ||
| # wrap_comments = true |
There was a problem hiding this comment.
The thing of rustfmt often deciding to "wrap" code that I had commented out made me decide to turn this off
| let query_context = self.core.query_context(); | ||
| let query = query_context.query.clone(); | ||
| let filter = query_context.filter.clone(); | ||
| let mut matches = get_captures( |
There was a problem hiding this comment.
I did in fact DRY up the boilerplate of "running a tree-sitter query" and exposed those as get_captures()/get_captures_for_enclosing_node()
| fn sink<'tree>( | ||
| &mut self, | ||
| matches: &mut impl Iterator<Item = Node<'tree>>, | ||
| matches: &mut impl StreamingIterator<Item = CaptureInfo<'tree>>, |
There was a problem hiding this comment.
That DRY'ed up get_captures() returns a StreamingIterator which has nothing to do with async "streams", it's a crate for doing a "lending" iterator (ie where the items returned are dependent on the iterator's lifetime)
| pub fn search_slice_callback_no_path<'a, 'text, 'tree>( | ||
| &mut self, | ||
| query_context: QueryContext, | ||
| // slice: impl TextProvider<'a> + Parseable + 'a, |
There was a problem hiding this comment.
After having gotten get_captures() working with a RopeOrSlice argument I took a quick stab at trying to have it take the more generic impl TextProvider + Parseable but ran into some (more) lifetime stuff so left it more "concrete" for now at least
| } | ||
|
|
||
| #[self_referencing] | ||
| pub struct Captures<'a, 'text: 'a, 'tree: 'a> { |
There was a problem hiding this comment.
Ok so for the get_captures() iterator type again we're in self-referencing struct world
Why? Because the items returned by a tree_sitter::QueryCaptures are lifetime-dependent on things like the tree, text and query that passed to QueryCursor::captures() (and I think that implies that?)
And the query cursor itself wants to be created inside the shared abstraction but the captures iterator is also dependent on its lifetime I believe
So basically "toss all the stuff that the iterator depends on as fields and then use ouroboros to make the one iterator field self-referentially dependent on those"
But then it didn't want to implement Iterator, I think basically because items returned by the iterator would be lifetime-dependent on the struct (aka Iterator) itself and Iterator does not support that (your returned items cannot depend on the lifetime of the Iterator itself)
So then looked like streaming_iterator was a reasonable option for trying to implement a "lending" iterator on this struct
I was able to get that to basically work but see comments below re a little unsafe
| capture_index: u32, | ||
| #[borrows(text, mut query_cursor, query, tree)] | ||
| #[covariant] | ||
| captures_iterator: QueryCaptures<'this, 'this, 'this, RopeOrSlice<'this>>, |
There was a problem hiding this comment.
I initially tried to "go ahead" and call the .filter_map() (that in the previous versions of this pattern was always getting called to narrow the returned captures) on the QueryCursor::captures() iterator (converted to a StreamingIterator) but ran into "lifetime stuff" including some stuff around covariant/invariant "subtyping"
But so keeping this field "simpler" and then basically "manually" implementing StreamingIterator (including "inlining" the .filter_map() logic) made it more straightforward
| impl<'a, 'text, 'tree> StreamingIterator for Captures<'a, 'text, 'tree> { | ||
| type Item = CaptureInfo<'tree>; | ||
|
|
||
| fn advance(&mut self) { |
There was a problem hiding this comment.
StreamingIterator "decouples" "loading" the next item from "returning" it
|
|
||
| fn get<'this>(&'this self) -> Option<&'this Self::Item> { | ||
| let next_capture = self.borrow_next_capture(); | ||
| // SAFETY: I think this is ok as long as CaptureInfo isn't |
There was a problem hiding this comment.
Ok this is where I resorted to unsafe
I think I sort of understand why it was flagging this as an error without it
Basically I've declared the associated Item type here to be CaptureInfo<'tree> but it thinks that what I'd be trying to return is a CaptureInfo<'this> (ie its lifetime is dependent on the struct itself aka self-referential), not a CaptureInfo<'tree> (which would be allowed to outlive the struct)
And I think it might be right about that but not totally clear on that
But there's no way for me to specify 'this in the declaration of the Item associated type afaik
But it is by definition making the & in the return type here a "self-referential lifetime" so what I convinced myself of is that the only way you could "break out of that" and treat the CaptureInfo contents as though they're 'tree when in fact they're 'this is if you could copy/clone it, and CaptureInfo doesn't implement Copy/Clone
(I'm not even sure that would be a way to achieve anything bad - CaptureInfo just has a usize and a Node<'tree> (which is "actually" 'tree, I don't see how it couldn't be) on it so I don't think there are like any fields that eg when copied could incorrectly outlive the &'this reference - you can safely copy the Node<'tree> field)
| } | ||
|
|
||
| #[self_referencing] | ||
| pub struct CapturesForEnclosingNode<'a, 'text: 'a, 'tree: 'a> { |
There was a problem hiding this comment.
Same thing, just for the variation (as used by tree-sitter-lint) where you want to provide an explicit "enclosing node" for the query vs it defaulting to querying against the entire file contents associated with the Tree
| const ALL_NODES_QUERY: &str = "(_) @node"; | ||
|
|
||
| #[derive(Parser)] | ||
| #[derive(Builder, Clone, Default, Parser)] |
There was a problem hiding this comment.
Exposed an ArgsBuilder for tree-sitter-lint to use (vs building Args clunkily from passing "command-line args")
Presumably there should be some additional validation eg that some of the invariants that clap checks when parsing command-line args are not violated when using the builder, but I haven't done that yet
| query_text: Option<String>, | ||
|
|
||
| #[clap(skip)] | ||
| query_per_language: Option<HashMap<SupportedLanguage, Arc<Query>>>, |
There was a problem hiding this comment.
This is "tackling a couple birds with one stone" (?), first this is allowing passing in pre-parsed queries, second this is supporting "query-per-language" (vs "just one query")
Query-per-language allows tree-sitter-lint to let a single "run" of tree-sitter-grep "drive" everything, which seems to make sense (so that we're only doing one project-file-walking)
| pub(crate) fn get_project_file_walker_types(&self) -> Types { | ||
| get_project_file_walker_types(self.language) | ||
| get_project_file_walker_types(self.language.map(|language| vec![language]).or_else(|| { | ||
| self.query_per_language |
There was a problem hiding this comment.
This is basically asserting that if you pass query_per_language those are "all" of the languages that you're interested in
Some of this is probably a little bit conceptually unresolved but that seems more or less like the right idea
| pub(crate) fn get_loaded_query_text(&self) -> Result<String, Error> { | ||
| pub(crate) fn get_loaded_query_text_per_language( | ||
| &self, | ||
| ) -> Result<QueryOrQueryTextPerLanguage, Error> { |
There was a problem hiding this comment.
This may seem like a weird "half-step" (toward Args just going ahead and actually parsing the query text), but I think it makes sense
The reason is that we do want to go ahead and fail fast if eg we got passed a path to a query file that we can't read
But eg for the case of auto-language (and not having passed query_per_language) we want to let CachedQueries lazily resolve the query text to a parsed query
| } | ||
| } | ||
|
|
||
| #[derive(Default)] |
There was a problem hiding this comment.
I got to track down the fact that sometimes I wasn't seeing a newly added lint-rule violation get printed out
The reason ended up being that I forgot that in adding query_per_language (to Args) then the capture index could no longer be a "singleton" (because now each language's query could have a different "target" capture index)
So got rid of this data structure for it entirely and moved the per-language-query capture index storage/calculation into CachedQueries
| "Should've tried to parse in at least one language or else should've already failed on no candidate files" | ||
| ); | ||
| return Err(Error::NoSuccessfulQueryParsing(attempted_parsings)); | ||
| if let Some((_, capture_index_error)) = |
There was a problem hiding this comment.
I think the storage of "either a query-parsing error or a (subsequent) capture-index-resolution error" in CachedQueries makes sense, but here this gets a little clunky probably as far as deciding how to "resolve"/"surface" those
(eg should it be "fatal" if for one of the per-language queries there was a capture index error (eg if they passed a capture-name via Args and it wasn't valid for all of them? Speaking of which, probably query_per_language should be modeled as taking a capture name per language too, it just happens that for the tree-sitter-lint use case there's one capture name for all of the different language queries so it works to pass the single capture-name via Args)?)
| fn main() { | ||
| let args = Args::parse_from(["tree_sitter_grep", "-q", "(function_item) @f"]); | ||
| run_with_callback(args, |node, file_contents, path| { | ||
| run_with_callback(args, |query_match, file_contents, path| { |
There was a problem hiding this comment.
As of now I updated all of the crate-level "callback"-oriented entry points to use the match-oriented (vs single-capture-oriented) querying/callback style
Not sure where it should actually "land", for the moment mostly just driving based on the tree-sitter-lint use case
| &file_contents[query_match | ||
| .nodes_for_capture_index(0) | ||
| .next() | ||
| .unwrap() |
There was a problem hiding this comment.
While this is a little boilerplate-y I'm guessing the argument for in fact only exposing match-oriented crate-level callback API's is that I believe you can then basically "wrap" those (with a little match -> individual query captures boilerplate) to achieve the same idea/behavior as a single-capture-oriented style
| self.check_config()?; | ||
|
|
||
| log::trace!("slice reader: searching via multiline strategy"); | ||
| get_matches(query_context.language, slice, &query_context.query, tree).for_each( |
There was a problem hiding this comment.
I added a get_matches() corresponding to the get_captures() "iterate over query captures" abstraction
| let next_match = self.borrow_next_match(); | ||
| // SAFETY: Not as sure on this one? | ||
| let next_match: &'this Option<QueryMatch<'a, 'tree>> = | ||
| unsafe { mem::transmute(next_match) }; |
There was a problem hiding this comment.
Same "technique" here (as with Captures above) for avoiding lifetime errors
No idea but I'd be surprised if this isn't "sound"
| static SUPPORTED_LANGUAGE_COMMENT_KINDS: Lazy<BySupportedLanguage<HashSet<&'static str>>> = | ||
| Lazy::new(|| { | ||
| by_supported_language!( | ||
| Rust => ["line_comment", "block_comment"].into(), |
There was a problem hiding this comment.
Some of the ESLint helper methods are "comment-aware" and so "baking in" some comment-awareness at the tree-sitter-lint level seems to make sense (and it seemed most appropriate to push that down to the tree-sitter-grep level where the SupportedLanguage definitions live)
I have no idea if these are right but can go through and add the correct "kinds" here for the different languages at some point
| }) | ||
| } | ||
|
|
||
| pub fn run_with_single_per_file_callback( |
There was a problem hiding this comment.
This is the version that isn't doing any actual querying on the tree-sitter-grep side (this was related to getting there to not be lifetime issues with storing eg tree_sitter::Node's in a tree-sitter-lint rule's "local state")
It just basically drives running a single callback for each found project file that exposes things like the detected language, file contents, tree-sitter tree and parsed query
| self.search_file_maybe_path(query_context, Some(path), &file, write_to) | ||
| } | ||
|
|
||
| pub fn load_file_contents<P, TError: SinkError>( |
There was a problem hiding this comment.
This is sort of "extracting" the existing logic related to loading file contents from disk (but not doing any querying)
…ree-sitter-grep into run-context-tree-sitter-lint
In this PR:
run-contextthat accommodate thetree-sitter-lintuse case(since the base branch is not "ready to merge", transitively neither is this really, though I suppose it could get merged down into the base branch?)
To test:
Per the updated example, the crate-level API now gives you visibility of the query "pattern index" that matched in addition to the
tree_sitter::Nodethat matchedBased on
run-context