Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,12 @@ edition = "2021"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
clap = "4.3.17"
derive_builder = "0.12.0"
regex = "1.9.1"
tree-sitter = "0.20.10"
tree-sitter-grep = { git = "https://github.com/helixbass/tree-sitter-grep", rev = "3d4682c" }
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dependency here is on the proof-of-concept run-context branch plus a couple of tweaks

tree-sitter-rust = "0.20.3"
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it might be a little silly to have this crate depend on all of these also (along with tree-sitter-grep depending on them)?

Not sure what that would entail to eg "delegate" that dependency, maybe some exposed API from tree-sitter-grep to get the tree_sitter::Language for a given SupportedLanguage or something like that?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, or just a straight re-export of tree-sitter-rust from somewhere in tree-sitter-grep. (Maybe feature flagged.)

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya actually I ended up doing some crate re-exporting in a recent PR (for rule-author "consumers" I guess)

Here what seems to have been the pattern is that I exposed SupportedLanguage on tree_sitter_grep and tree-sitter-lint is just re-using that as its own "language" type, and a SupportedLanguage can give you its corresponding tree_sitter::Language


[[bin]]
name = "tree-sitter-lint"
4 changes: 4 additions & 0 deletions src/args.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
use clap::Parser;

#[derive(Parser)]
pub struct Args {}
7 changes: 7 additions & 0 deletions src/bin/tree-sitter-lint.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
use clap::Parser;
use tree_sitter_lint::{run, Args};

fn main() {
let args = Args::parse();
run(args);
}
57 changes: 57 additions & 0 deletions src/context.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
use std::{
path::Path,
sync::atomic::{AtomicBool, Ordering},
};

use tree_sitter::Language;

use crate::{rule::ResolvedRule, violation::Violation};

pub struct Context {
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was somewhat mimicking ESLint's rule-authoring API

But the concept of "context" being used here is a little fuzzy at the moment

In ESLint's API, they are calling the rule's create() callback once for each file that the rule is being run against, so it supplies a context that is specific to that file

(which I guess would enable eg tailoring that rule/its internal state to the specific file that it's being run against?)

The big things that seems questionable about mimicking that here is that we need to parse tree-sitter queries for the rule "listeners", and so doing that parsing n times (where n = the # of files we're checking) seems like not the way to do things

It's possible that that query-parsing could be eg memoized (for the common case that for each file you're registering the same exact set of tree-sitter queries for the listeners) if we did in fact want to have something like the ESLint API's "scope" (where you can basically configure the rule once per file)?

But no idea what actually makes sense

So for the moment there's this "global to the linter run"-level Context, and then a QueryMatchContext that gets supplied to your listener callback

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For linting more-so than tree-sitter-grep, it seems very reasonable for the rules to have to say which language(s?) they operate on, so the queries could be parsed up front once, maybe?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup I haven't really "wired up" any "rule-driven language" stuff yet but that definitely seems right to me

pub language: Language,
}

impl Context {
pub fn new(language: Language) -> Self {
Self { language }
}
}

pub struct QueryMatchContext<'a> {
pub path: &'a Path,
pub file_contents: &'a [u8],
pub rule: &'a ResolvedRule<'a>,
reported_any_violations: &'a AtomicBool,
}

impl<'a> QueryMatchContext<'a> {
pub fn new(
path: &'a Path,
file_contents: &'a [u8],
rule: &'a ResolvedRule,
reported_any_violations: &'a AtomicBool,
) -> Self {
Self {
path,
file_contents,
rule,
reported_any_violations,
}
}

pub fn report(&self, violation: Violation) {
self.reported_any_violations.store(true, Ordering::Relaxed);
print_violation(&violation, self);
}
}

fn print_violation(violation: &Violation, query_match_context: &QueryMatchContext) {
println!(
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is presumably not as good as using something ripgrep-Printer-ish (eg might suffer from interleaving)?

"{:?}:{}:{} {} {}",
query_match_context.path,
violation.node.range().start_point.row + 1,
violation.node.range().start_point.column + 1,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth eventually making some PositionForPrinting struct that can take the 0-indexed row and column, and implements Display in some reasonable way? (Maybe not if it would need to support multiple output styles, like tree-sitter-grep does.)

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you're saying leverage Display for "how to formatted-output a single violation"?

Right that makes sense that "if there's one way to do it" that would be a nice thing to reach for

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

violation.message,
query_match_context.rule.name,
);
}
184 changes: 175 additions & 9 deletions src/lib.rs
Original file line number Diff line number Diff line change
@@ -1,14 +1,180 @@
pub fn add(left: usize, right: usize) -> usize {
left + right
mod args;
mod context;
mod rule;
mod violation;

use std::{
borrow::Cow,
process,
sync::atomic::{AtomicBool, Ordering},
};

pub use args::Args;
use clap::Parser;
use context::QueryMatchContext;
use rule::{ResolvedRule, Rule, RuleBuilder, RuleListenerBuilder};
use tree_sitter::Query;
use violation::ViolationBuilder;

use crate::context::Context;

#[macro_export]
macro_rules! regex {
($re:expr $(,)?) => {{
static RE: std::sync::OnceLock<regex::Regex> = std::sync::OnceLock::new();
RE.get_or_init(|| regex::Regex::new($re).unwrap())
}};
}

const CAPTURE_NAME_FOR_TREE_SITTER_GREP: &str = "_tree_sitter_lint_capture";
const CAPTURE_NAME_FOR_TREE_SITTER_GREP_WITH_LEADING_AT: &str = "@_tree_sitter_lint_capture";

pub fn run(_args: Args) {
let language = tree_sitter_rust::language();
let context = Context::new(language);
let resolved_rules = get_rules()
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rules/listeners get "resolved" by eg calling the create() callback and then actually parsing the tree-sitter queries etc

.into_iter()
.map(|rule| rule.resolve(&context))
.collect::<Vec<_>>();
let aggregated_queries = AggregatedQueries::new(&resolved_rules, &context);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool that this is possible.

let tree_sitter_grep_args = tree_sitter_grep::Args::parse_from([
"tree_sitter_grep",
"-q",
&aggregated_queries.query_text,
"-l",
"rust",
"--capture",
CAPTURE_NAME_FOR_TREE_SITTER_GREP,
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd thought we would need to make tree-sitter-grep be able to handle the concept of "multiple queries/captures" but nope tree-sitter queries by their nature "multiplex"/"de-multiplex" (?) (just by concatenating several top-level "patterns")

So just had to make the one change to the crate-level API to additionally expose the "pattern index" of the matching capture (which is what lets us do the "de-multiplexing")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, fun that this silly command-line-style API is getting some use already.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya this is definitely a good "driver" for kicking the tires on the tree_sitter_grep crate-level API

]);
let reported_any_violations = AtomicBool::new(false);
tree_sitter_grep::run_with_callback(
tree_sitter_grep_args,
|capture_info, file_contents, path| {
let (rule_index, rule_listener_index) =
aggregated_queries.pattern_index_lookup[capture_info.pattern_index];
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we de-multiplex to resolve the tree-sitter-grep-reported match to one of our rule listeners

This general approach (of just running a single aggregated tree-sitter query (of all of the individual listener queries) once and letting tree-sitter-grep do its normal "single-capture-oriented" matching) relies on the underlying tree-sitter API reporting the same tree-sitter AST node multiple times if it matched multiple of the top-level patterns (because multiple listeners might be "interested" in the same tree-sitter AST node)

And it does appear to behave that way (I did a little separate sanity-check of that)

At that point this seems pretty efficient, we just crunch through the tree-sitter AST once per file

let rule = &resolved_rules[rule_index];
let listener = &rule.listeners[rule_listener_index];
(listener.on_query_match)(
&capture_info.node,
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks silly now that I saw that tree_sitter::Node is Copy

&QueryMatchContext::new(path, file_contents, rule, &reported_any_violations),
);
},
)
.unwrap();
if reported_any_violations.load(Ordering::Relaxed) {
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory this should be reporting success or failure exit status code depending on whether it found any violations (which is what I assume you want for eg CI?) but I didn't actually check that it was working

process::exit(1);
} else {
process::exit(0);
}
}

#[cfg(test)]
mod tests {
use super::*;
type RuleIndex = usize;
type RuleListenerIndex = usize;

#[test]
fn it_works() {
let result = add(2, 2);
assert_eq!(result, 4);
struct AggregatedQueries {
pattern_index_lookup: Vec<(RuleIndex, RuleListenerIndex)>,
#[allow(dead_code)]
query: Query,
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we at least want to parse this query "on our end" in order to eg be able to do the sanity-check assertion below

But then it seems like tree-sitter-grep's crate-level API could expose a version of Args or whatever that lets you pass in a pre-parsed tree_sitter::Query (vs query text/file path) maybe?

query_text: String,
}

impl AggregatedQueries {
pub fn new(resolved_rules: &[ResolvedRule], context: &Context) -> Self {
let mut pattern_index_lookup: Vec<(RuleIndex, RuleListenerIndex)> = Default::default();
let mut aggregated_query_text = String::new();
for (rule_index, resolved_rule) in resolved_rules.into_iter().enumerate() {
for (rule_listener_index, rule_listener) in resolved_rule.listeners.iter().enumerate() {
for _ in 0..rule_listener.query.pattern_count() {
pattern_index_lookup.push((rule_index, rule_listener_index));
}
let use_capture_name =
&rule_listener.query.capture_names()[rule_listener.capture_index as usize];
let query_text_with_unified_capture_name =
regex!(&format!(r#"@{use_capture_name}\b"#)).replace_all(
&rule_listener.query_text,
CAPTURE_NAME_FOR_TREE_SITTER_GREP_WITH_LEADING_AT,
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing regex-based replacing of the "target capture name" so that the "one big query" has a single capture name that we're interested in

);
assert!(
matches!(query_text_with_unified_capture_name, Cow::Owned(_),),
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is relying on the presumed invariant that regex.replace_all() will return an "owned" (ie String) Cow if and only if it actually did some replacing

I suppose could also just string-compare that the replaced version is different than the original?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose could also just string-compare that the replaced version is different than the original?

That would be much more clear to read, I think.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But more expensive? How expensive is string-equality comparison?

(As always might be being "naive about optimization guy" but) if this version is "reliable" and more performant then I'd be inclined to solve the readability issue by just eg hiding it behind a well-named helper (eg were_any_captures_replaced())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, yeah, that seems fine, and maybe worth a comment about why the Cow match works.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok added were_any_captures_replaced() + comment on the current branch

"Didn't find any instances of the capture name to replace"
);
aggregated_query_text.push_str(&query_text_with_unified_capture_name);
aggregated_query_text.push_str("\n\n");
}
}
let query = Query::new(context.language, &aggregated_query_text).unwrap();
assert!(query.pattern_count() == pattern_index_lookup.len());
Self {
pattern_index_lookup,
query,
query_text: aggregated_query_text,
}
}
}

fn get_rules() -> Vec<Rule> {
vec![no_default_default_rule(), no_lazy_static_rule()]
}

fn no_default_default_rule() -> Rule {
RuleBuilder::default()
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The builder pattern seemed appropriate for these but I could also see adding an (optional) layer on top of that for people who wanted a slightly "slimmer" syntax eg:

rule! {
    name => no_default_default,
    create => |context| {
        vec![
            rule_listener! {
                query => r#"..."#,
                capture_name => "c",
                on_query_match => |node, query_match_context| {
                    ...
                }
            },
            ...
        ]
    }
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main issue with macros like that is that you have to (I think in this case) manage indentation/whitespace yourself.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This theme keeps coming up of tooling "pulling up short" of really trying to parse the contents of macro invocations and consumers suffering for it (the thing I had to do for tree-sitter-grep of forking the Rust tree-sitter grammar to try and parse "typical" macro-invocation arguments; a similar thing came up here with doing syn-based traversal of parsed AST's where it "wasn't working as expected" without some extra "convincing" to traverse into macro contents; and like you're saying rustfmt not formatting macro invocation contents that it kind of seems like it should be able to format)

Ya I'm sort of into the more "magical" macro syntax for this use case of trying to make rule authorship feel "smooth" (and ESLint-like) but I'm assuming it won't be everyone's style

In which case (in terms of how the rule-authoring API's are currently organized) (a) you can always write the more boilerplate-y but "just Rust" version and/or (b) there's really no reason there couldn't be alternate macro syntaxes (whether "official" or not) (since it's just a way of generating "publicly-writable Rust code") that are maybe a bit less "magical"?

.name("no_default_default")
.create(|_context| {
vec![RuleListenerBuilder::default()
.query(
r#"(
(call_expression
function:
(scoped_identifier
path:
(identifier) @first (#eq? @first "Default")
name:
(identifier) @second (#eq? @second "default")
)
) @c
)"#,
)
.capture_name("c")
.on_query_match(|node, query_match_context| {
query_match_context.report(
ViolationBuilder::default()
.message(r#"Use '_d()' instead of 'Default::default()'"#)
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are actual rules I thought of for the Typescript compiler in Rust code-base

Specifically here I've been playing with using a "shorthand" _d() helper vs the more verbose Default::default()

Neither of them are "flexing" the tree-sitter side of it very hard but whatever

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These read pretty nicely

.node(node)
.build()
.unwrap(),
);
})
.build()
.unwrap()]
})
.build()
.unwrap()
}

fn no_lazy_static_rule() -> Rule {
RuleBuilder::default()
.name("no_lazy_static")
.create(|_context| {
vec![RuleListenerBuilder::default()
.query(
r#"(
(macro_invocation
macro: (identifier) @c (#eq? @c "lazy_static")
)
)"#,
)
.on_query_match(|node, query_match_context| {
query_match_context.report(
ViolationBuilder::default()
.message(r#"Prefer 'OnceCell::*::Lazy' to 'lazy_static!()'"#)
.node(node)
.build()
.unwrap(),
);
})
.build()
.unwrap()]
})
.build()
.unwrap()
}
100 changes: 100 additions & 0 deletions src/rule.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
use std::sync::Arc;

use derive_builder::Builder;
use tree_sitter::{Node, Query};

use crate::context::{Context, QueryMatchContext};

#[derive(Builder)]
#[builder(setter(into))]
pub struct Rule {
pub name: String,
#[builder(setter(custom))]
pub create: Arc<dyn Fn(&Context) -> Vec<RuleListener>>,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, not sure I realized you could directly contain a dyn trait in an Arc, but that makes sense, since it's essentially a fancy Box?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup and Box<dyn Trait>/Rc<dyn Trait>/Arc<dyn Trait> are all "special" as far as being able to be the "self receiver type", I take advantage of that in a subsequent PR

}

impl Rule {
pub fn resolve(self, context: &Context) -> ResolvedRule<'_> {
let Rule { name, create } = self;

ResolvedRule::new(
name,
create(context)
.into_iter()
.map(|rule_listener| rule_listener.resolve(context))
.collect(),
)
}
}

impl RuleBuilder {
pub fn create(
&mut self,
callback: impl Fn(&Context) -> Vec<RuleListener> + 'static,
) -> &mut Self {
self.create = Some(Arc::new(callback));
self
}
}

pub struct ResolvedRule<'context> {
pub name: String,
pub listeners: Vec<ResolvedRuleListener<'context>>,
}

impl<'context> ResolvedRule<'context> {
pub fn new(name: String, listeners: Vec<ResolvedRuleListener<'context>>) -> Self {
Self { name, listeners }
}
}

#[derive(Builder)]
#[builder(setter(into, strip_option))]
pub struct RuleListener<'on_query_match> {
pub query: String,
#[builder(default)]
pub capture_name: Option<String>,
#[builder(setter(custom))]
pub on_query_match: Arc<dyn Fn(&Node, &QueryMatchContext) + 'on_query_match + Send + Sync>,
}

impl<'on_query_match> RuleListener<'on_query_match> {
pub fn resolve(self, context: &Context) -> ResolvedRuleListener<'on_query_match> {
let RuleListener {
query: query_text,
capture_name,
on_query_match,
} = self;
let query = Query::new(context.language, &query_text).unwrap();
let capture_index = match capture_name {
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sort of mimics the capture-index-resolution logic from tree-sitter-grep

None => match query.capture_names().len() {
0 => panic!("Expected capture"),
_ => 0,
},
Some(capture_name) => query.capture_index_for_name(&capture_name).unwrap(),
};
ResolvedRuleListener {
query,
query_text,
capture_index,
on_query_match,
}
}
}

impl<'on_query_match> RuleListenerBuilder<'on_query_match> {
pub fn on_query_match(
&mut self,
callback: impl Fn(&Node, &QueryMatchContext) + 'on_query_match + Send + Sync,
) -> &mut Self {
self.on_query_match = Some(Arc::new(callback));
self
}
}

pub struct ResolvedRuleListener<'on_query_match> {
pub query: Query,
pub query_text: String,
pub capture_index: u32,
pub on_query_match: Arc<dyn Fn(&Node, &QueryMatchContext) + 'on_query_match + Send + Sync>,
}
9 changes: 9 additions & 0 deletions src/violation.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
use derive_builder::Builder;
use tree_sitter::Node;

#[derive(Builder)]
#[builder(setter(into))]
pub struct Violation<'a> {
pub message: String,
pub node: &'a Node<'a>,
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right per other comment if tree_sitter::Node is Copy then it seems un-idiomatic to store a reference to it

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, and looks like it'd save you a little lifetime pain to not do it? (How is Node Copy, though? I'd have thought it held references to other things, which would require cloning.)

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's either a type alias of or a newtype wrapper around a raw pointer to the underlying tree-sitter C API Node type (and pointers are Copy)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, and that works because the Node has an associated lifetime.

}