Document and test use of char::is_whitespace in collapsible_if#16840
Document and test use of char::is_whitespace in collapsible_if#16840iAmChidiebereh wants to merge 1 commit intorust-lang:masterfrom
Conversation
|
r? @dswij rustbot has assigned @dswij. Use Why was this reviewer chosen?The reviewer was selected based on:
|
There was a problem hiding this comment.
Can you please make the PR description shorter?
Just 3-4 dot points saying what the tests and code does it great.
If you need help summarising it, please let me know.
Edit: you can just link to the Outreachy ticket, you don't need to explain that ticket in detail.
I also want to check you're looking at the right is_whitespace method?
https://doc.rust-lang.org/std/primitive.char.html#method.is_whitespace
The snippet function returns a str, so we are working with Unicode chars here:
https://github.com/rust-lang/rust/blob/ad4b9354009cb6bd5a9ff1b5f5a63a13ec98ebc9/src/tools/clippy/clippy_utils/src/source.rs#L535
| #[rustfmt::skip] | ||
| fn spacing_zero_width_ws() { | ||
| // This test shows that the Char::is_whitespace gives a more desirable result than rust_lexer::is_whitespace for require spaacing logic | ||
| // We test out 2 zero-wisth characters recognized as whitespaces by the lexer but not by Char::is_whitespace |
There was a problem hiding this comment.
We can see we have an undesirable result collapse suggestion using elseif instead of else if
It looks like you're testing whitespace characters that are included in Rust language whitespace (Unicode Pattern_White_Space or rust_lexer::is_whitespace). This won't find whitespace bugs, but it is still a good test to have.
Please add another test with a whitespace character that is Unicode White_Space, but not Pattern_White_Space / rust_lexer::is_whitespace. Here are those lists:
- Pattern_White_Space: https://doc.rust-lang.org/reference/whitespace.html
- White_Space: https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt
For example, some characters to test are:
- 00A0 ; White_Space # Zs NO-BREAK SPACE
- 202F ; White_Space # Zs NARROW NO-BREAK SPACE
There was a problem hiding this comment.
It looks like you're testing whitespace characters that are included in Rust language whitespace (Unicode Pattern_White_Space or rust_lexer::is_whitespace). This won't find whitespace bugs, but it is still a good test to have
I found out that the lexer panics on non-Pattern_White_Space characters, meaning that adding a test sample involving if else blocks that will generate snippets ending with whitespace characters in White_Space, but not in Pattern_White_Space is not possible.. We can only test within the Pattern_White_Space set. Given this, rustc_lexer::is_whitespace seemed logical to use. However, it counts the zero-width characters \u{200E} and \u{200F} as whitespace (and rightfully so), which causes Clippy to output what visibly looks like elseif as suggestion to users. While syntactically valid to the lexer, this invisible spacing is confusing to users. On the other hand, char::is_whitespace doesn't classify those two as whitespace, forcing the insertion of a visible space (else if) for readability. Therefore, this test only acts as a safeguard to prevent future developers from mistakenly "fixing" the check by switching to the lexer's method, which would degrade the visual output...
I could still use rustc_lexer::is_whitespace but I will also have to add a check that will let us treat \u{200E} and \u{200F} as non-whitespace characters for this particular problem. That will give us the same desirable behaviour. This is actually more intuitive as anyone reading the code will know we just treating them differently but they are still whitespaces.
Please add another test with a whitespace character that is Unicode White_Space, but not Pattern_White_Space / rust_lexer::is_whitespace. Here are those lists:
The image here was the 'unknown start of token' error when I tried testing fo a snippet that ends with 202F. It is the same for others.
Though it seems like I am missing out out on some detail and also, my assumption that we wouldn't want to print a suggestion that appears to look like 'elseif' even when there is actually a whitespace in between might very well be wrong.
I think you are trying to draw my attention to something that I am currently not figuring out. Please, I would appreciate a bit more of a pointer. Thank you.
There was a problem hiding this comment.
I found out that the lexer panics on non-Pattern_White_Space characters,
If adding a non-Pattern_White_Space character to a Rust source file causes a panic, then this exactly the kind of bug in clippy (or rustc) that we're trying to find.
meaning that adding a test sample involving if else blocks that will generate snippets ending with whitespace characters in White_Space, but not in Pattern_White_Space is not possible.
You can add a test and mark it with should_panic in one commit, then fix the bug in clippy/rustc, and remove should_panic from the test:
https://doc.rust-lang.org/reference/attributes/testing.html#the-should_panic-attribute
If it is a UI test, you can use //@ should-ice instead:
https://rustc-dev-guide.rust-lang.org/tests/directives.html#controlling-outcome-expectations
If using those directives isn't possible, it's ok to just make one final commit with the test.
There was a problem hiding this comment.
Okay.. Thank you
It is a ui test, so i tried out the should-ice directive and got "should-ice is not a command known to ui_test".
In trying to research on what could be done, I also realized @should-ice is not a supported directive in ui_test: https://docs.rs/ui_test/0.30.4/ui_test/#supported-comment-annotations
If using those directives isn't possible, it's ok to just make one final commit with the test.
Alright...I will just push the panicking test as it is. Thank you!
I have shortened it. Please help confirm it is okay.
Yes, this was what I was looking at
Yes, I get this part... but my investigation showed that the lexer fails to scan the file before the lint even has a chance to run... so we don't even get to the part where we can extract the snippet to test on (right?). |
|
Hi @teor2345 |
|
Hi @teor2345 |
|
Also, I have been thinking... We are currently not testing any whitespace specific behaviour. We are just suppressing panic. |
There was a problem hiding this comment.
one of the tests … is failing due to our generated .stderr file not meeting established convention for lint messages: rust-lang/rust-clippy/actions/runs/24382289363/job/71208280778?pr=16840
this is because a .stderr file now contains [lexer] message. The clippy team probably ensured all lint messages are lowercase
This is a question for a clippy maintainer. They might decide to exempt the test, move it to another test category, or silence the whitespace errors.
But the test is important, because it shows that clippy doesn't crash, or break valid code.
(i've tried to just quote the essential parts of your question, to make it easier to read.)
We are currently not testing any whitespace specific behaviour. We are just suppressing panic. And I am trying to wrap my head around how this will play out because I currently don't see a way we can fix the main code (probably by switching to
rustc_lexer::is_whitespace), take lines like//~ ERROR: unknown start of token: \u{202f}off and bring back//~ collapsible_else_ifto finally have things running without suppressing the error. I think the error is a constant.
I don't see any clippy panics here. I see errors, but those errors are a correct response to invalid Rust source code.
Let's take a step back, and think about the purpose of the lint, this test, and any code changes:
- the lint finds
else { ifthat can be collapsed toelse if, and suggests a fix if possible - for whitespace accepted by the Rust lexer, if the suggestion/fix changes that whitespace to other whitespace, the meaning of the program doesn't change (
rustfmtalso makes similar changes) - for whitespace not accepted by the Rust lexer, the Rust code has an error that will show up in the compiler. So it's ok for clippy to show that error, fix or not fix it, or ignore it. Clippy just can't panic or crash.
One test shows that clippy fixes code accepted by the lexer, and doesn't break the code when there's unusual white space. Another test shows that clippy correctly prints lexer errors when the code isn't valid Rust code.
So the only improvement we could make here is fixing non-lexer whitespace in the suggestion. But that is optional, and might not be possible, because the lint comes after the lexer.
Oh... Okay.. I will just wait for a maintainer to come around when they are free so that they can address this. Is it okay to tag someone? I'm cautious of appearing like I am pressing them for response despite their schedule... but I honestly wish I could tag them though.
Oh... I keep mixing things up... Pardon my use of the word 'panic'... They are errors as you mentioned. I meant 'suppressing test failure due to errors'.
Thank you so much! I fully get it now. Thanks for the explanation...
This cleared things up honestly because I was confused on how we were going to fix this when the lint actually comes after the lexer. I have been researching on how I could go about fixing this, but then, like you said, it might not be possible. I think I found an issue in the course of investigating this problem and that was like few days ago. Will tag you to it |
|
the test |
Can you point out where it is agreed/explicitly mentioned that it is intentional? |
Alright thank you! |
Looking back now, I think my choice of words didn't really tell what I meant. I am sorry about that. There was no place where it was explicitly mentioned that it was intentional. What I meant to say was "My investigation revealed that the current char::is_whitespace check 'might' actually be intentional". The statement came from my investigation after testing out all lexer-recognized whitespaces... Using char::is_whitespace for the whitespace check, every whitespace recognized by the lexer is also classified as whitspace, except for 2: \u{200E} and \u{200F}. Co-incidentally, these 2 characters turns out to be zero-width. Using char::is_whitespace, thereby, ensures a space is added betweeen else and if when a snippet ends with any of those 2 characters... so we get to see the desirable "else if" suggestion.... If we had used the rustc_lexer::is_whitespace function, those 2 will be classified as whitespaces (and rightfully so), therefore, no space will be added in between, making us see suggestions like "elseif" (no space in between on print, even though there is an invisible whitespace). I don't think that is desirable... So it felt deliberate and intentional from the person that implemented this to use a function that classified zero-width whitespce characters as non-whitespace to enable addition of a space between else and if for them. So it was just an assumption from my end backed by the thought I presented here. |
If you are feeling time pressure because final Outreachy applications are due in a few hours' time, you don't need to worry. You can put links to incomplete PRs in your final application. What happens to your PR is up to the maintainers of the tool you're modifying or testing. Sometimes PRs (or parts of PRs) don't get accepted for good reasons, and that's ok. Sometimes there are delays because reviewers are busy, and that's also ok. |
Thank you so much! |
|
@iAmChidiebereh can you please help to squash the commits? |
9149b3e to
575c612
Compare
|
Sure... I just did @dswij |
…_else_if lint docs: document intentional char::is_whitespace usage chore: make test function name clearer test: add panicking test for snipets ending with non-Pattern_White_Space whitespaces fix: test panic chore: fix typos Co-authored-by: teor <teor@riseup.net> docs: add comment to non-pattern_white_space test chore: take out redundant tests docs: fix wrong func name
|
The merge failed.. from the logs, it failed to access 'https://github.com/rust-lang/cargo/'... maybe github server was bad as it returned 500 |
575c612 to
777bdce
Compare
| #[rustfmt::skip] | ||
| fn ends_with_zero_width_whitespace() { | ||
| // Test out snippets ending with the 2 zero-width characters recognized as whitespaces by the lexer, | ||
| // but not by char::is_whitespace |
There was a problem hiding this comment.
Also, my eyes caught a mistake here.
I made an edit on this line (line 188).
We had u8::is_ascii_whitespace instead of the correct char::is_whitespace, hence the update
There was a problem hiding this comment.
I don't think your comment change is correct, please check the documentation for the lexer:
https://doc.rust-lang.org/reference/whitespace.html
And for char::is_whitespace (which uses Unicode):
https://doc.rust-lang.org/std/primitive.char.html#method.is_whitespace
If you mean ascii::Char::is_whitespace, then that method is unstable:
https://doc.rust-lang.org/std/ascii/enum.Char.html#method.is_whitespace
So it is better to refer to the underlying method, which is u8::is_ascii_whitespace:
https://doc.rust-lang.org/src/core/ascii/ascii_char.rs.html#1078
There was a problem hiding this comment.
Oh okay.. Thank you so much
I will go through the links and get back to you 🙏
There was a problem hiding this comment.
I have gone through the links.
And for char::is_whitespace (which uses Unicode):
https://doc.rust-lang.org/std/primitive.char.html#method.is_whitespace
This is what the code for checking snippet-ending characters in the collapsible_else_if lint is using currently, hence my documenting it. Clicking on the is_whitespace call in the codebase and opting to go to its implementation leads us to it. Image shown below:

Please can you help re-confirm? Maybe I am still looking at the wrong place?
Patiently awaiting your response. Thank you.
There was a problem hiding this comment.
You are looking in the right place, but "the 2 zero-width characters" are recognised as whitespace by unicode::White_Space.
There was a problem hiding this comment.
I think your code suggestions might be over-complicating things, and introducing a special case?
As long as clippy provides a useful suggestion with visible space, and rustfmt introduces visible space, I think the current code and test are fine.
There was a problem hiding this comment.
I think your code suggestions might be over-complicating things, and introducing a special case?
I don't particularly get what you meant when you said 'introducing a special case'. Could it be that it is any different from when we are using the current char::is_whitespace? Do you mind explaining, please?
As long as clippy provides a useful suggestion with visible space, and rustfmt introduces visible space, I think the current code and test are fine.
Oh, Okay thank you... Will just leave it as it is then.
There was a problem hiding this comment.
I think your code suggestions might be over-complicating things, and introducing a special case?
I don't particularly get what you meant when you said 'introducing a special case'. Could it be that it is any different from when we are using the current char::is_whitespace? Do you mind explaining, please?
I kinda asked this because the only special case I could see was when I singled out \u{200E} and \u{200F} when using the lexer's whitespace check... which is not different from what char::is_whitespace is doing since for any valid snippet, any ending character identified as NOT whitespace by char::is_whitespace is also identified as non-whitespace by the lexer... with the sole exception of \u{200E} and \u{200F} (the lexer recognizes these as whitespace)... so char::is_whitespace already implicitly creates a special case for those 2 characters which was why clippy still provided useful suggestions with visible space in those cases... and was the basis of my argument at the beginning of this PR that its (char::is_whitespace) use might be intentional.. The code i suggested just makes the special case explicit.
That said, I agree with you that we should leave it as it is since they both do same thing. Thank you.
I would appreciate your response though 🙏
There was a problem hiding this comment.
This is a single lint in the clippy tool.
It's ok to add tests and fix any bugs caused by unusual whitespace when we find them.
But any larger changes in the compiler or tools to handle directional markers would be better made in a principled manner across the entire codebase. I'm not sure what those principles are yet, so I think there would need to be some discussion first.
There was a problem hiding this comment.
Hi Teor...
I spent the last 2 days with my dentist so I couldn't respond earlier.
To respond to your message, I think I need to clarify something because I see we have a lot of mixups. Pardon my lengthy text.
You spoke about how "any larger changes in the compiler or tools to handle directional markers would be better made in a principled manner across the entire codebase"... But there is no where in the PR I suggested any such kind of change... I am assuming you mistook the suggestion I gave for such kind of change?
If so, then that's not the case.
An analogy would be like having a + b * c but then, changing it to a + (b*c). They both do same thing, the second one just shows the operator precedence more clearly, thereby improving readability.
In the case of the pr, what happens in the current code, that is, when we apply the negation of char::is_whitespace on a char (!c.is_whitespace)?
For a regular character that is not whitespace by any definition, it returns true because they are not whitespace. Now, for whitespaces, remember that if the whitespace is not recognized by the lexer, the lint wouldn't run because it is not valid rust... So we are not concerned with those cases because the responsibility is on the lexer to stop such program from running by throwing an error. This is basically why @dswij asked that we remove the test we added before for whitespaces not recognized by the lexer.
So we are only left with whitespaces that are recognized by the lexer. Passing any of those whitespaces to (!c.is_whitespace) returns false... With the sole exception of \u200E and \200F... Which returns true because they don't have the Unicode White_Space property.
We can see that there is an implicit 'special case' already for the directional markers.
Now let's check the suggested code, specifically the whitespace checker. What happens when we pass a char to
!rustc_lexer::is_whitespace(c) || matches!(c, '\u{200E}' | '\u{200F}')
If it is not a whitespace, it returns true. But if it is a whitespace, the first expression returns false and we then have to check the right hand of OR (||)... Which checks if the char matches the directional markers. If it doesn't, it returns false... Otherwise, if it does match any of the directional marks, it returns true.
This is exact same behaviour... Just improving the readability and showing intent behind code. So I am not suggesting any kind of change that alters the behavior and because they are same behavior, we could leave the code at what it is currently. I have already documented that we intentionally used char::is_whitespace and written test for the directional marks... So we are good. Just wanted to clarify things.
Thank you.
|
@dswij please when time permits, can you take a look at this PR? Thank you 🙏🏾 |


View all comments
changelog: none
This PR addresses the whitespace check in collapsible_if.
My investigation revealed that the current char::is_whitespace check is actually intentional. If we switched to rustc_lexer::is_whitespace, zero-width characters (like \u{200E}) would be treated as valid spacing, which would make Clippy to output suggestions that visibly look like elseif. To prevent future regressions, this PR adds a UI test with a zero-width space and an inline comment explaining why we must keep the current check.
The issue is for outreachy applicants and is being tracked here: rustfoundation/interop-initiative#53