fix: ignore whitespace-only Disallow paths in extractUrlsFromRobotsTxt#1973
Closed
juliosuas wants to merge 1 commit intosmicallef:masterfrom
Closed
fix: ignore whitespace-only Disallow paths in extractUrlsFromRobotsTxt#1973juliosuas wants to merge 1 commit intosmicallef:masterfrom
juliosuas wants to merge 1 commit intosmicallef:masterfrom
Conversation
The regex r'disallow:\s*(.[^ #]*)' used '.' as the first character of the capture group, which matches any character including a space. This caused 'Disallow: ' (a whitespace-only path) to be returned as ' ', adding an invalid disallowed path to the list. Per the robots.txt specification, 'Disallow: ' with no non-whitespace content means 'allow all' and should be treated as an empty/no-op rule. Fix: replace the leading '.' with '\S' so only paths that start with a non-whitespace character are captured. This resolves the TODO comment that had been in the docstring since the original implementation. Fixes smicallef#701
Author
|
A bit more detail: per the robots.txt spec, |
Author
|
Closing this stale small fix to keep my open-source queue focused. Happy to revisit fresh if maintainers want this robots.txt behavior changed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
fixes #701 (also the TODO in the docstring).
the regex is
r'disallow:\s*(.[^ #]*)'. the leading.matches anything — including a space — so a bareDisallow:ends up returning[' ']and a single-space "path" lands in the exclusion list.changing the first char of the group to
\Smakes it require a non-whitespace start:after:
the spec treats empty
Disallowas "allow everything" so returning[]there is the correct behaviour.