Skip to content

fix: ignore whitespace-only Disallow paths in extractUrlsFromRobotsTxt#1973

Closed
juliosuas wants to merge 1 commit intosmicallef:masterfrom
juliosuas:fix/robots-txt-whitespace-disallow
Closed

fix: ignore whitespace-only Disallow paths in extractUrlsFromRobotsTxt#1973
juliosuas wants to merge 1 commit intosmicallef:masterfrom
juliosuas:fix/robots-txt-whitespace-disallow

Conversation

@juliosuas
Copy link
Copy Markdown

@juliosuas juliosuas commented Mar 30, 2026

fixes #701 (also the TODO in the docstring).

the regex is r'disallow:\s*(.[^ #]*)'. the leading . matches anything — including a space — so a bare Disallow: ends up returning [' '] and a single-space "path" lands in the exclusion list.

changing the first char of the group to \S makes it require a non-whitespace start:

r'disallow:\s*(\S[^ #]*)'

after:

'Disallow: '              → []
'Disallow: /admin'        → ['/admin']
'Disallow: /p#comment'    → ['/p']

the spec treats empty Disallow as "allow everything" so returning [] there is the correct behaviour.

The regex r'disallow:\s*(.[^ #]*)' used '.' as the first character of the
capture group, which matches any character including a space.  This caused
'Disallow: ' (a whitespace-only path) to be returned as ' ', adding an
invalid disallowed path to the list.

Per the robots.txt specification, 'Disallow: ' with no non-whitespace
content means 'allow all' and should be treated as an empty/no-op rule.

Fix: replace the leading '.' with '\S' so only paths that start with a
non-whitespace character are captured.  This resolves the TODO comment
that had been in the docstring since the original implementation.

Fixes smicallef#701
@juliosuas
Copy link
Copy Markdown
Author

A bit more detail: per the robots.txt spec, Disallow: with no path (or only whitespace) means "allow all" — it should not add any entry to the disallowed list. The old regex r'disallow:\s*(.[^ #]*)' used . which matches any character including space, so Disallow: (space-only) was returned as ' ' — an invalid path. The \S fix ensures only paths starting with a real non-whitespace character are captured. Also removed the TODO comment from the docstring since it's now addressed.

@juliosuas
Copy link
Copy Markdown
Author

Closing this stale small fix to keep my open-source queue focused. Happy to revisit fresh if maintainers want this robots.txt behavior changed.

@juliosuas juliosuas closed this May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TODO: sflib.py: fix whitespace parsing; ie, " " is not a valid disallowed path

1 participant