Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude & Allowed Switches Not Behaving as Expected #91

Open
03k64serenity opened this issue Apr 20, 2022 · 7 comments
Open

Exclude & Allowed Switches Not Behaving as Expected #91

03k64serenity opened this issue Apr 20, 2022 · 7 comments

Comments

@03k64serenity
Copy link

CeWL/cewl.rb

Line 814 in 280bfe6

if allowed_pattern && !a_url_parsed.path.match(allowed_pattern)

When providing regex patterns in a file for the --exclude or in the command line argument for --allowed, cewl is not properly excluding and allowing offsite URLs based on the rules.

@digininja
Copy link
Owner

digininja commented Apr 20, 2022 via email

@03k64serenity
Copy link
Author

Right. I'd like to be able to limit the spider from crawling certain domains and allow it to crawl others based on a regex.

@digininja
Copy link
Owner

digininja commented Apr 20, 2022 via email

@03k64serenity
Copy link
Author

Sounds good. Will do. Hey, by the way...I had no idea you were the author of CeWL all these years seeing you on the interwebs, so I'm even more impressed and grateful for your contributions to the community.

@digininja
Copy link
Owner

digininja commented Apr 20, 2022 via email

@spencer-dollahite
Copy link

https://github.com/spencer-dollahite/CeWL/blob/master/cewl.rb

This is the sort of approach/feature I'd like to see to have both an allowed and exclude pattern switch for the domain and path. I know the code here isn't perfect, but I think it is close enough for demo purposes. Thoughts?

@digininja
Copy link
Owner

digininja commented Apr 28, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants