Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow filtering sitemap urls #9

Open
paceaux opened this issue Sep 2, 2022 · 4 comments
Open

Allow filtering sitemap urls #9

paceaux opened this issue Sep 2, 2022 · 4 comments
Labels
enhancement New feature or request

Comments

@paceaux
Copy link
Owner

paceaux commented Sep 2, 2022

Allow ability to give a wildcard to filter urls (useful if there's a LOT of urls).

@paceaux paceaux added the enhancement New feature or request label Sep 2, 2022
@paceaux
Copy link
Owner Author

paceaux commented Sep 20, 2023

This may not be necessary because URLs are first output to a .sitemap.json file and someone could manually adjust that file.

But, it could be possible to allow passing in some sort of regex which would allow the user to decide which pages they'd want.

@paceaux
Copy link
Owner Author

paceaux commented Oct 25, 2023

If we offered this, there's a question of:

should this modify the existing exported sitemap? Or should it create a new one?
there's a few ways this could go:

Option 1:

  1. produce a full sitemap, as usual.
  2. then filter the sitemap to a smaller one in memory
  3. pass the new, smaller sitemap into the main config (it's set on the linkSet property)

option 2:

  1. As we're producing the sitemap, check each url if it meets parameters
  2. if it doesn't, don't add the url
  3. so the sitemap is produced with fewer urls

option 3:

  1. get sitemap
  2. filter the sitemap and produce a smaller sitemap file
  3. pass the smaller sitemap file into ... whatever.

Seems like regardless, the SiteCrawler class probably needs a filter function in there.

Other questions worth asking:

  1. what is the name of the argument we sent in?
  2. are we just gonna allow a regex?

@paceaux
Copy link
Owner Author

paceaux commented Sep 16, 2024

Note:

There's a url pattern api now: https://developer.mozilla.org/en-US/docs/Web/API/URL_Pattern_API

@paceaux
Copy link
Owner Author

paceaux commented Oct 2, 2024

Some of this work was done under #4 .

the option to honor robots involved creating a Robots class that can generated a disallowed.json file, which are all of the url patterns disallowed by the * agent in the robots file.

But the -b parameter will only involve honoring robots.txt when crawling.

The thinking here is that we're asking SelectorHound to behave just like any other crawling bot.

What we still want is for an option where a user can simply say, "don't do these URLs".

that could be that we follow the same pattern we do for sitemap:

  1. go to sitemap
  2. convert it to JSON
  3. download it
  4. read from the downloaded file

but doing this with Robots might be overkill for users who may want to just give a disallow / allow list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant