- Crawl through entire website or provided URLs, based on the URL pattern (crawledURLs.json)
- Check broken links (brokenLinks.json, testedPages.json)
- Retrieve page information (crawledPages.json)
- page title
- meta description
- latest updated date
- Detect page with external domain resources:
- iframe (pagesWithExternalIframes.json)
- image (pagesWithExternalImages.json)
- video (pagesWithExternalVideos.json)
- all external domain requests (externalDomains.json)
- Detect page with:
- iframe (pagesWithIframes.json)
- image (pagesWithImages.json)
- video (pagesWithVideos.json)
- Check if oversize / overcompressed image being used
- oversize when intrinsic size (width * height) of image 10% larger than rendered size
- overcompressed when intrinsic size (width * height) of image 10% smaller than rendered size
- Detect all non-HTML document link (pagesWithFiles.json)
- format: .pdf, .jp(e)g, .png, .xls(x), .doc(x), .mp3, .mp4
- Scan WCAG and generate report using Pa11y
- Validate HTML using W3C validator
- URL redirection verification
- Microsoft Window / Mac OS
- Node.JS v10 and above
- Run
npm install
- Clone config-sample.js file and name it as config.js.
- In config.js, update the configuration:
module.exports = {
entryUrl: 'https://www.example.com/xxx',
urlPattern: null,
urlsSource: null
// ...
}
- Run
node index.js
urlPattern
(string ||null
): Only URLs that start withurlPattern
will be crawled. E.g.:/press-release
will only crawl www.example.com/press-release/*urlsSource
(string ||null
): File path to JSON file with arrays of URLs for crawler to test. E.g.:./src/urlsSource.json
.
// Example: urlsSource.json
[
"https://www.example.com",
"https://www.example.com/xxx",
//...
]
pageWaitTime
(integer): Time (in miliseconds) to wait after page loaded. (Intentionally delay to prevent being blocked). E.g.:5000
.debug
(boolean):true
to turn on debug mode, will stop crawler after 15 URLs.checkBrokenLink
(boolean):true
to check broken links.detectFileLink
(boolean):true
to find the links that open non HTML documents (pdf, jpg, jpeg, png, xls, xlsx, doc, docx).checkImageExist
(boolean):true
to generate list of pages that contain images.checkVideoExist
(boolean):true
to generate list of pages that contain videos.checkIframeExist
(boolean):true
to generate list of pages that contain iframes.disableCrawl
(boolean):true
to prevent crawler to crawl the page.detectExternalResource
(boolean):true
to generate list of pages that contain external iframes, images, videos, and list of all external domains.savePageInfo
(boolean):true
to collect information of crawled pages, include.- Page Title
<title>
- Meta description
<meta name="description">
- Last Updated Date (depends on the
lastUpdatedTextSelector
)
- Page Title
scanWCAG
(boolean):true
to scan the page for WCAGvalidateHTML
(boolean):true
to scan the page with W3C validatortakeScreenshot
(boolean):true
to take screenshot of the page (mobile and desktop)outputFolderName
(string): Name of the output folder. E.g.:reports
lastUpdatedTextSelector
(string): DOM selector of the last updated text. E.g.:.copyright > p
To verify if all the URL redirection is correctly done.
- Prepare a URL mapping JSON file.
- The JSON file should be an array of objects with property
url
anddestination
- Note: Ensure the format of URL must start with
http://
orhttps://
, and end with ending trail, e.g.: https://www.example.com/xxx/.
- Note: Ensure the format of URL must start with
[
{
"origin": "URL",
"destination": "URL"
},
// ...
]
- Update
configuration
in redirection-check.js file.urlsMapSource
(string): File path to the source of the URLs map, e.g.: ./src/exampleRedirectionMap.json.
- Run
node redirection-check.js
- Test report will be saved in reports/redirectionTestResults.json.
- Incorrect redirection will be marked as
matched: false
- Incorrect redirection will be marked as
- Currently, detecting non-HTML document feature is only checking if the
href
contain the file extension. Hence, it will not detect any redirect link or thehref
does not contain the file extension at all. - Feature: crawl and download all the assets and save as static files.
Written with StackEdit.