Webdext

Webdext is a Javascript library for web data extraction (web scraping). Currently, it only supports data records extraction from a list page (a web page containing 2 or more data records).

In order to use it, you must run Webdext inside the web page context. There are 2 ways to do that:

Use it as browser extension (currently, I only implemented the Chrome extension)
Inject the script into the web page context using headless browser such as PhantomJS or Splash (currently, I only implemented the runner script for PhantomJS)

Installation and usage

Internals

Intelligent extraction algorithm is heavily based on AutoRM [1] and DAG-MTM [2] (not an exact implementation though).
XPath wrapper induction algorithm is based on [3].

[1]	Shengsheng Shi , Chengfei Liu, Yi Shen, Chunfeng Yuan, Yihua Huang. 2015. AutoRM: An effective approach for automatic Web data record mining. Knowledge-Based Systems, 89, 314–331. doi:10.1016/j.knosys.2015.07.012

[2]	Shengsheng Shi , Chengfei Liu, Chunfeng Yuan, Yihua Huang. 2014. Multi-feature and DAG-based multi-tree matching algorithm for automatic web data mining. Proceedings of International Joint Conferences on Web Intelligence and Intelligent Agent Technology, 739–755. doi:10.1109/WI-IAT.2014.24

[3]	Joachim Nielandt, Antoon Bronselaer, Guy de Tré. 2016. Predicate enrichment of aligned XPaths for wrapper induction. Expert Systems With Applications 51, 259–275. doi:10.1016/j.eswa.2015.12.040

Author

Sigit Dewanto, sigitdewanto11[at]yahoo[dot]co[dot]uk

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
chrome		chrome
phantom		phantom
src		src
test		test
.eslintrc.yml		.eslintrc.yml
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.rst		README.rst
gulpfile.js		gulpfile.js
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Webdext

Installation and usage

Internals

Author

About

Releases

Packages

Languages

License

rahul0/webdext

Folders and files

Latest commit

History

Repository files navigation

Webdext

Installation and usage

Internals

Author

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages