This repository contains a software that extracts links on DSpace bitstream documents (references to external links).
Given a file with links to sitemap DSpace.
- For each sitemap, parse DSpace site map.
- For each site map entry, download it and on its HTML extract relevant URLs using a regex that matches bitstream links.
- For each bitstream URL, download it, extract its links using tikalinkextract software and save each links extracted to a file with same file structure.
Like:
- From URL http://repositorio-aberto.up.pt/bitstream/10216/63886/2/90220.pdf
- To file: output/repositorio-aberto.up.pt/bitstream/10216/63886/2/90220.pdf_seeds.txt
It uses the tikalinkextract software and tika server.
On other shell run:
java -mx1000m -jar tools/tika-server-1.20.jar --port=9998
mvn clean package
Run the dspace link extractor on background and redirect to a file:
java -jar target/dspace-link-extractor-0.1-SNAPSHOT.jar dspace-urls.txt output >> dspace.log 2>&1
If you only want the entries that have been changed from a specific date add a date on argument like using format yyyy-MM-dd like:
java -jar target/dspace-link-extractor-0.1-SNAPSHOT.jar dspace-urls.txt output 2019-01-01 >> dspace.log 2>&1
When thw crawl has finished you could remove all the 'handle' folders. Because the seeds are on bitsteam folder.
find output -maxdepth 2 -name handle -exec rm -rf {} \;
Concatenate all the seeds on a single file:
find output/ -type f -name "*_seeds.txt" -exec cat {} \; >> seeds.txt
Remove mails and filter duplicates:
cat seeds.txt | egrep -v "^mail.*" | sort | uniq > seeds_uniq.txt