Skip to content

Python Script that extracts info from pdftohtml xml output of PDF of Bundestag Lobby List

License

Notifications You must be signed in to change notification settings

stefanw/verbaendeliste-bundestag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Verbaendeliste-Bundestag Extractor

Use pdftohtml to get an XML file from the pdf.

pdftohtml -xml input.pdf output.xml

Then use the extractor with first and last relevant page number to convert to parsed JSON:

python extract_lobby.py 4 690 < lobbylist.xml > lobbylist.json

Here is extracted JSON (15th of June 2012).

License: MIT-License

About

Python Script that extracts info from pdftohtml xml output of PDF of Bundestag Lobby List

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages