Skip to content

A set of scripts that is a functional approach to creating a domain specific LOD name directory

License

Notifications You must be signed in to change notification settings

linkedjazz/linked-jazz-name-directory

Repository files navigation

Linked Jazz

Mou icon

Name Directory Creation

These set of scripts are a functional approach to creating a domain specific LOD name directory. It works with extract files that are sequentially processed, no DB interface needed just the extracts and the scripts. The scripts uses keywords to build our Jazz directory but the keywords could easily be replaced to create a name directory for other domains. A lot of the process it designed so it will work on a VPS but some parts (filterLOCskos.py) needed to be done locally.

Installing:

Requires osx/linux command line tools, grep, wget, etc..

####Extracts Needed:

The process requires a number of extract files from dbpedia and the Library of Congress

DBpedia:

(When a new version of dbpedia extract comes you would need to change the urls below)

Library of Congress:

Extract these files into the data directory (you are going to need a lot of space)

Running:

Building the directory is just running the scripts in order.

python filterDBpediaJazzFile.py 	

This takes a article category approach to everything related to jazz and filters it down to people. It is diagramed in filterDBpediaJazzFile.pdf

python filterLOCskos.py

Takes the enormous LC data file and creates a new LC lookup that is more manageable. The first step it does it create personURIs.nt, this could be done locally and added to the extract data on a server to reduce the space needed. Making this file will take a long time as its greping a 30GB extract. The process is in filter_LOC_filterLOCskos.pdf.

python addDatesToJazzPeople.py

This adds birth and death dates to the name directory for people who don't have that data structured but it is in their abstract. Just cares about the year.

python mergeLOCandDBpedia.py

This attempts to merge the two authorities based on name and dates, it makes a number of final name directory sameAs_*.nt files based on the confidence of the match. Documented in mergeLOCandDBpedia.pdf

python filterToJazzData.py

Optional, this script creates an auxiliary file for the sameAs files which has the person image if in wikipedia and their short abstract.

License

The MIT License (MIT)

Copyright (c) 2015 Matt Miller

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

About

A set of scripts that is a functional approach to creating a domain specific LOD name directory

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages