AVResearcherXL

AVResearcherXL is a tool based on AVResearcher, a prototype aimed at allowing media researchers to explore metadata associated with large numbers of audiovisual broadcasts. AVResearcher allows them to compare and contrast the characteristics of search results for two topics, across time and in terms of content. Broadcasts can be searched and compared not only on the basis of traditional catalog descriptions, but also in terms of spoken content (subtitles), and social chatter (tweets associated with broadcasts). AVResearcher is a new and ongoing valorisation project at the Netherlands Institute for Sound and Vision. more details

In addition to the exploration of audiovisual broadcasts, AVResearcherXL allows users to search and compare different document collections. AVResearcherXL also implements a new design, the option to show relative counts on its timeline visualisation and multiple views on result sets.

AVResearcherXL is developed by Dispectu B.V..

Requirements

Python 2.7
- pip
- virtualenv
Elasticsearch > 1.1
Relational database (e.g. SQLite, MySQL or PostgreSQL)
A webserver with WSGI or proxy capabilities

Installing AVResearcherXL

Clone the repository:

$ git clone git@github.com:beeldengeluid/AVResearcherXL.git
$ cd AVResearcherXL

Create a virtualenv, activate it and install the required Python packages:

$ virtualenv ~/my_pyenvs/avresearcherxl
$ source ~/my_pyenvs/avresearcherxl/bin/activate
$ pip install -r requirements.txt

Create a local settings file to override the default settings specified in settings.py. In the next steps we describe to minimal settings that should be changed to get the application up-and-running. Please have a look at the comments in settings.py to get an overview of all possible settings.

$ vim local_settings.py

When running the application in a production environment, set DEBUG to False
Set the SECRET_KEY for the installation (this key is used to sign cookies). A good random key can be generated as follows:

>>> import os
>>> os.urandom(24)
'\x86\xb8f\xcc\xbf\xd6f\x96\xf0\x08v\x90\xed\xad\x07\xfa\x01\xd0\\L#\x95\xf6\xdd'

Set the URLs and names of the ElasticSearch indexes:

ES_SEARCH_HOST = 'localhost'
ES_SEARCH_PORT = 9200
ES_LOG_HOST = ES_SEARCH_HOST
ES_LOG_PORT = ES_SEARCH_PORT
ES_LOG_INDEX = 'avresearcher_logs'

Set the options of the indexed collections (``COLLECTIONS_CONFIGzz).
Provide the settings of the SMTP server that should be used to send notification emails during registration:

MAIL_SERVER = 'localhost'
MAIL_PORT = 25
MAIL_USE_TLS = False
MAIL_USE_SSL = False
MAIL_USERNAME = None
MAIL_PASSWORD = None

Provide the URI of the database. The SQLAlchemy documentation provides information on how to structure the URI for different databases. To use an SQLite database named avresearcher.db set DATABASE_URI to sqlite:///avresearcher.db.
Load the schema in the database configured in the previous step.

./manage.py init_db

Use a built-in WSGI server (like uWSGI) or a standalone WSGI container (like Gunicorn) to run the Flask application. Make sure to serve static assets directly through the webserver.

$ pip install gunicorn
$ gunicorn --bind 0.0.0.0 -w 4 wsgi:app

Running the text analysis tasks

The package contains several text analysis tasks to generate the terms used in the 'descriptive terms' facet. Make sure that the collection you wish to use is fully indexed in Elasticsearch before running the analysis tasks.

Install the required packages:

$ pip install -r requirements-text-analysis.txt

Tokenize the source text by starting a producer that grabs the text and one or more consumers that perform the actual tokenization and lemmatization:

$ ./manage.py analyze_text tokenize producer "immix_source/*.json" immix_summaries
$ ./manage.py analyze_text tokenize consumer "immix_analyzed/summaries" immix_summaries

Create a (Gensim) dictionary of the tokenized text:

$ ./manage.py analyze_text create_dictionary "immix_analyzed/summaries/*/*.txt" "gensim_data/immix_summaries.dict"

Optionally prune the dictionary

$ ./manage.py analyze_text prune_dictionary gensim_data/immix_summaries.dict gensim_data/immix_summaries_pruned.dict --no_below 10 --no_above .10

Construct the corpus in the Matrix Market format:

$ ./manage.py analyze_text construct_corpus "immix_analyzed/summaries/*.tar.gz" gensim_data/immix_summaries_pruned.dict gensim_data/immix_summaries.mm

Construct the TF-IDF model

$ ./mange.py construct_tfidf_model gensim_data/immix_summaries.mm gensim_data/immix_summaries.tfidf_model

Add the topN 'most descriptive' terms to each indexed document:

$ ./mange.py analyze_text index_descriptive_terms "immix_analyzed/summaries/*.tar.gz"  gensim_data/immix_summaries_pruned.dict gensim_data/immix_summaries.tfidf_model gensim_data/immix_summaries.tfidf_model 'quamerdes_immix_20140920' 'text_descriptive_terms' 10

Name		Name	Last commit message	Last commit date
Latest commit History 227 Commits
avresearcher		avresearcher
es_configs		es_configs
hunspell/nl_NL		hunspell/nl_NL
text_analysis		text_analysis
.gitignore		.gitignore
LICENSE		LICENSE
README.rst		README.rst
Vagrantfile		Vagrantfile
app.build.js		app.build.js
manage.py		manage.py
r.js		r.js
requirements-text-analysis.txt		requirements-text-analysis.txt
requirements.txt		requirements.txt
wsgi.py		wsgi.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AVResearcherXL

Requirements

Installing AVResearcherXL

Running the text analysis tasks

License

About

Releases

Packages

Contributors 11

Languages

License

beeldengeluid/AVResearcherXL

Folders and files

Latest commit

History

Repository files navigation

AVResearcherXL

Requirements

Installing AVResearcherXL

Running the text analysis tasks

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 11

Languages

Packages