Skip to content

Elasticsearch, MongoDB, Tornado Server, RESTful API, Python, Information Retrieval, Machine Learning, Web Crawler

Notifications You must be signed in to change notification settings

chenxi-shi/Information-Retrieval

Repository files navigation

Information-Retrieval

Keywords

Elasticsearch, MongoDB, Tornado Server, RESTful API, Python, Information Retrieval, Machine Learning, Web Crawler

Screenshots

  • Search web page page_showing.png
  • Elasticsearch result search_result.png
  • Search Interface search_interface.png
  • Search Results search_results.png

Introduction

Homework of my course "Information Retrieval", by Python 3.

  • Instructor: Virgil Pavlu
  • University: Northeastern University
  • Course: CS6200
  1. Elasticsearch Index
  • index more than 80000 documents into elasticsearch
  • optimized index speed to around 15min
  1. Documents Index
  • making my own "elasticsearch"
  • index data in both doc dimension, and term dimension
  • two kinds of dimension index increase the index efficiency.
  1. Web Crawler
  • topic: maritime accident
  • Breadth-first search to iterate all pages in early waves.
  • topic module application for accurately checking the relevance of pages
  • in total 36000 pages, more than 50% is relevant to topic "maritime accident"
  • distinguish wanted pages by header content type before downloading it.
  • applied network session to restore cookies for fast and low-duty re-access.
  • sort domains according to last accessing time, so that multi threads can access different domains to speed up crawling
  • normalize href links in good method, to reduce page drop rate
  1. Web Graph Computation
  • applied pagerank and HITS to evaluate the page in whole page set
  • regard in & out links of pages as directed network graph
  • web graph computation is a kind of admitting of idea “Cream rises to the top”:
  • good authority page can be referenced more and more,
  • good hub page digs more and more good authority pages.
  1. Web Interface Relevance Assessments
  • applied Tornado Server as a web server, which can be accessed remotely
  • server communicates with elasticsearch database for searching and extracting data
  • MongoDB restores page info to speed up web server
  • made python based html template to create search result page automatically and flexibility.
  • set log in permit to filter users
  • applied application layer info to transfer parameter between pages.
  • after getting manual evaluation, apply query compute R-precision, Average Precision, nDCG, precision and recall and F1 to evaluate search result coming from page set.
  • drew precision & recall graphics for the visualized cooperation between search results distribution and page relevant true values.
  1. Machine Learning for IR
  • with better understanding of elasticsearch, re-index the dataset, which set new analyzer with standard tokenizer, lowercase, and porter2 stemmer.
  • set nested mapping to restore features details
  • distinguish documents by different elasticsearch types
  • for a dataset with labeled data in it, split it by 80% for training, 20% for testing
  • tried different combination of feature to increase the performance of machine learning module
  • applied different machine learning modules including: Liner Regression, LogisticRegression, svm, svm rank