Skip to content

Latest commit

 

History

History
312 lines (209 loc) · 9.63 KB

README.md

File metadata and controls

312 lines (209 loc) · 9.63 KB

µoogle

The µoogle project for GCED-AP2 (2018 edition)

Documentation

This page describes the µoogle project for the AP2 course at GCED. µoogle stands for "micro Google" and your task is to implement the core of a simple system to crawl web pages and answer queries from users, by reporting which pages contain all the words they search.

Demo

This screenshot shows the home page of the web server:

Home page

Looks familiar?

This screenshot shows the results of quering for the star:

Search page

Architecture

The architecture of the system is as follows:

  • The server.py module is the web server that users will use in order to get answers to their queries.

    In order to start the web server, simply run ./server.py. Or use the --help flag to get more options.

    You can interact with the web server by opening the http://localhost:5000 URL in your browser.

    Additionally, the web server also offers some sample files under the static directory. Point your bowser to http://localhost:5000/static/index.html to browse their root. This figure shows the relations between these pages:

    Graph

    The server.py module is already implemented, do not modify it.

  • The answer.py module is a command line interface tool that developpers can use in order to get answers to their queries. For instance, ./answer.py -q "the star" will deliver the indexed web pages that match the query:

    [{'score': 100,
      'title': 'sample site',
      'url': 'http://localhost:5000/static/index.html'},
     {'score': 100,
      'title': 'twinkle twinkle little star',
      'url': 'http://localhost:5000/static/twinkle.html'}]
    

    Use ./answer.py --help to get more options.

    This module is already implemented, do not modify it.

  • The crawler.py module is the command line interface tool that is used to crawl the web and index its pages. In order to execute it, use a command such as ./crawler.py --url http://localhost:5000/static/index.html --maxdist 4, which specifies a starting page and a maximum crawling distance. Use ./crawler.py --help to get more options.

    This module is already implemented, do not modify it.

  • The util.py module contains several utility functions that are used in the other modules.

    This module is already implemented, do not modify it.

  • Finally, the moogle.py module contains the core of the application and is used by server.py, answer.py and crawler.py, which are simple wrappers arround it. This is the only module you have to modify.

System overview

The system works in two phases:

  1. In the first phase, the crawler visits some web pages and saves some information about them. This information is a Python object referred as the database and is referred as db all through the project. By default, the database is saved in the moogle.dat file, but this file can be changed using the --database flag of theserver.py, answer.py and crawler.py modules.

  2. In the second phase, the web server loads the database and processes queries from users. Alternatively, the queries can be processed by the answer.py module, which is more useful in order to debug.

Your task

Your task is to implement the moogle.py module, so that all the project will work as expected. In order to do so, you may modify this module at will, but you have to implement a few functions with a given interface that is described below. The moogle.py module has three parts:

Common part

This part is meant to define all the types and functions you need.

It also must define an authors() function that returns a string with the names or name of the authors of the assigment. This information is displayed in the home page. Simply modify this function to include your name(s)!

Crawler part

This part is meant to define all the types and functions you need in order to perform the crawling phase.

It must define a function crawler(url, maxdist) that crawls the web starting from url, following up to maxdist links and returns a database (it is up to you to define the type and value of this database).

This part already defines a store(db, filename) function that writes a database db in file filename using pickle.

Answer part

This part is meant to define all the types and functions you need in order to perform the answer phase.

You should implement the answer(db, query) function, that, given a database and a query (that is, a string of cleaned words), returns a list of pages for the given query. In this list, each page is a map with three fields:

  • title: its title
  • url: its url
  • score: its score

The list must be sorted by score in descending order. The score of a page is a real number that describes how relevant is that page for that query. Higher scores imply more relevance and will be shown first. Scoring pages is not mandatory (it is an extra). You can simply return all the pages with the same score.

The answer.py module just prettifies this result and outputs it.

This part already defines a load(filename) function that reads a database stored in file filename using pickle and returns it.

In short:

You have to:

  • Implement the authors() function (that's easy!!!)
  • Decide which data structure will represent your database
  • Implement the crawler() function
  • Implement the answer() function
  • (Also, write a README.md file with the documentation of your work)

In order to do this, you possibly want to define types, auxiliary functions...

The utility module

The util.py module contains two functions that will help you treating words in web pages.

The clean_word function returns (in lowercase) the longest prefix of a word made of latin unicode characters. This function gives the opportunity to "normalize" words with latin letters. Here are some examples:

clean_word("Jordi") -> "jordi"
clean_word("Jordi!") -> "jordi"
clean_word("LlorençGabarróNúñezÅrmstrong<>lala") -> "llorençgabarrónúñezårmstrong"
clean_word("AlfaBetaAlefChinaαßℵ中xyz") -> "alfabetaalefchina"

The clean_words function works all the same, but applied to a string that can have many words. Here are some examples:

clean_words("Jordi") -> "jordi"
clean_words("Jordi Jordi!    Llorenç#Martí") -> "jordi jordi llorenç"

Usefull information

The following code snippet reads a page through its url and prints its title, text and all its links. Error handling has been ommitted.

import urllib.request
from bs4 import BeautifulSoup

url = "http://localhost:5000/static/index.html"
response = urllib.request.urlopen(url)
page = response.read()
soup = BeautifulSoup(page, "html.parser")
print(soup.title.text)
print(soup.get_text())
for link in soup.find_all("a"):
    print(link.get("href"), link.text)

Warning! An older version of this document was using .string rather than .text in the previous code. Update your code to use .text.

The following code snippet show how to save some data (whatever its type) into a file so that it can be retrieved latter:

import pickle

data = {1:2, 2:3}
f = open("somefile.dat", "wb")
pickle.dump(data, f)
f.close()

And this code shows how to retrieve it back:

import pickle

f = open("somefile.dat", "rb")
data = pickle.load(f)
f.close()

You can consider using urllib.parse.urljoin to combine absolute and relative URLs. Here are some examples:

>>> import urllib.parse
>>> urllib.parse.urljoin("https://jutge.org/problems", "P12345.html")
'https://jutge.org/P12345.html'
>>> urllib.parse.urljoin("https://jutge.org/problems/", "P12345.html")
'https://jutge.org/problems/P12345.html'
>>> urllib.parse.urljoin("https://jutge.org/problems", "../dashboard.html")
'https://jutge.org/dashboard.html'
>>> urllib.parse.urljoin("https://jutge.org/problems", "https://google.cat/index.html")
'https://google.cat/index.html'

Install dependencies

In order to install the Python libraries you need, please run this command:

pip3 install -r requirements.txt

Instruccions

Podeu fer aquest projecte sols o en equips de dos. En cas de fer-lo en equip, la càrrega de treball dels dos membres de l'equip ha de ser semblant i el resultat final és responsabilitat d'ambdós. Cada membre de l'equip ha de saber què ha fet l'altre membre.

Lliurament

Heu de lliurar la pràctica des de l'apartat "Pràctiques" del Racó. El lliurament només l'ha de fer un estudiant per equip i ha de contenir un arxiu ZIP amb tots els fitxers per fer funcionar la pràctica, però sense moogle.dat.

El ZIP ha d'incloure un fitxer README.md amb tota la vostra documentació (disseny, extres, instruccions, exemples...) en format Markdown.

L'ús de tabuladors en el codi queda prohibit. A més, es valorarà que el vostre fitxer moogle.py segueixi les regles d'estíl PEP8. Podeu utilitzar el paquet pep8 o http://pep8online.com/ per assegurar-vos que seguiu aquestes regles d'estíl.

El termini de lliurament és fins al dilluns 11 de juny a les 8 del matí (2018-06-11 08:00 CEST).

Consells

  • Si voleu fer parts addicionals extres per pujar la nota ho podeu fer, però us recomanem de parlar-les abans amb els vostres professors.

  • Segurament no us cal definir classes pròpies. Amb tuples, llistes, diccionaris i conjunts n'hauríeu de tenir prou.

  • Si voleu fer servir algun mòdul "exòtic", consulteu-ho abans amb els vostres professors.