Skip to content

This project constructs an ad-hoc information retrieval system using the π·π‘ƒπ‘…π‘Šπ‘–π‘˜π‘–100 dataset with PyTerrier. NLTK handles query processing, including tokenization and stemming. BM25 ranking is used with enhancing performance through optimizations. The system features a minimalistic tkinter-based user interface for an intuitive experience.

Notifications You must be signed in to change notification settings

mb-emektar/wiki-search

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

wiki-search

In the vast landscape of digital information, the challenge of efficiently accessing relevant data is increasing. Whether for academic research, professional pursuits or personal curiosity, the ability to quickly access relevant information among the vast sea of available content is crucial. This overarching problem has been the primary motivation for this information retrieval project.

Within this course project, the aim was to address a specific aspect of this grand challenge: optimizing ad-hoc information retrieval systems. Leveraging the rich resources of the DPR Wiki100 dataset and the versatile PyTerrier framework, the effort was focused on building an efficient information retrieval system. The approach involved using the powerful NLTK library for query processing and enabling core functionalities such as tokenization and stemming. Moreover, the inclusion of a minimalistic user interface, facilitated by the tkinter library, ensures an intuitive experience for users, further enhancing the practical utility of our system.

At its core, the project attempted to address the small but important problem of improving the performance of the core BM25 ranking schemes. While existing retrieval methods provided a baseline, it was realized that there was a large room for improvement and optimization. It is aimed to improve the effectiveness and efficiency of a simple information retrieval system by making various enhancements and optimizations.

What makes this project particularly interesting is its practical importance. In an age dominated by data, the ability to access relevant information quickly and accurately is invaluable in many fields. Moreover, the fact that it can process such a large amount of data (almost 5 GB) and provide the user with results quickly and accurately makes this project even more valuable.

Moving forward, this report will delve into the methodology employed in constructing our information retrieval system, detailing the various components and techniques utilized. Then, it will present an in-depth analysis of the performance enhancements achieved through our optimizations, supported by empirical results and comparative evaluations. Finally, it will discuss the implications of our findings, highlighting avenues for future research and potential extensions of our work.

About

This project constructs an ad-hoc information retrieval system using the π·π‘ƒπ‘…π‘Šπ‘–π‘˜π‘–100 dataset with PyTerrier. NLTK handles query processing, including tokenization and stemming. BM25 ranking is used with enhancing performance through optimizations. The system features a minimalistic tkinter-based user interface for an intuitive experience.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%