{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Exercise 5: Text Retrieval\n",
        "\n",
        "\n",
        "To complete the exercise, follow the instructions and complete the missing code and write the answers where required.  All points, except the ones marked with **(N points)** are mandatory. The optional tasks require more independet work and some extra effort. Without completing them you can get at most 75 points for the exercise (the total number of points is 100 and results in grade 10). Sometimes there are more optional exercises and you do not have to complete all of them, you can get at most 100 points.\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Introduction\n",
        "\n",
        "In this exercise you will implement some indexing operations used to retrieve information from a corpus of text documents. As the size of readily available text documents grows beyond all measures, methods for fast and user-friendly querying of information from text files are needed."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {},
      "outputs": [],
      "source": [
        "# General import\n",
        "import re, math, sys\n",
        "import numpy as np\n",
        "\n",
        "# NLTK imports\n",
        "import nltk\n",
        "from nltk.corpus import stopwords, gutenberg\n",
        "from nltk.tokenize import word_tokenize, sent_tokenize\n",
        "\n",
        "PYTHONIOENCODING=\"UTF-8\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Assignment 1: Using nltk library and corpuses\n",
        "\n",
        "In this assignment, you will learn how to use <a href=\"https://www.nltk.org/\"><b>nltk</b> library</a> and how to load and preprocess a corpus of documents."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "a) The <b>nltk</b> library includes mechanisms for loading a number of different text corpuses. Load the <b>Gutenberg corpus</b> by using the function ``nltk.download()``. Then, with the help of the <a href=\"http://www.nltk.org/book/\">NLTK Book</a>, familiarize yourself with the contents and structure of the corpus. If at any point of the exercise you find the Gutenberg corpus too small or otherwise unsuitable for your needs you are welcome to try other corpuses available on the internet."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Download Gutenberg corpus\n",
        "nltk.download('gutenberg')\n",
        "# Download Punkt Tokenizer Models\n",
        "nltk.download('punkt')\n",
        "# Download Stop Words\n",
        "nltk.download('stopwords')"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {},
      "outputs": [],
      "source": [
        "from nltk.corpus import gutenberg\n",
        "\n",
        "# TODO: Show structure of Gutenberg corpus\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {},
      "outputs": [],
      "source": [
        "# TODO: familiarize/experiment with the content of Gutenberg corpus\n",
        "#       e.g. show the number of words, display first few words, ..."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "b) Preprocess the raw data of each document in the Gutenberg corpus using the nltk inbuilt tokenizer. You can use the function ``nltk.word_tokenize()``. Remove stop words and punctuation to further reduce the amount of data you will need to process later on."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {},
      "outputs": [],
      "source": [
        "# TODO\n",
        "# Write a function that recives an array of words obtained from tokenized file\n",
        "# as an input parameter and returns an array of filtered words"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "metadata": {},
      "outputs": [],
      "source": [
        "# TODO\n",
        "# Preprocess raw data of documents in Gutenberg corpus (use build in tokenizer 'word_tokenize')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "c) **(10 points)** Write your own tokenizer for preprocessing. You can use regular expressions. Show the difference in the results from the tokenizer you used in the previous task. What kind of tokens did you keep/ignore relative to the in-built tokenizer?"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {},
      "outputs": [],
      "source": [
        "# TODO: (1) Write your own tokenizer (2) Compore results with the inbuilt tokenizer"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Assignment 2: Indexing\n",
        "\n",
        "In order to simplify and speed up operations on sets of documents, they need to be indexed. This allows a fast look up if and where a query word or phrase appears in our corpus."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "a) Build an inverted index. For each unique token appearing in your corpus, make a list of the documents it appears in."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {},
      "outputs": [],
      "source": [
        "# TODO: Write an inverted index structure\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "b) Index Gutenberg corpus documents and make some queries using Boolean logic (operation AND is an intersection of lists, operation OR is an intersection). Building the entire index takes a lot of time, test your structure on a smaller set (either just a few 100 characters from each document or just three documents), use the entire set when you know that everything works.\n",
        "\n",
        "For query \"encompass AND furies\" the system should return documents: 'whitman-leaves.txt', 'milton-paradise.txt'\n",
        "\n",
        "For query \"highbury OR behoves\" the system should return documents: 'austen-emma.txt', 'milton-paradise.txt'\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 30,
      "metadata": {},
      "outputs": [],
      "source": [
        "# TODO: Build an index and make some queries\n",
        "#for k, v in index.index.items():\n",
        "#    if len(v) < 10: \n",
        "#        print(k)\n",
        "# encompass, behoves, furies\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "c) **(5 points)** Extend your index with a stopword list and a stemmer for faster and more informative retrieval. Since we are assuming that documents are written in English, you can use `nltk.corpus.stopwords.words('english')` to get a list of stopwords. You can use Porter2, aslo known as Snowball stemmer `from nltk.stem.snowball.EnglishStemmer`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# TODO: Build index with stopword support and a stemmer, demonstrate its use with some examples"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "d) **(10 points)** To allow querying of tokens that occur together (i.e. common phrases), a positional index can be used. Build a positional index on your corpus. This is an extension of the inverted index where each of the list elements containing the document index also stores a list of positions in the document where the token appears. Make sure you properly removed stop words in order to keep your computation relatively fast.\n",
        "\n",
        "Use the positional index to query phrases. That is, return the positions in documents where each of the words in your phrase occurs at approximately the same position in the document."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 10,
      "metadata": {},
      "outputs": [],
      "source": [
        "# TODO: Build positional index, demonstrate its use with some examples"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Assignment 3: Text statistics\n",
        "\n",
        "a) The relevance of the documents in your corpus to the user's query can be measured by the term frequency, i.e. how many times the query term appears in each document. Extend your index structure so that it stores the number of appearances of each token in each document as well as the number of the documents in which a certain term appears."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {},
      "outputs": [],
      "source": [
        "# TODO: Count the number of appearances of each token in each document\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The absolute number of appearances is biased, therefore a different metric called **TF-IDF** (short for term frequency\u2013inverse document frequency) is commonly used to rank the relevance of documents containing a query. TF-IDF is computed as follows:\n",
        "\n",
        "\\begin{equation}\n",
        "\\mathrm{tfidf}_{t,d} = \\mathrm{tf}_{t,d} \\cdot \\mathrm{idf}_t,\n",
        "\\end{equation}\n",
        "\n",
        "where $\\mathrm{tf}_{t,d}$ is the frequency of the term $t$ in document $d$ and $\\mathrm{idf}_{t}$ equals to\n",
        "\n",
        "\\begin{equation}\n",
        "\\log_{10}{\\frac{N}{\\mathrm{df}_t}},\n",
        "\\end{equation}\n",
        "\n",
        "where $N$ is the number of documents in the corpus and $\\mathrm{df}_t$ is the number of documents of the corpus in which the term $t$ appears.\n",
        "\n",
        "Implement a system that returns the first $5$ most relevant documents from the corpus given a query. Note that your queries can contain more than one word. The score in that case is calculated as\n",
        "\n",
        "\\begin{equation}\n",
        "s(q,d) = \\sum_{t \\in q}{\\mathrm{tfidf}_{t,d}},\n",
        "\\end{equation}\n",
        "\n",
        "where $q$ is your query. "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "b) Write a function that returns the most relevant documents for a given query.\n",
        "\n",
        "For query \"lie AND reason AND book\" the five most relevant documents should be: bible-kjv, melville-moby_dick\n",
        "whitman-leaves, edgeworth-parents, chesterton-brown."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 11,
      "metadata": {},
      "outputs": [],
      "source": [
        "# TODO: Return 5 most relevant documents from the corpus for a given query\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "c) **(10 points)** Implement a system for handling typographical errors of queries on the user's part. The choice of the method is up to you. You need to show that your system returns relevant results for misspelled queries. Usually these kind of systems use similarities to the words available in the dictionary. \n",
        "\n",
        "Demonstrate your spelling correction on a few examples, you can use Gutemberg dataset and assume that the language is English."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 15,
      "metadata": {},
      "outputs": [],
      "source": [
        "# TODO: Handle typographical errors in user's query and return relevant results for misspelled queries."
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3 (ipykernel)",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.9.7"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 2
}