wikiRank

quickly parse any complicated document for easy answers

Why wikiRank?

Can't understand a journal article? Feeling lost in a textbook chapter? Wikipedia is your friend — but it's so big! wikiRank analyzes your text to determine the most relevant Wikipedia pages using the vector space model. Happy learning!


Instructions

  1. Paste the text you want to study in the top text box
  2. Enter the relevant keywords separated by commas
  3. Press the button and enjoy the Wikipedia article rankings

Example

Here's an example use case! Suppose I want to read Gerard Salton's celebrated paper on the vector space model, but I first need some background knowledge from Wikipedia. I can do this by pasting a query text into the top text box. In this case, my query text could be the paper's abstract:

In a document retrieval, or other pattern matching environment where stored entities (documents) are compared with each other or with incoming patterns (search requests), it appears that the best indexing (property) space is one where each entity lies as far away from the others as possible; in these circumstances the value of an indexing system may be expressible as a function of the density of the object space; in particular, retrieval performance may correlate inversely with space density. An approach based on space density computations is used to choose an optimum indexing vocabulary for a collection of documents. Typical evaluation results are shown, demonstating the usefulness of the model.


— "A vector space model for automatic indexing" (Salton, Wong, & Yang 1975)

I can then paste into the keywords box the following keywords from the paper:

automatic information retrieval, automatic indexing, content analysis, document space

By gathering the top Wikipedia search results for each of these keywords, our model defines the corpus of articles over which it will rank the most relevant pages to my query. After hitting the rank button, wikiRank will recommend these top pages (and more!) for perusal:

  1. Information retrieval
  2. Vector space model
  3. Subject indexing

Vector space model

In the vector space model, each document is represented by a vector, where each component corresponds to a term and its tf-idf weight.

What is a term?

A term is a fundamental unit of text analysis. In practice, these are the set of words in a document (e.g. the terms of "information retrieval" are "information" and "retrieval"). However, some parsing should still occur. For example, we do not want to distinguish between letter cases, nor would we want to include spaces, punctuation, or numbers in our terms. Our model also pre-processes your text by filtering out stop words — a fixed list of words picked by experts to be insignificant when ranking a document's relevance (e.g. "a", "the", "it").

What is a tf-idf vector?

Term frequency or tf is defined by how many times a given term appears in a given document. For example, the tf of the term "frequency" in the document "term frequency inverse document frequency" is 2.

Similarly, the document frequency or df of a term is the number of documents it appears in. For example, the df of "frequency" given 3 documents "term frequency", "inverse document frequency", and "bazinga!" is 2.

The inverse document frequency or idf of a term is defined as follows: $$\log_{10}\left(\frac{N}{1 + df}\right)$$ The log is a heuristic scaling factor, and we add 1 to the denominator to avoid the possibility of dividing by 0 if a term appears in no documents

The tf-idf score of a term in a document is found by multiplying tf with idf. To get the components of our document vector, we calculate this score for every term in our set of documents.

What about similarity?

We measure the similarity between two documents by measuring the cosine of the angle between their tf-idf vectors: $$sim(\vec{u},\vec{v})=\frac{\vec{u}\cdot\vec{v}}{\lVert\vec{u}\rVert\lVert\vec{v}\rVert}$$ This is aptly called the cosine similarity, and these are the scores displayed in the page rankings.

made with 💜 by Anthony Ge, Charles Wang, and Jason Liu. source code is available on GitHub!