- Home
- Blog
- Why donât we use grep for information retrieval?Why donât we use a relational database for information retrieval?Google does not always…
- Why don’t we use grep for information retrieval?
- Why don’t we use a relational database for information retrieval?
- Google does not always interpret the query as a boolean conjunction of its terms. Give examples.
- What is a term-document incidence matrix?
- In constructing the index, which step is most expensive/complex?
- Complex Boolean retrieval systems like Westlaw use many operations that go beyond strictly Boolean
- operators. Name some of them.
- Define the number of types/tokens in a sentence.
- An IR system can normalize terms by defining equivalence classes. E.g., “suit” and “suits” could be
- in an equivalence class. What is the limitation of this model in IR?
- What is tokenization?
- Give an example in English were tokenization is nontrivial
- What is a stop list?
- What is lemmatization? Give an example.
- What is stemming? Give an example that is not also a lemmatization example.
- Name a particular stemmer.
- Give an example of a pair of words that a typical stemmer would put in one equivalence class and we
- would expect improved performance of the IR system.
- Give an example of a pair of words that a typical stemmer would put in one equivalence class and we
- would expect decreased performance of the IR system.
- Name two data structures that support phrase queries.
- Name a data structure that supports proximity queries.
- Which data structures are typically used for locating the entry for a term in the dictionary?
- Which data structure is best used for locating the entry for a term in the dictionary if the collection
- is static?
- Which data structure is best used for locating the entry for a term in the dictionary if prefix search
- must be supported?
- Which special strings are stored in the permuterm index for the word “car”?
- What sequence of letters is looked up in the permuterm index for the following wildcard queries?
- X, X*, *X, *X*, X*Y
- What is the difference between the regular inverted index used in IR and the k-gram index?
- Give an example of a query that cannot be corrected using isolated-word spelling correction.
- Define Levenshtein edit distance.
- Define Damerau-Levenshtein edit distance.
- Give the formula for Zipf’s law.
- Give the formula for Heaps’ law.
- What is the feast or famine problem?
- Define the Jaccard coefficient
- What is the bag of words model?
- What is the advantage of idf weighting compared to inverse-collection-frequency weighting?
- What is the tf-idf weight of term t in document d?
- What is the relationship between term frequency and collection frequency?
- Why don’t we use Euclidean distance of tf-idf vectors to rank documents with respect to a query?
- Write down the formula for cosine similarity between query q and document d.
- Explain the notation ddd.qqq
- What is the advantage of pivot normalization compared to regular cosine normalization?
- What is document-at-a-time processing?
- What index organization does document-at-a-time processing require?
- What is term-at-a-time processing?
- What data structure does term-at-a-time processing require that document-at-a-time processing does
- not require?
- What is a tiered inverted index?
- Name two criteria that can be used for deciding as to whether to put a document d in tier 1 of a tiered
- index.
- Name three criteria for evaluating a search engine.
- What are the components of an information retrieval benchmark?
- What is the difference between the concepts of query and information need?
- Define precision
- Define recall
- Define F1
- What is the harmonic mean of two numbers?
- Why is F1 defined as the harmonic mean?
- What is an easy way of maximizing the recall of an IR engine?
- What is an easy way of maximizing the precision of an IR engine?
- What is a precision-recall curve?
- An evaluation benchmark ideally should tell us for any document-query pair whether the document is
- relevant to the query. Why is Cranfield the only collection that actually satisfies this desideratum?
- Define the kappa measure
- What is the minimum and maximum of the kappa measure?
- What is the significance of kappa being less than / greater than 0?
- What is A/B testing?
- What does marginal relevance measure?
- What distinguishes a dynamic from a static summary?
- What is a simple heuristic for computing a dynamic summary if you can display n characters?
0 comments