cocaine in my rockit brain

Inverse Document Frequency, that´s what we intensively are discussing in case of SEO

Inverse Document Frequency Weighting-Stanford NLP-Professor Dan Jurafsky and Chris Manning

That is what we are working on, to find a more scientifically way to explore it.

What he is the information? What is in it? Listen and/ or read:

In this segment I’m going to introduce another score that’s used for ranking the matches of documents to a query, and that is to make use of this notion of document frequency. In particular we always use in reverse so it’s normally referred to as inverse document frequency weighting.

The idea behind making use of document frequency is that rare terms are more informative than frequent terms. So if you remember earlier on when we talked about stop words, which were words like “the” “and” “to” and “of”, and so the idea was that these words were so common, so semantically empty that we didn’t have to include them in our information retrieval system at all. They had no effect on how good a match a document was to a query.

Well, that’s maybe not quite true, but there’s some truth in it and particular it seems like in general very common words aren’t very determinative of the matching of a document and the query whereas rare words are more important. So consider a term in the query that is very rare in the collection perhaps something like “arachnocentric”. Well if someone had typed that word into the query and we can find a document that contains the word “arachnocentric”, it’s very likely to be a document that the user would be interested in seeing. So we want to give a high weight in our match score to rare terms like “arachnocentric”.

On the other hand frequent terms are less informative than rare terms. So, consider a term that is frequent in the collection like high increase line, which might occur in lots of documents. Well, a document containing such a term is more likely to be relevant than a document that doesn’t, if the query contains one of those terms. It’s not such a sure indicator of relevance. So if frequent terms we want to give positive weights for a document matching a term in the query, but lower rates then for rare terms.

The way we’re going to go about doing that is by making use of this notion of document frequency scores. So what exactly is that? Well, the document frequency of a term is the number of documents that contain the term. So what this means is that we are looking at the entire collection, so maybe the collection is 1,000,000 documents and if 10 documents have this word we’re saying that the document frequency is 10. That’s just counting the number of documents that occurs regardless of the number of times it occurs, that’s something I will come back to.

So, document frequency is an inverse measure of informative-ness of the term and we also note that the document frequency has to be smaller than >the number of documents in the collection. So putting that together this gives us the measure of inverse document frequency where we start with the document frequency and use it as the denominator and the numerator, N here, is the number of documents. So for a word that appears in just 1 document this part will be N and for a word that appears in every document it’s value will be 1. So it is some value between 1 and N. And so then what we do after that is we take the log of it and the log is used to dampen the effect of inverse document frequency. The idea again is that if you just use the absolute score that would be too strong a factor.

Now in this computation as you can see I’ve used log to the base 10 and that’s very commonly used, but actually it turns out what we use as the base of the log isn’t really important. Okay, let’s go through a concrete example; where again we are going to suppose that the size of our document collection is 1,000,000 documents. So if we take an extremely rare word like Calpurnia which let’s say occurs in just 1 document, well then what we’re going to be doing is we’re going to be taking 1,000,000; the number of documents; and divide it by 1 then taking the log of that which means with log to the base 10 that will be 6.

If we take a somewhat more common word that occurs in maybe 100 documents then we’re going to get the inverse document frequency of that is 4, and so then we can work on down through progressively more common words and the inverse document frequency will count down and in particular for the final case if we assume the word “the” occurred in every one of our documents well then we’ve got 1,000,000 divided by 1,000,000 which is 1 and if we take the log of that we get the answer 0.

So the result we actually get is that a word that occurs in every document does have a weight of 0 according to an IDF score and has no effect on the ordering of words in retrieval and that makes sense because if it occurs in every document it has no discriminatory value between documents and gets a weight of 0. And so what you can see with these numbers overall though is that this inverse document frequency weighting will give a small multiplier to pay more attention to words that are rarer words rather than very common words.

Another thing to note here is, that IDF values aren’t things that change to each query that there’s precisely 1 IDF value for each term in the collection and that’s going to be the same regardless of what query you’re issuing of the collection. Okay, here’s a yes, no question to you guys. Does the IDF have an effect on ranking for one term queries like this one?

The answer is no it doesn’t. IDF had no effect on 1 term queries. So for 1 term query you’re going to have 1 of these terms of N over the document frequency and it will be worked out, but it’s going to be just the scaling factor which since there’s only 1 IDF value for each term will be applied to every document and therefore it won’t affect the ranking in any way. You only get an effect from IDF when you have multiple terms in the query. So for example if we have query “capricious person”, well now we’re in the situation where capricious is a much rarer word and so IDF will say “Pay much more attention to documents that contain the word capricious than to documents that contain just the word person in ranking your retrieval results.”

There’s another measure that reflects the frequency of a term and indeed you might have been wondering why we’re not using it. That other measure is what information retrieval people refer to as the collection frequency of a term. So the collection frequency of a term is just the total number of times it appears in the collection counting multiple occurrences. So that’s the measure that we’ve been using in other places, it’s the measure we were using to build Unigram language models or when we were working out SPAM classifiers or something like that.

It’s not what’s usually used in information retrieval ranking systems and this next example can maybe help explain why. So here we have 2 words “insurance” and “try”, and I picked those two words because they have virtually identical collection frequency. Overall they both occur somewhat more than 10,000 times in the collection, and let’s then look at their document frequency. So the word try occurs in 8,700 odd documents and that stands in contrast to insurance which occurs in slightly under 4,000 documents. What does that mean?

What that means is that when try occurs in a document it tends to occur only once, but that try is widely distributed across documents. On the other hand when “insurance” occurs in a document it tends to occur several times, it tends to occur 2-3 times and so what does that reflect? It reflects the fact that they tend to be documents about insurance which then mentions insurance several times where there don’t really tend to be documents about trying. And so what does that mean in terms of coming up with a score for retrieval systems with words matching? What it seems to suggest is that what we should be doing is giving higher weighing to instances of the word “insurance” appearing, so if we had some kind of, imagine some kind of query like “try to buy insurance”.

The most important word to make sure we’re finding in our documents to match the query is “insurance” and probably the second most important word is “buy”, and “try” should be coming in third place before then, the near stop word of “to”. And so that’s an idea that’s being correctly captured by looking at the document frequency, but as you can see it’s not captured by the collection frequency which would score try and insurance equally. Okay, so I hope now you know what document frequency weighting is and why people usually use that as a retrieval ranking score rather than collection
frequency.

That´s what we are discussing and working on it, to do more in case of linkless outranking ;o) Don´t be shy to ask us!