Friday, October 7, 2011

*Mandatory* Blog questions on Complexity and Search Engines [Due by 10/21]

Folks:

 Here are a few questions on complexity and search engines--mostly to check what got through. You are required to
enter your answers to the questions as a comment on the blog:


1. My friend was asked by his boss to come up with an efficient algorithm for a time-tabling problem for the company.
My friend was unable to come up with anything that runs in less than exponential time. He wanted to convince his boss that
no one else could have done any better. He went and told the boss this: "Look, this time-tabling problem can be easily (i.e., in polynomial time)
converted into a boolean satisfiability problem. Now, it is well known that boolean satisfiability is an NP-Complete problem (i.e., it is one of 'em
monster problems in class NP). So it is no bloody surprise that I am unable to come up with anything other than an exponential algorithm".

What do you think of my friends' reasoning?


2. Explain why the arguments based on reductions to/from NP-complete problems is needed? Can't we just prove that a problem can't have anything
other than exponential (or polynomial) solution? Do you know of any famous problems that were thought to be exponential for a long time until someone found a polynomial algorithm?


3. One of the CSE faculty members used to have a bunch of magnetic words stuck to his (then metallic) doors. All the students passing by will try to arrange the magnetic words to make interesting english messages. If we assume that each student was making it a point to use *all* the words in his/her message, then what would the bag-of-words similarity measures say about the pair-wise difference among those messages? Considering that I told you that search engines use these as the default similarity metrics, what does it make you think about the intelligence of search engines?



4. Suppose I have a document D1: "Rao is a happy chap"   I make another document D2 by copying all the text from D1 and pasting it 2 times into D2 (so D2 is
"Rao is a happy chap Rao is a happy chap"). What will the bag similarity metric say about the similarity between D1 and D2? What will the cosine-theta (or vector) similarity metric
say about the similarity between D1 and D2? Which do you think is better for this case?

8 comments:

  1. 1.
    I agree with your friend. P ?= NP is a millennium prize problem, and it would be ridiculous to expect that he would solve it for an assignment at work while people all over the world have attempted to solve this problem.

    2.
    The arguments based on reductions to/from NP-complete problems is needed because it hasn't been proven that a problem where the solution can be checked in polynomial time can or cannot be solved in polynomial time.

    I don't remember any famous exponential problems that turned out to be polynomial.

    3.
    If every student used all of the words available, a search engine using “bag-of-words similarity measures” would say that all of the messages are equal. This makes me think “bag-of-words similarity” search engines aren't very intelligent.

    4.
    The bag similarity metric will say that the first one is half as similar to the second one. The cosine-theta similarity metric will say that they are exactly the same. I think that means the cosine-theta similarity metric is a better metric because both documents contain the exact same phrases.

    ReplyDelete
  2. 1. Your friend is correct because any problem that can be converted to a monster problem in polynomial time can only be solved exponentially never polynomially.
    2. It is necessary to change a problem to/from NP-complete problems due to the fact that it is much easier to say this problem is equal to this problem which has been tried by many many people than to attempt the original problem yourself. Also I am not aware of any famous problems that were thought to be exponential for a long time until someone found a polynomial algorithm.
    3. If you see all the magnetic words as a bag of words than the order in which they are placed is not accounted for in the algorithim and hence all comparisons would be equal. Since this is how search engines compute their answers they aren’t very smart since even the dumbest human knows the significance of the order of words.
    4. The similarity would be 1. According to Wikipedia the cosine similarity ranges from −1 meaning exactly opposite, to 1 meaning exactly the same, with 0 usually indicating independence, and in-between values indicating intermediate similarity or dissimilarity. Since both of the strings have the same words the similarity would be 1.
    --------------Michael W. Gutierrez--------------

    ReplyDelete
  3. 1. I am going to agree with your friend, this is due to the fact that the P versus NP problem is still unresolved so all known NP-complete problems are believed to require exponential time. It would be unfair for your friend's boss to think that your friend can achieve something that no one else has.

    2. I think it makes more sense to think that if a NP problem cannot be reduced into a P problem then it can't be anything other than exponential.
    I do not know of any exponential problems proven to instead be polynomial.

    3. Bags don't care about order of words, only similarity. The bag-of-words similarity measures would say that there is a 0% difference among these messages.
    Search engines better not solely rely on bag-of-words similarity measures or else they risk saying that two very different sentences are actually identical. Ex: "Dog bites man." vs. "Man bites dog."

    4. The bag similarity metric would say that the similarity between D1 and D2 is 5/10 or 50% similarity.
    The Cosine-Theta metric would say that the similarity between D1 and D2 is (5*10)/(|5|*|10|) = 1 or 100% similarity.
    I think that the bag similarity metric would be the better way to go because D1 is not identical to D2, and the bag similarity metric represents that.

    ReplyDelete
  4. 1. While there may exist a polynomial time solution, your friend's reasoning is right because as of yet no one in the world has been able to come one, so it unreasonable to expect someone to be able to.

    2. Reduction to/from NP problems is necessary since while it hasn't been proved that there exists polynomial time solutions for NP-complete problems, it does not rule out the possibility that they do exist.

    3. Since all messages will consist of the same words, the bag similarity between all of them will be the same, so a search engine using that method will treat them all as the same even though the messages' meanings might all be different.

    4. The bag similarity metric would say the two files have 50% similarity while the cosine-theta metric would say they are the same. In this particular case I would argue that the bag similarity metric would be better since the first file really is only 50% of the second one, they aren't really the same. I guess it depends on what exactly one means by similarity and what the type of case is that will change whether one method is better than the other.

    ReplyDelete
  5. [ I sent this post to you by email before the class because I had a problem with cookies and now I found the solution for that, so im posting my answers here now]

    1- I agree with your friend because we can solve a problem if we can convert it to monster problem in polynomial time.

    2- Converting problem from/to NP problems is better because many people used this way to solve hard problems by make it easier for them. In addition, I don't know any famous problems like that (problems that were thought to be exponential for a long time until someone found a polynomial algorithm).

    3- The difference between these messages would be 0% according to the bag-of-words similarity .Since search engines use these as the default similarity metrics, I think they are not very intelligent.

    4- The similarity between D1 and D2 is 50% according to the bag similarity metric. In this case, I think both of them are ok because they have the same words but the second page has more size than the first, so it may affects the result.

    Abdulla AlBraiki

    ReplyDelete
  6. 1) Your friend has valid reasoning. You cannot expect someone to come up with a solution to a problem that has been tried so much but never been solved.

    2) Converting to a Np problem is better because it is easier to show how the problem relates.

    I do not know any famous problems that were thought to be exponential for a long time until someone found a polynomial algorithm.

    3) It makes me think that the search engines are very inefficient, and aren't at the level that they should be at.

    4) If you use the Cosine similarity then it would come out as 1 meaning the same, while the bag similarity would read it as half the same or 50%.

    ------Ryan Palach------

    ReplyDelete
  7. 1.) Your friend's boss didn't seem to specify what this algorithm was supposed to run in, only that it should be efficient. So, I'm siding with your friend. He came up with a solution, and it doesn't matter if it ran in exponential time. It's more simple and efficient.

    2.) I think of these arguments as work shown on a math problem. For one, they show how you got to your solution (in this case, reductions to/from NP-complete problems). Also, they allow others to compare their solutions with yours.

    3.) If we used the "bag-of-words" similarity measures to compare the different sentences the students have formed, our result would be that all of the sentences are the same. The "bag-of-words" disregard the order of the words formed in the sentences. It only cares about the words matching. Kinda lessens my faith in search engines by knowing this...

    4.) The bag-of-similarity metric would say that D1 is HALF the same as D2. The cosine-theta similarity metric, however, will say that D1 and D2 are the same. Knowing this, I definitely think that the cosine-theta is a more reliable similarity metric.

    ReplyDelete
  8. 1) The only way to solve this is to convert into a NP-Complete program which would take an impossible amount of time to calculate, therefore I agree with your friend as it would be unfair for his boss to expect that of him.

    2)Reductions to/from NP-complete problems is needed because although a polynomial time solution hasn't been found to exist, its does't rule out the possibility they do exist.

    3)They cannot rely on bag-of-words similarities or they would be at risk of letting different phrases of similar words mean the same thing. Bag-of-words care about similarity not order.

    4)The bag similarity metric would say they have 50% similarity while the cosine-theta metric would say they are the same.

    ReplyDelete

Note: Only a member of this blog may post a comment.