Tag Archives: algorithms

Longstanding problem put to rest:Proof that a 40-year-old algorithm is the best possible will come as a relief to computer scientists.

By Larry Hardesty

CAMBRIDGE, Mass. – Comparing the genomes of different species — or different members of the same species — is the basis of a great deal of modern biology. DNA sequences that are conserved across species are likely to be functionally important, while variations between members of the same species can indicate different susceptibilities to disease.

The basic algorithm for determining how much two sequences of symbols have in common — the “edit distance” between them — is now more than 40 years old. And for more than 40 years, computer science researchers have been trying to improve upon it, without much success.

At the ACM Symposium on Theory of Computing (STOC) next week, MIT researchers will report that, in all likelihood, that’s because the algorithm is as good as it gets. If a widely held assumption about computational complexity is correct, then the problem of measuring the difference between two genomes — or texts, or speech samples, or anything else that can be represented as a string of symbols — can’t be solved more efficiently.

In a sense, that’s disappointing, since a computer running the existing algorithm would take 1,000 years to exhaustively compare two human genomes. But it also means that computer scientists can stop agonizing about whether they can do better.

“This edit distance is something that I’ve been trying to get better algorithms for since I was a graduate student, in the mid-’90s,” says Piotr Indyk, a professor of computer science and engineering at MIT and a co-author of the STOC paper. “I certainly spent lots of late nights on that — without any progress whatsoever. So at least now there’s a feeling of closure. The problem can be put to sleep.”

Moreover, Indyk says, even though the paper hasn’t officially been presented yet, it’s already spawned two follow-up papers, which apply its approach to related problems. “There is a technical aspect of this paper, a certain gadget construction, that turns out to be very useful for other purposes as well,” Indyk says.

Squaring off

Edit distance is the minimum number of edits — deletions, insertions, and substitutions — required to turn one string into another. The standard algorithm for determining edit distance, known as the Wagner-Fischer algorithm, assigns each symbol of one string to a column in a giant grid and each symbol of the other string to a row. Then, starting in the upper left-hand corner and flooding diagonally across the grid, it fills in each square with the number of edits required to turn the string ending with the corresponding column into the string ending with the corresponding row.

Computer scientists measure algorithmic efficiency as computation time relative to the number of elements the algorithm manipulates. Since the Wagner-Fischer algorithm has to fill in every square of its grid, its running time is proportional to the product of the lengths of the two strings it’s considering. Double the lengths of the strings, and the running time quadruples. In computer parlance, the algorithm runs in quadratic time.

That may not sound terribly efficient, but quadratic time is much better than exponential time, which means that running time is proportional to 2N, where N is the number of elements the algorithm manipulates. If on some machine a quadratic-time algorithm took, say, a hundredth of a second to process 100 elements, an exponential-time algorithm would take about 100 quintillion years.

Theoretical computer science is particularly concerned with a class of problems known as NP-complete. Most researchers believe that NP-complete problems take exponential time to solve, but no one’s been able to prove it. In their STOC paper, Indyk and his student Artūrs Bačkurs demonstrate that if it’s possible to solve the edit-distance problem in less-than-quadratic time, then it’s possible to solve an NP-complete problem in less-than-exponential time. Most researchers in the computational-complexity community will take that as strong evidence that no subquadratic solution to the edit-distance problem exists.

Can’t get no satisfaction

The core NP-complete problem is known as the “satisfiability problem”: Given a host of logical constraints, is it possible to satisfy them all? For instance, say you’re throwing a dinner party, and you’re trying to decide whom to invite. You may face a number of constraints: Either Alice or Bob will have to stay home with the kids, so they can’t both come; if you invite Cindy and Dave, you’ll have to invite the rest of the book club, or they’ll know they were excluded; Ellen will bring either her husband, Fred, or her lover, George, but not both; and so on. Is there an invitation list that meets all those constraints?

In Indyk and Bačkurs’ proof, they propose that, faced with a satisfiability problem, you split the variables into two groups of roughly equivalent size: Alice, Bob, and Cindy go into one, but Walt, Yvonne, and Zack go into the other. Then, for each group, you solve for all the pertinent constraints. This could be a massively complex calculation, but not nearly as complex as solving for the group as a whole. If, for instance, Alice has a restraining order out on Zack, it doesn’t matter, because they fall in separate subgroups: It’s a constraint that doesn’t have to be met.

At this point, the problem of reconciling the solutions for the two subgroups — factoring in constraints like Alice’s restraining order — becomes a version of the edit-distance problem. And if it were possible to solve the edit-distance problem in subquadratic time, it would be possible to solve the satisfiability problem in subexponential time.

Source: MIT News Office

Recommendation theory

Model for evaluating product-recommendation algorithms suggests that trial and error get it right.

By Larry Hardesty

Devavrat Shah’s group at MIT’s Laboratory for Information and Decision Systems (LIDS) specializes in analyzing how social networks process information. In 2012, the group demonstrated algorithms that could predict what topics would trend on Twitter up to five hours in advance; this year, they used the same framework to predict fluctuations in the prices of the online currency known as Bitcoin.

Next month, at the Conference on Neural Information Processing Systems, they’ll present a paper that applies their model to the recommendation engines that are familiar from websites like Amazon and Netflix — with surprising results.

“Our interest was, we have a nice model for understanding data-processing from social data,” says Shah, the Jamieson Associate Professor of Electrical Engineering and Computer Science. “It makes sense in terms of how people make decisions, exhibit preferences, or take actions. So let’s go and exploit it and design a better, simple, basic recommendation algorithm, and it will be something very different. But it turns out that under that model, the standard recommendation algorithm is the right thing to do.”

The standard algorithm is known as “collaborative filtering.” To get a sense of how it works, imagine a movie-streaming service that lets users rate movies they’ve seen. To generate recommendations specific to you, the algorithm would first assign the other users similarity scores based on the degree to which their ratings overlap with yours. Then, to predict your response to a particular movie, it would aggregate the ratings the movie received from other users, weighted according to similarity scores.

To simplify their analysis, Shah and his collaborators — Guy Bresler, a postdoc in LIDS, and George Chen, a graduate student in MIT’s Department of Electrical Engineering and Computer Science (EECS) who is co-advised by Shah and EECS associate professor Polina Golland — assumed that the ratings system had two values, thumbs-up or thumbs-down. The taste of every user could thus be described, with perfect accuracy, by a string of ones and zeroes, where the position in the string corresponds to a particular movie and the number at that location indicates the rating.

Birds of a feather

The MIT researchers’ model assumes that large groups of such strings can be clustered together, and that those clusters can be described probabilistically. Rather than ones and zeroes at each location in the string, a probabilistic cluster model would feature probabilities: an 80 percent chance that the members of the cluster will like movie “A,” a 20 percent chance that they’ll like movie “B,” and so on.

The question is how many such clusters are required to characterize a population. If half the people who like “Die Hard” also like “Shakespeare in Love,” but the other half hate it, then ideally, you’d like to split “Die Hard” fans into two clusters. Otherwise, you’d lose correlations between their preferences that could be predictively useful. On the other hand, the more clusters you have, the more ratings you need to determine which of them a given user belongs to. Reliable prediction from limited data becomes impossible.

In their new paper, the MIT researchers show that so long as the number of clusters required to describe the variation in a population is low, collaborative filtering yields nearly optimal predictions. But in practice, how low is that number?

To answer that question, the researchers examined data on 10 million users of a movie-streaming site and identified 200 who had rated the same 500 movies. They found that, in fact, just five clusters — five probabilistic models — were enough to account for most of the variation in the population.

Missing links

While the researchers’ model corroborates the effectiveness of collaborative filtering, it also suggests ways to improve it. In general, the more information a collaborative-filtering algorithm has about users’ preferences, the more accurate its predictions will be. But not all additional information is created equal. If a user likes “The Godfather,” the information that he also likes “The Godfather: Part II” will probably have less predictive power than the information that he also likes “The Notebook.”

Using their analytic framework, the LIDS researchers show how to select a small number of products that carry a disproportionate amount of information about users’ tastes. If the service provider recommended those products to all its customers, then, based on the resulting ratings, it could much more efficiently sort them into probability clusters, which should improve the quality of its recommendations.

Sujay Sanghavi, an associate professor of electrical and computer engineering at the University of Texas at Austin, considers this the most interesting aspect of the research. “If you do some kind of collaborative filtering, two things are happening,” he says. “I’m getting value from it as a user, but other people are getting value, too. Potentially, there is a trade-off between these things. If there’s a popular movie, you can easily show that I’ll like it, but it won’t improve the recommendations for other people.”

That trade-off, Sanghavi says, “has been looked at in an empirical context, but there’s been nothing that’s principled. To me, what is appealing about this paper is that they have a principled look at this issue, which no other work has done. They’ve found a new kind of problem. They are looking at a new issue.”

Source : MIT News