The project has considered syntactic analysis of natural languages, with a focus on semi-supervised approaches that require very limited amounts of training data. One focus of the project has been on highly efficient methods for the learning of lexical representations from unlabeled data; these representations can then be used in various natural language processing problems. We have derived a new algorithm for word clustering that is significantly more efficient than previous approaches, and has strong theoretical guarantees. In other work, we have investigated methods for part of speech tagging - the problem of assigning the part of speech to each word in the sentence - using minimal amounts of training data. Our results show that a few hundred words of labeled data are sufficient for high accuracy. A final piece of work has focused on efficient dependency parsing of multiple languages, example applications being machine translation and information extraction.
Michael Collins is a Vikram S. Pandit Professor of computer science at Columbia University. He completed a PhD in computer science from the University of Pennsylvania in December 1998. From January 1999 to November 2002 and from January 2003 until December 2010 he was a researcher at AT&T Labs-Research. Dr.