Coupled Semi-Supervised Learning
CARNEGIE-MELLON UNIV PITTSBURGH PA MACHINE LEARNING DEPT
Pagination or Media Count:
This thesis argues that successful semi-supervised learning is improved by learning many functions at once in a coupled manner. Given knowledge about constraints between functions to be learned e.g., f1x yields f2x, forcing the models that are learned to obey these constraints can yield a more constrained, and therefore easier, set of learning problems. We apply these ideas to bootstrap learning methods as well as semi-supervised logistic regression models, and show that considerable improvements are achieved in both settings. In experimental work, we focus on the problem of extracting factual knowledge from the web. This problem is an ideal case study for the general problems that we study because there is an abundance of unlabeled web page data available, and because thousands or millions of functions are discussed on the web. Chapter 3 focuses on coupling the semi-supervised learning of information extractors that extract information from free text using textual extraction patterns e.g., mayor of X and Y star quarterback X. We present an approach in which the input to the learner is an ontology defining a set of target categories and relations to be learned, a handful of seed examples for each, and a set of constraints that couple the various categories and relations e.g., Person and Sport are mutually exclusive. We show that given this input and millions of unlabeled documents, a semi-supervised learning procedure can, by avoiding violations of the constraints in how its learned extractors label unlabeled data, achieve very significant accuracy improvements over semi-supervised methods that do not avoid such violations. In Chapter 4, we apply the ideas from Chapter 3 to a different type of extraction method, wrapper induction for semi-structured web pages.
- Information Science