Accession Number:



Scalable, Flexible and Active Learning on Distributions

Descriptive Note:

Technical Report

Corporate Author:


Personal Author(s):

Report Date:


Pagination or Media Count:



A wide range of machine learning problems, including astronomical inference about galaxy clusters, natural image scene classification, parametric statistical inference, and detection of potentially harmful sources of radiation, can be well-modeled as learning a function on samples from distributions. This thesis explores problems in learning such functions via kernel methods, and applies the framework to yield state-of-the-art results in several novel settings. One major challenge with this approach is one of computational efficiency when learning from large numbers of distributions the computation of typical methods scales between quadratically and cubically, and so they are not amenable to large datasets. As a solution, we investigate approximate embeddings into Euclidean spaces such that inner products in the embedding space approximate kernel values between the source distributions. We provide a greater understanding of the standard existing tool for doing so on Euclidean inputs, random Fourier features. We also present a new embedding for a class of information-theoretic distribution distances, and evaluate it and existing embeddings on several real-world applications. The next challenge is that the choice of distance is important for getting good practical performance, but how to choose a good distance for a given problem is not obvious. We study this problem in the setting of two-sample testing, where we attempt to distinguish two distributions via the maximum mean divergence, and provide a new technique for kernel choice in these settings, including the use of kernels defined by deep learning-type models. In a related problem setting, common to physical observations, autonomous sensing, and electoral polling, we have the following challenge when observing samples is expensive, but we can choose where we would like to do so, how do we pick where to observe

Subject Categories:

  • Statistics and Probability
  • Operations Research
  • Cybernetics

Distribution Statement: