Department of Statistical Science
216 Old Chemistry
P.O. Box 90251
Durham, NC 27708-0251
I am an Assistant Professor of the Department of Statistical Science at Duke University and affiliated faculty in Computer Science, Biostatistics and Bioinformatics, the information initiative at Duke (iiD), and the Social Science Research Institute. I also hold a Schedule A appointment at the U.S. Census Bureau.
My main research focus is on entity resolution (record linkage or de-duplication), where the goal is to remove duplicated information from large, noisy databases in the absence of unique identifiers. In my research, I develop flexible methods for entity resolution that are able to handle the uncertainty of the record linkage process and can be easily integrated with post-linkage statistical analyses, such as logistic regression or capture recapture. In addition, a strength of the methods I propose, is that they are able to maintain low error rates (precision and recall) and beat the state-of-the-art methods in the literature in terms of these error rates. Furthermore, I have developed the first performance bounds for a general class of entity resolution models, illustrating when the bounds hold in practice. I proposed a new methodology for entity resolution, realizing that the size of the clusters grows sub-linearly compared to the number of records, which contrasts with many other processes. In turn, this had led to proposing a general class of models for clustering of tasks with a sub-linear growth that are scalable, and illustrating their success for entity resolution.
In addition to approaching entity resolution from a Bayesian perspective, I also approach it using statistical machine learning. Specifically, I have been able to leverage locality sensitive hashing (LSH) as a dimsenion reduction technique for entity resolution and develop fast ways of estimating the unique number of clusters in very large databases. In addition, we have shown that our methods have nice theoretical properties and are very scalable.
If you're interested in working with me as an undergraduate, MS, or PhD student, please set up a time to talk to me with email after looking over my research page and looking over some recent work that myself and my group has done.
I’m currently looking for a postdoctoral researcher in my group, so please send me a CV if you’re interested and have a background in both statistics and machine learning.
Duke Machine Learning
I am heavily involved in integrating computation into both the graduate and undergraduate statistics curriculum, using reproducible research and also using real and complex data sets. All of my courses that are taught at Duke can be found at github. In addition, I have taught the first course in Statistical Science in machine learning for undergraduates, and I'm working with students so that machine learning can have a greater presence on campus through the Duke Undergraduate Machine Learning (ML) Program (http://dukeml.org/) with the undergraduate ML board. We have a student run seminar series (MLBytes), bootcamps, a machine learning day, and a datathon. We have been fortunate to have many sponsors, academic and industrial, which can be found on our webpage (http://dukeml.org/). If you’re interested in becoming involved in this group, please contact us!