Semi-Supervised Methods in Population Genetics

Unpictured
  • Followers
About this project

Every year, billions of dollars worth of crops are destroyed due to infestation by invasive species. Thus, effective identification and tracking of these species is crucial. Currently popular methods do this identification by comparing the specimen with a database. For DNA barcoding, which is a way to compare organisms using their DNA, projects like Biocode act as public archives of genetic information about discovered species. However, only a small fraction of this data corresponds to known and studied species. A majority of this data is unlabelled, and so the conventional statistical techniques are limited to a tiny subset of the real data. In this paper, we apply semi-supervised learning methods to this dataset, effectively expanding the dataset. Then, we apply this new approach to the problem classification of species native and invasive. Finally, we explore the usefulness of phenotypical information as features for this classifier. Depending on the class of algorithms used, our method can be generalized to many other tasks that benefit from DNA barcoding.

Project Members