Chris Jones Ancestry Mapping Project

DOWNLOAD THE PHASED HAPMAP DATA IN MY FILES SECTION!

  • Download the file *NEW_HAPMAP_ENCODE_PHASED.tar.gz* Do not use the old one.
  • The files can be unzipped using WINRAR and read using notepad or wordpad.
  • The naming convention is ENCODE_{encode region #}_{population: JPT=Japanese CHB=Chinese CEU=European YRI=African JTPCHB=Chinese and Japanese}_{poly=no monomorphic SNPs, otherwise poly+mono are included}.hap

Project: Ancestry Mapping

UPDATES

Slides from my presentation are now available for download in the files section. Updated charts are included.

After my presentation of this project, Dr. Eskin suggested that I decrease my sliding window size from 10 to 8 and remove the mystery individual from my HapMap training populations. The rationale behind these changes was that a window size of 10 would produce 2^10=1024 possible haplotypes in my lookup tables, but only 120 or 180 haplotypes exist for each HapMap population at a given window. Therefore, it was possible that the only haplotype occurring was the one that was copied and used as my mystery population. This issue, coupled with low transition probabilities, could produce artificially accurate lineage estimations. I have made these changes to my program, and the charts shown below reflect this modification.

Also, I made a minor update to how unseen haplotypes are dealt with. I had some problems in my decision blocks that caused my program to declare missing haplotypes as the value of the last window position. I fixed this problem by setting a missing value to (1.0 / population haplotype #)*.99. This number ensures that the value for a missing haplotype will be lower than the frequency of a single haplotype occurring in a population. If a value of zero is used, then the Viterbi algorithm will crash, and if a very low number (1e-22) is used, it unfairly biases the weights in the algorithm causing errors. These changes have greatly increased the accuracy of the program.

Problem description: Given an individual with unknown lineage (either mixed or ethnically homogenous) estimate their ancestry based on the haplotype structure of their genome.

Approach: For this problem, I created a program that computes the frequencies of a series of markers occurring in each of the HapMap populations (CEU, JPTCHB, YRI). Specifically, I use a sliding window of 8 markers to build a database of every marker combination that occurs in this space for each HapMap population. For example:

in CEU at one position of the sliding window: [AAAAGGTAGAT: 33% of individuals|| CCCCGTTAGTAG: 15% of individuals || etc..]

Then, I slide the window over the genome of the mystery individual. At each position of the window, I observe the likelihood of the individual's marker combination occurring in each population. Next, I choose the population in which this marker combination occurs the most as the probable lineage of the individual for that position of the window.

Unfortunately, this maximum likelihood estimate of the ancestral population is not very accurate as transitions between populations are considered as equally likely as a long stretch of the same population. In reality, this trend is not true; therefore, we need to implement a model that penalizes transitions more heavily and takes into account past trends. For these reasons, I implemented a Viterbi algorithm, a variant of a Hidden Markov Model, which has one state for each HapMap population. The emission probabilities of this model at each position of the sliding window change to reflect the frequency of that window's marker sequence occurring in each population. The Viterbi algorithm significantly improved the accuracy of my program to predict ancestry.

Diagram of the Viterbi algorithm state machine:
viterbi.jpg

Latest results: These plots reflect which population the program estimates the mystery individual originating from for each sliding window position. A dot on line 3 indicates that at the window position given by the X axis, the individual is estimated to be Japanese/Chinese. A dot on line 2 indicates that the program classifies this position as Yoruban (African), and a dot on line 1 indicates that the program predicts this position to originate in a European population.

This plot shows the expected origin of a mystery individual who is composed of 1000 SNPS from the CEU population. Note that the origin of the individual is predicted at near 100% accuracy.

Ceu1k.jpg

Here is the ancestry estimation of a 3kb sequence from an individual who is 33% YRI and 66% CEU. The middle third of the sequence is YRI (line 2) which is flanked by CEU (line 1) regions. The program again manages to predict the ancestry of the individual with near perfect accuracy.

ceuyriceuNew

Here is ancestry estimation of an individual who is 33% JPTCHB, 33% YRI, and 33% CEU. The individual should have the first third on line 3, the second third on line 1, and the last third on line 2. The program also predicts the ancestry of this individual with very high accuracy.

jptchbyrinew

Finally, lets give the program a bigger challenge. The mystery genome sequence in this plot does not originate in any of the HapMap training populations. It belongs to a North Finnish person, whose lineage is considered a population isolate. This individual may have some haplotypes that do not occur in any HapMap population.

Fin1k.jpg

We see a misclassification at the start of this chromosomal segment where the Finnish person is classified as Japanese/Chinese. This error is due to the different haplotypes in the Finnish isolate. Thankfully, though, the transition probabilities of the Viterbi algorithm save us, so that the differences between Finnish and CEU people cannot accumulate enough probability mass to induce a spurious transition.

Week 8:
Goal for the week: actually do something
Did I complete the goal?: Yes, I wrote this entire program in one and a half days
Grade earned this week: A for the work A+ in procrastination
Future plans: Run this program on a few individuals other individuals and see what to do about the noisy data (noisy data fixed!)

Week 9:
Goal for this week: present the project and implement Dr. Eskin's critiques of my project
Did I complete this goal?: Yes, I gave a fairly decent presentation about the project, and I made the necessary alterations to my program.
Grate earned this week: A, I completed my goals
Future plans: Submit a write up and conduct a literature survey to see if I can turn this project into a student workshop publication

Week 10:
Goal for this week: Turn in program and writeup to Dr. Eskin

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License