Project Leader

Steven Snyder

Project Description

The goal of this project is to develop software for predicting unobserved genotype data using correlations found in the HapMap data. The software will take as input the reference data (or if none is supplied, then the HapMap project data) and the incomplete set of SNPs. It will produce as output a complete predicted genotype based on SNP correlations from the reference data. The program is coded in C++ and the code has been released under the GNU General Public License.
An overview of the project is in this paper.
The current version is: 0.62. Get it here.

Related Papers


March 30 2009

The permanent homepage for this project is []

June 5 2008

Finished overview paper. Get it here.

June 4 2008

Final draft of presentation on this project here.
Current version: 0.62. Get it here.

  • Improved error rate to 6%.

May 24 - June 1 2008 - Week Summary/Updates

Current version: 0.61. Get it here.

The program is in first-draft finished condition. In current testing the accuracy rate is 6 to 7% and still subject to improvement.

  • What I did this week:
  • Wrote the first draft of my presentation slides (see them here)
  • Revised some functions to improve performance and accuracy
  • Cleaned up most of the code
  • Changed the imputation algorithm slightly
  • Plans for next week: Finish cleaning up the code. Present the project.
  • How what I did compared to what I planned to do: Everything went as planned.
  • Self-determined grade for this week: A

May 25 2008

  • Improved error rate to 7%.

May 24 2008

Current version: 0.20. Get it here.

  • Imputation was broken. Got it working again.
  • Fixed a few other bugs.
  • Test scripts and binary now included in archive.
  • Current error rate on a single chromosome haplotype with 5% randomly missing SNPs is 11.5%.

May 16 - May 23 2008 - Week Summary/Updates

Current version: 0.18. Get it here.

I have decided to not make a graphical representation of the data. Since imputation is a very general procedure, I have decided to keep my application in a general purpose form for now. I don't think it would be worth the large effort to generate pretty graphs when there is nothing wrong with using the text file to view the results. In most cases the results will be interpreted by a program since there is such a large number of SNPs anyway.

  • What I did this week:
  • Implemented basic imputation.
  • Revised some algorithms to improve performance and accuracy.
  • Added progress bar for imputation computation.
  • Fixed a large number of bugs.
  • Plans for next week: Obtain reasonable accuracy for imputation. The current algorithm is only slightly better than random choice.
  • How what I did compared to what I planned to do: Everything went as planned.
  • Self-determined grade for this week: A

May 21 2008

Current version: 0.16. Get it here.

  • Implemented basic (crude!) imputation on missing SNPs in input file.
  • Generated output in raw haplotype data format as well as readable format with imputation information.
  • Created test scripts to randomly delete SNPs from known haplotypes, then impute on the perturbed haplotypes to test the error rate of the imputation. I will post some results when I have them available.
  • Fixed a few bugs.

May 10 - May 15 2008 - Week Summary/Updates

Current version: 0.05. Get it here.

  • What I did this week: Fixed fatal buffer overrun error in interface: An older linux kernel I was using did not cause a segmentation fault but I upgraded my distribution and the program no longer ran, so I found that problem and fixed it. Started implementing imputation functionality and working out the statistical formulas for it. I also started looking into various GUI solutions. For the GUI I think I will make a web page front end with a Java-based viewer for the imputation results. I'm still looking into this however, so plans may change.
  • Plans for next week: Finish implementing basic imputation.
  • How what I did compared to what I planned to do: Didn't get to fully implement imputation, but I'm making good progress.
  • Self-determined grade for this week: B+

May 13 2008

Revised the code for the legend and haplotype file parsers:

  • File input is now much more robust and should catch most malformed files instead of producing erroneous data structures and/or crashing.
  • The parsers now work properly on the HapMap phased haplotype files in the format they are available in from [].

Current version: 0.04. Get it here.

May 2 - May 9 2008 - Week Summary/Updates

The current version of the source code is available here.

  • What I did this week: Greatly improved robustness and error-checking in file parsing code. Added correlation functionality. The software will now calculate r^2 values for all the SNPs in the reference haplotype data.
  • Plans for next week: Implement basic imputation and start working on graphical representation of results.
  • How what I did compared to what I planned to do: Everything went as planned.
  • Self-determined grade for this week: A

April 26-May 1 2008 - Week Summary/Updates

I have finished the parsing functions for HapMap style haplotype data. The current version of the source code is available here.
The software will read in a legend and then phased haplotype data from the HapMap project and a file with a single-haplotype for performing imputation on.

  • What I did this week: Parsed HapMap haplotype data. Found IMPUTE, an existing SNP imputation utility.
  • Plans for next week: Clean up code, improve robustness of file parsing, start working on calculating correlation data from the haplotypes.
  • How what I did compared to what I planned to do: Everything went as planned.
  • Self-determined grade for this week: A

April 30 2008

I have downloaded the HapMap haplotype data for the YRI and CEU groups. The JPT+CHB haplotypes were removed from the HapMap site on March 8, 2008, so I was unable to download them. Apparently they are suspected to be incorrect.
I started writing the parser for the HapMap data. So far I have implemented the basic classes and structures to store the information in the program, and have completed the parser for the haplotype legend files. The next step will be to use the legend to parse the haplotypes themselves.
I decided to write all of the code in C++ for a few of reasons. One is to ensure that the performance and memory overhead of the program is decent, as the data set and number of operations is extremely large. Another major reason is to make it feasible to implement the functionality of this program into GenLib.
I considered using Java as it would offer generous portability, but the modular nature of my C++ code should make it fairly simple to cross-compile with a small number of minor changes.

April 20-25 2008 - Week Summary

  • What I did this week: Create the project's wiki page and made a schedule for the quarter.
  • Plans for next week: Continue as planned according to the project schedule.
  • How what I did compared to what I planned to do: Everything went as planned.
  • Self-determined grade for this week: A

The goal of this quarter

The goal of this project is to develop imputation software for genetic data. See Project Description for more information.

The Schedule for the quarter

  • Week 4: Set up the project wiki page
  • Week 5: Look at existing imputation software and start working on parsing HapMap data
  • Week 6: Finish parsing, and get correlations from HapMap data
  • Week 7: Implement rough draft of imputation functionality
  • Week 8: Improve imputation functionality, work on graphical representation of data
  • Week 9: Fix bugs, continue to improve imputation accuracy and graphical representation
  • Week 10: Present working copy of software to the class
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License