Skip to content

PGP18: A 23andme exome

May 3, 2012

We’re happy to announce that the Personal Genome Project has received its first donated 23andme exome from a participant! As with the genotyping data acquired from direct to consumer testing companies, the PGP also welcomes donations of larger data sets like genomes and exomes. When assigning PGP nicknames (like “PGP1”), we have decided they should go to individuals who have exome or genome data hosted in the PGP — whether sequenced by us, or donated to us. Thus, hu97DB4A now has the nickname “PGP18”!

What is exome data?

In my earlier post “The Whole Two Yards” I explained that the PGP is interested in whole genomes rather than genotyping. Exome sequencing is a third category of DNA analysis. So what is an exome?

“Exome sequencing” refers to something much like genome sequencing, but limited to the regions in the genome that code for proteins (the “exons” of genes). Proteins perform most of the functions in the cell: from structures to enzymes to relaying signals, proteins are the workhorses of biology. Thus, most known genetic variations that have significant effects are the result of changes within exons — changes that disrupt the resulting protein coded by a gene. Surprisingly, these regions only account for 1% of the genome! Because of this, there has been some focus on targeted sequencing of only exons — hopefully getting almost as much useful sequence for a fraction of the cost.

Genes contain protein coding regions (“exons”) interspersed with large non-coding gaps (“introns”). To save money, exome sequencing targets and sequences only the exons in the genome — thereby focusing on the regions most likely to have variations that affect traits. [Image by User:Daycd on, shared as CC-BY-SA]

In the end, isolating exons is difficult, so “exomes” aren’t that much cheaper than whole genomes (maybe 2-3x cheaper, not 100x)… but it’s still useful to have the cheaper option.1 23andme recently started a pilot exome sequencing service (notably, they provide no interpretation of the data), and some PGP participants have signed up for it.

Addition of VCF interpretation in GET-Evidence

23andme provided the participant with both a VCF file and individual read data (in the form of a “BAM” file). Personally I’m not a fan of the VCF format for personal genomes, mainly because it fails to report which regions are confidently called as “matching reference”. (What this means is that, if a variant isn’t listed in the file, you can’t tell whether (a) you don’t have it, or (b) that region simply wasn’t well covered.)

That said, VCF is a very common format, and so I’ve finally added the ability to interpret VCF to GET-Evidence. I ran the exome data through GET-Evidence and did a little bit of additional interpretation (as with other whole genome reports, these interpretation is far from complete). You can visit the report on GET-Evidence — and if you’d like a copy of the VCF file itself, it is linked at the top of the report as “source data”. We’re hoping to reprocess the BAM files to produce higher quality reports and publicly host these larger files as well. For now, though, we’re able to immediately accept and interpret VCF files.

Donation of genetic data is very valuable to our project, hopefully we’ll see other 23andme exomes donated from participants in the future!

1Exomes have other issues that makes them less desirable, including extremely high variations in coverage, and are difficult to use for detection of larger structural variations (like large deletions or duplications of regions). The PGP does whole genome sequencing because we wish to collect the best data possible, and we feel that a full genome’s data is worth the 2-3x higher cost.

  1. Manuel Corpas permalink
    May 8, 2012 9:19 am

    Hi Madeleine

    I would be happy to donate my personal exome which I sequenced by the BGI, but I do not know if this is pertinent for the PGP. I have already talked about my experiences extensively in my blog and have made available for public download its fasq, VCF and bam files. You can find them here:

  2. yankeelaker permalink
    May 4, 2012 1:56 pm

    Came across a pretty interesting (free) iPad app called MyGenome, by Illumina. It is ultimately intended to be a tool for browsing the results of Illumina’s complete genome sequencing, but for now, to lest you get a feel for the product, just lets you use the tool to explore the genome of Illumina’s president. Be sure to download the reference genome the first time you open the app. That takes 10 or 15 minutes and is necessary. I found it to be pretty cool! (I am not associated with the company or its affiliates.)


  3. May 3, 2012 9:33 pm

    Actually Madeline, I’m totally with you on generally just creating ranges to represent those spots that were adequately sequenced but not reporting the actual base calls at those positions because it’s not really needed.

    However, there can be useful information in the gory details of a ref/ref call. It all depends on whether that particular location has any interpretable annotations you could associate with it.

    A simple example would be a polymorphic site in a repetitive region associated with a certain disorder that runs in your family. It could be useful to look at the more extensive base-calling information even for a ref/ref call to make sure the position was called accurately given that you have information that makes that site more interesting than the average reference-called base.

  4. May 3, 2012 6:38 pm

    Thanks for posting this, here at 23andMe we’re excited to hear how our exome pilot participants are using their data.

    I’d like to point out that the VCF that you analyzed is a preliminary release, as the project progresses and we receive more data we will improve upon this initial analysis (eg cross-sample variant calling, better variant filtering etc.). We do generate ref/ref calls as @mjcgenetics described but the file is way too bulky to be useful. For the final release we will probably take a hybrid approach of providing a BED describing the regions confidently called along with a VCF with ref/ref calls at specific positions of interest. If you have specific recommendations we’d love to hear them

    @The Kick-Off
    The data you downloaded is from our genotyping chip, the exome pilot uses next-generation sequencing to give a high-resolution view of a small (but important) part of your genome. The intake for our exome pilot is closed and we currently don’t have a date for a sequencing product launch but you should stay tuned to @23andMe for news.

  5. May 3, 2012 4:40 pm

    Ah, I oversimplified a bit: ideally I would have a file that returns reference calls in a *compact format*. Complete Genomics does this in their files by reporting “start” and “end” for a region annotated as “matches reference”. Other regions have a “start” and “end” and are annotated as “no-call”. We also include such reference-matching-region annotation, when available, in the GFF files outputted in GET-Evidence (although I don’t recommend GFF as a personal genome format either). The result is only a 2-3x larger file.

    In short: there’s no reason to require gigabytes to return this information.

    Why even bother with base-by-base? I can answer that myself: reporting the region on the whole — rather than base-by-base — leaves out some potentially useful detailed information. In particular, each base will have individual stats like “number of reads covering each position”.

    That said, I think it would be possible to make a decision on whether a region is confidently matching reference, include some “average coverage” and “minimum coverage” stats for the region, and call it a day. To be honest, I’m not planning to look at these coverage stats — errors can happen for a lot of reasons, high coverage isn’t proof that something isn’t an error (in fact, very high coverage is a bad sign). What I really want to do is simply trust whomever ran the variant calling software (e.g. GATK) to use good settings, make good calls, and take those calls to my interpretation pipeline. 🙂

    I think VCF *could* be adapted to report reference regions in this manner (the INFO field seems flexible, it could be used to define an “end”?). It might be a bit of a hack and I haven’t seen anyone doing it yet — so I’d be interested to hear of an example, if someone knows of one.

  6. May 3, 2012 3:45 pm

    I saw this posted at 23andMe and wanted to add a small correction.

    VCF can certainly be used to report reference calls. It can be used to report calls at any position, including “no-calls”, in fact, if you wanted that for the entire genome.

    23andMe’s VCF files don’t include the reference calls for every position no doubt because the resulting file would be enormous (think 50 million+ lines with a hundred+ characters each–you’re talking gigabytes).

    However, if you want the reference calls, you can take the BAM file and run it through the GATK pipeline using the “–output_mode EMIT_ALL_SITES” parameter. As you suggest, these reference called sites can be valuable for interpretation, so I encourage people sequencing their own exomes or genomes to do so.

    More detailed info on how to do that is here:

  7. May 3, 2012 3:28 pm

    Is the 23andMe Exome data a new and unique program they’re providing? I signed up for the complete 23andMe package back in 2010 and have access to a raw data file in a .TXT format (which I’ve uploaded to my PGP profile). Is exome data something distinctly different? Would I need to purchase a separate package from 23andMe in order to get it, or the VCF file be generated using the data they’ve already collected?

Comments are closed.

%d bloggers like this: