Skip to content

PGP Harvard data in Google Cloud Storage

May 30, 2014

At PGP Harvard our participants are, by and large, very enthusiastic about understanding genetics and their own genomes. Many participants are programmers, researchers, and often both! It should come as no surprise that our staff are often asked “can I see more of the raw data?”

Some drives our genomes arrived on. Porsche design! That’s how you know it’s quality. © 2012 Alexander Wait Zaranek, released as CC-BY.

Some drives our genomes arrived on. Porsche design! That’s how you know it’s quality.
© 2012 Alexander Wait Zaranek, CC-BY license.

We’ve always wanted the entire “raw data” to be public, for participants and researchers alike. One issue that stymied us was the intractable size of the data: this sort of data is typically shipped on terabyte disks. I’m now happy to share that we now have an answer and a place to find the data, although accessing this requires some familiarity with using a command line interface and maybe a smidge of programming.

The full data sets PGP Harvard received from Complete Genomics are now shared on a public bucket on Google Cloud Storage, using credits generously donated by Google. Data is organized by huID.

The bucket: gs://pgp-harvard-data-public

To access the bucket, you should read about installing and using gsutil.

Some example commands

List contents of bucket top level:
gsutil ls gs://pgp-harvard-data-public

Recursively list contents of hu011C57 directory, with date and file size details:
gsutil ls -Rl gs://pgp-harvard-data-public/hu011C57

Download/copy the var file from hu011C57 Complete Genomics data to your current directory (234 MB):
gsutil cp gs://pgp-harvard-data-public/hu011C57/GS000018120-DID/GS000015172-ASM/GS01669-DNA_B05/ASM/var-GS000015172-ASM.tsv.bz2 .

With multi-threading and recursion, copy the hu011C57 directory to your current directory. (40.8 GB):
gsutil -m cp -R gs://pgp-harvard-data-public/hu011C57 .

Use a Google Compute Engine VM to analyze the data

You can also access this data using virtual machines in the Google Compute Engine – this could save you a lot of disk space! Once you have a virtual machine you can, for example, use the Python Client Library to automatically access data.

Comments are closed.

%d bloggers like this: