About Color Data

Color Data includes aggregated genetic and phenotypic information related to hereditary cancer from 50,000 affected and unaffected individuals who took a Color test. Here we describe the scientific methodology and design of Color Data and outline the steps taken to protect client privacy.

Data collection

Individuals were ordered a Color test by a healthcare provider. All phenotypic information was reported by the individual through an interactive, online health history tool in her or his Color account. Individuals who reported more than one ancestry were counted as Multiple ethnicities with the following exceptions:

any individuals who reported Ashkenazi Jewish in addition to any other ancestry were counted as Ashkenazi Jewish
any individuals who reported Hawaiian were counted as Pacific Islander
any individuals who reported any combination of Chinese, Japanese, Indian, Filipino, Hawaiian, Other Pacific islander, or Other Asian and no other ancestry were counted as Asian, not specified

Bioinformatics pipeline

Laboratory procedures were performed at the Color laboratory (Burlingame, CA) under CLIA (Clinical Laboratory Improvements Amendments: #05D2081492) and CAP (College of American Pathologists #8975161) compliance as previously described¹. Briefly, genomic DNA was extracted from blood or saliva, enriched for select regions using SureSelect XT probes, and then sequenced using NextSeq 500/550 or NovaSeq 6000 instruments. Sequence reads were aligned against human genome reference GRCh37.p12, and variants were identified using a suite of bioinformatic tools designed to detect single nucleotide variants, small insertions and deletions, and large structural variants. Variants were classified according to the standards and guidelines for sequence variant interpretation of the American College of Medical Genetics and Genomics (ACMG)², and all variant classifications were signed out by a board certified medical geneticist or pathologist. Variant classification categories are pathogenic (P), likely pathogenic (LP), variant of uncertain significance (VUS), likely benign (LB), and benign (B).

The genes in Color Data were selected based on 1) published evidence of association with hereditary cancer risk and 2) technical feasibility using the methods described above. These genes are:

APC, ATM, BAP1, BARD1, BMPR1A, BRCA1, BRCA2, BRIP1, CDH1, CDK4, CDKN2A (p14ARF and p16INK4a), CHEK2, EPCAM, GREM1, MITF, MLH1, MSH2, MSH6, MUTYH, NBN, PALB2, PMS2, POLD1, POLE, PTEN, RAD51C, RAD51D, SMAD4, STK11, and TP53

Analysis, variant calling, and reporting focused on the complete coding sequence and adjacent intronic sequence of the primary transcript(s) (CSV) unless otherwise indicated. In PMS2, exons 12-15 were not analyzed. In several genes, only specific positions known to impact cancer risk were analyzed (genomic coordinates in GRCh37):

CDK4 - only chr12:g.58145429-58145431 (codon 24)
MITF - only chr3:g.70014091 (including c.952G>A)
POLD1 - only chr19:g.50909713 (including c.1433G>A)
POLE - only chr12:g.133250250 (including c.1270C>G)
EPCAM - only large deletions and duplications including 3’ end of the gene
GREM1 - only duplications in the upstream regulatory region

Architecture and implementation

Color Data is powered by Metabase, an open source data analysis tool developed by Metabase Inc. and licensed under the AGPL v3. It runs on a dedicated site and accesses Google BigQuery via its REST API over HTTPS.

The database URL includes a version (v) identifier that is assigned in increasing order and corresponds to new developments in the database. A new version will be assigned when there are significant changes to the data (in quantity or composition), inputs and outputs, filters, and other functionalities. Users who cite the database should include the version identifier from which they derived their results as queries may change between versions. Importantly, the data and functionality within a version will remain fixed so that queries may be reproduced and replicated regardless of the current version.

v1: released October 18, 2018; data collection April 2015 through September 2018

Web interface

Filter categories use AND logic, and filter values within categories use OR logic. Users can select filter values in the dropdown list or by text typing with autocomplete, with the exception of the Variant filter values which can only be selected by text typing with autocomplete using HGVS nomenclature. Furthermore, any query where the return of results would yield information about < 5 individuals will generate the following error message: Too few individuals in the Color Data population match this query to return results.

Full results can be downloaded in csv, xlsx, and json format directly from the query/results page to permanently store on their computer in tabular format. Queries and results can be shared via email or social media, including Facebook and Twitter, through integrated share buttons.

Privacy

To help protect the privacy of individuals whose information is included in Color Data, all information in the database is de-identified in compliance with the HIPAA Privacy Rule and is returned in aggregate. We took additional steps to limit re-identification of a single individual, while still maintaining the power of aggregate and statical database queries. These precautions were largely inspired by the literature on statistical databases^3,4, differential privacy^5,6, and hippocratic databases⁷. Query filters such as age are quantized into five year buckets, and all queries are required to match ≥ 5 individuals or results will not be returned and an error message generated. Taken together, these restrictions can help to stymie some common techniques used to re-identify individuals in de-identified, aggregate data sets. Finally, all queries in the database and their source IP addresses are logged to detect, and potentially block, users who are making many suspiciously overlapping queries.

References

1. Crawford B, Adams SB, Sittler T, et al. Multi-gene panel testing for hereditary cancer predisposition in unsolved high-risk breast and ovarian cancer patients. Breast Cancer Res Treat. 2017;163(2):383-390.
2. Richards S, Aziz N, Bale S, et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med. 2015;17(5):405-424.
3. Adam NR. Security-Control Methods for Statistical Databases: A Comparative Study. ACM Computing Surveys. 1989;21(4).
4. Denning DE. Secure statistical databases with random sample queries. ACM Transactions on Database Systems (TODS). 1980;5(3):291-315.
5. Dinur I, Nissim K. Revealing Information While Preserving Privacy. In: Proceedings of the Twenty-Second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. PODS ’03. New York, NY, USA: ACM; 2003:202-210.
6. Dwork C, McSherry F, Nissim K, Smith A. Calibrating Noise to Sensitivity in Private Data Analysis. In: Theory of Cryptography. Springer Berlin Heidelberg; 2006:265-284.
7. Agrawal R, Kiernan J, Srikant R, Xu Y. Chapter 14 - Hippocratic Databases. In: VLDB ’02: Proceedings of the 28th International Conference on Very Large Databases. San Francisco: Morgan Kaufmann; 2002:143-154.