Color Data includes aggregated genetic and phenotypic information related to hereditary cancer from 50,000 affected and unaffected individuals who took a Color test. Here we describe the scientific methodology and design of Color Data and outline the steps taken to protect client privacy.
Individuals were ordered a Color test by a healthcare provider. All phenotypic information was reported by the individual through an interactive, online health history tool in her or his Color account. Individuals who reported more than one ancestry were counted as Multiple ethnicities with the following exceptions:
Laboratory procedures were performed at the Color laboratory (Burlingame, CA) under CLIA (Clinical Laboratory Improvements Amendments: #05D2081492) and CAP (College of American Pathologists #8975161) compliance as previously described1. Briefly, genomic DNA was extracted from blood or saliva, enriched for select regions using SureSelect XT probes, and then sequenced using NextSeq 500/550 or NovaSeq 6000 instruments. Sequence reads were aligned against human genome reference GRCh37.p12, and variants were identified using a suite of bioinformatic tools designed to detect single nucleotide variants, small insertions and deletions, and large structural variants. Variants were classified according to the standards and guidelines for sequence variant interpretation of the American College of Medical Genetics and Genomics (ACMG)2, and all variant classifications were signed out by a board certified medical geneticist or pathologist. Variant classification categories are pathogenic (P), likely pathogenic (LP), variant of uncertain significance (VUS), likely benign (LB), and benign (B).
The genes in Color Data were selected based on 1) published evidence of association with hereditary cancer risk and 2) technical feasibility using the methods described above. These genes are:
APC, ATM, BAP1, BARD1, BMPR1A, BRCA1, BRCA2, BRIP1, CDH1, CDK4, CDKN2A (p14ARF and p16INK4a), CHEK2, EPCAM, GREM1, MITF, MLH1, MSH2, MSH6, MUTYH, NBN, PALB2, PMS2, POLD1, POLE, PTEN, RAD51C, RAD51D, SMAD4, STK11, and TP53
Analysis, variant calling, and reporting focused on the complete coding sequence and adjacent intronic sequence of the primary transcript(s) (CSV) unless otherwise indicated. In PMS2, exons 12-15 were not analyzed. In several genes, only specific positions known to impact cancer risk were analyzed (genomic coordinates in GRCh37):
Color Data is powered by Metabase, an open source data analysis tool developed by Metabase Inc. and licensed under the AGPL v3. It runs on a dedicated site and accesses Google BigQuery via its REST API over HTTPS.
The database URL includes a version (v) identifier that is assigned in increasing order and corresponds to new developments in the database. A new version will be assigned when there are significant changes to the data (in quantity or composition), inputs and outputs, filters, and other functionalities. Users who cite the database should include the version identifier from which they derived their results as queries may change between versions. Importantly, the data and functionality within a version will remain fixed so that queries may be reproduced and replicated regardless of the current version.
Filter categories use AND logic, and filter values within categories use OR logic. Users can select filter values in the dropdown list or by text typing with autocomplete, with the exception of the Variant filter values which can only be selected by text typing with autocomplete using HGVS nomenclature. Furthermore, any query where the return of results would yield information about < 5 individuals will generate the following error message: Too few individuals in the Color Data population match this query to return results.
Full results can be downloaded in csv, xlsx, and json format directly from the query/results page to permanently store on their computer in tabular format. Queries and results can be shared via email or social media, including Facebook and Twitter, through integrated share buttons.
To help protect the privacy of individuals whose information is included in Color Data, all information in the database is de-identified in compliance with the HIPAA Privacy Rule and is returned in aggregate. We took additional steps to limit re-identification of a single individual, while still maintaining the power of aggregate and statical database queries. These precautions were largely inspired by the literature on statistical databases3,4, differential privacy5,6, and hippocratic databases7. Query filters such as age are quantized into five year buckets, and all queries are required to match ≥ 5 individuals or results will not be returned and an error message generated. Taken together, these restrictions can help to stymie some common techniques used to re-identify individuals in de-identified, aggregate data sets. Finally, all queries in the database and their source IP addresses are logged to detect, and potentially block, users who are making many suspiciously overlapping queries.