Color Data includes aggregated genetic and self-reported phenotypic information related to hereditary cancer and hereditary cardiovascular conditions from 54,000 affected and unaffected individuals who took a Color test. Here we briefly describe the scientific methodology and design of Color Data. We also outline the steps taken to protect participant privacy. For a detailed description of the methods and protections, please refer to our publications:
Individuals’ demographic and health history information was self-reported using an interactive, online tool. All individuals consented to have their genetic and phenotypic information appear in Color's research database. Individuals were not recruited for this database and can opt out of participating in the database.
Laboratory procedures, bioinformatics analysis, and variant interpretation for the multi-gene panel test were performed at Color Genomics, Inc. (‘Color’, Burlingame, CA) under CLIA (Clinical Laboratory Improvements Amendments, #05D2081492) and CAP (College of American Pathologists, #8975161) compliance as previously described.1 Bioinformatics analysis included 30 genes associated with hereditary cancer and 30 genes associated with hereditary cardiovascular conditions. Variants were classified according to the standards and guidelines for sequence variant interpretation of the American College of Medical Genetics and Genomics (ACMG), and all variant classifications were signed out by a board certified medical geneticist or pathologist. Variant classification categories are pathogenic (P), likely pathogenic (LP), variant of uncertain significance (VUS), likely benign (LB), and benign (B). Variants can be reclassified when new information becomes available; the database snapshot may not, therefore, reflect the variant’s current classification.
Hereditary cancer: APC, ATM, BAP1, BARD1, BMPR1A, BRCA1, BRCA2, BRIP1, CDH1, CDK4, CDKN2A (p14ARF and p16INK4a), CHEK2, EPCAM, GREM1, MITF, MLH1, MSH2, MSH6, MUTYH, NBN, PALB2, PMS2, POLD1, POLE, PTEN, RAD51C, RAD51D, SMAD4, STK11, and TP53. Analysis, variant calling, and reporting focused on the complete coding sequence and adjacent intronic sequence of the primary transcript(s), unless otherwise indicated. In PMS2, exons 12-15 were not analyzed. In several genes, only specific positions known to impact cancer risk were analyzed (genomic coordinates in GRCh37): CDK4 - only chr12:g.58145429-58145431 (codon 24), MITF - only chr3:g.70014091 (including c.952G>A), POLD1 - only chr19:g.50909713 (including c.1433G>A), POLE: only chr12:g.133250250 (including c.1270C>G), EPCAM - only large deletions and duplications including the 3’ end of the gene, and GREM1 - only duplications in the upstream regulatory region.
Hereditary cardiovascular conditions: ACTA2, ACTC1, APOB, COL3A1, DSC2, DSG2, DSP, FBN1, GLA, KCNH2, KCNQ1, LDLR, LMNA, MYBPC3, MYH7, MYH11, MYL2, MYL3, PCSK9, PKP2, PRKAG2, RYR2, SCN5A, SMAD3, TGFBR1, TGFBR2, TMEM43, TNNI3, TNNT2, and TPM1. Analysis, variant calling, and reporting focused on the complete coding sequence and adjacent intronic sequence of the primary transcript(s), unless otherwise indicated. In APOB, analysis was limited to chr2:g.21229159_21229161 (codon 3527). In MYH7, variants of uncertain significance (VUS) were not reported for exon 27. In several genes, certain exons were not analyzed: exons 4 and 14 of KCNH2, exon 1 of KCNQ1, exon 11 of MYBPC3, exon 5 of PRKAG2, and exon 1 of TGFBR1.
Laboratory procedures and imputation for low coverage whole genome sequencing (lcWGS) were performed at Color.2 Data from lcWGS were used to calculate previously published polygenic scores for three common, complex diseases: breast cancer,3 coronary artery disease,4 and atrial fibrillation.4 To note, if users would like to view polygenic risk score results for a given query, they must select 'Calculated' in the polygenic risk score filter because only a subset of the individuals in the database have a calculated polygenic risk score. Individuals who do not have polygenic risk scores calculated are captured under the filter value ‘Unknown’. Unless otherwise selected, self-reported phenotypic and genotypic information from ‘Calculated’ and ‘Unknown’ individuals is included in the other query results by default.
Detailed information about the clinical risk models can be found in our preprint. Briefly, genotypic and self-reported phenotypic information were used in the following clinical risk models: Gail Model for five-year risk of breast cancer,5 Claus Model for lifetime risk of breast cancer,6 simple office-based Framingham Coronary Heart Disease Risk Score for ten-year risk of coronary heart disease,7 CHARGE-AF Simple Score for five-year risk of atrial fibrillation.8 To note, only a subset of individuals have a risk score calculated. Individuals who do not have a risk score calculated are labeled as ‘Unknown’ if not enough information was provided to calculate a risk score or ‘Ineligible’ if they did not meet the model criteria.
Color Data is powered by Metabase, an open source data analysis tool developed by Metabase Inc. and licensed under the AGPL v3. It runs on a dedicated site and accesses Google BigQuery via its REST API over HTTPS.
The database URL includes a version (v) identifier that is assigned in increasing order and corresponds to new developments in the database. A new version will be assigned when there are significant changes to the data (in quantity or composition), inputs and outputs, filters, and other functionalities. Users who cite the database should include the version identifier from which they derived their results as queries may change between versions. Importantly, the data and functionality within a version will remain fixed so that queries may be reproduced and replicated regardless of the current version.
Filter categories use AND logic, and filter values within categories use OR logic. Users can select filter values in the dropdown list or by text typing with autocomplete, with the exception of the Variant filter values which can only be selected by text typing with autocomplete using HGVS nomenclature. Furthermore, any query where the return of results would yield information about < 5 individuals will generate the following error message: "Too few individuals in the Color Data population match this query to return results."
Full results can be downloaded in csv, xlsx, and json format directly from the query/results page to permanently store on their computer in tabular format. Queries and results can be shared via email or social media, including Facebook and Twitter, through integrated share buttons.
To help protect the privacy of individuals whose information is included in Color Data, all information in the database is de-identified in compliance with the HIPAA Privacy Rule and is returned in aggregate. We took additional steps to limit re-identification of a single individual, while still maintaining the power of aggregate and statistical database queries. These precautions were largely inspired by the literature on statistical databases,9,10 differential privacy,11,12 and hippocratic databases.13 Query filters such as age are quantized into five year buckets, and all queries are required to match ≥ 5 individuals or results will not be returned and an error message generated. Taken together, these restrictions can help to stymie some common techniques used to re-identify individuals in de-identified, aggregate data sets. Finally, all queries in the database and their source IP addresses are logged to detect, and potentially block, users who are making many suspiciously overlapping queries.