About Color Data


Color Data includes aggregated genetic and self-reported phenotypic information related to hereditary cancer and hereditary cardiovascular conditions from 54,000 affected and unaffected individuals who took a Color test. Here we briefly describe the scientific methodology and design of Color Data. We also outline the steps taken to protect participant privacy. For a detailed description of the methods and protections, please refer to our publications:

Data collection

Individuals’ demographic and health history information was self-reported using an interactive, online tool. All individuals consented to have their genetic and phenotypic information appear in Color's research database. Individuals were not recruited for this database and can opt out of participating in the database.

Multi-gene panel testing

Laboratory procedures, bioinformatics analysis, and variant interpretation for the multi-gene panel test were performed at Color Genomics, Inc. (‘Color’, Burlingame, CA) under CLIA (Clinical Laboratory Improvements Amendments, #05D2081492) and CAP (College of American Pathologists, #8975161) compliance as previously described.1 Bioinformatics analysis included 30 genes associated with hereditary cancer and 30 genes associated with hereditary cardiovascular conditions.

Hereditary cancer: APC, ATM, BAP1, BARD1, BMPR1A, BRCA1, BRCA2, BRIP1, CDH1, CDK4, CDKN2A (p14ARF and p16INK4a), CHEK2, EPCAM, GREM1, MITF, MLH1, MSH2, MSH6, MUTYH, NBN, PALB2, PMS2, POLD1, POLE, PTEN, RAD51C, RAD51D, SMAD4, STK11, and TP53. Analysis, variant calling, and reporting focused on the complete coding sequence and adjacent intronic sequence of the primary transcript(s), unless otherwise indicated. In PMS2, exons 12-15 were not analyzed. In several genes, only specific positions known to impact cancer risk were analyzed (genomic coordinates in GRCh37): CDK4 - only chr12:g.58145429-58145431 (codon 24), MITF - only chr3:g.70014091 (including c.952G>A), POLD1 - only chr19:g.50909713 (including c.1433G>A), POLE: only chr12:g.133250250 (including c.1270C>G), EPCAM - only large deletions and duplications including the 3’ end of the gene, and GREM1 - only duplications in the upstream regulatory region.

Hereditary cardiovascular conditions: ACTA2, ACTC1, APOB, COL3A1, DSC2, DSG2, DSP, FBN1, GLA, KCNH2, KCNQ1, LDLR, LMNA, MYBPC3, MYH7, MYH11, MYL2, MYL3, PCSK9, PKP2, PRKAG2, RYR2, SCN5A, SMAD3, TGFBR1, TGFBR2, TMEM43, TNNI3, TNNT2, and TPM1. Analysis, variant calling, and reporting focused on the complete coding sequence and adjacent intronic sequence of the primary transcript(s), unless otherwise indicated. In APOB, analysis was limited to chr2:g.21229159_21229161 (codon 3527). In MYH7, variants of uncertain significance (VUS) were not reported for exon 27. In several genes, certain exons were not analyzed: exons 4 and 14 of KCNH2, exon 1 of KCNQ1, exon 11 of MYBPC3, exon 5 of PRKAG2, and exon 1 of TGFBR1.

Low coverage whole genome sequencing for polygenic risk scores

Laboratory procedures and imputation for low coverage whole genome sequencing (lcWGS) were performed at Color.2 Data from lcWGS were used to calculate previously published polygenic scores for three common, complex diseases: breast cancer,3 coronary artery disease,4 and atrial fibrillation.4 To note, if users would like to view polygenic risk score results for a given query, they must select 'Calculated' in the polygenic risk score filter because only a subset of the individuals in the database have a calculated polygenic risk score. Individuals who do not have polygenic risk scores calculated are captured under the filter value ‘Unknown’. Unless otherwise selected, self-reported phenotypic and genotypic information from ‘Calculated’ and ‘Unknown’ individuals is included in the other query results by default.

Clinical risk models

Detailed information about the clinical risk models can be found in our preprint. Briefly, genotypic and self-reported phenotypic information were used in the following clinical risk models: Gail Model for five-year risk of breast cancer,5 Claus Model for lifetime risk of breast cancer,6 simple office-based Framingham Coronary Heart Disease Risk Score for ten-year risk of coronary heart disease,7 CHARGE-AF Simple Score for five-year risk of atrial fibrillation.8 To note, only a subset of individuals have a risk score calculated. Individuals who do not have a risk score calculated are labeled as ‘Unknown’ if not enough information was provided to calculate a risk score or ‘Ineligible’ if they did not meet the model criteria.

Architecture and implementation

Color Data is powered by Metabase, an open source data analysis tool developed by Metabase Inc. and licensed under the AGPL v3. It runs on a dedicated site and accesses Google BigQuery via its REST API over HTTPS.

The database URL includes a version (v) identifier that is assigned in increasing order and corresponds to new developments in the database. A new version will be assigned when there are significant changes to the data (in quantity or composition), inputs and outputs, filters, and other functionalities. Users who cite the database should include the version identifier from which they derived their results as queries may change between versions. Importantly, the data and functionality within a version will remain fixed so that queries may be reproduced and replicated regardless of the current version.

  • Version 2 released January 12, 2020; data collection April 2015 to December 2019
  • Version 1 released October 18, 2018; data collection April 2015 to September 2018

Web interface

Filter categories use AND logic, and filter values within categories use OR logic. Users can select filter values in the dropdown list or by text typing with autocomplete, with the exception of the Variant filter values which can only be selected by text typing with autocomplete using HGVS nomenclature. Furthermore, any query where the return of results would yield information about < 5 individuals will generate the following error message: "Too few individuals in the Color Data population match this query to return results."

Full results can be downloaded in csv, xlsx, and json format directly from the query/results page to permanently store on their computer in tabular format. Queries and results can be shared via email or social media, including Facebook and Twitter, through integrated share buttons.

Privacy

To help protect the privacy of individuals whose information is included in Color Data, all information in the database is de-identified in compliance with the HIPAA Privacy Rule and is returned in aggregate. We took additional steps to limit re-identification of a single individual, while still maintaining the power of aggregate and statistical database queries. These precautions were largely inspired by the literature on statistical databases,9,10 differential privacy,11,12 and hippocratic databases.13 Query filters such as age are quantized into five year buckets, and all queries are required to match ≥ 5 individuals or results will not be returned and an error message generated. Taken together, these restrictions can help to stymie some common techniques used to re-identify individuals in de-identified, aggregate data sets. Finally, all queries in the database and their source IP addresses are logged to detect, and potentially block, users who are making many suspiciously overlapping queries.

References

  • 1. Neben CL, Zimmer AD, Stedden W, et al. Multi-Gene Panel Testing of 23,179 Individuals for Hereditary Cancer Risk Identifies Pathogenic Variant Carriers Missed by Current Genetic Testing Guidelines. J Mol Diagn. 2019;0(0). doi:10.1016/j.jmoldx.2019.03.001
  • 2. Homburger JR, Neben CL, Mishne G, Zhou AY, Kathiresan S, Khera AV. Low coverage whole genome sequencing enables accurate assessment of common variants and calculation of genome-wide polygenic scores. Genome Med. 2019;11(1):74.
  • 3. Mavaddat N, Michailidou K, Dennis J, et al. Polygenic Risk Scores for Prediction of Breast Cancer and Breast Cancer Subtypes. Am J Hum Genet. 2019;104(1):21-34.
  • 4. Khera AV, Chaffin M, Aragam KG, et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat Genet. August 2018. doi:10.1038/s41588-018-0183-z
  • 5. Gail MH, Brinton LA, Byar DP, et al. Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. J Natl Cancer Inst. 1989;81(24):1879-1886.
  • 6. Claus EB, Risch N, Thompson WD. Autosomal dominant inheritance of early-onset breast cancer. Implications for risk prediction. Cancer. 1994;73(3):643-651.
  • 7. D’Agostino RB Sr, Vasan RS, Pencina MJ, et al. General cardiovascular risk profile for use in primary care: the Framingham Heart Study. Circulation. 2008;117(6):743-753.
  • 8. Alonso A, Krijthe BP, Aspelund T, et al. Simple risk model predicts incidence of atrial fibrillation in a racially and geographically diverse population: the CHARGE-AF consortium. J Am Heart Assoc. 2013;2(2):e000102.
  • 9. Denning DE. Secure statistical databases with random sample queries. ACM Transactions on Database Systems (TODS). 1980;5(3):291-315.
  • 10. Adam NR. Security-Control Methods for Statistical Databases: A Comparative Study. ACM Computing Surveys. 1989;21(4). http://www.utdallas.edu/~muratk/courses/privacy08f_files/stat_database_sec.pdf.
  • 11. Dinur I, Nissim K. Revealing Information While Preserving Privacy. In: Proceedings of the Twenty-Second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. PODS ’03. New York, NY, USA: ACM; 2003:202-210.
  • 12. Dwork C, McSherry F, Nissim K, Smith A. Calibrating Noise to Sensitivity in Private Data Analysis. In: Theory of Cryptography. Springer Berlin Heidelberg; 2006:265-284.
  • 13. Dwork C, McSherry F, Nissim K, Smith A. Calibrating Noise to Sensitivity in Private Data Analysis. In: Theory of Cryptography. Springer Berlin Heidelberg; 2006:265-284.