D3G Release 26.03
===

D3G provides a set of genomic data to support the development of oligonucleotide therapeutics. Genome sequences and gene models are compiled for seven species, human, crab-eating macaque, rhesus macaque, common marmoset, mouse, rat, and rabbit. Expression data are supplemented for two species of crab-eating macaque and common marmoset. 


What's new
-----------
* RefSeq and Gencode genes were updated for human
* Gencode genes were updated for mouse
* Genome assemblies and gene annotations were updated for crab-eating macaque, rhesus macaque, rat, and rabbit.
* Promoter-level gene expression data for crab-eating macaque were reprocessed with the updated baseline
* Promoter-level gene expression data for common marmoset were updated, as there was a bug in data processing. Please refer the latest data set for the correct data.


Data use / embargo
-------------------
Here we open our original data sets to support development of oligonucleotide therapeutics under [CC-BY](https://creativecommons.org/licenses/by/4.0/). We encourage anyone (for example, in academic institute, commercial company, regulatory agency, or others) to use the data sets freely for development or assessment of drugs.

Meanwhile, we request users to respect the embargo on the publication of genome-wide analysis based on this data set, as we are currently preparing manuscripts to provide genome-scale analysis of the non-human primates with complete description of the experimental and computational details with raw data. Exceptions to the policy are for analyses on a couple of locus, gene families, and oligonucleotide sequences, rather than comprehensive large-scale analysis.


Disclaimer
-------------------
The data are provided “as is” without warranties of any kind. While reasonable efforts have been made to ensure accuracy, the developers and maintainers of D3G assume no responsibility for any errors or omissions, or for any consequences arising from the use of this information.


Data source
------------
* Human (Homo sapiens)
    - Genome assembly: GRCh38/hg38.p14 (GCA_000001405.29) [^1]
    - Gene models
        - RefSeq: Annotation Release NCBI RefSeq GCF_000001405.40-RS_2025_08 (2025-08-06) [^2]
        - Gencode: V49 [^3]
* Crab-eating macaque (Cynomolgus macaque)
    - Genome assembly: T2T-MFA8v1.1 (GCF_037993035.2) 
    - Gene models
        - RefSeq: NCBI RefSeq GCF_037993035.2-RS_2025_03 (2025-05-16)
    - Expression: CAGE profiles obtained from PRJDB9546
* Rhesus macaque (Macaca mulatta)
    - Genome assembly: T2T-MMU8v2.0 (GCF_049350105.2)
    - Gene models
        - RefSeq: GCF_049350105.2-RS_2025_08 (2025-12-11)
* Common marmoset (Callithrix jacchus)
    - Genome assembly: calJac4 (GCF_009663435.1)
    - Gene models
        - RefSeq: NCBI Callithrix jacchus Annotation Release 105 (2020-07-11)
        - TRaC: 22.02
            - built from CAGE profiles and RNA-seq profiles (data from PRJDB9547 and others), and MinION sequencing of full-length cDNAs prepared by cap-trapping.
    - Expression: CAGE profiles obtained from the above
* Mouse (Mus musculus)
    - Genome assembly: GRCm39/mm39 (GCA_000001635.9)
    - Gene models
        - RefSeq: Annotation Release NCBI RefSeq GCF_000001635.27-RS_2024_02 (2024-02-08) 
        - Gencode: VM38

* Rat (Rattus norvegicus)
    - Genome assembly: GRCr8 (GCF_036323735.1)
    - Gene models
        - RefSeq: GCF_036323735.1-RS_2024_02 (2024-02-24)
* Rabbit (Oryctolagus cuniculus)
    - Genome assembly: mOryCun1.1 (GCF_964237555.1)
    - Gene models
        - RefSeq: GCF_964237555.1-RS_2024_11 (2024-11-29)

The genome assemblies and the gene sets of crab-eating macaque, rhesus macaque, and rat, and rabbit were compiled based on the UCSC Genome Repository, GenArk [^4].


[^1]: Church DM, Schneider VA, Graves T, et al. Modernizing reference genome assemblies. PLoS Biol. 2011;9(7):e1001091. doi:10.1371/journal.pbio.1001091
[^2]: O'Leary NA, Wright MW, Brister JR, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44(D1):D733-D745. doi:10.1093/nar/gkv1189
[^3]: Harrow J, Frankish A, Gonzalez JM, et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 2012;22(9):1760-1774. doi:10.1101/gr.135350.111
[^4]: Clawson H, Lee BT, Raney BJ, Barber GP, Casper J, Diekhans M, Fischer C, Gonzalez JN, Hinrichs AS, Lee CM, Nassar LR, Perez G, Wick B, Schmelter D, Speir ML, Armstrong J, Zweig AS, Kuhn RM, Kirilenko BM, Hiller M, Haussler D, Kent WJ, Haeussler M. GenArk: towards a million UCSC genome browsers. Genome Biol. 2023 Oct 2;24(1):217. doi: 10.1186/s13059-023-03057-x. PMID: 37784172; PMCID: PMC10544498.


Data files
-----------
```
.
|-- GRCr8_GCF_036323735.1
|   |-- gene
|   |   |-- GRCr8_ncbiRefSeqCurated.bed12.gz
|   |   |-- GRCr8_ncbiRefSeqCuratedProteinCoding.bed12.gz
|   |   |-- GRCr8_ncbiRefSeqCuratedProteinCoding_prespliced.fa.gz
|   |   |-- GRCr8_ncbiRefSeqCuratedProteinCoding_spliced.fa.gz
|   |   |-- GRCr8_ncbiRefSeqCurated_prespliced.fa.gz
|   |   |-- GRCr8_ncbiRefSeqCurated_spliced.fa.gz
|   |   `-- GRCr8_ncbiRefSeqPredicted.bed12.gz
|   `-- genome
|       `-- GRCr8.fa.gz
|-- T2T-MFA8v1.1_GCF_037993035.2
|   |-- expression
|   |   |-- T2T-MFA8v1.1_GCF_037993035.2_ncbiRefseqPromoterAdultExpCounts.txt.gz
|   |   |-- T2T-MFA8v1.1_GCF_037993035.2_ncbiRefseqPromoterAdultExpCpm.txt.gz
|   |   |-- T2T-MFA8v1.1_GCF_037993035.2_ncbiRefseqPromoterNewbornExpCounts.txt.gz
|   |   `-- T2T-MFA8v1.1_GCF_037993035.2_ncbiRefseqPromoterNewbornExpCpm.txt.gz
|   |-- gene
|   |   |-- T2T-MFA8v1.1_ncbiRefSeq.bed12.gz
|   |   |-- T2T-MFA8v1.1_ncbiRefSeq_prespliced.fa.gz
|   |   `-- T2T-MFA8v1.1_ncbiRefSeq_spliced.fa.gz
|   `-- genome
|       `-- T2T-MFA8v1.1.fa.gz
|-- T2T-MMU8v2.0_GCF_049350105.2
|   |-- gene
|   |   |-- T2T-MMU8v2.0_ncbiRefSeq.bed12.gz
|   |   |-- T2T-MMU8v2.0_ncbiRefSeq_prespliced.fa.gz
|   |   `-- T2T-MMU8v2.0_ncbiRefSeq_spliced.fa.gz
|   `-- genome
|       `-- T2T-MMU8v2.0.fa.gz
|-- calJac4
|   |-- expression
|   |   |-- calJac4_ncbiRefseqPromoterExpCounts.txt.gz
|   |   `-- calJac4_ncbiRefseqPromoterExpCpm.txt.gz
|   |-- gene
|   |   |-- calJac4_Trac2202.bed12.gz
|   |   |-- calJac4_ncbiRefSeq.bed12.gz
|   |   |-- calJac4_ncbiRefSeq_prespliced.fa.gz
|   |   `-- calJac4_ncbiRefSeq_spliced.fa.gz
|   `-- genome
|       `-- calJac4.fa.gz
|-- hg38
|   |-- gene
|   |   |-- hg38_ncbiRefSeqCurated.bed12.gz
|   |   |-- hg38_ncbiRefSeqCuratedProteinCoding.bed12.gz
|   |   |-- hg38_ncbiRefSeqCuratedProteinCoding_prespliced.fa.gz
|   |   |-- hg38_ncbiRefSeqCuratedProteinCoding_spliced.fa.gz
|   |   |-- hg38_ncbiRefSeqCurated_prespliced.fa.gz
|   |   |-- hg38_ncbiRefSeqCurated_spliced.fa.gz
|   |   |-- hg38_ncbiRefSeqPredicted.bed12.gz
|   |   `-- hg38_wgEncodeGencodeCompV49.bed12.gz
|   `-- genome
|       `-- hg38.p14.fa.gz
|-- mOryCun1.1_GCF_964237555.1
|   |-- gene
|   |   |-- mOryCun1.1_ncbiRefSeq.bed12.gz
|   |   |-- mOryCun1.1_ncbiRefSeq_prespliced.fa.gz
|   |   `-- mOryCun1.1_ncbiRefSeq_spliced.fa.gz
|   `-- genome
|       `-- mOryCun1.1.fa.gz
`-- mm39
    |-- gene
    |   |-- mm39_ncbiRefSeqCurated.bed12.gz
    |   |-- mm39_ncbiRefSeqCuratedProteinCoding.bed12.gz
    |   |-- mm39_ncbiRefSeqCuratedProteinCoding_prespliced.fa.gz
    |   |-- mm39_ncbiRefSeqCuratedProteinCoding_spliced.fa.gz
    |   |-- mm39_ncbiRefSeqCurated_prespliced.fa.gz
    |   |-- mm39_ncbiRefSeqCurated_spliced.fa.gz
    |   |-- mm39_ncbiRefSeqPredicted.bed12.gz
    |   `-- mm39_wgEncodeGencodeCompVM38.bed12.gz
    `-- genome
        `-- mm39.fa.gz
```

Files with `.fa.gz` suffix in their names contain nucleotide sequences in FASTA format, ones with `.bed12.gz` suffix contain exon-intron structure of gene models in [BED12 format (BED format with 12 columns)](https://genome.ucsc.edu/FAQ/FAQformat.html#format1), and ones with `.matrix.txt.gz` suffix contain tab-delimited text of expression intensities. 

Nucleotide sequences for gene models are compiled by using their genomic coordinates and the genome sequences, only for a selected subset of the gene models. Files containing `prespliced` in their names represent immature transcripts before splicing, the same to pre-mRNA for protein coding transcripts (the term "pre-spliced" is used to be compatible with long noncoding RNAs). Note that the protein coding transcripts here include ones on both nuclear and mitochondrial DNAs. It means that protein coding transcripts compiled based on RefSeq [^2] includes `YP_` (for human) or `NP_` (for mouse) entries in addition to `NM_` ones.

Identifier (ID) for RefSeq gene models consists of `REFSEQ_TRANSCRIPT_ID|GENE_SYMBOL;ENTREZ_GENE_ID`, as seen in the example of `NM_000454.4|SOD1;6647` in the `hg38NcbiRefSeqCurated.bed12.gz` file. In FASTA files of nucleotide sequences, other information such as genomic coordinates are concatenated after `|`.

As for expression data , files containing `ExpCounts` in their names represent counts of 5'-ends of CAGE read alignments with the genome sequences, and the ones with `ExpCpm` represent expression intensities where the read counts were normalized by CPM, counts per million with RLE (relative log expression) method [^5]. The files which names include 'ncbiRefseqPromoter' indicate promoter level expression data (according to CAGE peaks) of RefSeq gene models (not CAGE peaks).


Experimental data and their computational processing
-----------------------------------------------------
We generated the following data to construct 5'-end complete gene models by a novel approach (TRaC, Transcript models based on RNA-seq, CAGE, and long read sequencing by Oxford Nanopore sequencing).

* RNA-seq
* CAGE, based on ssCAGE protocol [^6]
* Long read (Oxford Nanopore) sequencing of full-length cDNA prepared by cap-trapping

[^5]: Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol 2010; 11:R106 doi: http://dx.doi.org/10.1186/gb-2010-11-10-r106

[^6]: Morioka MS, Kawaji H, Nishiyori-Sueki H, et al. Cap Analysis of Gene Expression (CAGE): A Quantitative and Genome-Wide Assay of Transcription Start Sites. Methods Mol Biol. 2020;2120:277-301. doi:10.1007/978-1-0716-0327-7_20


Leadership and contact
----------------------
The original data is made in a collaboration among [RIKEN PMI](https://www.riken.jp/research/labs/pmi/), [RIKEN IMS](https://www.riken.jp/research/labs/ims/), [Shiga University of Medical Science](https://www.shiga-med.ac.jp/), [CIEA (Central Institute for Experimental Animals)](https://www.ciea.or.jp/), [Keio University](https://www.keio.ac.jp/), [RIKEN BDR](https://www.riken.jp/research/labs/bdr/), [DBCLS (Database Center for Life Science)](http://dbcls.rois.ac.jp/), [NIBIOHN (National Institutes of Biomedical Innovation, Health and Nutrition)](https://www.nibiohn.go.jp/), [Kyoto university](https://www.kyoto-u.ac.jp/), and [TMIMS (Tokyo Metropolitan Institute of Medical Science)](http://www.igakuken.or.jp/). Please contact us via e-mail below for any questions, comments, suggestions, or collaboration:

    contact@d3g.io


How to cite
------------
Please refer our database like this:

    D3G: Database for Drug Development based on Genome and RNA sequences, https://d3g.io, 2026


Acknowledgement
----------------
We thank to [AMED (Japan Agency for Medical Research and Development)](https://www.amed.go.jp/), [NIHS (National Institute of Health Sciences)](http://www.nihs.go.jp/), [JPMA (Japan Pharmaceutical Manufacturers Association)](http://www.jpma.or.jp/), and [the FANTOM consortium](https://fantom.gsc.riken.jp/) for relevant advices and fruitful discussions. The experiments and the database is financially supported by AMED under Grant Number JP17kk0305008, JP20kk0305013, and 23kk0305024