What's New
- Version 1.0.9 (Nov-2014) minor improvements, support for hg18, hg19
and hg38 genome assemblies. ALWAYS check the coordinate system used by
your fusion detection software.
- Version 1.0.8 (Oct-2014) input format for FusionCatcher changed to
support v0.99.3b. Support for spanning/encompassing read filtering
added (see README).
- Version 1.0.7 (24-08-2014) includes several minor improvements, e.g.
reporting number of spanning/encompassing reads in output.
- Version 1.0.6 (24-09-2013) fixes some issues about mapping reads to genes with several isoforms. We strongly recommend updating to this version.
- As from version 1.0.5 (released on 16-09-2013)
the classifier also takes into account broken domains in
fusion proteins. This affords better precision and recall rates than
originally published.
- Version 1.0.4 has extended the output format and supports
tophat-fusion-post and RNASTAR input.
- As from version 1.0.3, installation of Groovy is not necessary.
Input file types and the content of output file have also been
improved.
Installation
The pipeline is platform independent, but users are required to install Java.
Download the latest version of Oncofuse, and unpack it to a directory of your choice. Open a terminal window and go to Oncofuse root directory.
TEST Oncofuse with the following command:
java -jar Oncofuse.jar example_coord.txt coord - outcoord.txt
If JVM drops with "out of memory" exception or the execution time appears to be very long, we recommend to use the -Xmx argument:
java -Xmx1G -jar Oncofuse.jar example_coord.txt coord - outcoord.txt
If you get no error messages, and a file named outcoord.txt (containing two lines) is written to your working directory, Oncofuse is ready to use.
Running Oncofuse
A typical command to run Oncofuse will be:
java -Xmx1G -jar Oncofuse.jar input_file input_type tissue_type output_file
Where:
input_file
is the path to your input file.
input_type indicates the type of input file (see below).
tissue_type
is the library argument, which tells Oncofuse
to use its own pre-built gene expression libraries. There are four
pre-built libraries, corresponding to the four supported tissue types:
EPI (epithelial origin), HEM
(hematological origin), MES (mesenchymal origin) and
AVG (average expression, if tissue source is
unknown).
output_file is the path to your output file.
Options
-a option specifies genome assembly version. Allowed values: hg18, hg19 and hg38. Default value: hg19.
-p option specifies the number of threads Oncofuse will use.
Input types
Coordinates-based
If input_type argument is set to "coord", Oncofuse will take as input a list of genomic positions of breakpoints in a pair of fusion partner genes. A tab-delimited file in which each line represents a fusion gene, with the following structure, should be provided:
5'-chrom 5'-coord
3'-chrom 3'-coord library
For instance, the file example_coord.txt contains the line:
chr22 23632742 chr9 133607147 HEM
In this file, "5'-chrom" and "3'-chrom" indicate the chromosomes for the 5' and the 3' fusion partner genes, respectively. Likewise, "5'-coord" and "3'-coord" indicate the genomic coordinates for the breakpoints in each partner gene (those should be the first or the last nucleotide lost upon fusion, NOT the ones retained).
For this type of input, the library to be used is specified within the input file, so there is no need to pass the tissue_type argument (which should be set to "-").
IMPORTANT:
ALWAYS check the coordinate system you are using and use the -a option
as required.
Post-processing output of fusion detection software
Oncofuse also processes files generated by Tophat, Tophat-Fusion-Post, RNASTAR and FusionCatcher. In this case, input_type should be set to tophat, tophat-post, rnastar and fcatcher respectively, and tissue_type argument is mandatory ("-" is not allowed, one of the four pre-built libraries MUST be specified).
For instance, a command to run Oncofuse using the library EPI on file "fusions.out", generated by TopHat-Fusion, should look like this:
java -Xmx1G -jar Oncofuse.jar fusions.out tophat EPI out.txt
The data import step in Oncofuse will filter fusions from Tophat and RNASTAR that have less than N=1 spanning read and less than M=2 supporting (spanning plus encompassing)reads. These values could be changed by substituting input_type with tophat-N-M or rnastar-N-M. No filtering for Tophat-Fusion-Post and FusionCatcher is performed.
NOTE: An internal filtering step in Oncofuse will discard all fusions that a) have at least one breakpoint that does not map to any coding RefSeq canonical transcript; b) both breakpoints map to the same gene. That will get rid of quite a few of the raw fusions included in the output of RNA-Seq fusion detection software, so beware.
Data library structure (advanced)
prom.txt:
Column | Gene Symbol | Expression value | Replication timing value |
Content | Official gene symbol | Log2 (expression, R.F.U.) | Normalized to 0 mean 1 Std; >1.5 ~ very early and <-1.5 ~ very late |
expr.txt:
Column | Gene Symbol | Expression value |
Content | Official gene symbol | Log2 (expression, R.F.U.) |
utr.txt:
Column | Gene Symbol | UTR size |
Content | Official gene symbol | Log2 (length, bp) |
Pre-compiled libraries contain averaged expression data from four distinct normal samples of given tissue type (EPI, MES, HEM or global average, AVG). For more information on replication timing see ReplicationDomain. UTR length was computed from RefSeq data and is used as it shows good correlation with the number of conserved elements and with miRNA binding sites and is significantly different between fusion partner genes (FPGs) and normal genes. Users can add their own libraries to the ./libs folder of the distribution, with the same structure that libraries included in that folder.
Structure of output file
The output file contains information about fusions and classification results with the following structure:
SAMPLE_ID
|
The ID of sample for tophat-fusion-post, input
file name otherwise
|
FUSION_ID
|
The original line number in input file (except
for RNASTAR input).
|
TISSUE
|
As specified by library argument or in 'coord'
input file.
|
GENOMIC
|
Chromosomal coordinates for both breakpoints
(as in input file).
|
5_FPG_GENE_NAME
|
The HGNC symbol of 5' fusion partner gene.
|
5_IN_CDS?
|
Indicates whether breakpoint is within the CDS
of this gene.
|
5_SEGMENT_TYPE
|
Indicates whether breakpoint is located within
either exon or intron.
|
5_SEGMENT_ID
|
Indicates number of exon or intron where
breakpoint is located.
|
5_COORD_IN_SEGMENT
|
Indicates coordinates for breakpoint within
that exon/intron.
|
5_FULL_AA
|
Length of translated 5' FPG in full amino
acids
|
5_FRAME
|
Frame of translated 5' FPG
|
(Same as 7 lines above for the 3' fusion
partner gene).
|
|
FPG_FRAME_DIFFERENCE
|
Difference in 5' and 3' FPG frames
|
P_VAL_CORR
|
The Bayesian probability of fusion being a passenger
(class 0), given as Bonferroni-corrected P-value.
|
DRIVER_PROB
|
The Bayesian probability of fusion being a driver (class 1).
|
EXPRESSION_GAIN
|
Expression gain of fusion calculated as:
[(expression of 5' gene)/(expression of 3' gene)]-1 |
5_DOMAINS_RETAINED
|
List of protein domains retained in 5' fusion
partner gene.
|
3_DOMAINS_RETAINED
|
List of protein domains retained in 3' fusion
partner gene.
|
5_DOMAINS_BROKEN
|
List of protein domains that overlap
breakpoint in 5' fusion partner gene
|
3_DOMAINS_BROKEN
|
List of protein domains that overlap
breakpoint in 3' fusion partner gene
|
5_PII_RETAINED
|
List of protein interaction
interfaces retained in 5' fusion partner gene.
|
3_PII_RETAINED
|
List of protein interaction
interfaces retained in 3' fusion partner gene.
|
CTF, G, H, K,
P and TF
|
Corresponding functional family association
scores (log-transformed, scaled to the largest score obtained
from classifier training set). See manuscript for details.
|
NOTE: The pipeline operates with HGNC Gene Symbols rather than transcripts. It uses exon data from the major RefSeq transcript (the one with the largest CDS, as defined by UCSC). The rationale for this is to avoid complex output, as well as to facilitate mapping of protein domains (done as RefSeq -> UniProt -> InterPro).