logo

UNAV

Department of Biochemistry and Genetics

General Information
This tool is designed to predict the oncogenic potential of fusion genes found by Next-Generation Sequencing in cancer cells. It is a post-processing step that tries to validate in-silico the predictions made by fusion detection software. Oncofuse is NOT a fusion detection software, its goal is NOT to identify fusion sequences, but to assign a functional prediction score (oncogenic potential, i.e. the probability of being 'driver' events) to fusion sequences identified by other software such as Tophat-fusion, fusioncatcher or STAR.

Oncofuse is a naive bayesian classifier built using information from Shugay et al. 2012 and is described in Shugay et al. 2013.

As an example (see figure), running Oncofuse on the initial set of 165 fusions found by Banerji et al. (2012) in breast cancer (see input file here), assigned the highest probability of 'driver' status to MAGI3-AKT3 (see complete output here).

Please cite Oncofuse as:

Mikhail Shugay, Inigo Ortiz de Mendíbil, Jose L. Vizmanos and Francisco J. Novo. Oncofuse: a computational framework for the prediction of the oncogenic potential of gene fusions. Bioinformatics. 16 Aug 2013. doi:10.1093/bioinformatics/btt445. [Epub ahead of print].

Address any queries to bioinfoun@unav.es

What's New

- Version 1.0.9 (Nov-2014) minor improvements, support for hg18, hg19 and hg38 genome assemblies. ALWAYS check the coordinate system used by your fusion detection software.

- Version 1.0.8 (Oct-2014) input format for FusionCatcher changed to support v0.99.3b. Support for spanning/encompassing read filtering added (see README).

- Version 1.0.7 (24-08-2014) includes several minor improvements, e.g. reporting number of spanning/encompassing reads in output.

- Version 1.0.6 (24-09-2013) fixes some issues about mapping reads to genes with several isoforms. We strongly recommend updating to this version.

- As from version 1.0.5 (released on 16-09-2013) the classifier also takes into account broken domains in fusion proteins. This affords better precision and recall rates than originally published.

- Version 1.0.4 has extended the output format and supports tophat-fusion-post and RNASTAR input.

- As from version 1.0.3, installation of Groovy is not necessary. Input file types and the content of output file have also been improved.

Installation

The pipeline is platform independent, but users are required to install Java.

Download the latest version of Oncofuse, and unpack it to a directory of your choice. Open a terminal window and go to Oncofuse root directory.

TEST Oncofuse with the following command:

java -jar Oncofuse.jar example_coord.txt coord - outcoord.txt

If JVM drops with "out of memory" exception or the execution time appears to be very long, we recommend to use the -Xmx argument:

java -Xmx1G -jar Oncofuse.jar example_coord.txt coord - outcoord.txt

If you get no error messages, and a file named outcoord.txt (containing two lines) is written to your working directory, Oncofuse is ready to use.

Running Oncofuse

A typical command to run Oncofuse will be:

java -Xmx1G -jar Oncofuse.jar input_file input_type tissue_type output_file

Where:
input_file is the path to your input file.

input_type indicates the type of input file (see below).

tissue_type is the library argument, which tells Oncofuse to use its own pre-built gene expression libraries. There are four pre-built libraries, corresponding to the four supported tissue types: EPI (epithelial origin), HEM (hematological origin), MES (mesenchymal origin) and AVG (average expression, if tissue source is unknown).

output_file is the path to your output file.

Options

-a option specifies genome assembly version. Allowed values: hg18, hg19 and hg38. Default value: hg19.

-p option specifies the number of threads Oncofuse will use.

Input types

Coordinates-based

If input_type argument is set to "coord", Oncofuse will take as input a list of genomic positions of breakpoints in a pair of fusion partner genes. A tab-delimited file in which each line represents a fusion gene, with the following structure, should be provided:

5'-chrom    5'-coord    3'-chrom    3'-coord    library

For instance, the file example_coord.txt contains the line:

chr22    23632742    chr9    133607147    HEM

In this file, "5'-chrom" and "3'-chrom" indicate the chromosomes for the 5' and the 3' fusion partner genes, respectively. Likewise, "5'-coord" and "3'-coord" indicate the genomic coordinates for the breakpoints in each partner gene (those should be the first or the last nucleotide lost upon fusion, NOT the ones retained).

For this type of input, the library to be used is specified within the input file, so there is no need to pass the tissue_type argument (which should be set to "-").

IMPORTANT: ALWAYS check the coordinate system you are using and use the -a option as required.


Post-processing output of fusion detection software 

Oncofuse also processes files generated by Tophat, Tophat-Fusion-Post, RNASTAR and FusionCatcher. In this case, input_type should be set to tophat, tophat-post, rnastar and fcatcher respectively, and tissue_type argument is mandatory ("-" is not allowed, one of the four pre-built libraries MUST be specified).

For instance, a command to run Oncofuse using the library EPI on file "fusions.out", generated by TopHat-Fusion, should look like this:

java -Xmx1G -jar Oncofuse.jar fusions.out tophat EPI out.txt

The data import step in Oncofuse will filter fusions from Tophat and RNASTAR that have less than N=1 spanning read and less than M=2 supporting (spanning plus encompassing)reads. These values could be changed by substituting input_type with tophat-N-M or rnastar-N-M. No filtering for Tophat-Fusion-Post and FusionCatcher is performed.

NOTE: An internal filtering step in Oncofuse will discard all fusions that a) have at least one breakpoint that does not map to any coding RefSeq canonical transcript; b) both breakpoints map to the same gene. That will get rid of quite a few of the raw fusions included in the output of RNA-Seq fusion detection software, so beware.

 

Data library structure (advanced)

prom.txt:

Column Gene Symbol Expression value Replication timing value
Content Official gene symbol Log2 (expression, R.F.U.) Normalized to 0 mean 1 Std; >1.5 ~ very early and <-1.5 ~ very late

 

expr.txt:

Column Gene Symbol Expression value
Content Official gene symbol Log2 (expression, R.F.U.)

 

utr.txt:

Column Gene Symbol UTR size
Content Official gene symbol Log2 (length, bp)

 

Pre-compiled libraries contain averaged expression data from four distinct normal samples of given tissue type (EPI, MES, HEM or global average, AVG). For more information on replication timing see ReplicationDomain. UTR length was computed from RefSeq data and is used as it shows good correlation with the number of conserved elements and with miRNA binding sites and is significantly different between fusion partner genes (FPGs) and normal genes. Users can add their own libraries to the ./libs folder of the distribution, with the same structure that libraries included in that folder.

Structure of output file

The output file contains information about fusions and classification results with the following structure:

SAMPLE_ID
The ID of sample for tophat-fusion-post, input file name otherwise
FUSION_ID
The original line number in input file (except for RNASTAR input).
TISSUE
As specified by library argument or in 'coord' input file.
GENOMIC
Chromosomal coordinates for both breakpoints (as in input file).
5_FPG_GENE_NAME
The HGNC symbol of 5' fusion partner gene.
5_IN_CDS?
Indicates whether breakpoint is within the CDS of this gene.
5_SEGMENT_TYPE
Indicates whether breakpoint is located within either exon or intron.
5_SEGMENT_ID
Indicates number of exon or intron where breakpoint is located.
5_COORD_IN_SEGMENT
Indicates coordinates for breakpoint within that exon/intron.
5_FULL_AA
Length of translated 5' FPG in full amino acids
5_FRAME
Frame of translated 5' FPG
(Same as 7 lines above for the 3' fusion partner gene).
FPG_FRAME_DIFFERENCE
Difference in 5' and 3' FPG frames
P_VAL_CORR
The Bayesian probability of fusion being a passenger (class 0), given as Bonferroni-corrected P-value.
DRIVER_PROB
The Bayesian probability of fusion being a driver (class 1).
EXPRESSION_GAIN
Expression gain of fusion calculated as:

[(expression of 5' gene)/(expression of 3' gene)]-1

5_DOMAINS_RETAINED
List of protein domains retained in 5' fusion partner gene.
3_DOMAINS_RETAINED
List of protein domains retained in 3' fusion partner gene.
5_DOMAINS_BROKEN
List of protein domains that overlap breakpoint in 5' fusion partner gene
3_DOMAINS_BROKEN
List of protein domains that overlap breakpoint in 3' fusion partner gene
5_PII_RETAINED
List of protein interaction interfaces retained in 5' fusion partner gene.
3_PII_RETAINED
List of protein interaction interfaces retained in 3' fusion partner gene.
CTF, G, H, K, P and TF
Corresponding functional family association scores (log-transformed, scaled to the largest score obtained from classifier training set). See manuscript for details.

NOTE: The pipeline operates with HGNC Gene Symbols rather than transcripts. It uses exon data from the major RefSeq transcript (the one with the largest CDS, as defined by UCSC). The rationale for this is to avoid complex output, as well as to facilitate mapping of protein domains (done as RefSeq -> UniProt -> InterPro).