Manipulating tracks

Using only file paths

Here is a short example where a new track is created containing the mean_score_by_feature computed on two other tracks:

from gMiner.genomic_manip import mean_score_by_feature
virtual_track = mean_score_by_feature('/tracks/pol2.sql', '/tracks/rib_prot.sql')
virtual_track.export('/tmp/result.sql')

Using track objects

You can also do the same thing by inputing track objects directly:

import track
from track.manipulate import mean_score_by_feature
with track.load('/scratch/tracks/pol2.sql') as pol2:
    with track.load('/scratch/tracks/ribosome_proteins.sql') as rpgenes:
        virtual_track = mean_score_by_feature(pol2,rpgenes)
        virtual_track.export('/tmp/result.sql')

Chaining manipulations

The beautiful thing about this is that operations can be chained one to an other without having to compute intermediary states. The following also works:

import track
from track.manipulate import overlap, complement
with track.load('/scratch/genomic/tracks/pol2.sql') as pol2:
    with track.load('/scratch/genomic/tracks/rap1.sql') as rap1:
        virtual_track = complement(overlap(pol2,rap1))
        virtual_track.export('/tmp/result.sql')

All manipulations

track.manipulate.closest_features()

Find the closest features from one track in an other track.

For instance, the closest_features manipulation can identify the most relevant(s) gene(s) associated to each peak.

  • In cases where the peak is isolated, min_length indicates maximal allowed length between peak and closest gene. Beyond this distance, the peak will not be attributed to any gene. This defaults to 100000.
  • In cases where the peak is near two promoters (one on each strand), utr_cutoff indicates the threshold above which the peak isn’t attributed to any gene. Below this threshold, the peak will be attributed to both genes. This defaults to 2000.
  • In cases where the peak is between two genes on the same strand, prom_cutoff indicates the percentage of the distance between the two genes below which the peak is attributed to the 3’UTR of the preceding gene rather than to the promoter of the following gene. This defaults to 10.
Parameters:
  • X – A track or the path to a track. Eventually, a generator yielding features. The fields read form this track will be: ['start', 'end', 'name'], extra fields will not be used.
  • Y – A track or the path to a track. Eventually, a generator yielding features. The fields read form this track will be: ['start', 'end', 'name', 'strand'], extra fields will not be used.
  • min_length (int) – Minimal distance for attribution. By default 100000.
  • utr_cutoff (int) – Distance below which the gene is attributed to both. By default 2000.
  • prom_cutoff (int) – Distance percentage below which attribution is to preceding gene. By default 10.
Returns:

A virtual track with the following fields: ['start', 'end', 'name', 'id', 'type', 'location'].

A visual example:

                             peak
      ______  min_length    ------       min_length  ______
-----|______|------...------------------...---------|______|-----
      gene 1                                         gene 2

          -->                          -->
         |______  10%        90%      |______
---------|______|-----|---------------|______|-------------------
          gene 1      prom_cutoff      gene 2

If the list of chromosomes contained in the various tracks differ, the conflict will be resolved by applying the ‘first’ principle.

track.manipulate.complement()

Complement (boolean NOT).

The complement manipulation takes only one track for input. The output consists of all intervals that were not covered by a feature in the input track. This corresponds to the boolean NOT operation.

Parameters:
  • X – A track or the path to a track. Eventually, a generator yielding features. The fields read form this track will be: ['start', 'end'], extra fields will not be used.
  • l (int) – The length of the current chromosome (only necessary when calling the manipulation with generators).
Returns:

A virtual track with the following fields: ['start', 'end'].

A numerical example:

X [start,end]: (10,20) (30,40)
R [start,end]: ( 0,10) (20,30) (40,1000)

A visual example:

X: ──────▤▤▤▤▤─────────▤▤▤▤▤──────
R: ▤▤▤▤▤▤─────▤▤▤▤▤▤▤▤▤─────▤▤▤▤▤▤
track.manipulate.concatenate()

Concatenate N tracks together.

The concatenate manipulation takes only one track for input. The output consists of all intervals that were not covered by a feature in the input track.

Parameters:n_tracks (list) – An arbitrary number of tracks or paths to tracks. Eventually, generators yielding features. The fields read form these tracks will be: ['start', 'end', '...'], extra fields will be used.
Returns:A virtual track with the following fields: ['start', 'end', '...'].

A numerical example:

X [start,end]: (0,20)
Y [start,end]: (10,30)
R [start,end]: (0,20) (10,30)

A visual example:

X: ───▤▤▤▤▤▤▤▤▤────────────▤▤▤▤▤▤▤──────
Y: ──────▤▤▤▤▤────▤▤▤▤▤─────────────────
R: ───▤▤▤▤▤▤▤▤▤───▤▤▤▤▤────▤▤▤▤▤▤▤──────
         ▤▤▤▤▤

If the list of chromosomes contained in the various tracks differ, the conflict will be resolved by applying the ‘union’ principle.

track.manipulate.custom_boolean()

Execute a custom boolean.

The custom_boolean manipulation takes several tracks t_1,..,*t_n* and a boolean function fn as input, and returns one track that is fn(t_1,..,t_n). The window size win_size determines how many basepairs of each track are loaded in memory at a time (default: 1000).

Parameters:
  • n_tracks (list) – An arbitrary number of tracks or paths to tracks. Eventually, generators yielding features. The fields read form these tracks will be: ['start', 'end', 'name', 'score', 'strand'], extra fields will not be used.
  • fn (function) – A function that takes a vector and returns a boolean.
  • win_size (int) – The number of basepairs loaded in memory at each iteration (default: 1000). By default 1000.
Returns:

A virtual track with the following fields: ['start', 'end'].

A numerical example:

X1 [start,end]: [(4,5),(7,9),(10,12)]
X2 [start,end]: [(1,3),(4,5),(11,14)]
X3 [start,end]: [(9,13)]
fn: not(X1) and (X2 or X3)
R               [(1,3),(9,10),(12,14)]

A visual example:

X1: ───▤▤▤▤▤▤▤▤─────────────────────
X2: ─▤▤▤▤▤▤─────────────────────────
X3: ─────▤▤▤▤▤▤▤▤▤──────────────────
fn: X1 and X2 and X3
R:  ─────▤▤─────────────────────────

If the list of chromosomes contained in the various tracks differ, the conflict will be resolved by applying the ‘union’ principle.

track.manipulate.difference()

Difference (boolean XOR).

Concatenates two tracks together and subsequently performs the fusion manipulation on the result. Form this are removed any regions that were present in both of the original tracks. This is equivalent to the boolean XOR operation.

Parameters:
  • X – A track or the path to a track. Eventually, a generator yielding features. The fields read form this track will be: ['start', 'end'], extra fields will not be used.
  • Y – A track or the path to a track. Eventually, a generator yielding features. The fields read form this track will be: ['start', 'end'], extra fields will not be used.
  • l (int) – The length of the current chromosome (only necessary when calling the manipulation with generators).
Returns:

A virtual track with the following fields: ['start', 'end'].

A numerical example:

X [start,end]: (0,40) (50,60)
Y [start,end]: (10,20)
R [start,end]: (0,10) (20,40) (50,60)

A visual example:

X: ───▤▤▤▤▤▤▤▤▤────────────▤▤▤▤▤▤▤──────
Y: ──────▤▤▤▤▤────▤▤▤▤▤─────────────────
R: ───▤▤──────▤───▤▤▤▤▤────▤▤▤▤▤▤▤──────

If the list of chromosomes contained in the various tracks differ, the conflict will be resolved by applying the ‘union’ principle.

track.manipulate.filter()

Filter features in a track by using a second track.

The filter manipulation computes the overlap of the first track against the second stream returning only complete features from the first track.

Parameters:
  • X – A track or the path to a track. Eventually, a generator yielding features. The fields read form this track will be: ['start', 'end', '...'], extra fields will be used.
  • Y – A track or the path to a track. Eventually, a generator yielding features. The fields read form this track will be: ['start', 'end'], extra fields will not be used.
Returns:

A virtual track with the following fields: ['start', 'end', '...'].

A numerical example:

X [start,end]: (10,20) (30,40)
Y [start,end]: (10,12) (17,22)
R [start,end]: (10,20) (30,40)

A visual example:

X: ───▤▤▤▤▤▤▤▤▤────────────▤▤▤▤▤▤▤──────
Y: ──────▤▤▤▤▤────▤▤▤▤▤─────────────────
R: ───▤▤▤▤▤▤▤▤▤─────────────────────────

If the list of chromosomes contained in the various tracks differ, the conflict will be resolved by applying the ‘first’ principle.

track.manipulate.fusion()

Fuses features that are adjacent or overlapping in a track.

The fusion manipulation will combine any features in a track that are adjacent to one another or overlapping each other into a single feature. The strand attribute will be conserved only if all features that are being merged have the same strand. The score attribute will be the sum of all the scores.

Parameters:X – A track or the path to a track. Eventually, a generator yielding features. The fields read form this track will be: ['start', 'end', 'name', 'score', 'strand'], extra fields will not be used.
Returns:A virtual track with the following fields: ['start', 'end', 'name', 'score', 'strand'].

A numerical example:

X1 [start,end]: (10,20) (20,30)
R  [start,end]: (10,30)

A visual example:

X: ──────▤▤▤▤▤─────────▤▤▤▤▤──────
           ▤▤▤▤▤▤▤▤▤
R: ──────▤▤▤▤▤▤▤▤▤▤▤───▤▤▤▤▤──────
track.manipulate.mean_score_by_feature()

Mean score of the signal in every feature.

Given a signal track X and a feature track Y, the mean_score_by_feature manipulation computes the mean of scores of every of Y‘s features in X. The output consists of a feature track similar to Y but with a new score value property for every feature.

Parameters:
  • X – A track or the path to a track. Eventually, a generator yielding features. The fields read form this track will be: ['start', 'end', 'score'], extra fields will not be used. The track is assumed to have no overlapping features.
  • Y – A track or the path to a track. Eventually, a generator yielding features. The fields read form this track will be: ['start', 'end', 'score', '...'], extra fields will be used.
Returns:

A virtual track with the following fields: ['start', 'end', 'score', '...'].

A numerical example:

X [start,end,score]: (10,20,999)   (30,40,9999)
Y [start,end,score]: (10,12,5)     (17,22,500)
R [start,end,score]: (10,20,151)   (30,40,0)

A visual example:

X: ──────▤▤▤▤▤▤▤▤▤▤──────────────▤▤▤▤▤▤▤▤▤▤──────
Y: ▁▁▁▁▁▁▁▁▁▁▁█████████▁▁▁▁▁▁▁▁▁▁██████████▁▁▁▁▁▁
R: ▁▁▁▁▁▁▅▅▅▅▅▅▅▅▅▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁██████████▁▁▁▁▁▁

If the list of chromosomes contained in the various tracks differ, the conflict will be resolved by applying the ‘last’ principle.

track.manipulate.merge_scores()

Merge scores of all signals together.

The merge_scores manipulation merges N signals using some average function. If the boolean value geometric is true, the geometric mean is used, otherwise the arithmetic mean is used.

Parameters:
  • n_tracks (list) – An arbitrary number of tracks or paths to tracks. Eventually, generators yielding features. The fields read form these tracks will be: ['start', 'end', 'score'], extra fields will not be used. The track is assumed to have no overlapping features.
  • geometric (bool) – Use the geometric mean instead of the arithmetic mean. By default False.
Returns:

A virtual track with the following fields: ['start', 'end', 'score'].

A numerical example:

X1 [start,end,score]: (10,20,50) (30,40,100)
X2 [start,end,score]: (10,12,20)
R  [start,end,score]: (10,20,40) (12,20,25) (30,40,50)

A visual example:

X1: ▁▁▁▁▁▁▁▁▁▁█████████▁▁▁▁▁▁
X2: ▁▁▁▁▁▅▅▅▅▅▅▅▅▅▅▁▁▁▁▁▁▁▁▁▁
R:  ▁▁▁▁▁▂▂▂▂▂▇▇▇▇▇▅▅▅▅▁▁▁▁▁▁

If the list of chromosomes contained in the various tracks differ, the conflict will be resolved by applying the ‘union’ principle.

track.manipulate.neighborhood()

Compute neighborhood regions upstream and downstream of features..

Given a stream of features and four integers before_start, after_end, after_start and before_end, this manipulation will output, for every feature in the input stream, one or two features in the neighborhood of the original feature.

  • Only before_start and after_end are given:

    (start, end, ...) -> (start+before_start, end+after_end, ...)
    
  • Only before_start and after_start are given:

    (start, end, ...) -> (start+before_start, start+after_start, ...)
    
  • Only after_end and before_end are given:

    (start, end, ...) -> (end+before_end, end+after_end, ...)
    
  • If all four parameters are given, a pair of features is outputted:

    (start, end, ...) -> (start+before_start, start+after_start, ...)
                         (end+before_end, end+after_end, ...)
    
  • If the boolean parameter on_strand is set to True, features on the negative strand are inverted as such:

    (start, end, ...) -> (start-after_end, start-before_end, ...)
                         (end-after_start, end-before_start, ...)
    
Parameters:
  • X – A track or the path to a track. Eventually, a generator yielding features. The fields read form this track will be: ['start', 'end', '...'], extra fields will be used.
  • before_start (int) – Base pairs before the feature start. By default 0.
  • after_end (int) – Base pairs after the feature end. By default 0.
  • after_start (int) – Base pairs after the feature start. By default 0.
  • before_end (int) – Base pairs before the feature end. By default 0.
  • on_strand (bool) – Features on the negative strand can be inverted. By default False.
  • l (int) – The length of the current chromosome (only necessary when calling the manipulation with generators).
Returns:

A virtual track with the following fields: ['start', 'end', '...'].

A numerical example:

X [start,end]: (10,20)
R [start,end]: (5,8) (22,25)

A visual example:

X: ──────────▤▤▤▤▤▤▤▤▤──────────────
R: ────▤▤▤▤─────────────▤▤▤▤────────
track.manipulate.overlap()

Pieces of overlap between two tracks (boolean AND).

The overlap manipulation computes the overlap between two tracks returning new features that exactly match the overlapping zones. This is equivalent to the boolean AND operation.

Parameters:
  • X – A track or the path to a track. Eventually, a generator yielding features. The fields read form this track will be: ['start', 'end', 'name', 'score', 'strand', '...'], extra fields will be used.
  • Y – A track or the path to a track. Eventually, a generator yielding features. The fields read form this track will be: ['start', 'end', 'name', 'score', 'strand', '...'], extra fields will be used.
  • l (int) – The length of the current chromosome (only necessary when calling the manipulation with generators).
Returns:

A virtual track with the following fields: ['start', 'end', 'name', 'score', 'strand', '...'].

A numerical example:

X [start,end]: (0,20)
Y [start,end]: (10,30)
R [start,end]: (10,20)

A visual example:

X: ───▤▤▤▤▤▤▤▤▤─────────────────▤▤▤▤▤▤▤──────
Y: ─────────▤▤▤▤▤▤▤────▤▤▤▤▤─────────────────
R: ─────────▤▤▤──────────────────────────────

If the list of chromosomes contained in the various tracks differ, the conflict will be resolved by applying the ‘intersection’ principle.

track.manipulate.threshold()

Apply a score threshold.

Given a track X and a real number s, the threshold manipulation will remove any features with a score below the s value.

Parameters:
  • X – A track or the path to a track. Eventually, a generator yielding features. The fields read form this track will be: ['start', 'end', 'score', '...'], extra fields will be used.
  • s (int) – Score threshold.
Returns:

A virtual track with the following fields: ['start', 'end', 'score', '...'].

A numerical example:

X [start,end,score]: (10,20,10) (30,40,5)
s: 8
R [start,end,score]: (10,20,10)

A visual example:

X: ▁▁▁▁▅▅▅▅▅▅▅▅▅▅▁▁▁▂▂▂▂▂▂▂▂▂▂▁▁▁▁█████████▁▁
R: ▁▁▁▁▅▅▅▅▅▅▅▅▅▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█████████▁▁
track.manipulate.window_smoothing()

Smooth scores with a moving window.

Given a signal track and a window size in base pairs, the windows_smoothing manipulation will output a new signal track with, at each position p, the mean of the scores in the window [p-L, p+L]. Border cases are handled by zero padding and the signal’s support is invariant.

Parameters:
  • X – A track or the path to a track. Eventually, a generator yielding features. The fields read form this track will be: ['start', 'end', 'score'], extra fields will not be used. The track is assumed to have no overlapping features.
  • L (int) – The window radius. By default 200.
  • l (int) – The length of the current chromosome (only necessary when calling the manipulation with generators).
Returns:

A virtual track with the following fields: ['start', 'end', 'score'].

A visual example:

X: ▁▁▁▁▁▁▁▁▁▁████████████▁▁▁▁▁▁▁▁▁▁▁▁
R: ▁▁▁▁▁▁▂▄▅▇████████████▇▅▄▂▁▁▁▁▁▁▁▁

Table Of Contents

Previous topic

Using the Track object

Next topic

Loading into memory

This Page