ISMB/ECCB 2009 Workshop: Bioinformatic Cores

2 days ago

Bioinformatics Core Facilities Workshop, Fran Lewitter, Michael Rebhan, Brent Richter & David Sexton

Missed the first five minutes due to another meeting.

Project types:

  • Long term: Pipeline development
  • Medium: Software / method development
  • Short: Ad hoc projects, quick analysis, preliminary data generation

Prioritization: just long term misses new customers, grant opportunities. Focus on money or project size misses pilot projects, does not allow diversification. Suggestion: based on merit, what allows the core to grow in new directions, expand, supports the institutional community in general

Hiring: 50% FTE available for a new person, enough to get someone started. Identify new technologies, hire to get an early start. Consultants can fill gaps.

Time: maintain an overview of timeframes (putative project starts). Wrap up projects to avoid task switching overhead as much as possible. Make time for the planning stages, use management tools (Basecamp, Trac etc). Work is periodic, plan accordingly

Expectations: be open and transparent with regards to availability, feasibility, stick to realistic time estimates and turn down work if the resources are not available. Collaborate between Cores

Core Cancer Research UK, Cambridge: does not charge, only tracks usage by project and group. Analysis heavy (at least 50% based on description), with partial workflow building, software development (Bioconductor). LIMS for sample tracking and pipelines, re-usage of public tools (Ensembl, Galaxy — mostly to ‘offload’ a number of tasks back to the biologists).

Long term/researcher projects tend to bog down a core. Focus on short term, genomic projects that can help multiple groups with rapid turnover. Main difference to us: generate their own array, short-seq data. Scaling issues: usually 6-10 projects per person at any given time.

Manage workload: define scope early, manage using collaboration software, deliver data in stages to keep everyone in the loop (and happy!). Standardize and automate as early as possible. Train as much as possible to offload work.

Fran: Reward the group (chocolate and ice-cream seem to be enough!). Small group problem — someone on vacation stalls a large number of projects. Group develops own long term projects (TargetScan website in this case). Her take on priorities: publication, grant, ongoing experiments, exploratory research.

Tracking: Work hours between labs, departments, short/long projects. Report once a year to each lab to improve communication. Play fair with whom to support next. Seek co-authorship for collaborative projects (but does not seem to require this)

Hot topics of the month, those best received are hands-on tutorials that benefit the most people with basic tools (Ensembl, statistics packages). Unlikely to spend more time than this, but can be spread out over time. Conscious effort to prioritize them, otherwise they never get set up. Huge success with basic perl courses just to do very basic data processing

Chargeback model the standard approach from a quick poll, but 50% of the Cores represented do not charge their users at all (cost covered by their institutes). Fair number support industrial collaborations. Importance of time tracking, cost, part of every project discussion. Provide detailed reports (what was done and how long did it take) to the collaborator / customer at regular intervals.

Large-scale projects: try to avoid those, or outsource / hire specifically support. Buy commercial solutions and (try!) to customize the tools which can be risky depending on the level of commercial support.

Authorship: mostly just acknowledgement, many cores with focus on master’s level students; core authorship depend heavily on the scientific contribution. Alternative benefits – higher salary, ability to get involved in a large number of different interesting projects

Collaboration: central place to share knowledge of methods, tools, evaluation. One place to deposit this information could be the BioinfoCoreWiki or the mailing list.

Post-analysis of short reads

What kind of questions are being asked, can they handle the data themselves, and is there any way to build re-useable workflows? Experience seems to be that so far no question (beyond the assembly step) has come up twice.

Simon Andrews, Babraham Institute. Growth of data, but no additional people to handle short-seq. Huge, diverse range of data types (chip-seq, SNPs, mRNA, bisulfite, …) all require different downstream methods. Not involved in QC but take over for mapping additional QC to save time on subsequent analysis to set expectations on what can be done with the data (and build expertise what went wrong and why during a run).

Additional visualization, quantitative analysis and biological analysis. An iterative process that kills effectivity (please change the cutoff, different colours, just a slightly different view — sounds very familiar). Division of labour required, biologists need to work with and delve into their own data. Support with scripts, keep track of data being used across experiments, results by a junior person. Core facility to handle the management of primary data, develop software/glue, and only get involved in more complex problems that cannot be handled by the biologists.

One example for this is SeqMonk, to be used by scientists (and thus needs to be useable on desktop machines, hence handles mapped data only). A tool to once again offload tasks back to the requester. Scan/visualize their data to understand it, generate reports. Experiment agnostic, as generic as possible. A number of other tools are also available from their website

Core facility jobs:

  • Mapping (probably with base calling as well?)
  • Post-mapping QC
  • Filtering (remove biases, PCR artifacts)
  • Construct standard workflows (no consensus yet)
  • Customized analysis (when unavoidable)

Get involved in the design stage, provide realistic quotes. Lead times to evaluate the correct algorithm and methods.

Take from Partners Core: 4 Solexas deployed across Partners’ institutions, about 3TB/week/instrument. Pre-analysis pipeline deployed by ERIS within an HPC environment, regular test of new base callers and assemblers by the Core. In 2009 283 alignments, 145 raw data sets handled by a single bioinformaticists, all that can be done is care/feed the pipeline and return alignments. All downstream analysis by investigators.

Most users only just getting started (likely further increase in data). Huge amount of applications (see Nat Biotech paper from 2008), focus on genome re-sequencing, small RNA, SAGE, Chip-seq. About 20 different analysis tools being tested (Eland, MQ, MOSAIK, RMAP, SHRIMP, SOAP, SSA, velvet, PyroBayes etc). Usually come back to basics:

  • Eland, Cross-Match (alignment)
  • MAQ and Vaal (alignment and variant detection)

Post-analysis further down the line: GenomeQuest, Genomatix, CLC bio tools as pay services, mostly in the testing phase. Most important aspects: development of the study with investigators, QC and high quality alignments. Looking into Galaxy as a framework for standard workflow development. Can be a technical challenge when it comes to scaling.

Additional challenges include:

  • alignment process (speed vs quality), need to evaluate each tool for a specific purpose. (That is going to be our main problem as well, does not scale at all)
  • lack of standards (comparison across platforms, versions, alignments). Best current way is to convert all to phred-like scores
  • the right algorithm (no benchmarks, comparison methods. Best practice is experimentation)

Added from the audience: additional formats (not converging yet). Cooperate with sequencing center that ensures a bioinformatics consult has taken place prior to the samples being run (going to be difficult at Harvard, too de-centralized. Still worth trying to make contacts).

Try to compile information and discussions from the SEQanswers forum. Usually pass on only coordinates and processed data on to collaborators, handling of data standards is a problem of the Core.

Data storage: keep processed data, discard raw data (Partners: about 300TB of storage). Can be a problem as base callers are improving. Very few projects warrant going back, though, difficult enough to keep up with new data. In general no charge for storage though.

Oliver Hofmann

,

Comments

---

ISMB/ECCB 2009 Day 3

3 days ago

Yang Huang (NIH) – Graph Theoretical Approach To Study eQTL: A Case Study of Plasmodium Falciparum

P.falciparum is the most deadly human malaria pathogen. Little information about gene regulation so far, eQTL might be able to shed some light on this regulation and drug resistance. Reference to Daphne’s talk, difficult to deploy her methods due to lack of information in this pathogen.

Hypothesis: SNPs might affect gee expression. Consider expression as a quantitative trait like height, weight. Identify the associated locus by statistical methods.

For each progeny strain measure all gene expression, genotypes for predetermined loci. Result is a set of vectors, one for each strain including the genotype and expression values. Detect locus/gene pairs with statistical association between gene expression, identified genotype at one locus.

Traditional tests between multiple loci, all expression. Comprehensive and without biast, but does not use the inherent data structure, computationally expensive and a problem of statistical power. Alternative approach GeD, Graph-based eQTL decomposition. Include strain data in the association graph to identify hidden structure (eQTL association cliques) that help reduce the complexity of the data.

Construct graph

Three types of vertices: gene linked to strain linked to locus. One node for each distinct locus, two nodes per gene (up/down-regulated).

eQTL association cliques

Each clique has 3 vertices (G/S/L) that are fully connected, in addition each clique is a maximal subgraph that cannot be extended further; enumerate all maximal bipartite cliques in the graph (Farach-Colton, CPM 2008). Merge cliques by common vertices for strains.

eQTL detection

Heuristic approach on eQTL cliques to look for (Locus,gene) pairs with certain patterns; refer to graph/diagram in paper. Support is the number of strains in the data set that agree with an identified pattern.

Results

34 progeny strains, eQTLs need to be supported by at least six strains. Significant difference to random background cliques, 1327 eQTLs for 513 probes are significant (p-value adjusted by background model). About 25% overlap with a classical eQTL results, but with similar genomic distribution. Enrichment in chr3 subtelomeric regions, genes in the region enriched for host interaction.

Cliques help to detect eQTLs, avoiding a large number of tests; integration of strain information provides a new framework for eQTL studies. Improve heuristics, identify one-to-many loci to gene interactions as future work.


Mark Clement (Brigham Young) – GNUMAP: Unbiased Probabilistic Mapping of Next-Generation Sequencing Reads

Hash target genome into k-mers (indexing) for constant time lookup. Alignment step: identify possible match locations based on seed location followed by probabilistic Needleman-Wunsch, taking base call quality into account. Includes PWM for each sequence, allows inclusion of insertions / deletions in addition to quality information. If a read (PWM) matches multiple locations uses a probabilistic assignment, assigning proportions of a read to all possible match locations.

Relation to other tools via simulation studies as well as actual data. Seems to be doing better recovering the original location, but not quite clear what would be the ideal problem / data set to play to GnuMap’s advantages. No support for paired-end reads currently, but planned along with SOLiD support.

Slower but more precise than other mappers. No limit on sequence size and would work with 454, but needs all four base call probabilities (FASTQ file is not enough).


Younghoon Kim (KAIST) – MONET: A Cytoscape plugin for genome-scale network inference from expression profiles using modularization and parallel processing techniques with supercomputing resources

A need to add information beyond large scale expression data. Improving the sample to gene ratio by modularization. Incorporates pre-existing functional annotation (GO). Identify functional modules based on annotation, incorporate in global network in a divide-and-conquer approach. Calculates a bayesian network analysis for each modules. Identifies seed genes for each condition and expands based on functional annotation and expression data (details are a bit difficult to follow, recommend looking at the paper)


Shai Lubliner (Weizmann) – Modeling Interactions between Adjacent Nucleosomes Improves Genome-wide Predictions of Nucleosome Occupancy

75-90% of the DNA is associated with nucleosomes, play an important regulatory role. Determining occupancy of DNA is being done with a number of different methods, results in an affinity landscape. Here: a thermodynamic model. Reasonable correlation of 0.65 between model and data despite only using the DNA information.

Additional interactions are important for chromatin organization. DNA bending proteins, TF, histone modifications, etc. Trying to capture interactions between adjacent nucleosomes (cooperative effects) with the linker length preferences being encoded by a function. A ‘no cooperation’ function as reference / background model. Linker preferences obtained from in vivo data (right shifted peak, exponential decay at a certain length), tested five different representative functions.

Sample 5000 random configurations from a model instance and compared it to the in vivo data, functions can represent the underlying cooperative data. Repeat with data samples, add noise and try to fit different models to the sampled occupancy landscape. Model can be fitted if cooperativity is taken into effect, no success without cooperative interactions.

Do interactions play a role in vitro and in vivo (aka real biological systems)? For an in vitro validation the Exp, Step functions work better than the no cooperativity model, they exhibit a strong preference for short linker lengths.

For in vivo examples similar results with almost all functions doing better than the background model. Repeat for C.elegans with the Exp function only, again a strong preference for short linker lengths

Biological basis for this preference: shorter length allow for interaction of nucleosomes, energetically favoring their shift from otherwise better binding positions


Karim Chine – Computational Biology in the cloud, towards a federative and collaborative R-based platform (Biocep-R)

(Talk actually given by Eamonn Maguire)

Java app built on top of R and Scilab. With RESTful API, improved graphics, extensibility (plugins, server-side), distributed resources. Components such as R, Bioconductor; GUIs with collaborative views; Scripting (R/Python/Ruby); stateless web services; NFS/FTP/S3 storage; cluster/grid support.

  • Server-side, grid-enabled collaborative spreadsheet, multiple clients connected to the same server sharing the same view.
  • GUI builder using Netbeans.
  • Full cloud support and node worker support. Biocep can automatically add workers on the cloud on demand given a certain load
  • Scripting to wrap additional services

Actually.. coverage over at FF is much better ;-)


Other notes

  • BoF meeting on microblogging on ff
  • Lengauer Keynote coverage on ff
Oliver Hofmann

,

Comments

---

ISMB/ECCB 2009 Day 2

4 days ago

Keynote: Daphne Koller – Individual Genetic Variation: From Networks to Mechanisms

Understanding gene regulation: from networks to mechanisms — some new results caused a slight change of topics. RNA degradation mechanism, DNA modifications, endogenous and exogenous perturbations all important in gene regulation. Aims:

  • Inferring regulatory networks
  • Identify effect of pertubations
  • Result on phenotype

Regulatory networks for expression

mRNA level of regulator an (imprecise) indicator of regulator activity. Target expression partially predicted by expression of regulators. Broad view of a regulator gene: TFs, signal transduction proteins, RNA processing factors, anything that might play a direct or indirect role in gene regulation. Second assumption: co-regulated genes have similar regulatory mechanisms, group genes into modules and predict expression profile for the entire module

Structure of the regulatory program (Segal 2003 Nat Genetics): notion of a regression tree for regulatory programs. Boolean-logic style description that is easy to understand, but has disadvantages such as poor regulator selection lower in the tree, misses lot of regulators due to lack of statistical power and the choice between correlated regulators can be arbitrary.

Move to regulation in a linear regression model to identify activating / repressing effect of regulators on a module. Usually hundreds of regulators with an impact on the module. Lasso (L1) regression, constant ‘drive’ towards zero, induces sparsity in the solution. Also computationally more efficient — but does not get around the arbitrary regulator choice.

Elastic net regression to avoid arbitrary regulator / feature choice. Cluster genes to modules, learn regulatory program for module, repeat for all modules, iterate after re-assignment of genes to modules based on how well a program predicts the expression of a gene in the module (in essense a bayesian network, multiple genes share same program).

Genetic variation and regulation

Application of this approach to genotype to phenotype analysis. Data set eQTL (Brem 2002 Science), two different yeast strains, 112 individuals with array and SNP data. Adapt the regulatory network approach by including the genotype / markers. How do markers affect the expression level of a given module?

Sample results:

  • Telomere module (40 out of 42 in telomere region): most dominant regulator a region on chromosome XII which includes Rif2 regulator
  • Chromatin modules (4 out of 5 genes are consecutive): Sir1-containing region is the strongest controlling factor; additional modules controlled by regions with previously unknown Sir1 homologs

Evolutionary strategy to group target genes, identified 16 chromatin regulator regions

  • Puf3 module (147/153 targets of a Puf3 pulldown), sequence specific mRNA binding protein regulating degradation.

But Puf3 is not correlated with the module — P-bodies are (translational repression of mRNA stored in the P-body). Dhh1 regulates mRNA de-capping, Puf3 predicted (and confirmed) to be involved in localization to the P-body. Primary regulator of Puf3/the regulators is a large region on chr14 with 30 genes. Trying to identify the ‘regulatory potential’:

  • not all SNPs equally likely to be causal. Can create a set of features to select (coding, conservation, non-synonymous). How do you weight these features?
  • use Bayesian L1-regularization, prior a laplacian distribution, each regulator with its own prior determined by regulatory features of the regulator
  • for each module we have properties of the regulators allows the selection of regulators in a more biased way

Metaprior method: bootstrap to learn the regulatory program, learn regulatory weights (e.g., stop codings get a higher weight), compute the potential for each SNP in the genome to bias the regulatory potential of each regulator, iterate until convergence. (Empirical hierarchical Bayes). Regulatory potentials do not change the selection of strong regulators, but helps to disambiguate between multiple weak regulators. Strong regulators teach us what to look for in the putative weak regulators. In this case feature ranking / significance (learned)

  • strongest feature is the stop codon
  • cis-regulation (SNP and adjacent gene)
  • conservation
  • different combination of gene functions (protein binding, glucose process, RNA modification)

Statistical evaluation using PGV, the percent of genetic variation explained by the predicted regulatory program for each gene. Explained about 50% variation for half of the genes.

Used the approach to find causal regulators in 13 chromosomal hotspots, including the P-body region, with good results. Learning regulatory priors that are specific to a data set and organism, can handle any kind of regulator type.

Regulation in the context of cell differentiation

Understanding the process underlying differentiation with the ImmGen consortium, 200 arrays from 63 immune cell types. Can identify shared regulatory programs for all 60 samples. Does not have the G-regulators from the eQTL study. One network for each cell type overfits the data, but can bias towards shared regulation. Use the ontogeny to guide conserved regulation. Penalize changes in the regulatory program.

Test data prediction on six cell types, predict on one array (leave one out), the soft, lineage-aware model provides better accuracy. Identified novel member (JARID1B) as a candidate regulator.

Phenotype

How does expression change cause changes in phenotype, and what regulatory programs cause them? FL to DLBCL (Diffuse large B cell lymphoma) transformation in patients. Represent each module as a metagene, use ML technique to learn classifiers to distinguish FL-t (pre-transformed) from DLBCL (to appear in Blood). Module of interest in embryonic development (ESC1/2/TGF-beta), also a good predictor of survival. Potentially therapeutic implications based on a connectivity analysis, identify drugs likely to interfere with key genes in the module.

Metabolic syndrome in mouse experiment (300+ animals) using a high-fat diet, samples from four tissues. Phenotype network can use modules from different tissues. Interesting module includes liver biosynthesis, in-between insulin and cardiovascular disease traits.


Gregory Kryukov (Brigham) – Learning from Resequencing Data: What To Do When the $1000 Genome Arrives?

Mendelian diseases can be characterized by linkage analysis (classical association studies). New sequencing technologies, ways to collect clinical populations and exon capturing approaches can revolutionize the search for genes underlying human phenotypes. All genes have rare coding variants, and while sequencing will uncover low frequency variants the power to detect their associations is reduced and the multiple testing correction becomes very stringent. One option: combine all non-synonymous variants into a single test

Sequence all exons from a population, characterize variations at the extreme. Theory:

  • most new mis-sense mutations are functional
  • are only weakly deleterious
  • are likely to influence phenotype in the same direction

58 genes from 768 individuals (existing population sequencing data), estimate parameters of demographic history (from non-coding variation data), simulate genotypes, simulate phenotypes (quantitative traits), simulate a sequencing study to estimate the power

Demographic history of 370 generations agrees with experimental data, add natural selection (quarter of mutations neutral, majority at around 10^3 selection co-efficient).

For 20.000 genes we need a p-Value of 2×10^-6 to associate mutations with phenotype. Small sample of 1000 individuals have 75% power to detect genes which shift the phenotype deviation by 2 std devs . Sample sizes approaching 10.000 individuals re-sequencing based association studies are feasible (with phenotype information from 100k individuals).

Computational predictions of damaging mutations will help, as does multistage design.


Mikhail Zaslavskiy (Ecole des Mines de Paris) – Global Alignment of Protein-Protein Interaction Networks by Graph Matching Methods

Motivation: automatic identification of protein functional orthologs. Standard approach like reciprocal best BLAST hits have problems when several top hits have similar scores — which pair to chose? Additional information helps to resolve ambiguity.

Protein clustering based on blast similarity clusters (InParanoid): only proteins in the same cluster can be annotated as functional orthologs.

PPI networks (the usual hairball) can also be used to resolve this. If ortholog assignments conserve PPI interactions are ranked higher. Identify the mapping that maximizes the number of overlapping PPI cases.

  • Constrained alignment: InParanoid approach (Ideker 2006, MRF model). New approach here is the Message Passing (MP) Algorithm. MP is based on a forward-backward recursion, each node represents a protein cluster. Provides the maximum number of conserved interactions if the MP graph is a tree.
  • Balanced alignment: find the alignment that maximized the number of conserved interactions and the sum of all BLAST similarity scores of the pairs (Sing 2008), here with graph matching algorithms. Balanced alignment using the gradient ascent and path algorithm to solve the graph matching.

PPI networks and InParanoid clusters from the Ideker paper, 2244 clusters (1552 with only 2 proteins / orthologs, 692 ambiguous clusters that need to be resolved). No cycles in this graph means constrained alignment approach can be used, results in 238 conserved interactions (run time of 1-2 seconds). Balanced alignment does not use the InParanoid clusters, recovers the highest number of conserved interactions when compared to PATH, IsoRank.

Constrained algorithm with message passing is an exact solution, graph matching algorithms deliver good performance for balanced alignment problems.


Jose Caldas (CIT, Helsinki) – Probabilistic retrieval and Visualization of Biologically Relevant Microarray Experiments

Trying to find a method to relate results in large array databases based on expression information rather than annotation. A need to retrieve all ‘related’ experiments from a database using the data, not the text. Standard approaches like spearman correlation coefficients, but it would be interested to use sets of experiments as a query rather than a single array.

Query with a binary phenotype comparison and try to get back other, similar comparisons. Requires encoding of the phenotype comparison such as a vector of t-tests (0/1 vector for differentially expressed genes). Can be noisy, use a vector of differential GSEA. Less features to compare, additional biological information available for the sets.

Uses standard GSEA, number of genes in leading edge as vector value, ignoring the directionality. Uses LDA (Latent Dirichlet Allocation, used in bag-of-words approaches) rather than standard vector comparisons (combines gene sets into ‘topics’).

750+ binary phenotype comparisons from 288 experiments, focus on 105 comparisons for this analysis. Gene sets assigned to topics are coherent across a wide range of biological processes. Graph visualization allows to explore from phenotype pairs, gene sets or topics.

Retrieval performance better than random, but seems to have low-to-moderate recall for reasonable precision numbers (?)


Sven Nelander – Models from experiments: combinatorial perturbations of cancer cells

What happens when you change more than one factor in a cell? E.g., paired perturbation screens. Here: cancer patients, based on Nelander, Mol Syst Biol 2008

Defining feature of cancer cells: breakdown of regulatory systems. Pharmacology frequently targets the function of proteins in these pathways. Hope is to selectively counteract specific mutations (particularly gain of function mutations that can be inhibited). In many cases a tumour contains a whole set of mutations, computational methods to predict how it will react to a given drug.

Approaches to combinatorial perturbations:

  • Gene-gene interaction / epistasis
  • Drug-drug interaction
  • Genomic interaction screens
  • Flux models
  • Now: “systems biology”

Drugs as input to the system, molecular / phenotype response as output, search for models recapitulating the experiment. Use the models to predict new potential interventions. Sample experiment from MCF7 cells. Used a continued Hopfield network (neural network which allows feedback loops), trade-off between model fit and simplicity to construct model from the data. MC simulation results in probabilities of functional interaction based on the frequency with which interactions where observed during the simulations.

Functional interpretation of the top pathways / network recapitulates MAK cascade of the EGF receptor, PL3K-dependent AKT control, m-TOR signaling. Experimental verification int the Sander lab. Cross-validation with leave-one-out experiments do “rather well”.

Current work: characterize human tumour samples. Input includes point mutations, CNV, gene fusions, DNA methylation (analogue to drug input), phenotype output are changes in expression measurements. Adapted CoPIA to cancer genomics data, the perturbation is gene dosage (gain/loss of DNA) with direct effect on transcription levels. Model also captures indirect regulatory effects. Reduced the problem to a linear summary model, derived from the S-system (1969).

  • each tumour at a transcriptional steady state
  • same model applies to all patients
  • only try to explain differences between patients

Bootstrapped Lasso (Francis Bach 2008) and CoPIA approach (only effective up to 20 genes). Glioblastoma CoPia model based on profiles from 200 tumours shows EGFR asa pleiotropic regulator along with new testable predictions (GCPR, Necdin, others with strong neural expression). NDN over-expression in U343 glioma cells slows growth of cells in a dose-dependent manner. In four out of six cases the experimental validation matches the CoPIA model predictions


Anton Nekrutenko (Penn State) – Galaxy Library System for Management of Next Generation Sequencing Data

How to handle large data sets in tools with complex interfaces? Galaxy as a software framework. Instance is a piece of hardware running the framework, talk is focused on the software part.

NSG data must be served in a (immediately) useful form, ideally with a proliferation of ‘best practice’ workflows and practices. Showcase using the Penn State (modified) instance. Sequencing center provides sequencing run information as Galaxy libraries, pre-loaded on the server. Data can either be downloaded for offline / individual processing, or alternatively processed within the Galaxy framework. Includes:

  • statistical analysis of data quality
  • convert to graphical representation for easier evaluation
  • align to target genome (usually also pre-loaded on the server)
  • SNP calls, along with visualization of SNPs in a genome-browser view

[Brief demonstration of the mobile-optimized version of the site that shows job histories and status.. just in case you want to check during lunch.]

Switch of speakers, emphasis on tool integration. Quick (if unreadable from the back of the room) walkthrough of a tool integration process; takes about 2-3 minutes to write up an XML binding for a command line program.


Igor Ulitsky (Tel Aviv) – Regulatory networks define phenotypic classes of human stem cell lines

Identify differences between pluripotent and multipotent cell types? Should allow a rapid analysis of new cell lines and distinguish pluripotency from self-renewal and surival, increase the safety of cell-replacement therapies and direct differentiation in better ways.

mRNA, miRNA, DNA-methulation, genomic and chromatin structure. Focus here on mRNA and miRNA in 200 stem cell-related samples followed by unbiased clustering and identify mechanisms using a protein interaction network

mRNA

Non-negative matrix factorization to reduce dimensionality (metagenes vs experiments), repeat from multiple starting points and observe which samples are frequently co-clustered (i.e., robust). Ended up with k=12 clusters, most stem cell types group as expected (hESC and IPSCs, NSC split into coherent but distinct clusters). Used MATISSE which identifies sets of genes / modules that form connected subgraphs in the PPI network and have highly correlated expression patterns (and are up/down-regulated in specific NMF clusters with respect to other clusters, a new extension). Yielded the Plurinet network described in the original Nature paper.

Double-checked the plurinet expression patterns across stem cell and differentiated cells with good results; PluriNet can be used as a cell type classifiers. See http://www.openstemcellwiki.org/ for more details.

miRNA

Quick overview of miRNAs (mechanism, specificity, seed sequences). Profiled miRNA expression in 26 cell lines including pluripotent and differentiated cell lines. This time a separation into six different groups with all differentiated cell lines being present in one robust cluster despite tissue heterogeneity. The relevant ESC miRNAs are clustered (and upregulated), with a large cluster of primate specific miRNAs on chr19.

miRNA sequences upregulared in ESC frequently with a AAGUGC seed sequence (10^-14 pValue), similar families in the early embryonic development in zebrafish and xenopus. Reverse complement of seed sequence over-represented in miRNAs downregulated in ESCs.

Detecting pathways that are co-regulated by miRNA (group of miRNAs and targets with targets forming a connected component that have similar expression in similar cell types and regulated by the same miRNA). Found 57 such modules, mir-16/17 example which seem to regulate cell cycle progression. Additional example for miRNA associated with neurodegenerative disorders.


Rui Jiang (MOE Key Lab)- Network modeling of human interactome and phenome

Prioritization of disease genes: Guilt by association learning methods — gene more likely to be causal if it shares properties with disease-associated genes. E.g., use PPI networks and examine distance or other graph properties for candidate genes with regards to known disease genes. Requires annotated and relevant disease (seed) genes, limiting the power of these methods. Scope limited to diseases for which we know causal genes.

Multi-layered network model that includes diseases clinical traits, molecules and gene variation. Text mining of the human ‘phenome’ provides a similarity measure of all human diseases

Linear regression model uses the gene proximity to define the disease similarity. CIPHER: calculate concordance score of gene to a phenotype of interest, used to rank multiple candidate genes

Evaluated using three different screens, method robust to noise in the input data (?). Case study using breast cancer data set to evaluate 22 OMIN candidate genes associated with breast cancer, 16 of which are in HRPD. Ranked high in a genome-wide scan (details got skipped to quickly to capture).

Used to generate a genetic landscape of human diseases; similar diseases tend to cluster together. Standard network alignment approaches (NetBlast) can be used to align the phenome and interactomoe to identify ‘bi-modules’, all of them enriched in a given disease category


Keynote: Trey Ideker (USCD) – New Challenges and Opportunities in Network Biology

All coverage over at Friendfeed

Oliver Hofmann

,

Comments

---

ISMB/ECCB 2009 Day 1

5 days ago

Rod Nibbe (Case Western) Proteomics first approach for discovering sub-network targets in cancer

Move beyond single targets, an aim to identify concerted effects; combination or subnetworks of proteins characterizing the cancer. Understanding the pathophysiology of a late-stage colon cancer phenotype, not so much as a classifier but as a means to identify novel therapeutic targets.

Quantitative subnetworks based on (legacy)-PPi data with additional microarray data. Took tissue biopsies from large patient cohort, seed the search with identified proteomics targets in large curated PPI databases and score the resulting subnetworks

Top-down proteomics approach: paired normal / tumour biopsies from 12 patients. Differential image analysis to identify significantly changing proteins, MS/MS of excised spots results in lists of significant proteins associated with the phenotype. (As few as 6 patients seem to be enough to capture the majority of proteins). Highly significant both due to stringency in image analysis and database search.

Question then is: do the targets reside in interaction networks (using MetaCore). Scored significant networks, pruned to core network members, expanded by one step for functional inference.

[Looks like a fairly standard MetaCore Meta-analysis to me. Not quite sure how they scored the subnetworks with the array data, need to check the paper.]

Automated network scoring: initial small set of target proteins got expanded drastically due to MetaCore network analysis, additional information is required. Binned expression information added to the graphs and scored using mutual information. Null hypothesis is that random network activity does not discriminate between the normal and disease state.

Significant signatures (four out of 13 subnetworks): none of the original large networks was significant using MI. Exhaustive search of subnetwork results in four subnetworks significant with regards to two different background models. Subnetworks seem to be making biological / clinical sense, exploring the role of some proteins with pertubation experiments in cell lines.


Adam Smith (Wisconsin) – Clustered Alignments of Gene-Expression Time Series Data

Gene expression levels over time in different ways for different treatments. Idea is to perform similarity searches (BLAST for time series profiles). Usually restricted to discrete measurements of a continues attribute, requires reconstruction or interpolation. Time warping/alignment to maximize the similarity of the compared point helps.

Warpspace diagrams are an alternative way of showing alignment between two time series. Global alignments mean beginning and end corresponds, local or shorted alignments are the focus here and are required if one time series is more ‘evolved’ than the other.

  • SCOW: efficient method for sparse time series
  • Clustered alignments: subsets of genes sharing time series alignments

Known algorithm (1978) for dynamic time warping (DTW), minimize sum of euclidean distances. Parametric time warping (2005, Eilers, PTW) fits alignment function from given family. More limited in expressiveness than DTW.

Segment-based warping splits warps into individually scored segments, sits in a happy intermediate between DTW, PTW, but very slow (n^5 complexity). Correlation-optimized warping (COW, Nielsen 1998) looks for good ‘knots’ (one dimensional). SCOW searches in each dimension independently until convergence.

Evaluation with EDGE toxicology database, 11 treatments, 6-96 hours, 3 observed times for each data point well-defined zero time point, 216 genes. Take a gene and time series subset (removing time points), distort time series, then try to find the original entry in the test data set.

Cluster gene time series to get a regularization effect; algorithm based on k-means. Each cluster defined by an alignment, initial clusters calculates by a greedy algorithm, assign genes to cluster, re-calculate alignment for each cluster, iterate until convergence.

Does clustering help? Test with Mop3 knockout (circadian cycle gene); five clusters with exemplar genes identifies genes with a strong phase shift. Gene activity ‘sped up’ in the knockout.


Duygu Ucar (Ohio State) – Predicting Functionality of Protein-DNA Interactions by Integrating Diverse Evidence

Detecting interaction events between TFs and targets is crucial. Binding can be captured, but does not need to be functional. Semantics of interactions is important, but difficult to characterize (context dependency of TF binding event); half of the bindings without any effect on the gene expression. Binding changes also depending on external conditions / stimuli.

Prediction by integrating complimentary datasets. Estimating TF binding and gene expression response based on three data sets (chip-chip, PSSM and nucleosome occupancy). Binding considered functional if it changes gene expression (determined from microarray data)

  1. Binding estimate: yeast ChIP-chip data, pValues indicate binding strength. -800/+200 bp PSSM binding scores, nucleosome depletion around active promoters combined into a probabilistic bayesian model. Integrated model outperforms individual three data sets in a five-fold cross validation


  2. Used two data sets to correlate changes in binding events with changes in gene expression (YPD regular growth condition and stress). Distinguish between putatively functional and non-functional binding events

Functional binding rate can change under different stress conditions (GCN4 as an example). Function changes based on distance to promoter, orientation, presence/absence of co-factors. Multi-variate random forest based feature selection to identify important factors for each condidition-TF pair as well as to identify TF-TF pairs. Looks very interesting!


Saharon Rosset (Tel Aviv) – Grouped Graphical Granger Modeling for Gene Expression Regulatory Networks Discovery

Temporal causal modeling to uncover temporal relationships (time lags) between gene expression events and determine the identify the causal relationships (directionality). Ganger causality from Clive Granger (economy, nobel prize). X ‘granger causes’ Y if time delay in X is important to explain Y, usually via a linear regression model

Combine with graphical models to provide a methodology for causal modeling of temporal data; select the variables significantly affecting Y given time lag d; does one time series cause another as whole.

Followed of an overview of LASSO, Adaptive LASSO and their contribution (Group LASSO). Got to go back to the paper to really understand the underlying difference; if I got this right it is about how to handle the lag parameter. Different lag settings give significantly different results, but consistency increases with the number of time points.

Bootstrap sampling for small networks (nine genes) results in an average confidence of 80%, five of the top six links are either in BioGrid or in the literature.

Keynote: Tomaso PoggioComputational Neuroscience: Models of the Visual System

Computational neuroscience wrt to (machine) learning and (computer) vision starting to provide new ideas and approaches. Better connections between the biology / data and modeling. CBCL face detection research started about 15 years ago, now in most modern cameras. Ten years from person / motion detection system to deployment in expensive cards.. but all based on pure engineering, no input from neuroscience.

Animals (and humans) can learn from small number of examples. Supervised learning algorithms (regularization techniques) from classical learning theory along with kernel machines (SVMs, radial basis functions) require higher number of samples. Kernel machines correspond to shallow networks with small number of layers and not ideal for the high sample complexity. Hierarchical organization might be able to address this (along with the poverty of stimulus problem).

Visual recognition is a difficult learning problem, e.g., is there an animal in a photo? Human brain has the equivalent of 1 million fly brain neurons. Model the ventral stream with quantitative models with feedforward connections (no backprojections as it won’t help with the immediate object detection), including millions of model neurons. Tuning of more and more complex layers learned in an unsupervised way from thousands of images. Additional training with supervised classifiers (themselves trained on small subsets of labeled images, animal / no animal present). These hierarchical feedforward models are consistent with neural data. High correlation in recognition efficiency between model and human results. Model superior to the purely engineered solution — but it is unclear why the model works which seems to be a rather general problem of these models.

Deep learning / deep belief networks to preprocess signals to reduce sampling complexity for classifiers trained with labeled examples. Layers reduce the number of samples required for a given accuracy.

Additional notes (pretty much a full transcript) in the friendfeed thread.


Other notes

  • Excellent keynote, well covered by the bloggers already
  • ISA-Tab/Tools/Etc tech demo. Looks fairly complete by now, see the friendfeed discussion
Oliver Hofmann

,

Comments

---

Short-seq SIG: Metagenomics, assemblies and statistics

6 days ago

Bas Dutilh (UMC St Radboud) Increasing the coverage of a metapopulation consensus genome by iterative reading and mapping

Wild type as the average of the species. To obtain consensus genome sequence the community; as a related reference genome is available chose 32bp Solexa reads for cost-reasons. 65% of genome covered, lots of gaps after Maq mapping. Only 9% of the reads map at all to the reference genome.

First goal: lower the mapping stringency by mapping every read to their optimal position, then filter out the ones that are highly unlikely to be from this species. BlastN with high eValue cutoff, alignment length cutoff at 20nt. Assembly as a better representation of the community, so a re-mapping of reads against the assembly should provide a better wild-type representation. Iterative mapping approach increases the coverage and reduces the zero-coverage regions.

Confirm convergence to wild type by contamination with 50% of reads from a different species; hardly any of the contaminants get incorporated during the iterative assembly. Proteomic validation using MS/MS to identify which iteration explains most of the peptide peaks.

Reference based assembly cannot incorporate large genomic changes. Hybrid approach (de novo assembly, mapping de novo contigs to reference, remap to chimera assembly) might address this.


Juliane Klein (Tuebingen) on LOCAS: a new low coverage assembler for short reads

Used in the 1001 genomes project, re-sequencing effort of A thaliana. Define subregions of overlapping blocks after initial mapping to reference genome to allow local improvements, handling of small InDels and integrate unmapped reads (‘left overs’ and unmapped mate reads).

LOCAS assigns unmapped mate reads to the same block as their mapped paired read prior to subregion merging; left overs mapped after subregion arrangement (?), potentially bridging polymorphic regions.

More details on the graph-based alignment approach in the paper.


Adam Kowalczyk (National ICT Australia): Poisson model significance for short-read concentrations

Exploit the digital nature of the data to improve statistical models. Skipped this talk, sorry to say.


Su Yeon Kim (Berkeley) on the design of association studies with pooled next-gen sequencing data

Identify major/minor allele (here: minor for the rarer allele) associated with complex diseases using large-scale genotyping platforms in a large number of case/controls (Wellcome Trust: total of 17.000 samples); still known variants only explain at most 5% of the heritable variation of the disease (due to interaction, epigenetics, structural variation, other factors). Focus here is on rare SNPs.

Aim to identify rare alleles with a minor allele frequency (MAF) less than 5% which requires a lot of individuals. Cost-effective strategies help to reduce the expenses involved. Examples include a focuses approach on target regions of interest (all human exons). Alternative approach is to use pooled examples as we only need the MAF across the population.

Adds another source of variation (pooling variance) on top of sequencing errors, mapping problems. Examined a pool of five DNA samples in an empirical study. Based on unique mutations in the five samples pooling variance can be estimated and turns out to be relatively low.

Also focused on just capturing exons which exhibits addtional variation. Optimal design using a two-stage design with re-sequencing in the first stage. Only select top SNPs out of the initial sequence data. Developed a likelihood ratio statistics (LRT) taking into account uncertainty in genotypes, the pooled structure and sequencing errors (chi-square statistics is not appropriate for this). Details of the formula in the publication.

Extensive simulation studies with different case/control numbers, pool size and sequencing error rates. Other variables include sequencing depth, pooling and exon variance. Not accounting for population structure at this stage. More individuals with lower depth more cost effective than fewer samples at high coverage.

At fixed sample size balance coverage and pool size; high depth at larger pool sizes beats low depth with smaller pools.

Oliver Hofmann

,

Comments

---

« Older