--FAQ--


Q: How does this denoising procedure differ from PyroNoise?

Q: What is the expected run-time?

Q: Can I denoise Titanium data

Q: How can I speed up the computation?

Q: Why are there so few sequences in my output file after denoising? Did something went wrong with my sequencing run?

Q: So where are all the sequences then?

Q: Can I cluster at different sequence/flowgram similarity thresholds?

Q: Denoising on the clusters "hangs" after a while. What is going on?

Q: How and why can I run the preprocessing step separately?

Q: How can I use Qiime's workflow scripts with denoised data?

Q: What about different next-gen sequencing platforms?




Q: How does this denoising procedure differ from PyroNoise?

A: PyroNoise uses an expectation maximization (EM) algorithm to figure out the most likely sequence for every read. We, instead, use a greedy scheme that can be seen as an approximation to PyroNoise. According to several test data sets, our approximation gives very similar results in a fraction of the time.


Q: What is the expected run-time?

A: The whole heuristic for our method depends on the actual species distribution in your samples.
An ideal data set has few species and a very skewed abundance distribution with a few, very abundant species.
With more species and a flatter abundance distribution run time increases. You can get a rough estimate of the run time after the preprocessing step by looking at the number of reads printed in the log file in verbose mode. Very, very roughly, compute time increases quadratically with the number of reads after preprocessing:

...
Prefix matching: removed 242038 out of 339647 seqs
Remaining number of sequences: 97609
...

If the number of remaining sequences is smaller than 50.000, you can expect <24 hours on 20 cpus.
With 100k seqs you would need 80 cpus to expect it to finish within a day.

Here are some guidelines from runs with actual data:

- partial GSFLX run with 50.000 reads: ~ 1 hour on a single CPU

- Full GSFLX run (~400.000 reads):   6-24 hours on 24 CPUs

- 1/2 Titanium run (450k reads):   35 hours on 200 CPUs

Titanium data takes longer for two reasons:
 a) Reads are longer, meaning longer alignment times
 b) We observed a higher variability in the Titanium reads, leading to a less efficient greedy clustering. We currently investigate this issue.


Q: Can I denoise Titanium data

A: Yes. The algorithm can process Titanium data and we have done it several times. As of version 0.9 we ship an error profile for the titanium platform with tha package. Use the switch --titanium to  enable the new profile. Be aware that Titanium still takes considerably longer than FLX.

Q: How can I speed up the computation?

A:
1. Use more CPUs if available.

2. Stop clustering early.
   Clustering phase II processes clusters in decreasing order of their size after cluster phase I. As default, the procedure stops with the first singleton cluster being considered as cluster centroid. Setting -b 3 would stop the clustering with clusters of size 3. Note that setting the -b parameter does not hinder these cluster to be dragged in by another, larger cluster either in pahse II or phase III. It just limits their role as cluster centroid.

3. Split your data in smaller pieces. 
   For very large data sets(>1 FLX plate), this is the recommended way to go. While we have observed that splitting into too small pieces (e.g. per sample with 5k sequences/sample) might render the denoising less effective, we expect very little difference when denoising is performed on larger chunks of data (100k+ reads). We recommend pooling similar samples, e.g. time series samples from the same person, but encourage to separate samples from different habitats with expected very different communities.

4. As a rather desperate measure for people who have to limit the compute time we provide a new flag in version 0.9 that controls the maximum number of rounds that the greedy clustering should run for. Note that the lower this number is, the worse the final clustering result can be. 

Q: Why are there so few sequences in my output file after denoising? Did something went wrong with my sequencing run?

A: No, this is expected. The denoising procedure (and this also holds for Chris Quince's Pyronoise) technically do not remove any reads from the input set. This is the task of the initial quality filtering, which we suggest to do using Qiime's split_libraries.py. The denoising is basically a clustering approach on the flowgram level, i.e. all reads that look similar enough on the flowgram level are clustered and only the centroid of each cluster is reported in the output file (either in centroids.fasta if the cluster has more than one member or otherwise in singletons.fasta). You can think of the centroids as OTUs on the flowgram level. Since flowgram similarity does not correlate perfectly with sequence-similarity, we usually don't call them OTUs, but only after an extra OTU picking step with, say, cd-hit or uclust on the denoised sequences.


Q: So where are all the sequences then?

A: If you look at the file denoiser_mapping.txt, e.g. like this:
$ wc denoiser_mapping.txt
you should see that the number in the middle of the output (i.e. the number of words) is about the number of sequences in your input set. (Sometimes, the denoiser discards a few additional reads due to quality issues that were not captured by split_libraries.py). All reads that are in this mapping file can and will be used e.g. in the downstream Qiime analysis. The first number in the wc output gives the number of lines on the files, which corresponds to the number of clusters after denoising.



Q:Can I cluster at different sequence/flowgram similarity thresholds?

A: Basically, Yes. The default clustering parameters are set and tested to work well at 0.97% sequence similarity. If you want to cluster at, say, 0.95% you have to increase both cut-offs and decrease the percent_ID:

  --low_cut-off=LOW_CUTOFF
                        low clustering threshold for phase II [default: 3.75]
  --high_cut-off=HIGH_CUTOFF
                        high clustering threshold for phase III [default: 4.5]
  --percent_id=PERCENT_ID
                        sequence similarity clustering threshold [default: 0.97]


The low_cut_off and the percent_id are used for clustering in the second, greedy clustering step.
The high_cut_off is used in the third clustering step, where unclustered reads are mapped according to their best match to any of the clusters of phase II. For good values for the thresholds, we refer to the plot S2 in the supplementary material of the denoiser paper (Reeder and Knight, Nature Methods 2010).



Q: Denoising on the clusters "hangs" after a while. What is going on?

If not provided with already preprocessed data via the -p option, the denoiser.py script automatically starts the preprocessing phase (cluster phase I in the paper) on one CPU on the cluster. This preprocessing takes from a few minutes for partial GS FLX runs to an hour or more for large Titanium runs. After this step, the parallel cluster phase II starts. First, all requested workers are started one-by-one. Depending on your queueing system and the number of jobs this might take from few seconds to several minutes. If one or more of the jobs are not started by the queueing system, all submitted jobs will block and wait. This is most likely the state your process is in if nothing seems to happen. We know this is not optimally and already thinking about a better solution for the future. In the meantime, make sure you only request as many jobs as you can safely run in your queue and monitor (qstat) the startup phase to see if all jobs are properly scheduled. If you found out that you requested to many CPUs and need to restart, simply kill the master process (denoiser.py) and it should bring down all but the last submitted jobs. The last job might need to be killed by hand.
Once all workers are succesfully started, you can monitor the progress by following the log file in verbose mode (toggeld by the -v option):
	
	$ tail -f denoiser.log



Q: How and why can I run the preprocessing step separately?

A:If you call denoiser.py without the -p option (or via its wrapper denoise_wrapper.py in QIIME) the preprocessing step (cluster phase I) is implicitly called. You can explicitly run the preprocessing step via the script preprocess.py and provide the output directory to denoiser.py using the -p option. Reasons for running the steps separately could be:

	- run preprocess on a very fast single CPU machine, then transfer the data to a slower multi-cpu cluster

	- You want to check the cluster statistics of phase I first, before deciding of wether the data needs to be split or how many CPUs

	- something went wrong with the compute cluster in phase II and the program aborted. The results of preprocessing will be in the output dir and can be re-used if you restart the process.



Q: How can I use Qiime's workflow scripts with denoised data?

Small data sets should be processed automatically with QIIME's pick_otus_through_otu_table.py

If there are several data sets that need to be combined, follow the guidelines at:
http://qiime.sourceforge.net/tutorials/denoising_454_data.html

If you are comfortable editing shell scripts and moving files around you can try the following steps to use Qiime's workflow script after manual denoising. (Warning: the following commands are just guidelines, you need to adapt them)


1. Call pick_otus_through_otu_table using your custom options and the -w option and generate a shell script:

	$ pick_otus_through_otu_table.py  -w -s seqs.fna -f  seqs.sff.txt ADD_OPTIONS_HERE  > pick_otus_commands.sh

The file seqs.sff.txt is irrelevant here, because it won't be used. However, seqs.fna needs to contain all reads you want to use. If you combine several data sets, cat their respectives fasta files into one file.


2. Comment out (put a # symbols in front of) the line that does the denoising in the script, should be the second line.


3. move/copy the (possibly concatenated)  denoiser output fasta file and the (possibly concatenated) denoiser_mapping.txt file to the directory mentioned in the shell script you just created. There should be a denoised_seqs folder in the output folder specified in the pick_otus_through_otu_table.py script. If not just create it. After copying the file it should look like this

$ ls denoised_seqs/
denoised_seqs.fasta
denoiser_mapping.txt

4. Then, you can simply run the shell script and it will do all the steps as if you would have called the pick_otus... script.

$ bash pick_otus_commands.sh
 
