change to have hts wide depth filtering #2016
Conversation
|
It's working on some real world data, but often significantly over-stepping the depth. Eg overshoot.sam.gz from #1084 I then produced some synthetic data and can verify this. This basically lets all reads through the cache, and it's the same depth as unfiltered data. Compare to the mpileup equivalent: |
|
Working on the assumption (mostly valid) that overshoot.sam.gz alignments are all just 75M, then we can produce a table of start to end regions spanned by alignments and uniq -c then to get counts of each. It's quite small: That's frequency, start, end. Let us assume we want to selection region 1:104524850-104524850 with a maximum depth of 100, then we get this: So the depth is 3508 (of maximum 4592). The problem is bar the first dozen or so alignments, all of these span position 104524850. The first few lines being: By this time we're already over 300 fold depth for position 104524850, and as we know the end coordinate we could in theory stop here, but it wouldn't be a representative sub-selection as it's only records ending very close to our requested location. That's not the preferred output. I'm guessing it's continuing to add data until start coord is outside of the region requested, and each new alignment is extending the right-most base a little bit more and we don't yet know whether by not adding it we'd get a "depth crash". However part of the reason for using a history / cache of records is that we can "undo" the additions and smooth out such peaks. This doesn't appear to have happened. There are 59 unique sets of start..end coordinates (assuming 75M) that span region 1:104524850-104524850, so obviously just picking 2 from each would be slightly over the requested 100 records. This demonstrates that, manually at least, there clearly are reads which can be selected to give a better profile. Arguably for larger regions, we may wish to reflect the ups and downs that appear to a certain degree rather than be completely uniform, but we don't want the huge spikes. I think we need to have a better selection process than a greedy left-to-right scan. |
|
Another easier to test example of the problem. We have 10 reads starting at every reference position, of length 100. So ignoring the first 100 and last 100 bps of the contig, everything is at depth 1000. Asking it to filter to depth 100 does nothing at all, because every new contig coordinate hasn't yet been covered so we add another 10 records to it. We need to delay this decision. See the md5sums in these two commands, which only differ by The cause of this is due to a strict left-to-right greedy algorithm. As every read is the same length, we always need to add new records because they always extend the right-most base covered by a little bit. As we never have more than 100 reads ending at a single base position, we never filter out with a An alternative is to not add anything covering position X until we have a record starting at X+1, and then look back through our history to see what records to add. However that may lead to reads starting 1-10 and ending 101-110, and then starting 101-110 and ending 201-210, etc. It's a bit "blocky" and many positions (eg 105) are covered solely by the extreme starts or extreme ends of reads. Ideally we'd have every coordinate covered a broad mix of locations with the alignments, like a Perhaps some sort of reproducible stochastic approach where we add records over the current maximum depth with a probability inversely based on how over-depth we are, would give a smoother profile that still partially reflects ups and down trends. but with the ability to back fill from our history cache when this leaves us at lower depth than we asked for. It would overshoot, but only by a fixed amount (depending on our probabilities), and still give us data covering all coordinates from bases in the middle of reads rather than some only having extreme start/end bases. |
jkbonfield
left a comment
There was a problem hiding this comment.
This is significantly longer than I was expecting for the functionality, but I think the complexity probably comes from the desire to do read pairs properly. The old depth filtering code didn't care about read-pairs at all, but that was one of many problems that needed fixing. This is a hard nut to crack, but worth trying!
I've only done basic scans through the code so far and looked at the header files. I then got distracted by doing usability checks and that took up my main time. The core algorithm needs fixing first to filter better. I'm playing with some ideas.
Anyway, here is a fast pass of issues. The review is still in progress however.
| bam1_t *r; | ||
| struct ce_t *next, *prev; | ||
| uint64_t len; | ||
| kstring_t log; |
There was a problem hiding this comment.
kstring_t log isn't needed unless CACHE_DBG_LOG is enabled, and likewise the ks_initialise/ks_free calls in sam_cache.c.
| uint64_t ord; //ordinal | ||
| bam1_t *r; | ||
| struct ce_t *next, *prev; | ||
| uint64_t len; |
There was a problem hiding this comment.
Len comes from bam_cigar2rlen which returns hts_pos_t. It's basically the same type, but we probably ought to use that type here instead of uint64_t.
| hfile_libcurl.o hfile_libcurl.pico: hfile_libcurl.c config.h $(hfile_internal_h) $(htslib_hts_h) $(htslib_hts_alloc_h) $(htslib_kstring_h) $(htslib_khash_h) | ||
| hfile_s3.o hfile_s3.pico: hfile_s3.c config.h $(hfile_internal_h) $(htslib_hts_h) $(htslib_hts_alloc_h) $(htslib_kstring_h) $(hts_time_funcs_h) | ||
| hts.o hts.pico: hts.c config.h os/lzma_stub.h $(htslib_hts_h) $(htslib_bgzf_h) $(cram_h) $(htslib_hfile_h) $(htslib_hts_endian_h) version.h config_vars.h $(hts_internal_h) $(hfile_internal_h) $(sam_internal_h) $(htslib_hts_alloc_h) $(htslib_hts_expr_h) $(htslib_hts_os_h) $(htslib_khash_h) $(htslib_kseq_h) $(htslib_ksort_h) $(htslib_tbx_h) $(htscodecs_htscodecs_h) | ||
| hts.o hts.pico: hts.c config.h os/lzma_stub.h $(htslib_hts_h) $(htslib_bgzf_h) $(cram_h) $(htslib_hfile_h) $(htslib_hts_endian_h) version.h config_vars.h $(hts_internal_h) $(hfile_internal_h) $(sam_internal_h) $(htslib_hts_alloc_h) $(htslib_hts_expr_h) $(htslib_hts_os_h) $(htslib_khash_h) $(htslib_kseq_h) $(htslib_ksort_h) $(htslib_tbx_h) $(htscodecs_htscodecs_h) $(sam_cache_h) |
There was a problem hiding this comment.
We also need a makefile rule for sam_cache.o
| ce_t *head_ins, *tail_ins; //inserted alignments | ||
| uint64_t selcnt, nselcnt, inscnt,rcnt; | ||
| uint64_t ord; //last ordinal | ||
| int trgr; //sts: 0 not ready 1 caching 2 wnd full 3 ready 4 end |
There was a problem hiding this comment.
This should be an enum, which will help to make all the assignments and checks of trgr in sam_cache.c more self-documenting
| const char *fnidx; | ||
| struct sam_hdr_t *bam_header; | ||
| struct hts_filter_t *filter; | ||
| void *c; //for cache |
There was a problem hiding this comment.
Please pick a better name than c. Maybe cache?
It also shouldn't be void *. It should be an opaque type name, declared in a public header somewhere (or here), with the actual non-opaque version in the internal sam_cache.h. This then avoids casts elsewhere and improves code documentation / readability.
There was a problem hiding this comment.
I note this is also changing the size of the struct and is therefore technically ABI breaking, however the filter element appeared in 1.12 and we didn't bump LIBHTS_SOVERSION` for that. I think the justification was probably that we have an explicit comment saying "This structure should be considered opaque by end users", which gives us some more leeway. (Albeit the size being known means we could declare an array of htsFile, but that wouldn't be possible if it was anonymous so should also come under the "opaque" declaration.)
| // wndsz = 350; | ||
| // dpth = dpth & 0x00FF; //upto 65535 | ||
| //todo check whether sorted by pos and setup only if so | ||
| if (dpth > 0) |
There was a problem hiding this comment.
A stylistic comment, but generally if we have inner braces used we prefer outer braces too to aid readability.
| } | ||
| if (c && ret >= -1) { | ||
| processcache_iter(c); | ||
| if (!getfromreadcache_iter(c, s, &iter->tid, &iter->curr_beg, &iter->curr_end)) { |
There was a problem hiding this comment.
Is this correct? iter claims:
* tid - Reference id of the target region
* beg - Start position of the target region
* end - End position of the target region
* curr_tid - Reference id of the current alignment
* curr_beg - Start position of the current alignment
* curr_end - End position of the current alignment
Here we're mixing current alignment coordinates with region coordinates by using tid from one and beg/end from the other. It may be correct and I haven't followed the logic, but it seems different to most of the other getfromreadcache_iter calls.
| if (!c) { //create cache | ||
| if (!(c = calloc(1, sizeof(rc_t)))) | ||
| return NULL; | ||
| fp->c = c; | ||
| } |
There was a problem hiding this comment.
Should this call setupcache instead? I'm not sure what the benefit is of allocating the structure memory without setting all the other parameters up.
| c->maxdpth = maxdpth; | ||
| c->w_st = c->w_en = -1; | ||
| c->tid = -2; //start | ||
| c->selpair = kh_init(kh_pair); |
There was a problem hiding this comment.
Missing errr check on kh_init, which does a malloc and can therefore fail.
| const hts_cram_idx_t *cidx = (const hts_cram_idx_t *) idx; | ||
| if (idx == NULL) | ||
| return hts_itr_query(NULL, tid, beg, end, sam_readrec_rest); | ||
| return hts_itr_usecache(hts_itr_query(NULL, tid, beg, end, sam_readrec_rest)); |
There was a problem hiding this comment.
I don't understand this logic.
hts_itr_usecache is in the hts.h:
static inline hts_itr_t* hts_itr_usecache(hts_itr_t *itr) { if (itr) itr->usecache = 1; return itr; }
It's static inline, so we don't have an extra function overhead. That's fine.
However it's basically saying if hts_itr_query worked and didn't return NULL, then set itr->usecache = 1. In that case the iter query function may as well just set it to 1 itself as it's permanently enabled.
Although why is it always on? Shouldn't it only be enabled if the user specified the command line option? Is it because we're passing in NULL so we don't have an index, let alone a file descriptor to check for the CLI option being enabled? (I confess the overloading in the indexing code scares and confuses me as to what type is what and when it exists! Hence why CRAM simply has the index attached to the cram_fd rather than as an additional type floating around.)
It seems to be used in getsamcache and if set (which it will be) then it enables calls to getfromreadcache_iter in hts_itr_next. This seems to imply caching is permanently enabled.

This implements a caching mechanism for hts wide depth filtering.
Reads are added to a cache and on tid change/eof/a window size, the
cached reads are processed to find which reads should be used.
Any read for which pair is present in the window will also be selected.
At end of processing, selected reads are sent to user and read continues
until next process event.
The data is expected to be sorted by position.
The cache is allocated first and increased as and when required. Reads have their
order of read set in them. They are kept in order of position and based on length
when position are same. Internally it is a linked list where reads are cached.
Selected reads are moved to a selected list and others to non-selected list.
Any read selected out of turn from non-selected list, due to pair processing, are
moved to an insert list. Post processing, reads are retrieved from selected and insert
list based on their ordinal. Once all selected are transferred, the read and cache cycle
continues.
The same works with iterators as well, with new hts_itr_usecache method.