hamsterdb Embedded Database 1.1.15
Data Fields
ham_runtime_statistics_globdata_t Struct Reference

#include <hamsterdb_stats.h>

Data Fields

ham_u32_t scan_count [HAM_FREELIST_SLOT_SPREAD]
ham_u32_t ok_scan_count [HAM_FREELIST_SLOT_SPREAD]
ham_u32_t scan_cost [HAM_FREELIST_SLOT_SPREAD]
ham_u32_t ok_scan_cost [HAM_FREELIST_SLOT_SPREAD]
ham_u32_t insert_count
ham_u32_t delete_count
ham_u32_t extend_count
ham_u32_t fail_count
ham_u32_t search_count
ham_u32_t insert_query_count
ham_u32_t erase_query_count
ham_u32_t query_count
ham_u32_t first_page_with_free_space [HAM_FREELIST_SLOT_SPREAD]
ham_u32_t rescale_monitor

Detailed Description

global freelist algorithm specific run-time info: per cache

Definition at line 165 of file hamsterdb_stats.h.


Field Documentation

Definition at line 177 of file hamsterdb_stats.h.

Definition at line 183 of file hamsterdb_stats.h.

Definition at line 178 of file hamsterdb_stats.h.

Definition at line 179 of file hamsterdb_stats.h.

Definition at line 186 of file hamsterdb_stats.h.

count the number of insert operations for this DB

Definition at line 176 of file hamsterdb_stats.h.

Definition at line 182 of file hamsterdb_stats.h.

Definition at line 173 of file hamsterdb_stats.h.

Definition at line 169 of file hamsterdb_stats.h.

Definition at line 184 of file hamsterdb_stats.h.

Note: counter/statistics value overflow management:

As the 'cost' numbers will be the fastest growing numbers of them all, it is sufficient to check cost against a suitable high water mark, and once it reaches that mark, to rescale all statistics.

Of course, we could have done without the rescaling by using 64-bit integers for all statistics elements, but 64-bit integers are not native to all platforms and incur a (minor) run-time penalty when used. It is felt that slower machines, which are often 32-bit only, benefit from a compare plus once-in-a-while rescale, as this overhead can be amortized over a large multitude of statistics updates.

How does the rescaling work?

The statistics all are meant to represent relative numbers, so uniformly scaling these numbers will not produce worse results from the hinters -- as long as the scaling does not produce edge values (0 or 1) which destroy the significance of the numbers gathered thus far.

I believe a rescale by a factor of 256 (2^8) is quite safe when the high water mark is near the MAXINT (2^32) edge, even when the cost number can be 100 times as large as the other numbers in some regular use cases. Meanwhile, a division by 256 will reduce the collected numeric values so much that there is ample headroom again for the next 100K+ operations; at an average monitored cost increase of 10-20 per insert/delete trial and, for very large databases using an overly conservative freelist management setting, ~50-200 trials per insert/delete API invocation (which should be a hint to the user that another DAM mode is preferred; after all, 'classical' is only there for backwards compatibility, and in the old days, hamsterdb was a snail when you'd be storing 1M+ records in a single DB table), the resulting statistics additive step is a nominal worst case of 20 * 200 = 4000 cost points per insert/delete.

Assuming a high water mark for signed int, i.e. 2^31 ~ 2.14 billion, dividing ('rescaling') that number down to 2^(31-8) ~ 8M produces a headroom of ~ 2.13 billion points, which, assuming the nominal worst case of a cost addition of 4000 points per insert/delete, implies new headroom for ~ 500K insert/delete API operations.

Which, in my book, is ample space. This also means that the costs incurred by the rescaling can be amortized over 500K+ operations, resulting in an - on average - negligible overhead.

So we can use 32-bits for all statistics counters quite safely. Assuming our 'cost is the fastest riser' position holds for all use cases, that is.

A quick analysis shows this to be probably true, even for fringe cases (a mathematical proof would be nicer here, but alas): let's assume worst case, where we have a lot of trials (testing each freelist page entry in a very long freelist, i.e. a huge database table) which all fail. 'Cost' is calculated EVERY TIME the innermost freelist search method is invoked, i.e. when the freelist bitarray is inspected, and both fail and success costs are immediately fed into the statistics, so our worst case for the 'cost-is-fastest' lemma would be a long trace of fail trials, which do NOT test the freelist bitarrays, i.e. fails which are discarded in the outer layers, thanks to the hinters (global and per-entry) kicking in and preventing those freelist bitarray scans. Assume then that all counters have the same value, which would mean that the number of fails has to be higher that the final cost, repetitively.

This _can_ happen when the number of fail trials at the per-entry outer level is higher than the cost of the final (and only) freelist bitarray scan, which clocks in at a nominal 4-10 points for success cases. However, those _outer_ fail trials are NOT counted and fed to the statistics, so this case will only register a single, successful or failing, trial - with cost.

As long as the code is not changed to count those hinter-induced fast rounds in the outer layers when searching for a slot in the freelist, the lemma 'cost grows fastest' holds, as any other possible 'worst case' will either succeed quite quickly or fail through a bitarray scan, which results in such fail rounds having associated non-zero, 1+ costs associated with them.

To be on the safe side of it all, we accumulate all costs in a special statistics counter, which is specifically designed to be used for the high water mark monitoring and subsequent decision to rescale: rescale_monitor.

Definition at line 282 of file hamsterdb_stats.h.

summed cost ('duration') of all scans per size range

Definition at line 172 of file hamsterdb_stats.h.

number of scans per size range

Definition at line 168 of file hamsterdb_stats.h.

Definition at line 180 of file hamsterdb_stats.h.


The documentation for this struct was generated from the following file: