-
Notifications
You must be signed in to change notification settings - Fork 38
How to use the compression with Hawkular Metrics
Michael Burman edited this page Aug 22, 2016
·
3 revisions
version.gorilla.compression
- Distribute inserts to nodes with the same hashing using Infinispan
- Each node prosesses the insert and inserts it to the memory mapped file of the compressed series
- This is to reduce memory usage and try to maximize the amount of stuff kept in the memory for next compaction
- But needs persistence if there's a lot of timestamps sent during that time or memory is tight
- Investigate Netty or Chronicle's ByteBuffer enhancements and some sort of lookup table
- JGroups is used by the EhCache replication (or was - Hazelcast is probably same company these days), this is one option also
- One more option could be to use AtomicMap, which allows safe reading (a snapshot)
- We however accept rewriting the same timestamp again - replacing the old one, AtomicMap is based on unmutable objects..
- Can we verify somehow that if the map changed during reading that we don't delete it?
- Performance? Usually state machines are not enough performant
- Each node prosesses the insert and inserts it to the memory mapped file of the compressed series
- After time series is full (2 hours block time has passed), write the block to the Cassandra (or ditch Cassandra eventually..)
- This is technically what Facebook's Gorilla uses also, although they use just different memory boundaries
- Can we force this in Infinispan to happen on the key owning node?
- When compacting, look for out-of-order writes and combine them to the compressed series (this can be done later also)
- Also, we could write the series to a final place later if using a method where we don't compress right away
- One option is to write directly to Cassandra's data table (like these days) and read the rows afterwards.
- Data locality must be paid a lot of attention otherwise the performance will suffer and there's huge amount of data transfers
- We need to store the tags somewhere also (like currently supported per datapoint)
- Create a job that is run every two hours
- Timestamp block starts in even hours (00, 02, 04, ..)
- Job runs on odd hours (01, 03, 05, ..)
- It reads all the rows of the previous timestamp block (for example, 03 job reads everything written between 00-02)
- Creates a compressed block
- Writes it to the compressed table
- Deletes the originals from data table
- Timestamp precision changes for the write path, can be done in the compression phase
- Such as second precision instead of millisecond precision
- Depending on the implementation, this needs to be unparsed in the read path - or then we just mark milliseconds as zero and be happy with that (in case of second precision)
- Such as second precision instead of millisecond precision
- Create wrapper for the gorilla-tsc library to add compression headers and to read them:
- Type of compression that is used (1 byte) (bit mask?), allow different timestamp + value compression bundles
- Create modifications to DataAccessImpl to write to the data_compressed table and delete the selected keys from data
- Since the writes and deletes go to the same partition always, the delete and insert can be done in a single batch -> performance improvements and no data loss
- Out-of-order writes happen as before
- Add c_value column to the data table queries to the DataAccessImpl
- Truncate the dates to previous 2 hour block when reading from known compressed time range
- Out-of-order writes are correctly processed when doing the sorting in the reading path
- And even if compression job dies, the parsing will still be correct
- Create modifications to findDataPoints in the MetricsServiceImpl
- Sort each time if descending order for compressed datapoints - or there were out-of-order writes which need to be included
- Create new Row -> List<DataPoint> (and same for others) functions with enough filtering information to only get the requested datapoints, or Observable, but we need to read all the points in any case
- Add ability to add matching tags also to the Datapoint
- Create modifications to gorilla compressor to allow long compression also, not just double
- This needs to be known in the decompressing.. set a bit in the header? EnumSet?
- Create Availability -> compression path (enum literal number is the value) -> long compression
- Although efficient without changes, XOR is unnecessary here as compressing the leadingZeros and trailingZeros takes more space than the whole value would
- Recommendation: store 0 bit if unchanged, otherwise set bit (1) + 4 bits for the potential values?
- Although efficient without changes, XOR is unnecessary here as compressing the leadingZeros and trailingZeros takes more space than the whole value would
- Need something in the gorilla-tsc to allow only reading the timestamp or only the value. And same for compressing also.