>> (p.1)
    Author Topic: New blockchain management in Armory  (Read 2610 times)
    etotheipi (OP)
    Legendary
    *
    Offline Offline

    Activity: 1428
    Merit: 1103


    Core Armory Developer


    View Profile WWW
    February 14, 2013, 01:53:30 PM
    Last edit: February 14, 2013, 02:48:42 PM by etotheipi
     #1

    This is just a brain-dump to get a sanity check on my logic before going and basing future versions of Armory on it.  The idea here is to have Armory start managing its own blockchain data, but in a manner that is optimized for all forseeable use-cases.  Keep in mind, this is LevelDB, and part of my design is based on how LevelDB works best (which is probably common to a lot of DB engines).  LevelDB is a very simple but highly-optimized DB engine, which simply stores (key,value) pairs, where both keys and values are any string.   LevelDB handles some degree of "ACID" operations, and it should behave very much like a disk-based version of a C++ std::map<string,string>.

    So here's my list of considerations in designing the new database:

    • (1) LevelDB has key-order-optimized access:  if you access lots of data in key-sort order, it's super fast.  If you do lots of random access... well it's not optimized for that.  Here's how this matters:  if you have (TxID, TxSerialized) pairs in a database, and you want to scan all transactions in a given block, you are going to be accessing data all over the DB.  Each tx is retrieved from a different part of the disk and LevelDB doesn't know what data to prefetch for your next request.  On the other hand, if you were to prefix each TxID with the 4-byte block height, then doing a full blockchain scan would be very optimized because each the next tx you will be requesting is the next key when sorted.  LevelDB will be loading the subsequent blocks as you process the previous ones.  However, this particular example isn't useful, since this doesn't let you do random lookup of tx without knowing the block number -- this is just an example of how we might shape our data to leverage LevelDB optimizations.
    • (2) The structure should accommodate a clean transition to lite client mode (i.e. not have to redesign it)
    • (3) The structure should accommodate a clean transition to pruned-blockchain operations
    • (4) The structure should accommodate a clean transition to maintaining an address-indexed view of the blockchain
    • (5) Minimal duplication of tx data -- ideally we'd only ever have TxIDs in the database once.  Duplicating header data is not so bad, since that's strictly linear as a function of time, but tx data isn't (consider that there are 10 million tx, meaning that each extra copy of txids in the DB is another 320 MB).

    With all this in mind, I have decided on the following set of databases (use || to represent concatenation of data):

    Code:
    HeaderHashDB:  key=HeaderHash(32),              value=BlockHeight(4)
    TxHashDB:      key=TxID(32),                    value=BlockHeight(4)||TxIndex(4)
    BlockDataDB:   key=BlockHeight(4)||TxIndex(4)   value=SpecialSerializedTx(VAR)
                   key=BlockHeight(4)||FFFFFFFF     value=RawHeader(80)
    MerkleDB:      key=MerkleRoot(4)                value=TxID0(32)||TxID1(32)||...||TxIDN(32)  (optional, not sure if it's really necessary)
    OrphansDB:     key=HeaderOrTxHash(32)           value=RawHeaderOrTx(80orVAR)
    EDIT:  Should probably just have different DB altogether for RawHeaders, so that it is easy to download separately as a way to seed a new node.
    A few things about this:

    (1) BlockDataDB can store a variable amount of data -- full nodes will have one entry for every tx, but you don't have to have every tx:  you can store just the ones you need.  The TxHashDB for a lite node only needs to store the transactions that are relevant to that node, and BlockDataDB only needs to have those particular transactions available (and the BlockHeight||TxIndex keys will be non-existent for those other tx).  You will have the 0xFFFFFFFF entry for each block height to maintain those headers.
    (1a) The structure of BlockDataDB achieves the goal of #1 above -- the sort-order for the all the keys in the BlockDataDB is the tx-order in the blockchain.  If you're going to do a full scan, you only need to do a simple iteration over the entire database, and LevelDB will pre-fetch subsequent data for you.  I have tested this with some experimental LevelDB code and it's stupid fast... much faster than my own low-level, highly-optimized blockchain scanning code in C++.  

    (2) TxIDs and HeaderHashes are only stored exactly once (unless you also store the merkle DB, which I'm not sure is needed).  If you need to verify the TxID, look it up in the TxHashDB know where to fetch the full Tx from the BlockDataDB and hash it to make sure it matches.  

    (3) SpecialSerializeTx:  this is a special way of serializing the transactions to allow for variable amount of data depending on the node type.  This satisfies #3 above while still allowing for full reconstruction of the tx if you happen to have all pieces of it.
    NormalSerializedTx:   Version||  N||TxIn0||TxIn1||...||TxInN||  M||TxOut0||TxOut1||...||TxOutM||LockTime
    SpecialSerializedTx:  FFFFFFFF||(NormalSerializeTx-with-blank-TxOuts)||  TxOutIndex0(4)||TxOut0  ||  TxOutIndex1(4)||TxOut1  ||...||  TxOutIndexM(4)||TxOutM

    (3a) Just like (1a) above, this serialization accommodates any storage level:  full nodes will have the full-tx-with-blank-txouts and every TxOut.  Super pruned-and-lite nodes can skip FFFFFFFF entry, and exclude TxOuts it doesn't care about.  So if you know that you own a TxOut in Tx X, then you can lookup Tx X and there will only be one entry: the TxOut that you own.

    (4) The orphans DB is simply for juggling data that does not yet have a home -- i.e. if you just re-org'd you want to keep the old block/tx data around, but you don't have a blockheight to attach it to, so you put it in the orphans DB until you need it again.

    The downside of this is that to get from a header hash to header data you have to do two DB lookups (HeaderHashDB to get blockheight, then go look up blockheight||FFFFFFFF in the BlockDataDB).  Same for looking up Tx hashes.  But I don't think it will be so bad -- at least the header map will probably be fully cached because it is small.

    Haven't tackled #4 (address-indexed database view), but I think it can be just a separate database that piggybacks on the structures of the DBs described above.  I know it won't be simple, but it's also out of scope for now.  I just want to make sure I don't end up having to redesign this structure when I finally do decide to tackle #4.

    So how can I further improve this?  Am I missing use cases?  Am I crazy?

    Founder and CEO of Armory Technologies, Inc.
    Armory Bitcoin Wallet: Bringing cold storage to the average user!
    Only use Armory software signed by the Armory Offline Signing Key (0x98832223)

    Please donate to the Armory project by clicking here!    (or donate directly via 1QBDLYTDFHHZAABYSKGKPWKLSXZWCCJQBX -- yes, it's a real address!)
Page 1
Viewing Page: 1