Thursday, July 26, 2018

Teradata Database Hashing Algorithm



How data distribution works in Teradata? It depends on the primary Index.  The primary index is a set of table column/s which is/are used as input for hashing algorithm. 

When a row assigned to particular AMP based on the primary index, Teradata uses the hashing algorithm to determine which AMP gets the row. 


How Hashing algorithm works?
When parser (PI) or AMP received the data/row, it passes that records to the hashing algorithm. The hashing algorithm hashes the primary index value of each record and returns a 32-bit number (Row Hash), which later used to store data into AMP. 


Rows that are inserted into NoPI tables using single-row inserts or Teradata Parallel Data Pump ARRAY INSERTs are allocated to the AMPs based on the hashed value of the Query ID for their request

Rows that are inserted into NoPI tables using INSERT … SELECT operations are, by default, appended AMP-locally to their target table, so there is no need to hash them to a destination AMP.

The rows that are FastLoaded into nonpartitioned NoPI tables (see NoPI Tables, Column-Partitioned NoPI Tables, and Column-Partitioned NoPI Join Indexes) are assigned to their destination AMPs based on a round-robin protocol.


Hash Bucket Number
The first 16 or 20 bits of the rowhash is a hash bucket number, which is used to define the number of the hash bucket to which the hashed row belongs. 


A 32-bit rowhash provides 4.2 billion possible rowhash values, which reduces hash collisions to a level that eliminates their impact on retrieval performance for all intents and purposes.
Because the number of possible values for a given domain may exceed 4.2 billion, further information, called the, is required to make each row uniquely identifiable. This situation occurs for tables hashed on a nonunique primary index or primary AMP index, where different rows can have the same index value, and so the same rowhash value, and also for tables hashed on a unique primary index if there is a hash collision.
If there is any record with same row hash, then it increments the uniqueness ID which is a 32 bit number. For new row hash, uniqueness ID is assigned as 1 and incremented whenever a record with same row hash is inserted.
The combination of Row hash and Uniqueness ID is called as Row ID.

Row ID prefixes each record in the disk.

Each table row in the AMP is logically sorted by their Row IDs.



Query to check data distribution in Teradata -

SELECT HASHAMP(HASHBUCKET(HASHROW(<PRIMARY INDEX>))) AS "AMP"
,COUNT(*) as "#row"
FROM <your_table>
GROUP BY HASHAMP(HASHBUCKET(HASHROW(<PRIMARY INDEX>)))
ORDER BY 2 DESC;


Teradata uses HASH values to store data in various AMPs (Access Module Processor)  in the Teradata System. 

No comments:

Post a Comment