Thursday, July 26, 2018

Teradata Database Hashing Algorithm



How data distribution works in Teradata? It depends on the primary Index.  The primary index is a set of table column/s which is/are used as input for hashing algorithm. 

When a row assigned to particular AMP based on the primary index, Teradata uses the hashing algorithm to determine which AMP gets the row. 


How Hashing algorithm works?
When parser (PI) or AMP received the data/row, it passes that records to the hashing algorithm. The hashing algorithm hashes the primary index value of each record and returns a 32-bit number (Row Hash), which later used to store data into AMP. 


Rows that are inserted into NoPI tables using single-row inserts or Teradata Parallel Data Pump ARRAY INSERTs are allocated to the AMPs based on the hashed value of the Query ID for their request

Rows that are inserted into NoPI tables using INSERT … SELECT operations are, by default, appended AMP-locally to their target table, so there is no need to hash them to a destination AMP.

The rows that are FastLoaded into nonpartitioned NoPI tables (see NoPI Tables, Column-Partitioned NoPI Tables, and Column-Partitioned NoPI Join Indexes) are assigned to their destination AMPs based on a round-robin protocol.


Hash Bucket Number
The first 16 or 20 bits of the rowhash is a hash bucket number, which is used to define the number of the hash bucket to which the hashed row belongs. 


A 32-bit rowhash provides 4.2 billion possible rowhash values, which reduces hash collisions to a level that eliminates their impact on retrieval performance for all intents and purposes.
Because the number of possible values for a given domain may exceed 4.2 billion, further information, called the, is required to make each row uniquely identifiable. This situation occurs for tables hashed on a nonunique primary index or primary AMP index, where different rows can have the same index value, and so the same rowhash value, and also for tables hashed on a unique primary index if there is a hash collision.
If there is any record with same row hash, then it increments the uniqueness ID which is a 32 bit number. For new row hash, uniqueness ID is assigned as 1 and incremented whenever a record with same row hash is inserted.
The combination of Row hash and Uniqueness ID is called as Row ID.

Row ID prefixes each record in the disk.

Each table row in the AMP is logically sorted by their Row IDs.



Query to check data distribution in Teradata -

SELECT HASHAMP(HASHBUCKET(HASHROW(<PRIMARY INDEX>))) AS "AMP"
,COUNT(*) as "#row"
FROM <your_table>
GROUP BY HASHAMP(HASHBUCKET(HASHROW(<PRIMARY INDEX>)))
ORDER BY 2 DESC;


Teradata uses HASH values to store data in various AMPs (Access Module Processor)  in the Teradata System. 

Wednesday, July 25, 2018

Teradata - FastLoad - DataLoad Utility


Fastload is Teradata utility to load a large amount of data (mostly flat file)into Teradata table. Fastload is a command-driven utility. 

1) It loads data into an empty table only.

2) It doesn't allow duplicate rows, even if the table is defined as MULTISET table.

3) It doesn't use TRANSIENT JOURNALS, so data loads quickly.

4) Teradata FastLoad uses multiple sessions and block level operation to load data.

5) One Fastload job could load data into one table only. 

6) Data can be loaded from Disk/tape, network etc. or any other device which can provide the properly formatted source data

7) Target table should not have the secondary index, join index and foreign key reference or any trigger on the table. Primary index is permissible.  

8) AMP should not be down during fastload.

9) Only one datatype conversion is allowed per column. 



Command to run Teradata utility: Type "fastload" on the command prompt.  If Teradata "fastload" in installed on the system, it will return FASTLOAD UTILITY versions number and other information. 



Component of FastLoad :

1) Log Table: Log table stores the processing records information during load. This table contains one rows for each FastLoad job.

2) Empty Target Table: It is the mandatory requirement. 

3) Error tables (2):   Two tables are required to log error messages.  These are automatically created during runtime (you can define the error table name). First error table contains any data type (conversion error) issue, unavailability of AMP, constraint violations (like not null etc.).      Second table contains unique primary key index violations.

Both error tables have the same structure.  Say example your error table names are errorlog1 & errorlog2. To see the error message, run the query -

SEL  DataParcel,ErrorCode, ErrorFieldName FROM    errorlog1  ORDER BY ErrorCode ;
DataParcel: is kind of primary key of this table.  It stores entire data records (up to 64K). 

How FastLoad Works: It loads data into two phases.

Phase I (Acquisition Phase): Primary purpose of this phase to load data from the host computer into Teradata System. PI (Parsing engine) reads the records from the input file and send the block (size 64 KB) to each(random) AMP.


Phase II (Application Phase): This phase starts when FastLoad receives the END LOADING statement.  Each AMP hash each record, later sorts the records and sends to correct AMP and writes them to disk. Locks on target table get released and error tables are dropped.


Steps for fastload script : 

1)Logging onto Teradata
2)Define the target table.
3)Declare the input file delimiter.(Default delim is pipe)
4)Map the column with input file column sequence
5)Definethe input data file
6)Instruct system to start loading
7)Instruct the system to insert data into the target table
8)End the session.


Does hashing happens at PE or at AMP ?
Could be either AMP or PE. In particular, FastLoad sends data blocks to AMPs round-robin and the AMPs hash the rows and send to the correct target AMP. But for a single-row fetch or insert, the PE computes the hash to determine which AMP should handle the request.

Fastload Script : 

>fastload < D:\fastloadtest.txt

<<Script Starts Here >>
/***** Section 1 *****/
/* In this section we give the LOGIN credentials which is required to connect to TD system.
Sessions command is used to restrict the number of sessions Fastload will make to connect to TD.
Default is one session per AMP. */

SESSIONS 10;
errlimit 25;
.LOGON  HostName or IP Address/UserID,Password;

/***** Section 2 *****/
/* In this section we are defining the table which we want to load from Fastload. DROP commands are optional.
There is no need to define the structure of ERROR tables they’ll be created itself by Fastload. */

DROP TABLE  DBNAME.Errortable1;
DROP TABLE  DBNAME.Errortable2;
DROP TABLE  DBNAME.fastloadtest;

CREATE multiset DBNAME.fastloadtest
(
  att1 VARCHAR(50)
, att2 VARCHAR(100)
, col3 VARCHAR(90)
, col4 VARCHAR(200)
, col5 VARCHAR(300)
, col6 VARCHAR(200)
)no primary index
;

/***** Section 3 *****/
/*RECORD command is used to skip the starting rows from the data file. RECORD THRU command is used to skip the last rows of data file.
SET RECORD command is used to define the records layout and the “,” is the delimiter which we are using in our data file to separate columns. */

SET RECORD VARTEXT ",";
RECORD 1;


/***** Section 4 *****/
/* DEFINE statement is used to define the structure of the data file. This should be in accordance with the actual target table structure.
Fastload DEFINE statement allows only VARCHAR format. */
DEFINE
col1(VARCHAR(50)) ,
col2(VARCHAR(100)) ,
col3(VARCHAR(90)) ,
col4(VARCHAR(200)) ,
col5(VARCHAR(300)) ,
col6(VARCHAR(200))


/***** Section 5 *****/
/* FILE command defines the data file path and name. */

FILE=D:\Amrendra\FastLoadData.csv;
SHOW;

/***** Section 6 *****/
/* In this section we give the BEGIN loading statement. As soon as Fastload receives this statement it starts PHASE 1. */

BEGIN LOADING DBNAME.Errortable1
   ERRORFILES DBNAME.Errortable2,
              DBNAME.fastloadtest;


/***** Section 7 *****/
/* INSERT command is used to load the data file into actual target table. NOTE – For DATE columns we can use the data conversion by the syntax given below. */
INSERT INTO ndw_ebi_work.fastloadtest
(
att1,
att2,
col3,
col4,
col5,
col6
)
VALUES (
:col1,
:col2,
:col3,
:col4,
:col5,
:col6

);

/***** Section 8 *****/
/* END LOADING ends PHASE 1 and starts the execution of PHASE 2. LOGOFF is required to close all the sessions created by Fastload. */



END LOADING;
LOGOFF;


<<Script Ends Here>>


<<Scirpt you could do some customoziation also >>

INSERT INTO retail.emp_test
(
:emp_id ,
:emp_name,
:dept_id ,
:salary,
:dob (format ‘yyyy-mm-dd’)
); 
In above code, it dob which is varchar , is casting to date data type.

.RECORD 1;
.RECORD THRU 3;
SET RECORD VARTEXT “,”;
RECORD command is used to skip the starting rows from the data file. RECORD THRU command is used to skip the last rows of data file. SET RECORD command is used to define the records layout and the “,” is the delimiter which we are using in our data file to separate columns.


YOu can define checkpoint also - 
BEGIN LOADING EMPDB.EMP_TABLE
ERRORILES EMPDB.EMP_ERR1, EMPDB.EMP_ERR2
CHECKPOINT 10000000;