Welcome to Scribd!

Skip carousel

Hive Query Optimization Infinity

Uploaded by

shashwat2010

0% found this document useful (0 votes)

195 views13 pages

How to optimize hive queries for better performance and execution

Copyright

Available Formats

PPT, PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

How to optimize hive queries for better performance and execution

Copyright:

Attribution Non-Commercial (BY-NC)

Available Formats

Download as PPT, PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

195 views13 pages

Hive Query Optimization Infinity

Uploaded by

shashwat2010

How to optimize hive queries for better performance and execution

Copyright:

Attribution Non-Commercial (BY-NC)

Available Formats

Download as PPT, PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 13

Search inside document

dwivedishashwat@gmail.com http://helpmetocode.blogspot.

com

Well designed tables Partitioning Bucketing and well written queries can improve your query speed and

reduce processing cost.

Optimization on Table side

Partitioning Hive Tables:
It is a kind of horizontal slicing of data. This slicing can be

on the range, single value or a set of values. Imagine log files where each record includes a timestamp. If we partitioned by date, then records for the same date would be stored in the same partition. E.g.: Partition on date. Partition on geography location. Partition on number range.

Defining a table partition

Lets take a Apache log file example where we have log generated by web

server on visit of client. These log contains data & time information about browser and location(IP). So we can create table in hive and partition these log data using date & time and we can create sub partition of location. Which looks like :

CREATE TABLE alogs (timstamp BIGINT, detail STRING) PARTITIONED BY (date STRING, loc STRING);

Log Table

Directory Structure

/user/hive/warehouse/logs/dt=2010-01-01/country=GB/file1 /file2 /country=US/file3 /dt=2010-01-02/country=GB/file4 /country=US/file5 /file6

Hive Buckets
Bucketing Hive Tables:
Bucketing hive table result in more efficient queries.

Bucketing imposes extra structure on the table, which Hive can take advantage of when performing certain queries. The two tables are bucketed in the same way, a mapper processing a bucket of the left table knows that the matching rows in the right table are in its corresponding bucket, so it need only retrieve that bucket. Bucket may additionally be sorted by one or more columns. This allows even more efficient map-side joins, since the join of each bucket becomes an efficient merge-sort.

It makes sampling more efficient.

Parallel execution of queries

Hadoop can execute map reduce jobs in parallel and several queries executed on Hive make automatically use of this parallelism. The queries or sub queries which are not interdependent can be execute in parallel mode,like some Join queries.

Following is the example how it is done:

SET hive.exce.parallel=true; #Can be used to set this mode on

Final Result 4 Main Query 5 Query (1 & 2) & 3 Joined Join Sub query (1 & 2) Joined Join Sub query 1

2 Sub query 2

3 Sub query 3

Misc
So in the above flow, 1,2,4 can run in parallel as sub queries and

then joined finally to 3 and then to 5 and the final query result.

Since map join is faster than the common join, it's better to run the map join whenever possible. Previously, Hive users needed to give a hint in the query to specify the small table. For example, select /*+mapjoin(a)*/ * from src1 x join src2 y on x.key=y.key; Newer hive automatically converts normal join to map join.

Some examples

Which query is faster? Select count(distinct(column)) from table.

Or
Select count(*) from (select distinct(column) from table) ??

Answer
M M M M M M

Result

2nd one is faster

In first case :
Maps send each value to reducer Single reducer counts them all(over head)

In Second Case:
Map splits the values to many reducer
Each reducer generated a list Final job is to count the size of each list

Note : Singleton reducer is not always good.

Tips
Hive does not know whether query is bad.

So try to use Explain for queries which you doubt to be bad or

even dont doubt. Explain tells about following Number of jobs Number of map and reduce What job is sorting by What are the directories it will read. So explain will help to see the difference between the two or more queries for the same purpose. Job configuration and history can be studied for the query performance.

Datatypes in Hive
Document31 pages
Datatypes in Hive
Mytheesh Waran
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
Rating: 4 out of 5 stars
4/5 (2)
Hadoop Developer Training - Hive Lab Book
Document51 pages
Hadoop Developer Training - Hive Lab Book
Karthick selvam
No ratings yet
Hive Commands Simplin
Document5 pages
Hive Commands Simplin
marina dutta
No ratings yet
Hive Workshop Practical
Document29 pages
Hive Workshop Practical
Sree Eedupuganti
No ratings yet
Linux Command List
Document8 pages
Linux Command List
hkneptune
No ratings yet
Spark RDD Dataframes SQL
Document3 pages
Spark RDD Dataframes SQL
leongladxton
No ratings yet
5 Years Talend ETL Developer Expertise
Document4 pages
5 Years Talend ETL Developer Expertise
jani
No ratings yet
SqoopTutorial Ver 2.0
Document51 pages
SqoopTutorial Ver 2.0
bujjijuly
No ratings yet
Hadoop Multi Node Cluster
Document7 pages
Hadoop Multi Node Cluster
chandu102103
No ratings yet
Sqoop Demo
Document7 pages
Sqoop Demo
Jyotirmay Sahu
No ratings yet
Hands On Big Data
Document52 pages
Hands On Big Data
pratap
No ratings yet
Hadoop
Document30 pages
Hadoop
SAM7028
No ratings yet
Ssis Data Type Cheat Sheet
Document1 page
Ssis Data Type Cheat Sheet
ebtrain
No ratings yet
Sampath Polishetty BigData Consultant
Document7 pages
Sampath Polishetty BigData Consultant
Sampath Polishetty
No ratings yet
Scala & Spark: Scala is Red Hot for Apache Spark
Document5 pages
Scala & Spark: Scala is Red Hot for Apache Spark
Sudhakar Nelapati
No ratings yet
Introduction to Teradata SQL: Learn SQL and RDBMS Fundamentals
Document218 pages
Introduction to Teradata SQL: Learn SQL and RDBMS Fundamentals
dani_sag
No ratings yet
Hive Tutorial
Document25 pages
Hive Tutorial
Sankalp Jangam
No ratings yet
DataStage Faq S
Document57 pages
DataStage Faq S
swaroop24x7
No ratings yet
Distributed Database Systems: - Spark I
Document59 pages
Distributed Database Systems: - Spark I
Thomas Ariyanto
No ratings yet
HOL Hive
Document85 pages
HOL Hive
Kishore Kumar
No ratings yet
Create Three Node Replication Set
Document9 pages
Create Three Node Replication Set
Ravindra Malwal
No ratings yet
RDBMS Concepts and SQL Fundamentals
Document75 pages
RDBMS Concepts and SQL Fundamentals
Ramya Patel
No ratings yet
Query array elements in MongoDB
Document16 pages
Query array elements in MongoDB
chris
No ratings yet
Hive
Document17 pages
Hive
pruphiphis
No ratings yet
Introduction To Informatica
Document66 pages
Introduction To Informatica
Shravan Kumar
No ratings yet
Pair RDD Operations: Flat Map
Document4 pages
Pair RDD Operations: Flat Map
marina dutta
No ratings yet
Sqoop Commands
Document4 pages
Sqoop Commands
Senthil Kumar
No ratings yet
HDFS Commands Guide
Document5 pages
HDFS Commands Guide
Prabhu Kushwaha
No ratings yet
Datastage Transformer Functions
Document71 pages
Datastage Transformer Functions
AnonymousHP
No ratings yet
Sqoop Cheatsheet
Document3 pages
Sqoop Cheatsheet
PremKumar Sivanandan
No ratings yet
Hive
Document3 pages
Hive
ud
No ratings yet
Week-11 - 12-Hivepdf - 2023 - 11 - 10 - 12 - 47 - 43
Document8 pages
Week-11 - 12-Hivepdf - 2023 - 11 - 10 - 12 - 47 - 43
Sheshikanth Don
No ratings yet
CCA175 Cloudera Hadoop and Spark Developer Tips and Tricks
Document4 pages
CCA175 Cloudera Hadoop and Spark Developer Tips and Tricks
Abdur Rahman
No ratings yet
Teradata Utilities
Document139 pages
Teradata Utilities
sonu_pal
No ratings yet
INFORMATICA-Performance Tuning
Document21 pages
INFORMATICA-Performance Tuning
svprasad.t
100% (9)
Final Print Py Spark
Document133 pages
Final Print Py Spark
Shivaraj K
No ratings yet
MySQL and Postgres command equivalents cheat sheet
Document7 pages
MySQL and Postgres command equivalents cheat sheet
kishore_m_k
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
Document219 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
Chánh Lê
No ratings yet
Hive Commands
Document3 pages
Hive Commands
pkumarss
No ratings yet
Mongodb Crud Operations
Document43 pages
Mongodb Crud Operations
apoorva
100% (1)
Data Warehouse Interview Questions:: Why Oracle No Netezza?
Document6 pages
Data Warehouse Interview Questions:: Why Oracle No Netezza?
Hirak
No ratings yet
TeraData DBA
Document7 pages
TeraData DBA
avinashkakarla
No ratings yet
Databricks Ecosystem and Spark Overview
Document29 pages
Databricks Ecosystem and Spark Overview
abhishekanand20073509
No ratings yet
Window Functions
Document15 pages
Window Functions
chenna kesava
No ratings yet
Big Data Masters Program
Document13 pages
Big Data Masters Program
Arun Singh
No ratings yet
Predefined Exception in Pl/sqlpredefined Exceptions
Document4 pages
Predefined Exception in Pl/sqlpredefined Exceptions
hari248668
No ratings yet
Basic MongoDB Commands
Document2 pages
Basic MongoDB Commands
Manjunath.R
No ratings yet
Cleaning Data With PySpark Chapter3
Document25 pages
Cleaning Data With PySpark Chapter3
Fgpeqw
No ratings yet
Sqoop Commands - Latest
Document4 pages
Sqoop Commands - Latest
H S Manju Nath
No ratings yet
Map Reduce Examples
Document16 pages
Map Reduce Examples
icecream-likey
No ratings yet
Hadoop Big Data Cluster Management
Document35 pages
Hadoop Big Data Cluster Management
Ekapop Verasakulvong
100% (1)
UNIX Commands: CTRL+D - Possible Completer CTRL+C - Cancel Foreground Job CTRL+Z - Stop (Interrupted) A Foreground Job
Document41 pages
UNIX Commands: CTRL+D - Possible Completer CTRL+C - Cancel Foreground Job CTRL+Z - Stop (Interrupted) A Foreground Job
లక్ష్మిశైలజ పుత్ర కోనూరి దినేష్
No ratings yet
What Is Spark?: Up To 100× Faster
Document56 pages
What Is Spark?: Up To 100× Faster
jainam dude
No ratings yet
Hadoop Commands
Document6 pages
Hadoop Commands
Kodanda Ramudu
100% (1)
Spark Interview 4
Document10 pages
Spark Interview 4
consania
No ratings yet
Sql Plsql Oracle
From Everand
Sql Plsql Oracle
Andrew Igla
No ratings yet
Instant Pentaho Data Integration Kitchen
From Everand
Instant Pentaho Data Integration Kitchen
Sergio Ramazzina
No ratings yet
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet
Upgrading Hadoop
Document10 pages
Upgrading Hadoop
shashwat2010
No ratings yet
Hadoop Migration and Upgradation
Document8 pages
Hadoop Migration and Upgradation
shashwat2010
No ratings yet
R Language Introduction
Document27 pages
R Language Introduction
shashwat2010
No ratings yet
Hive Configuration: Shashwat Shriparv
Document5 pages
Hive Configuration: Shashwat Shriparv
shashwat2010
No ratings yet
Mysql
Document11 pages
Mysql
shashwat2010
100% (1)
HBase Development Java
Document24 pages
HBase Development Java
shashwat2010
No ratings yet
Hive Configuration: Shashwat Shriparv
Document5 pages
Hive Configuration: Shashwat Shriparv
shashwat2010
No ratings yet
Hive Configuration: Shashwat Shriparv
Document5 pages
Hive Configuration: Shashwat Shriparv
shashwat2010
No ratings yet
Hive Configuration: Shashwat Shriparv
Document5 pages
Hive Configuration: Shashwat Shriparv
shashwat2010
No ratings yet
Hadoop Fully Distributed Cluster
Document5 pages
Hadoop Fully Distributed Cluster
shashwat2010
No ratings yet
Hbase
Document29 pages
Hbase
shashwat2010
No ratings yet
Hadoop Fully Distributed Cluster
Document8 pages
Hadoop Fully Distributed Cluster
shashwat2010
No ratings yet
C# Interview Quesions
Document10 pages
C# Interview Quesions
Shashwat Shriparv
No ratings yet
Introduction To Apache Hadoop
Document22 pages
Introduction To Apache Hadoop
shashwat2010
No ratings yet
Linux 4 You
Document31 pages
Linux 4 You
shashwat2010
100% (1)
Poker Test
Document9 pages
Poker Test
shashwat2010
No ratings yet
Apache Tomcat
Document18 pages
Apache Tomcat
shashwat2010
No ratings yet
Next Generation Technology
Document4 pages
Next Generation Technology
shashwat2010
No ratings yet
Project Oxygen : Shashwat Shriparv Infinitysoft
Document25 pages
Project Oxygen : Shashwat Shriparv Infinitysoft
shashwat2010
No ratings yet
C# Interview Quesions
Document10 pages
C# Interview Quesions
Shashwat Shriparv
No ratings yet
Secondary Storage Devices
Document36 pages
Secondary Storage Devices
shashwat2010
No ratings yet
Configure HBase Hadoop and Hbase Client
Document16 pages
Configure HBase Hadoop and Hbase Client
shashwat2010
No ratings yet
Search Engine
Document42 pages
Search Engine
shashwat2010
No ratings yet
Probability Terminology and Concepts
Document13 pages
Probability Terminology and Concepts
shashwat2010
No ratings yet
Sam
Document24 pages
Sam
shashwat2010
No ratings yet
Parameter Passing
Document14 pages
Parameter Passing
shashwat2010
No ratings yet
Runtime Storage Management
Document14 pages
Runtime Storage Management
VinayKumarSingh
100% (1)
P2P
Document51 pages
P2P
shashwat2010
No ratings yet
Operations On Files
Document12 pages
Operations On Files
VinayKumarSingh
No ratings yet
Researcher's Toolbox PDF
Document161 pages
Researcher's Toolbox PDF
Jose Gremio
100% (1)
Titration of Citric Acid in Juice: Teacher Notes: Overview/Introduction: Teaching and Learning Context
Document15 pages
Titration of Citric Acid in Juice: Teacher Notes: Overview/Introduction: Teaching and Learning Context
Mohammad Abdo Rashed Al-amry
No ratings yet
Analyzing the Ningas Kugon Value of Filipinos
Document22 pages
Analyzing the Ningas Kugon Value of Filipinos
Merkiell GT
67% (3)
Reporting guidelines for qualitative research interviews and focus groups
Document9 pages
Reporting guidelines for qualitative research interviews and focus groups
Nurul Qalby
No ratings yet
Assignment-Distributed Database System
Document6 pages
Assignment-Distributed Database System
Noor Mohd Azad
20% (5)
Item expiration details and estimated quantities
Document2 pages
Item expiration details and estimated quantities
narasimha4u11
No ratings yet
VSphere 6.5 Storage
Document22 pages
VSphere 6.5 Storage
qihanchong
No ratings yet
Research Methods Overview
Document16 pages
Research Methods Overview
Fatema Shoshi
No ratings yet
How To Remove VSCSI Disk
Document6 pages
How To Remove VSCSI Disk
ilovedoc
No ratings yet
Data Engineer Manual (User Hands On)
Document2 pages
Data Engineer Manual (User Hands On)
Elona MUSKAJ
No ratings yet
Squashfs Howto
Document19 pages
Squashfs Howto
geaplanet1915
No ratings yet
Introduction To Information Technology
Document64 pages
Introduction To Information Technology
api-19922433
No ratings yet
SQL - The Ultimate Beginner - S Guide To Learn SQL Programming Step-by-Step
Document121 pages
SQL - The Ultimate Beginner - S Guide To Learn SQL Programming Step-by-Step
Andi 01
100% (1)
Tutorial 2 Instruction
Document3 pages
Tutorial 2 Instruction
Francis SAIPIOH
No ratings yet
UMTS Family
Document92 pages
UMTS Family
Ashish Gupta
No ratings yet
File Handling in Python: Open, Read, Write & Count
Document24 pages
File Handling in Python: Open, Read, Write & Count
Aman Jain
No ratings yet
Peter Norton Chap 16
Document20 pages
Peter Norton Chap 16
kanchan
100% (1)
Cur. Map
Document20 pages
Cur. Map
Jenilou Salibio Miculob
No ratings yet
Ieee FORMAT PDF
Document4 pages
Ieee FORMAT PDF
Kalyan Varma
No ratings yet
HG3051 Lec06 DIY
Document59 pages
HG3051 Lec06 DIY
Rania Abd El Fattah Abd El Hameed
No ratings yet
Art Gallery Database Management System
Document5 pages
Art Gallery Database Management System
Dhruv Sahgal
No ratings yet
PostgreSQL For Data Architects - Sample Chapter
Document23 pages
PostgreSQL For Data Architects - Sample Chapter
Packt Publishing
No ratings yet
Catalogo Cartas Atualizacao 30setembro2023
Document174 pages
Catalogo Cartas Atualizacao 30setembro2023
joaosaraiva1307
No ratings yet
A Project Report
Document87 pages
A Project Report
prabhu kirpa
No ratings yet
Database SQL Support Plan 1.5
Document46 pages
Database SQL Support Plan 1.5
Vinu3012
No ratings yet
Chapter 10
Document40 pages
Chapter 10
erma
No ratings yet
Imtc 2005 1604396
Document4 pages
Imtc 2005 1604396
Pushpak
No ratings yet
He 1-GB Dataset Limitation Does Not Apply To Direct Query
Document4 pages
He 1-GB Dataset Limitation Does Not Apply To Direct Query
gayatri
No ratings yet
PRRE1003 Lab B Report Template - S1 - 2023
Document3 pages
PRRE1003 Lab B Report Template - S1 - 2023
abdul haji daud
No ratings yet
Importance of Statistics
Document1 page
Importance of Statistics
Anecita L. Calamohoy
No ratings yet