Heap Size and Garbage Collection for Hive Components

HiveServer2 및 Hive Metastore 정상적으로 동작하기 위해서 충분한 량의 메모리 설정을 요구하며, 운영환경에서는 기본 설정값인 256MB의 Heap 사이즈는 적합하지 않습니다. 다음은 워크로드 별로 Cloudera에서 권장하는 값입니다.
Cloudera는 HiveServer2의 JVM Heap Size를 12GB 이상으로 설정하는 경우 HiveServer2를 다중 인스턴스로
Number of Concurrent Connections HiveServer2 Heap Size Minimum Recommendation Hive Metastore Heap Size Minimum Recommendation
최대 40 Concurrent Connections(*)
12 GB
12 GB
최대 20 Concurrent Connections
6 GB
10 GB
최대 10 Concurrent Connections
4 GB
8 GB
단일 Connection
2 GB
4 GB
중요: 여기에서 설명되는 값은 일반적 가이드 이며, 컬럼 수, 파티션, Complex Join 등과 같은 다양한 요소에 따라 다른 값으로 튜닝할 수도 있습니다.
추가적으로, Beeline CLI 최소 2GB Heap 사이즈를 사용해야 하며, 모든 컴포넌트에 대해서는 permGenSize를 512M로 설정할 것을 권장합니다.

Number of Concurrent Connections	HiveServer2 Heap Size Minimum Recommendation	Hive Metastore Heap Size Minimum Recommendation
최대 40 Concurrent Connections(*)	12 GB	12 GB
최대 20 Concurrent Connections	6 GB	10 GB
최대 10 Concurrent Connections	4 GB	8 GB
단일 Connection	2 GB	4 GB

Configuring Heap Size and Garbage Collection

HiveServer2 및 Hive Metastore의 Heap 사이즈를 구성: Cloudera Manager의 hive-env.sh 고급 구성 스냇피(Advanced Configuration Snippet)에서 HADOOP_OPTS 변수의 -Xmx 파라미터 값을 설정(직접 수정: /etc/hive/hive-env.sh).

Beeline CLI용 Heap 사이즈를 구성: Cloudera Manager에서 hive-env.sh 고급 구성 스냇피(Advanced Configuration Snippet)에서 HADDOP_HEAPSIZE 환경 변수를 설정.

다음의 설정예의 구성은 다음과 같습니다:

HiveServer2의 Heap Size는 12 GB 할당
Hive Metastore의 Heap Size는 12 GB 할당
Hive 클라이언트 용으로 2 GB Heap Size 할당

if [ "$SERVICE" = "cli" ]; then
  if [ -z "$DEBUG" ]; then
    export HADOOP_OPTS="$HADOOP_OPTS -XX:NewRatio=12 -Xmx12288m -Xms10m -XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:+useParNewGC -XX:-useGCOverheadLimit"
  else
    export HADOOP_OPTS="$HADOOP_OPTS -XX:NewRatio=12 -Xmx12288m -Xms10m -XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:-useGCOverheadLimit"
  fi
fi

export HADOOP_HEAPSIZE=2048

Table Partitions

팁: Cloudera는 최적 성능을 위해 테이블 파티션을 2,000 ~ 3,000개 이하로 유지할 것을 권장합니다.
단일 Hive 쿼리가 수 천개이상의 참조를 가지고 있는 경우에는 성능 저하가 발생할 수 있습니다. 너무 많은 파티션 정보를 참조 또는 갱신하기 위해서는 Hive Metastore DB에 수 많은 쿼리가 실행되어야 하며, HDFS는 해당 파티션된 파일들이 이동되어야 합니다.
최적의 성능을 위해서 Hive 테이블을 더 적은 수의 컬럼기반으로 파티션하거나 더 큰 범위의 시간 프레임기반의 파티션(시간 단위 파티션 대신 날짜 기반의 파티션)으로 설계하십시오. Also, hone your queries to use only a subset of a table's partitions.

Configuration for WebHCat

WebHCat를 사용하는 경우에는 Hive 설치 후 "/etc/default/hive-webhcat-server"의 PYTHON_CMD 변수를 설정해야 합니다. 예를 들면:

export PYTHON_CMD=/usr/bin/python

Table Lock Manager (Required)

Hive의 Table Lock Manager를 반드시 적절하게 구성해야하며, 이 기능은 Zookeepr 앙상블 서비스를 필요로 합니다.

다음과 같이 "/etc/hive/conf/hive-site.xm"의 설정하여 Lock Manager를 구성하십시오:

<property>
  <name>hive.support.concurrency</name>
  <description>Enable Hive's Table Lock Manager Service</description>
  <value>true</value>
</property>

<property>
  <name>hive.zookeeper.quorum</name>
  <description>Zookeeper quorum used by Hive's Table Lock Manager</description>
  <value>zk1.myco.com,zk2.myco.com,zk3.myco.com</value>
</property>

`hive.zookeeper.client.port`

Zookeeper ClientPort 속성이 기본 값을 사용하지 않는 경우에는, ZooKeeper가 사용하는 값으로 "hive.zookeeper.client.port in /etc/hive/conf/hive-site.xml" 옵션을 설정해야 합니다. ZooKeeper에서 사용하는 ClientPort 정보는 "/etc/zookeeper/conf/zoo.cfg"의 구성 정보 파일에서 확인할 수 있으며, 이 속성의 값이 2181(기본 값)이 아닌 다른 값을 사용하는 경우에는 "hive.zookeeper.client.port " 옵션의 값을 ZooKeeper 사용하는 실제 ClientPort 값으로 설정해야 합니다. 다음은 ClientPort를 2222를 사용하는 예입니다:

<property>
  <name>hive.zookeeper.client.port</name>
  <value>2222</value>
  <description>
  The port at which the clients will connect.
  </description>
</property>

JDBC driver

HiveServer2 및 HiveServer1의 Connection URL 형식을 다음과 같이 사용하십시오:

iveServer version	Connection URL	Driver Class
HiveServer2	jdbc:hive2://<host>:<port>	org.apache.hive.jdbc.HiveDriver
HiveServer1	jdbc:hive://<host>:<port>	org.apache.hadoop.hive.jdbc.HiveDriver

//--------------------------------------------------------------------------------------------------------------

HiveServer2 High Availability

HiveServer2를 다중 서버에 이중화 구성하기 위해서는, Load Balancer를 구성해야 합니다. 보안 및 안정성을 확보하기 위해서, Proxy Server에 Load Balancer를 구성하십시오.

Enabling HiveServer2 High Availability Using Cloudera Manager

Hive 서비스로 이동.
Configuration 탭 클릭.
Scope > HiveServer2를 차례로 선택.
Category > Main를 차례로 선택.
Locate the HiveServer2 Load Balancer property or search for it by typing its name in the Search box.
hostname:port number의 값을 입력.
주의: HiveServer2 Load Balancer 속성을 설정한 경우, Cloudera Manager는 HiveServer2 role용 keytab들을 생성합니다. 이 kebtabs 내의 Principal은 load balancer hostname이 포함됩니다. Hive 서비스를 많이 실행하는 Hue 서비스를 가지고 있는 경우에는, Hue 서비스 역시 Hive와 통신하기 위해 load balancer를 사용해야 합니다.
변경 사항을 적용하기 위해, "Save Changes" 버튼 클릭.
Hive 서비스 재시작.

Configuring HiveServer2 to Load Balance Behind a Proxy

For clusters with multiple users and availability requirements, you can configure a proxy server to relay requests to and from each HiveServer2 host. Applications connect to a single well-known host and port, and connection requests to the proxy succeed even when hosts running HiveServer2 become unavailable.

단일 호스트에 설치할 Load-balancing Proxy 소프트웨어 다운로드.
Load Balancer를 구성하십시오. 일반적으로 구성 정보 파일을 수동으로 수정합니다:
1. Set the port for the load balancer to listen on and relay HiveServer2 requests back and forth.
2. Set the port and hostname for each HiveServer2 host—that is, the hosts from which the load balancer chooses when relaying each query.
Run the load-balancing proxy server and point it at the configuration file.
Cloudera Manager 콘솔에서, 구성한 Proxy Server용 HiveServer2 Load Balancer를 구성:
1. hostname:port number에 값을 입력.
2. 변경 사항을 적용하기 위해, "Save Changes" 버튼 클릭.
3. Hive 서비스 재시작.
  주의: Cloudera Manager는 자동으로 새롭게 추가된 Proxy Server용 Keytab을 생성합니다.
모든 스크립트, 애플리케이션 구성정보에 기존 특정 HiveServer2 노드 정보 대신 새롭게 구성한 Load Balancer의 정보로 변경

Hive Metastore High Availability

특정 Metastore에 장애가 발생한 경우에도 서비스 영속성을 보증하기 위해서 Hive Metastore 컴포넌트를 이중화하여 구성할 수 있습니다. HA 모드를 사용하는 경우에는, 하나의 Metastore가 Master로 지정되며 나머지 하나는 Slave가 됩니다. Master Metastore에 장애가 발생하면 Slave Metastore가 Master의 역할을 Take-over 합니다.

Prerequisites

Cloudera는 Metastore의 개별 인스턴스들을 개별 호스트 별로 분리하여 구성할 것을 권장합니다.
Hive Metastore HA는 데이터 베이스 이중화 구성을 필요로 합니다. 예) MySQL with replication in active-active mode.

Limitations

Hive Metastore HA 구성 시, Sentry HDFS Synchronization 기능을 지원하지 않습니다.

Enabling Hive Metastore High Availability Using Cloudera Manager

Hive 서비스로 이동.
Secure Cluster인 경우 Hive token store를 enable해야 합니다.(None-secure Cluster의 경우, 다음 단계로 이동)
1. Configuration 탭 클릭.
2. Scope > Hive Metastore Server를 선택.
3. Category > Advanced를 선택.
4. 구성 정보 검색 기능을 사용하여, Hive Metastore Delegation Token Store 속성으로 이동.
5. org.apache.hadoop.hive.thrift.DBTokenStore를 선택.
6. 변경 사항을 적용하기 위해, "Save Changes" 버튼 클릭.
Instances 탭 클릭.
Add Role Instances를 클릭.
Hive Metastore Server 하위의 텍스트 필드 클릭.
신규로 추가할 Metastore가 실행될 호스트의 확인란을 선택하고 OK 버튼 클릭.
Continue 클릭한 뒤, Finish 클릭.
Check the box by the new Hive Metastore Server role.
Actions for Selected > Start,를 선택한 뒤, Start 를 클릭.
Close를 클릭한 뒤, "Stale Configuration" 페이지에 나타난 아이콘 클릭.
Restart Stale Services 클릭한 뒤, Restart Now 클릭.

참고: HiveServer2가 이중화 구성되어 있고 L4로 로드발랜서되어 있으며, Kerberos가 적용된 클러스터 환경에서 Hue의 Workflow에서 Hive 작업을 실행하는 경우 다음과 같은 에러가 발생

//---------------------------------------------------------------------------------------

2017-01-13 16:46:36,937 INFO org.apache.hadoop.hive.metastore.ObjectStore: [HiveServer2-Handler-Pool: Thread-72]: Initialized ObjectStore

2017-01-13 16:46:36,945 ERROR org.apache.thrift.transport.TSaslTransport: [HiveServer2-Handler-Pool: Thread-72]: SASL negotiation failure

javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: token expired or does not exist: HIVE_DELEGATION_TOKEN owner=hive, renewer=hive, realUser=, issueDate=1484293582043, maxDate=1484898382043, sequenceNumber=7, masterKeyId=2]

at com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:594)

at com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)

at org.apache.thrift.transport.TSaslTransport$SaslParticipant.evaluateChallengeOrResponse(TSaslTransport.java:539)

at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:283)

.....

Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: token expired or does not exist: HIVE_DELEGATION_TOKEN owner=hive, renewer=hive, realUser=, issueDate=1484293582043, maxDate=1484898382043, sequenceNumber=7, masterKeyId=2

at org.apache.hadoop.hive.thrift.TokenStoreDelegationTokenSecretManager.retrievePassword(TokenStoreDelegationTokenSecretManager.java:114)

.....

2017-01-13 16:46:36,946 ERROR org.apache.thrift.server.TThreadPoolServer: [HiveServer2-Handler-Pool: Thread-72]: Error occurred during processing of message.

java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: DIGEST-MD5: IO error acquiring password

at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219)

....

Caused by: org.apache.thrift.transport.TTransportException: DIGEST-MD5: IO error acquiring password

at org.apache.thrift.transport.TSaslTransport.sendAndThrowMessage(TSaslTransport.java:232)

at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:316)

.....

//---------------------------------------------------------------------------------------

조치 방식:

Provided following steps to use org.apache.hadoop.hive.thrift.ZooKeeperTokenStore in Load Balance environment to fix the issue.

1) Cloudera Manager -> Hive -> Configuration

2) Filter "HiveServer2 Advanced Configuration Snippet (Safety Valve) for hive-site.xml" and add following option.

<name>hive.cluster.delegation.token.store.class</name>

<value>org.apache.hadoop.hive.thrift.ZooKeeperTokenStore</value>

</property>

3) Restart HS2 server

Slow but Surely!

Configuring HiveServer2

Heap Size and Garbage Collection for Hive Components

Configuring Heap Size and Garbage Collection

Table Partitions

Configuration for WebHCat

WebHCat를 사용하는 경우에는 Hive 설치 후 "/etc/default/hive-webhcat-server"의 `PYTHON_CMD` 변수를 설정해야 합니다. 예를 들면:
export PYTHON_CMD=/usr/bin/python

Table Lock Manager (Required)

`hive.zookeeper.client.port`

JDBC driver

HiveServer2 High Availability

Enabling HiveServer2 High Availability Using Cloudera Manager

Configuring HiveServer2 to Load Balance Behind a Proxy

Hive Metastore High Availability

Prerequisites

Limitations

Enabling Hive Metastore High Availability Using Cloudera Manager

티스토리툴바

Configuring HiveServer2

Heap Size and Garbage Collection for Hive Components

Configuring Heap Size and Garbage Collection

Table Partitions

Configuration for WebHCat

WebHCat를 사용하는 경우에는 Hive 설치 후 "/etc/default/hive-webhcat-server"의 PYTHON_CMD 변수를 설정해야 합니다. 예를 들면:export PYTHON_CMD=/usr/bin/python

Table Lock Manager (Required)

hive.zookeeper.client.port

JDBC driver

HiveServer2 High Availability

Enabling HiveServer2 High Availability Using Cloudera Manager

Configuring HiveServer2 to Load Balance Behind a Proxy

Hive Metastore High Availability

Prerequisites

Limitations

Enabling Hive Metastore High Availability Using Cloudera Manager

티스토리툴바

WebHCat를 사용하는 경우에는 Hive 설치 후 "/etc/default/hive-webhcat-server"의 `PYTHON_CMD` 변수를 설정해야 합니다. 예를 들면:
export PYTHON_CMD=/usr/bin/python

`hive.zookeeper.client.port`