티스토리 뷰

공부

[aws] emr.sh

승가비 2022. 9. 10. 10:38
728x90
# emr.sh

#!/bin/sh

TAG=$1

ENV=$2
CLASS=$3
ARGS=$4

SRC=s3://src/${ENV}/jar/batch/batch.jar
LOG=s3://log/${ENV}/batch/

SUBNET_ID=subnet-0e3653577617c98a3

echo $( \
aws emr create-cluster \
\
--auto-scaling-role EMR_AutoScaling_DefaultRole \
--instance-groups file://./batch/static/json/instance.json \
\
--name ${CLASS} \
--release-label emr-6.7.0 \
--auto-terminate \
--applications Name=Spark \
--use-default-roles \
--ec2-attributes SubnetId=${SUBNET_ID} \
--tags \
Env=${ENV} \
Name=${TAG} \
--log-uri ${LOG} \
--configurations file://./batch/static/json/spark.json \
\
--steps "Type=SPARK,Name=${CLASS},ActionOnFailure=TERMINATE_CLUSTER,Args=[--class,${CLASS},${SRC},${ARGS}]" \
)
# spark.json

[
  {
    "Classification": "spark-env",
    "Configurations": [
      {
        "Classification": "export",
        "Properties": {
          "JAVA_HOME": "/usr/lib/jvm/java-11-amazon-corretto.x86_64"
        }
      }
    ]
  },
  {
    "Classification": "spark-defaults",
    "Properties": {
      "spark.executorEnv.JAVA_HOME": "/usr/lib/jvm/java-11-amazon-corretto.x86_64",
      "spark.sql.broadcastTimeout": "3600",
      "spark.default.parallelism": "200",
      "spark.yarn.am.memory": "2g",
      "spark.executor.extraJavaOptions": "-XX:+IgnoreUnrecognizedVMOptions",
      "spark.rpc.askTimeout": "600s",
      "spark.sql.shuffle.partitions": "360",
      "spark.sql.cbo.enabled": "True",
      "spark.sql.adaptive.enabled": "True",
      "spark.sql.adaptive.coalescePartitions.enabled": "True",
      "spark.sql.adaptive.advisoryPartitionSizeInBytes": "128m",
      "spark.dynamicAllocation.enabled": "True",
      "spark.dynamicAllocation.initialExecutors": "1",
      "spark.dynamicAllocation.minExecutors": "1",
      "spark.dynamicAllocation.maxExecutors": "300",
      "spark.dynamicAllocation.executorAllocationRatio": "1",
      "spark.sql.catalogImplementation": "hive",
      "spark.hadoop.hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory",
      "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
      "spark.kryoserializer.buffer.max": "1024m",
      "spark.sql.autoBroadcastJoinThreshold": "60mb"
    }
  },
  {
    "Classification": "hive-site",
    "Properties": {
      "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
    }
  },
  {
    "Classification": "spark-hive-site",
    "Properties": {
      "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
    }
  },
  {
    "Classification": "spark",
    "Properties": {
      "maximizeResourceAllocation": "true"
    }
  }
]
# instance.json

[
  {
    "InstanceCount": 1,
    "Name": "MASTER",
    "InstanceGroupType": "MASTER",
    "InstanceType": "r4.xlarge",
    "BidPrice": "0.1"
  },
  {
    "InstanceCount": 1,
    "Name": "CORE",
    "InstanceGroupType": "CORE",
    "InstanceType": "r4.xlarge",
    "BidPrice": "0.1",
    "AutoScalingPolicy": {
      "Constraints": {
        "MinCapacity": 1,
        "MaxCapacity": 100
      },
      "Rules": [
        {
          "Name": "Default-scale-out",
          "Description": "Replicates the default scale-out rule in the console for YARN memory.",
          "Action": {
            "SimpleScalingPolicyConfiguration": {
              "AdjustmentType": "CHANGE_IN_CAPACITY",
              "ScalingAdjustment": 1,
              "CoolDown": 300
            }
          },
          "Trigger": {
            "CloudWatchAlarmDefinition": {
              "ComparisonOperator": "LESS_THAN",
              "EvaluationPeriods": 1,
              "MetricName": "YARNMemoryAvailablePercentage",
              "Namespace": "AWS/ElasticMapReduce",
              "Period": 300,
              "Threshold": 15,
              "Statistic": "AVERAGE",
              "Unit": "PERCENT",
              "Dimensions": [
                {
                  "Key": "JobFlowId",
                  "Value": "${emr.clusterId}"
                }
              ]
            }
          }
        }
      ]
    }
  }
]
FROM python:3.10.6 as app

ENV TZ=Asia/Seoul

RUN pip install awscli

WORKDIR /app
COPY static/ /app/static/
export AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
export AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
export AWS_DEFAULT_REGION="ap-northeast-2"

TAG="test-emr"

ENV=prd
CLASS=test

output=$( \
./batch/static/sh/emr.sh \
${TAG} \
\
${ENV} \
${CLASS} \
"${TAG},${AWS_ACCESS_KEY_ID},${AWS_SECRET_ACCESS_KEY},${AWS_DEFAULT_REGION},${BUCKET_NAME},${S3_PATH},${MAIL},${TSV_FIELDS},\"${QUERY}\"" \
)


curl -L https://github.com/stedolan/jq/releases/download/jq-1.6/jq-linux64 -o ./jq
chmod a+x ./jq
cluster=$(echo ${output} | ./jq ".ClusterId")
cluster=$(echo ${cluster} | sed -e 's/"//g')

echo ${cluster}

https://huzz.tistory.com/19

 

AWS CLI로 EMR Spark Cluster 띄우기

커맨드 기록용이다. 윈도우라서 라인 구분자를 ^로 썼다. aws emr create-cluster ^ --name "emr-spark-cluster" ^ --release-label emr-5.11.1 ^ --instance-groups ^ InstanceGroupType=MASTER,InstanceType=m4...

huzz.tistory.com

https://docs.aws.amazon.com/ko_kr/emr/latest/ReleaseGuide/emr-spark-configure.html#spark-change-defaults

 

Spark 구성 - Amazon EMR

이spark.decommissioning.timeout.threshold스팟 인스턴스를 사용할 때 Spark 복원력을 높일 수 있도록 Amazon EMR 릴리스 버전 5.11.0에 설정이 추가되었습니다. 이전 릴리스 버전에서는 노드가 스팟 인스턴스를

docs.aws.amazon.com

https://stackoverflow.com/questions/70886684/how-to-use-java-runtime-11-in-emr-cluster-aws

 

How to use java runtime 11 in EMR cluster AWS

I'm creating a cluter in EMR aws and when spark runs my application I'm getting error below: Exception in thread "main" java.lang.UnsupportedClassVersionError: com/example/demodriver/

stackoverflow.com

https://stackoverflow.com/questions/62928662/facing-error-while-trying-to-create-transient-cluster-on-aws-emr-to-run-python-s

 

Facing error while trying to create transient cluster on AWS emr to run Python script

I am new to aws and trying to create a transient cluster on AWS emr to run a Python script. I just want to run the python script that will process the file and auto terminate the cluster post compl...

stackoverflow.com

https://docs.aws.amazon.com/ko_kr/emr/latest/ReleaseGuide/emr-configure-apps.html

 

애플리케이션 구성 - Amazon EMR

Amazon EMR 설명 및 나열 API 작업은 사용자 지정 및 구성 가능한 설정을 내보내며 이는 일반 텍스트로 Amazon EMR 작업 흐름의 일부로 사용됩니다. 이러한 설정에 암호와 같은 민감한 정보를 삽입하지

docs.aws.amazon.com

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html

 

Configure Spark - Amazon EMR

The spark.decommissioning.timeout.threshold setting was added in Amazon EMR release version 5.11.0 to improve Spark resiliency when you use Spot instances. In earlier release versions, when a node uses a Spot instance, and the instance is terminated becaus

docs.aws.amazon.com

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-automatic-scaling.html

 

Using automatic scaling with a custom policy for instance groups - Amazon EMR

When you create a cluster that has an automatic scaling policy, you must use the --auto-scaling-role MyAutoScalingRole command to specify the IAM role for automatic scaling. The default role is EMR_AutoScaling_DefaultRole and can be created with the create

docs.aws.amazon.com

https://github.com/WorksApplications/ansible_aws_emr/blob/0b58e7223de36ed35b89b4b92c9391431315b27d/emr/examples/roles/emr/init-create-fleet/files/emr_config.json

 

GitHub - WorksApplications/ansible_aws_emr: Unofficial Ansible module for Amazon EMR

Unofficial Ansible module for Amazon EMR. Contribute to WorksApplications/ansible_aws_emr development by creating an account on GitHub.

github.com

https://gist.github.com/tmusabbir/34fdab6bd30fd87bcdd69cf03f54090c

 

AWS CLI command to create EMR cluster with default auto-scaling task group

AWS CLI command to create EMR cluster with default auto-scaling task group - create-spark-cluster.sh

gist.github.com

 

728x90

'공부' 카테고리의 다른 글

[python] sort dict by key  (0) 2022.09.10
`orc` vs `parquet` vs `avro`  (0) 2022.09.10
[flask] trouble shooting (requirements.txt)  (0) 2022.09.10
[Jsoup] GET & POST crawling  (0) 2022.09.10
[jenkins] pipeline groovy use secret `printenv NAME`  (0) 2022.09.08
댓글