Running Spark on EC2 - Spark 1.6.2 Documentation (2024)

The spark-ec2 script, located in Spark’s ec2 directory, allows youto launch, manage and shut down Spark clusters on Amazon EC2. It automaticallysets up Spark and HDFS on the cluster for you. This guide describes how to use spark-ec2 to launch clusters, how to run jobs on them, and how to shut them down. It assumes you’ve already signed up for an EC2 account on the Amazon Web Services site.

spark-ec2 is designed to manage multiple named clusters. You canlaunch a new cluster (telling the script its size and giving it a name),shutdown an existing cluster, or log into a cluster. Each cluster isidentified by placing its machines into EC2 security groups whose namesare derived from the name of the cluster. For example, a cluster namedtest will contain a master node in a security group calledtest-master, and a number of slave nodes in a security group calledtest-slaves. The spark-ec2 script will create these security groupsfor you based on the cluster name you request. You can also use them toidentify machines belonging to each cluster in the Amazon EC2 Console.

  • Create an Amazon EC2 key pair for yourself. This can be done bylogging into your Amazon Web Services account through the AWSconsole, clicking Key Pairs on theleft sidebar, and creating and downloading a key. Make sure that youset the permissions for the private key file to 600 (i.e. only youcan read and write it) so that ssh will work.
  • Whenever you want to use the spark-ec2 script, set the environmentvariables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to yourAmazon EC2 access key ID and secret access key. These can beobtained from the AWS homepage by clickingAccount > Security Credentials > Access Credentials.

You can also run ./spark-ec2 --help to see more usage options. Thefollowing options are worth pointing out:

  • --instance-type=<instance-type> can be used to specify an EC2instance type to use. For now, the script only supports 64-bit instancetypes, and the default type is m1.large (which has 2 cores and 7.5 GBRAM). Refer to the Amazon pages about EC2 instancetypes and EC2pricing for information about otherinstance types.
  • --region=<ec2-region> specifies an EC2 region in which to launchinstances. The default region is us-east-1.
  • --zone=<ec2-zone> can be used to specify an EC2 availability zoneto launch instances in. Sometimes, you will get an error because thereis not enough capacity in one zone, and you should try to launch inanother.
  • --ebs-vol-size=<GB> will attach an EBS volume with a given amountof space to each node so that you can have a persistent HDFS clusteron your nodes across cluster restarts (see below).
  • --spot-price=<price> will launch the worker nodes asSpot Instances,bidding for the given maximum price (in dollars).
  • --spark-version=<version> will pre-load the cluster with thespecified version of Spark. The <version> can be a version number(e.g. “0.7.3”) or a specific git hash. By default, a recentversion will be used.
  • --spark-git-repo=<repository url> will let you run a custom version ofSpark that is built from the given git repository. By default, theApache Github mirror will be used.When using a custom Spark version, --spark-version must be set to gitcommit hash, such as 317e114, instead of a version number.
  • If one of your launches fails due to e.g. not having the rightpermissions on your private key file, you can run launch with the--resume option to restart the setup process on an existing cluster.
  • Run./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> --vpc-id=<vpc-id> --subnet-id=<subnet-id> launch <cluster-name>,where <keypair> is the name of your EC2 key pair (that you gave itwhen you created it), <key-file> is the private key file for yourkey pair, <num-slaves> is the number of slave nodes to launch (try1 at first), <vpc-id> is the name of your VPC, <subnet-id> is thename of your subnet, and <cluster-name> is the name to give to yourcluster.

    For example:

    bashexport AWS_SECRET_ACCESS_KEY=AaBbCcDdEeFGgHhIiJjKkLlMmNnOoPpQqRrSsTtUexport AWS_ACCESS_KEY_ID=ABCDEFG1234567890123./spark-ec2 --key-pair=awskey --identity-file=awskey.pem --region=us-west-1 --zone=us-west-1a --vpc-id=vpc-a28d24c7 --subnet-id=subnet-4eb27b39 --spark-version=1.1.0 launch my-spark-cluster

  • Go into the ec2 directory in the release of Spark you downloaded.
  • Run ./spark-ec2 -k <keypair> -i <key-file> login <cluster-name> toSSH into the cluster, where <keypair> and <key-file> are asabove. (This is just for convenience; you could also usethe EC2 console.)
  • To deploy code or data within your cluster, you can log in and use theprovided script ~/spark-ec2/copy-dir, which,given a directory path, RSYNCs it to the same location on all the slaves.
  • If your application needs to access large datasets, the fastest way to dothat is to load them from Amazon S3 or an Amazon EBS device into aninstance of the Hadoop Distributed File System (HDFS) on your nodes.The spark-ec2 script already sets up a HDFS instance for you. It’sinstalled in /root/ephemeral-hdfs, and can be accessed using thebin/hadoop script in that directory. Note that the data in thisHDFS goes away when you stop and restart a machine.
  • There is also a persistent HDFS instance in/root/persistent-hdfs that will keep data across cluster restarts.Typically each node has relatively little space of persistent data(about 3 GB), but you can use the --ebs-vol-size option tospark-ec2 to attach a persistent EBS volume to each node forstoring the persistent HDFS.
  • Finally, if you get errors while running your application, look at the slave’s logsfor that application inside of the scheduler work directory (/root/spark/work). You canalso view the status of the cluster using the web UI: http://<master-hostname>:8080.

You can edit /root/spark/conf/spark-env.sh on each machine to set Spark configuration options, suchas JVM options. This file needs to be copied to every machine to reflect the change. The easiest way todo this is to use a script we provide called copy-dir. First edit your spark-env.sh file on the master, then run ~/spark-ec2/copy-dir /root/spark/conf to RSYNC it to all the workers.

The configuration guide describes the available configuration options.

Note that there is no way to recover data on EC2 nodes after shuttingthem down! Make sure you have copied everything important off the nodesbefore stopping them.

  • Go into the ec2 directory in the release of Spark you downloaded.
  • Run ./spark-ec2 destroy <cluster-name>.

The spark-ec2 script also supports pausing a cluster. In this case,the VMs are stopped but not terminated, so theylose all data on ephemeral disks but keep the data in theirroot partitions and their persistent-hdfs. Stopped machines will notcost you any EC2 cycles, but will continue to cost money for EBSstorage.

  • To stop one of your clusters, go into the ec2 directory and run./spark-ec2 --region=<ec2-region> stop <cluster-name>.
  • To restart it later, run./spark-ec2 -i <key-file> --region=<ec2-region> start <cluster-name>.
  • To ultimately destroy the cluster and stop consuming EBS space, run./spark-ec2 --region=<ec2-region> destroy <cluster-name> as described in the previoussection.
  • Support for “cluster compute” nodes is limited – there’s no way to specify alocality group. However, you can launch slave nodes in your<clusterName>-slaves group manually and then use spark-ec2 launch--resume to start a cluster with them.

If you have a patch or suggestion for one of these limitations, feel free tocontribute it!

Spark’s file interface allows it to process data in Amazon S3 using the same URI formats that are supported for Hadoop. You can specify a path in S3 as input through a URI of the form s3n://<bucket>/path. To provide AWS credentials for S3 access, launch the Spark cluster with the option --copy-aws-credentials. Full instructions on S3 access using the Hadoop input libraries can be found on the Hadoop S3 page.

In addition to using a single input file, you can also use a directory of files as input by simply giving the path to the directory.

Running Spark on EC2 - Spark 1.6.2 Documentation (2024)

FAQs

How to run Spark on ec2 instance? ›

This tutorial will be divided into 5 sections.
  1. Install Apache-Spark on your instances.
  2. Configuration of your Master nodes.
  3. Configuration of your Slave nodes.
  4. Add dependencies to connect Spark and Cassandra.
  5. Launch your Master and your Slave nodes.

How many cores per executor Spark? ›

--executor-cores: Based on the 5 cores per executor, we can have a maximum of 15 executors and 3 executors per node (75/5=15) --executor-memory: For the RAM allocation, we will allocate 126/3=41 GB RAM per executor.

Can I run Spark without Hadoop? ›

Do I need Hadoop to run Spark? No, but if you run on a cluster, you will need some form of shared file system (for example, NFS mounted at the same path on each node). If you have this type of filesystem, you can just deploy Spark in standalone mode.

What happens when you run Spark-submit? ›

It is used to launch applications on a standalone Spark cluster, a Hadoop YARN cluster, or a Mesos cluster. The spark-submit tool takes a JAR file or a Python file as input along with the application's configuration options and submits the application to the cluster.

How do I run Spark from command line? ›

To run a Spark job, run the following command: cde job run --name <job name> [Spark flags...] [--wait] [--variable name=value...]

How do I run Spark on my server? ›

Installing Spark
  1. Download Spark from CurseForge or download the Spigot version here.
  2. Navigate to your control panel and Stop your server.
  3. Access your server files via FTP, we recommend using FileZilla.
  4. Navigate to the /mods directory.
  5. Upload the spark. jar file.
  6. Start your server.

Why do Spark executors fail? ›

Spark executors are worker processes that run computations and store data in memory or on disk. When an executor fails, it can cause the entire job to fail or result in degraded performance. This type of incident can occur for a variety of reasons, such as hardware or network issues, memory errors, or software bugs.

What is the difference between cores and executors in Spark? ›

An executor is a single JVM process that is launched for a spark application on a node while a core is a basic computation unit of CPU or concurrent tasks that an executor can run. A node can have multiple executors and cores. All the computation requires a certain amount of memory to accomplish these tasks.

Do I need Scala to run Spark? ›

Spark's shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively. It is available in either Scala (which runs on the Java VM and is thus a good way to use existing Java libraries) or Python.

Is Spark faster than Hadoop? ›

Writing and reading from RAM are exponentially faster than doing the same with an external drive. Moreover, Spark reuses the retrieved data for numerous operations. Therefore, Spark performs better than Hadoop in varying degrees for both simple and complex data processing.

Can Spark run without a hive? ›

1 Answer. Yes, we can run spark sql queries on spark without installing hive, by default hive uses mapred as an execution engine, we can configure hive to use spark or tez as an execution engine to execute our queries much faster. Hive on spark hive uses hive metastore to run hive queries.

How to check if Spark is running or not? ›

You can view all Apache Spark applications from Spark job definition, or notebook item context menu shows the recent run option -> Recent runs.

How to get Spark master URL? ›

When running on Spark Standalone, the Spark Master URL is same as the host name in Spark REST URL field plus the Spark standalone master console port field. Spark Master URL is http://<Spark REST URL>:< Spark standalone master console port> . You can obtain Spark Master URL from the System Settings page.

How to run a Spark script? ›

The most common way to launch spark applications on the cluster is to use the shell command spark-submit. When using spark-submit shell command the spark application need not be configured particularly for each cluster as the spark-submit shell script uses the cluster managers through a single interface.

Can Spark run on EC2? ›

The spark-ec2 script, located in Spark's ec2 directory, allows you to launch, manage and shut down Spark clusters on Amazon EC2. It automatically sets up Spark and HDFS on the cluster for you.

Can Spark run on AWS? ›

Amazon EMR is the best place to run Apache Spark. You can quickly and easily create managed Spark clusters from the AWS Management Console, AWS CLI, or the Amazon EMR API.

How do I connect Pyspark to AWS? ›

Here are the steps to set up Amazon EMR with Spark:
  1. Sign in to the AWS Management Console.
  2. Choose the EMR Service.
  3. Create an EMR Cluster with Spark.
  4. a. Click "Create cluster."
  5. b. Configure Cluster.
  6. c. Launch Cluster.
  7. Submit Spark Jobs.
  8. Monitor and Manage the Cluster.
Nov 16, 2023

Top Articles
Study Shows That Walking Can Help You Live Longer
Shiba Inu | VCA Canada Animal Hospitals
UPS Paketshop: Filialen & Standorte
Missed Connections Inland Empire
Tyson Employee Paperless
Jesus Calling December 1 2022
Cumberland Maryland Craigslist
Bank Of America Appointments Near Me
Produzione mondiale di vino
Boston Gang Map
Army Oubs
Busted Newspaper Fauquier County Va
Air Quality Index Endicott Ny
27 Paul Rudd Memes to Get You Through the Week
The Creator Showtimes Near R/C Gateway Theater 8
Ltg Speech Copy Paste
The Boogeyman (Film, 2023) - MovieMeter.nl
Umn Biology
Ultra Ball Pixelmon
How To Improve Your Pilates C-Curve
Meggen Nut
Courtney Roberson Rob Dyrdek
Greater Orangeburg
Average weekly earnings in Great Britain
Gina's Pizza Port Charlotte Fl
Envy Nails Snoqualmie
Indiana Immediate Care.webpay.md
Joe's Truck Accessories Summerville South Carolina
Oreillys Federal And Evans
House Of Budz Michigan
Craigslist Mount Pocono
Otter Bustr
Indio Mall Eye Doctor
Vocabulary Workshop Level B Unit 13 Choosing The Right Word
20 bank M&A deals with the largest target asset volume in 2023
Author's Purpose And Viewpoint In The Dark Game Part 3
Flipper Zero Delivery Time
511Pa
Nina Flowers
Craigslist Minneapolis Com
Television Archive News Search Service
Unlock The Secrets Of "Skip The Game" Greensboro North Carolina
Youravon Com Mi Cuenta
Jane Powell, MGM musical star of 'Seven Brides for Seven Brothers,' 'Royal Wedding,' dead at 92
bot .com Project by super soph
A Man Called Otto Showtimes Near Cinemark Greeley Mall
552 Bus Schedule To Atlantic City
Mejores páginas para ver deportes gratis y online - VidaBytes
Evil Dead Rise - Everything You Need To Know
Greg Steube Height
Craigslist Psl
Scholar Dollar Nmsu
Latest Posts
Article information

Author: Delena Feil

Last Updated:

Views: 6602

Rating: 4.4 / 5 (45 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Delena Feil

Birthday: 1998-08-29

Address: 747 Lubowitz Run, Sidmouth, HI 90646-5543

Phone: +99513241752844

Job: Design Supervisor

Hobby: Digital arts, Lacemaking, Air sports, Running, Scouting, Shooting, Puzzles

Introduction: My name is Delena Feil, I am a clean, splendid, calm, fancy, jolly, bright, faithful person who loves writing and wants to share my knowledge and understanding with you.