Running Spark on EC2 - Spark 1.6.2 Documentation (2024)

The spark-ec2 script, located in Spark’s ec2 directory, allows youto launch, manage and shut down Spark clusters on Amazon EC2. It automaticallysets up Spark and HDFS on the cluster for you. This guide describes how to use spark-ec2 to launch clusters, how to run jobs on them, and how to shut them down. It assumes you’ve already signed up for an EC2 account on the Amazon Web Services site.

spark-ec2 is designed to manage multiple named clusters. You canlaunch a new cluster (telling the script its size and giving it a name),shutdown an existing cluster, or log into a cluster. Each cluster isidentified by placing its machines into EC2 security groups whose namesare derived from the name of the cluster. For example, a cluster namedtest will contain a master node in a security group calledtest-master, and a number of slave nodes in a security group calledtest-slaves. The spark-ec2 script will create these security groupsfor you based on the cluster name you request. You can also use them toidentify machines belonging to each cluster in the Amazon EC2 Console.

Create an Amazon EC2 key pair for yourself. This can be done bylogging into your Amazon Web Services account through the AWSconsole, clicking Key Pairs on theleft sidebar, and creating and downloading a key. Make sure that youset the permissions for the private key file to 600 (i.e. only youcan read and write it) so that ssh will work.
Whenever you want to use the spark-ec2 script, set the environmentvariables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to yourAmazon EC2 access key ID and secret access key. These can beobtained from the AWS homepage by clickingAccount > Security Credentials > Access Credentials.

Go into the ec2 directory in the release of Spark you downloaded.
Run./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> launch <cluster-name>,where <keypair> is the name of your EC2 key pair (that you gave itwhen you created it), <key-file> is the private key file for yourkey pair, <num-slaves> is the number of slave nodes to launch (try1 at first), and <cluster-name> is the name to give to yourcluster.
See Also
Jak kopać Ethereum: krok po kroku, jak rozpocząć kopanie Mining Ethereum on AWS — the complete guide Microsoft zakazuje kopania kryptowalut w swoich usługach online bez pozwolenia Decentralized?? Most Of Ethereum Active Nodes Are Hosted On Centralized Servers

For example:

bashexport AWS_SECRET_ACCESS_KEY=AaBbCcDdEeFGgHhIiJjKkLlMmNnOoPpQqRrSsTtUexport AWS_ACCESS_KEY_ID=ABCDEFG1234567890123./spark-ec2 --key-pair=awskey --identity-file=awskey.pem --region=us-west-1 --zone=us-west-1a launch my-spark-cluster
After everything launches, check that the cluster scheduler is up and seesall the slaves by going to its web UI, which will be printed at the end ofthe script (typically http://<master-hostname>:8080).

You can also run ./spark-ec2 --help to see more usage options. Thefollowing options are worth pointing out:

--instance-type=<instance-type> can be used to specify an EC2instance type to use. For now, the script only supports 64-bit instancetypes, and the default type is m1.large (which has 2 cores and 7.5 GBRAM). Refer to the Amazon pages about EC2 instancetypes and EC2pricing for information about otherinstance types.
--region=<ec2-region> specifies an EC2 region in which to launchinstances. The default region is us-east-1.
--zone=<ec2-zone> can be used to specify an EC2 availability zoneto launch instances in. Sometimes, you will get an error because thereis not enough capacity in one zone, and you should try to launch inanother.
--ebs-vol-size=<GB> will attach an EBS volume with a given amountof space to each node so that you can have a persistent HDFS clusteron your nodes across cluster restarts (see below).
--spot-price=<price> will launch the worker nodes asSpot Instances,bidding for the given maximum price (in dollars).
--spark-version=<version> will pre-load the cluster with thespecified version of Spark. The <version> can be a version number(e.g. “0.7.3”) or a specific git hash. By default, a recentversion will be used.
--spark-git-repo=<repository url> will let you run a custom version ofSpark that is built from the given git repository. By default, theApache Github mirror will be used.When using a custom Spark version, --spark-version must be set to gitcommit hash, such as 317e114, instead of a version number.
If one of your launches fails due to e.g. not having the rightpermissions on your private key file, you can run launch with the--resume option to restart the setup process on an existing cluster.

Run./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> --vpc-id=<vpc-id> --subnet-id=<subnet-id> launch <cluster-name>,where <keypair> is the name of your EC2 key pair (that you gave itwhen you created it), <key-file> is the private key file for yourkey pair, <num-slaves> is the number of slave nodes to launch (try1 at first), <vpc-id> is the name of your VPC, <subnet-id> is thename of your subnet, and <cluster-name> is the name to give to yourcluster.
See Also
3 cloud providers accounting for over two-thirds of Ethereum nodes

For example:

bashexport AWS_SECRET_ACCESS_KEY=AaBbCcDdEeFGgHhIiJjKkLlMmNnOoPpQqRrSsTtUexport AWS_ACCESS_KEY_ID=ABCDEFG1234567890123./spark-ec2 --key-pair=awskey --identity-file=awskey.pem --region=us-west-1 --zone=us-west-1a --vpc-id=vpc-a28d24c7 --subnet-id=subnet-4eb27b39 --spark-version=1.1.0 launch my-spark-cluster

Go into the ec2 directory in the release of Spark you downloaded.
Run ./spark-ec2 -k <keypair> -i <key-file> login <cluster-name> toSSH into the cluster, where <keypair> and <key-file> are asabove. (This is just for convenience; you could also usethe EC2 console.)
To deploy code or data within your cluster, you can log in and use theprovided script ~/spark-ec2/copy-dir, which,given a directory path, RSYNCs it to the same location on all the slaves.
If your application needs to access large datasets, the fastest way to dothat is to load them from Amazon S3 or an Amazon EBS device into aninstance of the Hadoop Distributed File System (HDFS) on your nodes.The spark-ec2 script already sets up a HDFS instance for you. It’sinstalled in /root/ephemeral-hdfs, and can be accessed using thebin/hadoop script in that directory. Note that the data in thisHDFS goes away when you stop and restart a machine.
There is also a persistent HDFS instance in/root/persistent-hdfs that will keep data across cluster restarts.Typically each node has relatively little space of persistent data(about 3 GB), but you can use the --ebs-vol-size option tospark-ec2 to attach a persistent EBS volume to each node forstoring the persistent HDFS.
Finally, if you get errors while running your application, look at the slave’s logsfor that application inside of the scheduler work directory (/root/spark/work). You canalso view the status of the cluster using the web UI: http://<master-hostname>:8080.

You can edit /root/spark/conf/spark-env.sh on each machine to set Spark configuration options, suchas JVM options. This file needs to be copied to every machine to reflect the change. The easiest way todo this is to use a script we provide called copy-dir. First edit your spark-env.sh file on the master, then run ~/spark-ec2/copy-dir /root/spark/conf to RSYNC it to all the workers.

The configuration guide describes the available configuration options.

Note that there is no way to recover data on EC2 nodes after shuttingthem down! Make sure you have copied everything important off the nodesbefore stopping them.

Go into the ec2 directory in the release of Spark you downloaded.
Run ./spark-ec2 destroy <cluster-name>.

The spark-ec2 script also supports pausing a cluster. In this case,the VMs are stopped but not terminated, so theylose all data on ephemeral disks but keep the data in theirroot partitions and their persistent-hdfs. Stopped machines will notcost you any EC2 cycles, but will continue to cost money for EBSstorage.

To stop one of your clusters, go into the ec2 directory and run./spark-ec2 --region=<ec2-region> stop <cluster-name>.
To restart it later, run./spark-ec2 -i <key-file> --region=<ec2-region> start <cluster-name>.
To ultimately destroy the cluster and stop consuming EBS space, run./spark-ec2 --region=<ec2-region> destroy <cluster-name> as described in the previoussection.

Support for “cluster compute” nodes is limited – there’s no way to specify alocality group. However, you can launch slave nodes in your<clusterName>-slaves group manually and then use spark-ec2 launch--resume to start a cluster with them.

If you have a patch or suggestion for one of these limitations, feel free tocontribute it!

Spark’s file interface allows it to process data in Amazon S3 using the same URI formats that are supported for Hadoop. You can specify a path in S3 as input through a URI of the form s3n://<bucket>/path. To provide AWS credentials for S3 access, launch the Spark cluster with the option --copy-aws-credentials. Full instructions on S3 access using the Hadoop input libraries can be found on the Hadoop S3 page.

In addition to using a single input file, you can also use a directory of files as input by simply giving the path to the directory.

Running Spark on EC2 - Spark 1.6.2 Documentation (2024)

FAQs

How to run Spark on ec2 instance? ›

This tutorial will be divided into 5 sections.

Install Apache-Spark on your instances.
Configuration of your Master nodes.
Configuration of your Slave nodes.
Add dependencies to connect Spark and Cassandra.
Launch your Master and your Slave nodes.

Read On ›

How many cores per executor Spark? ›

--executor-cores: Based on the 5 cores per executor, we can have a maximum of 15 executors and 3 executors per node (75/5=15) --executor-memory: For the RAM allocation, we will allocate 126/3=41 GB RAM per executor.

Discover More Details ›

Can I run Spark without Hadoop? ›

Do I need Hadoop to run Spark? No, but if you run on a cluster, you will need some form of shared file system (for example, NFS mounted at the same path on each node). If you have this type of filesystem, you can just deploy Spark in standalone mode.

What happens when you run Spark-submit? ›

It is used to launch applications on a standalone Spark cluster, a Hadoop YARN cluster, or a Mesos cluster. The spark-submit tool takes a JAR file or a Python file as input along with the application's configuration options and submits the application to the cluster.

See Details ›

How do I run Spark from command line? ›

To run a Spark job, run the following command: cde job run --name <job name> [Spark flags...] [--wait] [--variable name=value...]

Find Out More ›

How do I run Spark on my server? ›

Installing Spark

Download Spark from CurseForge or download the Spigot version here.
Navigate to your control panel and Stop your server.
Access your server files via FTP, we recommend using FileZilla.
Navigate to the /mods directory.
Upload the spark. jar file.
Start your server.

Tell Me More ›

Why do Spark executors fail? ›

Spark executors are worker processes that run computations and store data in memory or on disk. When an executor fails, it can cause the entire job to fail or result in degraded performance. This type of incident can occur for a variety of reasons, such as hardware or network issues, memory errors, or software bugs.

Show Me More ›

What is the difference between cores and executors in Spark? ›

An executor is a single JVM process that is launched for a spark application on a node while a core is a basic computation unit of CPU or concurrent tasks that an executor can run. A node can have multiple executors and cores. All the computation requires a certain amount of memory to accomplish these tasks.

Explore More ›

Do I need Scala to run Spark? ›

Spark's shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively. It is available in either Scala (which runs on the Java VM and is thus a good way to use existing Java libraries) or Python.

Is Spark faster than Hadoop? ›

Writing and reading from RAM are exponentially faster than doing the same with an external drive. Moreover, Spark reuses the retrieved data for numerous operations. Therefore, Spark performs better than Hadoop in varying degrees for both simple and complex data processing.

Show Me More ›

Can Spark run without a hive? ›

1 Answer. Yes, we can run spark sql queries on spark without installing hive, by default hive uses mapred as an execution engine, we can configure hive to use spark or tez as an execution engine to execute our queries much faster. Hive on spark hive uses hive metastore to run hive queries.

Read The Full Story ›

How to check if Spark is running or not? ›

You can view all Apache Spark applications from Spark job definition, or notebook item context menu shows the recent run option -> Recent runs.

See Details ›

How to get Spark master URL? ›

When running on Spark Standalone, the Spark Master URL is same as the host name in Spark REST URL field plus the Spark standalone master console port field. Spark Master URL is http://<Spark REST URL>:< Spark standalone master console port> . You can obtain Spark Master URL from the System Settings page.

Get More Info Here ›

How to run a Spark script? ›

The most common way to launch spark applications on the cluster is to use the shell command spark-submit. When using spark-submit shell command the spark application need not be configured particularly for each cluster as the spark-submit shell script uses the cluster managers through a single interface.

Can Spark run on EC2? ›

The spark-ec2 script, located in Spark's ec2 directory, allows you to launch, manage and shut down Spark clusters on Amazon EC2. It automatically sets up Spark and HDFS on the cluster for you.

Can Spark run on AWS? ›

Amazon EMR is the best place to run Apache Spark. You can quickly and easily create managed Spark clusters from the AWS Management Console, AWS CLI, or the Amazon EMR API.

View Details ›

How do I connect Pyspark to AWS? ›

Here are the steps to set up Amazon EMR with Spark:

Sign in to the AWS Management Console.
Choose the EMR Service.
Create an EMR Cluster with Spark.
a. Click "Create cluster."
b. Configure Cluster.
c. Launch Cluster.
Submit Spark Jobs.
Monitor and Manage the Cluster.

More items...

Nov 16, 2023