Active21 days ago

I have used Spark in Scala for a long time. Now I am using pyspark for the first time. This is on a Mac

  1. First I installed pyspark using conda install pyspark, and it installed pyspark 2.2.0
  2. I installed spark itself using brew install apache-spark, and it seems to have installed apache-spark 2.2.0

I have used Spark in Scala for a long time. Now I am using pyspark for the first time. This is on a Mac First I installed pyspark using conda install pyspark, and it installed pyspark 2.2.0 I ins.

but when I run pyspark, it dumps out

Why is it pointing to the 1.6.2 installation, which seems to be no longer there? Brew search apache-spark does indicate the presence of both 1.5. and 1.6. Shouldn't pyspark 2.2.0 automatically point to the apache-spark 2.2.0 installation?

bhomassbhomass
1,3552 gold badges23 silver badges49 bronze badges
  • How do I run/install Apache Spark on my laptop/Mac computer for testing purpose?
  • How to Install Scala and Apache Spark on MacOS. It comes to installing applications and languages on a Mac OS. Will now download and install Apache Spark, it.
  • Apache Spark is currently using the Apache top level project in the Big Data environment is, the most active is being developed. That alone would be enough, so we as a company in our field in 'Search and Big Data' So with this project apart.

3 Answers

There is a number of issues with your question:

To start with, PySpark is not an add-on package, but an essential component of Spark itself; in other words, when installing Spark you get also PySpark by default (you cannot avoid it, even if you would like to). So, step 2 should be enough (and even before that, PySpark should be available in your machine since you have been using Spark already).

Step 1 is unnecessary: Pyspark from PyPi (i.e. installed with pip or conda) does not contain the full Pyspark functionality; it is only intended for use with a Spark installation in an already existing cluster. From the docs:

The Python packaging for Spark is not intended to replace all of the other use cases. This Python packaged version of Spark is suitable for interacting with an existing cluster (be it Spark standalone, YARN, or Mesos) - but does not contain the tools required to setup your own standalone Spark cluster. You can download the full version of Spark from the Apache Spark downloads page.

NOTE: If you are using this with a Spark standalone cluster you must ensure that the version (including minor version) matches or you may experience odd errors

Based on the fact that, as you say, you have already been using Spark (via Scala), your issue seems rather to be about upgrading. Now, if you use pre-built Spark distributions, you have actually nothing to install - you just download, unzip, and set the relevant environment variables (SPARK_HOME etc) - see my answer on 'upgrading' Spark, which is actually also applicable for first-time 'installations'.

desertnautdesertnaut
24.4k9 gold badges52 silver badges87 bronze badges

Easiest way to install pyspark right now would to do a pip install with version > 2.2.

If you want to use a distribution instead (and want to use jupyter along with it), another way would be:https://blog.sicara.com/get-started-pyspark-jupyter-guide-tutorial-ae2fe84f594f

Zoe
15.6k12 gold badges65 silver badges96 bronze badges
Kushagra VermaKushagra Verma

Step 1: If you don't have brew first install brew using the following command in terminal

Step 2: Once you have brew then run below command to install java on your Mac.

Step 3: Once Java is installed run the below command to install spark on Mac

Step 4: type pyspark -version

Dugini VijayDugini Vijay
Apache Spark Download For Mac

Not the answer you're looking for? Browse other questions tagged pythonapache-sparkpysparkhomebrew or ask your own question.

Before doing anything [Requirements]

Step 1: AWS Account Setup Before installing Spark on your computer, be sure to set up an Amazon Web Services account. If you already have an AWS account, make sure that you can log into the AWS Console with your username and password.

Step 2: Software Installation Before you dive into these installation instructions, you need to have some software installed. Here's a table of all the software you need to install, plus the online tutorials to do so.

Requirements for Mac

NameDescriptionInstallation Guide
BrewThe package installation soft for mac. Very helpful for this installation and in life in general.brew install
AnacondaA distribution of python, with packaged modules and libraries. Note: we recommend installing Anaconda 2 (for python 2.7)anaconda install
JDK 8Java Development Kit, used in both Hadoop and Spark.just use brew cask install java

Requirements for Linux

NameDescriptionInstallation Guide
AnacondaA distribution of python, with packaged modules and libraries. Note: we recommend installing Anaconda 2 (for python 2.7)anaconda install
JDK 8Java Development Kit, used in both Hadoop and Spark.install for linux

We are going to install Spark+Hadoop. Use the Part that corresponds to your configuration:

Featuring four full octaves of pitch adjustment, low latency processing and automation support, Octave Shifter 2 will satisfy both your live and studio pitch shifting needs. Key features of the Audio Unit include: • real-time, low latency processing for live performances • high precision post-processing for studio quality results • four full octaves of pitch adjustment • continuous pitch change, enables automation of pitch parameter • extensive user manual Additionally, Octave Shifter makes use of the industry standard What's New in Octave Shifter. Featuring four full octaves of pitch adjustment, low latency processing and automation support, Octave Shifter 2 will satisfy both your live and studio pitch shifting needs. Free download software for mac. Octave Shifter is a free, real-time pitch shifting Audio Unit effect.

  • 1.1. Installing Spark+Hadoop on Mac with no prior installation
  • 1.2. Installing Spark+Hadoop on Linux with no prior installation
  • 1.3. Use Spark+Hadoop from a prior installation

We'll do most of these steps from the command line. So, open a terminal and jump in !

NOTE: If you would prefer to jump right into using spark you can use the spark-install.sh script provided in this repo which will automatically perform the installation and set any necessary environment variables for you. This script will install spark-2.1.0-bin-hadoop2.7.

1.1. Installing Spark+Hadoop on MAC with no prior installation (using brew)

Be sure you have brew updated before starting: use brew update to update brew and brew packages to their last version.

1. Use brew install hadoop to install Hadoop (version 2.7.3 as of Jan 2017)

2. Check the hadoop installation directory by using the command:

3. Use brew install apache-spark to install Spark (version 2.1.0 as of Jan 2017)

4. Check the installation directory by using the command:

5. You're done ! You can now go to the section 2 for setting up your environment and run your Spark scripts.

1.2. Installing Spark+Hadoop on Linux with no prior installation

1. Go to Apache Spark Download page. Choose the latest Spark release (2.1.0), and the package type 'Pre-built for Hadoop 2.7 and later'. Click on the link 'Download Spark' to get the tgz package of the latest Spark release. On Jan 2017 this file was spark-2.1.0-bin-hadoop2.7.tgz so we will be using that in the rest of these guidelines but feel free to adapt to your version.

2. Uncompress that file into /usr/local by typing:

Spark

3. Create a shorter symlink of the directory that was just created using:

4. Go to Apache Hadoop Download page. On the table above, click on the latest version below 3 (2.7.3 as of Nov 2016). Click as to download the binary version tar.gz archive, choose a mirror and download the file unto your computer.

5. Uncompress that file into /usr/local by typing:

6. Create a shorter symlink of this directory using:

1.3. Using a prior installation of Spark+Hadoop

We strongly recommend you update your installation to the must recent version of Spark. As of Jan 2017 we used Spark 2.1.0 and Hadoop 2.7.3.

If you want to use another version there, all you have to do is to locate your installation directories for Spark and Hadoop, and use that in the next section 2.1 for setting up your environment.

2.1. Environment variables

To run Spark scripts you have to properly setup your shell environment: setting up environment variables, verifying your AWS credentials, etc.

Twistpad is a fully featured text [] • O. Twistpad just as the name sounds, is a fully-featured text editor that works on Windows operating system. Textwrangler free download. James Samson - September 2, 2016 Beyond Compare enables you to compare an entire drive and folder, and thoroughly verify every file with byte-by-byte comparisons. When you have loads of tools or software to choose from, it simply makes your job easier. The utility is very useful for a lot of things on your system, does not waste your time and comes with a lot of comparison techniques.

1. Edit your ~/.bash_profile to add/edit the following lines depending on your configuration. This addition will setup your environment variables SPARK_HOME and HADOOP_HOME to point out to the directories used to install Spark and Hadoop.

For a Mac/Brew installation, copy/paste the following lines into your ~/.bash_profile:

For the Linux installation described above, copy/paste the following lines into your ~/.bash_profile:

For any other installation, find what directories your Spark and Hadoop installation where, adapt the following lines to your configuration and put that into your ~/.bash_profile:

While you're in ~/.bash_profile, be sure to have two environment variables for your AWS keys. We'll use that in the assignments. Be sure you have the following lines set up (with the actual values of your AWS credentials):

Note: After any modification to your .bash_profile, for your terminal to take these changes into account, you need to run source ~/.bash_profile from the command line. They will be automatically taken into account next time you open a new terminal.

2.2. Python environment

1. Back to the command line, install py4j using pip install py4j.

2. To check if everything's ok, start an ipython console and type import pyspark. This will do nothing in practice, that's ok: if it did not throw any error, then you are good to go.

3.1. How to run Spark/Python from a Jupyter Notebook

Running Spark from a jupyter notebook can require you to launch jupyter with a specific setup so that it connects seamlessly with the Spark Driver. We recommend you create a shell script jupyspark.sh designed specifically for doing that.

1. Create a file called jupyspark.sh somewhere under your $PATH, or in a directory of your liking (I usually use a scripts/ directory under my home directory). In this file, you'll copy/paste the following lines:

Save the file. Make it executable by doing chmod 711 jupyspark.sh. Now, whenever you want to launch a spark jupyter notebook run this script by typing jupyspark.sh in your terminal.

Here's how to read that script.. Basically, we are going to use pyspark (an executable from your Spark installation) to run jupyter with a proper Spark context.

The first two lines :

will set up two environment variables for pyspark to execute jupyter.

Note: If you are installing Spark on a Virtual Machine and would like to access jupyter from your host browser, you should set the NotebookApp.ip flag to --NotebookApp.ip='0.0.0.0' so that your VM's jupyter server will accept external connections. You can then access jupyter notebook from the host machine on port 8888.

Apache For Mac

The next line:

is part of a multiline long command to run pyspark with all necessary packages and options.

The next 3 lines:

Set up the options for pyspark to execute locally, using all 4 cores of your computer and setting up the memory usage for Spark Driver and Executor.

The next line:

Is creating a directory to store Sparl SQL dataframes. This is non-necessary but has been solving a common error for loading and processing S3 data into Spark Dataframes.

The final 3 lines:

add specific packages to pyspark to load. These packages are necessary to access AWS S3 repositories from Spark/Python and to read csv files.

Note: You can adapt these parameters to your own liking. See Spark page on Submitting applications to tune these parameters.

2. Now run this script. It will open a notebook home page in your browser. From there, create a new notebook and copy/pate the following commands in your notebook:

What these lines do is to try to connect to the Spark Driver by creating a new SparkSession instance.

After this point, you can use spark as a unique entry point for reading files and doing spark things.

3.2. How to run Spark/Python from command line with via spark-submit

Instead of using jupyter notebook, if you want to run your python script (using Spark) from the command line, you will need to use an executable from the Spark suite called spark-submit. Again, this executable requires some options that we propose to put into a script to use whenever you need to launch a Spark-based python script.

1. Create a script you would call localsparksubmit.sh, put it somewhere handy. Copy/paste the following content in this file:

See the previous section 2.1 for an explanation of these values. The final line here $@ means that whatever you gave as an argument to this localsparksubmit.sh script will be used as a last argument in this command.

2. Whenever you want to run your script (called for instance script.py), you would do it by typing localsparksubmit.sh script.py from the command line. Make sure you put localsparksubmit.sh somewhere under your $PATH, or in a directory of your linking.

Note: You can adapt these parameters to your own setup. See Spark page on Submitting applications to tune these parameters.

1. Open a new jupyter notebook (from the jupyspark.sh script provided above) and paste the following code:

Download

It should output the following result :

Apache Spark Download For Macbook Pro

2. Create a python script called testspark.py and paste the same lines above in it. Run this script from the command line using localsparksubmit.sh testspark.py. It should output the same result as above.