Saturday, February 6, 2016

Getting Started with Spark on Windows 7 (64 bit)


Lets get started on Apache Spark 1.6 on Windows 7 (64 Bit). [ Mac, Ubuntu, other OS steps are similar except winutils step that is only for Windows OS ]

-  Download and install Java  (Needs Java 1.7 or 1.8, Ignore if already installed)
-  Download & Install Anaconda Python 3.5+.  (Extract to C:\Anaconda3 or any folder )
-  Download Spark  ( Download 7-zip to unzip .gz files) : Extract to C:\BigData\Spark making sure that all 15 folders go under C:\BigData\Spark folder and not in long folder name with version number
-  Download winutils.exe  ( Put in C:\BigData\Hadoop\bin )  -- This is for 64-bit
-  Download Sample Data  (Extract to C:\BigData\Data)

1. Create Environment Variables :-
    SPARK_HOME : C:\BigData\Spark
    HADOOP_HOME : C:\BigData\Hadoop
    JAVA_HOME :  Make sure JAVA_HOME is Environment variable is defined and pointing to Java home directory.

2. Add to Environment variable PATH at end:
   ;%SPARK_HOME%\bin;%HADOOP_HOME%\bin;%JAVA_HOME%\bin;

3. Create folder  
    c:\tmp\hive
 
4. On Command prompt in Admin Mode : (One time only)
winutils.exe chmod -R 777 \tmp\hive

5. On command prompt in Adminstrator mode start Spark using :-
pyspark --packages com.databricks:spark-csv_2.11:1.3.0

[ If everything is set you should see Welcome to Spark version 1.6 using Python 3.5.1 message ]

6.  Type following commands on Pyspark shell :-
sc.setLogLevel("WARN")

movies=sqlContext.read.format("com.databricks.spark.csv").options(delimiter="|").options(header="false").load("file:///BigData/Data/ml-100k/u.item")

movies.registerTempTable("movies")

movies.cache()

movies.show()

ratings=sqlContext.read.format("com.databricks.spark.csv").options(delimiter="\t").options(header="false").load("file:///BigData/Data/ml-100k/u.data")

ratings.registerTempTable("ratings")

ratings.cache()

ratings.show()


# To view Ratings distribution :
ratingsDistribution=sqlContext.sql("Select c2 as ratings, count(*) as cnt from ratings group by c2")

TopMovies.show()

#Top Most Watched Movies :
TopMovies=sqlContext.sql("Select c1 as movieId, Count(*) as Cnt from ratings group by c1 Order by Cnt Desc")

TopMovies.show()



# Top Most Watched Movies by Name
TopMovieNames=sqlContext.sql("Select movies.c1 as MovieName, count(*) as Cnt from ratings, movies where ratings.c1=movies.c0 group by movies.c1 Order by Cnt desc")

TopMovieNames.show()
---------------------------------------------------------------
Some more troubleshooting commands/tips :-
---------------------------------------------------------------

sqlContext._get_hive_ctx()      - If this runs clean with no errors, then Winutils.exe version, HADOOP_HOME path etc is correct. 

sc.appName="myFirstApp"  - appName appears in Jobs, easy to differentiate

winutils.exe ls \tmp\hive  : This command on windows Command propmt will display access level to \tmp\hive folder. If you see any errors, then probably you may not have proper access to C:\ drive (especially work laptop that has restrictions to C:\ drive. In that case try to run winutils and pyspark command from D:\ prompt. If \tmp\hive permissions are not set properly, you may receive error like this ( java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-)

type sc. and hit Tab key will list all commands, options. Same with sqlContext.

ratings.unpersist()   - to remove from Cache. ( ratings.cache() to put in Memory ) 

create C:\BigData\Spark\bin\myspark.bat with pyspark --packages com.databricks:spark-csv_2.11:1.3.0  in it. Next time just type myspark on command line to open pyspark with CSV package.

Download larger Movie/Ratings data sets to slice n dice data in different ways and evaluate performance (memory/cpu) implications when Data is cached vs not cached. Explore other related data sets at this Link 




18 comments:

Khaja Asmath said...

Hi Sangeet, I have followed your steps and I always get the error
as java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /t
mp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-

How did you resolve this on windows when trying to open spark shell

Sangeet Chourey said...

Hi Khaja,

Firstly make sure that you are running commands using "Run As Administrator"

Next try following command to see the actual error :-

winutils.exe ls \tmp\hive

This may indicate the actual issue. There could be various reasons, like if it says like Domain Trust related error, then may be you are on Work Laptop. If that's the case try login to VPN and connect to domain server.

Alternatively, you can also try to run pyspark command from D:\ or E:\ Drive. If you don't have D:\ or E:\ drive then you can also put in USB drive and get run pyspark from there.

Basically it is due to Windows security issue specially on work Laptops where C:\ drive is more restricted.

Hope that helps...

Thanks
Sangeet

Anonymous said...

Sangeet, I'm running Spark 1.6.0 Pre-built for Hadoop 2.6 and later standalone (without Hadoop) on Windows 7 64 Bit and get the following error. I have admin access but still no luck. Are you able to help ! Thanks.

16/02/24 12:06:02 WARN General: Plugin (Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont hav
e multiple JAR versions of the same plugin in the classpath. The URL "file:/C:/IT_CodeRepo/BigData/spark/lib/datanucleus
-rdbms-3.2.9.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/C:/IT_
CodeRepo/BigData/spark/bin/../lib/datanucleus-rdbms-3.2.9.jar."
16/02/24 12:06:02 WARN General: Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have mu
ltiple JAR versions of the same plugin in the classpath. The URL "file:/C:/IT_CodeRepo/BigData/spark/lib/datanucleus-api
-jdo-3.2.6.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/C:/IT_Co
deRepo/BigData/spark/bin/../lib/datanucleus-api-jdo-3.2.6.jar."
16/02/24 12:06:02 WARN General: Plugin (Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple J
AR versions of the same plugin in the classpath. The URL "file:/C:/IT_CodeRepo/BigData/spark/bin/../lib/datanucleus-core
-3.2.10.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/C:/IT_CodeR
epo/BigData/spark/lib/datanucleus-core-3.2.10.jar."
16/02/24 12:06:02 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/02/24 12:06:02 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/02/24 12:06:06 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is no
t enabled so recording the schema version 1.2.0
16/02/24 12:06:06 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
16/02/24 12:06:07 WARN : Your hostname, TORL2413 resolves to a loopback/non-reachable address: fe80:0:0:0:45b5:a994:a3fa
:1fc6%eth9, but we couldn't find any external IP address!
java.lang.RuntimeException: java.lang.NullPointerException
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
at org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:194)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238)
...
...
:16: error: not found: value sqlContext
import sqlContext.implicits._

Sangeet Chourey said...

Hi,

You can try following :-

1. Make sure that HADOOP_HOME, SPARK_HOME, JAVA_HOME are set properly.
2. Ensure Java version 1.7+ is installed and in PATH. Type Java -version on command prompt and make sure it displays the correct version.
3. Ensure that winutils.exe ls \tmp\hive command doesn't throw any error.
4. On command prompt Set Environment variables manually thus overriding the System ones and then run pyspark command :

SET HADOOP_HOME=C:\{HADOOP HOME Directory}

SET JAVA_HOME=C:\{Java Home diretory}

SET SPARK_HOME = {Spart Home directory}

PATH=%SPARK_HOME%\bin;%HADOOP_HOME%\bin;%JAVA_HOME%\bin;

Hope this helps

Thanks
Sangeet

rajeshd said...

Hi Sangeet,

While running Junit test cases related to Spark API's i am getting below error

java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-

Do i have to make any settings in eclipse ?.

There is no Hadoop or Spark installed on windows 7, eclipse workspace has spark libraries.

Did i missed any thing setup related?.

Sangeet Chourey said...

Hi Rajesh,

1. Please make sure you have winutils.exe in PATH and under HADOOP_HOME
2. Secondly, run winutils.exe ls \tmp\hive on command prompt to check if you see any error.
3. Please refer to my comment on this same issue in response to comment posted in this blog.

Thanks
Sangeet

Unknown said...

Hi Sangeet,

When i am running the command movies --- i am getting the error -
scala> movies=sqlContext.read.format("com.databricks.spark.csv").options(delimit
er="|").options(header="false").load("file:///BigData/Data/ml-100k/u.item")
:22: error: not found: value movies
val $ires9 = movies
^
:19: error: not found: value movies
movies=sqlContext.read.format("com.databricks.spark.csv").options(delim
iter="|").options(header="false").load("file:///BigData/Data/ml-100k/u.item")
^

scala> val movies=sqlContext.read.format("com.databricks.spark.csv").options(del
imiter="|").options(header="false").load("file:///BigData/Data/ml-100k/u.item")

then i assign movies to val - IT says sqlContext not found.

:19: error: not found: value sqlContext
val movies=sqlContext.read.format("com.databricks.spark.csv").options(d
elimiter="|").options(header="false").load("file:///BigData/Data/ml-100k/u.item"
)

Sangeet Chourey said...

Hi,

The code I posted was to be run on Pyspark (python) shell. I have updated blog with Scala code as well. Please follow instruction in Blog post and let me know if you face any issues.

Thanks
Sangeet

Unknown said...

Sangeet - Awesome buddy...it worked (winutils.exe ls \tmp\hive )

Anonymous said...

Hi
This is very a nice post. Thanks for this. I am getting this error while running the spark-shell. I am running windows 10 with Java 1.8

'cmd' is not recognized as an internal or external command,operable program or batch file.

Unknown said...

hi,
it works for but the chmod command line is
winutils.exe chmod -R 777 /tmp/hive not \tmp\hive

Unknown said...

Hi Sangeet,
thanks for your tip, it works for me but something different.
In my case the chmod command line is winutils.exe chmod -R 777 /tmp/hive not \tmp\hive

Anonymous said...

Hey Thank You So much , This worked perfect :-)

Unknown said...

Very helpful, Thank You So much! Especially for winutils:)

Unknown said...

Very helpful, Thanks You a lot! Especially for the winutils:)

Balázs said...

Thanks a lot, it was really helpful

Balázs said...

Thanks a lot, it was really helpful

Unknown said...

Sangeet- You are a genius