Hadoop Setup Instructions on Windows OS
What is BigData ?
BigData is a platform for storing and processing really very big data that cannot be processed using traditional computing techniques. Big data is not merely a data, rather its a distributed computing technology which involves various tools, technqiues and frameworks.
Scope of this Tutorial
Here we are not covering the tutorial of Bigdata, rather we are taking you to the first step of setting up environment for work on BigData prototypes from your local Windows based machine. For this tutorial we have used Windows 7.
CPU Cores: 2 + RAM: 8 + DISK: 5GB +
If you have more than 4 cores of CPU and more than 12 GB of RAM then we could recommend to go with link http://microdebug.com/2017/05/20/cloudera-vm-setup/ to setup complete BigData setup on your machine. Tutorials covered in this page is only to setup minimal Hadoop environment on Windows OS.
Spark Scala Hive Hadoop Scala-IDE
Extract and Install Packages
Below are the all required packages downloaded for Hadoop setup on Windows
Java JDK Setup
Install Java JDK using the installer setup downloaded from Oracle page. Install JDK in the default path and then we can move it to directly under C:\ to make Hadoop work.
After installation of Java it will be installed under path “C:\Program Files\Java”
Move this “C:\Program Files\Java” to path “C:\Java” to avoid space between path.
Set Environment variables
Goto My Computer (Start -> Computer) ->Right click and select properties -> Advanced System Settings -> Advanced Tab -> Environment Variables -> System Variables -> New and give as below
Goto Run Command window “Windows + R” for shortcut key and type “sysdm.cpl” and enter will open System Properties window and then select Advanced Tab -> Environment Variables -> System Variables -> New and give as below
Variable Name: JAVA_HOME Variable Value: C:\Java\jdk1.8.0_131
Variable Name: Path Variable Value: ;%JAVA_HOME%\bin;
Extract the downloaded hadoop tar archive file “hadoop-2.8.0.tar”, Right click file -> Extract Here. We used Winrar for extracting the tar to a folder.
Download winutils.exe from github https://github.com/steveloughran/winutils/raw/master/hadoop-2.6.0/bin/winutils.exe which is required to run hadoop from Windows machine.
Then move this winutils.exe file to below folder C:\Users\tutor\InstalledApps\hadoop-2.8.0\bin\
Set environment variable for Hadoop as we did for above Java setup
Variable Name: HADOOP_HOME Variable Value: C:\Users\tutor\InstalledApps\hadoop-2.8.0
Now we have set Hadoop Home environment variable and now we need to add this to PATH variable so that hadoop commands will be picked up.
Select Variable “Path” and edit it for append Variable value with below line
Variable Name: Path Variable Value: ;%HADOOP_HOME%\bin;
Now open command prompt and type below command will show you HDFS directories without any errors printed on screen
hadoop fs -ls
This is a simple installation setup file “scala-2.12.2.exe” which can be easily installed by clicking on next -> next buttons
Now we are good with Hadoop Environment setup and the last step is to setup IDE (Integrated Development Environment) for Scala.
Extract eclipse package “scala-SDK-4.6.1-vfinal-2.12-win32.win32.x86_64” and then goto eclipse folder -> click on eclipse app icon file
Extract the Hive package “apache-hive-2.1.1-bin.tar” as we did for Java package and then set environment variable for HIVE_HOME and Path.
Variable Name: HIVE_HOME Variable Value: C:\Users\tutor\InstalledApps\apache-hive-2.1.1-bin
Variable Name: Path Variable Value: ;%HIVE_HOME%\bin;
Edit below beeline script to add library dependencies
Reach bottom of the script and updating existing call script looks like below by appending lib path “C:\Users\abilash\InstalledApps\apache-hive-2.1.1-bin\lib\* ”
call %JAVA_HOME%\bin\java %JAVA_HEAP_MAX% %HADOOP_OPTS% -classpath %CLASSPATH%;C:\Users\tutor\InstalledApps\apache-hive-2.1.1-bin\lib\* org.apache.hive.beeline.BeeLine %*
After this open a new window and enter below command
Extract the Spark package “spark-2.1.1-bin-hadoop2.7” as we did for Java package and then set environment variable for SPARK_HOME and Path.
Variable Name: SPARK_HOME Variable Value: C:\Users\tutor\InstalledApps\spark-2.1.1-bin-hadoop2.7
Variable Name: Path Variable Value: ;%SPARK_HOME%\bin;
Open Command prompt window and enter below command to check if you are able to get connected with scala, spark and hive with Hadoop
If you encounter errors then give Read and Write permission to Hive like below
winutils.exe chmod 777 \tmp\hive
\tmp\hive is unix type format but it will work with windows. You can find the path under c:\tmp
C:\Users\tutor>spark-shell Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 17/06/10 23:36:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/06/10 23:36:55 WARN General: Plugin (Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "fi le:/C:/Users/tutor/InstalledApps/spark-2.1.1-bin-hadoop2.7/jars/datanucleus-core-3.2.10.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/C:/Users/tutor/InstalledApps/spark-2.1.1-bin-hadoop2.7/bin/../jars/datanucleus-core-3.2.10.jar." 17/06/10 23:36:55 WARN General: Plugin (Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/C:/Users/tutor/InstalledApps/spark-2.1.1-bin-hadoop2.7/jars/datanucleus-rdbms-3.2.9.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/C:/Users/tutor/InstalledApps/spark-2.1.1-bin-hadoop2.7/bin/../jars/datanucleus-rdbms-3.2.9.jar." 17/06/10 23:36:55 WARN General: Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/C:/Users/tutor/InstalledApps/spark-2.1.1-bin-hadoop2.7/bin/../jars/datanucleus-api-jdo-3.2.6.jar" is already registered, and you are trying to register an identical p lugin located at URL "file:/C:/Users/tutor/InstalledApps/spark-2.1.1-bin-hadoop2.7/jars/datanucleus-api-jdo-3.2.6.jar." 17/06/10 23:36:59 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException Spark context Web UI available at http://10.0.2.15:4040 Spark context available as 'sc' (master = local[*], app id = local-1497163014195). Spark session available as 'spark'. Welcome to ____ _ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.1.1 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131) Type in expressions to have them evaluated. Type :help for more information. scala>