Hadoop on Windows OS

Hadoop Setup Instructions on Windows OS

What is BigData ?

BigData is a platform for storing and processing really very big data that cannot be processed using traditional computing techniques. Big data is not merely a data, rather its a distributed computing technology which involves various tools, technqiues and frameworks.

Scope of this Tutorial

Here we are not covering the tutorial of Bigdata, rather we are taking you to the first step of setting up environment for work on BigData  prototypes from your local Windows based machine. For this tutorial we have used Windows 7.

System Specification

CPU Cores: 2 +
RAM: 8 +
DISK: 5GB +

If you have more than 4 cores of CPU and more than 12 GB of RAM then we could recommend to go with link http://microdebug.com/2017/05/20/cloudera-vm-setup/ to setup complete BigData setup on your machine. Tutorials covered in this page is only to setup minimal Hadoop environment on Windows OS.

Tools Covered
Spark
Scala
Hive
Hadoop
Scala-IDE
Download packages

Hadoop: http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.8.0/hadoop-2.8.0.tar.gz

Scala: https://downloads.lightbend.com/scala/2.12.2/scala-2.12.2.msi

Hive: http://mirror.fibergrid.in/apache/hive/hive-2.1.1/

Spark: https://d3kbcqa49mib13.cloudfront.net/spark-2.1.1-bin-hadoop2.7.tgz

Scala-IDE: http://downloads.typesafe.com/scalaide-pack/4.6.1-vfinal-neon-212-20170609/scala-SDK-4.6.1-vfinal-2.12-win32.win32.x86_64.zip

Java SDK: http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

 

Extract and Install Packages

Below are the all required packages downloaded for Hadoop setup on Windows

 

Java JDK Setup

Install Java JDK using the installer setup downloaded from Oracle page. Install JDK in the default path and then we can move it to directly under C:\ to make Hadoop work.

After installation of Java it will be installed under path “C:\Program Files\Java”

Move this “C:\Program Files\Java” to path “C:\Java” to avoid space between path.

Set Environment variables

Goto My Computer (Start -> Computer) ->Right click and select properties -> Advanced System Settings -> Advanced Tab -> Environment Variables -> System Variables -> New and give as below

(Or)

Goto Run Command window “Windows + R” for shortcut key and type “sysdm.cpl” and enter will open System Properties window and then select Advanced Tab -> Environment Variables -> System Variables -> New and give as below

Variable Name: JAVA_HOME
Variable Value: C:\Java\jdk1.8.0_131

Update PATH

Variable Name: Path
Variable Value: ;%JAVA_HOME%\bin;

 

 

Hadoop Setup

Extract the downloaded hadoop tar archive file “hadoop-2.8.0.tar”, Right click file -> Extract Here. We used Winrar for extracting the tar to a folder.

Download winutils.exe from github https://github.com/steveloughran/winutils/raw/master/hadoop-2.6.0/bin/winutils.exe which is required to run hadoop from Windows machine.

Then move this winutils.exe file to below folder C:\Users\tutor\InstalledApps\hadoop-2.8.0\bin\

Set environment variable for Hadoop as we did for above Java setup

Variable Name: HADOOP_HOME
Variable Value: C:\Users\tutor\InstalledApps\hadoop-2.8.0

 

Now we have set Hadoop Home environment variable and now we need to add this to PATH variable so that hadoop commands will be picked up.

Select Variable “Path” and edit it for append Variable value with below line

Update PATH

Variable Name: Path
Variable Value: ;%HADOOP_HOME%\bin;

 

Now open command prompt and type below command will show you HDFS directories without any errors printed on screen

hadoop fs -ls

Scala Setup

This is a simple installation setup file “scala-2.12.2.exe” which can be easily installed by clicking on next -> next buttons

Scala-IDE

Now we are good with Hadoop Environment setup and the last step is to setup IDE (Integrated Development Environment) for Scala.

Extract eclipse package “scala-SDK-4.6.1-vfinal-2.12-win32.win32.x86_64” and then goto eclipse folder -> click on eclipse app icon file

Hive Setup

Extract the Hive package “apache-hive-2.1.1-bin.tar” as we did for Java package and then set environment variable for HIVE_HOME and Path.

Variable Name: HIVE_HOME
Variable Value: C:\Users\tutor\InstalledApps\apache-hive-2.1.1-bin

Update PATH

Variable Name: Path
Variable Value: ;%HIVE_HOME%\bin;

Edit below beeline script to add library dependencies

C:\Users\tutor\InstalledApps\apache-hive-2.1.1-bin\bin\beeline.cmd

Reach bottom of the script and updating existing call script looks like below by appending lib path “C:\Users\abilash\InstalledApps\apache-hive-2.1.1-bin\lib\* ”

call %JAVA_HOME%\bin\java %JAVA_HEAP_MAX% %HADOOP_OPTS% -classpath %CLASSPATH%;C:\Users\tutor\InstalledApps\apache-hive-2.1.1-bin\lib\* org.apache.hive.beeline.BeeLine %*

After this open a new window and enter below command

beeline

Spark Setup

Extract the Spark package “spark-2.1.1-bin-hadoop2.7” as we did for Java package and then set environment variable for SPARK_HOME and Path.

Variable Name: SPARK_HOME
Variable Value: C:\Users\tutor\InstalledApps\spark-2.1.1-bin-hadoop2.7

Update PATH

Variable Name: Path
Variable Value: ;%SPARK_HOME%\bin;

Open Command prompt window and enter below command to check if you are able to get connected with scala, spark and hive with Hadoop

 spark-shell

If you encounter errors then give Read and Write permission to Hive like below

winutils.exe chmod 777 \tmp\hive

\tmp\hive is unix type format but it will work with windows. You can find the path under c:\tmp

 

C:\Users\tutor>spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/06/10 23:36:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/06/10 23:36:55 WARN General: Plugin (Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "fi
le:/C:/Users/tutor/InstalledApps/spark-2.1.1-bin-hadoop2.7/jars/datanucleus-core-3.2.10.jar" is already registered, and you are trying to register an identical plugin located at
URL "file:/C:/Users/tutor/InstalledApps/spark-2.1.1-bin-hadoop2.7/bin/../jars/datanucleus-core-3.2.10.jar."
17/06/10 23:36:55 WARN General: Plugin (Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath.
The URL "file:/C:/Users/tutor/InstalledApps/spark-2.1.1-bin-hadoop2.7/jars/datanucleus-rdbms-3.2.9.jar" is already registered, and you are trying to register an identical plugin
located at URL "file:/C:/Users/tutor/InstalledApps/spark-2.1.1-bin-hadoop2.7/bin/../jars/datanucleus-rdbms-3.2.9.jar."
17/06/10 23:36:55 WARN General: Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The
URL "file:/C:/Users/tutor/InstalledApps/spark-2.1.1-bin-hadoop2.7/bin/../jars/datanucleus-api-jdo-3.2.6.jar" is already registered, and you are trying to register an identical p
lugin located at URL "file:/C:/Users/tutor/InstalledApps/spark-2.1.1-bin-hadoop2.7/jars/datanucleus-api-jdo-3.2.6.jar."
17/06/10 23:36:59 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://10.0.2.15:4040
Spark context available as 'sc' (master = local[*], app id = local-1497163014195).
Spark session available as 'spark'.
Welcome to
____ _
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.1
/_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *