First of all, you must have a local network with 2 or more vms connected.
To install Hadoop and Spark follow the steps below:
First, install Hadoop follow the link -> Hadoop
Then, set up Yarn Cluster follow the link -> Yarn
Finally, install Spark with pdf adove or follow the link -> Spark
1.Connect to master vm:
ssh (master vm connection string)
2.Start Hadoop and Spark in master vm:
start-dfs.sh
start-master.sh
3.Upload data in Hadoop Distributed File System (HDFS), according to the example below:
hadoop fs -put ./yellow_tripdata_2022-01.parquet hdfs://master:9000/par/yellow_tripdata_2022-01.parquet
4.Start a worker in master:
start-worker.sh spark://192.168.0.2:7077
5.Start a worker in slave by typing the following instructions in the master vm:
ssh (slave vm connection string)
start-worker.sh spark://192.168.0.2:7077
6.Submit the task in Spark environment (in the master vm and in the directory of the file):
spark-submit (filename)
7.Results
See the results in the terminal
Dimitris Kalathas, Dimitris Bakalis