Hadoop and Hadoop version
hadoop
hadoop version
start-all.cmd
Create the directory
hdfs dfs -mkdir /mca54
Create one input file
This is Demo
This is Sample
Dear Bear Car Van
Car Van Car Van
This is Dear
Move the file from local to hdfs
hdfs dfs -copyFromLocal "C:\\Users\\labuser\\Desktop\\MCA54\\WordCountApp\\assets\\sample-data.txt" /mca54
Check whether the file is copied to hdfs
hdfs dfs -ls /mca54
Check hadoop user interface (Application Status)
Check hadoop user interface (Namenode Status)
Now create one Java Project using vs code
Open in File explorer
Copy jar file from and paste in lib folder of java project
Java file WordCount.java
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
// import org.apache.hadoop.io.FileSystem;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class WordCount {
// Mapper class that splits the input text into words and outputs them with a count of 1
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
// Split each line of input into words
String line = value.toString();
String[] words = line.split("\\\\s+");
// For each word in the line, emit (word, 1)
for (String w : words) {
word.set(w);
context.write(word, one);
}
}
}
// Reducer class that sums the counts for each word
public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
// Sum all the counts for this word
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
// Main method to set up and run the job
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.printf("Usage: WordCount<input dir> <output dir>\\n");
System.exit(-1);
}
// Set up the configuration for the Hadoop job
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Word Count");
job.setJarByClass(WordCount.class);
// Set the input and output file paths (first argument: input, second: output)
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// Set the Mapper, Reducer, and Output key/value types
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Create output
folder inside java project
command to compile WordCount.java
file
javac --release 8 -cp "lib/*" -d output/WordCount "src/WordCount.java"
command to create a jar file for the WordCount program
jar -cvf src/WordCount.jar -C output/WordCount .
command to run hadoop jar
hadoop jar C:\\Users\\labuser\\Desktop\\MCA54\\WordCountApp\\src\\WordCount.jar WordCount /mca54/sample-data.txt /mca54/output/WordCount
"C:\\Users\\labuser\\Desktop\\MCA54\\WordCountApp\\src\\WordCount.jar"
is the path of the jar file present in the local system
"/mca54/sample-data.txt"
is the path of the input files present on the hadoop server.
"/mca54/output"
is the path of the output folder where i wish to upload all the output files on the hadoop server.
Output (Without Combiner):
Output (With Combiner):
command to run hadoop jar for mutiple input files
Create the directory
hdfs dfs -mkdir /mca54/sample-files
Move the file from local to hdfs
hdfs dfs -copyFromLocal C:\\Users\\labuser\\Desktop\\MCA54\\WordCountApp\\assets\\sample-files\\input-file-1.txt /mca54/sample-files
hdfs dfs -copyFromLocal C:\\Users\\labuser\\Desktop\\MCA54\\WordCountApp\\assets\\sample-files\\input-file-2.txt /mca54/sample-files
hdfs dfs -copyFromLocal C:\\Users\\labuser\\Desktop\\MCA54\\WordCountApp\\assets\\sample-files\\input-file-3.txt /mca54/sample-files
running hadoop jar.
hadoop jar C:\\Users\\labuser\\Desktop\\MCA54\\WordCountApp\\src\\WordCount.jar WordCount /mca54/sample-files /mca54/output/multipleFilesOutput
"C:\\Users\\labuser\\Desktop\\MCA54\\WordCountApp\\src\\WordCount.jar"
is the path of the jar file present in the local system
"/mca54/sample-files"
is the path of the input files present on the hadoop server directory.
"/mca54/output"
is the path of the output folder where i wish to upload all the output files on the hadoop server.