HDFS Word Programs Example

Hadoop and Hadoop version
```
hadoop
```
```
hadoop version
```
```
start-all.cmd
```
Create the directory
```
hdfs dfs -mkdir /mca54
```

Create one input file

This is Demo
This is Sample
Dear Bear Car Van
Car Van Car Van
This is Dear

Move the file from local to hdfs

hdfs dfs -copyFromLocal "C:\\Users\\labuser\\Desktop\\MCA54\\WordCountApp\\assets\\sample-data.txt" /mca54

Check whether the file is copied to hdfs
```
hdfs dfs -ls /mca54
```
Check hadoop user interface (Application Status)

http://localhost:8088/cluster
Check hadoop user interface (Namenode Status)
Now create one Java Project using vs code
Open in File explorer
Copy jar file from and paste in lib folder of java project

Java file WordCount.java

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
// import org.apache.hadoop.io.FileSystem;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class WordCount {

    // Mapper class that splits the input text into words and outputs them with a count of 1
    public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            // Split each line of input into words
            String line = value.toString();
            String[] words = line.split("\\\\s+");

            // For each word in the line, emit (word, 1)
            for (String w : words) {
                word.set(w);
                context.write(word, one);
            }
        }
    }

    // Reducer class that sums the counts for each word
    public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;

            // Sum all the counts for this word
            for (IntWritable val : values) {
                sum += val.get();
            }

            result.set(sum);
            context.write(key, result);
        }
    }

    // Main method to set up and run the job
    public static void main(String[] args) throws Exception {
        
        if (args.length != 2) {
            System.out.printf("Usage: WordCount<input dir> <output dir>\\n");
            System.exit(-1);
        }

        // Set up the configuration for the Hadoop job
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "Word Count");
        job.setJarByClass(WordCount.class);

        // Set the input and output file paths (first argument: input, second: output)
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        // Set the Mapper, Reducer, and Output key/value types
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

Create output folder inside java project
1. command to compile WordCount.java file
```
javac --release 8 -cp "lib/*" -d output/WordCount "src/WordCount.java"
```
2. command to create a jar file for the WordCount program
```
jar -cvf src/WordCount.jar -C output/WordCount .
```
command to run hadoop jar
```
hadoop jar C:\\Users\\labuser\\Desktop\\MCA54\\WordCountApp\\src\\WordCount.jar WordCount /mca54/sample-data.txt /mca54/output/WordCount
```
"C:\\Users\\labuser\\Desktop\\MCA54\\WordCountApp\\src\\WordCount.jar" is the path of the jar file present in the local system

"/mca54/sample-data.txt" is the path of the input files present on the hadoop server.

"/mca54/output" is the path of the output folder where i wish to upload all the output files on the hadoop server.
Output (Without Combiner):
Output (With Combiner):

command to run hadoop jar for mutiple input files

Create the directory
```
hdfs dfs -mkdir /mca54/sample-files
```

Move the file from local to hdfs

hdfs dfs -copyFromLocal C:\\Users\\labuser\\Desktop\\MCA54\\WordCountApp\\assets\\sample-files\\input-file-1.txt /mca54/sample-files

hdfs dfs -copyFromLocal C:\\Users\\labuser\\Desktop\\MCA54\\WordCountApp\\assets\\sample-files\\input-file-2.txt /mca54/sample-files

hdfs dfs -copyFromLocal C:\\Users\\labuser\\Desktop\\MCA54\\WordCountApp\\assets\\sample-files\\input-file-3.txt /mca54/sample-files

running hadoop jar.
```
hadoop jar C:\\Users\\labuser\\Desktop\\MCA54\\WordCountApp\\src\\WordCount.jar WordCount /mca54/sample-files /mca54/output/multipleFilesOutput
```
"C:\\Users\\labuser\\Desktop\\MCA54\\WordCountApp\\src\\WordCount.jar" is the path of the jar file present in the local system

"/mca54/sample-files" is the path of the input files present on the hadoop server directory.

"/mca54/output" is the path of the output folder where i wish to upload all the output files on the hadoop server.