1. Hadoop and Hadoop version

    hadoop
    
    hadoop version
    
    start-all.cmd
    

    image.png

  2. Create the directory

    hdfs dfs -mkdir /mca54
    

    image.png

  3. Create one input file

    This is Demo
    This is Sample
    Dear Bear Car Van
    Car Van Car Van
    This is Dear
    

    image.png

  4. Move the file from local to hdfs

    hdfs dfs -copyFromLocal "C:\\Users\\labuser\\Desktop\\MCA54\\WordCountApp\\assets\\sample-data.txt" /mca54
    

    image.png

  5. Check whether the file is copied to hdfs

    hdfs dfs -ls /mca54
    

    image.png

  6. Check hadoop user interface (Application Status)

    http://localhost:8088/cluster

    image.png

  7. Check hadoop user interface (Namenode Status)

    image.png

    image.png

  8. Now create one Java Project using vs code

    image.png

  9. Open in File explorer

    image.png

  10. Copy jar file from and paste in lib folder of java project

    image.png

    image.png

  11. Java file WordCount.java

    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    // import org.apache.hadoop.io.FileSystem;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    
    import java.io.IOException;
    
    public class WordCount {
    
        // Mapper class that splits the input text into words and outputs them with a count of 1
        public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
            private final static IntWritable one = new IntWritable(1);
            private Text word = new Text();
    
            public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
                // Split each line of input into words
                String line = value.toString();
                String[] words = line.split("\\\\s+");
    
                // For each word in the line, emit (word, 1)
                for (String w : words) {
                    word.set(w);
                    context.write(word, one);
                }
            }
        }
    
        // Reducer class that sums the counts for each word
        public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
            private IntWritable result = new IntWritable();
    
            public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
                int sum = 0;
    
                // Sum all the counts for this word
                for (IntWritable val : values) {
                    sum += val.get();
                }
    
                result.set(sum);
                context.write(key, result);
            }
        }
    
        // Main method to set up and run the job
        public static void main(String[] args) throws Exception {
            
            if (args.length != 2) {
                System.out.printf("Usage: WordCount<input dir> <output dir>\\n");
                System.exit(-1);
            }
    
            // Set up the configuration for the Hadoop job
            Configuration conf = new Configuration();
            Job job = Job.getInstance(conf, "Word Count");
            job.setJarByClass(WordCount.class);
    
            // Set the input and output file paths (first argument: input, second: output)
            FileInputFormat.addInputPath(job, new Path(args[0]));
            FileOutputFormat.setOutputPath(job, new Path(args[1]));
    
            // Set the Mapper, Reducer, and Output key/value types
            job.setMapperClass(TokenizerMapper.class);
            job.setCombinerClass(IntSumReducer.class);
            job.setReducerClass(IntSumReducer.class);
    
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(IntWritable.class);
    
            System.exit(job.waitForCompletion(true) ? 0 : 1);
        }
    }
    
  12. Create output folder inside java project

    1. command to compile WordCount.java file

      javac --release 8 -cp "lib/*" -d output/WordCount "src/WordCount.java"
      
    2. command to create a jar file for the WordCount program

      jar -cvf src/WordCount.jar -C output/WordCount .
      
  13. command to run hadoop jar

    hadoop jar C:\\Users\\labuser\\Desktop\\MCA54\\WordCountApp\\src\\WordCount.jar WordCount /mca54/sample-data.txt /mca54/output/WordCount
    

    "C:\\Users\\labuser\\Desktop\\MCA54\\WordCountApp\\src\\WordCount.jar" is the path of the jar file present in the local system

    "/mca54/sample-data.txt" is the path of the input files present on the hadoop server.

    "/mca54/output" is the path of the output folder where i wish to upload all the output files on the hadoop server.

  14. Output (Without Combiner):

    image.png

    image.png

    image.png

  15. Output (With Combiner):

    image.png

    image.png

    image.png

  16. command to run hadoop jar for mutiple input files

    1. Create the directory

      hdfs dfs -mkdir /mca54/sample-files
      
    2. Move the file from local to hdfs

      hdfs dfs -copyFromLocal C:\\Users\\labuser\\Desktop\\MCA54\\WordCountApp\\assets\\sample-files\\input-file-1.txt /mca54/sample-files
      
      hdfs dfs -copyFromLocal C:\\Users\\labuser\\Desktop\\MCA54\\WordCountApp\\assets\\sample-files\\input-file-2.txt /mca54/sample-files
      
      hdfs dfs -copyFromLocal C:\\Users\\labuser\\Desktop\\MCA54\\WordCountApp\\assets\\sample-files\\input-file-3.txt /mca54/sample-files
      
    3. running hadoop jar.

      hadoop jar C:\\Users\\labuser\\Desktop\\MCA54\\WordCountApp\\src\\WordCount.jar WordCount /mca54/sample-files /mca54/output/multipleFilesOutput
      

      "C:\\Users\\labuser\\Desktop\\MCA54\\WordCountApp\\src\\WordCount.jar" is the path of the jar file present in the local system

      "/mca54/sample-files" is the path of the input files present on the hadoop server directory.

      "/mca54/output" is the path of the output folder where i wish to upload all the output files on the hadoop server.

      image.png

      image.png