1. Move the file from local to hdfs

    input-data-pipes.txt

    hdfs dfs -copyFromLocal "C:\\Users\\labuser\\Desktop\\MCA54\\GroupLineProject\\assets\\input-data-pipes.txt" /mca54
    

    image.png

  2. Check whether the file is copied to hdfs

    hdfs dfs -ls /mca54
    
  3. Check hadoop user interface (Namenode Status) and the File we’ve copied just now.

  4. Now create one Java Project using vs code (GroupLineProject)

  5. Copy jar files from hadoop (hdfs > common and mapreduce dir) and paste in lib folder of java project

  6. Java file GroupLine.java

    import org.apache.hadoop.io.NullWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    
    import java.io.IOException;
    
    public class GroupLine {
    
        // Mapper class that replaces "||" with "," in each input line
        public static class FormatMapper extends Mapper<Object, Text, Text, NullWritable> {
            private Text formattedLine = new Text();
    
            public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
                String line = value.toString();
    
                // Replace || with ,
                String csvLine = line.replace("||", ",");
    
                formattedLine.set(csvLine);
                context.write(formattedLine, NullWritable.get());
            }
        }
    
        // Main method to set up and run the job
        public static void main(String[] args) throws Exception {
    
            if (args.length != 2) {
                System.out.printf("Usage: GroupLine <input dir> <output dir>\\n");
                System.exit(-1);
            }
    
            Configuration conf = new Configuration();
            Job job = Job.getInstance(conf, "Group Line Format");
            job.setJarByClass(GroupLine.class);
    
            // Set input and output paths
            FileInputFormat.addInputPath(job, new Path(args[0]));
            FileOutputFormat.setOutputPath(job, new Path(args[1]));
    
            // Set Mapper class and output types
            job.setMapperClass(FormatMapper.class);
            job.setNumReduceTasks(0); // No reducer needed
    
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(NullWritable.class);
    
            System.exit(job.waitForCompletion(true) ? 0 : 1);
        }
    }
    

    In Java, specifically within the Apache Hadoop framework, NullWritable is a special implementation of the Writable interface. It serves as a placeholder for a null value when a key or value is not required in the MapReduce paradigm.

  7. Create output folder inside java project

    1. command to compile GroupLine.java file

      javac --release 8 -cp "lib/*" -d output "src/GroupLine.java"
      

      image.png

    2. command to create a jar file for the GroupLine program

      jar -cvf src/GroupLine.jar -C output/ .
      

      image.png

  8. command to run hadoop jar

    hadoop jar C:\\Users\\labuser\\Desktop\\MCA54\\GroupLineProject\\src\\GroupLine.jar GroupLine /mca54/input-data-pipes.txt /mca54/output/GroupLine
    

    "C:....jar" is the path of the jar file present in the local system

    "/mca54/input-data-pipes.txt" is the path of the input files present on the hadoop server.

    "/mca54/output/GroupLine" is the path of the output folder where i wish to upload all the output files on the hadoop server.

    image.png

    image.png