Group Line by || MapReduce Assignment

Move the file from local to hdfs

hdfs dfs -copyFromLocal "C:\\Users\\labuser\\Desktop\\MCA54\\GroupLineProject\\assets\\input-data-pipes.txt" /mca54

Check whether the file is copied to hdfs
```
hdfs dfs -ls /mca54
```
Check hadoop user interface (Namenode Status) and the File we’ve copied just now.
Now create one Java Project using vs code (GroupLineProject)
Copy jar files from hadoop (hdfs > common and mapreduce dir) and paste in lib folder of java project

Java file GroupLine.java

import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class GroupLine {

    // Mapper class that replaces "||" with "," in each input line
    public static class FormatMapper extends Mapper<Object, Text, Text, NullWritable> {
        private Text formattedLine = new Text();

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            String line = value.toString();

            // Replace || with ,
            String csvLine = line.replace("||", ",");

            formattedLine.set(csvLine);
            context.write(formattedLine, NullWritable.get());
        }
    }

    // Main method to set up and run the job
    public static void main(String[] args) throws Exception {

        if (args.length != 2) {
            System.out.printf("Usage: GroupLine <input dir> <output dir>\\n");
            System.exit(-1);
        }

        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "Group Line Format");
        job.setJarByClass(GroupLine.class);

        // Set input and output paths
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        // Set Mapper class and output types
        job.setMapperClass(FormatMapper.class);
        job.setNumReduceTasks(0); // No reducer needed

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

In Java, specifically within the Apache Hadoop framework, NullWritable is a special implementation of the Writable interface. It serves as a placeholder for a null value when a key or value is not required in the MapReduce paradigm.

Explanation

Create output folder inside java project
1. command to compile GroupLine.java file
```
javac --release 8 -cp "lib/*" -d output "src/GroupLine.java"
```
2. command to create a jar file for the GroupLine program
```
jar -cvf src/GroupLine.jar -C output/ .
```
command to run hadoop jar
```
hadoop jar C:\\Users\\labuser\\Desktop\\MCA54\\GroupLineProject\\src\\GroupLine.jar GroupLine /mca54/input-data-pipes.txt /mca54/output/GroupLine
```
"C:....jar" is the path of the jar file present in the local system

"/mca54/input-data-pipes.txt" is the path of the input files present on the hadoop server.

"/mca54/output/GroupLine" is the path of the output folder where i wish to upload all the output files on the hadoop server.