Move the file from local to hdfs
hdfs dfs -copyFromLocal "C:\\Users\\labuser\\Desktop\\MCA54\\GroupLineProject\\assets\\input-data-pipes.txt" /mca54
Check whether the file is copied to hdfs
hdfs dfs -ls /mca54
Check hadoop user interface (Namenode Status) and the File we’ve copied just now.
Now create one Java Project using vs code (GroupLineProject
)
Copy jar files from hadoop (hdfs
> common
and mapreduce
dir) and paste in lib folder of java project
Java file GroupLine.java
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class GroupLine {
// Mapper class that replaces "||" with "," in each input line
public static class FormatMapper extends Mapper<Object, Text, Text, NullWritable> {
private Text formattedLine = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
// Replace || with ,
String csvLine = line.replace("||", ",");
formattedLine.set(csvLine);
context.write(formattedLine, NullWritable.get());
}
}
// Main method to set up and run the job
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.printf("Usage: GroupLine <input dir> <output dir>\\n");
System.exit(-1);
}
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Group Line Format");
job.setJarByClass(GroupLine.class);
// Set input and output paths
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// Set Mapper class and output types
job.setMapperClass(FormatMapper.class);
job.setNumReduceTasks(0); // No reducer needed
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
In Java, specifically within the Apache Hadoop framework,
NullWritable
is a special implementation of theWritable
interface. It serves as a placeholder for a null value when a key or value is not required in the MapReduce paradigm.
Create output
folder inside java project
command to compile GroupLine.java
file
javac --release 8 -cp "lib/*" -d output "src/GroupLine.java"
command to create a jar file for the GroupLine program
jar -cvf src/GroupLine.jar -C output/ .
command to run hadoop jar
hadoop jar C:\\Users\\labuser\\Desktop\\MCA54\\GroupLineProject\\src\\GroupLine.jar GroupLine /mca54/input-data-pipes.txt /mca54/output/GroupLine
"C:....jar"
is the path of the jar file present in the local system
"/mca54/input-data-pipes.txt"
is the path of the input files present on the hadoop server.
"/mca54/output/GroupLine"
is the path of the output folder where i wish to upload all the output files on the hadoop server.