Top 60 Hadoop & MapReduce Interview Questions & Answers


1) What is Hadoop Map Reduce?

For processing large data sets in parallel across a Hadoop cluster, Hadoop MapReduce framework is used.  Data analysis uses a two-step map and reduce process.

2) How Hadoop MapReduce works?

In MapReduce, during the map phase, it counts the words in each document, while in the reduce phase it aggregates the data as per the document spanning the entire collection. During the map phase, the input data is divided into splits for analysis by map tasks running in parallel across Hadoop framework.

3) Explain what is shuffling in MapReduce?

The process by which the system performs the sort and transfers the map outputs to the reducer as inputs is known as the shuffle

4) Explain what is distributed Cache in MapReduce Framework?

Distributed Cache is an important feature provided by the MapReduce framework. When you want to share some files across all nodes in Hadoop Cluster, Distributed Cache is used.  The files could be an executable jar files or simple properties file.

5) Explain what is NameNode in Hadoop?

NameNode in Hadoop is the hub, where Hadoop stores all the document area information in HDFS (Hadoop Distributed File System). In other words, NameNode is the centerpiece of a HDFS document framework. It keeps the record of the apparent multitude of documents in the record framework and tracks the document data across the cluster or numerous machines

6) Explain what is JobTracker in Hadoop? What are the activities followed by Hadoop?

In Hadoop for submitting and tracking MapReduce occupations, JobTracker is utilized. Occupation tracker run all alone JVM process

  • Occupation Tracker performs following activities in Hadoop
  • Customer application submit occupations to the activity tracker
  • JobTracker imparts to the Name mode to determine data area
  • Near the data or with accessible openings JobTracker finds TaskTracker hubs
  • On picked TaskTracker Nodes, it submits the work
  • At the point when an errand comes up short, Job tracker tells and chooses what to do at that point.
  • The TaskTracker hubs are monitored by JobTracker

7) Explain what is heartbeat in HDFS?

Heartbeat is referred to a sign utilized between a data hub and Name hub, and between task tracker and employment tracker, if the Name hub or occupation tracker doesn’t respond to the sign, at that point it is considered there is a few issues with data hub or undertaking tracker

8) Explain what combiners are and when you should utilize a combiner in a MapReduce Job?

To increase the productivity of MapReduce Program, Combiners are utilized. The measure of data can be reduced with the assistance of combiner’s that should be transferred across to the reducers. On the off chance that the operation performed is commutative and affiliated you can utilize your reducer code as a combiner. The execution of combiner isn’t guaranteed in Hadoop

9) What happens when a data hub falls flat?

At the point when a data hub falls flat

  • Jobtracker and namenode recognize the failure
  • On the bombed hub all errands are re-planned
  • Namenode replicates the user’s data to another hub

10) Explain what is Speculative Execution?

In Hadoop during Speculative Execution, a certain number of copy errands are propelled. On a different slave hub, numerous duplicates of a similar guide or reduce assignment can be executed utilizing Speculative Execution. In straightforward words, if a particular drive is setting aside a long effort to finish an undertaking, Hadoop will create a copy task on another plate. A circle that completes the assignment first is retained and plates that don’t complete first are murdered.

11) Explain what are the essential parameters of a Mapper?

  • The essential parameters of a Mapper are
  • LongWritable and Text
  • Text and IntWritable

12) Explain what is the capacity of MapReduce partitioner?

The capacity of MapReduce partitioner is to ensure that all the estimation of a solitary key goes to a similar reducer, in the long run which levels distribution of the guide yield over the reducers

13) Explain what is a difference between an Input Split and HDFS Block?

The intelligent division of data is known as Split while a physical division of data is known as HDFS Block

14) Explain what occurs in text format?

In text input format, each line in the content document is a record. Worth is the substance of the line while Key is the byte counterbalanced of the line. For example, Key: longWritable, Value: text

15) Mention what are the principle configuration parameters that user need to indicate to run MapReduce Job?

The user of the MapReduce framework needs to indicate

  • Employment’s information areas in the distributed record framework
  • Employment’s yield area in the distributed document framework
  • Information format
  • Yield format
  • Class containing the guide work
  • Class containing the reduce work
  • JAR record containing the mapper, reducer and driver classes

16) Explain what is WebDAV in Hadoop?

To support editing and refreshing documents WebDAV is a lot of augmentations to HTTP. On most operating framework WebDAV shares can be mounted as filesystems, so it is conceivable to get to HDFS as a standard filesystem by uncovering HDFS over WebDAV.

17)  Explain what is Sqoop in Hadoop?

To transfer the data between Relational database management (RDBMS) and Hadoop HDFS a tool is used known as Sqoop. Using Sqoop data can be transferred from RDMS like MySQL or Oracle into HDFS as well as exporting data from HDFS file to RDBMS

18) Explain how JobTracker schedules a task?

The task tracker sends out heartbeat messages to Jobtracker usually every few minutes to make sure that JobTracker is active and functioning.  The message also informs JobTracker about the number of available slots, so the JobTracker can stay up to date with wherein the cluster work can be delegated

19) Explain what is Sequencefileinputformat?

Sequencefileinputformat is used for reading files in sequence. It is a specific compressed binary file format which is optimized for passing data between the output of one MapReduce job to the input of some other MapReduce job.

20) Explain what does the conf.setMapper Class do?

Conf.setMapperclass  sets the mapper class and all the stuff related to map job such as reading data and generating a key-value pair out of the mapper

21) Explain what is Hadoop?

It is an open-source software framework for storing data and running applications on clusters of commodity hardware.  It provides enormous processing power and massive storage for any type of data.

22) Mention what is the difference between an RDBMS and Hadoop?

RDBMS is a relational database management systemHadoop is a node based flat structure
It used for OLTP processing whereas HadoopIt is currently used for analytical and for BIG DATA processing
In RDBMS, the database cluster uses the same data files stored in a shared storageIn Hadoop, the storage data can be stored independently in each processing node.
You need to preprocess data before storing ityou don’t need to preprocess data before storing it

23) Mention Hadoop core components?

Hadoop core components include,

  • HDFS
  • MapReduce

24) What is NameNode in Hadoop?

NameNode in Hadoop is where Hadoop stores all the file location information in HDFS. It is the master node on which job tracker runs and consists of metadata.

25) Mention what are the data components used by Hadoop?

Data components used by Hadoop are

  • Pig
  • Hive

26) Mention what is the data storage component used by Hadoop?

The data storage component used by Hadoop is HBase.

27) Mention what are the most widely recognized information formats characterized in Hadoop?

The most well-known info formats characterized in Hadoop are;

  • TextInputFormat
  • KeyValueInputFormat
  • SequenceFileInputFormat

28) In Hadoop what is InputSplit?

It splits input records into pieces and allots each split to a mapper for processing.

29) For a Hadoop work, by what method will you write a custom partitioner?

You write a custom partitioner for a Hadoop work, you follow the accompanying way

  • Create another class that expands Partitioner Class
  • Override strategy getPartition
  • In the wrapper that runs the MapReduce

Add the custom partitioner to the activity by utilizing strategy set Partitioner Class or – add the custom partitioner to the activity as a config document

30) For work in Hadoop, is it conceivable to change the number of mappers to be created?

No, it is unimaginable to expect to change the number of mappers to be created. The number of mappers is determined by the number of info splits.

31) Explain what is a grouping record in Hadoop?

To store binary key/esteem pairs, arrangement record is utilized. In contrast to regular compressed record, succession document support splitting in any event, when the data inside the document is compressed.

32) When Namenode is down what happens to work tracker?

Namenode is the single purpose of failure in HDFS so when Namenode is down your cluster will set off.

33) Explain how ordering in HDFS is finished?

Hadoop has an interesting method of ordering. When the data is stored as per the square size, the HDFS will continue storing the last part of the data which state where the following part of the data will be.

34) Explain is it conceivable to search for records utilizing wildcards?

Truly, it is conceivable to search for documents utilizing wildcards.

35) List out Hadoop’s three configuration records?

The three configuration records are

  • core-site.xml
  • mapred-site.xml
  • hdfs-site.xml

36) Explain how might you check whether Namenode is working close to utilizing the jps order?

Other than utilizing the jps order, to check whether Namenode are working you can likewise utilize

/and so forth/init.d/hadoop-0.20-namenode status.

37) Explain what is “map” and what is “reducer” in Hadoop?

In Hadoop, a guide is a stage in HDFS query settling. A guide reads data from an information area, and yields a key worth pair according to the information type.

In Hadoop, a reducer gathers the yield generated by the mapper, processes it, and creates its very own last yield.

38) In Hadoop, which record controls reporting in Hadoop?

In Hadoop, the record controls reporting.

39) For utilizing Hadoop list the network requirements?

For utilizing Hadoop the rundown of network requirements are:

Password-less SSH association

Secure Shell (SSH) for propelling server processes

40) Mention what is rack awareness?

Rack awareness is the manner by which the namenode determines on the most proficient method to put squares dependent on the rack definitions.

41) Explain what is a Task Tracker in Hadoop?

A Task Tracker in Hadoop is a slave hub daemon in the cluster that acknowledges assignments from a JobTracker. It additionally conveys the heartbeat messages to the JobTracker, every couple of moments, to confirm that the JobTracker is as yet alive.

42) Mention what daemons run on a master hub and slave hubs?

Daemons run on Master hub is “NameNode”

Daemons run on each Slave hubs are “Undertaking Tracker” and “Data”

43) Explain how might you investigate Hadoop code?

The popular techniques for investigating Hadoop code are:

By utilizing web interface provided by Hadoop framework

By utilizing Counters

44) Explain what is storage and figure hubs?

The storage hub is the machine or computer where your record framework resides to store the processing data

The figure hub is the computer or machine where your genuine business rationale will be executed.

45) Mention what is the utilization of Context Object?

The Context Object empowers the mapper to interact with the rest of the Hadoop

framework. It incorporates configuration data for the activity, just as interfaces which permit it to emit yield.

46) Mention what is the subsequent stage after Mapper or MapTask?

The subsequent stage after Mapper or MapTask is that the yield of the Mapper are sorted, and partitions will be created for the yield.

47) Mention what is the number of default partitioner in Hadoop?

In Hadoop, the default partitioner is a “Hash” Partitioner.

48) Explain what is the purpose of RecordReader in Hadoop?

In Hadoop, the RecordReader loads the data from its source and converts it into (key, esteem) pairs suitable for reading by the Mapper.

49) Explain how is data partitioned before it is sent to the reducer if no custom partitioner is characterized in Hadoop?

On the off chance that no custom partitioner is characterized in Hadoop, at that point a default partitioner registers a hash an incentive for the key and relegates the partition dependent on the result.

50) Explain what happens when Hadoop produced 50 errands for an occupation and one of the undertaking fizzled?

It will restart the errand again on some other TaskTracker if the assignment flops more than the characterized limit.

51) Mention what is the most ideal approach to duplicate records between HDFS clusters?

The most ideal approach to duplicate documents between HDFS clusters is by utilizing numerous hubs and the distcp order, so the workload is shared.

52) Mention what is the difference among HDFS and NAS?

HDFS data squares are distributed across nearby drives of all machines in a cluster while NAS data is stored on devoted hardware.

53) Mention how Hadoop is different from other data processing tools?

In Hadoop, you can increase or decrease the number of mappers without worrying about the volume of data to be processed.

54) Mention what employment does the conf class do?

Employment conf class separate different positions running on a similar cluster. It carries out the responsibility level settings, for example, declaring work in a real environment.

55) Mention what is the Hadoop MapReduce APIs contract for a key and worth class?

For a key and worth class, there are two Hadoop MapReduce APIs contract

The worth must characterize the interface

The key must characterize the interface

56) Mention what are the three modes where Hadoop can be run?

The three modes wherein Hadoop can be run are

  • Pseudo distributed mode
  • Independent (nearby) mode
  • Completely distributed mode

57) Mention what does the content information format do?

The content information format will create a line object that is a hexadecimal number. The worth is considered overall line text while the key is considered as a line object. The mapper will receive the incentive as ‘text’ parameter while key as ‘longwriteable’ parameter.

58) Mention what number of InputSplits is made by a Hadoop Framework?

Hadoop will make 5 splits

  • 1 split for 64K records
  • 2 split for 65mb records
  • 2 splits for 127mb records

59) Mention what is distributed store in Hadoop?

Distributed reserve in Hadoop is a facility provided by MapReduce framework. At the hour of execution of the activity, it is utilized to reserve document. The Framework duplicates the necessary documents to the slave hub before the execution of any undertaking at that hub.

60) Explain how does Hadoop Classpath assumes a vital role in halting or starting in Hadoop daemons?

Classpath will comprise of a rundown of directories containing jar documents to stop or start daemons.