Big Data Interview Questions


The era of big data has just begun. With more companies inclined towards big data to run their operations, the demand for talent at an all-time high. What does it mean for you? It only translates into better opportunities if you want to get employed in any of the big data positions. You can choose to become a Data Analyst, Data Scientist, Database administrator, Big Data Engineer, Hadoop Big Data Engineer and so on.

Basic Big Data Interview Questions

Whenever you go for a Big Data interview, the interviewer may ask some basic level questions. Whether you are a fresher or experienced in the big data field, the basic knowledge is required. So, let’s cover some frequently asked basic big data interview questions and answers to crack big data interview.

1. What do you know about the term “Big Data”?

Answer: Big Data is a term associated with complex and large datasets. A relational database cannot handle big data, and that’s why special tools and methods are used to perform operations on a vast collection of data. Big data enables companies to understand their business better and helps them derive meaningful information from the unstructured and raw data collected on a regular basis. Big data also allows the companies to take better business decisions backed by data.

2. What are the five V’s of Big Data?

Answer: The five V’s of Big data is as follows:

  • Volume – Volume represents the volume i.e. amount of data that is growing at a high rate i.e. data volume in Petabytes
  • Velocity – Velocity is the rate at which data grows. Social media contributes a major role in the velocity of growing data.
  • Variety – Variety refers to the different data types i.e. various data formats like text, audios, videos, etc.
  • Veracity – Veracity refers to the uncertainty of available data. Veracity arises due to the high volume of data that brings incompleteness and inconsistency.
  • Value –Value refers to turning data into value. By turning accessed big data into values, businesses may generate revenue.

3. Tell us how big data and Hadoop are related to each other.

Answer: Big data and Hadoop are almost synonyms terms. With the rise of big data, Hadoop, a framework that specializes in big data operations also became popular. The framework can be used by professionals to analyze big data and help businesses to make decisions.

Note: This question is commonly asked in a big data interview. You can go further to answer this question and try to explain the main components of Hadoop.

4. How is big data analysis helpful in increasing business revenue?

Answer: Big data analysis has become very important for the businesses. It helps businesses to differentiate themselves from others and increase the revenue. Through predictive analytics, big data analytics provides businesses customized recommendations and suggestions. Also, big data analytics enables businesses to launch new products depending on customer needs and preferences. These factors make businesses earn more revenue, and thus companies are using big data analytics. Companies may encounter a significant increase of 5-20% in revenue by implementing big data analytics. Some popular companies those are using big data analytics to increase their revenue is – Walmart, LinkedIn, Facebook, Twitter, Bank of America etc.

5. Explain the steps to be followed to deploy a Big Data solution.

Answer: Followings are the three steps that are followed to deploy a Big Data Solution –

i. Data Ingestion

The first step for deploying a big data solution is the data ingestion i.e. extraction of data from various sources. The data source may be a CRM like Salesforce, Enterprise Resource Planning System like SAP, RDBMS like MySQL or any other log files, documents, social media feeds etc. The data can be ingested either through batch jobs or real-time streaming. The extracted data is then stored in HDFS.

6. Define respective components of HDFS and YARN

Answer: The two main components of HDFS are-

  • NameNode – This is the master node for processing metadata information for data blocks within the HDFS
  • DataNode/Slave node – This is the node which acts as slave node to store the data, for processing and use by the NameNode

In addition to serving the client requests, the NameNode executes either of two following roles –

  • CheckpointNode – It runs on a different host from the NameNode
  • BackupNode- It is a read-only NameNode which contains file system metadata information excluding the block locations
  • The two main components of YARN are

    The two main components of YARN are–

    • ResourceManager– This component receives processing requests and accordingly allocates to respective NodeManagers depending on processing needs.
    • NodeManager– It executes tasks on each single Data Node

7. Why is Hadoop used for Big Data Analytics?

Answer: Since data analysis has become one of the key parameters of business, hence, enterprises are dealing with massive amount of structured, unstructured and semi-structured data. Analyzing unstructured data is quite difficult where Hadoop takes major part with its capabilities of

  • Storage
  • Processing
  • Data collection

Moreover, Hadoop is open source and runs on commodity hardware. Hence it is a cost-benefit solution for businesses.

8. What is fsck?

Answer: fsck stands for File System Check. It is a command used by HDFS. This command is used to check inconsistencies and if there is any problem in the file. For example, if there are any missing blocks for a file, HDFS gets notified through this command.

9. What are the main differences between NAS (Network-attached storage) and HDFS?

Answer: The main differences between NAS (Network-attached storage) and HDFS –

  • HDFS runs on a cluster of machines while NAS runs on an individual machine. Hence, data redundancy is a common issue in HDFS. On the contrary, the replication protocol is different in case of NAS. Thus the chances of data redundancy are much less.
  • Data is stored as data blocks in local drives in case of HDFS. In case of NAS, it is stored in dedicated hardware.

10. What is the Command to format the NameNode?

Answer: $ hdfs namenode -format

Experience-based Big Data Interview Questions

If you have some considerable experience of working in Big Data world, you will be asked a number of questions in your big data interview based on your previous experience. These questions may be simply related to your experience or scenario based. So, get prepared with these best Big data interview questions and answers –

11. Do you have any Big Data experience? If so, please share it with us.

How to Approach: There is no specific answer to the question as it is a subjective question and the answer depends on your previous experience. Asking this question during a big data interview, the interviewer wants to understand your previous experience and is also trying to evaluate if you are fit for the project requirement.

So, how will you approach the question? If you have previous experience, start with your duties in your past position and slowly add details to the conversation. Tell them about your contributions that made the project successful. This question is generally, the 2nd or 3rd question asked in an interview. The later questions are based on this question, so answer it carefully. You should also take care not to go overboard with a single aspect of your previous job. Keep it simple and to the point.

12. Do you prefer good data or good models? Why?

How to Approach: This is a tricky question but generally asked in the big data interview. It asks you to choose between good data or good models. As a candidate, you should try to answer it from your experience. Many companies want to follow a strict process of evaluating data, means they have already selected data models. In this case, having good data can be game-changing. The other way around also works as a model is chosen based on good data.

As we already mentioned, answer it from your experience. However, don’t say that having both good data and good models is important as it is hard to have both in real life projects.

13. Will you optimize algorithms or code to make them run faster?

How to Approach: The answer to this question should always be “Yes.” Real world performance matters and it doesn’t depend on the data or model you are using in your project.

The interviewer might also be interested to know if you have had any previous experience in code or algorithm optimization. For a beginner, it obviously depends on which projects he worked on in the past. Experienced candidates can share their experience accordingly as well. However, be honest about your work, and it is fine if you haven’t optimized code in the past. Just let the interviewer know your real experience and you will be able to crack the big data interview.

14. How do you approach data preparation?

How to Approach: Data preparation is one of the crucial steps in big data projects. A big data interview may involve at least one question based on data preparation. When the interviewer asks you this question, he wants to know what steps or precautions you take during data preparation.

As you already know, data preparation is required to get necessary data which can then further be used for modeling purposes. You should convey this message to the interviewer. You should also emphasize the type of model you are going to use and reasons behind choosing that particular model. Last, but not the least, you should also discuss important data preparation terms such as transforming variables, outlier values, unstructured data, identifying gaps, and others.

15. How would you transform unstructured data into structured data?

How to Approach: Unstructured data is very common in big data. The unstructured data should be transformed into structured data to ensure proper data analysis. You can start answering the question by briefly differentiating between the two. Once done, you can now discuss the methods you use to transform one form to another. You might also share the real-world situation where you did it. If you have recently been graduated, then you can share information related to your academic projects.

By answering this question correctly, you are signaling that you understand the types of data, both structured and unstructured, and also have the practical experience to work with these. If you give an answer to this question specifically, you will definitely be able to crack the big data interview.

16. Which hardware configuration is most beneficial for Hadoop jobs?

Dual processors or core machines with a configuration of  4 / 8 GB RAM and ECC memory is ideal for running Hadoop operations. However, the hardware configuration varies based on the project-specific workflow and process flow and need customization accordingly.

17. What happens when two users try to access the same file in the HDFS?

HDFS NameNode supports exclusive write only. Hence, only the first user will receive the grant for file access and the second user will be rejected.

18. How to recover a NameNode when it is down?

The following steps need to execute to make the Hadoop cluster up and running:

  1. Use the FsImage which is file system metadata replica to start a new NameNode.
  2. Configure the DataNodes and also the clients to make them acknowledge the newly started NameNode.
  3. Once the new NameNode completes loading the last checkpoint FsImage which has received enough block reports from the DataNodes, it will start to serve the client.

In case of large Hadoop clusters, the NameNode recovery process consumes a lot of time which turns out to be a more significant challenge in case of routine maintenance.

19. What do you understand by Rack Awareness in Hadoop?

It is an algorithm applied to the NameNode to decide how blocks and its replicas are placed. Depending on rack definitions network traffic is minimized between DataNodes within the same rack. For example, if we consider replication factor as 3, two copies will be placed on one rack whereas the third copy in a separate rack.

20. What is the difference between “HDFS Block” and “Input Split”?

The HDFS divides the input data physically into blocks for processing which is known as HDFS Block.

Input Split is a logical division of data by mapper for mapping operation.