Big Data - The Ultimate Q&A Guide for Beginners
- vsowmiya28
- Nov 19, 2022
- 3 min read
1. What is Big Data? Big Data refers to large volume of data. As per IBM’s definition, any data which is characterised into 3V’s is called as big data.
2. What are the 3V’s of Big Data?
Volume - Volume of data is huge(eg. in Tera Bytes & Peta Bytes) which traditional systems(single machine) can’t handle it.
Variety - Data is present in different varieties such as ----> Structured data - RDBMS Databases(MySQL, Oracle). ----> Semi-Structured data - CSV, XML, JSON formats. ----> Unstructured data - Audio, Video, Image, Log files.
Velocity - Speed at which data is arriving.
Veracity - Veracity refers to the nature of data which has poor quality / unclean data(eg. data having NULL values).
3. Why Big Data? What problem are you trying to solve by learning Big data? To Store and Process huge amount of data(eg. Tera Bytes, Peta Bytes) which our traditional systems are incapable of doing.
4. Requirements to design a good Big data system? Big data systems should
Storage of massive amount of data.
Processing the data in a timely manner.
Scalability easily as data grows.
5. How would you build a scalable system? There are two ways to build a scalable system. 1. Monolithic System → It is a single system(computer/machine) with lot of resources.
2. Distributed System → There are many systems(computer/machine) with lesser resources.
* Cluster → Many computers/systems/machines joined together is called cluster. * Node → Each individual computer/system/machines in a cluster is called node. * Resources → RAM — Memory — eg. 8 GB → Hard Disk — Storage — eg. 1 TB
→ CPU — Compute — eg. Quad Core
6. Which is preferred for scalability Monolithic or Distributed? 1. Monolithic systems - (Vertical/Not True Scaling of resources) - cannot be scalable because it is a single machine holding more resources and even if we double the amount of resources, performance will not be doubled. Here, we are adding resources to a single machine. (2x resources ≠ 2x performance). 2. Distributed systems - (Horizontal/Linear/True Scaling of resources) - can be scalable because more systems/machines can be added to increase the performance with double the amount of resources. Here, we are adding more machines with same amount of resources in each machine. (2x resources = 2x performance). So, distributed systems are preferred to get scalability.
7. What is Hadoop? Hadoop is a framework(bunch of tools and technologies) to solve Big data problems.
8. What are the Core Components of Hadoop? Hadoop 1.0: HDFS and MapReduce. Hadoop 2.0: HDFS, MapReduce and YARN.
9. What is HDFS? It is a File System for distributed data storage - data will be stored in different machines/nodes.
10. What is Map Reduce? It is a Processing engine for distributed data processing - data will be processed by many machines/nodes.
11. What is YARN?
YARN is Yet Another Resource Negotiator used for resource management.
Windows is an operating system which is used to manage resources on a single computer efficiently.
YARN is similar to an operating system but it is not running on a single machine, it is running on top of 1000’s of machines and managing all machine resources.
12. What are Hadoop ecosystem tools?

PIG:
Pig is used to clean the data using pig latin scripting language. eg. removing NULL values from data.
To convert unstructured data into structured form. eg. Web Server logs do not have any structure. To convert it to Structured form(tabular form), pig is used.
Web server log: ip_address, kind_of_request, date —> tabular form: ip_address | kind_of_request | date --> will be present in rows and columns.
Pig is an abstraction on top of Map reduce.
Note:
Pig is not used in industry because Spark does all work which Pig is capable of doing.
HIVE:
It is a Data Warehouse tool built on top of Hadoop for data querying and analysis.
It is an abstraction on top of Map Reduce i.e., We write SQL like code and hive will internally convert it to MapReduce code and execute in the cluster.
SQOOP:
A command like interface tool to ingest the data between relational databases and Hadoop.
Sqoop also internally uses Map Reduce.
HBASE:
It is a column-oriented NOSQL database which runs on top of Hadoop.
OOZIE:
It is a Work-flow scheduler to schedule Hadoop/Hive/Sqoop/HBASE jobs.
SPARK:
Spark is a distributed general purpose in-memory compute engine.
It is a replacement/alternative of Map Reduce and it is not a replacement of Hadoop.

Spark is a plug-and-plug engine which can be used to
1. Plug with any storage system → LOCAL STORAGE / HDFS / AMAZON S3 / AZURE BLOB STORAGE.
2. Plug with any Resource Manager → YARN / MESOS / KUBERNETES.

Spark code is written in Scala, Python, R, Java. Spark code written in Python is called PySpark.
Note:
In Big Data Industry, Spark with Scala & PySpark is widely used.
Comments