Question Bank | Notion

What is Big Data? Why is it important? What are its characteristics?
1. Big Data refers to data that is extremely large in volume, rapidly generated, and diverse in format, such that traditional data processing systems (like RDBMS) are unable to store, manage, or analyze it efficiently. Big Data typically comes from social media, sensors, mobile devices, online transactions, and various digital interactions.
2. Definition
  1. Big Data is defined as data whose size, speed, and complexity exceed the capabilities of traditional data-processing tools, and therefore require advanced technologies like Hadoop, Spark, NoSQL, etc.
3. Importance of Big Data
  1. Better Decision Making
    
    Organizations use big data analytics for real-time insights to improve strategic decisions.
  2. Enhanced Customer Experience
    
    Companies analyze customer behaviors from mobile apps, browsing patterns, and social media.
  3. Operational Efficiency
    
    Automated data-driven systems optimize supply chains, inventory, and logistics.
  4. Fraud Detection and Security
    
    Banks and financial institutions analyze transactional patterns to detect anomalies.
  5. Innovation and New Business Models
    
    Big Data enables new domains like personalized medicine, smart cities, IoT analytics, etc.
  6. Competitive Advantage
    
    Companies that use Big Data effectively outperform competitors in market understanding and speed.
4. Characteristics of Big Data (The 5 Vs)
  1. Volume
    1. Refers to the huge amount of data generated (TBs, PBs, ZBs).
    2. Example: Facebook generates > 4 PB of data per day.
  2. Velocity
    1. Data arrives at high speed, requiring real-time or near-real-time processing.
    2. Example: stock market transactions, sensor data from IoT.
  3. Variety
    1. Data is diverse: structured, semi-structured, and unstructured.
    2. Example: text, images, videos, logs, machine data.
  4. Veracity
    1. Refers to the quality, accuracy, and trustworthiness of data.
    2. Example: detecting fake news, removing noisy data.
  5. Value
    1. The most important characteristic.
    2. Extracting useful insights from raw data that benefit the organization.
Explain different types of Big Data with examples.
1. Big Data can be classified into three major types based on structure and nature of data:
2. Structured Data
  1. Definition
    1. Structured data follows a fixed schema and is organized in rows and columns, like a traditional database.
  2. Characteristics
    1. Easy to store, retrieve, and query
    2. Uses SQL and relational databases
    3. High data quality
  3. Examples
    1. Employee records (ID, Name, Salary)
    2. Banking transactions
    3. Inventory databases
    4. Student records in an RDBMS
3. Semi-Structured Data
  1. Definition
    1. Semi-structured data does not follow a rigid structure like tables, but still contains tags or markers to separate elements.
  2. Characteristics
    1. Flexible format
    2. Can be parsed using tools like XML, JSON
    3. Cannot be handled efficiently by traditional RDBMS
  3. Examples
    1. XML files
    2. JSON documents
    3. Log files
    4. HTML pages
    5. Emails (with To, From, Body fields)
4. Unstructured Data
  1. Definition
    1. Unstructured data has no predefined structure, meaning it cannot be stored in tabular form.
  2. Characteristics
    1. Most difficult to process
    2. Requires Big Data tools (Hadoop, Spark, ML algorithms)
    3. Very high volume in modern systems
  3. Examples
    1. Images, videos, audio files
    2. Social media posts (tweets, comments)
    3. Scanned documents
    4. CCTV footage
    5. Medical images (X-rays, MRI scans)
5. Other Classifications (Optional if required)
  1. Machine-Generated Data
    1. IoT sensor data
    2. Server logs
    3. GPS signals
  2. Human-Generated Data
Enlist and explain different technologies used for handling Big Data.
What is Big Data analytics? Explain its different types.
Explain different Big Data Business Models.
Write a short note on Hadoop. Explain the history and ecosystem.
Explain the different HDFS commands.
Write in detail about the evolution of data management or big data.
Explain the storing and querying (reading and writing) of big data in HDFS.
Draw and explain Hadoop architecture.
What are InputFormat and OutputFormat in MapReduce? Explain its different types.
Explain the map-reduce execution flow with an example.
Explain the concept of a partitioner in MapReduce including sample code, and discuss its advantages and disadvantages.