Skip to content

Data Storage Formats & Database Categories

OLTP vs OLAP

Aspect OLTP OLAP
Purpose Process transactions Analyze data
Operations INSERT, UPDATE, DELETE SELECT, aggregate, join
Query pattern Simple, by primary key Complex, full-table scans
Data scope Current state, single rows Historical, millions of rows
Response time Milliseconds Seconds to minutes
Users App users, services Analysts, dashboards
Schema Normalized (3NF) Denormalized (star/snowflake)
Concurrency Thousands of short txns Few long-running queries
Optimized for Write throughput Read throughput

Database Categories

Category How it stores Best for Examples
Row-oriented Records together OLTP, single-record access MySQL, PostgreSQL, MongoDB (BSON)
Column-oriented Columns together OLAP, aggregations Parquet, ORC, ClickHouse, Redshift
Document Nested JSON/BSON trees Flexible schema, hierarchical data MongoDB, CouchDB, Firestore
Key-Value Hash map (key → blob) Cache, session, config Redis, DynamoDB, etcd
Graph Nodes + edges Relationships, traversals Neo4j, Amazon Neptune, JanusGraph
Time-series Timestamp-indexed Metrics, IoT, logs Prometheus, InfluxDB, TimescaleDB
Wide-column Column families Sparse data, high write throughput Cassandra, HBase, ScyllaDB
Vector Embeddings (float arrays) AI/ML similarity search Pinecone, Milvus, pgvector
Hybrid (HTAP) Row + column combined Both OLTP + OLAP TiDB, SingleStore, AlloyDB

File Formats

Row-Oriented File Formats

  • CSV — Plain text, delimiter-separated. No types, no compression, no indexes. Universal but inefficient.
  • JSON Lines — One JSON object per line. Self-describing schema. Good for streaming.
  • Avro — Binary, schema-embedded, row-oriented. Compact, schema evolution support. Kafka's default.

Columnar File Formats

Parquet

Industry standard columnar format (Apache). Stores data in row groups, each containing column chunks with per-column compression and min/max statistics.

File = [Row Group 1] [Row Group 2] ... [Footer (schema + stats)]
Row Group = [Column Chunk A] [Column Chunk B] ...
Column Chunk = [Page 1 (compressed)] [Page 2] ...
  • Predicate pushdown: skip row groups where min/max doesn't match filter
  • Projection pushdown: read only requested columns
  • Nested data: Dremel encoding (repetition + definition levels)
  • Compression: Snappy (fast), ZSTD (balanced), Gzip (small)

ORC (Optimized Row Columnar)

Hadoop/Hive ecosystem columnar format. Similar to Parquet but with built-in indexes and ACID support.

File = [Stripe 1] [Stripe 2] ... [Footer]
Stripe (~250MB) = [Index] [Data (columnar)] [Stripe Footer]
  • Built-in bloom filters and min/max indexes per stripe
  • Native ACID transactions (Hive)
  • Better for: Hive workloads, UPDATE/DELETE on data lake

Parquet vs ORC

Aspect Parquet ORC
Ecosystem Universal Hive-centric
Nested data Better (Dremel) Struct/List/Map
ACID No Yes (Hive)
Indexes Min/max, optional bloom Built-in bloom + min/max
Adoption Industry standard Hadoop shops

Serialization Formats

  • Protocol Buffers — Google's binary format. Schema-defined (.proto), compact, fast. gRPC's wire format.
  • Thrift — Facebook's equivalent to Protobuf. Less common now.
  • MessagePack — Binary JSON. Smaller than JSON, faster to parse.
  • BSON — Binary JSON (MongoDB's native format). Supports types JSON doesn't (Date, Binary, ObjectId).
  • Arrow — In-memory columnar format. Zero-copy reads, cross-language (Python ↔ Java ↔ Rust). Not a file format but a memory layout.

Storage Engine vs File Format

Layer Purpose Examples
File format How bytes are laid out on disk CSV, Parquet, ORC, Avro
Storage engine How a database manages reads/writes InnoDB, WiredTiger, RocksDB
Database Query interface + storage engine MySQL (InnoDB), MongoDB (WiredTiger)

A database can use multiple file formats (Spark reads CSV, Parquet, ORC, JSON). A file format has no query engine — it's just bytes on disk.

Data Pipeline Pattern

Source (OLTP)  →  ETL/ELT  →  Warehouse (OLAP)  →  BI/Analytics
  MongoDB          Spark        Snowflake            Grafana
  row-oriented     transform    column-oriented      dashboards
  CSV/JSON         validate     Parquet/ORC          SQL queries

When to Use What

Scenario Format Why
API response JSON Human-readable, universal
Kafka events Avro Schema registry, compact, evolution
Data lake storage Parquet Columnar, compressed, predicate pushdown
ML training data Parquet/Arrow Fast column reads, zero-copy
Config files YAML/JSON Human-editable
Log shipping JSON Lines Streamable, one record per line
Database backup BSON (mongodump) Native format, preserves types
Inter-service (gRPC) Protobuf Smallest, fastest, typed

Features & Feature Engineering

A feature is a measurable property of data used as input to a model or analysis — a column in your dataset that carries signal.

Example

Feature Value Type
square_meters 120 Numeric
num_bedrooms 3 Numeric
has_garage true Boolean
neighborhood "downtown" Categorical
price 450000 Target (what you predict)

Everything except the target is a feature.

Feature Engineering Techniques

Technique Raw → Feature Example
Extraction Raw data → usable column Timestamp → hour_of_day, is_weekend
Transformation Scale/normalize pricelog(price), min-max scaling
Encoding Category → number "downtown" → one-hot [1,0,0]
Aggregation Multiple rows → one Orders → avg_order_value_30d
Interaction Combine features price / square_meters = price_per_sqm
Selection Drop irrelevant Remove customer_id (no predictive value)

Feature Store

In production, a feature store manages features at scale:

Raw Data → Feature Pipeline → Feature Store → Model Training / Serving
  (Kafka)    (Spark/Flink)     (Feast/Tecton)   (batch / real-time)

Solves:

  • Consistency — same feature definition for training and serving
  • Reuse — teams share features instead of rebuilding
  • Time travel — get feature values as of a past timestamp

Examples: Feast, Tecton, Hopsworks, AWS SageMaker Feature Store.

Data Preparation Process

Data preparation transforms raw data into clean, structured data ready for analysis or ML. Typically 60-80% of a data engineer's work.

Pipeline

Raw Data → Collect → Clean → Transform → Validate → Store → Serve

1. Data Collection (Ingestion)

Method Pattern Example
Batch Scheduled bulk load Daily CSV dump from ERP
Streaming Real-time events Kafka topic from IoT sensors
API pull Request-response REST API polling every 5 min
CDC Change Data Capture Debezium watching DB binlog

2. Data Profiling

Understand what you have before cleaning: row count, column types, null percentages, value distributions, min/max, cardinality, anomalies.

3. Data Cleaning

Problem Technique
Missing values Drop, impute (mean/median/mode), forward-fill
Duplicates Dedup by key, keep latest
Inconsistent formats "2026-04-21" vs "04/21/2026" → standardize
Outliers Z-score filter, IQR bounds, domain rules
Invalid values Age = -5, email without @ → reject or fix
Encoding issues UTF-8 normalization, strip BOM

4. Data Transformation

Operation What Example
Normalization Scale to range Min-max → [0,1]
Standardization Zero mean, unit variance Z-score
Encoding Category → numeric One-hot, label, target encoding
Aggregation Summarize Daily sales → monthly totals
Joining Combine sources Orders + Customers on customer_id
Pivoting Reshape Rows → columns (wide format)
Binning Continuous → discrete Age → [0-18, 19-35, 36-60, 60+]
Derived features Create new columns revenue = price × quantity

5. Data Validation

Check Rule Action on fail
Schema Column types match expected Reject batch
Completeness No nulls in required fields Alert + impute
Freshness Data arrived within SLA Alert ops team
Volume Row count within ±20% of expected Investigate
Referential integrity Foreign keys exist Quarantine rows
Business rules price > 0, end_date > start_date Reject record

6. Data Storage — Medallion Architecture

Bronze (raw)  →  Silver (cleaned)  →  Gold (business-ready)
  as-is            deduped, typed       aggregated, joined
  Parquet           Parquet/Delta        Parquet/Delta
  "data lake"       "data warehouse"     "data mart"

Tools by Stage

Stage Tools
Ingestion Kafka, Airbyte, Fivetran, Debezium
Profiling Great Expectations, pandas-profiling, dbt
Cleaning Spark, pandas, dbt
Transformation Spark, dbt, Flink, SQL
Validation Great Expectations, dbt tests, Soda
Orchestration Airflow, Dagster, Prefect
Storage S3/GCS + Parquet, Delta Lake, Iceberg

Data Serving

Data serving is the final mile — delivering prepared data to consumers (apps, dashboards, ML models, APIs).

Serving Patterns

Pattern Latency Consumer Example
Batch query Seconds–minutes Analysts, BI SQL on warehouse (Snowflake, BigQuery)
Interactive query Sub-second Dashboards Presto/Trino, Druid, ClickHouse
API serving Milliseconds Applications REST/GraphQL over cache or DB
Embedded analytics Milliseconds End users Pre-computed aggregates in app DB
ML model serving Milliseconds Inference Feature store → model endpoint
Pub/Sub Real-time Downstream systems Kafka topic, Redis Streams
Materialized views Near real-time Mixed Pre-joined, auto-refreshed tables

Serving Layer Architecture

Gold Layer (warehouse)
    ├── BI / Dashboards  ← SQL queries (Tableau, Grafana, Looker)
    ├── API Layer         ← REST/GraphQL (FastAPI, Express)
    │     └── Cache       ← Redis, CDN (reduce DB load)
    ├── ML Serving        ← Feature store → model (SageMaker, MLflow)
    ├── Search            ← Elasticsearch, OpenSearch
    └── Reverse ETL       ← Push data back to SaaS (Salesforce, HubSpot)

Key Concepts

Concept What Why
Caching Store frequent results in memory Reduce latency and DB load
Pre-aggregation Compute summaries ahead of time Dashboard loads in ms, not minutes
Materialized views Auto-refreshed query results Fresh data without full recompute
Data API Expose data via REST/GraphQL Decouple consumers from storage
Reverse ETL Push warehouse data to operational tools Sync analytics back to CRM, marketing
Data contracts Schema + SLA between producer and consumer Prevent breaking changes

Serving Trade-offs

Dimension Batch Real-time
Freshness Minutes–hours Seconds
Cost Low (scheduled) High (always-on)
Complexity Simple Complex (streaming infra)
Consistency Strong (full recompute) Eventual
Use case Reports, training data Alerts, live dashboards, fraud detection

Serving Tools

Category Tools
SQL warehouse Snowflake, BigQuery, Redshift, Databricks SQL
Interactive query Trino, Presto, Druid, ClickHouse
API gateway Kong, AWS API Gateway, GraphQL mesh
Caching Redis, Memcached, Varnish
ML serving SageMaker, Vertex AI, MLflow, BentoML
Search Elasticsearch, OpenSearch, Meilisearch
Reverse ETL Census, Hightouch, Polytomic

Data Sampling

Selecting a representative subset from a large dataset to reduce cost/time while preserving statistical properties.

Sampling Methods

Method How When to use
Random Each row has equal probability General purpose, balanced data
Stratified Sample proportionally from each group Imbalanced classes (95% normal, 5% fraud)
Systematic Every k-th row Ordered data, quick approximation
Cluster Randomly select entire groups Data naturally grouped (by region, store)
Reservoir Fixed-size sample from stream Streaming data, unknown total size
Weighted Higher probability for important rows Rare events, cost-sensitive

Python Examples

import pandas as pd

df = pd.read_parquet("large_dataset.parquet")

# Random: 10% of rows
sample = df.sample(frac=0.10, random_state=42)

# Fixed count: exactly 1000 rows
sample = df.sample(n=1000, random_state=42)

# Stratified: preserve class distribution
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2, stratify=df["label"])

# Systematic: every 10th row
sample = df.iloc[::10]

# Weighted: oversample rare class
weights = df["label"].map({0: 1, 1: 10})
sample = df.sample(n=1000, weights=weights, random_state=42)

Reservoir Sampling (streaming)

import random

def reservoir_sample(stream, k):
    """Keep k items from unknown-length stream."""
    reservoir = []
    for i, item in enumerate(stream):
        if i < k:
            reservoir.append(item)
        else:
            j = random.randint(0, i)
            if j < k:
                reservoir[j] = item
    return reservoir

Sampling Pitfalls

Pitfall Problem Fix
Selection bias Sample doesn't represent population Use stratified sampling
Too small High variance, unreliable stats Calculate minimum sample size
Time bias Only recent data sampled Include historical range
Survivorship bias Only successful cases in sample Include failures/dropouts

Data Processing

The compute-heavy middle stage — transforming cleaned data into analysis-ready outputs.

Prepared Data → Filter → Join → Aggregate → Enrich → Compute → Output

Core Steps

Step What Example
Filtering Remove irrelevant rows/columns WHERE date >= '2026-01-01'
Joining Combine datasets by key Orders + Products on product_id
Aggregation Group + summarize SUM(revenue) GROUP BY region, month
Enrichment Add external data Append weather data to sales by date+location
Deduplication Remove duplicate records Keep latest by updated_at per id
Sorting Order for downstream use Sort by timestamp for time-series
Partitioning Split by key for parallel processing Partition by date or region
Windowing Compute over sliding ranges 7-day rolling average, rank within group
Pivoting Reshape rows ↔ columns Long → wide format for reporting
Derived columns Compute new fields profit = revenue - cost

Processing Paradigms

Paradigm When Tools
Batch Process all data at once (hourly/daily) Spark, Hive, dbt, SQL
Micro-batch Small frequent batches (seconds) Spark Structured Streaming
Stream Process each event as it arrives Flink, Kafka Streams, Beam
Interactive Ad-hoc queries by humans Trino, Presto, BigQuery

ETL vs ELT

ETL ELT
Transform where Before loading (pipeline) After loading (in warehouse)
Tools Spark, Informatica dbt, SQL in Snowflake/BigQuery
Trend Legacy Modern (warehouse is powerful enough)

Processing Frameworks

Framework Language Batch Stream Scale
Spark Python/Scala/SQL Massive
Flink Java/Python/SQL ✅ (best) Massive
dbt SQL Warehouse-bound
Pandas Python Single machine
Polars Python/Rust Single machine (fast)
Beam Java/Python Portable (Spark/Flink/Dataflow)