You Don't Know JSON: Part 3 - Binary JSON in Databases
Master binary JSON in databases: PostgreSQL's JSONB and MongoDB's BSON. Learn how databases optimize JSON storage with binary formats for faster queries and efficient indexing.
- tags
- #Json #Jsonb #Bson #Binary-Serialization #Performance #Optimization #Postgresql #Mongodb #Data-Formats #Database-Optimization #Parsing #Encoding
- categories
- Fundamentals Programming Performance
- published
- reading time
- 15 minutes
📚 Series: You Dont Know JSON
- You Don't Know JSON: Part 1 - Origins, Evolution, and the Cracks in the Foundation
- You Don't Know JSON: Part 2 - JSON Schema and the Art of Validation
- You Don't Know JSON: Part 3 - Binary JSON in Databases (current)
- You Don't Know JSON: Part 4 - Binary JSON for APIs and Data Transfer
- You Don't Know JSON: Part 5 - JSON-RPC: When REST Isn't Enough
- You Don't Know JSON: Part 6 - JSON Lines: Processing Gigabytes Without Running Out of Memory
- You Don't Know JSON: Part 7 - Security: Authentication, Signatures, and Attacks
- You Don't Know JSON: Part 8 - Lessons from the JSON Revolution
In Part 1 , we explored JSON’s triumph through simplicity. In Part 2 , we added validation with JSON Schema. Now we tackle JSON’s performance tax when storing millions of documents: database-managed binary formats.
JSON’s human-readability is both its greatest strength and its Achilles heel. Every byte is text. Field names repeat in every object. Numbers are stored as strings. Parsing requires scanning every character.
For configuration files and API responses under 100KB, this is fine. But when storing millions of user records, events, or documents - the text format becomes expensive for databases.
What XML Had: No successful binary format (1998-2010)
XML’s approach: XML was purely textual for databases. Binary encoding attempts existed but failed:
- WBXML (1999): WAP-specific, limited adoption
- Fast Infoset (2005): Complex, required special parsers
- EXI (2011): Too late, minimal database support
- Binary XML (.NET): Proprietary, Microsoft-only
| |
For embedding binary data (images, files), both XML and JSON equally bad:
| |
| |
Benefit: Human readable, universal parser support
Cost: Massive storage overhead (repeated structure), slow parsing at scale, no database optimization, no native binary data type
JSON’s approach: Database-specific binary formats (JSONB, BSON) succeeded where XML’s failed - modular, database-optimized solutions
Architecture shift: Text-only → Binary storage with text compatibility, Failed standards → Database-integrated formats, No binary data type → Extended types (BSON)
Database binary JSON formats solve this at the storage layer - maintaining JSON’s structure and flexibility while dramatically improving query speed and storage efficiency.
Running Example: Storing 10 Million Users
Our User API from Part 1 now has validation from Part 2 . Next challenge: storing 10 million users efficiently in a database.
Current user object (text JSON):
| |
Size: 156 bytes per user 10M users: 1.56 GB as text JSON
Problems at scale in databases:
- Field names repeated 10 million times in storage
- Text parsing required on every query
- No indexing into JSON structure without parsing
- Inefficient storage and retrieval for database operations
Database binary JSON formats solve this at the storage layer. Let’s see the impact.
The Text Format Tax in Databases
What You Pay for Human-Readability
Our user object in JSON:
| |
Size: 156 bytes
What happens during database queries:
- Read entire string character by character from disk
- Decode UTF-8 sequences
- Identify delimiters (
{,},:,,) - Parse string values (allocate memory, copy)
- Convert number strings to numeric types
- Handle escape sequences
- Build object structure in memory for every query
The database-specific costs:
- Field names stored repeatedly (
"id","username","email"in every record) - Numbers stored as text (
123456789= 9 bytes vs 4 bytes as integer) - Date stored as 24-character string vs 8-byte timestamp
- Parse overhead: string scanning, allocation for every query
- No indexing without parsing entire document
- JOIN operations require reparsing for every row
When Does This Matter in Databases?
Scenarios where text JSON hurts databases:
- Large table queries - Parsing JSON columns in millions of rows
- Complex WHERE clauses - Filtering on JSON fields requires parsing
- Aggregation operations - GROUP BY, SUM on JSON fields
- JOIN operations - Joining on JSON fields
- Index maintenance - Extracting values for indexing
- Backup/restore operations - Processing entire datasets
- Analytics queries - OLAP workloads on JSON data
Database Rule of Thumb: Text JSON columns are fine for rarely-queried metadata. Consider binary formats when you have:
- Frequent queries on JSON fields
- Large datasets (>100K rows with JSON)
- Complex aggregations or analytics
- Need to index JSON content
- Performance-critical applications
The Database Binary JSON Landscape
Database binary JSON formats share common goals but differ in implementation and focus.
Common Database Goals
1. Smaller Storage
- Remove repeated field names (or compress them)
- Efficient number encoding (binary, not text)
- No syntax overhead stored on disk
2. Faster Queries
- Skip string parsing on queries (pre-decomposed data)
- Direct field access via offsets
- Type information embedded (no string-to-type conversion)
3. Indexable Structure
- Extract fields without full document parsing
- Support complex index types (GIN, GiST)
- Enable fast WHERE clauses on JSON content
The Database Formats
| Format | Database | Primary Use | Indexable | Schema Required |
|---|---|---|---|---|
| JSONB | PostgreSQL | Relational + document hybrid | Yes (GIN/GiST) | No |
| BSON | MongoDB | Document database storage | Yes (compound) | No |
| JSON | MySQL 5.7+ | Binary JSON columns | Yes (virtual columns) | No |
| JSON | SQL Server | JSON functions/indexing | Yes (computed columns) | No |
PostgreSQL JSONB: Indexable Documents
What is JSONB?
JSONB is PostgreSQL’s binary JSON storage format. Unlike the JSON column type (which stores text), JSONB decomposes JSON into a binary structure optimized for database operations.
Key difference:
| |
Internal Structure
JSONB uses a decomposed binary format:
Storage layout:
- Header: Version and flags
- JEntry array: Metadata for each key/value (offset, length, type)
- Data section: Actual values in binary form
Benefits:
- No reparsing on queries (already decomposed)
- Keys stored once per object
- Direct access to nested fields (offset jumping)
- Indexable (GIN, GiST indexes)
Trade-off:
- Slower to insert (decomposition overhead)
- Slightly larger than compressed JSON text
- Key order not preserved (sorted for efficiency)
Querying JSONB
Operators:
| |
Indexing JSONB
GIN Index (Generalized Inverted Index):
| |
GIN index on specific path:
| |
Expression index:
| |
B-tree index for range queries:
| |
Practical Example
| |
Performance Characteristics
Benchmark: 1M rows, user documents
| Operation | JSON (text) | JSONB (binary) | Speedup |
|---|---|---|---|
| INSERT | 15.2s | 18.7s | 0.81x (slower) |
| SELECT by ID | 0.12ms | 0.08ms | 1.5x |
| SELECT with filter | 2.3s | 0.45s (indexed) | 5.1x |
| UPDATE field | 1.8s | 0.9s | 2x |
| Storage size | 285 MB | 310 MB | 1.09x (larger) |
With GIN index:
- Index size: +95 MB
- Query speedup: 10-50x for containment queries
Best Practice: Use JSONB for:
- Semi-structured data in PostgreSQL
- Documents with varied schemas
- Fast queries on JSON fields
- When you need indexing
Stick with JSON column type only if you need:
- Exact key order preservation
- Faster inserts (no decomposition)
- Original formatting preserved
MongoDB BSON: Extended Types
What is BSON?
BSON (Binary JSON) is MongoDB’s data storage and wire protocol format. Created in 2009, it extends JSON with additional types and efficient binary encoding optimized for database operations.
Key features:
- Extended type system beyond JSON
- Length-prefixed elements (traversable without parsing)
- Efficient binary encoding
- Native in MongoDB drivers
Extended Type System
BSON adds types JSON lacks, crucial for database operations:
| |
BSON Types
| BSON Type | JSON Equivalent | Binary Size | Database Benefits |
|---|---|---|---|
| Double | number | 8 bytes | IEEE 754 float, indexable |
| String | string | 4 + length + 1 | UTF-8, length-prefixed |
| Object | object | Variable | Embedded documents |
| Array | array | Variable | Indexed arrays |
| Binary | (Base64 string) | 4 + length | No encoding overhead |
| ObjectId | (string) | 12 bytes | Unique, sortable, indexed |
| Boolean | boolean | 1 byte | Efficient storage |
| Date | (string) | 8 bytes | Native date queries |
| Null | null | 0 bytes | Efficient null handling |
| Regex | (no equivalent) | Variable | Pattern matching |
| Int32 | number | 4 bytes | Precise integers |
| Timestamp | (no equivalent) | 8 bytes | Replication ordering |
| Int64 | number | 8 bytes | Large integers |
| Decimal128 | (string) | 16 bytes | Financial precision |
ObjectId Deep Dive
ObjectId is a 12-byte identifier designed for distributed database systems:
Structure:
| 4-byte timestamp | 5-byte random | 3-byte counter |
Database properties:
- Globally unique (no coordination needed)
- Sortable by creation time (natural ordering)
- Embedded timestamp (no separate created_at needed)
- Efficient indexing (12 bytes vs 36-byte UUID)
Generation in drivers:
JavaScript:
| |
Go:
| |
Python:
| |
Date Handling
BSON’s native date type solves JSON’s date problem in databases:
JavaScript:
| |
Go:
| |
Python:
| |
Binary Data Handling
BSON avoids Base64 overhead for binary data in databases:
JavaScript:
| |
Size Comparison
Sample document:
| |
Sizes:
- JSON text: 169 bytes
- BSON binary: 142 bytes
- Savings: 16%
Larger document (100 fields):
- JSON text: 5,234 bytes
- BSON binary: 4,012 bytes
- Savings: 23%
With binary data (1KB image):
- JSON + Base64: 1,536 bytes (33% overhead)
- BSON binary: 1,100 bytes (raw binary)
- Savings: 28%
BSON in Practice
Complete example:
JavaScript (Node.js):
| |
Go:
| |
Python:
| |
BSON Use Cases:
- MongoDB storage (native format)
- MongoDB wire protocol
- Document databases needing extended types
- Systems requiring ObjectId benefits
Not recommended for:
- General-purpose serialization (use MessagePack)
- Non-MongoDB systems (ecosystem smaller)
- Human debugging (binary format)
Choosing Database Binary JSON
Database binary JSON formats excel at different use cases:
PostgreSQL JSONB When…
Choose JSONB if you need:
- Relational database with document flexibility
- Complex indexing requirements (GIN/GiST)
- ACID transactions with JSON data
- SQL queries with JSON operations
- Hybrid relational-document model
Example scenarios:
- User profiles with varying fields
- Event logging with structured metadata
- Configuration data that needs querying
- Semi-structured analytics data
MongoDB BSON When…
Choose BSON/MongoDB if you need:
- Pure document database approach
- Extended type system (ObjectId, Decimal128, dates)
- Horizontal scaling (sharding)
- Flexible schema evolution
- Binary data without encoding overhead
Example scenarios:
- Content management systems
- Catalogs with varying product attributes
- Time-series data with metadata
- File storage with metadata
Database Performance Impact
10M user benchmark:
| Database | Format | Storage | Query Speed | Index Size |
|---|---|---|---|---|
| PostgreSQL | JSON | 1.56 GB | 2.3s (filter) | N/A |
| PostgreSQL | JSONB | 1.67 GB | 0.45s (indexed) | +310 MB |
| MongoDB | JSON | 1.56 GB | 1.8s (scan) | N/A |
| MongoDB | BSON | 1.31 GB | 0.12s (indexed) | +280 MB |
Key insights:
- Binary formats trade insert speed for query speed
- Indexing provides 5-20x query speedup
- Storage overhead: 5-15% for binary format + indexes
- Extended types (BSON) can reduce storage vs text
What’s Next: Beyond Database Storage
Database binary JSON solves storage and query performance within individual databases. But what about data transfer between services, mobile applications, and distributed systems?
In Part 4 , we’ll explore binary JSON formats designed for APIs and data transfer: MessagePack for universal serialization and CBOR for IoT and security protocols. These formats optimize for network bandwidth and serialization speed rather than database storage.
Coming up:
- MessagePack: The universal binary JSON
- CBOR: IETF standard for constrained environments
- Performance comparison: when binary beats JSON
- Real-world bandwidth cost analysis
The goal remains the same - keeping JSON’s flexibility while eliminating the text format tax - but the trade-offs shift from storage efficiency to network efficiency.
References
Specifications:
Performance:
📚 Series: You Dont Know JSON
- You Don't Know JSON: Part 1 - Origins, Evolution, and the Cracks in the Foundation
- You Don't Know JSON: Part 2 - JSON Schema and the Art of Validation
- You Don't Know JSON: Part 3 - Binary JSON in Databases (current)
- You Don't Know JSON: Part 4 - Binary JSON for APIs and Data Transfer
- You Don't Know JSON: Part 5 - JSON-RPC: When REST Isn't Enough
- You Don't Know JSON: Part 6 - JSON Lines: Processing Gigabytes Without Running Out of Memory
- You Don't Know JSON: Part 7 - Security: Authentication, Signatures, and Attacks
- You Don't Know JSON: Part 8 - Lessons from the JSON Revolution