Identify the workload
- Usability Scenarios
- Production Logs & Stats
- Business Domain Experts
- Data Modeling Expert
- Assumptions on workloads
- frequencies
- relationship amounts (minimum 0, maximum 24?!, maximum 2500?)
Example:
- an IoT EDB
- 100 million weather sensors sending data
- collect the data
- make it available to 10 data scientists
- MOST trends can be deduced hourly
- no logs or stats to leverage
- data can be collected and sent 1x per minute
- need to keep data for more than 10 years
- ops team needs to validate faulty devices
- ops team needs to be able to aggregate data for the data scientists
- data scientists need to explore and find trends
list the CRUD
Item/Person | Use-Case | Data | CRUD |
---|---|---|---|
Devices | Sending data every minute | device_id, metrics | WRITE |
Ops Team | ID busted devices | device_id, timestamp | READ |
Ops Team | aggregate hourly data sets | device_id, metrics | READ / WRITE |
Data Scientist | run 10 analytical queries hourly | metrics, aggregated metrics | READ |
Understanding write operations
- sent by sensors
- sent to the server
- WRITE / INSERT
- data has device ID, timestamp and metrics
- 1.6M writes per second: db partitioned in 10-20 shards can handle this
- Data size = 1000 Bytes
- Life of data is 10 years
- Does not need to be extensively durability, do not need multiple-node majority confirmation on write: even though we want 1x-per-minute data, the data will get aggregated hourly most often when consumed
- consider grouping the writes because there are so many
Understand the read operations
- most queries will be on temperature data
- read queries
- 100 queries per hour: 10 scientists, 10 reqs per person
- will require collection scans
- mostly uses last-hour's worth of data
Understand Relationships
- What are the relationships?!
- How many relationships are there?
- Should these relationships be embedded or linked?!
Apply patterns
- recognize patterns
- apply patterns
A takeaway
Consider leveraging a dedicated node for analytics.
Primary for writes, secondary for reads.
A Flexible Methodology For Modeling Data
Goal | Shooting For Simplicity | Between Simple & Performance | Shooting For Optimal Performance |
---|---|---|---|
Describe the Workload | ID Most-Frequent Operations | ID ALL Operations, quantify ALL of them | |
Describe ID & Model relationships | Embed a lot of content: large object, less querying | Embed AND Link | |
Apply Patterns | Few Patterns - May Include data-duplication | Many Patterns for many details |
An Example, Data For A Coffee Shop
Business Needs
- 1K stores
- make killer coffee
- stick to a strict coffee recipe
- use smart && automated coffee hardware (shelving stock systems, coffee makers, etc.)
Describe Data Workload
Query | Operation Type | Business Description | Quantify Freq. | Qualify Importance |
---|---|---|---|---|
Get/Set the weight of coffee beans on a shelf | Write, when person takes coffee off shelf or stocker adds to shelf | Shelf sends event when coffee bags are removed && added | 1 write per sec, 10 shelves per store, number of shelves per store | Critical: this is the most-granular detail about inventory management. Ensuring the write was successful is important. Leverage a majority writeConcern |
Read how much coffee was consumed by the store & customers | Read as analytics | Show how much coffee was consumed and forecast how much coffee should be ordered the next day | ||
Find anomalies in the inventory | Read as analytics | Gain insights into unexpected inventory details | Read 1x per hour, will run against the whole dataset | stale data is ok, full-collection scans should be done on a redundant node in a replica set |
Capture Coffee-Making details | Writes from coffee machines (temp, weight, speed, water, etc) | Coffee Machine reports these details on a cup of coffee being made | LOTS of writes.. many per cup-of-coffee being made | Non-Critical |
Analyze The Details of Coffee-Making | Read as business analytics | Help the org through insights |
Describe Storage Needs
About Coffee Cups | About Inventory |
---|---|
One Year of Data | One Year Of Data |
10,000x1000 writes-per-day every day | 10000x10 writes-per-day every day |
365 billions / yr | 3.7 billions / yr |
370GB | 3.7GB |
NOTE: Sharding is best with at least 1TB of data, this does not need sharding according to data needs
Describe Entities and Data Relationships
- Coffee Cups
- Stores
- Shelves
- Machines
- Bags of Coffee
- coffee scales