Home
Tags:

Identify the workload

  • Usability Scenarios
  • Production Logs & Stats
  • Business Domain Experts
  • Data Modeling Expert
  • Assumptions on workloads
    • frequencies
    • relationship amounts (minimum 0, maximum 24?!, maximum 2500?)

Example:

  • an IoT EDB
  • 100 million weather sensors sending data
  • collect the data
  • make it available to 10 data scientists
  • MOST trends can be deduced hourly
  • no logs or stats to leverage
  • data can be collected and sent 1x per minute
  • need to keep data for more than 10 years
  • ops team needs to validate faulty devices
  • ops team needs to be able to aggregate data for the data scientists
  • data scientists need to explore and find trends

list the CRUD

Item/Person Use-Case Data CRUD
Devices Sending data every minute device_id, metrics WRITE
Ops Team ID busted devices device_id, timestamp READ
Ops Team aggregate hourly data sets device_id, metrics READ / WRITE
Data Scientist run 10 analytical queries hourly metrics, aggregated metrics READ

Understanding write operations

  • sent by sensors
  • sent to the server
  • WRITE / INSERT
  • data has device ID, timestamp and metrics
  • 1.6M writes per second: db partitioned in 10-20 shards can handle this
  • Data size = 1000 Bytes
  • Life of data is 10 years
  • Does not need to be extensively durability, do not need multiple-node majority confirmation on write: even though we want 1x-per-minute data, the data will get aggregated hourly most often when consumed
  • consider grouping the writes because there are so many

Understand the read operations

  • most queries will be on temperature data
  • read queries
  • 100 queries per hour: 10 scientists, 10 reqs per person
  • will require collection scans
  • mostly uses last-hour's worth of data

Understand Relationships

  • What are the relationships?!
  • How many relationships are there?
  • Should these relationships be embedded or linked?!

Apply patterns

  • recognize patterns
  • apply patterns

A takeaway

Consider leveraging a dedicated node for analytics.
Primary for writes, secondary for reads.

A Flexible Methodology For Modeling Data

Goal Shooting For Simplicity Between Simple & Performance Shooting For Optimal Performance
Describe the Workload ID Most-Frequent Operations ID ALL Operations, quantify ALL of them
Describe ID & Model relationships Embed a lot of content: large object, less querying Embed AND Link
Apply Patterns Few Patterns - May Include data-duplication Many Patterns for many details

An Example, Data For A Coffee Shop

Business Needs

  • 1K stores
  • make killer coffee
  • stick to a strict coffee recipe
  • use smart && automated coffee hardware (shelving stock systems, coffee makers, etc.)

Describe Data Workload

Query Operation Type Business Description Quantify Freq. Qualify Importance
Get/Set the weight of coffee beans on a shelf Write, when person takes coffee off shelf or stocker adds to shelf Shelf sends event when coffee bags are removed && added 1 write per sec, 10 shelves per store, number of shelves per store Critical: this is the most-granular detail about inventory management. Ensuring the write was successful is important. Leverage a majority writeConcern
Read how much coffee was consumed by the store & customers Read as analytics Show how much coffee was consumed and forecast how much coffee should be ordered the next day
Find anomalies in the inventory Read as analytics Gain insights into unexpected inventory details Read 1x per hour, will run against the whole dataset stale data is ok, full-collection scans should be done on a redundant node in a replica set
Capture Coffee-Making details Writes from coffee machines (temp, weight, speed, water, etc) Coffee Machine reports these details on a cup of coffee being made LOTS of writes.. many per cup-of-coffee being made Non-Critical
Analyze The Details of Coffee-Making Read as business analytics Help the org through insights

Describe Storage Needs

About Coffee Cups About Inventory
One Year of Data One Year Of Data
10,000x1000 writes-per-day every day 10000x10 writes-per-day every day
365 billions / yr 3.7 billions / yr
370GB 3.7GB

NOTE: Sharding is best with at least 1TB of data, this does not need sharding according to data needs

Describe Entities and Data Relationships

  • Coffee Cups
  • Stores
  • Shelves
  • Machines
  • Bags of Coffee
  • coffee scales