Data Basics
By Brandon Rose, MD, MPH
Updated August 30th, 2023
The foundation of modern medicine is data. However, most doctors do not know and are not taught basic definitions related to information technology. Here we will cover some basics about data and how it's stored, shared, analyzed, and interacted with.
The foundation of modern medicine is data. However, most doctors do not know and are not taught basic definitions related to information technology. Here we will cover some basics about data and how it's stored, shared, analyzed, and interacted with. We will also cover the tools used in each of these spaces.
Web 1.0 (1989 to 2005) - Servers join around the world to share their static content with others.
Web 2.0 (2005 to now) - Servers accept user input to provide dynamic content - social media, online banking etc.
Web 3.0 (soon? not soon?) - emerging ideas of a decentralized internet - blockchain, cryptocurrencies, Etherium, NFTs etc.
Data - information that is electronically collected and stored in a database to be analyzed.
Transactional Data (Databases)
Database - centralized electronic collection of data, usually as a relational normalized database, hosted on a local or cloud server. Handled by Data Scientists. Databases have rules, which is why your excel table is not a database.
Database Management System (DBMS) - the software used to interact with a database.
Transactional Data - data (usually normalized) that is collected, validated, and stored. Bank accounts, electronic medical records, Amazon, and most of the internet in Web 2.0. Usually done with an online transactional process that is ACID compliant
Online Transactional Process (OLTP) - an electronic process for reliable adding, deleting, and updating data. Preferred to be ACID compliant. Atomic (changes succeed or fail, no half data), Consistent (somewhat unchanging structure), Isolated (only a complete state of the database can be accessed at any given time), and Durable (can recover previous states if there is hardware/software failure).
Relational Database (such as SQL) - the traditional industry standard for storing data in its smallest normal form. Consists of a series of related tables that have a unique primary key and references to related tables with foreign keys. Uses OLTP.
Structured Query Language (SQL) - a relational database that scales vertically (limited by local or cloud computing power as the data grows). Requires a schema (structure) but can change over time.
SQL RDBMS - Microsoft SQL Server, MySQL, Oracle, PostgreSQL, SQLite, MariaDB, SQLite, OnBase
NewSQL - modern attempts at more scalable SQL (DBMS Amazon Aurora, Couchbase, CockroachDB, Google Spanner)
Non-relational Database (NoSQL AKA Not Only SQL) - great horizonal scaling (add more storage space), eventually consistent (valid/trusted), no schema needed.
Key-Value - In-memory cache for making traditional databases faster to query. For high-traffic web apps, e-commerce systems, gaming applications.
Example DBMS - Redis (expanding to platform) – used by Twitter, AWS DynamoDB
Wide Column – no schema, unstructured data, CQL (no joins), decentralized, time-series data (weather sensors, Netflix watch)
Example DBMS - Cassandra, Hbase, Amazon Keyspaces, possibly your excel table (informally)
Document-oriented database - unstructured no schema, collections and subcollections, apps and games, not ideal for graphs (highly connected data)
Example DBMS- MongoDB – amazon compatible (DocumentDB), Cloud Firestore, Couchbase, CouchDB - ACID
Graph - Fraud detection, social networking, recommendation engines
Example DBMS- Neo4j - ACID, dgraph, Amazon Neptune, Azure Cosmos DB
Search – search engines, Elastisearch (what Google was built on), indexes files
Multi-model - example DBMS - FaunaDB, Redis Expansion, Many platform services.
Analytical Data (Data Warehouses/Lakes)
Data warehouse - a semi-structured electronic denormalized collection of data that highly structured and imports data from multiple databases. Handled by data engineers/arcitechts and used by data analysts.
Analytical Data - data that has been extracted, transformed, and loaded for analysis. This is research!
Online analytical process (OLAP) - an electronic process for extracting, loading, and transforming (ETL/ELT)
Data Lake - a repository of raw data stored in it's native format. Can be structured or unstructered. Used in machine learning and AI.
Software Development - local or cloud server (hosting) + framework (such as Model-View-Contoller). Handled by developers.
Cloud Services (hosting) - Google, Amazon (AWS), Microsoft Azure, more!
Application Program Interface (API) - the documented way of how different applications communicate requests and responses
Model - a database or several databases that handle the transactions and analytics of your application
Viewer - the user interface (the front end), the part that your user interacts with.
Controller - the server side (the backend), the programming language that computes/handles the requests.
Examples
See what people are using! - https://insights.stackoverflow.com/survey/2021
Operating systems - Linux, macOS, Windows
Development environments - Android Studio, IPython/Jupyter, RStudio, RubyMine, Visual Studio Code, Xcode
Cloud Platforms - AWS, DigitalOcean, Firebase, Google Cloud Platform, IBM Cloud or Watson, Microsoft Azure, Oracle Cloud Infrastructure
Web frameworks (with language) - Angular, ASP.NET, Blazor (.NET), Django (python), Laravel (PHP), Ruby on Rails, more!
Programming, scripting, and markup language: Bash/Shell, HTML/CSS, JavaScript, Julia, MATLAB, Objective-C, Perl, PHP, Python, R, Ruby, SQL, TypeScript
Other frameworks and libraries - .NET Framework, Apache Spark, Electron, Flutter, jQuery, NumPy, Pandas, Tidyverse
Other Tools - Docker (wrapping apps), Git (version control)
Some Data Collection Tools - RedCAP (open-source research platform hosted by Vanderbilt University, written in PHP), Qualtics (survey collection tool)