Data Basics

By Brandon Rose, MD, MPH
Updated August 30th, 2023

The foundation of modern medicine is data. However, most doctors do not know and are not taught basic definitions related to information technology. Here we will cover some basics about data and how it's stored, shared, analyzed, and interacted with. 

The foundation of modern medicine is data. However, most doctors do not know and are not taught basic definitions related to information technology. Here we will cover some basics about data and how it's stored, shared, analyzed, and interacted with. We will also cover the tools used in each of these spaces.

Web 1.0 (1989 to 2005) - Servers join around the world to share their static content with others.

Web 2.0 (2005 to now) - Servers accept user input to provide dynamic content - social media, online banking etc.

Web 3.0 (soon? not soon?) - emerging ideas of a decentralized internet - blockchain, cryptocurrencies, Etherium, NFTs etc.

Data - information that is electronically collected and stored in a database to be analyzed.

Transactional Data (Databases)

Database - centralized electronic collection of data, usually as a relational normalized database, hosted on a local or cloud server. Handled by Data Scientists. Databases have rules, which is why your excel table is not a database.

Database Management System (DBMS) - the software used to interact with a database.

Transactional Data - data (usually normalized) that is collected, validated, and stored. Bank accounts, electronic medical records, Amazon, and most of the internet in Web 2.0. Usually done with an online transactional process that is ACID compliant

Online Transactional Process (OLTP) - an electronic process for reliable adding, deleting, and updating data. Preferred to be ACID compliant. Atomic (changes succeed or fail, no half data), Consistent (somewhat unchanging structure), Isolated (only a complete state of the database can be accessed at any given time), and Durable (can recover previous states if there is hardware/software failure).

Relational Database (such as SQL) - the traditional industry standard for storing data in its smallest normal form. Consists of a series of related tables that have a unique primary key and references to related tables with foreign keys. Uses OLTP.

Structured Query Language (SQL) - a relational database that scales vertically (limited by local or cloud computing power as the data grows). Requires a schema (structure) but can change over time.

SQL RDBMS - Microsoft SQL Server, MySQL, Oracle, PostgreSQL, SQLite, MariaDB, SQLite, OnBase

NewSQL - modern attempts at more scalable SQL (DBMS Amazon Aurora, Couchbase, CockroachDB, Google Spanner) 

Non-relational Database (NoSQL AKA Not Only SQL) - great horizonal scaling (add more storage space), eventually consistent (valid/trusted), no schema needed.

Key-Value - In-memory cache for making traditional databases faster to query. For high-traffic web apps, e-commerce systems, gaming applications.

Example DBMS - Redis (expanding to platform) – used by Twitter, AWS DynamoDB

Wide Column – no schema, unstructured data, CQL (no joins), decentralized, time-series data (weather sensors, Netflix watch)

Example DBMS - Cassandra, Hbase, Amazon Keyspaces, possibly your excel table (informally)

Document-oriented database - unstructured no schema, collections and subcollections, apps and games, not ideal for graphs (highly connected data)

Example DBMS- MongoDB – amazon compatible (DocumentDB), Cloud Firestore, Couchbase, CouchDB - ACID

Graph - Fraud detection, social networking, recommendation engines

Example DBMS- Neo4j - ACID, dgraph, Amazon Neptune, Azure Cosmos DB 

Search – search engines, Elastisearch (what Google was built on), indexes files

Multi-model - example DBMS - FaunaDB, Redis Expansion, Many platform services.

Analytical Data (Data Warehouses/Lakes)

Data warehouse - a semi-structured electronic denormalized collection of data that highly structured and imports data from multiple databases. Handled by data engineers/arcitechts  and used by data analysts.

Analytical Data - data that has been extracted, transformed, and loaded for analysis. This is research!

Online analytical process (OLAP) - an electronic process for extracting, loading, and transforming (ETL/ELT)

Data Lake - a repository of raw data stored in it's native format. Can be structured or unstructered. Used in machine learning and AI.

Software Development - local or cloud server (hosting) + framework (such as Model-View-Contoller). Handled by developers.

Cloud Services (hosting) - Google, Amazon (AWS), Microsoft Azure, more!

Application Program Interface (API) - the documented way of how different applications communicate requests and responses

Model - a database or several databases that handle the transactions and analytics of your application

Viewer - the user interface (the front end), the part that your user interacts with.

Controller - the server side (the backend), the programming language that computes/handles the requests.

Examples 

See what people are using! - https://insights.stackoverflow.com/survey/2021

Operating systems - Linux, macOS, Windows

Development environments - Android Studio, IPython/Jupyter, RStudio, RubyMine, Visual Studio Code, Xcode

Cloud Platforms - AWS, DigitalOcean, Firebase, Google Cloud Platform, IBM Cloud or Watson, Microsoft Azure, Oracle Cloud Infrastructure

Web frameworks (with language) - Angular, ASP.NET, Blazor (.NET), Django (python), Laravel (PHP), Ruby on Rails, more!

Programming, scripting, and markup language: Bash/Shell, HTML/CSS, JavaScript, Julia, MATLAB, Objective-C, Perl, PHP, Python, R, Ruby, SQL, TypeScript

Other frameworks and libraries - .NET Framework, Apache Spark, Electron, Flutter, jQuery, NumPy, Pandas, Tidyverse

Other Tools - Docker (wrapping apps), Git (version control)

Some Data Collection Tools - RedCAP (open-source research platform hosted by Vanderbilt University, written in PHP), Qualtics (survey collection tool)

Links