Dataset & Distribution
A dataset is a collection of data points, and its distribution describes how those values are spread out and how frequently each value occurs.
Data Distribution
A distribution is the representation of the frequency of each value or interval within a dataset. A distribution can be used to better understand eventual patterns of the data, such as the central tendency or spread and can be useful to make prediction.
Exists different types of distribution, based for example on the number of variables considered. A distribution over one variable is named Univariate Distribution, however exists also cross distribution over two or more variable (Multivariate).
A variable is defined when we decide how to measure a certain attribute. A variable can be:
- qualitative
- quantitative
depending on operational version (scale of measurement).
Schema
CREATE TABLE users_data (
user_id SERIAL PRIMARY KEY,
age INT CHECK (age >= 18 AND age <= 100),
gender VARCHAR(10) CHECK (gender IN ('Male', 'Female', 'Other')),
country VARCHAR(50) NOT NULL,
subscription_type VARCHAR(10) CHECK (subscription_type IN ('Free', 'Standard', 'Premium')),
session_count INT CHECK (session_count >= 0),
avg_session_duration NUMERIC(5,2) CHECK (avg_session_duration >= 0),
purchases INT CHECK (purchases >= 0),
total_spent NUMERIC(10,2) DEFAULT 0 CHECK (total_spent >= 0),
satisfaction_score NUMERIC(3,1) CHECK (satisfaction_score >= 1 AND satisfaction_score <= 10)
);
INSERT INTO users_data
(age, gender, country, subscription_type, session_count, avg_session_duration, purchases, total_spent, satisfaction_score)
VALUES
(25, 'Female', 'Italy', 'Free', 8, 12.50, 0, 0.00, 6.8),
(32, 'Male', 'Germany', 'Premium', 25, 35.20, 5, 120.50, 9.1),
(21, 'Other', 'France', 'Free', 5, 8.75, 0, 0.00, 7.2),
(45, 'Female', 'Spain', 'Standard', 18, 22.40, 2, 35.99, 8.0),
(29, 'Male', 'Italy', 'Premium', 30, 40.10, 8, 210.00, 9.5),
(54, 'Female', 'UK', 'Standard', 12, 18.30, 1, 15.50, 7.4),
(38, 'Male', 'USA', 'Premium', 22, 33.60, 3, 95.75, 8.8),
(26, 'Female', 'Canada', 'Free', 10, 14.10, 0, 0.00, 6.5),
(41, 'Male', 'Italy', 'Standard', 16, 25.75, 1, 20.99, 7.7),
(23, 'Female', 'France', 'Premium', 28, 31.20, 6, 145.25, 9.0),
(36, 'Male', 'Spain', 'Standard', 20, 26.50, 2, 40.00, 8.1),
(48, 'Female', 'Germany', 'Premium', 27, 38.90, 4, 110.00, 8.9),
(22, 'Male', 'UK', 'Free', 6, 10.20, 0, 0.00, 6.2),
(60, 'Female', 'USA', 'Standard', 15, 19.85, 1, 25.00, 7.3),
(34, 'Other', 'Italy', 'Premium', 24, 36.10, 3, 82.50, 8.6);
Subscription type Distribution (Univariate Distribution)
SELECT subscription_type, COUNT(*) AS num_utenti
FROM users_data
GROUP BY subscription_type
ORDER BY num_utenti DESC;
| subscription_type | num_utenti |
|---|---|
| Premium | 6 |
| Standard | 5 |
| Free | 4 |
Purchases vs Total Spent (Bivariate Distribution)
SELECT purchases,
COUNT(*) AS num_utenti,
ROUND(AVG(total_spent), 2) AS spesa_media
FROM users_data
GROUP BY purchases
ORDER BY purchases;
| purchases | num_utenti | spesa_media |
|---|---|---|
| 0 | 4 | 0.00 |
| 1 | 3 | 20.50 |
| 2 | 2 | 38.00 |
| 3 | 2 | 89.13 |
| 4 | 1 | 110.00 |
| 5 | 1 | 120.50 |
| 6 | 1 | 145.25 |
| 8 | 1 | 210.00 |
Now let's consider this famous Samuel L. Jackson citation from film "Pulp Fiction":
«Ezekiel 25:17. The path of the righteous man is beset on all sides by the iniquities of the selfish and the tyranny of evil men. Blessed is He who in the name of charity and good will shepherds the weak through the valley of darkness, for He is truly his brother's keeper and the finder of lost children. And I will strike down upon thee with great vengeance and furious anger those who attempt to poison and destroy My brothers. And you will know My name is the Lord when I lay My vengeance upon thee.»
and compute the distribution of letters.
CREATE TABLE quote_text (
id SERIAL PRIMARY KEY,
quote TEXT NOT NULL
);
INSERT INTO quote_text (quote)
VALUES ('[...]');
CREATE TABLE quote_letters (
id SERIAL PRIMARY KEY,
position INT NOT NULL,
letter CHAR(1) NOT NULL
);
INSERT INTO quote_letters (position, letter)
SELECT
gs AS position,
SUBSTRING(q.quote FROM gs FOR 1) AS letter
FROM
quote_text q,
generate_series(1, LENGTH(q.quote)) AS gs;
Letter Distribution
SELECT
LOWER(letter) AS letter,
COUNT(*) AS frequency
FROM
quote_letters
WHERE
letter ~ '[A-Za-z]'
GROUP BY
LOWER(letter)
ORDER BY
frequency DESC;
| Letter | Frequency |
|---|---|
| e | 56 |
| t | 32 |
| h | 31 |
| n | 30 |
| o | 29 |
| i | 27 |
| s | 26 |
| a | 23 |
| r | 21 |
| l | 19 |
| d | 17 |
| y | 12 |
| Letter | Frequency |
|---|---|
| w | 10 |
| f | 10 |
| u | 9 |
| m | 8 |
| p | 7 |
| g | 7 |
| k | 6 |
| b | 5 |
| v | 4 |
| c | 4 |
| z | 1 |
| q | 1 |
Caesar Cipher Encoding/Decoding
Now let's take a small text and encode it using a Caesar Cipher encryption. This kind of encryption is the easiest one and encrypt words using a shift on each letters, maintaining the pattern unchanged.
Know yourself and you will win all the battles. ~ Sun Tzu
The encryption will look like this:
Caesar Cipher Encoder
Encoded result:
Now let's first decrypt the text assuming we don't know the original plaintext neither the shift applied. We will brute force it!
Caesar Cipher Decoder — Brute Force
Candidates:
Obviously, the correct one is the one that is easily understood. (Shift 10)
Now let's decrypt the proposed text using the natural language distribution.
First of all compute the distribution of the encrypted text.
| Letter | Frequency |
|---|---|
| v | 6 |
| o | 3 |
| x | 3 |
| y | 3 |
| g | 3 |
| d | 3 |
| k | 3 |
| i | 2 |
| Letter | Frequency |
|---|---|
| e | 2 |
| c | 2 |
| s | 2 |
| p | 1 |
| n | 1 |
| b | 1 |
| l | 1 |
| u | 1 |
| r | 1 |
Now we compare the distribution side by side.
English Letter Distribution
Encrypted Text Letter Frequency
We will notice that the most common letters in the second graph are VOXYGDK.
Because the word is short, we can assume that these peaks are relative to the most common letter in English.
Then, analyzing this peak's relative shift we can check for a sentence that makes sense.
The result is that shifting using the letter O distance from letter E we get the right plaintext. The resulting shift is 10 as we have seen in the previous example.