Dataset & Distribution

Dataset & Distribution

A dataset is a collection of data points, and its distribution describes how those values are spread out and how frequently each value occurs.

Data Distribution

A distribution is the representation of the frequency of each value or interval within a dataset. A distribution can be used to better understand eventual patterns of the data, such as the central tendency or spread and can be useful to make prediction.

Exists different types of distribution, based for example on the number of variables considered. A distribution over one variable is named Univariate Distribution, however exists also cross distribution over two or more variable (Multivariate).

A variable is defined when we decide how to measure a certain attribute. A variable can be:

  • qualitative
  • quantitative

depending on operational version (scale of measurement).

Schema

CREATE TABLE users_data (
    user_id SERIAL PRIMARY KEY,
    age INT CHECK (age >= 18 AND age <= 100),
    gender VARCHAR(10) CHECK (gender IN ('Male', 'Female', 'Other')),
    country VARCHAR(50) NOT NULL,
    subscription_type VARCHAR(10) CHECK (subscription_type IN ('Free', 'Standard', 'Premium')),
    session_count INT CHECK (session_count >= 0),
    avg_session_duration NUMERIC(5,2) CHECK (avg_session_duration >= 0),
    purchases INT CHECK (purchases >= 0),
    total_spent NUMERIC(10,2) DEFAULT 0 CHECK (total_spent >= 0),
    satisfaction_score NUMERIC(3,1) CHECK (satisfaction_score >= 1 AND satisfaction_score <= 10)
);

INSERT INTO users_data
(age, gender, country, subscription_type, session_count, avg_session_duration, purchases, total_spent, satisfaction_score)
VALUES
(25, 'Female', 'Italy', 'Free', 8, 12.50, 0, 0.00, 6.8),
(32, 'Male', 'Germany', 'Premium', 25, 35.20, 5, 120.50, 9.1),
(21, 'Other', 'France', 'Free', 5, 8.75, 0, 0.00, 7.2),
(45, 'Female', 'Spain', 'Standard', 18, 22.40, 2, 35.99, 8.0),
(29, 'Male', 'Italy', 'Premium', 30, 40.10, 8, 210.00, 9.5),
(54, 'Female', 'UK', 'Standard', 12, 18.30, 1, 15.50, 7.4),
(38, 'Male', 'USA', 'Premium', 22, 33.60, 3, 95.75, 8.8),
(26, 'Female', 'Canada', 'Free', 10, 14.10, 0, 0.00, 6.5),
(41, 'Male', 'Italy', 'Standard', 16, 25.75, 1, 20.99, 7.7),
(23, 'Female', 'France', 'Premium', 28, 31.20, 6, 145.25, 9.0),
(36, 'Male', 'Spain', 'Standard', 20, 26.50, 2, 40.00, 8.1),
(48, 'Female', 'Germany', 'Premium', 27, 38.90, 4, 110.00, 8.9),
(22, 'Male', 'UK', 'Free', 6, 10.20, 0, 0.00, 6.2),
(60, 'Female', 'USA', 'Standard', 15, 19.85, 1, 25.00, 7.3),
(34, 'Other', 'Italy', 'Premium', 24, 36.10, 3, 82.50, 8.6);

Subscription type Distribution (Univariate Distribution)

SELECT subscription_type, COUNT(*) AS num_utenti
FROM users_data
GROUP BY subscription_type
ORDER BY num_utenti DESC;
subscription_type num_utenti
Premium 6
Standard 5
Free 4

Purchases vs Total Spent (Bivariate Distribution)

SELECT purchases,
       COUNT(*) AS num_utenti,
       ROUND(AVG(total_spent), 2) AS spesa_media
FROM users_data
GROUP BY purchases
ORDER BY purchases;
purchases num_utenti spesa_media
0 4 0.00
1 3 20.50
2 2 38.00
3 2 89.13
4 1 110.00
5 1 120.50
6 1 145.25
8 1 210.00

Now let's consider this famous Samuel L. Jackson citation from film "Pulp Fiction":

«Ezekiel 25:17. The path of the righteous man is beset on all sides by the iniquities of the selfish and the tyranny of evil men. Blessed is He who in the name of charity and good will shepherds the weak through the valley of darkness, for He is truly his brother's keeper and the finder of lost children. And I will strike down upon thee with great vengeance and furious anger those who attempt to poison and destroy My brothers. And you will know My name is the Lord when I lay My vengeance upon thee.»

and compute the distribution of letters.

CREATE TABLE quote_text (
    id SERIAL PRIMARY KEY,
    quote TEXT NOT NULL
);


INSERT INTO quote_text (quote)
VALUES ('[...]');


CREATE TABLE quote_letters (
    id SERIAL PRIMARY KEY,
    position INT NOT NULL,
    letter CHAR(1) NOT NULL
);

INSERT INTO quote_letters (position, letter)
SELECT 
    gs AS position,
    SUBSTRING(q.quote FROM gs FOR 1) AS letter
FROM 
    quote_text q,
    generate_series(1, LENGTH(q.quote)) AS gs;

Letter Distribution

SELECT 
    LOWER(letter) AS letter,
    COUNT(*) AS frequency
FROM 
    quote_letters
WHERE 
    letter ~ '[A-Za-z]'  
GROUP BY 
    LOWER(letter)
ORDER BY 
    frequency DESC;
LetterFrequency
e56
t32
h31
n30
o29
i27
s26
a23
r21
l19
d17
y12
LetterFrequency
w10
f10
u9
m8
p7
g7
k6
b5
v4
c4
z1
q1

Caesar Cipher Encoding/Decoding

Now let's take a small text and encode it using a Caesar Cipher encryption. This kind of encryption is the easiest one and encrypt words using a shift on each letters, maintaining the pattern unchanged.

Know yourself and you will win all the battles. ~ Sun Tzu

The encryption will look like this:

Caesar Cipher Encoder



Encoded result:

Now let's first decrypt the text assuming we don't know the original plaintext neither the shift applied. We will brute force it!

Caesar Cipher Decoder — Brute Force

Candidates:


Obviously, the correct one is the one that is easily understood. (Shift 10)


Now let's decrypt the proposed text using the natural language distribution.

First of all compute the distribution of the encrypted text.

LetterFrequency
v6
o3
x3
y3
g3
d3
k3
i2
LetterFrequency
e2
c2
s2
p1
n1
b1
l1
u1
r1

Now we compare the distribution side by side.

English Letter Distribution

Encrypted Text Letter Frequency

We will notice that the most common letters in the second graph are VOXYGDK.

Because the word is short, we can assume that these peaks are relative to the most common letter in English.

Then, analyzing this peak's relative shift we can check for a sentence that makes sense.

The result is that shifting using the letter O distance from letter E we get the right plaintext. The resulting shift is 10 as we have seen in the previous example.