The data science and ML ecosystem has converged around a fairly stable set of tools. You don't need to master all of them — but knowing what they are, and why each exists, gives you a clear map of the landscape.

Python — the language of data science

Python has become the dominant language for data science and machine learning. It's not the fastest language, but its readable syntax and vast ecosystem of libraries make it the default choice across academia and industry.

If you learn one thing for AI work, make it Python. A few weeks of basics gives you access to everything else in this module.

SQL — the language of data

SQL (Structured Query Language) is how you query databases — extract, filter, join, and aggregate data. Every data professional uses SQL constantly. No matter how advanced your Python and ML skills become, you'll always need SQL to get data in the first place.

# SQL: find top customers by spend in the last 90 days
SELECT
  customer_id,
  SUM(order_total) AS total_spend
FROM orders
WHERE order_date >= DATE_SUB(NOW(), INTERVAL 90 DAY)
GROUP BY customer_id
ORDER BY total_spend DESC
LIMIT 10;

Key Python libraries

pandas
The Swiss Army knife of data manipulation. Load CSVs, clean data, filter rows, join tables, aggregate values — all with readable Python syntax. The first thing any data scientist imports.
NumPy
Fast numerical computing. The foundation of scientific Python — arrays, maths operations, linear algebra. Most other libraries build on NumPy under the hood.
Matplotlib / Seaborn
Data visualisation. Matplotlib is the low-level foundation; Seaborn builds on it with prettier statistical charts. For interactive dashboards, Plotly is increasingly popular.
scikit-learn
The standard ML library for classical algorithms. Decision trees, random forests, SVMs, regression, clustering, preprocessing — all with a consistent, beginner-friendly API.
PyTorch / TensorFlow
The two deep learning frameworks. PyTorch dominates research; TensorFlow (and its Keras interface) is widely used in production. Both are used for training neural networks.
Jupyter Notebooks
An interactive environment where you write code in cells, run them, and see results immediately — with visualisations, text, and code all in one document. The standard tool for data exploration and analysis.
# A typical data science workflow in Python
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

df = pd.read_csv("data.csv") # 1. load
df = df.dropna() # 2. clean
df["age_group"] = pd.cut(df["age"], bins=[0,18,35,60,100]) # 3. engineer features
df["age_group"].value_counts().plot(kind="bar") # 4. visualise
X_train, X_test, y_train, y_test = train_test_split(df.drop("target", axis=1), df["target"])
model = RandomForestClassifier().fit(X_train, y_train) # 5. model
print(model.score(X_test, y_test)) # 6. evaluate

Cloud platforms

For large-scale work, cloud platforms provide managed services for storage, compute, and ML: Google Cloud (BigQuery, Vertex AI), AWS (S3, SageMaker), Azure (ML Studio). You don't need to start here — local Python is fine for learning — but cloud skills are increasingly expected in industry roles.

Key takeaways

  • Python is the dominant language for data science and ML — learn this first
  • SQL is essential for every data professional — always needed to access data
  • Core Python stack: pandas (data), NumPy (maths), Matplotlib (visualisation), scikit-learn (ML)
  • For deep learning: PyTorch (research) or TensorFlow/Keras (production)
  • Jupyter Notebooks are the standard environment for exploration and analysis