数据分析三件套

# 第 6 章数据分析三件套

# 目录介绍

2.1 问题引入：一份"看不懂"的销售数据
2.2 NumPy：数值计算的"加速引擎"
2.3 Pandas：数据处理的"瑞士军刀"
2.4 Matplotlib：数据可视化的"画布"
2.5 综合实战：电商销售分析全流程
2.6 新手陷阱 Top 5
2.7 综合思考题

# 6.1 场景引入

💬 场景：你是电商公司的数据分析师。老板丢给你一个 CSV 文件："这是上半年的销售记录，帮我搞清楚三件事——哪个品类最赚钱？各地区的销售趋势怎样？下半年备货重点是什么？"

你打开文件，10 万行数据、20 列字段、夹杂缺失值和异常记录。用 Excel 打开已经卡顿了——更别提分析。

你需要 Python 的数据分析三件套：NumPy（底层计算）+ Pandas（表格处理）+ Matplotlib（可视化）。

本章按"加速引擎 → 瑞士军刀 → 画布 → 端到端实战"四步走：

阶段	工具	能力
① 引擎	NumPy	数值计算、数组广播、性能对比
② 军刀	Pandas	读写、筛选、分组、透视表、合并
③ 画布	Matplotlib	折线图、柱状图、散点图、饼图、子图、seaborn
④ 实战	三件套联用	10 万行电商数据：清洗→分析→可视化→报告

# 6.2 NumPy 入门

NumPy 让 Python 拥有了接近 C 语言速度的数组运算——Pandas、Matplotlib、Scikit-learn 等核心库全部构建在它之上。

# 2.2.1 ndarray 创建与属性

import numpy as np

# ===== 创建 ndarray =====
a = np.array([1, 2, 3, 4, 5])                # 从列表创建
b = np.zeros((3, 4))                          # 3×4 全零矩阵
c = np.ones((2, 3))                           # 2×3 全一矩阵
d = np.arange(0, 10, 2)                       # [0, 2, 4, 6, 8]（类似 range）
e = np.linspace(0, 1, 5)                      # [0, 0.25, 0.5, 0.75, 1]（等间距）
f = np.random.randn(3, 3)                     # 3×3 标准正态分布随机数
g = np.eye(3)                                 # 3×3 单位矩阵

# ===== 属性 =====
print(f"{a.ndim=}")       # 1（维度）
print(f"{a.shape=}")      # (5,)（形状）
print(f"{a.dtype=}")      # int64（数据类型）
print(f"{a.size=}")       # 5（元素总数）

# 修改形状
m = np.arange(12).reshape(3, 4)
print(m)
# [[ 0  1  2  3]
#  [ 4  5  6  7]
#  [ 8  9 10 11]]

# 2.2.2 数组运算与广播

NumPy 的向量化运算是整个数据科学生态的性能根基：

a = np.array([1, 2, 3, 4])
b = np.array([10, 20, 30, 40])

# 逐元素运算——不需要写循环
print(a + b)      # [11 22 33 44]
print(a * b)      # [10 40 90 160]
print(a ** 2)     # [1 4 9 16]
print(np.sqrt(a)) # [1. 1.414 1.732 2.]

# 比较——返回布尔数组
print(a > 2)      # [False False True True]

# 聚合
print(a.sum())    # 10
print(a.mean())   # 2.5
print(a.max())    # 4
print(f"{a.std():.2f}")  # 1.12

🔑 广播（Broadcasting）——理解这个就理解了 NumPy 的核心：

# 不同形状的数组也能运算——NumPy 自动"广播"小数组去匹配大数组
a = np.array([[1, 2, 3],
              [4, 5, 6]])            # shape (2, 3)
b = np.array([10, 20, 30])           # shape (3,)

print(a + b)
# [[11 22 33]
#  [14 25 36]]                       # b 自动"复制"成 (2,3) 再相加

🖼️ 图解广播：

a (2,3):              b (3,)  广播为 (2,3):     结果:
[[1 2 3]    +   [10 20 30]  →  [[10 20 30]  =  [[11 22 33]
 [4 5 6]]                       [10 20 30]]      [14 25 36]]

# 2.2.3 切片与花式索引

a = np.arange(12).reshape(3, 4)

# 常规索引
print(a[0, 2])         # 2（第 0 行第 2 列）

# 切片
print(a[:2, 1:3])
# [[1 2]
#  [5 6]]

# 布尔索引——过滤数据的神器
print(a[a > 5])        # [6 7 8 9 10 11]（所有 >5 的元素）

# 花式索引——用数组指定位置
print(a[[0, 2]])       # 取第 0 行和第 2 行
# [[0 1 2 3]
#  [8 9 10 11]]

print(a[:, [0, -1]])   # 取第一列和最后一列
# [[0 3]
#  [4 7]
#  [8 11]]

# 2.2.4 常用函数速查

a = np.array([[1, 2, 3], [4, 5, 6]])

# 统计
print(a.sum())              # 21（总和）
print(a.sum(axis=0))        # [5 7 9]（按列——axis=0 沿行方向压缩）
print(a.sum(axis=1))        # [6 15]（按行——axis=1 沿列方向压缩）

# 线性代数
x = np.array([[1, 2], [3, 4]])
y = np.array([[5, 6], [7, 8]])
print(np.dot(x, y))         # 矩阵乘法 [[19 22] [43 50]]
print(x @ y)                # Python 3.5+ 等价写法

# 条件替换
print(np.where(a > 3, a, 0))   # >3 保留，否则 0 → [[0 0 0] [4 5 6]]

# 拼接
print(np.concatenate([a, a], axis=0))   # 纵向拼接 (6,4)
print(np.concatenate([a, a], axis=1))   # 横向拼接 (3,8)

# 唯一值
vals, counts = np.unique([1, 2, 2, 3, 3, 3], return_counts=True)
print(vals, counts)          # [1 2 3] [1 2 3]

# 2.2.5 为什么 NumPy 这么快？

import time

size = 10_000_000

# Python 原生循环
py_list = list(range(size))
start = time.time()
result_py = sum(x * 2 for x in py_list)
print(f"Python 循环：{time.time() - start:.3f}s")

# NumPy 向量化
np_arr = np.arange(size)
start = time.time()
result_np = np.sum(np_arr * 2)
print(f"NumPy 向量化：{time.time() - start:.3f}s")

# 典型结果：Python 循环 ~1.2s vs NumPy ~0.02s——快 60 倍！

🔑 NumPy 快的三个原因：

C 语言实现：运算在 C 层完成，不经过 Python 解释器
连续内存：ndarray 是一块连续内存，CPU 缓存友好
SIMD 指令：现代 CPU 的向量指令集一次处理多个数据

# 6.3 Pandas 入门

# 2.3.1 Series 与 DataFrame

Pandas 有两个核心数据结构：

import pandas as pd
import numpy as np

# ===== Series：带索引的一维数组 =====
s = pd.Series([85, 92, 78, 90], index=["张三", "李四", "王五", "赵六"])
print(s)
# 张三    85
# 李四    92
# 王五    78
# 赵六    90
print(s["李四"])               # 92——按索引取值
print(s[s > 80].index.tolist())  # ['张三', '李四', '赵六']——布尔索引

# ===== DataFrame：带行列索引的二维表 =====
# 创建方式一：从字典
df = pd.DataFrame({
    "姓名": ["张三", "李四", "王五", "赵六"],
    "语文": [85, 92, 78, 90],
    "数学": [88, 95, 82, 87],
    "英语": [79, 91, 85, 93],
})
print(df)
#    姓名  语文  数学  英语
# 0  张三  85   88   79
# 1  李四  92   95   91
# 2  王五  78   82   85
# 3  赵六  90   87   93

# 创建方式二：从 CSV
# df = pd.read_csv("data.csv", encoding="utf-8")

# 创建方式三：从 NumPy
# df = pd.DataFrame(np.random.randn(100, 4), columns=list("ABCD"))

# ===== 基本属性 =====
print(df.shape)            # (4, 4)——行数, 列数
print(df.columns.tolist()) # ['姓名', '语文', '数学', '英语']
print(df.dtypes)           # 每列的数据类型
print(df.describe())       # 数值列的统计摘要（count/mean/std/min/四分位数/max）
print(df.head(2))          # 前 2 行
print(df["语文"])          # 取一列 → Series
print(df[["姓名", "语文"]]) # 取多列 → DataFrame

# 新增列——向量化运算
df["总分"] = df["语文"] + df["数学"] + df["英语"]
df["平均分"] = df["总分"] / 3
print(df)

# 2.3.2 数据筛选与查询

# ===== loc：按标签索引 =====
# df.loc[行标签, 列标签]
print(df.loc[0])                       # 第 0 行（按索引标签）
print(df.loc[0:2, ["姓名", "总分"]])   # 索引 0~2 的姓名和总分
print(df.loc[df["总分"] > 260])        # 总分 > 260 的所有行

# ===== iloc：按位置索引 =====
print(df.iloc[0])                      # 第 0 行
print(df.iloc[0:2, 0:3])               # 前 2 行，前 3 列

# ===== query()：SQL 风格的筛选 =====
print(df.query("总分 >= 260 and 语文 >= 85"))

# ===== 条件组合 =====
cond1 = df["语文"] >= 85
cond2 = df["数学"] >= 85
print(df[cond1 & cond2])               # & 是 and，| 是 or，~ 是 not

# 2.3.3 分组聚合与透视表

这是 Pandas 最强大的功能——一行代码完成 Excel 里 10 分钟的数据透视：

# 模拟更真实的数据
data = {
    "日期": pd.date_range("2025-01-01", periods=100, freq="D"),
    "城市": np.random.choice(["北京", "上海", "深圳", "杭州"], 100),
    "品类": np.random.choice(["服装", "电子", "食品", "家居"], 100),
    "销售额": np.random.randint(500, 5000, 100),
    "数量": np.random.randint(1, 20, 100),
}
df = pd.DataFrame(data)
df["月份"] = df["日期"].dt.month

# ===== groupby()：分组聚合 =====
# 按城市分组——看哪个城市卖得多
print(df.groupby("城市")["销售额"].agg(["count", "sum", "mean", "max"]))

# 按城市和品类——两级分组
print(df.groupby(["城市", "品类"])["销售额"].sum().unstack())

# 多列多聚合
result = df.groupby("城市").agg({
    "销售额": ["sum", "mean"],
    "数量": "sum",
})
print(result)

# ===== pivot_table()：数据透视表 =====
pivot = pd.pivot_table(
    df,
    values="销售额",
    index="城市",
    columns="品类",
    aggfunc="sum",
    margins=True,          # 加合计行列
)
print(pivot)

🖼️ groupby 执行流程：

原始数据                 groupby("城市")         聚合结果
┌────┬────┬──────┐      分拆              合并
│城市│品类│销售额│  →  [北京: rows...]  →  sum/mean
│北京│电子│ 3000│      [上海: rows...]
│北京│服装│ 1500│      [深圳: rows...]
│上海│食品│ 2000│      [杭州: rows...]
│深圳│电子│ 4500│
│上海│家居│ 1800│
└────┴────┴──────┘

# 2.3.4 缺失值与重复值处理

真实世界的 CSV 几乎没有干净的——处理缺失值和脏数据是数据分析 80% 的工作量：

# 制造脏数据
df2 = df.copy()
df2.loc[5:8, "销售额"] = np.nan       # 缺失
df2.loc[10, "数量"] = 999              # 异常值
df2 = pd.concat([df2, df2.iloc[:3]])   # 重复

# ① 检测缺失
print(df2.isnull().sum())              # 每列缺失数

# ② 处理缺失
# 填固定值
df2["销售额"] = df2["销售额"].fillna(0)
# 填均值
df2["销售额"] = df2["销售额"].fillna(df2["销售额"].mean())
# 填前值（时间序列常用）
df2["销售额"] = df2["销售额"].ffill()
# 删除含缺失的行
df2 = df2.dropna(subset=["销售额"])

# ③ 异常值处理——用条件过滤
df2 = df2[df2["数量"].between(1, 50)]   # 只保留 1~50

# ④ 去重
df2 = df2.drop_duplicates()
print(f"去重后：{len(df2)} 行")

# 2.3.5 合并与连接

# 模拟订单表 + 客户表
orders = pd.DataFrame({
    "order_id": [1, 2, 3, 4],
    "customer_id": ["C001", "C002", "C001", "C003"],
    "amount": [150, 200, 120, 300],
})
customers = pd.DataFrame({
    "customer_id": ["C001", "C002", "C003"],
    "name": ["张三", "李四", "王五"],
    "level": ["VIP", "普通", "普通"],
})

# merge()——SQL JOIN 的等价物
merged = pd.merge(orders, customers, on="customer_id", how="left")
print(merged)
#    order_id customer_id  amount name level
# 0         1        C001     150   张三   VIP
# 1         2        C002     200   李四   普通
# 2         3        C001     120   张三   VIP
# 3         4        C003     300   王五   普通

# concat()——上下/左右拼接
combined = pd.concat([orders, orders])  # 纵向堆叠（两份）

🔑 merge 的四种 how 参数：

how	含义	类比 SQL
`"inner"`	交集	INNER JOIN
`"left"`	保留左表全部	LEFT JOIN
`"right"`	保留右表全部	RIGHT JOIN
`"outer"`	并集	FULL OUTER JOIN

# 6.4 数据可视化

# 2.4.1 折线图与柱状图

import matplotlib.pyplot as plt
import numpy as np

# 配置中文字体
plt.rcParams["font.sans-serif"] = ["Arial Unicode MS", "SimHei", "Heiti SC"]
plt.rcParams["axes.unicode_minus"] = False

# ===== 折线图 =====
months = ["1月", "2月", "3月", "4月", "5月", "6月"]
sales = [12000, 15000, 13500, 18000, 16000, 21000]

plt.figure(figsize=(10, 5))
plt.plot(months, sales, marker="o", color="#2196F3",
         linewidth=2, markersize=8, label="销售额")
plt.title("上半年销售趋势", fontsize=16, fontweight="bold")
plt.xlabel("月份")
plt.ylabel("销售额（元）")
plt.grid(True, alpha=0.3)
plt.legend()

# 在数据点上标注数值
for i, (x, y) in enumerate(zip(months, sales)):
    plt.text(x, y + 300, f"{y:,}", ha="center")

plt.savefig("line_chart.png", dpi=150, bbox_inches="tight")

# ===== 柱状图 =====
categories = ["服装", "电子", "食品", "家居", "运动"]
revenue = [45000, 82000, 38000, 29000, 21000]
colors = ["#4CAF50", "#2196F3", "#FF9800", "#9C27B0", "#F44336"]

plt.figure(figsize=(10, 5))
bars = plt.bar(categories, revenue, color=colors, edgecolor="white")
plt.title("各品类销售额", fontsize=16, fontweight="bold")
plt.ylabel("销售额（元）")

# 柱顶标注
for bar, val in zip(bars, revenue):
    plt.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 500,
             f"¥{val:,}", ha="center", fontweight="bold")

plt.savefig("bar_chart.png", dpi=150, bbox_inches="tight")

# 2.4.2 散点图与饼图

# ===== 散点图——看两变量相关性 =====
np.random.seed(42)
n = 200
x = np.random.randn(n) * 15 + 100        # 广告投放（元）
y = x * 2.5 + np.random.randn(n) * 30    # 销售额（元）——与 x 正相关

plt.figure(figsize=(10, 6))
scatter = plt.scatter(x, y, c=y, cmap="viridis",
                      s=x/2, alpha=0.6, edgecolors="white")
plt.colorbar(scatter, label="销售额")
plt.title("广告投放 vs 销售额", fontsize=16, fontweight="bold")
plt.xlabel("广告投放（元）")
plt.ylabel("销售额（元）")

# 趋势线
z = np.polyfit(x, y, 1)
p = np.poly1d(z)
plt.plot(sorted(x), p(sorted(x)), "r--", linewidth=2, label="趋势线")
plt.legend()
plt.savefig("scatter_chart.png", dpi=150, bbox_inches="tight")

# ===== 饼图 =====
categories = ["服装", "电子", "食品", "家居", "其他"]
shares = [28, 35, 18, 12, 7]
explode = (0, 0.05, 0, 0, 0)   # 突出电子品类

plt.figure(figsize=(8, 8))
wedges, texts, autotexts = plt.pie(
    shares, labels=categories, autopct="%1.1f%%",
    explode=explode, colors=plt.cm.Paired.colors,
    startangle=90, textprops={"fontsize": 12},
)
plt.title("各品类销售占比", fontsize=16, fontweight="bold")
plt.savefig("pie_chart.png", dpi=150, bbox_inches="tight")

# 2.4.3 子图布局与样式定制

# 2×2 子图——一屏展示四维分析
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle("销售数据多维分析面板", fontsize=18, fontweight="bold")

# 子图 1：月度趋势
ax1 = axes[0, 0]
ax1.plot(months, sales, "o-", color="#2196F3", linewidth=2)
ax1.set_title("月度销售趋势")
ax1.grid(alpha=0.3)

# 子图 2：城市对比
ax2 = axes[0, 1]
cities = ["北京", "上海", "深圳", "杭州"]
city_sales = [35000, 42000, 28000, 22000]
ax2.barh(cities, city_sales, color=["#4CAF50", "#2196F3", "#FF9800", "#9C27B0"])
ax2.set_title("各城市销售额")
for i, v in enumerate(city_sales):
    ax2.text(v + 500, i, f"¥{v:,}", va="center")

# 子图 3：品类分布（饼图）
ax3 = axes[0, 3]   # 占两个位置——这个不太对，应该用专门的位置
# 实际用 axes[1, 0/1] 来画别的

# 子图 3 & 4 使用同一个图——画分布直方图
# ...（在实战环节完整演示）

plt.tight_layout()
plt.savefig("dashboard.png", dpi=150, bbox_inches="tight")

# 2.4.4 seaborn 快速美化

seaborn 基于 matplotlib，一行代码让图表变得专业：

import seaborn as sns
sns.set_style("whitegrid")
sns.set_palette("husl")

# 加载示例数据集
tips = sns.load_dataset("tips")
# 或自定义：df = pd.read_csv("sales.csv")

# ① 箱线图——看数据分布
plt.figure(figsize=(10, 5))
sns.boxplot(data=tips, x="day", y="total_bill", hue="smoker")
plt.title("不同日期的账单分布（按是否吸烟）", fontsize=14)

# ② 小提琴图——分布 + 密度
plt.figure(figsize=(10, 5))
sns.violinplot(data=tips, x="day", y="total_bill", inner="quartile")

# ③ 热力图——相关性矩阵
plt.figure(figsize=(8, 6))
# sns.heatmap(df.corr(), annot=True, cmap="coolwarm", fmt=".2f")
# 实战中会大量用到——见 §2.5

# ④ pairplot——散点图矩阵（看所有数值列两两关系）
# sns.pairplot(df, hue="品类")

🔑 seaborn vs matplotlib 的定位：

Matplotlib	Seaborn
底层引擎——完全控制	高层封装——默认好看
代码多但自由度高	一行出图但定制有限
`plt.plot()` `plt.bar()`	`sns.lineplot()` `sns.barplot()`
适合定制化报表	适合探索性数据分析

# 6.5 综合实战

现在把三件套串联起来——10 万行电商 CSV → 清洗 → 多维度分析 → 可视化报告。

# 2.5.1 数据加载与清洗

"""
模块一：数据加载 + 清洗
"""

import pandas as pd
import numpy as np

# 1. 生成模拟的电商数据（实际使用时替换为 pd.read_csv("sales.csv")）
np.random.seed(42)

n = 100_000
dates = pd.date_range("2025-01-01", periods=n, freq="h")
df = pd.DataFrame({
    "order_id": range(1, n + 1),
    "date": dates,
    "category": np.random.choice(["服装", "电子", "食品", "家居", "运动", "美妆"], n,
                                  p=[0.20, 0.25, 0.20, 0.15, 0.10, 0.10]),
    "city": np.random.choice(["北京", "上海", "深圳", "杭州", "广州", "成都"], n),
    "amount": np.random.lognormal(mean=6, sigma=0.8, size=n).astype(int),
    "quantity": np.random.randint(1, 20, n),
    "member_level": np.random.choice(["普通", "银卡", "金卡", "钻石"], n,
                                      p=[0.50, 0.30, 0.15, 0.05]),
})

# 注入一些脏数据（模拟真实场景）
df.loc[np.random.choice(n, 200), "amount"] = np.nan      # 缺失 sales
df.loc[np.random.choice(n, 100), "quantity"] = 999        # 异常数量
df = pd.concat([df, df.iloc[:50]])                         # 重复 50 条

print(f"原始数据：{len(df)} 行")
print(f"缺失值数：{df.isnull().sum().sum()}")
print(f"数据类型：\n{df.dtypes}")
print(df.head())

# 2. 清洗
print("\n=== 开始清洗 ===")

# ① 填充缺失值——用品类中位数
df["amount"] = df.groupby("category")["amount"].transform(
    lambda x: x.fillna(x.median())
)

# ② 异常值——数量 > 50 视为异常
before = len(df)
df = df[df["quantity"].between(1, 50)]
print(f"异常值过滤：{before} → {len(df)} 行")

# ③ 去重
df = df.drop_duplicates(subset=["order_id"])
print(f"去重后：{len(df)} 行")

# ④ 日期拆分——方便后续分析
df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month
df["weekday"] = df["date"].dt.weekday     # 0=周一
df["hour"] = df["date"].dt.hour

# ⑤ 客单价计算
df["unit_price"] = (df["amount"] / df["quantity"]).round(0)

print(f"\n清洗完成：{len(df)} 行 × {len(df.columns)} 列")
print(df.describe())

# 2.5.2 多维度分析

"""
模块二：多维度分析
"""

print("\n" + "=" * 60)
print("一、各品类销售概览")
print("=" * 60)
cat_stats = df.groupby("category").agg(
    总销售额=("amount", "sum"),
    订单数=("order_id", "count"),
    平均客单价=("amount", "mean"),
    平均数量=("quantity", "mean"),
).round(0)
cat_stats["销售额占比"] = (cat_stats["总销售额"] / cat_stats["总销售额"].sum() * 100).round(1)
print(cat_stats.sort_values("总销售额", ascending=False))

print("\n" + "=" * 60)
print("二、月度销售趋势")
print("=" * 60)
monthly = df.groupby("month").agg(
    销售额=("amount", "sum"),
    订单数=("order_id", "count"),
).round(0)
monthly["月环比"] = (monthly["销售额"].pct_change() * 100).round(1)
print(monthly)

print("\n" + "=" * 60)
print("三、城市 × 品类交叉分析")
print("=" * 60)
pivot = pd.pivot_table(
    df, values="amount", index="city", columns="category",
    aggfunc="sum", margins=True, margins_name="合计"
).round(0)
print(pivot)

print("\n" + "=" * 60)
print("四、会员等级分析")
print("=" * 60)
member_stats = df.groupby("member_level").agg(
    人数=("order_id", "count"),
    总消费=("amount", "sum"),
    人均消费=("amount", "mean"),
).round(0)
print(member_stats)

print("\n" + "=" * 60)
print("五、时段分析（按小时）")
print("=" * 60)
hourly = df.groupby("hour").agg(
    订单数=("order_id", "count"),
    销售额=("amount", "sum"),
).round(0)
# 找高峰
peak_hour = hourly["订单数"].idxmax()
print(f"下单高峰：{peak_hour}:00（{hourly.loc[peak_hour, '订单数']} 单）")
print(hourly.sort_values("订单数", ascending=False).head(6))

print("\n" + "=" * 60)
print("六、品类备货建议")
print("=" * 60)
# 假设：过去 30 天的日均销量 × 品类重要性权重
recent_30 = df[df["date"] >= df["date"].max() - pd.Timedelta(days=30)]
stock_advice = recent_30.groupby("category").agg(
    日均销量=("quantity", "sum"),
).reset_index()
stock_advice["日均销量"] = (stock_advice["日均销量"] / 30).round(0)
stock_advice["建议备货量"] = (stock_advice["日均销量"] * 7).astype(int)  # 周备货
print(stock_advice.sort_values("日均销量", ascending=False))

# 2.5.3 可视化报告

"""
模块三：可视化 + 导出报告
"""
import matplotlib.pyplot as plt
import matplotlib
matplotlib.use("Agg")
plt.rcParams["font.sans-serif"] = ["Arial Unicode MS", "SimHei"]
plt.rcParams["axes.unicode_minus"] = False

fig = plt.figure(figsize=(16, 12))
fig.suptitle("📊 电商销售数据分析报告", fontsize=20, fontweight="bold", y=0.98)

# ============ 子图1：品类销售额（柱状图+占比） ============
ax1 = plt.subplot(2, 3, 1)
cats = cat_stats.sort_values("总销售额", ascending=True)
ax1.barh(cats.index, cats["总销售额"] / 10000, color=plt.cm.viridis(np.linspace(0, 1, len(cats))))
ax1.set_title("各品类销售额（万元）")
for i, (idx, row) in enumerate(cats.iterrows()):
    ax1.text(row["总销售额"] / 10000 + 10, i - 0.1,
             f"{row['总销售额']/10000:.0f}万 ({row['销售额占比']}%)",
             fontsize=8)

# ============ 子图2：月度趋势 ============
ax2 = plt.subplot(2, 3, 2)
ax2.plot(monthly.index, monthly["销售额"] / 10000, "o-", color="#E91E63", linewidth=2, markersize=8)
ax2.set_title("月度销售趋势（万元）")
ax2.set_xticks(monthly.index)
ax2.grid(alpha=0.3)
for m, v in zip(monthly.index, monthly["销售额"] / 10000):
    ax2.text(m, v + 3, f"{v:.0f}万", ha="center", fontsize=8)

# ============ 子图3：城市分布（饼图） ============
ax3 = plt.subplot(2, 3, 3)
city_data = df.groupby("city")["amount"].sum().sort_values(ascending=False)
ax3.pie(city_data.values, labels=city_data.index, autopct="%1.1f%%",
        startangle=90, colors=plt.cm.Set3.colors)
ax3.set_title("销售额城市分布")

# ============ 子图4：时段热力图替代——时段折线 ============
ax4 = plt.subplot(2, 3, 4)
ax4.fill_between(hourly.index, hourly["订单数"] / 1000, alpha=0.4, color="#4CAF50")
ax4.plot(hourly.index, hourly["订单数"] / 1000, color="#2E7D32", linewidth=2)
ax4.set_title("24 小时下单分布（千单）")
ax4.set_xlabel("小时")
ax4.axvline(peak_hour, color="red", linestyle="--", alpha=0.7)
ax4.text(peak_hour + 0.5, ax4.get_ylim()[1] * 0.9, f"高峰 {peak_hour}:00", color="red")

# ============ 子图5：会员等级消费（柱状图） ============
ax5 = plt.subplot(2, 3, 5)
level_order = ["普通", "银卡", "金卡", "钻石"]
member_data = member_stats.reindex(level_order)
colors_bar = ["#BDBDBD", "#9E9E9E", "#FFC107", "#FF5722"]
ax5.bar(level_order, member_data["人均消费"] / 1000, color=colors_bar, edgecolor="white")
ax5.set_title("各等级会员人均消费（千元）")
for i, (idx, row) in enumerate(member_data.iterrows()):
    ax5.text(i, row["人均消费"] / 1000 + 0.2, f"¥{row['人均消费']/1000:.1f}K",
             ha="center", fontweight="bold")

# ============ 子图6：品类客单价对比（箱线图概化——用均值） ============
ax6 = plt.subplot(2, 3, 6)
unit_data = df.groupby("category")["unit_price"].mean().sort_values()
ax6.bar(unit_data.index, unit_data.values, color=plt.cm.plasma(np.linspace(0, 1, len(unit_data))))
ax6.set_title("各品类平均客单价（元）")
ax6.set_ylabel("元/件")
for i, (name, val) in enumerate(unit_data.items()):
    ax6.text(i, val + 2, f"¥{val:.0f}", ha="center", fontsize=8)

plt.tight_layout(pad=3)
plt.savefig("ecommerce_report.png", dpi=150, bbox_inches="tight")
print("\n✅ 报告已保存：ecommerce_report.png")


# ============ 导出 Excel 报告（Pandas 原生的 ExcelWriter） ============
with pd.ExcelWriter("ecommerce_report.xlsx", engine="openpyxl") as writer:
    cat_stats.to_excel(writer, sheet_name="品类概览")
    monthly.to_excel(writer, sheet_name="月度趋势")
    pivot.to_excel(writer, sheet_name="城市×品类")
    member_stats.to_excel(writer, sheet_name="会员分析")
    hourly.to_excel(writer, sheet_name="时段分析")
    stock_advice.to_excel(writer, sheet_name="备货建议", index=False)
print("✅ Excel 报告已保存：ecommerce_report.xlsx")

运行后得到：

ecommerce_report.png：6 合 1 可视化仪表板
ecommerce_report.xlsx：6 个 Sheet 的完整数据表——可直接发给老板

# 6.6 新手陷阱

#	陷阱	说明
1	DataFrame 修改返回 None	`df = df.dropna()` 而不是 `df.dropna()`——Pandas 大部分操作返回新对象
2	`SettingWithCopyWarning`	`df[df["a"]>0]["b"] = 0` 修改的是临时副本——用 `df.loc[df["a"]>0, "b"] = 0`
3	中文显示方块	`plt.rcParams["font.sans-serif"]` 没有设置——Mac 用 "Arial Unicode MS"，Win 用 "SimHei"
4	groupby 后忘记 reset_index	`df.groupby(...).sum()` 返回多级索引——加 `.reset_index()` 回到平表
5	NaN 参与比较永远 False	`np.nan == np.nan` → `False`——用 `pd.isna()` 或 `df.dropna()` 处理

陷阱 2 详解——Pandas 链式赋值：

# ❌ 错误——触发 SettingWithCopyWarning
df[df["销售额"] > 1000]["品类"] = "高价值"
# 问题：df[df["销售额"] > 1000] 返回的是副本还是视图？不确定！

# ✅ 正确——用 loc
df.loc[df["销售额"] > 1000, "品类"] = "高价值"   # 明确索引，一步到位

# 6.7 综合思考题

NumPy 广播 vs Python 循环：(a[:, np.newaxis] + b) 和 [[x+y for y in b] for x in a] 结果一样——前者每秒处理千万级数据，后者百万级就卡。广播机制在什么情况下不能使用？形状不兼容时会怎样？
groupby().agg() 的性能：1000 万行数据按城市分组求销售额和——Pandas 的 groupby 需要几十秒。如果每天都要跑这个查询，你会怎么做？有哪些加速方案？（提示：Dask、Polars、数据库物化视图）
Matplotlib 的面向对象 vs pyplot：plt.plot() 和 fig, ax = plt.subplots(); ax.plot() 都可以画图——它们的本质区别是什么？为什么生产级代码推荐后一种？
Pandas 链式操作 vs SQL：df.query("category == '电子'").groupby("city")["amount"].mean() 这行代码，如果数据在 PostgreSQL 里用 SQL 写，哪个更快？什么场景该用 SQL，什么场景该用 Pandas？
DataFrame 的内存优化：100 万行数据的 int64 列占 ~8MB——如果这一列实际只有 0/1/2/3 四个值，用 astype("int8") 可以压缩到 1MB。但如果是字符串列（object dtype），每个字符串都是一个 Python 对象——内存可能暴增到 100MB。Pandas 2.0 的 StringDtype 和 CategoricalDtype 是怎么解决这个问题的？

#Python #实战

上次更新: 2026/06/28, 17:55:19

← 爬虫全流程实战办公自动化实战→