如何在不直接传递字节流的情况下读取 Excel 文件数据

心靈之曲 2026-01-09 00:00:00 次阅读

本文详解如何安全、合规地使用 pandas 读取 excel 数据，避免因直接传入 bytes 而触发 futurewarning，并提供基于 `bytesio` 的标准解决方案及最佳实践。

在使用 pandas.read_excel() 从内存中（如 Azure Blob Storage、Flask 请求体或 BytesIO 模拟文件）读取 Excel 文件时，若直接将 bytes 对象（例如 blob_data.readall() 返回值）传入 read_excel()，会触发如下弃用警告：

FutureWarning: Passing bytes to 'read_excel' is deprecated and will be removed in a future version. 
To read from a byte string, wrap it in a `BytesIO` object.

该警告明确指出：read_excel 不再接受原始 bytes，而应接收一个类文件对象（file-like object）。BytesIO 正是 Python 标准库中专为此场景设计的内存缓冲区类，它实现了 read(), seek() 等必需方法，完全满足 pandas 内部 IO 处理的要求。

✅ 正确做法（推荐且向后兼容）：

from io import BytesIO
import pandas as pd

# 假设 blob_data 是类似 azure.storage.blob.BlobClient 的响应对象
excel_bytes = blob_data.readall()  # type: bytes
df = pd.read_excel(BytesIO(excel_bytes), engine='openpyxl')

? 补充说明与注意事项：

引擎选择：engine='openpyxl' 适用于 .xlsx/.xlsm 文件；若处理 .xls（旧版 Excel），请改用 engine='xlrd'（注意：xlrd>=2.0 已不再支持 .xlsx，建议统一用 openpyxl 或 calamine（需 pandas>=2.2.0））；
性能优化：若文件较大，可考虑使用 BytesIO(excel_bytes) 后复用该对象多次调用 read_excel(..., sheet_name=...)，避免重复解包；

异常处理建议：

try:
    df = pd.read_excel(BytesIO(blob_data.readall()), engine='openpyxl')
except ValueError as e:
    raise ValueError(f"Excel 解析失败，请检查文件格式是否为有效 .xlsx：{e}")
except Exception as e:
    raise RuntimeError(f"读取 Excel 时发生未知错误：{e}")

替代方案（进阶）：对于高性能或无 pandas 依赖场景，可尝试 calamine-python（Rust 实现，零依赖、极快）：

from calamine import open_workbook
workbook = open_workbook(BytesIO(blob_data.readall()))
df = pd.DataFrame(workbook.get_sheet_by_index(0).to_pandas())