Python 第三方库杂录

大多是常用第三方库.

其他库

unyt

2022/7/7

涉及到浮点数的单位相等判断会有问题, 参考 issue#238

Pandas

Cookbook

2020/5/30

pipeline (df.pipe), 临时列 (df.assign) 等

groupby 默认会排序

2022/4/25

sort : bool, default True

Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.

这个问题文档. 我因为忽视这一点遇到了微妙的 bug. 把这个关掉还能提升性能.

Pandas 1.1 isin bug

2022/3/24

import pandas as pd

a = pd.Series([42])
b = pd.Series(['42'])
a.isin(b)  # True in 1.1
b.isin(a)  # False in 1.1
# Both are False since 1.2

Strings and integers are distinct and are therefore not comparable

这句话从 1.2 版本才出现在 文档 中, 相关 issue 见 Pandas doesn’t always cast strings to int consistently when using .isin().

用 apply 给列赋值可能导致原地修改

df = pd.DataFrame([1, 2], columns=['a'])
df['b'] = 0
def f(row):
    if row['a'] == 1:
        row['a'] = 123
    row['b'] = False
    return row
df.apply(f, axis=1, result_type='expand')
print(df)
"""
     a  b
0  123  0
1    2  0
"""

另一个例子可以参考 python - pandas df.apply unexpectedly changes dataframe inplace - Stack Overflow.

Pandas 1.3 inconsistent where

2022/3/24

Pandas 1.2

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame([0.5, np.nan])
>>> df.where(pd.notnull(df), None)
     0
0  0.5
1  None

Pandas 1.3

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame([0.5, np.nan])
>>> df.where(pd.notnull(df), None)
     0
0  0.5
1  NaN

See the issue https://github.com/pandas-dev/pandas/issues/42423

Inplace is harmful

时间和空间都没优势, 反而不能 chaining methods.

Ref: python - In pandas, is inplace = True considered harmful, or not? - Stack Overflow

datetime, Timestamp, and datetime64

真的很搞.

Ref: python - Converting between datetime, Timestamp and datetime64 - Stack Overflow

Numpy

np.nan

  • np.nan == np.nan returns False.
  • np.nan is a floating point constant.

Ref: Python NumPy For Your Grandma - 3.5 nan

SQLAlchemy

Close connection with pandas

用 with 包住 connection 后, 好像不需要 dispose?

Streaming

读取大查询.

schedule

定时任务包 schedule

How it works

设置定时后, 每次调用 run_pending 时检测当前时刻是否超过任务下次 (由上次运行决定) 运行时刻, 是则执行. 如果两次 run_pending 间隔太久, 中间错过的任务 (本来应该执行多次的) 会且仅会执行一次. 这点很重要却没在文档中说明, 只有注释

# schedule/__init__.py#L88-L100
def run_pending(self) -> None:
    """
    Run all jobs that are scheduled to run.
    Please note that it is *intended behavior that run_pending()
    does not run missed jobs*. For example, if you've registered a job
    that should run every minute and you only call run_pending()
    in one hour increments then your job won't be run 60 times in
    between but only once.
    """
    runnable_jobs = (job for job in self.jobs if job.should_run)
    for job in sorted(runnable_jobs):
        self._run_job(job)
# schedule/__init__.py#L637-L642
def should_run(self) -> bool:
    """
    :return: ``True`` if the job should be run now.
    """
    assert self.next_run is not None, "must run _schedule_next_run before"
    return datetime.datetime.now() >= self.next_run

statsmodels

真的贼难用

有趣实践: 用更宽松的方式判断传入参数

以前读源码看到一个有趣的地方, statsmodels.tsa.seasonal.seasonal_decompose 这个函数的参数为 model: {"additive", "multiplicative"}, optional, 源码 写的是 if model.startswith("m"):.