Measure Zero

Python 第三方库杂录

2022-01-21 | ~ 2022-07-08 | Tech

大多是常用第三方库.

其他库

unyt

2022/7/7

涉及到浮点数的单位相等判断会有问题, 参考 issue#238

Pandas

Cookbook

2020/5/30

pipeline (df.pipe), 临时列 (df.assign) 等

peter. (2022). Cookbook
data school. (2018). What’s the future of the pandas library?
Sin-Yi Chou. (2019). Pandas - Pipe Method. 用装饰器辅助 debug

groupby 默认会排序

2022/4/25

sort : bool, default True

Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.

见这个问题和文档. 我因为忽视这一点遇到了微妙的 bug. 把这个关掉还能提升性能.

弱监督学习两则: Snorkel 和 Skweak

2022-01-08 | ~ 2022-05-06 | Machine Learning

弱监督旨在避免昂贵的大量手动标注, 而采用编程式的方法生成标注数据. 一般分为两步: 先用「多种来源的带噪声标注规则」(称为 labelling functions) 对「无标注数据」进行标注 (得到 label model), 再把 (用 label model) 生成的标注数据喂给下游模型 (end model) 训练. 理想是 label model 可以泛化 (处理冲突, 平滑标签) labelling functions, 然后 end model 进一步泛化. (依然需要一些标注数据作为验证集和测试集以评估效果.)

押井守评吉卜力笔记

2022-01-03 | ~ | Anime

参考

押井守. (2021). 并不想说坏话! 无人敢评的吉卜力功过 (李思园, 译). 四川文艺出版社.

为什么批判吉卜力的意见很难印成文字? 因为批评吉卜力什么好处也捞不到, 而夸奖吉卜力能得到好处, 也就是建立了所谓 “内部圈子” (inner circle) 的利益关系, 即神之彼此共通的利害关系, 深谙沉默规则, 仅允许内部批判的共同体. 由”极少数的职业人士 + 大部分的业余爱好者” 组成的世界, 往往就会形成这种内部圈子.

虽然吉卜力作品的表现力十分卓越, 但作为电影而言, 它的作品大多都只能说是槽点满满. 是否让观众获得了足够的乐趣与作品本身的评价, 完全是两码事.

Bayes 优化简介

2022-01-01 | ~ 2022-02-05 | Machine Learning

首先强烈推荐

Frazier, P. I. (2018). A tutorial on Bayesian optimization. arXiv Preprint arXiv:1807.02811.

写得非常清楚. 本文只简单介绍最基本的内容.

考虑优化问题

\[\max_{x\in A} f(x),\]

Bayesian optimization is designed for black-box derivative-free global optimization. 黑箱意思是不知道函数 $f$ 的形式和性质 (凸性等), 只能通过输入 $x$ 得到输出 $f(x)$, 另外也不知道导数信息, 目标是求解全局最优.

半监督学习简要

2021-12-31 | ~ | Machine Learning

参考

Zhu, X., & Goldberg, A. B. (2009). Introduction to semi-supervised learning. Synthesis lectures on artificial intelligence and machine learning, 3(1), 1-130.

书比较老, 介绍了 SVM 时代一些分类问题机器学习算法的半监督版本. 特色是强调半监督学习有效需要的假设, 以及不符合假设的人造数据样例的可视化展示. 本文只涉及通用的算法.

SQL 简单复习与习题集

2021-12-16 | ~ | Tech

只包含查询语法, 不包含具体机制.

BERT 复习

2021-12-15 | ~ | Machine Learning

复习

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv Preprint arXiv:1810.04805.

文档目录抽取

2021-12-14 | ~ 2022-10-19 | Machine Learning

文档结构化是很暧昧的词, 它可能的意思很多, 不过本文只考虑目录抽取.

结构化文档由各级章节标题和段落等逻辑结构组成, 比如对 HTML 来说, 逻辑结构包括 <body> <h1> <p> 等标签. 文档结构化任务基本等价于目录抽取, 因为识别出标题后剩下的就是段落. 这个领域可供搜索的关键词包括 document structure recognition, document layout analysis (版面分析) 等. 意义: 便于抽取信息, 高度定制化的展示等.

Transformer 复习

2021-12-10 | ~ | Machine Learning

复习

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 5998–6008.

[半监督] 非监督数据增强 (UDA)

2021-12-06 | ~ | Machine Learning

来自 Google 的

Xie, Q., Dai, Z., Hovy, E., Luong, M.-T., & Le, Q. V. (2019). Unsupervised data augmentation for consistency training. arXiv Preprint arXiv:1904.12848. [Code] [Code for PyTorch]

Consistency training methods simply regularize model predictions to be invariant to small noise applied to either input examples or hidden states. This framework makes sense intuitively because a good model should be robust to any small change in an input example or hidden states.