Measure Zero

Spark 简要

2022-04-12 | ~ | Tech

Apache Spark is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching and optimized query execution for fast queries against data of any size. Simply put, Spark is a fast and general engine for large-scale data processing. 快的对照物是 MapReduce, 原因在于 Spark 在内存上运行, 避免了后者串联 jobs 时频繁的磁盘 I/O 开销 (每对 map 和 reduce 之后都要进行一次). 通用的意思是它也支持了 SQL, 流处理, 机器学习, 图等一系列应用, 形成生态.

gRPC 简要

2022-04-09 | ~ | Tech

只涉及很基础的 general idea. gRPC is an RPC implementation using Protocol Buffers by Google.

2020 年第一届 "智圣" 智力运动会线上连珠冠军 Epifanov Dmitry 自战棋评

2022-04-03 | ~ | Games

Epifanov Dmitry 是来自俄罗斯莫斯科的顶尖棋手, 1995 年接触连珠, 98 年开始参赛. 郝天一曾经在通信赛与其对局过. 这是他在 “信雅达杯” 第一届 “智圣” 智力运动会第一届五子棋比赛海外专业组 7-0 夺冠的自战棋评, 相当通俗易懂. 我翻译时删减了一些无关紧要的话, 在括号里补充了一些. 比赛结果见这里.

Epifanov Dmitry. (2020, Oct 13). The 1st Mind Master Tournament comments. RenjuNews.

(2020 年) 10 月 10 日至 11 日, 作为第八届中国国际棋博会的一部分, vint.ee 举办了一场由中国组织的国际比赛. 这样的实验是第一次, 结果总的来说很不错. 大约有八十名参赛者, 尽管许多欧洲连珠高手没有参赛, 但比赛还是很紧张的: 因为每轮只有 30 分钟, 而且只有 7 轮.

我只会讲一盘重要的棋. 点这里看动态棋谱.

词权重求和背后的假设

2022-03-29 | ~ | Machine Learning

来自 BM25 专著

Robertson, S., & Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. Now Publishers Inc.

其中的第一部分介绍词 (term) 权重求和的背后假设. 这里的 “词” 可以指各种粒度.

孟加拉国口罩试验: 效应量显著地比统计显著性更重要

2022-03-27 | ~ | Statistics

话题很简单, 也简单地翻译一下. 这是一系列的文章, 我进行了一些编排也额外加了小标题.

首先是故事的起始

Ben Recht. (2021, Sep 13). Effect size is significantly more important than statistical significance. argmin blog.

论文概述

在孟加拉国进行的一项旨在测试戴口罩对减少新冠病毒传播的效能 (efficacy) 的大规模群组随机对照试验 (cluster-randomized controlled trial) 公布了其初步结果, 新冠专家们已经激动地议论纷纷了. 关于这份报告的吸睛评论甚多, 大多数人把这项研究作为应该戴口罩的证据. 但是在读完 94 页的报告后, 我得出了不同的结论. 我担心由于统计学上的模糊性, 这篇报告根本不能推断出什么东西.

深度学习吐槽杂录

2022-03-25 | ~ 2022-08-06 | Machine Learning

都是老生常谈的问题. 吐槽外还包含哪些事情有意义. 主要内容看原文链接.

概率图模型基本概念

2022-03-18 | ~ | Statistics

只涉及概念和记号, 不涉及算法.

Directed graphs are useful for expressing causal relationships between random variables, whereas undirected graphs are better suited to expressing soft constraints between random variables. For the purposes of solving inference problems, it is often convenient to convert both directed and undirected graphs into a different representation called a factor graph.

到达时间预测两则: Uber 和美团

2022-03-03 | ~ | Machine Learning

Uber: DeepETA (Transformer)

参考

Xinyu Hu, Olcay Cirit, Tanmay Binaykiya, and Ramit Hora. (2022, Feb 10). DeepETA: How Uber Predicts Arrival Times Using Deep Learning. Uber Engineering.

预测到达时间 (ETA, estimated time of arrival) 显然对 Uber 很重要 (最高 QPS 的服务). 传统方法把道路网分割为小路段, 每个路段表示为图中的带权边, 找出图中最短路径即得 ETA. 为考虑别的因素, 用机器学习预测 “真实到达时间和传统方法得到的 ETA” 的差值 (residual).

之前用 XGB ensemble. Eventually, we reached a point where increasing the dataset and model size using XGBoost became untenable. (为啥不行?) We decided to explore deep learning because of the relative ease of scaling to large datasets using data-parallel SGD. (树模型应该也有并行算法?)

目标有三个

时延: 毫秒级
准确度: MAE 要胜过 XGB
通用性: 用于 Uber 所有业务线

SimCSE: simple contrastive sentence embedding

2022-02-25 | ~ 2022-07-12 | Machine Learning

参考

Gao, T., Yao, X., & Chen, D. (2021). Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821.
Wang, T., & Isola, P. (2020, November). Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning (pp. 9929-9939). PMLR.

《下一代书店》笔记

2022-02-16 | ~ | Reading

赵慧. (2018). 下一代书店. 东方出版社.

上一代书店什么样?

以巴诺书店 (Barnes & Noble) 为例,

选址: 占据好位置.
库存和内部陈设足够丰富: 分类书架, 畅销书和推荐书的专门摆放位, 热门 IP 周边等.
不少非书籍类商品: 旗下电子书, 文创等.
名号大 (美国最大连锁书店), 与星巴克独家合作, 还经常举办活动.

唯一的问题是, 越来越少的消费者在这里买书. 书架或结账的前台很少有人, 客人都在儿童教育图书和星巴克咖啡附近.