Measure Zero

Tensmeyer, C., Morariu, V. I., Price, B., Cohen, S., & Martinez, T. (2019, September). Deep splitting and merging for table structure decomposition. In 2019 International Conference on Document Analysis and Recognition (ICDAR) (pp. 114-121). IEEE.

中的 Split Model 的 Inference 那节. 文章采用的表格检测分为两步, 先划若干条竖直线和水平线将图像分割为许多单元格, 再合并单元格得到表格.

MLFlow 简介

2023-01-06 | ~ | Machine Learning

管理深度学习实验

可以参考这个问题下的回答. 主要需要保存每次实验的

代码 (Git 提交记录)
数据 (路径), 模型
超参数, 指标
日志

保证实验结果好找, 便于复现实验. 这类工具 (满足上述部分功能) 有很多, 比如 TensorBoard; 自己写也可以.

介绍 MLflow

MLflow 是开源的机器学习工作流 (workflow) 管理平台, 提供了 Python, R, Java, REST API 等多种接口. 它是 Spark 团队 (他们还创建了 Databricks 公司) 2018 年的新作, 现在已经到 2.1 版本了.

With origins in academia and the open source community, Databricks was founded in 2013 by the original creators of Apache Spark™, Delta Lake and MLflow.

如果只是管理实验, 那每个人在本地上自己操作就行. MLflow 提供了中心化的管理, 有助于多人协作, 管理模型生命周期. 包含如下四个组件 (最主要的是 tracking 和 model registry).

Huggingface Transformers Trainer as a general PyTorch trainer

2023-01-04 | ~ | Machine Learning

受这篇启发, 自定义 Huggingface Transformers Trainer 做通用训练器.

模型定义照常.

import torch.nn as nn

class Model(nn.Module):
    def forward(self, inputs):
        ...
        return logits

自定义损失函数. 损失函数要么写在模型的 forward 里 (Huggingface 的写法), 要么继承 Trainer 类, 覆写 compute_loss.

import transformers

class MyTrainer(transformers.Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop('labels')
        logits = model(**inputs)
        # loss_fct = nn.CrossEntropyLoss()
        loss_fct = nn.BCEWithLogitsLoss()
        loss = loss_fct(logits, labels)
        # TODO: tested only with `return_outputs=False`
        return (loss, {'logits': logits}) if return_outputs else loss

Notes on Distributed Data

2023-01-01 | ~ | Tech

Kleppmann, M. (2017). Designing data-intensive applications: The big ideas behind reliable, scalable, and maintainable systems. “ O’Reilly Media, Inc.”.

There are various reasons why you might want to distribute a database across multiple machines:

Scalability
Fault tolerance/high availability
Latency

Replication

Replication means keeping a copy of the same data on multiple machines that are connected via a network. All of the difficulty in replication lies in handling changes to replicated data.

使数据和用户地理邻近 (减少延迟)
系统部分故障也能继续工作 (提高可用性)
横向扩展处理读取请求 (提高读取吞吐量)

In this chapter we will assume that your dataset is so small that each machine can hold a copy of the entire dataset.

GPT-1 到 ChatGPT 简介

2022-12-24 | ~ | Machine Learning

总体时间线参考这里.

GPT-1~3

GPT-1

Our system works in two stages; first we train a transformer model on a very large amount of data in an unsupervised manner — using language modeling as a training signal — then we fine-tune this model on much smaller supervised datasets to help it solve specific tasks.

We trained a 12-layer decoder-only transformer with masked self-attention heads (768 dimensional states and 12 attention heads).

GPT 全称 generative pre-training, 就是预训练 + 微调. 时间顺序从前到后依次是, GPT-1, BERT, GPT-2.

[Walkthrough] Install and set up Redis

2022-12-19 | ~ 2023-01-04 | Tech

Check system info.

cat /etc/os-release

Follow a walkthrough listed here. The below works on CentOS 7.