<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
	<channel>
		<title>Posts on RenovZ&#39;s Notes</title>
		<link>/posts/</link>
		<description>Recent content in Posts on RenovZ&#39;s Notes</description>
		<generator>Hugo -- gohugo.io</generator>
		<language>en-us</language>
		<lastBuildDate>Mon, 30 Mar 2026 19:02:07 +0800</lastBuildDate>
		<atom:link href="/posts/index.xml" rel="self" type="application/rss+xml" />
		
		<item>
			<title>Sqlite Wal Design</title>
			<link>/posts/sqlite-wal-design/</link>
			<pubDate>Mon, 30 Mar 2026 19:02:07 +0800</pubDate>
			
			<guid>/posts/sqlite-wal-design/</guid>
			<description><![CDATA[%!s(<nil>)]]></description>
			<content type="html"><![CDATA[<h1 id="sqlite-wal-write-ahead-logging-设计原理详解">SQLite WAL (Write-Ahead Logging) 设计原理详解</h1>
<h2 id="一源码文件位置">一、源码文件位置</h2>
<h3 id="核心实现文件">核心实现文件</h3>
<ul>
<li><strong><code>src/wal.c</code></strong> - WAL 主实现 (约 4700 行代码)</li>
<li><strong><code>src/pager.c</code></strong> - WAL 相关的 Pager 层逻辑</li>
</ul>
<h3 id="配套文件">配套文件</h3>
<ul>
<li><strong><code>src/wal.h</code></strong> - WAL 接口定义</li>
<li><strong><code>src/os_unix.c</code></strong> / <strong><code>src/os_win.c</code></strong> - 平台特定的 WAL 实现</li>
</ul>
<h3 id="源码注释位置">源码注释位置</h3>
<p>文件开头部分 (行 1-250) 有详细的 WAL 格式说明文档</p>
<hr>
<h2 id="二wal-核心数据结构">二、WAL 核心数据结构</h2>
<h3 id="21-wal-header-32-字节">2.1 WAL Header (32 字节)</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="k">struct</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="n">u32</span> <span class="n">magic</span><span class="p">;</span>           <span class="cm">/* 0: 0x377f0682 (小端) 或 0x377f0683 (大端) */</span>
</span></span><span class="line"><span class="cl">  <span class="n">u32</span> <span class="n">version</span><span class="p">;</span>         <span class="cm">/* 4: 文件格式版本，当前 3007000 */</span>
</span></span><span class="line"><span class="cl">  <span class="n">u32</span> <span class="n">pageSize</span><span class="p">;</span>        <span class="cm">/* 8: 数据库页面大小，如 1024、4096 等 */</span>
</span></span><span class="line"><span class="cl">  <span class="n">u32</span> <span class="n">ckptSeq</span><span class="p">;</span>         <span class="cm">/* 12: Checkpoint 序列号 */</span>
</span></span><span class="line"><span class="cl">  <span class="n">u32</span> <span class="n">salt1</span><span class="p">;</span>           <span class="cm">/* 16: Salt-1，随机整数，每次 checkpoint 递增 */</span>
</span></span><span class="line"><span class="cl">  <span class="n">u32</span> <span class="n">salt2</span><span class="p">;</span>           <span class="cm">/* 20: Salt-2，随机整数，每次 checkpoint 随机化 */</span>
</span></span><span class="line"><span class="cl">  <span class="n">u32</span> <span class="n">checksum1</span><span class="p">;</span>       <span class="cm">/* 24: 头部校验和 - 第一部分 */</span>
</span></span><span class="line"><span class="cl">  <span class="n">u32</span> <span class="n">checksum2</span><span class="p">;</span>       <span class="cm">/* 28: 头部校验和 - 第二部分 */</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span> <span class="n">WalIndexHdr</span><span class="p">;</span>
</span></span></code></pre></div><p><strong>属性说明：</strong></p>
<ul>
<li><strong>Magic Number</strong>: <code>0x377f0682</code> (小端序) 或 <code>0x377f0683</code> (大端序)</li>
<li><strong>Frame Count</strong>: WAL 可包含任意数量的帧</li>
<li><strong>Salt Mechanism</strong>: 防止旧帧被误认为是新帧</li>
<li><strong>Checksum</strong>: 使用 Fibonnaci 反向权重算法</li>
</ul>
<h3 id="22-wal-frame-24-字节-header--pagedata">2.2 WAL Frame (24 字节 Header + PageData)</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="k">struct</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="n">u32</span> <span class="n">pageNum</span><span class="p">;</span>         <span class="cm">/* 0: 页面号 */</span>
</span></span><span class="line"><span class="cl">  <span class="n">u32</span> <span class="n">dbSize</span><span class="p">;</span>          <span class="cm">/* 4: Commit Marker 时显示数据库大小 (页面数) */</span>
</span></span><span class="line"><span class="cl">  <span class="n">u32</span> <span class="n">salt1</span><span class="p">;</span>           <span class="cm">/* 8: 从 WAL Header 拷贝 */</span>
</span></span><span class="line"><span class="cl">  <span class="n">u32</span> <span class="n">salt2</span><span class="p">;</span>           <span class="cm">/* 12: 从 WAL Header 拷贝 */</span>
</span></span><span class="line"><span class="cl">  <span class="n">u32</span> <span class="n">checksum1</span><span class="p">;</span>       <span class="cm">/* 16: 帧校验和 - 第一部分 */</span>
</span></span><span class="line"><span class="cl">  <span class="n">u32</span> <span class="n">checksum2</span><span class="p">;</span>       <span class="cm">/* 20: 帧校验和 - 第二部分 */</span>
</span></span><span class="line"><span class="cl">  <span class="n">u8</span> <span class="n">pageData</span><span class="p">[];</span>       <span class="cm">/* 24: 实际页面数据，大小为 pageSize */</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span> <span class="n">WalFrame</span><span class="p">;</span>
</span></span></code></pre></div><p><strong>有效帧验证条件：</strong></p>
<ol>
<li>salt1、salt2 与 Header 匹配</li>
<li>checksum1、checksum2 与实际数据一致</li>
</ol>
<h3 id="23-wal-内部状态结构">2.3 Wal 内部状态结构</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="k">struct</span> <span class="n">Wal</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="n">sqlite3_vfs</span> <span class="o">*</span><span class="n">pVfs</span><span class="p">;</span>           <span class="cm">/* VFS 模块 */</span>
</span></span><span class="line"><span class="cl">  <span class="n">sqlite3_file</span> <span class="o">*</span><span class="n">pWalFd</span><span class="p">;</span>        <span class="cm">/* WAL 文件句柄 */</span>
</span></span><span class="line"><span class="cl">  <span class="n">sqlite3_file</span> <span class="o">*</span><span class="n">pDbFd</span><span class="p">;</span>         <span class="cm">/* 数据库文件句柄 */</span>
</span></span><span class="line"><span class="cl">  <span class="kt">int</span> <span class="n">readLock</span><span class="p">;</span>                <span class="cm">/* Reader 锁 ID */</span>
</span></span><span class="line"><span class="cl">  <span class="kt">int</span> <span class="n">writeLock</span><span class="p">;</span>               <span class="cm">/* Writer 锁标志 */</span>
</span></span><span class="line"><span class="cl">  <span class="kt">int</span> <span class="n">ckptLock</span><span class="p">;</span>                <span class="cm">/* Checkpoint 锁标志 */</span>
</span></span><span class="line"><span class="cl">  <span class="kt">int</span> <span class="n">exclusiveMode</span><span class="p">;</span>           <span class="cm">/* 锁定模式 */</span>
</span></span><span class="line"><span class="cl">  <span class="n">WalIndexHdr</span> <span class="n">hdr</span><span class="p">;</span>             <span class="cm">/* 本地 WAL Header 缓存 */</span>
</span></span><span class="line"><span class="cl">  <span class="n">u32</span> <span class="n">aWiData</span><span class="p">[</span><span class="n">WAL_NREADER</span><span class="p">];</span>    <span class="cm">/* 共享内存数据 */</span>
</span></span><span class="line"><span class="cl">  <span class="kt">int</span> <span class="n">syncHeader</span><span class="p">;</span>              <span class="cm">/* 是否同步 header */</span>
</span></span><span class="line"><span class="cl">  <span class="kt">int</span> <span class="n">padToSectorBoundary</span><span class="p">;</span>     <span class="cm">/* 是否对齐扇区边界 */</span>
</span></span><span class="line"><span class="cl">  <span class="n">i64</span> <span class="n">mxWalSize</span><span class="p">;</span>               <span class="cm">/* WAL 最大大小限制 */</span>
</span></span><span class="line"><span class="cl">  <span class="n">u32</span> <span class="n">iCallback</span><span class="p">;</span>               <span class="cm">/* 事务回调帧数 */</span>
</span></span><span class="line"><span class="cl">  <span class="n">u32</span> <span class="n">iReCksum</span><span class="p">;</span>                <span class="cm">/* 重计算的位置 */</span>
</span></span><span class="line"><span class="cl"><span class="p">};</span>
</span></span></code></pre></div><h3 id="24-walckptinfo-检查点信息">2.4 WalCkptInfo (检查点信息)</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="k">struct</span> <span class="n">WalCkptInfo</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="n">u32</span> <span class="n">nBackfill</span><span class="p">;</span>                  <span class="cm">/* 已刷入数据库的帧数 */</span>
</span></span><span class="line"><span class="cl">  <span class="n">u32</span> <span class="n">aReadMark</span><span class="p">[</span><span class="n">WAL_NREADER</span><span class="p">];</span>     <span class="cm">/* 读器标记 */</span>
</span></span><span class="line"><span class="cl">  <span class="n">u8</span> <span class="n">aLock</span><span class="p">[</span><span class="n">SQLITE_SHM_NLOCK</span><span class="p">];</span>     <span class="cm">/* 锁字节 */</span>
</span></span><span class="line"><span class="cl">  <span class="n">u32</span> <span class="n">nBackfillAttempted</span><span class="p">;</span>           <span class="cm">/* 尝试刷入的帧数 */</span>
</span></span><span class="line"><span class="cl">  <span class="n">u32</span> <span class="n">notUsed0</span><span class="p">;</span>                     <span class="cm">/* 预留字段 */</span>
</span></span><span class="line"><span class="cl"><span class="p">};</span>
</span></span></code></pre></div><hr>
<h2 id="三wal-工作机制详解">三、WAL 工作机制详解</h2>
<h3 id="31-写入流程">3.1 写入流程</h3>
<pre tabindex="0"><code>1. 打开 WAL 连接 → sqlite3WalOpen()
2. 开始写入事务 → sqlite3WalBeginWriteTransaction()
   - 获取 WAL_WRITE_LOCK (单 writer 锁)
   - 检查 WAL Header 一致性
3. 写入页面帧 → sqlite3WalFrames()
   - 计算 checksum
   - 追加写入到 WAL 文件末尾
   - 更新 wal-index 中的 frame 映射
4. Commit 提交 → FSync + 写 Commit Marker
5. 释放写锁 → sqlite3WalEndWriteTransaction()
</code></pre><p><strong>核心代码片段：</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="kt">int</span> <span class="nf">sqlite3WalBeginWriteTransaction</span><span class="p">(</span><span class="n">Wal</span> <span class="o">*</span><span class="n">pWal</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="cm">/* 单 writer 锁，其他 writer 等待或返回 BUSY */</span>
</span></span><span class="line"><span class="cl">  <span class="n">rc</span> <span class="o">=</span> <span class="nf">walLockExclusive</span><span class="p">(</span><span class="n">pWal</span><span class="p">,</span> <span class="n">WAL_WRITE_LOCK</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">  
</span></span><span class="line"><span class="cl">  <span class="cm">/* 防止 WAL Header 被其他进程修改 */</span>
</span></span><span class="line"><span class="cl">  <span class="k">if</span> <span class="p">(</span><span class="nf">memcmp</span><span class="p">(</span><span class="o">&amp;</span><span class="n">pWal</span><span class="o">-&gt;</span><span class="n">hdr</span><span class="p">,</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="nf">walIndexHdr</span><span class="p">(</span><span class="n">pWal</span><span class="p">),</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">WalIndexHdr</span><span class="p">))</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">SQLITE_BUSY_SNAPSHOT</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">  <span class="p">}</span>
</span></span><span class="line"><span class="cl">  <span class="k">return</span> <span class="n">SQLITE_OK</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><h3 id="32-读取流程">3.2 读取流程</h3>
<pre tabindex="0"><code>1. 开启读事务 → sqlite3WalBeginReadTransaction()
   - 获取 WAL_READ_LOCK(k)
   - 记录当前 mxFrame 作为快照
2. 查找页面 → sqlite3WalFindFrame()
   - 通过 wal-index 定位帧位置
   - 若找到有效帧则读取
   - 否则从数据库文件读取
3. 读取数据 → sqlite3WalReadFrame()
4. 释放读锁 → 释放 WAL_READ_LOCK(k)
</code></pre><p><strong>Snapshot 读机制：</strong></p>
<ul>
<li>Reader 记录 mxFrame (当前 WAL 中最大有效帧)</li>
<li>后续读取时忽略 mxFrame 之后的新帧</li>
<li>实现 <strong>快照隔离</strong> (Snapshot Isolation)</li>
</ul>
<h3 id="33-checkpoint-流程">3.3 Checkpoint 流程</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="n">PRAGMA</span><span class="w"> </span><span class="n">wal_checkpoint</span><span class="p">;</span><span class="w">           </span><span class="c1">-- PASSIVE 模式
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="n">PRAGMA</span><span class="w"> </span><span class="n">wal_checkpoint</span><span class="p">(</span><span class="k">TRUNCATE</span><span class="p">);</span><span class="w"> </span><span class="c1">-- RESTART 模式
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="n">PRAGMA</span><span class="w"> </span><span class="n">wal_checkpoint</span><span class="p">(</span><span class="k">FULL</span><span class="p">);</span><span class="w">     </span><span class="c1">-- FULL 模式
</span></span></span></code></pre></div><p><strong>4 种 Checkpoint 模式：</strong></p>
<table>
<thead>
<tr>
<th>模式</th>
<th>功能</th>
<th>是否阻塞读</th>
<th>是否阻塞写</th>
<th>WAL 处理</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>PASSIVE</code></td>
<td>被动 checkpoint</td>
<td>❌</td>
<td>❌</td>
<td>仅刷已提交的帧</td>
</tr>
<tr>
<td><code>FULL</code></td>
<td>完整 checkpoint</td>
<td>❌</td>
<td>✅</td>
<td>刷所有帧，不重启 WAL</td>
</tr>
<tr>
<td><code>RESTART</code></td>
<td>重启 WAL</td>
<td>✅</td>
<td>✅</td>
<td>刷所有帧，截断 WAL</td>
</tr>
<tr>
<td><code>TRUNCATE</code></td>
<td>截断模式</td>
<td>✅</td>
<td>✅</td>
<td>最小化 WAL 文件</td>
</tr>
</tbody>
</table>
<p><strong>Checkpoint 执行步骤：</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="kt">int</span> <span class="nf">sqlite3WalCheckpoint</span><span class="p">(</span><span class="n">Wal</span> <span class="o">*</span><span class="n">pWal</span><span class="p">,</span> <span class="kt">int</span> <span class="n">eMode</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="c1">// 1. 获取 checkpoint 锁
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>  <span class="nf">walLockExclusive</span><span class="p">(</span><span class="n">pWal</span><span class="p">,</span> <span class="n">WAL_CKPT_LOCK</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">  
</span></span><span class="line"><span class="cl">  <span class="c1">// 2. FULL/RESTART/TRUNCATE 模式获取 writer 锁
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>  <span class="nf">walBusyLock</span><span class="p">(</span><span class="n">pWal</span><span class="p">,</span> <span class="p">...,</span> <span class="n">WAL_WRITE_LOCK</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">  
</span></span><span class="line"><span class="cl">  <span class="c1">// 3. 读取 wal-index 头部信息
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>  <span class="nf">walIndexReadHdr</span><span class="p">(</span><span class="n">pWal</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">isChanged</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">  
</span></span><span class="line"><span class="cl">  <span class="c1">// 4. 将 WAL 帧刷入数据库文件
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>  <span class="nf">walCheckpoint</span><span class="p">(</span><span class="n">pWal</span><span class="p">,</span> <span class="n">db</span><span class="p">,</span> <span class="n">eMode2</span><span class="p">,</span> <span class="p">...);</span>
</span></span><span class="line"><span class="cl">  
</span></span><span class="line"><span class="cl">  <span class="c1">// 5. 释放所有锁
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>  <span class="nf">walUnlockExclusive</span><span class="p">(</span><span class="n">pWal</span><span class="p">,</span> <span class="n">WAL_CKPT_LOCK</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><hr>
<h2 id="四文件系统布局">四、文件系统布局</h2>
<h3 id="41-三文件设计">4.1 三文件设计</h3>
<pre tabindex="0"><code>数据库目录结构：
.
├── mydb                     # 主数据库文件 (通用 header + 页数据)
├── mydb-wal                 # WAL 预写日志文件
└── mydb-shm                 # 共享内存索引文件 (wal-index)
</code></pre><p><strong>关系说明：</strong></p>
<ul>
<li><code>mydb</code>: 数据库主体，包含 Schema 和用户数据</li>
<li><code>mydb-wal</code>: 预写日志，记录修改后的页面帧</li>
<li><code>mydb-shm</code>: 共享内存，提供快速查找 WAL 帧位置</li>
</ul>
<h3 id="42-共享内存-wal-index-结构">4.2 共享内存 (wal-index) 结构</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="cm">/* 第一部分：Header (136 字节) */</span>
</span></span><span class="line"><span class="cl"><span class="o">-</span> <span class="nf">WalIndexHdr</span> <span class="p">(</span><span class="n">x2</span><span class="p">)</span>  <span class="o">:</span> <span class="err">重复两个，增强可靠性</span>
</span></span><span class="line"><span class="cl"><span class="o">-</span> <span class="nl">WalCkptInfo</span>       <span class="p">:</span> <span class="n">Checkpoint</span> <span class="err">信息</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="cm">/* 第二部分：Index Blocks (4096 条目/块) */</span>
</span></span><span class="line"><span class="cl"><span class="o">-</span> <span class="n">Page</span> <span class="nl">Mapping</span><span class="p">:</span> <span class="err">帧号到页面号的映射</span>
</span></span><span class="line"><span class="cl"><span class="o">-</span> <span class="n">Hash</span> <span class="nl">Table</span><span class="p">:</span> <span class="err">快速定位帧的哈希表</span>
</span></span></code></pre></div><p><strong>查找算法：</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="kt">int</span> <span class="nf">sqlite3WalFindFrame</span><span class="p">(</span><span class="n">Wal</span> <span class="o">*</span><span class="n">pWal</span><span class="p">,</span> <span class="n">Pgno</span> <span class="n">pgNo</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">piFrame</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="c1">// 1. 哈希表查找
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>  <span class="n">hashLoc</span> <span class="o">=</span> <span class="nf">walHashLocation</span><span class="p">(</span><span class="n">pgNo</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">  
</span></span><span class="line"><span class="cl">  <span class="c1">// 2. 从 wal-index 中定位帧
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>  <span class="k">for</span> <span class="p">(</span><span class="err">每个</span> <span class="n">index</span> <span class="n">block</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="p">(</span><span class="n">pgNo</span> <span class="o">==</span> <span class="n">pageMapping</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="o">*</span><span class="n">piFrame</span> <span class="o">=</span> <span class="n">i</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">      <span class="k">return</span> <span class="n">SQLITE_OK</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">  <span class="p">}</span>
</span></span><span class="line"><span class="cl">  <span class="k">return</span> <span class="n">SQLITE_NOTFOUND</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><hr>
<h2 id="五并发控制机制">五、并发控制机制</h2>
<h3 id="51-lock-类型-共享内存区域">5.1 Lock 类型 (共享内存区域)</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="n">WAL_INDEX_LOCK_OFFSET</span> <span class="o">=</span> <span class="mi">120</span> <span class="err">字节</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="mi">120</span><span class="o">:</span> <span class="n">WAL_WRITE_LOCK</span>      <span class="o">--</span> <span class="err">写入锁</span> <span class="p">(</span><span class="err">单</span> <span class="n">writer</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="mi">121</span><span class="o">:</span> <span class="n">WAL_CKPT_LOCK</span>        <span class="o">--</span> <span class="n">Checkpoint</span> <span class="err">锁</span>
</span></span><span class="line"><span class="cl"><span class="mi">122</span><span class="o">:</span> <span class="n">WAL_RECOVER_LOCK</span>     <span class="o">--</span> <span class="err">崩溃恢复锁</span>
</span></span><span class="line"><span class="cl"><span class="mi">123</span><span class="o">-</span><span class="mi">127</span><span class="o">:</span> <span class="n">WAL_READ_LOCK</span><span class="p">[</span><span class="mi">0</span><span class="o">-</span><span class="mi">4</span><span class="p">]</span> <span class="o">--</span> <span class="n">Reader</span> <span class="err">锁</span> <span class="p">(</span><span class="err">最多</span> <span class="mi">5</span> <span class="err">个并发</span> <span class="n">reader</span><span class="p">)</span>
</span></span></code></pre></div><h3 id="52-锁协议">5.2 锁协议</h3>
<p><strong>Reader 协议：</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="mf">1.</span> <span class="err">获取</span> <span class="nf">WAL_READ_LOCK</span><span class="p">(</span><span class="n">k</span><span class="p">)</span> <span class="p">(</span><span class="err">共享锁，非阻塞尝试</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="mf">2.</span> <span class="err">记录</span> <span class="n">mxFrame</span> <span class="err">到</span> <span class="n">aReadMark</span><span class="p">[</span><span class="n">k</span><span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="mf">3.</span> <span class="err">读取数据</span> <span class="p">(</span><span class="err">忽略</span> <span class="n">aReadMark</span> <span class="err">之后的新帧</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="mf">4.</span> <span class="err">释放</span> <span class="nf">WAL_READ_LOCK</span><span class="p">(</span><span class="n">k</span><span class="p">)</span>
</span></span></code></pre></div><p><strong>Writer 协议：</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="mf">1.</span> <span class="err">获取</span> <span class="nf">WAL_WRITE_LOCK</span> <span class="p">(</span><span class="err">独占锁</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="mf">2.</span> <span class="err">等待所有</span> <span class="n">active</span> <span class="n">reader</span> <span class="err">完成</span> <span class="p">(</span><span class="err">阻塞或</span> <span class="n">BUSY</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="mf">3.</span> <span class="err">写入帧到</span> <span class="nf">WAL</span> <span class="p">(</span><span class="err">追加写</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="mf">4.</span> <span class="nl">Commit</span><span class="p">:</span> <span class="n">FSync</span> <span class="o">+</span> <span class="err">释放</span> <span class="n">WAL_WRITE_LOCK</span>
</span></span></code></pre></div><p><strong>Checkpoint 协议：</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="mf">1.</span> <span class="err">获取</span> <span class="nf">WAL_CKPT_LOCK</span> <span class="p">(</span><span class="err">独占锁</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="mf">2.</span> <span class="n">FULL</span><span class="o">/</span><span class="n">RESTART</span><span class="o">/</span><span class="nl">TRUNCATE</span><span class="p">:</span> <span class="err">额外获取</span> <span class="n">WAL_WRITE_LOCK</span>
</span></span><span class="line"><span class="cl"><span class="mf">3.</span> <span class="err">等待所有</span> <span class="n">active</span> <span class="n">reader</span> <span class="err">释放</span>
</span></span><span class="line"><span class="cl"><span class="mf">4.</span> <span class="err">刷</span> <span class="n">WAL</span> <span class="err">帧到数据库</span>
</span></span><span class="line"><span class="cl"><span class="mf">5.</span> <span class="err">更新</span> <span class="n">salt1</span><span class="o">/</span><span class="n">salt2</span><span class="p">,</span> <span class="err">释放所有锁</span>
</span></span></code></pre></div><h3 id="53-reader-锁槽管理">5.3 Reader 锁槽管理</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="cp">#define WAL_NREADER  5  </span><span class="cm">/* 最大 5 个 reader 锁槽 */</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp"></span>
</span></span><span class="line"><span class="cl"><span class="cm">/* 槽分配算法 */</span>
</span></span><span class="line"><span class="cl"><span class="mf">1.</span> <span class="err">新</span> <span class="n">reader</span> <span class="err">尝试：找到</span> <span class="n">aReadMark</span> <span class="err">中的最大空闲槽</span>
</span></span><span class="line"><span class="cl"><span class="mf">2.</span> <span class="err">若所有槽被占用：等待最老</span> <span class="n">reader</span> <span class="err">完成</span>
</span></span><span class="line"><span class="cl"><span class="mf">3.</span> <span class="n">aReadMark</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">:</span> <span class="err">特殊槽，值为</span> <span class="mi">0</span><span class="err">，表示不使用</span> <span class="n">WAL</span><span class="err">，直接读数据库</span>
</span></span><span class="line"><span class="cl"><span class="mf">4.</span> <span class="n">aReadMark</span><span class="p">[</span><span class="mi">1</span><span class="o">-</span><span class="mi">4</span><span class="p">]</span><span class="o">:</span> <span class="err">普通</span> <span class="n">reader</span> <span class="err">槽，存储</span> <span class="n">mxFrame</span> <span class="err">快照</span>
</span></span></code></pre></div><hr>
<h2 id="六checksum-算法-亮点">六、Checksum 算法 (亮点)</h2>
<h3 id="61-算法特点">6.1 算法特点</h3>
<p><strong>Fibonacci 反向权重校验和</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">walChecksum</span><span class="p">(</span><span class="n">u8</span> <span class="o">*</span><span class="n">p</span><span class="p">,</span> <span class="kt">int</span> <span class="n">nB</span><span class="p">,</span> <span class="kt">int</span> <span class="n">bigEnd</span><span class="p">,</span> <span class="n">u32</span> <span class="n">aCksum</span><span class="p">[</span><span class="mi">2</span><span class="p">])</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="n">u32</span> <span class="n">s0</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="n">s1</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">  <span class="n">u32</span> <span class="o">*</span><span class="n">pU32</span> <span class="o">=</span> <span class="p">(</span><span class="n">u32</span><span class="o">*</span><span class="p">)</span><span class="n">p</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">  <span class="kt">int</span> <span class="n">n</span> <span class="o">=</span> <span class="n">nB</span> <span class="o">&gt;&gt;</span> <span class="mi">2</span><span class="p">;</span>  <span class="cm">/* 32 位整数数量 */</span>
</span></span><span class="line"><span class="cl">  
</span></span><span class="line"><span class="cl">  <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="n">u32</span> <span class="n">x</span> <span class="o">=</span> <span class="n">bigEnd</span> <span class="o">?</span> <span class="nf">sqlite3_getint32</span><span class="p">(</span><span class="n">pU32</span><span class="p">)</span> <span class="o">:</span> <span class="nf">sqlite3_getlendo32</span><span class="p">(</span><span class="n">pU32</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="n">s0</span> <span class="o">+=</span> <span class="n">x</span> <span class="o">+</span> <span class="n">s1</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">s1</span> <span class="o">+=</span> <span class="n">s0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">  <span class="p">}</span>
</span></span><span class="line"><span class="cl">  
</span></span><span class="line"><span class="cl">  <span class="n">aCksum</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">s0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">  <span class="n">aCksum</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">s1</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p><strong>权重计算 (反向 Fibonacci):</strong></p>
<ul>
<li>x[0] 权重：F(n-1)</li>
<li>x[1] 权重：F(n-2)</li>
<li>x[n-1] 权重：F(1) = 1</li>
</ul>
<p><strong>字节序选择：</strong></p>
<ul>
<li><code>0x377f0683</code> 大端序 (网络字节序)</li>
<li><code>0x377f0682</code> 小端序 (主机字节序)</li>
</ul>
<h3 id="62-验证流程">6.2 验证流程</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="c1">// 帧有效验证
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="k">if</span> <span class="p">(</span><span class="n">frame</span><span class="p">.</span><span class="n">salt1</span> <span class="o">!=</span> <span class="n">header</span><span class="p">.</span><span class="n">salt1</span> <span class="o">||</span> <span class="n">frame</span><span class="p">.</span><span class="n">salt2</span> <span class="o">!=</span> <span class="n">header</span><span class="p">.</span><span class="n">salt2</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="k">return</span> <span class="n">INVALID_FRAME</span><span class="p">;</span>  <span class="cm">/* 盐值不匹配，废弃 */</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">if</span> <span class="p">(</span><span class="n">calculatedChecksum</span> <span class="o">!=</span> <span class="n">frame</span><span class="p">.</span><span class="n">checksum</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="k">return</span> <span class="n">CORRUPT_FRAME</span><span class="p">;</span>  <span class="cm">/* 校验和失败，数据损坏 */</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">return</span> <span class="n">VALID_FRAME</span><span class="p">;</span>
</span></span></code></pre></div><hr>
<h2 id="七wal-设计优势与劣势">七、WAL 设计优势与劣势</h2>
<h3 id="71-优势">7.1 优势</h3>
<table>
<thead>
<tr>
<th>优势</th>
<th>说明</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>高并发读取</strong></td>
<td>Reader 不阻塞 Writer，通过快照实现隔离</td>
</tr>
<tr>
<td><strong>高并发写入</strong></td>
<td>通过 Futex/ Mutex 协调，提高写入吞吐量</td>
</tr>
<tr>
<td><strong>Atomic Commit</strong></td>
<td>Commit Marker + FSync 保证原子性</td>
</tr>
<tr>
<td><strong>Crash Recovery 快速</strong></td>
<td>重放有效帧，无需回滚</td>
</tr>
<tr>
<td><strong>空间效率</strong></td>
<td>WAL 复用，不线性增长</td>
</tr>
<tr>
<td><strong>Cost-effective</strong></td>
<td>无需复杂锁定，仅需 shm 锁</td>
</tr>
</tbody>
</table>
<h3 id="72-劣势">7.2 劣势</h3>
<table>
<thead>
<tr>
<th>劣势</th>
<th>说明</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>需要共享内存</strong></td>
<td>不支持网络文件系统</td>
</tr>
<tr>
<td><strong>Checkpoint 开销大</strong></td>
<td>大量帧刷回需要 FSync</td>
</tr>
<tr>
<td><strong>内存占用</strong></td>
<td>需要维护 wal-index 映射表</td>
</tr>
<tr>
<td><strong>复杂性增加</strong></td>
<td>双文件 + 共享内存，调试困难</td>
</tr>
</tbody>
</table>
<h3 id="73-适用场景">7.3 适用场景</h3>
<p><strong>推荐 WAL 模式：</strong></p>
<ul>
<li>高并发读写场景</li>
<li>读多写少场景</li>
<li>需要快照隔离 (Serializable)</li>
<li>需要频繁查询</li>
</ul>
<p><strong>推荐使用传统 journal 模式：</strong></p>
<ul>
<li>只读数据库 (无并发写入)</li>
<li>单用户环境</li>
<li>极端内存受限</li>
</ul>
<hr>
<h2 id="八常见问题-qa">八、常见问题 (Q&amp;A)</h2>
<h3 id="q1-wal-与传统-rollback-journal-的区别">Q1: WAL 与传统 rollback journal 的区别？</h3>
<p><strong>A:</strong></p>
<table>
<thead>
<tr>
<th>维度</th>
<th>WAL 模式</th>
<th>Rollback Journal 模式</th>
</tr>
</thead>
<tbody>
<tr>
<td>写入方式</td>
<td>追加写 (Append-only)</td>
<td>先写 journal 再写主文件</td>
</tr>
<tr>
<td>并发能力</td>
<td>读写可并行</td>
<td>串行化</td>
</tr>
<tr>
<td>崩溃恢复</td>
<td>重放有效帧</td>
<td>回滚 journal 中的修改</td>
</tr>
<tr>
<td>文件大小</td>
<td>可变，可重用</td>
<td>固定，每事务一帧</td>
</tr>
<tr>
<td>复杂度</td>
<td>需要 wal-index 索引</td>
<td>相对简单</td>
</tr>
</tbody>
</table>
<h3 id="q2-wal-如何保证-crash-recovery-后的数据一致性">Q2: WAL 如何保证 crash recovery 后的数据一致性？</h3>
<p><strong>A:</strong> 三层保证机制：</p>
<ol>
<li>
<p><strong>Salt-1/2 匹配验证</strong></p>
<ul>
<li>每 checkpoint 变化 salt 值</li>
<li>防止旧帧被误认为是新帧</li>
</ul>
</li>
<li>
<p><strong>Checksum 验证</strong></p>
<ul>
<li>使用 Fibonacci 反向权重算法</li>
<li>验证帧的完整性和顺序</li>
</ul>
</li>
<li>
<p><strong>Commit Marker 标识</strong></p>
<ul>
<li>每 64 帧记录 Commit Marker</li>
<li>标识事务边界，确保原子性</li>
</ul>
</li>
</ol>
<p>恢复流程：</p>
<ol>
<li>读取 WAL Header 和 Frame</li>
<li>验证 checksum + salt 一致性</li>
<li>重放有效帧到数据库</li>
<li>未完成事务的回滚</li>
</ol>
<h3 id="q3-checkpoint-何时触发">Q3: Checkpoint 何时触发？</h3>
<p><strong>A:</strong> 3 种触发方式：</p>
<ol>
<li>
<p><strong>显式调用</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="n">PRAGMA</span><span class="w"> </span><span class="n">wal_checkpoint</span><span class="p">;</span><span class="w">
</span></span></span></code></pre></div></li>
<li>
<p><strong>自动触发</strong> (每 1000 帧)</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="n">PRAGMA</span><span class="w"> </span><span class="n">wal_autocheckpoint</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">1000</span><span class="p">;</span><span class="w">  </span><span class="c1">-- 默认值
</span></span></span></code></pre></div></li>
<li>
<p><strong>客户端断开时触发</strong></p>
</li>
</ol>
<p><strong>触发条件：</strong></p>
<ol>
<li>显式 <code>PRAGMA wal_checkpoint</code> 调用</li>
<li>WAL 帧数达到 <code>wal_autocheckpoint</code> 阈值 (默认 1000 帧)</li>
<li>最后一个客户端断开连接时</li>
</ol>
<h3 id="q4-wal-的内存锁机制如何工作">Q4: WAL 的内存锁机制如何工作？</h3>
<p><strong>A:</strong></p>
<p><strong>共享内存文件 (.db-shm) 布局</strong></p>
<pre tabindex="0"><code>Offset 0-119: 未使用
Offset 120:   WAL_WRITE_LOCK (写入锁)
Offset 121:   WAL_CKPT_LOCK    (Checkpoint 锁)
Offset 122:   WAL_RECOVER_LOCK (崩溃恢复锁)
Offset 123-127: WAL_READ_LOCK[5] (读锁槽)
</code></pre><p><strong>锁实现方式 (平台相关)：</strong></p>
<ul>
<li><strong>Unix</strong>: 通过 <code>futex</code> 实现</li>
<li><strong>Windows</strong>: 通过 <code>CreateMutexA</code> + <code>WaitForSingleObject</code> 实现</li>
<li><strong>跨平台</strong>: SQLite 在 <code>os_unix.c</code> / <code>os_win.c</code> 中实现</li>
</ul>
<p><strong>关键点：</strong></p>
<ul>
<li>使用共享内存实现进程间协调</li>
<li>每个锁槽 1 字节，简化锁获取逻辑</li>
<li>支持最多 5 个并发 reader</li>
</ul>
<h3 id="q5-什么是-wal-的-snapshot-isolation">Q5: 什么是 WAL 的 Snapshot Isolation?</h3>
<p><strong>A:</strong> 通过 <code>aReadMark</code> 实现快照隔离：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="c1">// Reader 获取快照
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="n">WalIndexHdr</span> <span class="n">hdr</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="nf">memcpy</span><span class="p">(</span><span class="o">&amp;</span><span class="n">hdr</span><span class="p">,</span> <span class="nf">walIndexHdr</span><span class="p">(</span><span class="n">pWal</span><span class="p">),</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">WalIndexHdr</span><span class="p">));</span>
</span></span><span class="line"><span class="cl"><span class="n">mxFrame</span> <span class="o">=</span> <span class="n">hdr</span><span class="p">.</span><span class="n">mxFrame</span><span class="p">;</span>  <span class="cm">/* 记录当前最大帧 */</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1">// Reader 读取页面时
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="kt">int</span> <span class="n">frameNum</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="k">if</span> <span class="p">(</span><span class="nf">walFindFrame</span><span class="p">(</span><span class="n">pWal</span><span class="p">,</span> <span class="n">pgNo</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">frameNum</span><span class="p">,</span> <span class="n">mxFrame</span><span class="p">))</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="cm">/* 从 WAL 读取帧 (不超过 mxFrame) */</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="cm">/* 从数据库文件读取 */</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1">// 保证一致性
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="cm">/* 只要 reader 坚持使用原始 mxFrame，就不会看到新提交的数据 */</span>
</span></span></code></pre></div><p><strong>Snapshot 隔离保证：</strong></p>
<ul>
<li>每个 reader 看到一致的数据库视图</li>
<li>不会看到其他事务的未提交数据</li>
<li>同一时间点的所有读取看到相同数据</li>
</ul>
<h3 id="q6-为什么需要-salt1salt2-机制">Q6: 为什么需要 salt1/salt2 机制？</h3>
<p><strong>A:</strong> 3 个原因：</p>
<ol>
<li><strong>防止 WAL 复用时的混淆</strong> - 防止旧帧被误认为是新帧</li>
<li><strong>防止 Checkpoint 中的竞争</strong> - 确保 new 和 old 帧不会混合</li>
<li><strong>增强数据完整性</strong> - 配合 checksum 确保数据一致性</li>
</ol>
<p><strong>Salt 变化规则：</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="n">salt1</span><span class="o">++</span>   <span class="cm">/* 递增，防止旧帧被复用 */</span>
</span></span><span class="line"><span class="cl"><span class="n">salt2</span> <span class="o">=</span> <span class="nf">random</span><span class="p">()</span>  <span class="cm">/* 随机化，防止旧帧被误认 */</span>
</span></span></code></pre></div><p><strong>验证检查：</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="k">if</span> <span class="p">(</span><span class="n">frame</span><span class="p">.</span><span class="n">salt1</span> <span class="o">!=</span> <span class="n">hdr</span><span class="p">.</span><span class="n">salt1</span><span class="p">)</span> <span class="k">return</span> <span class="n">INVALID</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="k">if</span> <span class="p">(</span><span class="n">frame</span><span class="p">.</span><span class="n">salt2</span> <span class="o">!=</span> <span class="n">hdr</span><span class="p">.</span><span class="n">salt2</span><span class="p">)</span> <span class="k">return</span> <span class="n">INVALID</span><span class="p">;</span>
</span></span></code></pre></div><h3 id="q7-为什么-wal-不会无限增长">Q7: 为什么 WAL 不会无限增长？</h3>
<p><strong>A:</strong> 通过 <strong>Checkpoint</strong> 机制：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="cm">/* WAL 生命周期 */</span>
</span></span><span class="line"><span class="cl"><span class="mf">1.</span> <span class="err">写入帧到</span> <span class="nf">WAL</span> <span class="p">(</span><span class="err">增长</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="mf">2.</span> <span class="n">Checkpoint</span> <span class="err">刷入数据库</span> <span class="p">(</span><span class="err">复用</span> <span class="n">WAL</span> <span class="err">空间</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="mf">3.</span> <span class="err">重置</span> <span class="n">mxFrame</span> <span class="o">=</span> <span class="mi">0</span><span class="err">，从新位置继续写入</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="cm">/* 触发条件 */</span>
</span></span><span class="line"><span class="cl"><span class="o">-</span> <span class="err">自动：</span><span class="n">wal_autocheckpoint</span> <span class="err">帧数</span> <span class="p">(</span><span class="err">默认</span> <span class="mi">1000</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="o">-</span> <span class="err">手动：</span><span class="n">PRAGMA</span> <span class="n">wal_checkpoint</span>
</span></span><span class="line"><span class="cl"><span class="o">-</span> <span class="err">清理：最后一个读者断开</span>
</span></span></code></pre></div><h3 id="q8-wal-中的-hash-表如何工作">Q8: WAL 中的 Hash 表如何工作？</h3>
<p><strong>A:</strong> 快速定位帧的 <strong>Page Mapping</strong> 机制：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="cm">/* WalIndexHashLoc (哈希位置) */</span>
</span></span><span class="line"><span class="cl"><span class="k">struct</span> <span class="n">WalHashLoc</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="n">u32</span> <span class="n">iBucket</span><span class="p">;</span>            <span class="cm">/* 哈希桶 */</span>
</span></span><span class="line"><span class="cl">  <span class="n">u32</span> <span class="n">iPage</span><span class="p">;</span>              <span class="cm">/* 页面号 */</span>
</span></span><span class="line"><span class="cl">  <span class="n">u32</span> <span class="n">iFrame</span><span class="p">;</span>             <span class="cm">/* 帧位置 */</span>
</span></span><span class="line"><span class="cl"><span class="p">};</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="cm">/* 查找算法 */</span>
</span></span><span class="line"><span class="cl"><span class="kt">int</span> <span class="nf">sqlite3WalFindFrame</span><span class="p">(</span><span class="n">Wal</span> <span class="o">*</span><span class="n">pWal</span><span class="p">,</span> <span class="n">Pgno</span> <span class="n">pgNo</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">piFrame</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="cm">/* 1. 计算哈希桶 */</span>
</span></span><span class="line"><span class="cl">  <span class="n">hash</span> <span class="o">=</span> <span class="nf">walHash</span><span class="p">(</span><span class="n">pgNo</span><span class="p">,</span> <span class="n">iChange</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">  
</span></span><span class="line"><span class="cl">  <span class="cm">/* 2. 查找 frame 位置 */</span>
</span></span><span class="line"><span class="cl">  <span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">HASH_NPAGE</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="p">(</span><span class="n">pWal</span><span class="o">-&gt;</span><span class="n">aPageMapping</span><span class="p">[</span><span class="n">hash</span> <span class="o">+</span> <span class="n">i</span><span class="p">]</span> <span class="o">==</span> <span class="n">pgNo</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="o">*</span><span class="n">piFrame</span> <span class="o">=</span> <span class="n">hash</span> <span class="o">+</span> <span class="n">i</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">      <span class="k">return</span> <span class="n">SQLITE_OK</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">  <span class="p">}</span>
</span></span><span class="line"><span class="cl">  <span class="k">return</span> <span class="n">SQLITE_NOTFOUND</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p><strong>关键特性：</strong></p>
<ul>
<li><strong>4096 个条目</strong> 的哈希表</li>
<li><strong>1 个 hash 桶</strong>，支持最多 4096 帧</li>
<li><strong>线性查找</strong>，碰撞率较低</li>
</ul>
<hr>
<h2 id="九源码关键函数列表">九、源码关键函数列表</h2>
<h3 id="91-生命周期管理">9.1 生命周期管理</h3>
<table>
<thead>
<tr>
<th>函数</th>
<th>说明</th>
<th>行号</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>sqlite3WalOpen()</code></td>
<td>打开 WAL 文件</td>
<td>1641</td>
</tr>
<tr>
<td><code>sqlite3WalClose()</code></td>
<td>关闭 WAL 连接</td>
<td>2501</td>
</tr>
<tr>
<td><code>sqlite3WalLimit()</code></td>
<td>设置 WAL 最大大小</td>
<td>1710</td>
</tr>
</tbody>
</table>
<h3 id="92-事务处理">9.2 事务处理</h3>
<table>
<thead>
<tr>
<th>函数</th>
<th>说明</th>
<th>行号</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>sqlite3WalBeginWriteTransaction()</code></td>
<td>开始写入事务</td>
<td>3690</td>
</tr>
<tr>
<td><code>sqlite3WalEndWriteTransaction()</code></td>
<td>结束写入事务</td>
<td>3743</td>
</tr>
<tr>
<td><code>sqlite3WalUndo()</code></td>
<td>事务回滚</td>
<td>3765</td>
</tr>
<tr>
<td><code>sqlite3WalSavepoint()</code></td>
<td>保存点</td>
<td>3830</td>
</tr>
<tr>
<td><code>sqlite3WalSavepointUndo()</code></td>
<td>回滚到保存点</td>
<td>3863</td>
</tr>
</tbody>
</table>
<h3 id="93-读写操作">9.3 读写操作</h3>
<table>
<thead>
<tr>
<th>函数</th>
<th>说明</th>
<th>行号</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>sqlite3WalBeginReadTransaction()</code></td>
<td>开始读事务</td>
<td>3487</td>
</tr>
<tr>
<td><code>sqlite3WalFindFrame()</code></td>
<td>查找帧位置</td>
<td>3649</td>
</tr>
<tr>
<td><code>sqlite3WalReadFrame()</code></td>
<td>读取帧数据</td>
<td>3667</td>
</tr>
<tr>
<td><code>sqlite3WalFrames()</code></td>
<td>写入帧到 WAL</td>
<td>4266</td>
</tr>
</tbody>
</table>
<h3 id="94-checkpoint-操作">9.4 Checkpoint 操作</h3>
<table>
<thead>
<tr>
<th>函数</th>
<th>说明</th>
<th>行号</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>sqlite3WalCheckpoint()</code></td>
<td>Checkpoint 处理</td>
<td>4292</td>
</tr>
<tr>
<td><code>sqlite3WalCallback()</code></td>
<td>获取回调帧数</td>
<td>4430</td>
</tr>
<tr>
<td><code>sqlite3WalExclusiveMode()</code></td>
<td>设置独占模式</td>
<td>4463</td>
</tr>
</tbody>
</table>
<hr>
<h2 id="十性能优化建议">十、性能优化建议</h2>
<h3 id="101-参数调优">10.1 参数调优</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="c1">-- 1. 调整自动 checkpoint 阈值
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="n">PRAGMA</span><span class="w"> </span><span class="n">wal_autocheckpoint</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">10000</span><span class="p">;</span><span class="w">  </span><span class="c1">-- 默认 1000，可根据场景调整
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="c1">-- 2. 设置 WAL 文件大小限制 (MB)
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="n">PRAGMA</span><span class="w"> </span><span class="n">wal_size_limit</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">1024</span><span class="p">;</span><span class="w">  </span><span class="c1">-- 最大 1GB
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="c1">-- 3. 调整 checkpoint 模式
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="n">PRAGMA</span><span class="w"> </span><span class="n">wal_checkpoint</span><span class="p">(</span><span class="k">TRUNCATE</span><span class="p">);</span><span class="w">  </span><span class="c1">-- 最积极，立即截断
</span></span></span></code></pre></div><h3 id="102-设计考量">10.2 设计考量</h3>
<p><strong>高频写入场景：</strong></p>
<ul>
<li>调大 wal_autocheckpoint 减少 Checkpoint 频率</li>
<li>使用 RESTART 模式 checkpoint</li>
<li>增加 wal_max_size 避免频繁截断</li>
</ul>
<p><strong>高并发读场景：</strong></p>
<ul>
<li>保持默认 autocheckpoint (1000 帧)</li>
<li>启用 WAL 共享内存 (支持 5 个 reader)</li>
<li>避免使用 TRUNCATE 模式</li>
</ul>
<p><strong>内存受限场景：</strong></p>
<ul>
<li>使用 smaller wal_page_size (如 2KB)</li>
<li>减少 WAL_READER 数量 (最多 5 个)</li>
<li>启用 exclusive-mode 减少锁开销</li>
</ul>
<hr>
<h2 id="十一wal-核心设计思想">十一、WAL 核心设计思想</h2>
<h3 id="111-设计哲学">11.1 设计哲学</h3>
<ol>
<li><strong>Write-Ahead</strong>: 事务提交前先写日志 (原子性保障)</li>
<li><strong>Copy-on-Write</strong>: 读取时复制数据到 WAL (多版本实现)</li>
<li><strong>Lightweight</strong>: 无需复杂锁定，通过 shm 文件协调</li>
<li><strong>Efficient Checkpoint</strong>: 批量刷回，避免随机写开销</li>
</ol>
<h3 id="112-关键创新">11.2 关键创新</h3>
<table>
<thead>
<tr>
<th>创新点</th>
<th>作用</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Shadow Page</strong></td>
<td>WAL 中不覆盖原数据库，仅追加新帧</td>
</tr>
<tr>
<td><strong>Salt Mechanism</strong></td>
<td>防止旧帧被误认，确保数据一致性</td>
</tr>
<tr>
<td><strong>Fibonacci Checksum</strong></td>
<td>快速检测数据损坏，权重递减算法</td>
</tr>
<tr>
<td><strong>Shared Memory Lock</strong></td>
<td>轻量级锁机制，支持多 reader 并发</td>
</tr>
<tr>
<td><strong>Page Mapping Index</strong></td>
<td>快速定位帧位置，O(1) 平均时间复杂度</td>
</tr>
</tbody>
</table>
<hr>
<h2 id="十二调试与监控命令">十二、调试与监控命令</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="c1">-- 1. 查看当前 WAL 模式
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="n">PRAGMA</span><span class="w"> </span><span class="n">journal_mode</span><span class="p">;</span><span class="w">           </span><span class="c1">-- 返回 &#34;wal&#34;
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="c1">-- 2. 查看 WAL 文件状态
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="n">PRAGMA</span><span class="w"> </span><span class="n">wal_checkpoint</span><span class="p">;</span><span class="w">         </span><span class="c1">-- PASSIVE 模式，输出 checkpoint 结果
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="c1">-- 3. 强制 Checkpoint 并截断 WAL
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="n">PRAGMA</span><span class="w"> </span><span class="n">wal_checkpoint</span><span class="p">(</span><span class="k">TRUNCATE</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="c1">-- 4. 查看 WAL 配置
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="n">PRAGMA</span><span class="w"> </span><span class="n">wal_autocheckpoint</span><span class="p">;</span><span class="w">     </span><span class="c1">-- 显示当前 autocheckpoint 阈值
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="n">PRAGMA</span><span class="w"> </span><span class="n">wal_size_limit</span><span class="p">;</span><span class="w">         </span><span class="c1">-- 显示当前 WAL 大小限制
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="c1">-- 5. 生成 WAL 文件信息
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="n">PRAGMA</span><span class="w"> </span><span class="n">integrity_check</span><span class="p">;</span><span class="w">        </span><span class="c1">-- 检查数据库完整性
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="n">PRAGMA</span><span class="w"> </span><span class="n">wal_checkpoint</span><span class="p">;</span><span class="w">         </span><span class="c1">-- 检查 WAL 状态
</span></span></span></code></pre></div><h3 id="监控指标">监控指标</h3>
<ul>
<li>WAL 文件大小 (总帧数)</li>
<li>Checkpoint 频率</li>
<li>锁等待时间 (Writer 等待 Reader)</li>
<li>内存使用 (共享内存大小)</li>
<li>页面命中率 (从 WAL vs 从数据库)</li>
</ul>
<hr>
<h2 id="十三总结">十三、总结</h2>
<h3 id="wal-的核心价值">WAL 的核心价值</h3>
<ol>
<li><strong>高并发</strong>: Reader/Writer 可并行执行，极大提升并发能力</li>
<li><strong>数据一致性</strong>: Salt + Checksum 双重保障，确保原子性</li>
<li><strong>灵活 checkpoint</strong>: 支持 4 种模式，适应不同场景需求</li>
<li><strong>轻量级锁机制</strong>: 通过 shm 文件实现，无需复杂内核锁</li>
<li><strong>Crash Recovery 快速</strong>: 重放有效帧，无需复杂回滚</li>
</ol>
<hr>
<h2 id="附录常用命令参考">附录：常用命令参考</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># 1. 编译 SQLite 源码</span>
</span></span><span class="line"><span class="cl">./configure --enable-wal
</span></span><span class="line"><span class="cl">make -j<span class="k">$(</span>nproc<span class="k">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 2. 运行测试</span>
</span></span><span class="line"><span class="cl">make <span class="nb">test</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 3. 查看 WAL 源码结构</span>
</span></span><span class="line"><span class="cl">find src/wal*.c -type f <span class="p">|</span> head -10
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 4. 生成 WAL 状态报告</span>
</span></span><span class="line"><span class="cl">./sqlite3 test.db <span class="s2">&#34;PRAGMA journal_mode=wal; PRAGMA wal_checkpoint; PRAGMA integrity_check;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 5. 分析 WAL 文件二进制格式</span>
</span></span><span class="line"><span class="cl">hexdump -C test.db-wal <span class="p">|</span> head -50
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 6. 查看共享内存结构</span>
</span></span><span class="line"><span class="cl">ls -lh test.db-wal test.db-shm
</span></span></code></pre></div><hr>
<p><strong>文档生成时间</strong>: 2026-03-30<br>
<strong>SQLite 源码版本</strong>: 查看 <code>VERSION</code> 文件<br>
<strong>源码目录</strong>: <code>/Volumes/790/Codes/sqlite/src/wal.c</code></p>
<hr>
<h2 id="快速查阅表">快速查阅表</h2>
<table>
<thead>
<tr>
<th>概念</th>
<th>行号范围</th>
<th>说明</th>
</tr>
</thead>
<tbody>
<tr>
<td>WAL Header 结构</td>
<td>1-250</td>
<td>文件格式说明</td>
</tr>
<tr>
<td>Wal 结构体</td>
<td>511-574</td>
<td>核心状态结构</td>
</tr>
<tr>
<td>Checkpoint 算法</td>
<td>4292-4500</td>
<td>检查点处理逻辑</td>
</tr>
<tr>
<td>FindFrame 算法</td>
<td>3631-3700</td>
<td>页面帧查找</td>
</tr>
<tr>
<td>Hash 表维护</td>
<td>1146-1400</td>
<td>快速定位机制</td>
</tr>
</tbody>
</table>
<p><strong>建议阅读顺序</strong>: WAL Header 设计 → 核心结构体 → Checkpoint 算法 → 查找算法 → 源码注释详解</p>
]]></content>
		</item>
		
		<item>
			<title>How to Alter Table in Sqlite</title>
			<link>/posts/how-to-alter-table-in-sqlite/</link>
			<pubDate>Tue, 26 Nov 2024 15:49:35 +0800</pubDate>
			
			<guid>/posts/how-to-alter-table-in-sqlite/</guid>
			<description><![CDATA[%!s(<nil>)]]></description>
			<content type="html"><![CDATA[<h1 id="create-temp-table-then-do-migration">Create temp table, then do migration</h1>
<h2 id="the-old-table">The old table</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="s2">&#34;users&#34;</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="s2">&#34;id&#34;</span><span class="w">	</span><span class="nb">INTEGER</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="s2">&#34;created_at&#34;</span><span class="w">	</span><span class="nb">INTEGER</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="w"> </span><span class="k">DEFAULT</span><span class="w"> </span><span class="k">CURRENT_TIMESTAMP</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="s2">&#34;password&#34;</span><span class="w">	</span><span class="nb">TEXT</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">(</span><span class="s2">&#34;id&#34;</span><span class="w"> </span><span class="n">AUTOINCREMENT</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span></code></pre></div><h2 id="rename-table">Rename table</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="s2">&#34;users&#34;</span><span class="w"> </span><span class="k">RENAME</span><span class="w"> </span><span class="k">TO</span><span class="w"> </span><span class="s2">&#34;users_old&#34;</span><span class="p">;</span><span class="w">
</span></span></span></code></pre></div><h2 id="the-new-table">The new table</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="s2">&#34;users&#34;</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="s2">&#34;id&#34;</span><span class="w">	</span><span class="nb">INTEGER</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="s2">&#34;created_at&#34;</span><span class="w">	</span><span class="k">TIMESTAMP</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="w"> </span><span class="k">DEFAULT</span><span class="w"> </span><span class="k">CURRENT_TIMESTAMP</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="s2">&#34;password&#34;</span><span class="w">	</span><span class="nb">TEXT</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">(</span><span class="s2">&#34;id&#34;</span><span class="w"> </span><span class="n">AUTOINCREMENT</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span></code></pre></div><h2 id="migrate-data-from-old-table">Migrate data from old table</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="s2">&#34;users&#34;</span><span class="w"> </span><span class="p">(</span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">created_at</span><span class="p">,</span><span class="w"> </span><span class="n">password</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">	</span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">created_at</span><span class="p">,</span><span class="w"> </span><span class="n">password</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="s2">&#34;users_old&#34;</span><span class="p">;</span><span class="w">
</span></span></span></code></pre></div><h2 id="delete-old-table">Delete old table</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">DROP</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="s2">&#34;users_old&#34;</span><span class="p">;</span><span class="w">
</span></span></span></code></pre></div><h2 id="confirm-data">Confirm data</h2>
<p>Double check if the <code>created_at</code> type is <code>TIMESTAMP</code></p>
<h1 id="notes">Notes</h1>
<ol>
<li>Backup data</li>
<li>Data check</li>
<li>If <code>created_at</code> is not timestamp style, need transform.</li>
</ol>
]]></content>
		</item>
		
		<item>
			<title>How to Use Build.rs</title>
			<link>/posts/how-to-use-build.rs/</link>
			<pubDate>Sun, 24 Nov 2024 11:14:21 +0800</pubDate>
			
			<guid>/posts/how-to-use-build.rs/</guid>
			<description><![CDATA[%!s(<nil>)]]></description>
			<content type="html"><![CDATA[<ol>
<li>cargo:rustc-link-lib
用于指定要链接的库。</li>
</ol>
<p>格式：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-rust" data-lang="rust"><span class="line"><span class="cl"><span class="c1">// kind（可选）：指定库类型，如 static（静态库）、dylib（动态库，默认值）。
</span></span></span><span class="line"><span class="cl"><span class="c1">// name：库的名称，例如 virt。
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="fm">println!</span><span class="p">(</span><span class="s">&#34;cargo:rustc-link-lib=</span><span class="si">{kind}</span><span class="s">=</span><span class="si">{name}</span><span class="s">&#34;</span><span class="p">);</span><span class="w">
</span></span></span></code></pre></div><p>eg:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-rust" data-lang="rust"><span class="line"><span class="cl"><span class="fm">println!</span><span class="p">(</span><span class="s">&#34;cargo:rustc-link-lib=virt&#34;</span><span class="p">);</span><span class="w"> </span><span class="c1">// 链接动态库 libvirt
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="fm">println!</span><span class="p">(</span><span class="s">&#34;cargo:rustc-link-lib=static=foo&#34;</span><span class="p">);</span><span class="w"> </span><span class="c1">// 链接静态库 libfoo.a
</span></span></span></code></pre></div><ol start="2">
<li>cargo:rustc-link-search
用于指定 Cargo 搜索库文件的路径。</li>
</ol>
<p>格式：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-rust" data-lang="rust"><span class="line"><span class="cl"><span class="c1">// kind（可选）：路径类型，如 native（默认值）。
</span></span></span><span class="line"><span class="cl"><span class="c1">// path：库所在的路径。
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="fm">println!</span><span class="p">(</span><span class="s">&#34;cargo:rustc-link-search=</span><span class="si">{kind}</span><span class="s">=</span><span class="si">{path}</span><span class="s">&#34;</span><span class="p">);</span><span class="w">
</span></span></span></code></pre></div><p>eg:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-rust" data-lang="rust"><span class="line"><span class="cl"><span class="fm">println!</span><span class="p">(</span><span class="s">&#34;cargo:rustc-link-search=/opt/homebrew/lib&#34;</span><span class="p">);</span><span class="w"> </span><span class="c1">// 在该路径下搜索库
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="fm">println!</span><span class="p">(</span><span class="s">&#34;cargo:rustc-link-search=native=/custom/path&#34;</span><span class="p">);</span><span class="w"> </span><span class="c1">// 指定为 native 类型路径
</span></span></span></code></pre></div><ol start="3">
<li>cargo:rerun-if-changed
告诉 Cargo，当指定的文件改变时，重新运行 build.rs。</li>
</ol>
<p>格式：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-rust" data-lang="rust"><span class="line"><span class="cl"><span class="fm">println!</span><span class="p">(</span><span class="s">&#34;cargo:rerun-if-changed=</span><span class="si">{file}</span><span class="s">&#34;</span><span class="p">);</span><span class="w">
</span></span></span></code></pre></div><p>eg:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-rust" data-lang="rust"><span class="line"><span class="cl"><span class="fm">println!</span><span class="p">(</span><span class="s">&#34;cargo:rerun-if-changed=build.rs&#34;</span><span class="p">);</span><span class="w"> </span><span class="c1">// 当 build.rs 发生改变时重新运行
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="fm">println!</span><span class="p">(</span><span class="s">&#34;cargo:rerun-if-changed=src/lib.rs&#34;</span><span class="p">);</span><span class="w"> </span><span class="c1">// 监视其他文件
</span></span></span></code></pre></div><ol start="4">
<li>cargo:rerun-if-env-changed
当指定的环境变量改变时，重新运行 build.rs。</li>
</ol>
<p>eg:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-rust" data-lang="rust"><span class="line"><span class="cl"><span class="fm">println!</span><span class="p">(</span><span class="s">&#34;cargo:rerun-if-env-changed=MY_ENV_VAR&#34;</span><span class="p">);</span><span class="w">
</span></span></span></code></pre></div><ol start="5">
<li>pkg-config 或其他工具输出的指令
如果使用 pkg-config，它会自动输出 cargo:rustc-link-lib 和 cargo:rustc-link-search 的指令。例如：</li>
</ol>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-rust" data-lang="rust"><span class="line"><span class="cl"><span class="k">fn</span> <span class="nf">main</span><span class="p">()</span><span class="w"> </span><span class="p">{</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="n">pkg_config</span>::<span class="n">Config</span>::<span class="n">new</span><span class="p">()</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="p">.</span><span class="n">atleast_version</span><span class="p">(</span><span class="s">&#34;1.0&#34;</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="p">.</span><span class="n">probe</span><span class="p">(</span><span class="s">&#34;libvirt&#34;</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="p">.</span><span class="n">unwrap</span><span class="p">();</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="p">}</span><span class="w">
</span></span></span></code></pre></div><p>这个调用会检测 libvirt 是否存在，并自动生成指令。</p>
<p>如何确定你需要的指令？</p>
<ol>
<li>阅读库文档：查看你需要链接的 C/C++ 库的安装路径、库名称。</li>
<li>查找库文件：检查系统中安装的库（如 libvirt），通过以下命令找到路径：</li>
</ol>
<h3 id="macoslinux">macOS/Linux:</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">pkg-config --libs --cflags libvirt
</span></span></code></pre></div><p>输出可能包含路径和库名称：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-console" data-lang="console"><span class="line"><span class="cl"><span class="go">-L/opt/homebrew/lib -lvirt
</span></span></span></code></pre></div><p>对应 cargo:rustc-link-search 和 cargo:rustc-link-lib。</p>
<p>double check：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">find /usr /opt/homebrew -name <span class="s2">&#34;libvirt*&#34;</span>
</span></span></code></pre></div><ol start="3">
<li>调试构建脚本：运行 cargo build，观察报错提示。例如，如果提示 libvirt 未找到，检查是否需要调整库路径或库名称。</li>
</ol>
<p>总之，需要通过调试和确认依赖信息逐步完善 build.rs 文件。</p>
]]></content>
		</item>
		
		<item>
			<title>What Is CIDR</title>
			<link>/posts/what-is-cidr/</link>
			<pubDate>Fri, 19 Apr 2024 16:23:59 +0800</pubDate>
			
			<guid>/posts/what-is-cidr/</guid>
			<description><![CDATA[%!s(<nil>)]]></description>
			<content type="html"><![CDATA[<p><img src="/imgs/ipcalc-cidr.png" alt="ipcalc cidr"></p>
<p>CIRD 是一个无类别域间路由的IP分配方法。</p>
<h2 id="ip地址格式">IP地址格式</h2>
<h3 id="有类地址">有类地址</h3>
<ol>
<li>A类，IPv4地址有8个网络前缀位，可分配16777214个IP地址。如：192.0.0.1，其中192是网络地址，0.0.1是主机地址</li>
<li>B类，IPv4地址有16个网络前缀位，可分配65534个IP地址。如：192.168.0.1，其中192.168是网络地址，0.1是主机地址</li>
<li>C类，IPv4地址有24个网络前缀位，可分配254个IP地址。如：192.168.1.100，其中192.168.1是网络地址，100是主机地址</li>
</ol>
<h3 id="无类地址">无类地址</h3>
<p>该方式使用可变长度子网掩码（VLSM）来改变IP地址中网络地址和主机地址位之间的比率。子网掩码是一组标识符，通过将主机地址变为0，从IP地址返回网络地址的值。</p>
<p>VLSM序列允许网络管理员将IP地址空间分解位不同大小的子网，每个子网可以有灵活的主机数量和有限的IP地址数量。CIDR IP地址在普通IP地址的基础上加了一个后缀值，说明网络地址前缀位数。</p>
<p>如：192.168.0.0/24是一个IPv4 CIDR地址，其中前24位（192.168.0）是网络地址。</p>
<h2 id="cidr-优势">CIDR 优势</h2>
<ol>
<li>减少IP地址浪费，如用户需要一个包含300个IP地址的网段，如果使用有类地址，需要分配一个B类(65534=256*256-2)地址，但是这会造成极大的资源浪费。如果使用CIDR的IP分配方式，如：192.168.0.0/23，则可以避免这类问题。具体的ip地址范围，可是使用<code>ipcalc</code>来查看对应的IP地址范围</li>
<li>快速传输数据。因为CIDR策略可以有效的将IP地址组织成多个子网，子网存在与网路中的较小网络，减少了路由器的路由次数，进一步也就减少了数据拷贝的次数，因而效率更高。</li>
<li>创建虚拟私有云（VPC）。允许用户在隔离且安全的环境中配置工作负载，但这种需求使用有类地址也可以做到。</li>
<li>灵活创建超网。超网是一组具有相似网络前缀的子网，CIDR允许灵活创建超网，这在传统的掩码架构（即有类地址）中使不可能的。如：<code>192.168.1/23</code>与<code>192.168.0/23</code>，这种方式将<code>255.255.254.0</code>的子网掩码应用于IP地址，该IP地址将返回前23位作为网络地址，路由器只需要一个路由表条目即可管理子网设备之间的数据包。</li>
</ol>
<h2 id="cidr-工作原理">CIDR 工作原理</h2>
<p>CIDR允许网络路由器根据指定的子网将数据包路由到相应的设备。路由器不是根据类别对IP地址分类，而是检索CIDR后缀指定的网络和主机地址。</p>
<h3 id="cidr块">CIDR块</h3>
<p>CIDR数据块是共享相同网络前缀和位数的IP地址集合。一个大数据块有更多IP地址和一个小后缀组成。如：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-console" data-lang="console"><span class="line"><span class="cl"><span class="go">master CIDR Block: 10.10.0.0/16
</span></span></span><span class="line"><span class="cl"><span class="go"></span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="go">subnet 1: 10.10.1.0/24
</span></span></span><span class="line"><span class="cl"><span class="go">subnet 2: 10.10.2.0/24
</span></span></span><span class="line"><span class="cl"><span class="go">subnet 3: 10.10.3.0/24
</span></span></span><span class="line"><span class="cl"><span class="go">subnet 4: 10.10.4.0/24
</span></span></span><span class="line"><span class="cl"><span class="go">subnet 5: 10.10.5.0/24
</span></span></span><span class="line"><span class="cl"><span class="go">...
</span></span></span></code></pre></div>]]></content>
		</item>
		
		<item>
			<title>Change the CIDR for Running Kubernetes</title>
			<link>/posts/change-the-cidr-for-running-kubernetes/</link>
			<pubDate>Tue, 16 Apr 2024 18:31:17 +0800</pubDate>
			
			<guid>/posts/change-the-cidr-for-running-kubernetes/</guid>
			<description><![CDATA[%!s(<nil>)]]></description>
			<content type="html"><![CDATA[<h1 id="first-of-first">First of first</h1>
<p>This operation will lead the kube cluster unavailable for some minutes. Take care.</p>
<h2 id="releated-files-and-kube-objects">Releated files and kube objects</h2>
<ol>
<li><code>/etc/kubernetes/manifest/kube-apiserver.yaml</code></li>
<li><code>kubectl -n kube-system edit svc/kube-dns</code></li>
<li><code>/var/lib/kubelet/config.yaml</code></li>
<li><code>kubectl -n kube-system edit cm kubelet-config</code></li>
</ol>
<h2 id="update-kube-apiserver-manifest">Update kube-apiserver manifest</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="c"># vim /etc/kubernetes/manifests/kube-apiserver.yaml</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nn">...</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">spec</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">containers</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="nt">command</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span>- <span class="l">kube-apiserver</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nn">...</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span>- --<span class="l">service-account-signing-key-file=/etc/kubernetes/pki/sa.key</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span>- --<span class="l">service-cluster-ip-range=100.96.0.0/12  </span><span class="w"> </span><span class="c"># Change</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span>- --<span class="l">tls-cert-file=/etc/kubernetes/pki/apiserver.crt</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nn">...</span><span class="w">
</span></span></span></code></pre></div><h2 id="edit-kube-dns-service">Edit kube-dns service</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="c"># kubectl -n kube-system edit svc kube-dns</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nn">...</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="c"># in the service YAML, modify the &#39;strategy&#39;. save and quit to apply the changes!</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">spec</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">clusterIP</span><span class="p">:</span><span class="w"> </span><span class="m">100.96.0.10</span><span class="w"> </span><span class="c"># Change</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">clusterIPs</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="m">100.96.0.10</span><span class="w"> </span><span class="c"># Change</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">internalTrafficPolicy</span><span class="p">:</span><span class="w"> </span><span class="l">Cluster</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nn">...</span><span class="w">
</span></span></span></code></pre></div><h2 id="replace-kube-dns">Replace kube-dns</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-console" data-lang="console"><span class="line"><span class="cl"><span class="go">kubectl replace -f /tmp/kubectl-edit-3485293250.yaml --force 
</span></span></span><span class="line"><span class="cl"><span class="go"></span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="gp">#</span> see the new IP address given to the service
</span></span><span class="line"><span class="cl"><span class="go">kubectl -n kube-system get svc
</span></span></span><span class="line"><span class="cl"><span class="go"></span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="go">controlplane $ kubectl -n kube-system get svc
</span></span></span><span class="line"><span class="cl"><span class="go">NAME       TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)                  AGE
</span></span></span><span class="line"><span class="cl"><span class="go">kube-dns   ClusterIP   100.96.0.10   &lt;none&gt;        53/UDP,53/TCP,9153/TCP   6s
</span></span></span></code></pre></div><h2 id="update-kubelet-config">Update kubelet config</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="c"># vim /var/lib/kubelet/config.yaml </span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="c"># within the config.yaml file, change the clusterDNS value to 100.96.0.10</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nn">...</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">cgroupDriver</span><span class="p">:</span><span class="w"> </span><span class="l">systemd</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">clusterDNS</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span>- <span class="m">100.96.0.10</span><span class="w"> </span><span class="c"># Change</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">clusterDomain</span><span class="p">:</span><span class="w"> </span><span class="l">cluster.local</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nn">...</span><span class="w">
</span></span></span></code></pre></div><h2 id="update-kubelet-config-configmap">Update kubelet-config configmap</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="c"># kubectl -n kube-system edit cm kubelet-config</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="c"># in the kubelet configMap, change the value for clusterDNS to 100.96.0.10</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nn">...</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">data</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">kubelet</span><span class="p">:</span><span class="w"> </span><span class="l">|</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nn">...</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">cgroupDriver</span><span class="p">:</span><span class="w"> </span><span class="l">systemd</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">clusterDNS</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span>- <span class="m">100.96.0.10</span><span class="w"> </span><span class="c"># Change</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">clusterDomain</span><span class="p">:</span><span class="w"> </span><span class="l">cluster.local</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nn">...</span><span class="w">
</span></span></span></code></pre></div><h2 id="update-configmap-kubelet-service">Update configmap, kubelet service</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl"><span class="c1"># apply the update to the kubelet configuration immediately on the node</span>
</span></span><span class="line"><span class="cl">kubeadm upgrade node phase kubelet-config
</span></span><span class="line"><span class="cl">systemctl daemon-reload
</span></span><span class="line"><span class="cl">systemctl restart kubelet
</span></span></code></pre></div><h2 id="verifing">Verifing</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl"><span class="c1"># start a pod named &#39;netshoot&#39; using the image &#39;nicolaka/netshoot&#39; ensuring that the pod stays in a running state.</span>
</span></span><span class="line"><span class="cl">kubectl run netshoot --image<span class="o">=</span>nicolaka/netshoot --command sleep --command <span class="s2">&#34;3600&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># login the checking pod container</span>
</span></span><span class="line"><span class="cl">kubectl <span class="nb">exec</span> netshoot -it -- /bin/bash
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># cat the /etc/resolv.conf</span>
</span></span><span class="line"><span class="cl">cat /etc/resolv.conf
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">nslookup kubernetes.default
</span></span></code></pre></div>]]></content>
		</item>
		
		<item>
			<title>Qemu Usage</title>
			<link>/posts/qemu-usage/</link>
			<pubDate>Wed, 10 Apr 2024 20:25:41 +0800</pubDate>
			
			<guid>/posts/qemu-usage/</guid>
			<description><![CDATA[%!s(<nil>)]]></description>
			<content type="html"><![CDATA[<h2 id="qemu-img">qemu-img</h2>
<h3 id="snapshot">snapshot</h3>
<p><img src="/imgs/qemu-img-snapshot.png" alt="qemu-img-snapshot"></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl"><span class="c1"># list the snapshots</span>
</span></span><span class="line"><span class="cl">qemu-img disk_image.qcow2 snapshot -l
</span></span><span class="line"><span class="cl"><span class="c1"># create snapshot</span>
</span></span><span class="line"><span class="cl">qemu-img disk_image.qcow2 snapshot -c snapshot_name
</span></span><span class="line"><span class="cl"><span class="c1"># apply a snapshot</span>
</span></span><span class="line"><span class="cl">qemu-img disk_image.qcow2 snapshot -a snapshot_name
</span></span></code></pre></div>]]></content>
		</item>
		
		<item>
			<title>Policies</title>
			<link>/posts/policies/</link>
			<pubDate>Tue, 09 Apr 2024 09:48:38 +0800</pubDate>
			
			<guid>/posts/policies/</guid>
			<description><![CDATA[%!s(<nil>)]]></description>
			<content type="html"><![CDATA[<p>kubernetes 策略是管理其他配置或运行时行为的一些配置。kubernetes 提供了各种形式的策略：</p>
<ol>
<li>使用 API 对象应用策略
<ul>
<li>NetworkPolicy 用于限制工作负载的出入站流量</li>
<li>LimitRange 管理多个不同对象类别的资源分配约束</li>
<li>ResourceQuota 限制命名空间的资源消耗</li>
</ul>
</li>
<li>使用准入控制器应用策略
准入控制器允许在 API 服务器上，可以验证或变更 API 请求，某些准入控制器用于应用策略。例如：AlwaysPullImages 准入控制会将 Pod 中容器的镜像拉取策略设置为 Always</li>
<li>使用 ValidatingAdmissionPolicy 应用策略
允许使用表达式语言 CEL 在 API 服务器中执行可配置的验证检查。如：ValidatingAdmissionPolicy 用于禁止使用 latest 镜像标签</li>
<li>使用动态准入控制
提供给用户自定义准入控制的一种方式，用户可以自定义自己的准入控制，然后在运行时动态加载到 kube-apiserver 的准入控制验证或变更中。
<ul>
<li><a href="https://github.com/kubewarden">kubewarden</a> 基于 Wasm 的准入控制，允许用户通过自定义策略检查 kubernetes 对象，如 Pod，Deployment，Service 等的创建，更新和删除请求，并在请求被接受或拒绝时执行这些策略</li>
<li><a href="https://kyverno.io/">kyverno</a> 用于实施声明式的策略管理，允许 kubernetes 用户定义和实施基于资源的策略</li>
<li><a href="https://github.com/open-policy-agent/gatekeeper">OPA Gatekeeper</a> 用于执行和管理策略</li>
<li><a href="https://polaris.docs.fairwinds.com/admission-controller/">Polaris</a> 提供对集群健康和最佳实践的自动化检查和建议，可以帮助用户识别和解决潜在的问题</li>
</ul>
</li>
<li>使用 kubelet 配置应用策略，kubernetes 允许每个节点上配置 kubelet，一些 kubelet 配置可以视为策略。
<ul>
<li>进程 ID 限制和保留：限制或保留可分配的 PID</li>
<li>节点资源管理器：为低延迟和高吞吐量的工作负载管理 CPU，内存和其他设备资源</li>
</ul>
</li>
</ol>
<h2 id="limitrange">LimitRange</h2>
<p>默认情况下，kubernetes 集群上的容器运行使用的计算资源没有限制。使用命名空间资源配额，可以限制命名空间的资源使用上限；使用 LimitRange 可以限制一个 Pod 可以使用的资源上限和下限。LimitRange 提供限制能力包括：</p>
<ol>
<li>限制在一个命名空间中实施对每个 Pod 或 Container 最小和最大的资源使用量</li>
<li>限制在一个命名空间中实施对每个 PersistentVolumeClaim 能申请的最小和最大的存储空间大小</li>
<li>控制在一个命名空间中实施对一种资源的申请值和限制值的比例</li>
<li>设置在一个命名空间中对计算资源的默认申请和限制值，并且自动的在运行时注入到多个 Container 中</li>
</ol>
<h3 id="资源限制和请求的约束">资源限制和请求的约束</h3>
<p>注意事项：</p>
<ol>
<li>LimitRange 仅在 Pod 准入阶段进行，不对任何正在运行的 Pod 进行验证。如果添加 namespace resourceQuota 时已有对应运行的 Pod 且 Pod 添加了 resourceQuota，则 namespace 配额会更新已使用的配额。</li>
<li>如果一个命名空间中存在多个 LimitRange，则应用哪个默认值是不确定的。</li>
</ol>
<h3 id="pod-的-limitrange-和准入检查">Pod 的 LimitRange 和准入检查</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l">v1</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l">LimitRange</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">metadata</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">cpu-resource-constraint</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">spec</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">limits</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span>- <span class="nt">default</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">cpu</span><span class="p">:</span><span class="w"> </span><span class="l">500m</span><span class="w"> </span><span class="c"># 如果没有定义containers[*].resources.limits.cpu，那将使用此值</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">defaultRequest</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">cpu</span><span class="p">:</span><span class="w"> </span><span class="l">500m</span><span class="w"> </span><span class="c"># 如没有定义containers[*].resources.requests.cpu，那将使用此值</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">maxLimitRequestRatio</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">cpu</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;10&#34;</span><span class="w"> </span><span class="c"># 表示对于非零的containers[*].resources.[limits/requests].cpu，每个容器的limits.cpu最大可以是requests.cpu的10倍</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">max</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">cpu</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;1&#34;</span><span class="w"> </span><span class="c"># 限制container.resource.cpu.limit的最大值为1000m</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">min</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">cpu</span><span class="p">:</span><span class="w"> </span><span class="l">100m</span><span class="w"> </span><span class="c"># 限制container.resource.cpu.limit的最小值为100m</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">type</span><span class="p">:</span><span class="w"> </span><span class="l">Container</span><span class="w"> </span><span class="c"># 作用在containers上，包括initContainers</span><span class="w">
</span></span></span></code></pre></div><h2 id="资源配额">资源配额</h2>
<p>通过<code>ResourceQuota</code>对象来定义，对每个命名空间的资源消耗总量提供限制。它可以限制命名空间中某种类型的对象的总数目上限，也可以限制命名空间中的 Pod 可以使用的计算资源总上限。工作方式如下：</p>
<ol>
<li>集群管理员可以为每个命名空间创建一个或多个 ResourceQuota 对象</li>
<li>当用户在命名空间下创建资源（如 Pod，Service 等）时，kubernetes 的配额系统会跟踪并确保资源用量不会超过 ResourceQuota 中定义的硬性资源限额</li>
<li>如果资源创建或者更新违反了配额约束，则该请求会报 http 403 FORBIDDEN 的错误，并在消息中给出有可能违反的约束</li>
<li>如果命名空间下的计算资源被配额被启用，则 Pod 的创建必须指定资源的 request 和 limit，否则无法创建 Pod。当然，如果使用了 LimitRanger 准入控制器，Pod 没有配置资源用量，则 LimitRanger 会为这些 Pod 设置默认值。</li>
</ol>
<blockquote>
<p>NOTES:
对于其他资源：ResourceQuota 可以工作，并且会忽略命名空间中的 Pod，而无需为该资源设置 limit/request 限额。这意味着，即便资源配额限制了此命名空间的临时存储，也可以创建没有 limit/request 临时存储的新 Pod。也可以使用 LimitRange 自动设置对这些资源的默认值。</p>
</blockquote>
<h3 id="启用资源配额">启用资源配额</h3>
<p><code>kube-apiserver --enable-admission-plugins=ResourceQuota</code>，若命名空间中存在 ResourceQuota 对象，则该命名空间的资源配额就是开启的。</p>
<h3 id="计算资源配额">计算资源配额</h3>
<table>
<thead>
<tr>
<th>资源名称</th>
<th>描述</th>
</tr>
</thead>
<tbody>
<tr>
<td>limits.cpu</td>
<td>所有非终止状态的 Pod，其 CPU 限额总量不超过该值</td>
</tr>
<tr>
<td>limits.memory</td>
<td>～</td>
</tr>
<tr>
<td>requests.cpu</td>
<td>所有非终止状态的 Pod，其 CPU 需求总量不能超过该值，也可以理解为是一个初始值</td>
</tr>
<tr>
<td>requests.memory</td>
<td>～</td>
</tr>
<tr>
<td>hugepages-<size></td>
<td>所有非终止状态的 Pod，针对指定尺寸的大块内存请求总数不能超过此值</td>
</tr>
</tbody>
</table>
<h3 id="扩展资源的配额"><a href="https://kubernetes.io/zh-cn/docs/concepts/configuration/manage-resources-containers/#extended-resources">扩展资源</a>的配额</h3>
<p>由于扩展资源不可超量分配，因此没有必要在配额中为同一扩展资源同时指定 requests 和 limits，对于扩展资源目前仅允许使用前缀为<code>requests.</code>的配额项。如：<code>requests.nvidia.com/gpu: 4</code>，限制请求的 GPU 用量为 4。</p>
<h3 id="存储资源配额">存储资源配额</h3>
<p>用户可以对给定命名空间下的存储资源总量进行限制，还可以根据相关的 StorageClass 来限制存储资源的消耗。</p>
<table>
<thead>
<tr>
<th>资源名称</th>
<th>描述</th>
</tr>
</thead>
<tbody>
<tr>
<td>requests.storage</td>
<td>所有 PVC 存储资源的需求总量不能超过该值</td>
</tr>
<tr>
<td>persistentvolumeclaims</td>
<td>在该命名空间下所允许的 PVC 总量</td>
</tr>
<tr>
<td><storage-class-name>.storageclass.storage.k8s.io/requests.storage</td>
<td>在所有与<storage-class-name>相关的 PVC 中，存储请求的总和不能超过该值</td>
</tr>
<tr>
<td><storage-class-name>.storageclass.storage.k8s.io/persistentvolumeclaims</td>
<td>在所有与<storage-class-name>相关的 PVC 中，命名空间中可以存在的 PVC 总数</td>
</tr>
</tbody>
</table>
<p>在 v1.8 版本中，本地临时存储的配额支持已经是 alpha 功能：</p>
<table>
<thead>
<tr>
<th>资源名称</th>
<th>描述</th>
</tr>
</thead>
<tbody>
<tr>
<td>requests.ephemeral-storage</td>
<td>在命名空间的所有 Pod 中，本地临时存储 requests 的总和不能超过此值</td>
</tr>
<tr>
<td>limits.ephemeral-storage</td>
<td>在命名空间的所有 Pod 中，本地临时存储 limits 的总和不能超过此值</td>
</tr>
<tr>
<td>ephemeral-storage</td>
<td>与 requests.ephemeral-storage 相同</td>
</tr>
</tbody>
</table>
<blockquote>
<p>NOTES:
如果所使用的是 CRI 容器运行时，容器日志会被记入临时存储配额。这可能会导致存储配额耗尽的 Pod 被意外地 evict 节点。</p>
</blockquote>
<h3 id="对象数量的配额">对象数量的配额</h3>
<p>可以使用以下语法对所有标准的、命名空间域的资源类型进行配额限制：</p>
<ul>
<li><code>count/&lt;resource&gt;.&lt;group&gt;</code> 用于非 core 组的资源</li>
<li><code>count/&lt;resource&gt;</code>用于 core 组的资源</li>
</ul>
<p>也可以用于自定义资源，如：要对<code>example.com</code>API 组中的自定义资源<code>widgets</code>设置配额，<code>count/widgets.example.com</code>。</p>
<p>当使用<code>count/*</code>资源配额时，如果对象存在于服务器存储中，则会根据配额管理资源。如：集群中存在过多的 Secret 实际上会导致服务器无法启动；配置不当的 CronJob 在命名空间中创建太多 Job 而导致集群拒绝服务。</p>
<table>
<thead>
<tr>
<th>资源名称</th>
<th>描述</th>
</tr>
</thead>
<tbody>
<tr>
<td>pods</td>
<td>该命名空间中允许存在的非终止态（<code>.status.phase</code> in [Failed, Succeed]）的 Pod 总数上限，主要是为了防止 Pod 过多而耗尽集群所能提供的 Pod IP 地址</td>
</tr>
</tbody>
</table>
<h3 id="配额作用域">配额作用域</h3>
<p><code>resourcequota.spec.scopeSelector</code>,<code>resourcequota.spec.scopes</code>：每个配额仅会对作用域内的资源生效，配额机制仅统计所列举的作用域的交集中的资源用量。</p>
<p>当一个作用域被添加到配额中后，它会对作用域相关的资源数量作限制。如配额中制定了允许（作用域）集合之外的资源，会导致验证错误。</p>
<table>
<thead>
<tr>
<th>作用域</th>
<th>描述</th>
</tr>
</thead>
<tbody>
<tr>
<td>Terminating</td>
<td>匹配所有 <code>spec.activeDeadlineSeconds</code>不小于 0 的 Pod</td>
</tr>
<tr>
<td>NotTerminating</td>
<td>匹配所有<code>spec.activeDeadlineSeconds</code>是 nil 的 Pod</td>
</tr>
<tr>
<td>BestEffort</td>
<td>匹配所有 QoS 是 BestEffort 的 Pod</td>
</tr>
<tr>
<td>NotBestEffort</td>
<td>匹配所有 QoS 不是 BestEffort 的 Pod</td>
</tr>
<tr>
<td>PriorityClass</td>
<td>匹配所有引用了所指定的优先级类的 Pods</td>
</tr>
<tr>
<td>CrossNamespacePodAffinity</td>
<td>匹配那些设置了跨名字空间 podAffinity 和 podAntiAffinity</td>
</tr>
</tbody>
</table>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">scopeSelector</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">matchExpressions</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span>- <span class="nt">scopeName</span><span class="p">:</span><span class="w"> </span><span class="l">PriorityClass</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">operator</span><span class="p">:</span><span class="w"> </span><span class="l">In</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">values</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span>- <span class="l">middle</span><span class="w">
</span></span></span></code></pre></div><h4 id="基于-priorityclass-来设置资源配额">基于 PriorityClass 来设置资源配额</h4>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l">scheduling.k8s.io/v1</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l">PriorityClass</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">metadata</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">high</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">value</span><span class="p">:</span><span class="w"> </span><span class="m">1000</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="l">A demo high priority.</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nn">---</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l">v1</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l">Pod</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">metadata</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">high-priority</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">spec</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">containers</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">high-priority</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">image</span><span class="p">:</span><span class="w"> </span><span class="l">ubuntu</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">command</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">&#34;/bin/sh&#34;</span><span class="p">]</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">args</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">&#34;-c&#34;</span><span class="p">,</span><span class="w"> </span><span class="s2">&#34;while true; do echo hello; sleep 10;done&#34;</span><span class="p">]</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">resources</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">requests</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span><span class="nt">memory</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;10Gi&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span><span class="nt">cpu</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;500m&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">limits</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span><span class="nt">memory</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;10Gi&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span><span class="nt">cpu</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;500m&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">priorityClassName</span><span class="p">:</span><span class="w"> </span><span class="l">high</span><span class="w">
</span></span></span></code></pre></div><p><img src="/imgs/priority-quota.png" alt="priority-quota"></p>
<h4 id="跨名字空间的-podaffinity-配额">跨名字空间的 PodAffinity 配额</h4>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l">v1</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l">ResourceQuota</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">metadata</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">disable-cross-namespace-affinity</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">namespace</span><span class="p">:</span><span class="w"> </span><span class="l">foo-ns</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">spec</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">hard</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">pods</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;0&#34;</span><span class="w"> </span><span class="c"># 禁用了该名字空间创建Pod</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">scopeSelector</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">matchExpressions</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span>- <span class="nt">scopeName</span><span class="p">:</span><span class="w"> </span><span class="l">CrossNamespacePodAffinity</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">operator</span><span class="p">:</span><span class="w"> </span><span class="l">Exists</span><span class="w"> </span><span class="c"># 如果存在跨命名空间Pod亲和性功能，则应用该资源配额</span><span class="w">
</span></span></span></code></pre></div><p>禁止使用<code>CrossNamespacePodAffinity</code>或<code>CrossNamespacePodAntiAffinity</code>，可以配置<code>kube-apiserver --admissiion-control-config-file=&lt;下面的yaml&gt;</code></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l">apiserver.config.k8s.io/v1</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l">AdmissionConfiguration</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">plugins</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;ResourceQuota&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">configuration</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l">apiserver.config.k8s.io/v1</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l">ResourceQuotaConfiguration</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">limitedResources</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span>- <span class="nt">resource</span><span class="p">:</span><span class="w"> </span><span class="l">pods</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span><span class="nt">matchScopes</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">            </span>- <span class="nt">scopeName</span><span class="p">:</span><span class="w"> </span><span class="l">CrossNamespacePodAffinity</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">              </span><span class="nt">operator</span><span class="p">:</span><span class="w"> </span><span class="l">Exists</span><span class="w"> </span><span class="c"># 当存在跨命名空间Pod亲和性功能时应用资源配额的限制</span><span class="w">
</span></span></span></code></pre></div><h2 id="进程-id-约束与预留">进程 ID 约束与预留</h2>
<p>kubernetes 允许用户限制一个 Pod 中可以使用的 PID 数目。用户也可以为每个节点预留一点数量的可分配的 PID，供操作系统和守护进程（非 Pod）使用。</p>
<p>PID 时节点上的一种基础资源。很容易就会在尚未超出其他资源约束的时候就已经触及任务个数上限，进而导致宿主机不稳定。</p>
<p>集群管理员需要一定的机制来确保集群中运行的 Pod 不会导致 PID 资源枯竭，甚至造成宿主机上的守护进程（如：kubelet 或 kube-proxy，以及容器运行时本身）无法正常运行。此外，确保 Pod 中的 PID 的个数受限对于保证其不会影响到同一节点上其他负载也很重要。</p>
<blockquote>
<p>NOTES:
默认 PID 上限可能为 32768，用户可以通过修改<code>/proc/sys/kernel/pid_max</code>来扩大 PID 的上限。</p>
</blockquote>
<h3 id="节点级别-pid-限制">节点级别 PID 限制</h3>
<p>可以为 kubelet 使用<code>--system-reserved</code>和<code>--kube-reserved</code>命令行选项参数<code>pid=&lt;number&gt;</code>，分别对应整个操作系统和 kubernetes 系统守护进程所保留的 PID 数目。</p>
<h3 id="pod-级别的-pid-限制">Pod 级别的 PID 限制</h3>
<p>也可以限制 POD 中的 PID 个数，在节点级别设置这一限制，而不是为特定的 Pod 来将其设置为资源限制，每个节点都可以有不同的 PID 限制设置。使用<code>kubelet --pod-max-pids</code>或者在 kubelet 配置文件中设置<code>PodPidsLimit</code>。</p>
<h3 id="基于-pid-的-eviction">基于 PID 的 eviction</h3>
<p>可以配置 kubelet 使之在 Pod 行为不正常或者消耗不正常数量资源的时候将其终止。可以针对不同的驱逐信号配置资源不足的处理。使用<code>pid.available</code>驱逐信号来配置 Pod 使用的 PID 个数的阈值，可以设置硬性的和软性的驱逐策略。不过即使使用硬性的驱逐策略，如果 PID 个数增长过快，节点仍然可能因为初级节点 PID 限制而进入一种不稳定状态。驱逐信号的取值时周期性计算的，而不是一直能够强制实施约束。</p>
<p>Pod 级别和节点级别的 PID 限制会设置硬性限制，一旦触及限制，工作负载会在尝试获得新的 PID 时遇到问题。这可能会也可能不会导致 Pod 被重新调度，取决于工作负载如何应对这类失败以及 Pod 的存货行和就绪态探测时如何配置的。可是，如果限制值被正确设置，用户可以确保其他 Pod 负载和系统进程不会因为某个 Pod 行为不正常而没有 PID 可用。</p>
<h2 id="节点资源管理器">节点资源管理器</h2>
<p>kubernetes 提供了一组资源管理器，用于直至延迟敏感的、高吞吐量的工作负载。资源管理器的目标时协调和优化节点资源，以支持对 CPU、设备和内存（hugepage）等资源有特殊需求的 Pod。</p>
<p>主管理器（Topology Manager），是一个 kubelet 组件，通过策略，协调全局的资源管理过程。</p>
<p>各个管理器的配置方式会在专项文档中详细阐述。</p>
<h3 id="cpu-管理器">CPU 管理器</h3>
<p>默认情况下，kubelet 使用[CFS 策略]来执行 Pod 的 CPU 约束。当节点上运行了很多 CPU 密集的 Pod 时，工作负载可能会迁移到不同的 CPU 核，这取决于调度时 Pod 是否被扼制，以及哪些 CPU 核是可用的。许多工作负载对这种迁移不敏感，因此无需任何干预即可正常工作。</p>
<p>然而，有些工作负载的性能明显地收到 CPU 缓存亲和性以及调度延迟的影响。对此 kubelet 提供了可选的 CPU 管理策略，来确定节点上的一些分配偏好。</p>
<p>要配置 CPU 管理策略，可以通过<code>kubelet --cpu-manager-policy</code>或使用<code>KubeletConfiguration</code>中的<code>cpuManagerPolicy</code>字段来指定，支持 2 中策略：</p>
<ol>
<li>none: 默认策略</li>
<li>static: 允许为节点上具有某些资源特征的 Pod 赋予增强的 CPU 亲和性和独占性。</li>
</ol>
<p>CPU 管理器定期通过 CRI 写入资源更新，以保证内存中 CPU 分配与 cgroupfs 一致。同步频率通过 kubelet 配置参数<code>--cpu-manager-reconcile-period</code>设置，如不指定，默认与<code>--node-status-update-frequency</code>的周期相同。</p>
<p>Static 策略的行为可以使用<code>--cpu-manager-policy-options=key1=value1,key2=value2</code>参数来微调。</p>
<p>除了顶级的 CPUManagerPolicyOptions feature gate，策略选项分为 2 组：alpha（默认隐藏）和 beta（默认可见）。分别由<code>CPUManagerPolicyAlphaOptions</code>和<code>CPUManagerPolicyBetaOptions</code>feature gate 来管控。</p>
<h3 id="内存管理器">内存管理器</h3>
<p>使用 NUMA（Non-Uniform Memory Access）为 Guaranteed QoS 类的 Pods 提供可保证的内存及 hugepage 分配能力。</p>
<p>内存管理器使用提示生成协议来为 Pod 生成最合适的 NUMA 亲和性配置。内存管理器将这类亲和性提示输入给中央管理器（Topology Manager）。基于所给的提示和 Topology Manager 的策略设置，Pod 或者会被某节点接受，或者被该节点拒绝。</p>
<p>此外，内存管理器还全包 Pod 所请求的内存是从尽量少的 NUMA 节点分配而来。</p>
<p>内存管理器仅使用与 Linux 主机。</p>
<h3 id="设备管理器-tbd"><a href="https://kubernetes.io/zh-cn/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/">设备管理器</a> TBD</h3>
]]></content>
		</item>
		
		<item>
			<title>Authorization</title>
			<link>/posts/authorization/</link>
			<pubDate>Sun, 07 Apr 2024 20:03:24 +0800</pubDate>
			
			<guid>/posts/authorization/</guid>
			<description><![CDATA[%!s(<nil>)]]></description>
			<content type="html"><![CDATA[<p>kubernetes 内置 RBAC，ABAC，Node Authorization 等授权模式，然后使用<code>kube-apiserver --authorization-mode=&lt;AUTH MODE&gt;,RBAC</code>开启对应的授权模式，其中 RBAC 是必选项。</p>
<h2 id="rbac">RBAC</h2>
<p><img src="/imgs/rbac.png" alt="rbac"></p>
<p>基于角色的访问控制，通过集群管理员指定用户拥有的权限，使用时经过授权模块赋予对应的权限。</p>
<p>支持权限的降级，但不支持权限的升级，如 RoleBinding（作用域为 namespace）可以绑定到 ClusterRole，但是不能将 ClusterRoleBinding（作用域为 cluster）绑定到 Role。</p>
<p>对于普通用户名或用户组的限制，其中前缀是<code>system:</code>为 kubernetes 系统保留的，禁止用户使用前缀为此的用户名称或组名称。</p>
<p>对于服务账户和服务账户组，分别需要包含<code>system:serviceaccount:</code>和<code>system:serviceaccounts:</code>前缀。</p>
<p>对于需要修改用户或组的 RoleBinding 或者 ClusterRoleBinding，系统不支持修改，需要删除后重新添加。</p>
<h3 id="默认的-roles-和-rolebindings">默认的 Roles 和 RoleBindings</h3>
<p>对集群默认角色、名称、以及绑定关系的修改，都将可能导致集群无法工作，需要格外小心。</p>
<ol>
<li>集群中以<code>system:</code>为前缀的，用以标识对应资源是直接由集群控制面管理的。</li>
<li>所有默认的 ClusterRole 和 ClusterRoleBinding 都有<code>kubernetes.io/bootstrapping=rbac-defaults</code>标签</li>
<li>自动协商机制保证，每次集群启动时，kube-apiserver 都会更新默认 ClusterRole 以添加各类缺失的权限，并更新 ClusterRoleBinding 以添加各类缺失的主体，可修复一些不小心发生的修改，并有助于保证角色和角色绑定在新的发行版中保持最新状态。</li>
</ol>
<h3 id="kube-apiserver-发现角色">kube-apiserver 发现角色</h3>
<p>默认的角色绑定赋予所有人（包括匿名用户）可以访问对于被集群认为是安全公开的 API，如果要禁用匿名用户访问可以配置<code>kube-apiserver --anonymous-auth=false</code>。</p>
<!-- 有 3 个集群默认的配置：`system:base-user`与`system:authenticated`组，`system:discovery`与`system:authenticated`组，`system:public-info-viewer`与`system:authenticated`和`system:unauthenticated`组 -->
<table>
<thead>
<tr>
<th>默认 ClusterRole</th>
<th>默认 ClusterRoleBinding</th>
<th>说明</th>
</tr>
</thead>
<tbody>
<tr>
<td>system:basic-user</td>
<td>system:authenticated 组</td>
<td>允许用户以只读的方式去访问他们自己的基本信息。在 v1.14 之前，默认也绑定 system:unauthenticated</td>
</tr>
<tr>
<td>system:discovery</td>
<td>system:authenticated 组</td>
<td>允许以只读方式访问 API 发现端点，这些端点用来发现和协商 API 级别。v1.14 之前，默认也绑定 system:unauthenticated</td>
</tr>
<tr>
<td>system:public-info-viewer</td>
<td>system:authenticated, system:unauthenticated 组</td>
<td>允许对集群的非敏感信息进行只读访问，此角色在 v1.14 中引入</td>
</tr>
</tbody>
</table>
<h3 id="默认的面向用户的角色">默认的面向用户的角色</h3>
<p>如<code>cluster-admin</code>是系统超级管理员角色，<code>cluster-status</code>是 ClusterRoleBinding 在集群范围内完成授权的角色，<code>admin</code>、<code>edit</code>、<code>view</code>是使用 RoleBinding 在特定名字空间内授予的角色</p>
<table>
<thead>
<tr>
<th>ClusterRole</th>
<th>ClusterRoleBinding</th>
<th>说明</th>
</tr>
</thead>
<tbody>
<tr>
<td>cluster-admin</td>
<td>system:masters</td>
<td>集群超级用户角色，允许对集群上的任何资源执行任何操作</td>
</tr>
<tr>
<td>admin</td>
<td>无</td>
<td>允许管理员访问权限，旨在使用 RoleBinding 在名字空间内执行授权。不允许对资源配额或者名字空间本身进行写操作。不允许对 v1.22+ 创建的 EndpointSlices(或 Endpoints)进行写操作，<a href="https://kubernetes.io/zh-cn/docs/reference/access-authn-authz/rbac/#write-access-for-endpoints">EndpointSlices 和 Endpoints 写权限</a></td>
</tr>
<tr>
<td>edit</td>
<td>无</td>
<td>允许对名字空间的大多数对象进行读写操作。可以访问 Secret，可以访问以 ServiceAccount 身份运行的 Pod。不允许对 v1.22+创建的 EndpointSlices（或 Endpoints）对象进行写操作</td>
</tr>
<tr>
<td>view</td>
<td>None</td>
<td>允许对名字空间的大多数对象有只读权限。不允许查看角色或角色绑定，不允许查看 Secret，因为读取 Secret 的内容意味着可以访问名字空间中的 ServiceAccount 的 Token 信息，进而允许利用该 Token 访问 API，从而间接权限提升</td>
</tr>
</tbody>
</table>
<h3 id="core-component-roles">Core component roles</h3>
<table>
<thead>
<tr>
<th>ClusterRole</th>
<th>ClusterRoleBinding</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>system:kube-scheduler</td>
<td>system:kube-scheduler</td>
<td>允许访问 scheduler 组件所需的资源</td>
</tr>
<tr>
<td>system:volume-scheduler</td>
<td>system:kube-scheduler</td>
<td>允许访问 kube-scheduler 组件所需要的卷资源</td>
</tr>
<tr>
<td>system:kube-controller-manager</td>
<td>system:kube-controller-manager</td>
<td>允许访问控制器管理器所需要的资源。<a href="https://kubernetes.io/zh-cn/docs/reference/access-authn-authz/rbac/#controller-roles">控制器角色</a></td>
</tr>
<tr>
<td>system:node</td>
<td>none</td>
<td>允许访问 kubelet 所需要的资源，包括对所有 Secret 的读权限和对所有 Pod 状态对象的写权限。用户应该使用<a href="https://kubernetes.io/zh-cn/docs/reference/access-authn-authz/node/">Node 鉴权组件</a>和<a href="https://kubernetes.io/zh-cn/docs/reference/access-authn-authz/admission-controllers/#noderestriction">NodeRestriction 准入插件</a>而不是 system:node 角色。同时基于 kubelet 上调度执行的 Pod 来授权 kubelet 对 api 的访问。该角色仅仅是为了兼容 v1.8 之前的版本。</td>
</tr>
</tbody>
</table>
<h3 id="服务账户权限">服务账户权限</h3>
<p>默认的 RBAC 策略为控制面组件、节点和控制器授予权限，但不会对 kube-system 以外的服务账户授予权限，除了授予所有已认证用户的发现权限。</p>
<p>从最安全到最不安全的服务账户权限配置，推荐顺序如下：</p>
<ol>
<li>为特定应用的服务账户授予角色</li>
<li>将角色授予某名字空间中的 default 服务账户，如果没有指定服务账户，将使用默认的 default 服务账户</li>
<li>将角色授予名字空间中所有服务账户</li>
<li>在集群范围内为所有服务账户授予一个受限角色（不推荐）</li>
<li>授予超级用户访问权限给集群范围内的所有服务账户（强烈不推荐）</li>
</ol>
<h2 id="abac">ABAC</h2>
<p>基于属性的访问控制</p>
<h2 id="使用-node-授权">使用 Node 授权</h2>
<p>该方式是专门针对 kubelet 发出的 API 请求进行授权</p>
]]></content>
		</item>
		
		<item>
			<title>Admission Controllers</title>
			<link>/posts/admission-controllers/</link>
			<pubDate>Sun, 07 Apr 2024 12:22:39 +0800</pubDate>
			
			<guid>/posts/admission-controllers/</guid>
			<description><![CDATA[%!s(<nil>)]]></description>
			<content type="html"><![CDATA[<p>准入控制器是一段代码，会在请求通过认证和授权之后、对象被持久化之前拦截到达 kube-apiserver 的请求。</p>
<p>准入控制器可以执行 Validating 和 Mutating 操作。准入控制器限制创建、删除、修改对象的请求，也可以阻止自定义 verbs，如通过 api 代理服务器连接到 Pod 的请求。准入控制器不能阻止 get、watch 或 list 对象的请求。</p>
<p>有 2 个特殊的准入控制器：MutatingAdmissionController 和 ValidatingAdmissionController，分别执行变更和验证准入控制 Webhook。</p>
<h2 id="准入控制阶段">准入控制阶段</h2>
<ol>
<li>运行 MutatingAdmissionControl</li>
<li>运行 ValidatingAdmissionControl</li>
</ol>
<p>某些 AdmissionController 即是 MutatingAdmissionController，又是 ValidatingAdmissionController。</p>
<p>如果某个请求在 2 个阶段的其中一个被拒绝，则整个请求立即被拒绝，并返回错误。</p>
<p>准入控制器有其他副作用，将相关资源作为请求处理的一部分进行变更。如增加资源配额。任何一个准入控制器都无法确定其他准入控制器的需求，只有当请求符合所有的准入控制器的要求，才会被 kube-apiserver 正确处理。</p>
<h2 id="启用准入控制器">启用准入控制器</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl">kube-apiserver --enable-admission-plugins<span class="o">=</span>NamespaceLifeCycle,LimitRanger ...
</span></span></code></pre></div><h2 id="关闭准入控制器">关闭准入控制器</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl">kube-apiserver --disable-admission-plugins<span class="o">=</span>PodNodeSelector,AlwaysDeny ...
</span></span></code></pre></div><h2 id="默认启用的准入控制器">默认启用的准入控制器</h2>
<ul>
<li>CertificateApproval</li>
<li>CertificateSigning</li>
<li>CertificateSubjectRestriction</li>
<li>DefaultIngressClass</li>
<li>DefaultTolerationSeconds</li>
<li>LimitRanger</li>
<li>MutatingAdmissionWebhook</li>
<li>NamespaceLifeCycle</li>
<li>PersistentVolumeClaimResize</li>
<li>PodSecurity</li>
<li>Priority</li>
<li>ResourceQuota</li>
<li>ServiceAccount</li>
<li>StorageObjectInUseProtection</li>
<li>TaintNodesByCondition</li>
<li>ValidatingAdmissionPolicy</li>
<li>ValidatingAdmissionWebhook</li>
</ul>
<h2 id="每个准入控制器的作用">每个准入控制器的作用</h2>
<ul>
<li>LimitRanger: 监测传入的请求，确保不会违反 Namespace 中 LimitRange 对象所设置的任何约束；还可以用于将默认资源配额应用到没有设定资源配额的 Pod</li>
<li>NamespaceLifeCycle：会禁止在一个正在被删除的 Namespace 中创建新对象，并确保针对不存在的 Namespace 的请求被拒绝；会禁止删除系统中的保留命名空间：default,kube-system,kube-public</li>
</ul>
<p><a href="https://kubernetes.io/zh-cn/docs/reference/access-authn-authz/admission-controllers/#what-does-each-admission-controller-do">准入控制器列表</a> <a href="https://kubernetes.io/zh-cn/docs/reference/command-line-tools-reference/kube-apiserver/#options">推荐的准入控制器</a></p>
<h2 id="动态准入控制器">动态准入控制器</h2>
<p>除了编译到 kube-apiserver 中的准入控制器，还支持在运行时动态使用的准入控制器。</p>
<h3 id="准入-webhook">准入 Webhook</h3>
<p>是一种用于接受准入请求并对其进行处理的 HTTP 回调机制。可以定义 2 种类型的准入 Webhook：ValidatingAdmissionWebhook,MutatingAdmissionWebhook。其中 Mutating 性质的 Webhook 被先调用，然后执行 Validating 的 Webhook。</p>
<p>用户可以实现自己的准入 Webhook 动态加入到 kube-apiserver 的准入控制中。<a href="https://kubernetes.io/zh-cn/docs/reference/access-authn-authz/extensible-admission-controllers/#experimenting-with-admission-webhooks">自定义准入 webhook</a></p>
<h3 id="监控准入-webhook">监控准入 Webhook</h3>
<ol>
<li>哪个变更性质的 Webhook 改变了 API 请求中的对象</li>
<li>变更性质的 Webhook 对对象做了那些修改</li>
<li>哪些 Webhook 经常拒绝 API 请求，什么原因拒绝等等</li>
</ol>
<h4 id="指标信息示例">指标信息示例</h4>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-console" data-lang="console"><span class="line"><span class="cl"><span class="gp">#</span> HELP apiserver_admission_webhook_rejection_count <span class="o">[</span>ALPHA<span class="o">]</span> Admission webhook rejection count, identified by name and broken out <span class="k">for</span> each admission <span class="nb">type</span> <span class="o">(</span>validating or admit<span class="o">)</span> and operation. Additional labels specify an error <span class="nb">type</span> <span class="o">(</span>calling_webhook_error or apiserver_internal_error <span class="k">if</span> an error occurred<span class="p">;</span> no_error otherwise<span class="o">)</span> and optionally a non-zero rejection code <span class="k">if</span> the webhook rejects the request with an HTTP status code <span class="o">(</span>honored by the apiserver when the code is greater or equal to 400<span class="o">)</span>. Codes greater than <span class="m">600</span> are truncated to 600, to keep the metrics cardinality bounded.
</span></span><span class="line"><span class="cl"><span class="gp">#</span> TYPE apiserver_admission_webhook_rejection_count counter
</span></span><span class="line"><span class="cl"><span class="go">apiserver_admission_webhook_rejection_count{error_type=&#34;calling_webhook_error&#34;,name=&#34;always-timeout-webhook.example.com&#34;,operation=&#34;CREATE&#34;,rejection_code=&#34;0&#34;,type=&#34;validating&#34;} 1
</span></span></span><span class="line"><span class="cl"><span class="go">apiserver_admission_webhook_rejection_count{error_type=&#34;calling_webhook_error&#34;,name=&#34;invalid-admission-response-webhook.example.com&#34;,operation=&#34;CREATE&#34;,rejection_code=&#34;0&#34;,type=&#34;validating&#34;} 1
</span></span></span><span class="line"><span class="cl"><span class="go">apiserver_admission_webhook_rejection_count{error_type=&#34;no_error&#34;,name=&#34;deny-unwanted-configmap-data.example.com&#34;,operation=&#34;CREATE&#34;,rejection_code=&#34;400&#34;,type=&#34;validating&#34;} 13
</span></span></span></code></pre></div><h3 id="自定义准入-webhook-最佳实践与警告">自定义准入 Webhook 最佳实践与警告</h3>
<ol>
<li>幂等性：变更性质的准入 webhook 多次执行都会与初次执行产生相同的结果。</li>
<li>拦截对象的所有版本：<code>.webhook[].matchPolicy = Equivalent</code><a href="https://kubernetes.io/zh-cn/docs/reference/access-authn-authz/extensible-admission-controllers/#matching-requests-matchpolicy">匹配请求：matchPolicy</a></li>
<li>可用性</li>
<li>确保看到对象的最终状态</li>
<li>避免自定义的 Webhook 出现死锁</li>
<li>应尽可能避免副作用</li>
<li>避免对 kube-system 命名空间进行操作</li>
</ol>
]]></content>
		</item>
		
		<item>
			<title>证书与证书签名请求</title>
			<link>/posts/%E8%AF%81%E4%B9%A6%E4%B8%8E%E8%AF%81%E4%B9%A6%E7%AD%BE%E5%90%8D%E8%AF%B7%E6%B1%82/</link>
			<pubDate>Sun, 07 Apr 2024 10:10:46 +0800</pubDate>
			
			<guid>/posts/%E8%AF%81%E4%B9%A6%E4%B8%8E%E8%AF%81%E4%B9%A6%E7%AD%BE%E5%90%8D%E8%AF%B7%E6%B1%82/</guid>
			<description><![CDATA[%!s(<nil>)]]></description>
			<content type="html"><![CDATA[<p>kubernetes 证书和 trust bundle API 可以通过为 kube-apiserver 的客户端提供编程接口，实现<a href="https://www.itu.int/rec/T-REC-X.509">X.509</a>证书的请求并获取证书颁发机构的自动化制备。</p>
<h3 id="证书签名">证书签名</h3>
<p>etcd 不会保存签名请求，相应的会有以下几种状态：</p>
<ol>
<li>已签发的请求：会在 1 个小时后自动被垃圾回收器删除</li>
<li>拒绝的请求：会在 1 个小时后自动被垃圾回收器删除</li>
<li>失败的请求：会在 1 个小时后自动被垃圾回收器删除</li>
<li>所有的请求：已签发证书会在失效以后自动被垃圾回收器删除</li>
</ol>
<h2 id="signer">Signer</h2>
<p>任何要在特定集群以外提供的签名者都应该提供关于签名者工作方式的信息。</p>
<ol>
<li>信任分发：信任锚点（CA 证书或证书包）</li>
<li>许可的主体</li>
<li>许可的 x509 扩展：包括 IP subjectAltNames，DNS subjectAltNames, Email subjectAltNames, URL subjectAltNames 等</li>
<li>许可的/扩展的密钥用途</li>
<li>过期时间/证书有效期：可以由签名者确定、管理员配置，或者 CSR 请求中<code>spec.expirationSeconds</code>字段指定</li>
<li>允许/不允许 CA 位：若 CSR 包含一个签名者并不允许的 CA 证书时，相应的应对手段</li>
</ol>
<h3 id="内置的-signer">内置的 Signer</h3>
<ol>
<li><code>kubernetes.io/kube-apiserver-client</code> 签名的证书将被 kube-apiserver 视为客户端证书，不会被 kube-controller-manager 自动批准</li>
<li><code>kubernetes.io/kube-apiserver-client-kubelet</code> 签名的证书将被 kube-apiserver 视为 kubelet 的客户端证书，可以被 kube-controller-manager 自动批准</li>
<li><code>kubernetes.io/kubelet-serving</code> 签名的证书将被视为有效的 kubelet 服务端证书，不会被 kube-controller-manager 自动批准</li>
<li><code>kubernetes.io/legacy-unknown</code> 不保证信任，kubernetes 的第三方组件发行版可能会使用它签署的客户端证书，稳定版的 CSR API（<code>certificates.k8s.io/v1</code>以及之后的版本）不允许将<code>signerName</code>设置为<code>kubernetes.io/legacy-unknown</code>，不会被 kube-controller-manager 自动批准</li>
</ol>
<p>kube-controlelr-manager 为每个内置签名者实现了 control-plane 签名，所有这些故障仅在 kube-controller-manager 日志中报告。</p>
<p>除了上述信任的签名者前发的证书之外，kubernetes 集群不保证其他任何信任关系的建立。虽然某些发行版使用<code>kubernetes.io/legacy-unknown</code>作为客户端证书供 kube-apiserver 使用，但者并不是一个标准的做法（即某些发行版可能会有特定的配置，这跟 kubernetes 官方无关）。这些用法与 ServiceAccount 令牌密钥的<code>.data[ca.crt]</code>没有任何关。使用 CA bundles 的做法只能确保验证到 kube-apiserver 的连接，而且仅限于使用默认服务<code>kubernetes.default.svc</code>（即这些证书斤用于验证与默认服务 kube-apiserver 的连接，而不适用于其他服务或场景）。</p>
<h3 id="自定义签名者">自定义签名者</h3>
<p>也可以使用自定义签名集成外部的第三方组件，如<code>issuer.open-fictional.example/service-mesh</code>。</p>
<h2 id="签名">签名</h2>
<h3 id="control-plane-签名者">Control Plane 签名者</h3>
<p>Control plane 实现了内置的所有签名者，作为 kube-controller-manager 的一部分。</p>
<blockquote>
<p>NOTES: v1.18 之前，kube-controller-manager 签名所有标记为 approved 的 CSR。
<code>spec.expirationSeconds</code>是在 v1.22 中加入的，v1.22 的版本会忽略该字段。</p>
</blockquote>
<h3 id="api-based-签名者">API-based 签名者</h3>
<p>签名完成之后，会将已签发的一个或多个 base64 编码的 pem 证书包含在 CRS 的<code>status.certificate</code>字段中，所有的 pem 块必须要有&quot;CERTIFICATE&quot;标签，编码数据必须符合<a href="https://tools.ietf.org/html/rfc5280#section-4.1">RFC5280 第 4 节</a></p>
<h2 id="clustertrustbundles">ClusterTrustBundles</h2>
<p>可是使用 2 种模式：signer-linked，signer-unlinked</p>
<h2 id="如何签发证书">如何签发证书</h2>
<h3 id="创建私钥">创建私钥</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl"><span class="c1"># 生成证书私钥</span>
</span></span><span class="line"><span class="cl">openssl genrsa -out myuser.key <span class="m">2048</span>
</span></span><span class="line"><span class="cl"><span class="c1"># 用私钥生成一个证书请求文件</span>
</span></span><span class="line"><span class="cl">openssl req -new -key myuser.key -out myuser.csr -subj <span class="s2">&#34;/CN=myuser&#34;</span>
</span></span></code></pre></div><h3 id="创建-csr">创建 CSR</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl">cat <span class="s">&lt;&lt;EOF | kubectl apply -f -
</span></span></span><span class="line"><span class="cl"><span class="s">apiVersion: certificates.k8s.io/v1
</span></span></span><span class="line"><span class="cl"><span class="s">kind: CertificateSigningRequest
</span></span></span><span class="line"><span class="cl"><span class="s">metadata:
</span></span></span><span class="line"><span class="cl"><span class="s">  name: myuser
</span></span></span><span class="line"><span class="cl"><span class="s">spec:
</span></span></span><span class="line"><span class="cl"><span class="s">  request: &lt;cat myuser.csr | base64 | tr -d &#34;\n&#34;&gt;
</span></span></span><span class="line"><span class="cl"><span class="s">  signerName: kubernetes.io/kube-apiserver-client
</span></span></span><span class="line"><span class="cl"><span class="s">  expirationSeconds: 86400  # 一天，单位是秒
</span></span></span><span class="line"><span class="cl"><span class="s">  usages:
</span></span></span><span class="line"><span class="cl"><span class="s">  - client auth # 必须是如此
</span></span></span><span class="line"><span class="cl"><span class="s">EOF</span>
</span></span></code></pre></div><h3 id="批准-csr">批准 CSR</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl"><span class="c1"># kubectl get csr</span>
</span></span><span class="line"><span class="cl">kubectl certificate approve myuser
</span></span></code></pre></div><h3 id="获取证书">获取证书</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl"><span class="c1"># kubectl get csr/myuser -o yaml</span>
</span></span><span class="line"><span class="cl">kubectl get csr myuser -o <span class="nv">jsonpath</span><span class="o">=</span><span class="s1">&#39;{.status.certificate}&#39;</span> <span class="p">|</span> base64 -d &gt; myuser.crt
</span></span></code></pre></div><h3 id="创建角色和绑定角色">创建角色和绑定角色</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl">kubectl create role developer --verb<span class="o">=</span>create --verb<span class="o">=</span>get --resource<span class="o">=</span>pods
</span></span><span class="line"><span class="cl">kubectl create rolebinding developer-binding-myuser --role<span class="o">=</span>developer --user<span class="o">=</span>myuser
</span></span></code></pre></div><h3 id="添加到-kubeconfig">添加到 kubeconfig</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl">kubectl config set-credentials myuser --client-key<span class="o">=</span>myuser.key --client-certificate<span class="o">=</span>myuser.crt --embed-certs<span class="o">=</span><span class="nb">true</span>
</span></span><span class="line"><span class="cl">kubectl config set-context myuser --cluster<span class="o">=</span>kubernetes --user<span class="o">=</span>myuser
</span></span></code></pre></div><p>至此用户可以使用该 kubeconfig 完成对集群的访问了</p>
]]></content>
		</item>
		
		<item>
			<title>Kube-apiserver Bypass Risks</title>
			<link>/posts/kube-apiserver-bypass-risks/</link>
			<pubDate>Sat, 06 Apr 2024 17:26:20 +0800</pubDate>
			
			<guid>/posts/kube-apiserver-bypass-risks/</guid>
			<description><![CDATA[%!s(<nil>)]]></description>
			<content type="html"><![CDATA[<p>kube-apiserver 是外部与集群交互的主要入口，提供了几种关键的内置安全控制，如审计日志和准入控制器。但还有一些方式可以绕过这些安全控制从而修改集群的配置或内容。</p>
<p>以下这些方式需要适当的被限制。</p>
<h2 id="静态-pod">静态 Pod</h2>
<p>每个节点上的 kubelet 会加载并直接管理集群中存储在指定目录中或从指定 URL 获取的静态 Pod 清单。kube-apiserver 不管理这些静态 Pod。对该位置具有写入权限的攻击者可以修改从该位置加载的静态 Pod 的配置，或引入新的静态 Pod。</p>
<p>静态 Pod 被限制访问 kube-apiserver 中的其他对象。如不能将静态 Pod 配置为从集群挂载 Secret。但是这些 Pod 可以执行其他安全敏感的操作，如使用<code>hostPath</code>直接挂载宿主机节点的文件目录.</p>
<p>默认情况下，kubelet 会创建一个 Mirror Pod，以便静态 Pod 在 kube-apiserver 中可见。但是如果攻击者在创建 Pod 时使用了无效的名字空间名称，则该 Pod 将在 kube-apiserver 中不可见，只能通过对受影响主机有访问权限的工具发现。</p>
<p>如果静态 Pod 无法通过准入控制，kubelet 不会将 Pod 注册到 kube-apiserver 中，但该 Pod 仍然在节点上运行。</p>
<h3 id="mitigations">Mitigations</h3>
<ol>
<li>仅在节点需要时启用 kubelet 静态 Pod manifest 功能</li>
<li>如果一个节点使用静态 Pod 功能，限制它的文件系统访问权限，即仅配置需要访问的目录或 URL</li>
<li>限制对 kubelet 配置参数和文件的访问，以防止攻击者通过设置静态 Pod 访问的文件系统路径或 URL</li>
<li>定期审计并集中报告所有对托管静态 Pod manifest 文件和 kubelet 配置文件的目录或 Web 存储位置的访问</li>
</ol>
<h2 id="kubelet-api">kubelet API</h2>
<p>kubelet 提供了一个 HTTP API，一般是 TCP:10250，在某些 kubernetes 发行版中，API 也可能暴露在 control-plane-panel 节点上。对 API 的直接访问允许公开有关运行在节点上的 Pod、这些 Pod 的日志以及在节点上运行每个容器中执行命令的信息。</p>
<p>具有对 Node 对象及子资源 RBAC 访问权限的集群用户，也可以访问 kubelet API，实际的访问权限取决于授予了哪些子资源的访问。[kubelet 鉴权]</p>
<p>对 kubelet API 的直接访问不受准入控制影响，也不会被审计日志记录。能直接访问此 API 的攻击者可能会绕过能检测或防止某些操作的控制机制。</p>
<p>kubelet API 可以配置为多种方式的验证，默认情况下，kubelet 的配置允许匿名访问。大多数 kubernetes 提供商将默认值更改为使用 webhook 和证书身份认证。这使得 control-plane-panel 能够确保调用者访问节点的 API 资源或子资源时经过授权的。但 control-plane-panel 不能确保默认的匿名访问也是如此。</p>
<h3 id="mitigations-1">Mitigations</h3>
<ol>
<li>使用 RBAC 等机制限制对节点 API 对象的资源及子资源的访问。只有在需要时才授予此访问权限，如监控服务。</li>
<li>限制对 kubelet 端口的访问，只允许指定和受信任的 IP 地址段访问该端口</li>
<li>确保将[kubelet 身份认证]设置为 webhook 或证书模式</li>
<li>确保集群上未启用不作身份认证的&quot;只读&quot;kubelet 端口</li>
</ol>
<h2 id="etcd-api">etcd API</h2>
<p>etcd 监听的 TCP:2379，只有 kube-apiserver 和所使用的备份工具需要，对该 API 的直接访问允许公开或修改集群中保存的数据。</p>
<p>对 etcd API 的访问通常通过客户端证书身份认证来管理。由 etcd 信任的证书颁发机构颁发的任何证书都可以完全访问 etcd 中存储的数据。</p>
<p>对 etcd 的直接访问不受 kubernetes 准入控制的影响，也不会被 kubernetes 审计日志记录。具有对 API 服务器的 etcd 客户端证书私有的读取访问权限（或可以创建一个新的受信任的客户端证书）的攻击者可以通过访问集群 Secret 或修改访问规则来获得集群管理员权限。即使不提升 kubernetes RBAC 权限，可以修改 etcd 的攻击者也可以在集群内检索所有 API 对象或创建新的工作负载。</p>
<p>许多 kubernetes 提供商配置 etcd 为使用双向 TLS（客户端和服务器端需要互相验证对方的证书）。但这个特性目前 etcd API 鉴权还没有实现。任何具有 etcd 的客户端访问权限的证书都可以用于获得对 etcd 的完全访问权限。如用于健康检查的 etcd 客户端证书也可以授予完全读写访问权限。</p>
<h3 id="mitigations-2">Mitigations</h3>
<ol>
<li>确保 etcd 所信任的证书颁发机构颁发的证书仅用于该服务的身份认证</li>
<li>控制对 etcd 服务器证书的私钥以及 kube-apiserver 的客户端证书和密钥的访问</li>
<li>限制对 etcd 的访问，仅限于受信任的 IP 地址段</li>
</ol>
<h2 id="容器运行时-socket">容器运行时 socket</h2>
<p>kubernetes 集群的每个节点上，与容器交互的访问都由容器运行时通过一个 unix socket 控制。具有此 socket 访问权限的攻击者可以启动新容器与正在运行的容器进行交互。</p>
<p>在集群层面，这种访问造成的影响取决于在受威胁节点上运行的容器是否可以访问 Secret 或其他机密信息，攻击者可以使用这些机密数据将权限提升到其他工作节点或 control-plane-panel 组件（如 kube-proxy，kube-scheduler 等）。</p>
<h3 id="mitigations-3">Mitigations</h3>
<ol>
<li>确保严格控制对容器运行时 socket 所在文件系统的访问，尽可能仅限制为 root 用户</li>
<li>使用 linux 内核的命名空间等机制将 kubelet 与节点上运行的其他组件隔离</li>
<li>确保限制或禁止使用包含容器运行时 socket 的<code>hostPath</code>挂载，同时<code>hostPath</code>挂载必须设置为只读，以降低攻击者绕过文件系统目录限制的风险</li>
<li>限制用户对节点的访问，特别是限制超级用户对节点的访问</li>
</ol>
]]></content>
		</item>
		
		<item>
			<title>Authentication Mechanisms</title>
			<link>/posts/authentication-mechanisms/</link>
			<pubDate>Sat, 06 Apr 2024 11:17:29 +0800</pubDate>
			
			<guid>/posts/authentication-mechanisms/</guid>
			<description><![CDATA[%!s(<nil>)]]></description>
			<content type="html"><![CDATA[<p>关于以下几种身份验证机制的生产环境中使用的利弊分析与官方建议</p>
<h2 id="x509-客户端证书身份认证-不推荐">X.509 客户端证书身份认证 (不推荐)</h2>
<ol>
<li>客户端证书不能被单独撤销，直到证书过期</li>
<li>如果证书需要被设置为无效，需要重新生成证书颁发机构的密钥，会导致集群不可用</li>
<li>集群中没有客户端证书的永久记录，因此无法跟踪监控使用证书的用户</li>
<li>用户客户端证书认证的私钥无法收到密码保护，任何能够读取包含密钥文件的人都可以使用</li>
<li>使用客户端证书身份认证需要直接连接 kube-apiserver，没有中间跳板节点的话，可能会使网络结构变得复杂化，如网络拓扑可能需要重新规划，安全性也需要重新考虑，维护和管理负担，扩展性会变差</li>
<li>客户端证书的 O（Organization）中嵌入了组数据，用户的组成员资格在证书的生命周期内无法更改，意味着一旦用户获得了 X.509 客户端证书，这个用户的组成员资格将在证书有效期内保持不变，换句话说，用户被分配到的组信息是固定的，无法通过更改证书来改变用户的组成员资格，这样做缺乏灵活性，权限管理复杂性会提升，安全性隐患，难以追踪和审计</li>
</ol>
<h2 id="静态-token-文件-不推荐">静态 Token 文件 (不推荐)</h2>
<ol>
<li>凭据信息被以明文形式保存在磁盘上，增加了安全性风险</li>
<li>改变任何凭据都需要重启 kube-apiserver，引发可用性风险</li>
<li>没有可用的机制能让用户轮换他们的凭据，如果要轮换凭据，集群管理员需要更改在磁盘上的 Token，然后在下发给用户</li>
<li>没有锁定机制去阻止暴力攻击</li>
</ol>
<h2 id="引导-tokens-不推荐">引导 tokens (不推荐)</h2>
<ol>
<li>它们固定了不适合一般使用的硬编码组成员资格，因此不适用于用户认证目的</li>
<li>手动生成引导令牌可能导致弱令牌，攻击者可以猜测到，会带来安全风险</li>
<li>没有锁定机制来防止暴力攻击，这使得攻击者容器猜测或破解令牌</li>
</ol>
<h2 id="serviceaccount-secret-tokens-不推荐">ServiceAccount secret tokens (不推荐)</h2>
<p>在 kubernetes &lt; 1.23 版本中，是默认选项，供集群中的工作负载进行身份认证。但是这种方式正在被<code>TokenRequest API</code>令牌替换，官方也不建议在生产环境中使用</p>
<ol>
<li>不能设置过期时间，而是会在相关的服务账户被删除之前一直保持有效</li>
<li>身份验证令牌对于在定义它们的命名空间中读取密钥的任何集群用户都是可见的</li>
<li>服务账户无法被添加到任意组中，导致在使用时复杂化了 RBAC 管理</li>
</ol>
<h2 id="tokenrequest-api-tokens-不推荐">TokenRequest API tokens (不推荐)</h2>
<p>可生成短期凭据，用于服务对 kube-apiserver 或第三方系统进行身份认证的有用工具，官方也不建议用于生产环境，因为没有可用的吊销方法，并且以安全的方式将凭据分发给用户可能也有挑战。</p>
<p>当使用 TokenRequest 令牌进行身份认证时，建议实现短期生命周期，以减少受到损害的令牌带来的影响。</p>
<h2 id="oidc-令牌身份认证-推荐">OIDC 令牌身份认证 (推荐)</h2>
<p>官方支持使用 OpenID Connect 令牌身份认证，将外部身份验证服务与 kube-apiserver 集成。使用 OIDC 时，需要考虑以下加固措施：</p>
<ol>
<li>安装在集群中以支持 OIDC 身份认证的软件应该与一般工作负载隔离，因为它将以高权限运行</li>
<li>一些 kubernetes 托管服务对可以使用的 OIDC 提供者有限制</li>
<li>与 TokenRequest 令牌一样，OIDC 令牌的生命周期应该较短，以减少受到损害的令牌产生的影响</li>
</ol>
<h2 id="webhook-token-身份认证-中立">Webhook token 身份认证 (中立)</h2>
<p>是将外部身份认证服务提供者集成到 kubernetes 中的另一种选项。该服务可以集成在集群内部或外部运行，用于进行身份认证。该机制的使用性取决于用于身份认证服务的软件，并且有一些 kubernetes 特定的考虑因素。</p>
<p>要配置 webhook 身份验证，需要访问控制平面服务器的文件系统。这意味着除非供应商专门提供此功能，否则在托管 kubernetes 的供应商那里将无法使用。此外，为了支持这种访问，集群中安装的任何软件都应与一般的工作负载隔离。</p>
<h2 id="认证代理服务-推荐">认证代理服务 (推荐)</h2>
<p>使用认证代理将外部身份认证系统集成到 kubernetes 集群中的另一种选项。通过这种机制，kubernetes 期望从代理接受请求，并在请求中设置特定的头部值，只是用于授权目的的用户名和组成员资格。这种机制也需要考虑一些特定的考虑因素。</p>
<ol>
<li>必须在代理和 kube-apiserver 之间使用 TLS 连接，以减轻流量拦截或嗅探攻击的风险，这确保了代理和 kube-apiserver 之间的通信时安全的</li>
<li>需要注意能够修改请求头的攻击者可能会未经授权地访问 kubernetes 资源，因此重要的是确保请求头被正确的保护，并且不能被篡改。</li>
</ol>
<h2 id="其他的认证授权">其他的认证授权</h2>
<p>Webhook 并不是一种特定的身份认证机制，而是一种用于集成外部身份认证提供者到 kubernetes 集群中的机制。通过 webhook，kubernetes 可以与外部的身份认证服务通信，以获取身份认证决策。这个外部的身份验证服务可以是在集群内部或外部运行的任何类型的服务，例如 OAuth 2.0 提供者，LDAP 等。</p>
<p>OAuth2.0 是一种授权框架，不是一种身份认证协议，主要用于用户授权。虽然 OAuth2.0 可以用于身份认证，但它通常与 OIDC 结合使用，以提供完整的身份认证和授权解决方案。OAuth2.0 与 Webhook 结合使用，以便将 OAuth2.0 或 OIDC 身份认证服务集成到 kubernetes 集群中。</p>
<p>LDAP（轻量级目录访问协议）是一种用于访问和维护分布式目录信息服务的开放标准协议。它通常用在网络中分布式存储组织的信息，例如用户信息、组织结构和网络资源等。</p>
]]></content>
		</item>
		
		<item>
			<title>Kube API 访问控制</title>
			<link>/posts/kube-api-%E8%AE%BF%E9%97%AE%E6%8E%A7%E5%88%B6/</link>
			<pubDate>Fri, 05 Apr 2024 19:20:16 +0800</pubDate>
			
			<guid>/posts/kube-api-%E8%AE%BF%E9%97%AE%E6%8E%A7%E5%88%B6/</guid>
			<description><![CDATA[%!s(<nil>)]]></description>
			<content type="html"><![CDATA[<p><img src="/imgs/kube-authenticating.png" alt="kube-authenticating"></p>
<h2 id="身份认证策略">身份认证策略</h2>
<p>kubernetes 通过身份认证插件利用客户端证书、持有者令牌或身份代理来认证 API 请求的身份。HTTP 请求发给 kube-apiserver 时，插件会将以下属性关联到请求本身：</p>
<ol>
<li>用户名</li>
<li>用户 ID</li>
<li>用户组：一组字符串，用来标明用户是哪些命名的用户组集合的成员，如系统组 system:masters（这个组的使用有一定的风险），这里只是为了说明</li>
<li>附加字段：一组额外的 KV 映射，用来保存一些认证组件可能觉得有用的额外价值</li>
</ol>
<p>所有属性值对于身份认证系统都是非透明的，只有被认证组件解释过之后才有意义，才会被 kubernetes 集群所识别。可以同时启用多种身份认证方法，并且通常至少会有 2 种：</p>
<ol>
<li>针对服务账号使用的令牌</li>
<li>针对普通人类用户的身份认证</li>
</ol>
<p>认证组件的执行顺序是不确定的，对于完成身份认证的用户，system:authenticated 组都会被添加到用户的组属性中。</p>
<h3 id="x509-客户端证书">X.509 客户端证书</h3>
<p>通过 TLS 的非对称加密完成的双向身份认证，使用<code>--client-ca-file=&lt;client-ca-file&gt;</code>，参考<a href="https://kubernetes.io/zh-cn/docs/tasks/administer-cluster/certificates/">证书管理</a></p>
<h3 id="静态令牌">静态令牌</h3>
<p>这种方式就跟我们不同服务之间内部交互，直接使用一个随机字符串去互相认证类似。目前令牌会长期有效，且 kube-apiserver 不重启的情况下无法更新令牌，使用<code>--token-auth-file=&lt;token-file&gt;</code>。当使用 HTTP 客户端执行身份认证时，http 请求需要携带一个<code>Authorization: Bearer &lt;token string&gt;</code>的头信息。</p>
<h3 id="启动引导令牌">启动引导令牌</h3>
<p>像执行 kubeadm join 中使用的 token 就是这个，这个主要用于平滑启动引导集群，这些令牌以 Secret 的形式保存在 kube-system 命名空间中，可以被动态管理和创建。<code>TokenCleaner</code>控制器能够在启动引导令牌过期时将其删除。</p>
<p>它也被设计成可通过 RBAC 策略，结合 kubelet TLS 启动引导系统进行工作。这个令牌被定义成一个特定的 Secret 类型<code>bootstrap.kubernetes.io/token</code>，并存在于 kube-system 命名空间中。这些 secret 会被 kube-apiserver 启动引导认证组件（Bootstrap Authenticator）读取，<code>TokenCleaner</code>能够删除过期的启动引导令牌，也被用来在节点发现的过程中使用的一个特殊的 ConfigMap 对象，<code>BootstrapSigner</code>控制器也会使用这个 ConfigMap。参考<a href="https://kubernetes.io/zh-cn/docs/reference/access-authn-authz/bootstrap-tokens/">Bootstrap Tokens</a></p>
<h3 id="服务账号令牌">服务账号令牌</h3>
<p>可以被用在集群内部或者集群外部，token 数据可以被保存在 Secret 中，也可以不保存，如果挂载到 Pod 中的容器内可使用 volume 或者环境变量等方式（这种就需要用到 Secret 了），是一种自动启动的用户认证机制。该组件有 2 个参数可以设置：</p>
<ol>
<li><code>--service-account-key-file=&lt;X509 rsa 或 ecdsa私钥或公钥&gt;</code>，可以指定多个文件，若未指定默认使用<code>--tls-private-key-file</code>参数</li>
<li><code>--service-account-lookup</code>如果启用，则从 kube-apiserver 删除的令牌会被回收</li>
</ol>
<p>服务账号被认证通过后，所确定的用户名为<code>system:serviceaccount:&lt;namespace&gt;:&lt;服务账号名&gt;</code>，并被分配到用户组<code>system:serviceaccounts</code>和<code>system:serviceaccounts:&lt;namespace&gt;</code></p>
<h3 id="oidc-令牌">OIDC 令牌</h3>
<p>这是一种比较广泛接受的用于各种不同服务之间共享账号信息的认证的方式，也被 kubernetes 官方推荐为一种作为集群外部使用的方式。一般要配合 OAuth2 的授权，完成用户身份认证。<a href="https://openid.net/connect/">OpenID Connect</a>。第三方组件有：</p>
<ol>
<li><a href="https://dexidp.io/">dex</a></li>
<li><a href="https://github.com/keycloak/keycloak">keycloak</a></li>
<li><a href="https://github.com/cloudfoundry/uaa">UAA</a></li>
<li><a href="https://openunison.github.io/">OpenUnion</a></li>
</ol>
<p>也可以使用官方已经集成的 OIDC，可以直接使用 kubectl，对于用于而言，可以自己将 kubectl 配置的 OIDC 各项参数封装成一个 shell 脚本</p>
<h3 id="webhook-令牌">webhook 令牌</h3>
<p>这是一种回调机制，可以在启动 kube-apiserver 时添加参数，有<code>--authentication-token-webhook-config-file</code>,<code>--authentication-token-webhook-cache-ttl</code>,<code>--authentication-token-webhook-version</code>，也可以配置在 kubeconfig 中</p>
<h3 id="身份认证代理">身份认证代理</h3>
<p>这是一种代理的方式，主要用户 HTTP 请求，需要设置一些请求头信息，<code>--requestheader-username-headers</code>,<code>--requestheader-group-headers</code>,<code>--requestheader-extra-headers-prefix</code>等，要使用这种方式，需要使用 TLS 加密，配置合适的 ca 证书，但是不应该在不同的上下文中复用 ca 证书。</p>
<h2 id="匿名请求">匿名请求</h2>
<p>如果用户没有经过身份认证，或者类似集群中的 kubelet 绕过了准入控制，这类请求会被标记为<code>system:anonymous</code>和对应的用户组<code>system:unauthenticated</code></p>
<h2 id="用户伪装">用户伪装</h2>
<p>是一个可以用来伪装成集群中其他用户的手段，以完成一些该用户具有权限的操作。以下 HTTP 请求头用来执行伪装请求：</p>
<ol>
<li><code>Impersonate-User</code> 要伪装的用户名</li>
<li><code>Impersonate-Group</code> 要伪装的用户组</li>
<li><code>Impersonate-Extra-&lt;附加名称&gt;</code></li>
<li><code>Impersonate-UID</code></li>
</ol>
<p>有对应伪装权限的用户也可以使用 kubectl 加参数<code>--as</code>，<code>--as-group</code>来伪装目标用户执行该用户具有权限的操作。</p>
<h2 id="client-go-credentials-插件">client-go credentials 插件</h2>
<p>这一特性的目的是便于客户端与 k8s.io/client-go 并不原生支持的身份认证协议（LDAP、Kerberos、OAuth2、SAML 等）集成。该插件实现特定于协议的逻辑，之后返回非透明的凭据供 webhook 组件进行身份认证，然后解析用户数据给集群。</p>
<h2 id="为客户端提供的对身份验证信息的-api-访问">为客户端提供的对身份验证信息的 API 访问</h2>
<p>如果启用了此 API，可以使用<code>SelfSubjectReview</code>API 对象来了解 kubernetes 集群如何映射用户身份验证信息从而识别为某客户端。可用于身份认证调试等目的。默认情况下，所有经过身份验证的用户都可以在 APISelfSubjectReview 特性被启用时创建 SelfSubjectReview 对象，这是由 system:basic-user 集群角色允许的操作。</p>
<h2 id="基于-rbac-的良好实践">基于 RBAC 的良好实践</h2>
<ol>
<li>最小特权
<ul>
<li>尽可能在命名空间级别分配权限，尽可能使用<code>RoleBinding</code>，而不是<code>ClusterRoleBinding</code></li>
<li>尽可能避免通过通配符设置权限</li>
<li>尽可能不使用<code>cluster-admin</code>账号</li>
<li>避免添加用户到<code>system:masters</code>组</li>
</ul>
</li>
<li>最大限度减少特权令牌的分发
<ul>
<li>限制具有特权令牌 Pod 的节点数量</li>
<li>避免将拥有特权令牌的 Pod 与其他不可信或公开的 Pod 在一个节点运行，可通过使用 <code>Taint</code>，<code>Toleration</code>，<code>PodAffinity</code>，<code>PodAntiAffinity</code> 来实现</li>
</ul>
</li>
<li>加固
<ul>
<li>审查 <code>system:unauthenticated</code> 组的绑定，如果可能，将其删掉</li>
<li>设置 <code>automountServiceAccountToken: false</code> 来避免服务账号令牌的默认自动挂载</li>
</ul>
</li>
<li>定期检查 RBAC 的配置是否有冗余条目和提权可能性是至关重要的。如果攻击者能够创建与已删除用户同名的服务账户，它们可以自动继承被铲除用户的所有权限，尤其是分配给该用户的权限</li>
<li>警惕 RBAC 权限提升的风险，如 RoleBinding 转换为 ClusterRoleBinding 等行为</li>
<li>限制 Secret 的访问与使用</li>
<li>通过 Pod 安全性准入来尽可能规避工作负载类的控制器的创建产生的风险</li>
<li>持久卷的创建，慎用 hostPath 卷，受信任的人可以用 PV，受约束的人应使用 PVC
<ul>
<li>只允许需要此权限才能工作的用户以及信任的人员</li>
<li>kubernetes 控制平面组件，这些组件基于已配置为自动制备的 PVC 创建的 PV，通常由 kubernetes 提供商或某些 CSI 插件运行需要</li>
</ul>
</li>
<li>小心具有访问 Node 的 Proxy 子资源权限的 Pod 或用户</li>
<li>尽量避免使用 esclate，拥有此权限的用户可提升自身的权限</li>
<li>尽量避免使用 bind，拥有此权限的用户可以创建并绑定不具有权限的角色</li>
<li>尽量避免使用 impersonate，此操作允许用户伪装并获得其他用户的权限</li>
<li>小心使用 CSR 和证书颁发，CSR API 允许用户有 create <code>CSR</code> 和 update <code>certificatesigningrequests/approval</code> 的权限，有可能会特权提升</li>
<li>令牌请求：拥有 <code>serviceaccount/token</code> 的 create 权限的用户可以创建 TokenRequest 来发布现有服务账户的令牌</li>
<li>控制准入 Webhook：可以控制 <code>validatingwebhookconfigurations</code> 或 <code>mutatingwebhookconfigurations</code> 的用户可以控制能读取任何允许进入集群的对象的 webhook，并且在有变更 webhook 的情况下，还可以变更准入的对象。</li>
<li>为所有 Pod 配置资源配额，以限制可以创建的 Pod 数量，来避免 RBAC 拒绝服务攻击的风险</li>
</ol>
]]></content>
		</item>
		
		<item>
			<title>Pod 安全性</title>
			<link>/posts/pod-%E5%AE%89%E5%85%A8%E6%80%A7/</link>
			<pubDate>Fri, 05 Apr 2024 11:53:18 +0800</pubDate>
			
			<guid>/posts/pod-%E5%AE%89%E5%85%A8%E6%80%A7/</guid>
			<description><![CDATA[%!s(<nil>)]]></description>
			<content type="html"><![CDATA[<h2 id="安全标准">安全标准</h2>
<ol>
<li>
<p><code>Privileged</code> 不受限制的策略，此类 Pod 权限较高，通常为一些系统级别或基础设施级别的工作负载</p>
</li>
<li>
<p><code>Baseline</code> 限制性最弱的策略，禁止已知的特权提升</p>
<ol>
<li>
<p>HostProcess, windows related (v1.26 stable)</p>
<ul>
<li>Restricted Fields
<ol>
<li><code>spec.securityContext.windowsOptions.hostProcess</code></li>
<li><code>spec.containers[*].securityContext.windowsOptions.hostProcess</code></li>
<li><code>spec.initContainers[*].securityContext.windowsOptions.hostProcess</code></li>
<li><code>spec.ephemeralContainers[*].securityContext.windowsOptions.hostProcess</code></li>
</ol>
</li>
<li>Allowed Values
<ol>
<li>undefined/nil</li>
<li>false</li>
</ol>
</li>
</ul>
</li>
<li>
<p>Host Namespaces: Sharing the host namespaces must be disallowed.</p>
<ul>
<li>Restricted Fields
<ol>
<li><code>spec.hostNetwork</code></li>
<li><code>spec.hostPID</code></li>
<li><code>spec.hostIPC</code></li>
</ol>
</li>
<li>Allowed Values
<ol>
<li>undefined/nil</li>
<li>false</li>
</ol>
</li>
</ul>
</li>
<li>
<p>Privileged Containers: Privileged Pods disable most security mechanisms and must be disallowed.</p>
<ul>
<li>Restricted Fields
<ol>
<li><code>spec.containers[*].securityContext.privileged</code></li>
<li><code>spec.initContainers[*].securityContext.privileged</code></li>
<li><code>spec.ephemeralContainers[*].securityContext.privileged</code></li>
</ol>
</li>
<li>Allowed Values
<ol>
<li>undefined/nil</li>
<li>false</li>
</ol>
</li>
</ul>
</li>
<li>
<p>Capabilities: Adding additional capabilities beyond those listed below must be disallowed.</p>
<ul>
<li>Restricted Fields
<ol>
<li><code>spec.containers[*].securityContext.capabilities.add</code></li>
<li><code>spec.initContainers[*].securityContext.capabilities.add</code></li>
<li><code>spec.ephemeralContainers[*].securityContext.capabilities.add</code></li>
</ol>
</li>
<li>Allowed Values
<ul>
<li>undefined/nil</li>
<li><code>AUDIT_WRITE</code> 允许写入审计日志</li>
<li><code>CHOWN</code> 允许容器更改文件所有者</li>
<li><code>DAC_OVERRIDE</code> 允许容器忽略文件的 DAC 权限（Discretionary Access Controls）即，读、写、执行、特殊权限</li>
<li><code>FOWNER</code> 允许容器更改文件所有者为任何用户</li>
<li><code>FSETID</code> 允许容器设置文件的 Setuid 位或 Setgid 位</li>
<li><code>KILL</code> 允许容器相其他进程发送信号</li>
<li><code>MKNOD</code> 允许容器创建特殊文件节点</li>
<li><code>NET_BIND_SERVICE</code> 允许容器绑定到低于 1024 的端口号</li>
<li><code>SETFCAP</code> 允许容器设置文件的能力，如某个文件需要一些特权操作，但又不想以 root 用户身份执行</li>
<li><code>SETGID</code> 允许容器设置有效的组 ID（宿主机）</li>
<li><code>SETPCAP</code> 允许容器进程修改其进程的能力，如 Docker，Sandbox</li>
<li><code>SETUID</code> 允许容器设置有效的用户 ID（宿主机）</li>
<li><code>SYS_CHROOT</code> 允许容器使用 chroot 系统调用，即通过系统调用限制用户只能使用某个文件目录的能力</li>
</ul>
</li>
</ul>
</li>
<li>
<p>HostPath Volumes: HostPath volumes must be forbidden.</p>
<ul>
<li>Restricted Fields
<ol>
<li><code>spec.volumes[*].hostPath</code></li>
</ol>
</li>
<li>Allowed Values
<ol>
<li>undefined/nil</li>
</ol>
</li>
</ul>
</li>
<li>
<p>Hosts Ports: HostPorts should be disallowed entirely (recommended) or restricted to a known list</p>
<ul>
<li>Restricted Fields
<ol>
<li><code>spec.containers[*].ports[*].hostPort</code></li>
<li><code>spec.initContainers[*].ports[*].hostPort</code></li>
<li><code>spec.ephemeralContainers[*].ports[*].hostPort</code></li>
</ol>
</li>
<li>Allowed Values
<ol>
<li>undefined/nil</li>
<li>[enforce,audit,warn]</li>
<li>0</li>
</ol>
</li>
</ul>
</li>
<li>
<p>AppArmor: On supported hosts, the runtime/default AppArmor profile is applied by default. The baseline policy should prevent overriding or disabling the default AppArmor profile, or restrict overrides to an allowed set of profiles.</p>
<ul>
<li>Restricted Fields
<ol>
<li>metadata.annotations[&ldquo;container.apparmor.security.beta.kubernetes.io/*&rdquo;]</li>
</ol>
</li>
<li>Allowed Values
<ol>
<li>undefined/nil</li>
<li>runtime/default</li>
<li>localhost/*</li>
</ol>
</li>
</ul>
</li>
<li>
<p>SELinux: Setting the SELinux type is restricted, and setting a custom SELinux user or role option is forbidden.</p>
<ul>
<li>Restricted Fields
<ol>
<li><code>spec.securityContext.seLinuxOptions.type</code></li>
<li><code>spec.containers[*].securityContext.seLinuxOptions.type</code></li>
<li><code>spec.initContainers[*].securityContext.seLinuxOptions.type</code></li>
<li><code>spec.ephemeralContainers[*].securityContext.seLinuxOptions.type</code></li>
</ol>
</li>
<li>Allowed Values
<ol>
<li>undefined/&quot;&quot;</li>
<li><code>container_t</code> 容器的默认 SELinux 类型，允许主容器中的进程访问主容器中的资源，并收到 SELinux 策略的保护</li>
<li><code>container_init_t</code> 允许初始化容器中的进程访问初始化容器中的资源，并收到 SELinux 策略的保护</li>
<li><code>container_kvm_t</code> 允许容器内运行的虚拟机（如 KVM）的 SELinux 类型，并受到 SELinux 策略的保护</li>
</ol>
</li>
<li>Restricted Fields
<ol>
<li><code>spec.securityContext.seLinuxOptions.[user/role]</code></li>
<li><code>spec.containers[*].securityContext.seLinuxOptions.[user/role]</code></li>
<li><code>spec.initContainers[*].securityContext.seLinuxOptions.[user/role]</code></li>
<li><code>spec.ephemeralContainers[*].securityContext.seLinuxOptions.[user/role]</code></li>
</ol>
</li>
<li>Allowed Values
<ol>
<li>undefined/&quot;&quot;</li>
</ol>
</li>
</ul>
</li>
<li>
<p>/proc Mount Type: The default /proc masks are set up to reduce attack surface, and should be required.</p>
<ul>
<li>Restricted Fields
<ol>
<li><code>spec.containers[*].securityContext.procMount</code></li>
<li><code>spec.initContainers[*].securityContext.procMount</code></li>
<li><code>spec.ephemeralContainers[*].securityContext.procMount</code></li>
</ol>
</li>
<li>Allowed Values
<ol>
<li>undefined/nil</li>
<li>Default</li>
</ol>
</li>
</ul>
</li>
<li>
<p>Seccomp: Seccomp profile must not be explicitly set to Unconfined</p>
<ul>
<li>Restricted Fields
<ol>
<li><code>spec.securityContext.seccompProfile.type</code></li>
<li><code>spec.containers[*].securityContext.seccompProfile.type</code></li>
<li><code>spec.initContainers[*].securityContext.seccompProfile.type</code></li>
<li><code>spec.ephemeralContainers[*].securityContext.seccompProfile.type</code></li>
</ol>
</li>
<li>Allowed Values
<ol>
<li>undefined/nil</li>
<li>RuntimeDefault</li>
<li>Localhost</li>
</ol>
</li>
</ul>
</li>
<li>
<p>Sysctls: Sysctls can disable security mechanisms or affect all containers on a host, and should be disallowed except for an allowed &ldquo;safe&rdquo; subset. A sysctl is considered safe if it is namespaced in the container or the Pod, and it is isolated from other Pods or processes on the same Node.</p>
<ul>
<li>Restricted Fields
<ol>
<li><code>spec.securityContext.sysctls[*].name</code></li>
</ol>
</li>
<li>Allowed Values
<ol>
<li>undefined/nil</li>
<li><code>kernel.shm_rmid_forced</code> 控制是否强制删除共享内存标识符</li>
<li><code>net.ipv4.ip_local_port_range</code> 控制本地端口范围</li>
<li><code>net.ipv4.ip_unprivileged_port_start</code> 指定非特权用户可用的本地端口起始范围</li>
<li><code>net.ipv4.tcp_syncookies</code> 控制是否启用 SYN cookie 机制来防范 SYN 攻击</li>
<li><code>net.ipv4.ping_group_range</code> 指定 ping 命令可用的 ICMP Echo 请求</li>
<li><code>net.ipv4.ip_local_reserved_ports</code> 指定保留的本地端口范围</li>
<li><code>net.ipv4.tcp_keepalive_time</code> 指定 TCP 连接的 FIN 超时时间，以秒为单位</li>
<li><code>net.ipv4.tcp_fin_timeout</code> 指定 TCP 连接的 FIN 超时时间，以秒为单位</li>
<li><code>net.ipv4.tcp_keepalive_intvl</code> 指定 TCP 连接的 keepalive 控制消息之间的间隔时间，以秒为单位</li>
<li><code>net.ipv4.tcp_keepalive_probes</code> 指定 TCP 连接在进行 keepalive 检测之前尝试的次数</li>
</ol>
</li>
</ul>
</li>
</ol>
</li>
<li>
<p><code>Restricted</code> 限制性最强的策略</p>
<ol>
<li>Volume Types: The restricted policy only permits the following volume types.
<ul>
<li>Restricted Fields
<ol>
<li><code>spec.volumes[*]</code></li>
</ol>
</li>
<li>Allowed Values: Non-Null value
<ol>
<li><code>spec.volumes[*].configMap</code></li>
<li><code>spec.volumes[*].csi</code></li>
<li><code>spec.volumes[*].downwardAPI</code></li>
<li><code>spec.volumes[*].emptyDir</code></li>
<li><code>spec.volumes[*].ephemeral</code></li>
<li><code>spec.volumes[*].persistentVolumeClaim</code></li>
<li><code>spec.volumes[*].protected</code></li>
<li><code>spec.volumes[*].secret</code></li>
</ol>
</li>
</ul>
</li>
<li>Privilege Escalation: Privilege escalation (such as via set-user-ID or set-group-ID file mode) should not be allowed. linux only policy (<code>spec.os.name != windows)</code>
<ul>
<li>Restricted Fields
<ol>
<li><code>spec.containers[*].securityContext.allowPrivilegeEscalation</code></li>
<li><code>spec.initContainers[*].securityContext.allowPrivilegeEscalation</code></li>
<li><code>spec.ephemeralContainers[*].securityContext.allowPrivilegeEscalation</code></li>
</ol>
</li>
<li>Allowed Values
<ol>
<li>false</li>
</ol>
</li>
</ul>
</li>
<li>Running as Non-root
<ul>
<li>Restricted Fields: Containers must be required to run as non-root users.
<ol>
<li><code>spec.securityContext.runAsNonRoot</code></li>
<li><code>spec.containers[*].securityContext.runAsNonRoot</code></li>
<li><code>spec.initContainers[*].securityContext.runAsNonRoot</code></li>
<li><code>spec.ephemeralContainers[*].securityContext.runAsNonRoot</code></li>
</ol>
</li>
<li>Allowed Values
<ol>
<li>true</li>
</ol>
</li>
<li>Restricted Fields: Containers must not set runAsUser to 0
<ol>
<li><code>spec.securityContext.runAsUser</code></li>
<li><code>spec.containers[*].securityContext.runAsUser</code></li>
<li><code>spec.initContainers[*].securityContext.runAsUser</code></li>
<li><code>spec.ephemeralContainers[*].securityContext.runAsUser</code></li>
</ol>
</li>
<li>Allowed Values
<ol>
<li>any non-zero value</li>
<li>undefined/null</li>
</ol>
</li>
</ul>
</li>
<li>Seccomp: Seccomp profile must be explicitly set to one of the allowed values. Both the Unconfined profile and the absence of a profile are prohibited. linux only (<code>spec.os.name != windows)</code>
<ul>
<li>Restricted Fields
<ol>
<li><code>spec.securityContext.seccompProfile.type</code></li>
<li><code>spec.containers[*].securityContext.seccompProfile.type</code></li>
<li><code>spec.initContainers[*].securityContext.seccompProfile.type</code></li>
<li><code>spec.ephemeralContainers[*].securityContext.seccompProfile.type</code></li>
</ol>
</li>
<li>Allowed Values
<ol>
<li>RuntimeDefault</li>
<li>Localhost</li>
</ol>
</li>
</ul>
</li>
<li>Capabilities: Containers must drop ALL capabilities, and are only permitted to add back the NET_BIND_SERVICE capability. linux only (<code>spec.os.name != windows)</code>
<ul>
<li>Restricted Fields
<ol>
<li><code>spec.containers[*].securityContext.capabilities.drop</code></li>
<li><code>spec.initContainers[*].securityContext.capabilities.drop</code></li>
<li><code>spec.ephemeralContainers[*].securityContext.capabilities.drop</code></li>
</ol>
</li>
<li>Allowed Values
<ol>
<li>Any list of capabilities that includes ALL</li>
</ol>
</li>
<li>Restricted Fields
<ol>
<li><code>spec.containers[*].securityContext.capabilities.add</code></li>
<li><code>spec.initContainers[*].securityContext.capabilities.add</code></li>
<li><code>spec.ephemeralContainers[*].securityContext.capabilities.add</code></li>
</ol>
</li>
<li>Allowed Values
<ol>
<li>undefined/nil</li>
<li>NET_BIND_SERVICE</li>
</ol>
</li>
</ul>
</li>
</ol>
</li>
</ol>
<h2 id="为名字空间设置-pod-安全性准入控制标签">为名字空间设置 Pod 安全性准入控制标签</h2>
<ol>
<li>enforce: 策略违例会导致 Pod 被拒绝，应用到 Pod 对象上</li>
<li>audit：策略违例会触发在审计日志中记录新事件时添加注解；但是 Pod 仍然是被接受的，应用到 Deployment，ReplicaSet 等控制器对象上</li>
<li>warn：策略违例会触发用户可见的警告信息，但是 Pod 仍是被接受的，应用到 Deployment，ReplicaSet 等控制器对象上</li>
</ol>
<h3 id="对应的标签">对应的标签</h3>
<ol>
<li><code>pod-security.kubernetes.io/&lt;MODE&gt;: &lt;LEVEL&gt;</code>
MODE: <code>enforce</code>,<code>audit</code>,<code>warn</code>
LEVEL: <code>privileged</code>,<code>baseline</code>,<code>restricted</code></li>
<li><code>pod-security.kubernetes.io/&lt;MODE&gt;-version: &lt;VERSION&gt;</code>
MODE: <code>enforce</code>,<code>audit</code>,<code>warn</code>
VERSION: 合法的 kubernetes 小版本号或者<code>latest</code></li>
</ol>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l">v1</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l">Namespace</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">metadata</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">my-baseline-namespace</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">labels</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">pod-security.kubernetes.io/enforce</span><span class="p">:</span><span class="w"> </span><span class="l">baseline</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">pod-security.kubernetes.io/enforce-version</span><span class="p">:</span><span class="w"> </span><span class="l">v1.29</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="c"># 我们将这些标签设置为我们所 _期望_ 的 `enforce` 级别</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">pod-security.kubernetes.io/audit</span><span class="p">:</span><span class="w"> </span><span class="l">restricted</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">pod-security.kubernetes.io/audit-version</span><span class="p">:</span><span class="w"> </span><span class="l">v1.29</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">pod-security.kubernetes.io/warn</span><span class="p">:</span><span class="w"> </span><span class="l">restricted</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">pod-security.kubernetes.io/warn-version</span><span class="p">:</span><span class="w"> </span><span class="l">v1.29</span><span class="w">
</span></span></span></code></pre></div><h2 id="准入豁免">准入豁免</h2>
<ol>
<li>Username：来自用户名一杯豁免的，已认证的（或伪装的）用户请求会被忽略</li>
<li>RuntimeClassName：指定了已豁免的 CRI 类名称的 Pod 和负载资源（Deployment，ReplicaSet 等）会被忽略</li>
<li>Namespace：位于北豁免的名字空间中的 Pod 和负载资源会被忽略</li>
</ol>
<blockquote>
<p>NOTE: 为用户提供豁免时，只会当该用户直接创建的 Pod 时对其实施安全策略的豁免。用户所创建的工作负载资源（控制器）不会被豁免。控制器服务账号（如：system:serviceaccount:kube-system:replicaset-controller）通常不应该被豁免，因为这类服务账号隐含着对所有能够创建对应工作负载资源的用户豁免。</p>
</blockquote>
<p>策略检查时会对以下 Pod 字段的更新操作予以豁免，这意味着如果 Pod 更新请求进改变这些字段时，即使 Pod 违反了当前的策略级别，请求也不会被拒绝。</p>
<ul>
<li>除了对 seccomp 或 AppArmor 注解之外的所有 Metadata 更新操作：
<ul>
<li><code>container.apparmor.security.beta.kubernetes.io/*</code></li>
</ul>
</li>
<li>对 <code>.spec.activeDeadlineSeconds</code>的合法更新</li>
<li>对 <code>.spec.tolerations</code>的合法更新</li>
</ul>
<h2 id="pod-安全级别的指标监控">Pod 安全级别的指标监控</h2>
<ul>
<li>pod_security_evaluations_total: 表示易发生的策略评估的数量，不包括到处期间被忽略或豁免的请求</li>
<li>pod_security_exemptions_total: 表示豁免请求的数量，不包括被忽略或超出范围的请求</li>
</ul>
]]></content>
		</item>
		
		<item>
			<title>My Tmux Config</title>
			<link>/posts/my-tmux-config/</link>
			<pubDate>Wed, 03 Apr 2024 17:57:14 +0800</pubDate>
			
			<guid>/posts/my-tmux-config/</guid>
			<description><![CDATA[%!s(<nil>)]]></description>
			<content type="html"><![CDATA[<p><img src="/imgs/tmux-screenshot.png" alt="tmux-screenshot"></p>
<pre tabindex="0"><code class="language-conf" data-lang="conf">set -g default-terminal &#34;screen-256color&#34;

set -g prefix C-z
unbind C-b
bind-key C-z send-prefix

unbind %
bind _ split-window -h

unbind &#39;&#34;&#39;
bind - split-window -v

unbind r
bind r source-file ~/.tmux.conf \; display &#39;~/.tmux.conf sourced&#39;

bind -r Tab last-window         # move to last active window
bind-key -n M-h previous-window # select previous window
bind-key -n M-l next-window     # select next window

bind -r j resize-pane -D 5
bind -r k resize-pane -U 5
bind -r l resize-pane -R 5
bind -r h resize-pane -L 5

bind -r m resize-pane -Z

set -g mouse on

set-window-option -g mode-keys vi

bind-key -T copy-mode-vi &#39;v&#39; send -X begin-selection # start selecting text with &#34;v&#34;
bind-key -T copy-mode-vi &#39;y&#39; send -X copy-selection # copy text with &#34;y&#34;

unbind -T copy-mode-vi MouseDragEnd1Pane # don&#39;t exit copy mode when dragging with mouse

# remove delay for exiting insert mode with ESC in Neovim
set -sg escape-time 10

# tpm plugin
set -g @plugin &#39;tmux-plugins/tpm&#39;

# list of tmux plugins
set -g @plugin &#39;christoomey/vim-tmux-navigator&#39;
set -g @plugin &#39;tmux-plugins/tmux-resurrect&#39; # persist tmux sessions after computer restart
set -g @plugin &#39;tmux-plugins/tmux-continuum&#39; # automatically saves sessions for you every 15 minutes

set -g @plugin &#39;catppuccin/tmux&#39;
set -g @plugin &#39;tmux-plugins/tmux-cpu&#39;
set -g @plugin &#39;xamut/tmux-weather&#39;
# set -g @plugin &#39;vascomfnunes/tmux-clima&#39;
# set -g @plugin &#39;jamesoff/tmux-loadavg&#39;

# ========= catppuccin/tmux configuration start =============
set -g @catppuccin_window_left_separator &#34;&#34;
set -g @catppuccin_window_right_separator &#34; &#34;
set -g @catppuccin_window_middle_separator &#34; █&#34;
set -g @catppuccin_window_number_position &#34;right&#34;

set -g @catppuccin_window_default_fill &#34;number&#34;
set -g @catppuccin_window_default_text &#34;#W&#34;

set -g @catppuccin_window_current_fill &#34;number&#34;
set -g @catppuccin_window_current_text &#34;#W&#34;

set -g @catppuccin_status_modules_left &#34;&#34;
set -g @catppuccin_status_modules_right &#34;directory application weather date_time cpu session&#34;
set -g @catppuccin_status_left_separator  &#34; &#34;
set -g @catppuccin_status_right_separator &#34;&#34;
set -g @catppuccin_status_fill &#34;icon&#34;
set -g @catppuccin_status_connect_separator &#34;no&#34;

set -g @catppuccin_directory_text &#34;#{b:pane_current_path}&#34;
# ========= catppuccin/tmux configuration end =============

set -g @resurrect-capture-pane-contents &#39;on&#39;
set -g @continuum-restore &#39;on&#39;

# Initialize TMUX plugin manager (keep this line at the very bottom of tmux.conf)
run &#39;~/.tmux/plugins/tpm/tpm&#39;
</code></pre>]]></content>
		</item>
		
		<item>
			<title>API优先级与公平性</title>
			<link>/posts/api%E4%BC%98%E5%85%88%E7%BA%A7%E4%B8%8E%E5%85%AC%E5%B9%B3%E6%80%A7/</link>
			<pubDate>Wed, 03 Apr 2024 08:53:56 +0800</pubDate>
			
			<guid>/posts/api%E4%BC%98%E5%85%88%E7%BA%A7%E4%B8%8E%E5%85%AC%E5%B9%B3%E6%80%A7/</guid>
			<description><![CDATA[%!s(<nil>)]]></description>
			<content type="html"><![CDATA[<p>API Priority and Fairness (APF v1.29 stable)原则体现在，当 API 服务器过载的时候，处理请求与响应的能力。kube-apiserver 有一些选项如<code>--max-requests-inflight</code>和<code>--max-mutating-requests-inflight</code>，可以限制将要接受的未处理请求，从而防止过量请求入站，潜在导致 API 服务奔溃。这些选项不足以保证在高流量期间，最重要的请求仍能被服务器接受。</p>
<p>APF 可提升 apiserver 过载情况下的并发限制。以更细粒度的方式对请求进行分类（优先级配置）和隔离（每个优先级分配自定义并发限制），引入空间有限的排队机制（公平排队分发请求），可在非常短暂的突发情况下，保证 apiserver 不会拒绝任何请求。这种情况能够保证所有请求都可以得到 apiserver 的响应。</p>
<p>APF 在设计上期望能与标准控制器（如 Deployment 等？？？）一起协同工作，这类控制器在处理失效时能够执行指数型回退（指回退时间）。</p>
<p>APF 稳定版 v1 默认被启用，可通过<code>--enable-priority-and-fairness</code>开启或禁用 APF。</p>
<p>请求通过 FlowSchema 按照其属性分类，并分配优先级。每个优先级分配自定义并发限制（隔离）。公平排队算法可以防止来自不同 Flow 的请求相互饿死，该算法将请求排队，防止因低平均负载因通信量突增而导致请求失败。</p>
<h2 id="基本概念">基本概念</h2>
<h3 id="优先级">优先级</h3>
<p>如果未启用 APF，apiserver 中的整体并发量将收到 kube-apiserver 的参数<code>--max-requests-inflight</code>和<code>--max-mutating-requests-inflight</code>的限制。启用 APF 后，将对这些参数定义的并发限制进行求和，然后将总和分配到一组可配置的优先级中。因为每个请求都会分配一个优先级，且每个优先级都有各自的并发限制，以此来达到即使异常的 Pod 向 apiserver 发送大量请求，也无法阻止诸如领导者选举或内置控制器的操作成功等高优先级的 http 请求。优先级的并发限制会被定期调整，允许利用率低的优先级将并发 seat 临时借给利用率高的优先级。</p>
<h3 id="请求占用的-seat">请求占用的 seat</h3>
<p>有些请求占用一个 seat，有些请求占用多个 seat（可能附带其他请求）。对于后者 apiserver 将认为请求所返回的对象数量与所占用的 seat 成正比。即：如果一个请求返回一个 List 那么有可能这个请求会占用这个 List 长度的 seat。</p>
<h3 id="watch-请求">watch 请求</h3>
<p>APF 管理 watch 请求，要考虑除一个请求占用一个 seat 的情况。每当向 apiserver 发送通知创建/更新/删除一个对象时，正常都会以并发的方式发送所有相关的 watch 响应流，apiserver 估算要发送的通知数量，并调整写入请求占用的 seat 数以及额外工作继续占用的 seat 的时间。</p>
<h3 id="排队">排队</h3>
<p>第一步：每个请求都被分配到某个 Flow（由对应的 FlowSchema 与 FlowDistinguisher）。FlowDistinguisher 可以是发出请求的用户、目标资源的名字空间或者为空等。apiserver 尝试为不同流中具有相同优先级的请求赋予近似相等的权重。第二步：apiserver 将请求分配到队列中，使用 shuffle-sharding 的技术，相对有效的利用队列隔离 low-intensity flows 与 high-intensity flows。</p>
<p>排队算法的细节可针对每个优先级进行调整，并允许管理员在内存占用、公平性、突发流量容忍度以及排队引发的额外延迟之间进行权衡。</p>
<h3 id="豁免请求">豁免请求</h3>
<p>某些特别重要的请求不受制于此特性施加的任何限制。这些豁免可防止不当的流控配置完全禁用 apiserver。</p>
<h2 id="资源">资源</h2>
<h3 id="prioritylevelconfiguration"><code>PriorityLevelConfiguration</code></h3>
<p>该对象定义可用的优先级，并按比例分配 kube-apiserver 定义的<code>--max-request-inflight</code>或<code>--max-mutating-requests-inflight</code>总并发量限制。所以对于管理员来说，可以通过设定这 2 个参数的具体数值，然后集群中定义的所有<code>PriorityLevelConfiguration</code>都会按照比例相应缩减对应的并发限额。</p>
<p>一个优先级可以借出的并发数（seat）界限以及可以借用的并发数（seat）界限在<code>PriorityLevelConfiguration</code>表现该优先级的额定限制。这些界限值 * 额定限制 / 100.0 并取整，被解析为绝对 seat 数量。即一个优先级的动态调整并发限制范围在“器额定限制的下限 - 其可借出的 seat 数”和“其额定限制的上限 + 其可以借用的 seat 数”之间。在每次调整时，通过每个优先级推导得出动态限制，具体过程为回收最近出现需求的所有接触的 seat，然后在刚刚描述的界限内共同公平地响应有关这些优先级最近的 seat 要求。如果用户想要对给定资源分别进行处理，需要自己创建对应的 FlowSchema 对象，分别匹配对应的 mutating 和 non-mutating。</p>
<p>当入站请求数量大于分配的<code>PriorityLevelConfiguration</code>中允许的并发量（seat）时，<code>type=Reject</code>表示多余的请求将立即以 HTTP 429 的错误拒绝，<code>type=Queue</code>表示对允许并发量的请求进行排队处理（应用不同优先级的并发数量和公平排队的技术来平衡请求的处理）。</p>
<p>公平排队算法支持通过排队配置对优先级的微调，<a href="https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/1040-priority-and-fairness">增强建议</a></p>
<ul>
<li><code>queues</code> 数值增大，能降低 flow 冲突概率，但是内存会增大，值为 1，则禁用公平排队逻辑</li>
<li><code>queueLengthLimit</code> 数值增大，能提高并发，但是增加了等待时间和内存用量</li>
<li><code>handSize</code> 允许调整过载情况下不同 flow 之间的冲突概率以及单个请求可用的整体并发性，较大使 2 个单独的 flow 发生碰撞的可能性变小（也就是一个操作可以饿死另一个，因为它占用了太久的 seat），更有可能的情况是少数 flow 可以调用 apiserver。较大的值还可能增加单个高并发 flow 的延迟时间。单个 flow 中可能排队的请求的最大数量为<code>handSize</code>*<code>queueLengthLimit</code>。</li>
</ul>
<h3 id="flowschema"><code>FlowSchema</code></h3>
<p>用于对每个入站请求进行分类，并与一个<code>PriorityLevelConfiguration</code>相匹配。每个入站请求都会对<code>FlowSchema</code>测试是否匹配，首先从<code>matchingPrecedence</code>数值最低的匹配开始，然后依次进行，直到首个匹配出现。</p>
<p>确定请求与某个<code>FlowSchema</code>的<code>rules</code>的其中一条匹配的规则是：</p>
<ol>
<li>要求该条规则的<code>subjects</code>字段至少存在一个与该请求相匹配</li>
<li>该条规则的<code>resourceRules</code>或<code>nonResourceRules</code>（取决于请求传入的是资源 URL 还是非资源 URL）字段至少存在一个与该请求匹配</li>
</ol>
<p>对于<code>subjects</code>中的<code>name</code>字段和资源和非资源规则的<code>verbs</code>(如 GET,POST 等)、<code>apiGroups</code>、<code>resources</code>、<code>namespaces</code>和<code>nonResourceURLS</code>字段，可以指定通配符<code>*</code>来匹配任意值。</p>
<p><code>FlowSchema</code>的<code>distinguisherMethod.type</code>字段决定了如何把与该模式匹配的请求分发到不同的 flow 中，如：</p>
<ol>
<li><code>ByUser</code> 一个用户将无法饿死其他容量的用户</li>
<li><code>ByNamespace</code> 一个命名空间中的对象资源请求将无法饿死其他命名空间中的对象资源请求</li>
<li>如果省略<code>distinguisherMethod</code>，这种情况将被视为与此<code>FlowSchema</code>相匹配的请求是单个 flow 的一部分。</li>
<li>&hellip;</li>
</ol>
<h2 id="默认值">默认值</h2>
<p>kube-apiserver 会维护 2 种类型的 APF 配置对象：Mandatory 和 Suggested</p>
<h2 id="健康检查豁免示例">健康检查豁免示例</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l">flowcontrol.apiserver.k8s.io/v1beta3</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l">FlowSchema</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">metadata</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">health-for-strangers</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">spec</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">matchingPrecedence</span><span class="p">:</span><span class="w"> </span><span class="m">1000</span><span class="w"> </span><span class="c"># 最低匹配数值</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">priorityLevelConfiguration</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">exempt</span><span class="w"> </span><span class="c"># 优先级用于完全不受流控限制的请求</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">rules</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span>- <span class="nt">nonResourceRules</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span>- <span class="nt">nonResourceURLs</span><span class="p">:</span><span class="w"> </span><span class="c"># 针对URL</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">            </span>- <span class="s2">&#34;/healthz&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">            </span>- <span class="s2">&#34;/livez&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">            </span>- <span class="s2">&#34;/readyz&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span><span class="nt">verbs</span><span class="p">:</span><span class="w"> </span><span class="c"># 对所有的HTTP（GET,POST,PUT...）的请求</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">            </span>- <span class="s2">&#34;*&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">subjects</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span>- <span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l">Group</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span><span class="nt">group</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">            </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;system:unauthenticated&#34;</span><span class="w">
</span></span></span></code></pre></div><h2 id="稳定性提升">稳定性提升</h2>
<p>根据提供的<a href="https://kubernetes.io/zh-cn/docs/concepts/cluster-administration/flow-control/#metrics">指标</a>配置 APF 监控</p>
]]></content>
		</item>
		
		<item>
			<title>集群管理</title>
			<link>/posts/%E9%9B%86%E7%BE%A4%E7%AE%A1%E7%90%86/</link>
			<pubDate>Tue, 02 Apr 2024 08:59:23 +0800</pubDate>
			
			<guid>/posts/%E9%9B%86%E7%BE%A4%E7%AE%A1%E7%90%86/</guid>
			<description><![CDATA[%!s(<nil>)]]></description>
			<content type="html"><![CDATA[<h2 id="日志架构">日志架构</h2>
<h3 id="pod-与容器日志">Pod 与容器日志</h3>
<ol>
<li>所有容器相关的运行日志，都以 stdout 和 stderr 的方式输出，并对日志进行 rotate。默认情况下，如果容器重新启动，kubelet 会保留一个终止容器的日志，如果该容器被驱逐，所对应的日志也会跟随。kubelet 配置文件中，<code>containerLogMaxSize</code>(default 10Mi)，<code>containerLogMaxFiles</code>(default 5)。</li>
<li>系统组件的日志
<ol>
<li>kubelet：linux 默认写入<code>journald</code>，使用<code>journalctl -u kubelet</code>查看</li>
<li>其他组件如果是以容器的方式运行，如 kube-proxy 等一些系统 Pod，在<code>/var/log/pods</code>中，日志目录命名方式[namespace]<em>[podname]</em>[poduid]</li>
</ol>
</li>
</ol>
<h2 id="集群级日志架构">集群级日志架构</h2>
<p>kubernetes 没有提供原生日志解决方案，但用户可以选择一些方式：</p>
<ol>
<li>使用 DaemonSet 在每个节点部署日志代理</li>
<li>使用 Sidecar 专门收集容器日志</li>
<li>使用第三方后端记录日志</li>
</ol>
<p>总之，日志的收集无论哪种方式怎么实现，都会有一定的损耗。</p>
<h3 id="系统日志">系统日志</h3>
<p>Klog 示例</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-console" data-lang="console"><span class="line"><span class="cl"><span class="go">I1025 00:15:15.525108       1 httplog.go:79] GET /api/v1/namespaces/kube-system/pods/metrics-server-v0.3.1-57c75779f-9p8wg: (1.512ms) 200 [pod_nanny/v0.0.0 (linux/amd64) kubernetes/$Format 10.56.1.19:51756]
</span></span></span></code></pre></div><ol>
<li>结构化日志
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-console" data-lang="console"><span class="line"><span class="cl"><span class="go">&lt;klog header&gt; &#34;&lt;message&gt;&#34; &lt;key1&gt;=&#34;&lt;value1&gt;&#34; &lt;key2&gt;=&#34;&lt;value2&gt;&#34; &lt;%+v&gt;
</span></span></span></code></pre></div></li>
<li><a href="https://github.com/kubernetes/enhancements/tree/master/keps/sig-instrumentation/3077-contextual-logging">上下文日志</a>，<a href="https://github.com/kubernetes/kubernetes/blob/v1.24.0-beta.0/staging/src/k8s.io/component-base/logs/example/cmd/logger.go">example</a></li>
<li>JSON 日志格式，<code>--logging-format=json</code></li>
<li>日志精细度级别，如<code>-v=5</code>,0 为 critical 事件</li>
<li>日志位置，在容器中的日志：<code>/var/log/&lt;pods/containers/servicename&gt;/**/*.log</code>（logrotate）；不在容器中：写入 journald（系统日志工具）</li>
<li>日志查询，需开启<code>NodeLogQuery</code>(feature gate)</li>
</ol>
<h2 id="系统组件指标">系统组件指标</h2>
<p>http 访问路径，<code>/metrics/*</code>；</p>
<p>启用隐藏指标：<code>show-hidden-metrics-for-version</code></p>
<h3 id="kube-scheduler">kube-scheduler</h3>
<p>包含以下 label:</p>
<ol>
<li>namespace</li>
<li>podname</li>
<li>pod.spec.nodeName</li>
<li>pod.spec.priority</li>
<li>pod.spec.scheduleName</li>
<li>pod.spec.resources.[*].[cpu|memory|volume]</li>
</ol>
<h3 id="指标顺序指定">指标顺序指定</h3>
<p><code>--allow-label-value</code>, <code>--allow-metric-labels-manifest</code></p>
<h2 id="tracing-kubernetes-系统组件">Tracing Kubernetes 系统组件</h2>
<ol>
<li>[OpenTelemetry 协议]</li>
<li><a href="https://www.w3.org/TR/trace-context/">w3c trace-context</a></li>
</ol>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l">kubelet.config.k8s.io/v1beta1</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l">KubeletConfiguration</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">featureGates</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">KubeletTracing</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">tracing</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="c"># 默认值</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="c">#endpoint: localhost:4317</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">samplingRatePerMillion</span><span class="p">:</span><span class="w"> </span><span class="m">100</span><span class="w"> </span><span class="c"># 如引发性能问题，可适当调整数值或关闭该项</span><span class="w">
</span></span></span></code></pre></div><h2 id="kubernetes-中的-proxy">kubernetes 中的 proxy</h2>
<ol>
<li>kubectl proxy</li>
<li>apiserver proxy</li>
<li>kube proxy
<ul>
<li>在每个节点上运行</li>
<li>支持 UDP,TCP,SCTP</li>
<li>提供负载均衡能力</li>
<li>只用来访问 Service</li>
</ul>
</li>
<li>A Proxy/Load-balancer in front of apiserver(s)</li>
<li>Cloud Load Balancers on external services</li>
</ol>
]]></content>
		</item>
		
		<item>
			<title>Play Hyprland in ArchLinux</title>
			<link>/posts/play-hyprland-in-archlinux/</link>
			<pubDate>Tue, 02 Apr 2024 01:45:33 +0800</pubDate>
			
			<guid>/posts/play-hyprland-in-archlinux/</guid>
			<description><![CDATA[%!s(<nil>)]]></description>
			<content type="html"><![CDATA[<p><img src="/media/fancy-hyprland.png" alt="fancy-hyprland"></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl"><span class="c1"># Verify the boot mode</span>
</span></span><span class="line"><span class="cl">cat /sys/firmware/efi/fw_platform_size
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">fdisk /dev/vda
</span></span><span class="line"><span class="cl"><span class="c1"># Pick GPT partition</span>
</span></span><span class="line"><span class="cl"><span class="c1"># /dev/vda1 512MB EFI system type: uefi</span>
</span></span><span class="line"><span class="cl"><span class="c1"># /dev/vda2 512MB swap       type: swap</span>
</span></span><span class="line"><span class="cl"><span class="c1"># /dev/vda3 63G Linux Filesystem</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">mkfs.ext4 /dev/vda3
</span></span><span class="line"><span class="cl">mkswap /dev/vda2
</span></span><span class="line"><span class="cl">mkfs.fat -F <span class="m">32</span> /dev/vda1
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">mount /dev/vda3 /mnt
</span></span><span class="line"><span class="cl">mount --mkdir /dev/vda1 /mnt/boot
</span></span><span class="line"><span class="cl">swapon /dev/vda2
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">mv /etc/pacman.d/mirrorlist /etc/pacman.d/mirrorlist.bak
</span></span><span class="line"><span class="cl"><span class="nb">echo</span> <span class="s2">&#34;Server = https://mirrors.tuna.tsinghua.edu.cn/archlinux/</span><span class="nv">$repo</span><span class="s2">/os/</span><span class="nv">$arch</span><span class="s2">&#34;</span> &gt; /etc/pacman.d/mirrorlist
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">pacstrap -K /mnt base base-devel linux linux-firmware e2fsprogs dhcpcd networkmanager vim neovim man-db man-pages texinfo openssh git
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">genfstab -U /mnt &gt;&gt; /mnt/etc/fstab
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">arch-chroot /mnt
</span></span><span class="line"><span class="cl">ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
</span></span><span class="line"><span class="cl">hwclock --systohc
</span></span><span class="line"><span class="cl">vim /etc/locale.gen
</span></span><span class="line"><span class="cl">locale-gen
</span></span><span class="line"><span class="cl">cat &gt; /etc/locale.conf <span class="s">&lt;&lt;EOF
</span></span></span><span class="line"><span class="cl"><span class="s">LANG=en_US.UTF-8
</span></span></span><span class="line"><span class="cl"><span class="s">EOF</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">cat &gt; /etc/vconsole.conf <span class="s">&lt;&lt;EOF
</span></span></span><span class="line"><span class="cl"><span class="s">KEYMAP=en
</span></span></span><span class="line"><span class="cl"><span class="s">EOF</span>
</span></span><span class="line"><span class="cl">cat &gt;&gt; /etc/hostname <span class="s">&lt;&lt;EOF
</span></span></span><span class="line"><span class="cl"><span class="s">archlinux
</span></span></span><span class="line"><span class="cl"><span class="s">EOF</span>
</span></span><span class="line"><span class="cl">cat &gt;&gt; /etc/hosts <span class="s">&lt;&lt;EOF
</span></span></span><span class="line"><span class="cl"><span class="s">127.0.0.1 localhost
</span></span></span><span class="line"><span class="cl"><span class="s">::1         localhost
</span></span></span><span class="line"><span class="cl"><span class="s">127.0.0.1   archlinux
</span></span></span><span class="line"><span class="cl"><span class="s">EOF</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Ignore the warning or error, I did not dig into it.</span>
</span></span><span class="line"><span class="cl">mkinitcpio -P
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Set root passwd</span>
</span></span><span class="line"><span class="cl">passwd
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Install bootloader</span>
</span></span><span class="line"><span class="cl">pacman -S grub efibootmgr
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">mkdir /boot/grub
</span></span><span class="line"><span class="cl">grub-install --efi-directory<span class="o">=</span>/boot --bootloader-id<span class="o">=</span>GRUB
</span></span><span class="line"><span class="cl">grub-mkconfig -o /boot/grub/grub.cfg
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">exit</span>
</span></span><span class="line"><span class="cl">umount -R /mnt
</span></span><span class="line"><span class="cl">reboot now
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl">systemctl <span class="nb">enable</span> NetworkManager
</span></span><span class="line"><span class="cl">systemctl start NetworkManager
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">systemctl <span class="nb">enable</span> sshd
</span></span><span class="line"><span class="cl">systemctl start sshd
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">useradd -m -G wheel -s /bin/bash username
</span></span><span class="line"><span class="cl">passwd username
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nb">echo</span> <span class="s2">&#34;Server = https://mirrors.tuna.tsinghua.edu.cn/archlinuxarm/</span><span class="nv">$arch</span><span class="s2">/</span><span class="nv">$repo</span><span class="s2">&#34;</span> &gt;&gt; /etc/pacman.d/mirrorlist
</span></span><span class="line"><span class="cl">pacman -Syyu
</span></span></code></pre></div><h2 id="some-useful-references">Some Useful References</h2>
<ul>
<li><a href="https://wiki.archlinux.org/title/Installation_guide">https://wiki.archlinux.org/title/Installation_guide</a></li>
<li><a href="https://github.com/prasanthrangan/hyprdots">https://github.com/prasanthrangan/hyprdots</a></li>
<li><a href="https://github.com/JaKooLit/Fedora-Hyprland">https://github.com/JaKooLit/Fedora-Hyprland</a></li>
<li><a href="https://github.com/JackMyers001/archiso-aarch64">https://github.com/JackMyers001/archiso-aarch64</a></li>
<li><a href="https://github.com/romkatv/powerlevel10k">https://github.com/romkatv/powerlevel10k</a></li>
<li><a href="https://raspberry-hosting.com/en/faq/how-expand-arch-linux-root-partition">https://raspberry-hosting.com/en/faq/how-expand-arch-linux-root-partition</a></li>
</ul>
]]></content>
		</item>
		
		<item>
			<title>Schedule 总结2</title>
			<link>/posts/schedule-%E6%80%BB%E7%BB%932/</link>
			<pubDate>Mon, 01 Apr 2024 08:50:47 +0800</pubDate>
			
			<guid>/posts/schedule-%E6%80%BB%E7%BB%932/</guid>
			<description><![CDATA[%!s(<nil>)]]></description>
			<content type="html"><![CDATA[<h2 id="taint--toleration">Taint &amp; Toleration</h2>
<p><code>podAffinity</code>是一类属性，使节点拥有吸引某一类 Pod 的能力，Taint 则相反，使节点能够排斥一类特定的 Pod，Toleration 则允许调度器将 Pod 调度到带有 Taint 的节点上。</p>
<p>Taint 与 Toleration 相互配合，可以有效避免将一类 Pod 调度到不适合的节点上。如：可以防止将不需要 GPU 能力的 Pod 调度到可提供 GPU 能力的节点上；集群默认的 master 节点有<code>node-role.kubernetes.io/control-plane:NoSchedule</code>这样的配置，才可以避免将 Pod 调度到 master 节点等等。</p>
<p>一些基本的操作指令：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl"><span class="c1"># 给节点node1增加一个Taint，不允许被调度</span>
</span></span><span class="line"><span class="cl">kubectl taint nodes node1 <span class="nv">key1</span><span class="o">=</span>value1:NoSchedule
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 删除这个Taint</span>
</span></span><span class="line"><span class="cl">kubectl taint nodes node1 <span class="nv">key1</span><span class="o">=</span>value1:NoSchedule-
</span></span></code></pre></div><p>在 Pod 中设置 Toleration，使其可以被调度到对应的节点上：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">tolerations</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="nt">key</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;key1&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">operator</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;Equal&#34;</span><span class="w"> </span><span class="c"># operator 默认值</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">value</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;value1&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">effect</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;NoSchedule&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="nt">key</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;key1&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">operator</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;Exists&#34;</span><span class="w"> </span><span class="c"># 此时不能指定对应的value，如无特别说明，官方指定不能的情况，加上一般会报错，无法继续进行对应的操作</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">effect</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;NoSchedule&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="nt">key</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;&#34;</span><span class="w"> </span><span class="c"># 此时可以被调度到任何具有`NoSchedule`效果的Taint所在的节点上</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">operator</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;Exists&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">effect</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;NoSchedule&#34;</span><span class="w"> </span><span class="c"># effect也可以使用其他值，如PreferNoSchedule等</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="nt">key</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;key1&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">effect</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;&#34;</span><span class="w"> </span><span class="c"># 可以与所有key1的效果匹配</span><span class="w">
</span></span></span></code></pre></div><h3 id="effect-允许设置的值">effect 允许设置的值</h3>
<table>
<thead>
<tr>
<th>value</th>
<th>影响</th>
</tr>
</thead>
<tbody>
<tr>
<td>NoExecute</td>
<td>1. 如果 Pod 不能容忍这类污点，会马上被驱逐；2. 如果 Pod 能够容忍这类污点，没有指定<code>tolerationSeconds</code>，则 Pod 还会一直在这个节点上运行；3. 如果 Pod 能够容忍这类污点，而且指定了<code>tolerationSeconds</code>，则经过这段时间之后，Pod 会逐渐被驱离这些节点</td>
</tr>
<tr>
<td>NoSchedule</td>
<td>除非具有匹配的容忍度，否则 Pod 不会被调度到该污点的节点，且当前正在运行的 Pod 不会被驱逐</td>
</tr>
<tr>
<td>PreferNoSchedule</td>
<td>偏好设置，调度器不能完全保证不会被调度到具有该污点的节点上</td>
</tr>
</tbody>
</table>
<h3 id="高级用法">高级用法</h3>
<ol>
<li>可以通过自己实现一些<a href="https://kubernetes.io/zh-cn/docs/reference/access-authn-authz/admission-controllers/">准入控制器(AdmissionControllers)</a></li>
<li>可以采用一些[插件]，运行<code>ExtendedResourceToleration</code>准入控制器,来实现对特定设备集群的管理</li>
</ol>
<h3 id="基于-taint-的驱逐">基于 Taint 的驱逐</h3>
<p>当前内置的 Taint:</p>
<ol>
<li><code>node.kubernetes.io/not-ready</code></li>
<li><code>node.kubernetes.io/unreachable</code></li>
<li><code>node.kubernetes.io/memory-pressure</code></li>
<li><code>node.kubernetes.io/disk-pressure</code></li>
<li><code>node.kubernetes.io/pid-pressure</code></li>
<li><code>node.kubernetes.io/network-unavailable</code></li>
<li><code>node.kubernetes.io/unschedulable</code></li>
<li><code>node.cloudprovider.kubernetes.io/uninitialized</code></li>
</ol>
<h4 id="daemonset-的特别说明">DaemonSet 的特别说明</h4>
<p>DaemonSet 控制器会自动为所有的 Pod 添加如下<code>NoSchedule</code>的 Toleration:</p>
<ol>
<li><code>node.kubernetes.io/memory-pressure</code></li>
<li><code>node.kubernetes.io/disk-pressure</code></li>
<li><code>node.kubernetes.io/pid-pressure</code> (v1.14+)</li>
<li><code>node.kubernetes.io/unschedulable</code> (v1.10+)</li>
<li><code>node.kubernetes.io/network-unavailable</code> (仅适用主机网络配置???)</li>
</ol>
<p>针对以下 Taint<code>NoExecute</code>的 Toleration 将不会指定<code>tolerationSeconds</code>：</p>
<ol>
<li><code>node.kubernetes.io/unreachable</code></li>
<li><code>node.kubernetes.io/not-ready</code></li>
</ol>
<h2 id="调度框架-tbd">调度框架 [TBD]</h2>
<p><img src="/externals/scheduling-framework-extensions.png" alt="scheduling-framework-extensions"></p>
<h2 id="动态资源分配-alpha-特性">动态资源分配 (alpha 特性)</h2>
<p>feature-gate 需要被显示启用，可通过自己开发或者使用第三方插件来实现，目前(v1.29.3-)<code>resource.k8s.io/v1alpha2</code>API 组提供四种类型：</p>
<ol>
<li><code>ResourceClass</code> 定义资源驱动程序</li>
<li><code>ResourceClaim</code> 定义资源申请</li>
<li><code>ResourceClaimTemplate</code></li>
<li><code>PodSchedulingContext</code></li>
</ol>
<h2 id="调度器性能调优">调度器性能调优</h2>
<p>对于大规模的集群(可能有 50+节点)，有必要设置一个合适的<code>percentageOfNodesToScore</code>，让调度器快速做出响应，调度器覆盖所有节点的方式，当前的手段就是轮询遍历评估可调度性。更复杂的调度可能需要根据自己的架构特性来实现自定义调度器。</p>
<h2 id="调度插件----noderesourcesfit">调度插件 &ndash; NodeResourcesFit</h2>
<p>支持 2 种“bin packing”</p>
<ol>
<li><code>MostAllocated</code> 基于资源的利用率来为节点计分，优选分配比率较高的节点，可以设置<code>weight</code>来影响调度结果</li>
<li><code>RequestedToCapacityRatio</code> 允许用户基于请求值与容量的比率，针对参与节点计分的每类资源设置权重</li>
</ol>
<h3 id="计分函数">计分函数</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">shape</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="nt">utilization</span><span class="p">:</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="c"># 0%：节点评分为0</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">score</span><span class="p">:</span><span class="w"> </span><span class="m">0</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="nt">utilization</span><span class="p">:</span><span class="w"> </span><span class="m">100</span><span class="w"> </span><span class="c"># 100%：节点评分为10</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">score</span><span class="p">:</span><span class="w"> </span><span class="m">10</span><span class="w">
</span></span></span></code></pre></div><h3 id="节点容量分配的评分-tbd">节点容量分配的评分 [TBD]</h3>
<h2 id="pod-优先级与抢占">Pod 优先级与抢占</h2>
<h3 id="priorityclass">PriorityClass</h3>
<p><code>PriorityClass</code>可以允许 Pod 有优先级的区别，优先级越高的 Pod 越容易被调度，越容易进入运行状态。<code>PriorityClass</code>对象对应的取值范围是从 -2,147,483,648 到 1,000,000,000（含）。更大的数字用于标识集群关键 Pod 的内置<code>PriorityClass</code>。使用<code>PriorityClass</code>需要了解的点：</p>
<ol>
<li><code>PriorityClass</code>提供<code>globalDefault</code>(表示这个值应用于没有<code>priorityClassName</code>的 Pod，且系统中只能有一个设置了该字段的<code>PriorityClass</code>，如果不存在设置了<code>globalDefault</code>的<code>PriorityClass</code>，则没有<code>priorityClassName</code>的 Pod 优先级为 0)与<code>description</code>(任意字符串)字段。</li>
<li><code>PriorityClass</code>仅对新增的 Pod 有效，已存在的 Pod 不会因为<code>PriorityClass</code>的创建而提升优先级，即这类 Pod 优先级还是 0；如果删除了<code>PriorityClass</code>，则使用被删除的<code>PriorityClass</code>的 Pod 优先级保持不变，但是无法再创建已删除的<code>PriorityClass</code>的 Pod。</li>
</ol>
<h3 id="non-preempting-priorityclass">Non-preempting PriorityClass</h3>
<p>配置了<code>preemptingPolicy: Never</code>的 Pod 将被放置在调度队列中较低优先级 Pod 之前，但它们不能抢占其他 Pod 的资源，等待调度的非抢占式 Pod 将留在队列中，直到有资源满足，才可以被调度。非抢占式的 Pod 如果调度失败，会以更低的频率被重试，从而允许其他较高优先级的 Pod 被调度。即无论是配置了优先级还是没有配置优先级的普通 Pod 都会被 back-off（重试）。<code>preemptionPolicy</code>默认为<code>PreemptLowerPolicy</code>，这允许这类 Pod 抢占较低优先级的 Pod（默认行为）。如果<code>preemptionPolicy: Never</code>，则该<code>PriorityClass</code>中的 Pod 将是非抢占式的。</p>
<p>总之，Pod 被调度的优先级顺序为：<code>spec.priorityClassName</code> &gt; <code>preemptingPolicy: Never</code> &gt; <code>preemptingPolicy: PreemptLowerPriority</code>(default，未设置也是它)</p>
<h3 id="preemption">Preemption</h3>
<p>Pod 被创建后，会进入 Pending 状态，如果没有找到足够的资源或者匹配的节点，则触发对 Pending 状态的 Pod 的抢占逻辑。也就是节点会优先部署高优先级的 Pod。同时因资源被抢占而未被调度的 Pod 会被设置<code>nominatedNodeName</code>字段，用于跟踪为该 Pod 保留的资源，用户可以据此判断有关集群中的抢占信息。同时被标记<code>nominatedNodeName</code>的 Pod 将来不一定会被部署到这个字段指定的节点，调度程序会在任何其他节点尝试，因此<code>nominatedNodeName</code>与<code>nodeName</code>并不总是相同，而且该 Pod 最终也有机会抢占另一个节点 Pod 所对应的资源。</p>
<h4 id="因被抢占而被-kubelet-杀死的-pod">因被抢占而被 kubelet 杀死的 Pod</h4>
<p>会有一个体面终止期，这段时期从 Pod 被抢占开始到成功调度抢占的 Pod，因此用户可以根据实际需求适当调小低优先级 Pod 的体面终止时间，具体字段为：<code>spec.terminationGracePeriodSeconds</code> (default 30s)，如果超过这个时间，Pod 会被 kubelet 强制杀死。</p>
<h4 id="支持-poddisruptionbudget但不保证">支持 PodDisruptionBudget，但不保证</h4>
<p><code>PodDisruptionBudget</code>(PDB)允许设置多副本 Pod 所有者限制因自愿性质的干扰而同时终止的 Pod 数量。调度器在抢占 Pod 时对 PDB 的支持时尽力而为的，即会尽力去寻找符合 PDB 约束的牺牲者，如果没有找到，抢占还是会进行，并且即便是违反了 PDB 约束，也会删除优先级较低的 Pod。</p>
<h4 id="与低优先级-pod-之间的亲和性-tbt">与低优先级 Pod 之间的亲和性 [TBT]</h4>
<p>处于 Pending 的 Pod 与节点上的一个或多个较低优先级的 Pod 具有亲和性，如果在一个目标节点上没有除去这些具有亲和性 Pod 的其他一些低优先级 Pod，则抢占不会发生。相反，如果在另一个节点上找到合适的 Pod，也无法保证处于 Pending 的 Pod 可以被调度。</p>
<p>也就是说这种情况下，处于 Pending 状态的 Pod 可能永远处于 Pending，尽管它具备较高的优先级。
官方给出的解决方案是仅针对同等或更高优先级的 Pod 设置 Pod 间亲和性。</p>
<h4 id="跨节点抢占">跨节点抢占</h4>
<p>本来节点 N 上有一个处于 Pending 的 Pod P，只有当另一个节点上的 Pod Q 具有反亲和性规则，且在同一可用区，这个时候才会发生跨节点抢占，当且仅当 Pod Q 被抢占时，但是目前的版本还未实现这一点。</p>
<p>截止(v1.29.3)官方还没有给出跨节点抢占合理的算法，未来可能会有。</p>
<h4 id="总结">总结</h4>
<p>抢占如果要发生，必须要满足优先级要求，即被抢占资源的永远是那些具有低优先级的 Pod。</p>
<h3 id="priorityclass-与-qos">PriorityClass 与 QoS</h3>
<p>Pod 的优先级与 QoS，二者并无直接关联，基于 QoS 类设置的 PriorityClass 没有默认限制。QoS 类可以被用来估计 Pod 最有可能被驱逐的顺序。kubelet 根据以下因素对 Pod 的驱逐进行排名：</p>
<ol>
<li>对紧俏资源的使用是否超过请求值</li>
<li>Pod 优先级：如果较低优先级的 Pod 资源使用量没有超过其配额时，kubelet 不会驱逐该 Pod，相反，如果优先级较高且资源用量超过其配额时，Pod 可能被驱逐</li>
<li>相对于请求的资源使用量：当 Pod 资源使用量未超过其资源配额时，kubelet 不会驱逐该 Pod</li>
</ol>
<h2 id="节点压力驱逐">节点压力驱逐</h2>
<p>节点压力驱逐是 kubelet 主动终止 Pod 来释放节点资源(如 CPU，Memory，Volume，以及 fs 的 inode 等)的过程。压力驱逐期间 Pod 的状态会被 kubelet 标记为 Failed，节点压力驱逐是一个主动的过程。压力驱逐会排除设置的<code>PodDisruptionBudget</code>或者是 Pod 的<code>terminationGracePeriodSeconds</code>，对于软驱逐条件，会使用<code>eviction-max-pod-grace-period</code>，硬驱逐会立即杀死 Pod。</p>
<h3 id="pod-自我修复行为">Pod 自我修复行为</h3>
<ol>
<li>对于受如 Deployment，RS 等控制器干预的 Pod，控制器会自动创建新 Pod 来替换旧的 Pod</li>
<li>对于创建的裸 Pod，kubelet 会尝试使用新的 Pod 来替换旧的，且这类裸 Pod 的优先级是会被 kubelet 纳入是否驱逐的因素</li>
</ol>
<h3 id="kubelet-的驱逐策略">kubelet 的驱逐策略</h3>
<ol>
<li>驱逐信号：linux 系统，kubelet 使用的驱逐信号：</li>
</ol>
<table>
<thead>
<tr>
<th>eviction-signal</th>
<th>描述</th>
</tr>
</thead>
<tbody>
<tr>
<td>memory.available</td>
<td>node.status.capacity[memory] - node.stats.memory.workingSet</td>
</tr>
<tr>
<td>nodefs.available</td>
<td>node.stats.fs.available</td>
</tr>
<tr>
<td>nodefs.inodesFree</td>
<td>node.stats.fs.inodesFree</td>
</tr>
<tr>
<td>imagefs.available</td>
<td>node.stats.runtime.imagefs.available</td>
</tr>
<tr>
<td>imagefs.inodesFree</td>
<td>node.stats.runtime.imagefs.inodesFree</td>
</tr>
<tr>
<td>pid.available</td>
<td>node.stats.rlimit.maxpid - node.stats.rlimit.curproc</td>
</tr>
</tbody>
</table>
<p>每个信号支持百分比或具体数值，kubelet 使用百分比来确定驱逐条件。上述具体数值的获取都是由 cgroup 提供的。</p>
<ol start="2">
<li>
<p>驱逐条件：节点上应该可用资源的最小值，格式：[eviction-signal][operator][quantity]，如：memory.available &lt; 10%，memory.available &lt; 1G</p>
<ul>
<li>硬驱逐条件，没有宽限期，立即驱逐，使用 <code>eviction-hard</code> 如：memory.available&lt;100Mi，nodefs.available&lt;10%</li>
<li>软驱逐条件，过了宽限期才驱逐，使用 <code>eviction-soft</code> <code>eviction-soft-grace-period</code> <code>eviction-max-pod-grace-period</code> 如：</li>
</ul>
</li>
<li>
<p>驱逐检测间隔：<code>housekeeping-interval</code>（默认为 10s）评估驱逐条件</p>
</li>
</ol>
<h3 id="节点状态">节点状态</h3>
<p>kubelet 根据下表将驱逐信号映射为节点状态</p>
<table>
<thead>
<tr>
<th>节点条件</th>
<th>驱逐信号（eviction-signal）</th>
</tr>
</thead>
<tbody>
<tr>
<td>MemoryPressure</td>
<td>memory.available</td>
</tr>
<tr>
<td>DiskPressure</td>
<td>nodefs.available、nodefs.inodesFree、imagefs.available 或 imagefs.inodesFree</td>
</tr>
<tr>
<td>PIDPressure</td>
<td>pid.available</td>
</tr>
</tbody>
</table>
<p>此外，control-plane-panel 还将这些节点状态映射为其 Taint。kubelet 根据配置的 &ndash;node-status-update-frequency 更新节点条件，默认为 10s。</p>
<h4 id="节点状态波动">节点状态波动</h4>
<p>要尽量避免节点状态波动，可以设置<code>eviction-pressure-transition-period</code>，该值指定了不同状态必须等待的时间，默认为 5m</p>
<h4 id="主动回收节点资源">主动回收节点资源</h4>
<ol>
<li>专用 imagefs，如果有 nodefs 满足驱逐条件，kubelet 会收集死亡的 Pod 和容器；如果有 imagefs 满足驱逐条件，kubelet 将删除所有为使用的镜像</li>
<li>没有专用的 imagefs，如果节点只有一个满足驱逐条件的 nodefs，kubelet 将首先对死亡的 Pod 和容器进行垃圾回收，然后删除未使用的镜像</li>
</ol>
<h4 id="pod-的驱逐顺序">Pod 的驱逐顺序</h4>
<p>如果以上的步骤，无法降低对应节点资源的压力，kubelet 将开始驱逐 Pod，根据 Pod 资源使用量与优先级等情况，按以下顺序来驱逐 Pod：</p>
<ol>
<li>资源使用量超过其请求的 Pod，如<code>BestEffort</code>或<code>Burstable</code>，会根据各自的优先级，以及资源使用超过其请求的程度被驱逐</li>
<li>资源使用量少于请求量的<code>Guaranteed</code>和<code>Burstable</code>根据其优先级最后被驱逐</li>
</ol>
<blockquote>
<p>Note: QoS 不适用于<code>EphemeralVolume</code>，如果节点在 DiskPressure 下，上述顺序不适用。</p>
</blockquote>
<p>仅当 <code>Guaranteed</code> Pod 中所有容器都被指定了请求和限制并且二者相等时，才保证 Pod 不被驱逐。如果不足以缓解系统资源压力，尽管满足不被驱逐的条件，但是还是会驱逐优先级低的 Pod。</p>
<p>对于裸 Pod，如果希望免在资源压力下被驱逐，需要直接设置<code>spec.priority</code>字段，裸 Pod 不支持<code>spec.priorityClassName</code>。</p>
<p>当 kubelet 因 inode 或 进程 ID 不足而驱逐 Pod 时， 它使用 Pod 的相对优先级来确定驱逐顺序。</p>
<p>kubelet 根据节点是否具有专用的 imagefs 文件系统对 Pod 进行不同的排序：</p>
<ul>
<li>有 imagefs，如果 nodefs 触发驱逐， kubelet 会根据 nodefs 使用情况（本地卷 + 所有容器的日志）对 Pod 进行排序。</li>
<li>没有 imagefs，如果 nodefs 触发驱逐， kubelet 会根据磁盘总用量（本地卷 + 日志和所有容器的可写层）对 Pod 进行排序。</li>
</ul>
<h4 id="最小驱逐回收">最小驱逐回收</h4>
<p>为 kubelet 配置<code>--eviction-minimum-reclaim</code>参数可免于在资源紧俏的情况下，kubelet 反复回收资源，多次驱逐</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l">kubelet.config.k8s.io/v1beta1</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l">KubeletConfiguration</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">evictionHard</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">memory.available</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;500Mi&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">nodefs.available</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;1Gi&#34;</span><span class="w"> </span><span class="c">#达到 1Gi 的条件</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">imagefs.available</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;100Gi&#34;</span><span class="w"> </span><span class="c">#达到 100Gi 的条件</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">evictionMinimumReclaim</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">memory.available</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;0Mi&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">nodefs.available</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;500Mi&#34;</span><span class="w"> </span><span class="c">#继续回收至少 500MiB,直到条件达到 1.5GiB</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">imagefs.available</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;2Gi&#34;</span><span class="w"> </span><span class="c">#继续回收至少 2GiB，直到条件达到 102GiB</span><span class="w">
</span></span></span></code></pre></div><h3 id="节点内存不足时的行为">节点内存不足时的行为</h3>
<p>如果节点在 kubelet 能够回收内存之前遇到内存不足（OOM）事件， 则节点依赖[oom_killer]来响应。</p>
<table>
<thead>
<tr>
<th>QoS</th>
<th>oom_score_adj</th>
<th>说明</th>
</tr>
</thead>
<tbody>
<tr>
<td>Guaranteed</td>
<td>-997</td>
<td>当资源非常紧俏时，最不容易被杀死</td>
</tr>
<tr>
<td>BestEffort</td>
<td>1000</td>
<td>仅次于 Guaranteed，有可能被杀死</td>
</tr>
<tr>
<td>Burstable</td>
<td>min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999)</td>
<td>当资源非常紧俏时，这类 Pod 最有可能被杀死</td>
</tr>
</tbody>
</table>
<blockquote>
<p>kubelet 还将具有 system-node-critical 优先级 的任何 Pod 中的容器 oom_score_adj 值设为 -997。如：kube-scheduler，kube-apiserver，etcd 等组件。</p>
</blockquote>
<p>这意味着低 QoS Pod 中相对于其调度请求消耗内存较多的容器，将首先被杀死。</p>
<h3 id="最佳实践">最佳实践</h3>
<ol>
<li>配置 kubelet 时预留一定内存给系统比如 10%，然后在加上驱逐阈值，比如 16G 的系统内存，如果驱逐阈值是 500M，则预留约 2.1G 给系统，避免系统奔溃</li>
<li>为一些节点性质的 Pod，比如 DaemonSet 控制器相关的 Pod 设置比较高的优先级，避免被 kubelet 杀死</li>
</ol>
<h3 id="已知问题">已知问题</h3>
<ol>
<li>kubelet 可能不会立即观察到内存压力，如果追求极端的利用率，可以使用<code>--kernel-memcg-notification</code>以便在超过条件时立即执行 Pod 驱逐，如果不追求极端的利用率，可以设置<code>--kube-reserved</code>于<code>--system-reserved</code>为系统预留一定量的内存</li>
<li>active_file 内存未被视为可用内存，可能导致 Pod 驱逐发生，截止(v1.29.3)这是一个已知的未解决的问题，可以通过为可能执行 I/O 密集型活动的容器设置相同的内存限制和内存请求来应对该行为，为此，将需要估计或测量该容器的最佳内存限制值</li>
</ol>
<h2 id="eviction-api-发起的驱逐">Eviction API 发起的驱逐</h2>
<p>用户可以调用<code>Eviction API</code>发起驱逐，也可以通过<code>kubectl drain</code>发起驱逐。API 发起的驱逐遵循<code>PodDisruptionBudget</code>和<code>terminationGracePeriodSeconds</code>配置。</p>
<h3 id="相关的-http-状态码">相关的 http 状态码</h3>
<ol>
<li>200：允许驱逐</li>
<li>429：Too Many Requests，当前不允许驱逐，可能因为不满足<code>PodDisruptionBudget</code>配置，可稍后尝试</li>
<li>500：服务器配置错误，如多个<code>PodDisruptionBudget</code>引用同一个 Pod</li>
</ol>
<h3 id="解决驱逐被卡住">解决驱逐被卡住</h3>
<ol>
<li>暂停 Deployment 等控制器的自动化操作，可以设置如<code>suspend</code>字段</li>
<li>不使用<code>Eviction API</code>，直接删除对应的 Pod，如调用<code>kubectl delete</code></li>
</ol>
<p><a href="2">插件</a>: <a href="https://kubernetes.io/zh-cn/docs/concepts/configuration/manage-resources-containers/#extended-resources">https://kubernetes.io/zh-cn/docs/concepts/configuration/manage-resources-containers/#extended-resources</a>
[oom_killer]: <a href="https://lwn.net/Articles/391222/">https://lwn.net/Articles/391222/</a></p>
]]></content>
		</item>
		
		<item>
			<title>Schedule 总结</title>
			<link>/posts/schedule-%E6%80%BB%E7%BB%93/</link>
			<pubDate>Sun, 31 Mar 2024 18:20:31 +0800</pubDate>
			
			<guid>/posts/schedule-%E6%80%BB%E7%BB%93/</guid>
			<description><![CDATA[%!s(<nil>)]]></description>
			<content type="html"><![CDATA[<p>kube-scheduler 是 kubernetes 集群的默认调度器，官方允许用户根据自己的需求编写特定的调度器来替换 kube-scheduler。kube-scheduler 通过以下 2 个步骤来实现对 Pod 的调度：</p>
<ol>
<li>过滤，根据 Pod 定义的条件，如&quot;required&quot;强制限制节点 label 匹配，以及限制 Pod 必须满足某些条件，对于<code>podAffinity</code>可以有多个<code>labelSelector</code>等条件，对于<code>nodeAffinity</code>只能是一个，但是可用多个<code>matchExpressions</code>等条件</li>
<li>打分，根据 Pod 定义的<code>nodeAffinity</code>与<code>podAffinity</code>，以及<code>nodeAntiAffinity</code>，<code>podAntiAffinity</code>等&quot;preferred&quot;偏好条件，以及<code>weight</code>打分（weight 数值越高，Pod 越有可能被调度到对应条件的节点上）</li>
</ol>
<h2 id="可用的过滤条件">可用的过滤条件</h2>
<ol>
<li>节点标签：如 <code>kubernetes.io/hostname</code>, <code>kubernetes.io/os</code></li>
<li>节点隔离/限制：如<code>NodeRestriction</code>插件使用的带有<code>node-restriction.kubernetes.io</code>前缀的标签</li>
<li><code>.spec.affinity.nodeAffinity</code></li>
<li><code>.spec.affinity.podAffinity</code>与<code>.spec.afffinity.podAntiAffinity</code>，以及<code>.spec.affinity.podAffinity.[].topologyKey</code>集群用来标识域的节点标签键</li>
<li><code>namespaceSelector</code> (v1.29 alpha)</li>
<li><code>.spec.nodeName</code> 有一些局限性，这个条件不稳定
<ol>
<li>如节点不存在，Pod 无法运行，在某些情况下可能被自动删除</li>
<li>在云环境中的节点名称并总不是可预测与稳定的</li>
<li>当然节点没有对应的资源运行该 Pod，仍然会失败，这个条件在非<code>.spec.nodeName</code>情况都满足，只不过非<code>.spec.nodeName</code>条件 Pod 部署不一定会失败，有可能是处于<code>pending</code></li>
</ol>
</li>
</ol>
<h2 id="可选的方式">可选的方式</h2>
<ol>
<li><code>requiredDuringSchedulingIgnoredDuringExecution</code> 硬性条件，强制限制</li>
<li><code>preferredDuringSchedulingIgnoredDuringExecution</code> 软性条件，偏好限制，有更好，没有也可以，该方式是一个数组，可以提供<code>weight</code>(越大越优先)来加强或减弱某个条件</li>
</ol>
<h2 id="可用的选项">可用的选项</h2>
<ol>
<li><code>nodeAffinity</code>
<ol>
<li><code>nodeSelectorTerms</code></li>
</ol>
</li>
<li><code>podAffinity</code>与<code>podAntiAffinity</code>
<ol>
<li><code>matchLabelKeys</code></li>
<li><code>mismatchLabelKeys</code></li>
<li><code>topologyKey</code></li>
<li><code>labelSelector</code></li>
</ol>
</li>
<li>&hellip;</li>
</ol>
<h2 id="可用的操作符">可用的操作符</h2>
<table>
<thead>
<tr>
<th>操作符</th>
<th>行为</th>
<th>适用对象</th>
</tr>
</thead>
<tbody>
<tr>
<td>In</td>
<td>标签值匹配提供的字符串集</td>
<td><code>nodeAffinity</code>, <code>podAffinity</code>, <code>podAntiAffinity</code></td>
</tr>
<tr>
<td>NotIn</td>
<td>标签值不匹配提供的字符串集</td>
<td><code>nodeAffinity</code>, <code>podAffinity</code>, <code>podAntiAffinity</code></td>
</tr>
<tr>
<td>Exists</td>
<td>存在具有此键的标签</td>
<td><code>nodeAffinity</code>, <code>podAffinity</code>, <code>podAntiAffinity</code></td>
</tr>
<tr>
<td>DoesNotExist</td>
<td>不存在具有此键的标签</td>
<td><code>nodeAffinity</code>, <code>podAffinity</code>, <code>podAntiAffinity</code></td>
</tr>
<tr>
<td>Gt</td>
<td>标签的值 &gt; 该数</td>
<td><code>nodeAffinity</code></td>
</tr>
<tr>
<td>Lt</td>
<td>标签的值 &lt; 该数</td>
<td><code>nodeAffinity</code></td>
</tr>
</tbody>
</table>
<h2 id="调度方案">调度方案</h2>
<ol>
<li>
<p>逐个调度方案中设置<code>nodeAffinity</code>，需要在<a href="https://kubernetes.io/zh-cn/docs/reference/scheduling/config/">调度适配器</a>启用<a href="https://kubernetes.io/zh-cn/docs/reference/scheduling/config/#scheduling-plugins">NodeAffinity 插件</a></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l">kubescheduler.config.k8s.io/v1beta3</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l">KubeSchedulerConfiguration</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">profiles</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="nt">schedulerName</span><span class="p">:</span><span class="w"> </span><span class="l">default-scheduler</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="nt">schedulerName</span><span class="p">:</span><span class="w"> </span><span class="l">foo-scheduler</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">pluginConfig</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">NodeAffinity</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">args</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span><span class="nt">addedAffinity</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">            </span><span class="nt">requiredDuringSchedulingIgnoredDuringExecution</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">              </span><span class="nt">nodeSelectorTerms</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">                </span>- <span class="nt">matchExpressions</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">                    </span>- <span class="nt">key</span><span class="p">:</span><span class="w"> </span><span class="l">scheduler-profile</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">                      </span><span class="nt">operator</span><span class="p">:</span><span class="w"> </span><span class="l">In</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">                      </span><span class="nt">values</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">                        </span>- <span class="l">foo</span><span class="w">
</span></span></span></code></pre></div></li>
</ol>
<h2 id="pod-调度就绪状态-v127-beta">Pod 调度就绪状态 (v1.27 beta)</h2>
<p>通过指定<code>.spec.schedulingGates</code>可以控制 Pod 何时准备好被考虑调度。如果 Pod 已经就绪，可以将对应的字段清空，然后对 Pod 执行更新操作。如果指定了该字段，有以下情况允许，更新对应字段：</p>
<ol>
<li>如果设置了该字段，在没有<code>.spec.nodeSelector</code>的情况下，可以允许更新此字段</li>
<li>如果<code>spec.affinity.nodeAffinity</code>为 nil，则允许设置任意值</li>
<li>如果<code>NodeSelectorTerms</code>之前没有设置，则现在允许设置；如果之前已经设置了，那么仅允许添加对应的条目，不允许更新之前已经存在的条目</li>
<li>对于<code>preferredDuringSchedulingIgnoredDuringExecution</code>，所有更新现在都允许</li>
</ol>
<p>可以使用<code>scheduler_pending_pods{queue=&quot;gated&quot;}</code>查看调度的具体指标情况。需结合一些指标组件，如<code>metric-server</code>。</p>
<h2 id="pod-拓扑分布域">Pod 拓扑分布域</h2>
<p>分布域是一个类似组的概念对集群中的节点进行分组，按区域划分，使整个大的集群有一定的抗风险与自愈能力。<code>.spec.topologySpreadConstraints</code>定义了对集群可用区使用的调度规则，使用该功能最重要的要求是如果集群自动为节点打的标签无法满足需求，用户需要自定义标签的时候，需要保证集群所有的 key 的一致性，否则调度结果可能会匪夷所思，排查也会很困难。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nn">---</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l">v1</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l">Pod</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">metadata</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">example-pod</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">spec</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="c"># 配置一个拓扑分布约束</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">topologySpreadConstraints</span><span class="p">:</span><span class="w"> </span><span class="c"># 多个条目之间是逻辑与关系，默认无特别说明都是逻辑与</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span>- <span class="nt">maxSkew</span><span class="p">:</span><span class="w"> </span><span class="l">&lt;integer&gt;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">minDomains</span><span class="p">:</span><span class="w"> </span><span class="l">&lt;integer&gt;</span><span class="w"> </span><span class="c"># 可选；自从 v1.25 开始成为 Beta</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">topologyKey</span><span class="p">:</span><span class="w"> </span><span class="l">&lt;string&gt;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">whenUnsatisfiable</span><span class="p">:</span><span class="w"> </span><span class="l">&lt;string&gt;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">labelSelector</span><span class="p">:</span><span class="w"> </span><span class="l">&lt;object&gt;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">matchLabelKeys</span><span class="p">:</span><span class="w"> </span><span class="l">&lt;list&gt;</span><span class="w"> </span><span class="c"># 可选；自从 v1.27 开始成为 Beta</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">nodeAffinityPolicy</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="l">Honor|Ignore]</span><span class="w"> </span><span class="c"># 可选；自从 v1.26 开始成为 Beta</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">nodeTaintsPolicy</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="l">Honor|Ignore]</span><span class="w"> </span><span class="c"># 可选；自从 v1.26 开始成为 Beta</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="l">...</span><span class="w">
</span></span></span></code></pre></div><ol>
<li><code>maxSkew</code> 描述这些 Pod 可能被不均匀分布的程度，具体受制于
<ol>
<li><code>whenUnsatisfiable: DoNotSchedule</code> 描述可用区分布 Pod 数量最小值的差值</li>
<li><code>whenUnsatisfiable: ScheduleAnyway</code> 该调度器会更为偏向能够降低偏差值的拓扑域</li>
</ol>
</li>
<li><code>minDomains</code> 表示符合条件的域的最小数量</li>
<li><code>topologyKey</code> 节点标签 key</li>
<li><code>whenUnsatisfiable</code>: <code>DoNotSchedule</code>(default), <code>ScheduleAnyway</code></li>
<li><code>labelSelector</code> 用于查找匹配的 Pod</li>
<li><code>matchLabelKeys</code> 用于选择需要计算分布方式的 Pod 集合，<code>matchLabelKeys</code>和<code>labelSelector</code>不允许有相同的 key</li>
<li><code>nodeAffinityPolicy</code>: <code>Honor</code>, <code>Ignore</code></li>
<li><code>nodeTaintsPolicy</code>: <code>Honor</code>, <code>Ignore</code></li>
</ol>
<h3 id="隐式约定">隐式约定</h3>
<ol>
<li>只有与新来的 Pod 具有相同命名空间的 Pod 才能作为匹配候选者</li>
<li>调度器会忽略没有任何 <code>topologySpreadConstraints[*].topologyKey</code> 的节点，意味着这些节点上的 Pod 不会影响<code>maxSkew</code>的计算</li>
<li><code>topologySpreadConstraints[*].labelSelector</code>与 Pod 自身的标签不匹配，将不会改善集群的不平衡程度</li>
</ol>
<h3 id="集群级别的默认约束">集群级别的默认约束</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l">kubescheduler.config.k8s.io/v1beta3</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l">KubeSchedulerConfiguration</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">profiles</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="nt">schedulerName</span><span class="p">:</span><span class="w"> </span><span class="l">default-scheduler</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">pluginConfig</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">PodTopologySpread</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">args</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span><span class="nt">defaultConstraints</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">            </span>- <span class="nt">maxSkew</span><span class="p">:</span><span class="w"> </span><span class="m">1</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">              </span><span class="nt">topologyKey</span><span class="p">:</span><span class="w"> </span><span class="l">topology.kubernetes.io/zone</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">              </span><span class="nt">whenUnsatisfiable</span><span class="p">:</span><span class="w"> </span><span class="l">ScheduleAnyway</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span><span class="nt">defaultingType</span><span class="p">:</span><span class="w"> </span><span class="l">List</span><span class="w">
</span></span></span></code></pre></div><p>为集群设置默认的拓扑约束，可以在以下条件满足时被应用到 Pod 上：</p>
<ol>
<li>Pod 没有在其 <code>.spec.topologySpreadConstraints</code> 中定义任何约束</li>
<li>Pod 隶属于某个 Service、ReplicaSet、StatefulSet 或 ReplicationController</li>
</ol>
<h4 id="内置的默认约束-v124-stable">内置的默认约束 (v1.24 stable)</h4>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">defaultConstraints</span><span class="p">:</span><span class="w"> </span><span class="c"># 可以被用户覆写</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="nt">maxSkew</span><span class="p">:</span><span class="w"> </span><span class="m">3</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">topologyKey</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;kubernetes.io/hostname&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">whenUnsatisfiable</span><span class="p">:</span><span class="w"> </span><span class="l">ScheduleAnyway</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="nt">maxSkew</span><span class="p">:</span><span class="w"> </span><span class="m">5</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">topologyKey</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;topology.kubernetes.io/zone&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">whenUnsatisfiable</span><span class="p">:</span><span class="w"> </span><span class="l">ScheduleAnyway</span><span class="w">
</span></span></span></code></pre></div><h3 id="已知的局限-截止-v1293">已知的局限 (截止 v1.29.3)</h3>
<ol>
<li>当缩减某个控制器控制的 Pod 时，现阶段的方案无法保证 Pod 分布均衡</li>
<li>具有污点的节点上匹配的 Pod 也会被统计</li>
<li>任何情况下至少保证每个拓扑域(节点组)中至少有一个可用节点，否则调度可能会异常</li>
</ol>
]]></content>
		</item>
		
		<item>
			<title>Git Usages</title>
			<link>/posts/git-usages/</link>
			<pubDate>Sun, 31 Mar 2024 11:40:02 +0800</pubDate>
			
			<guid>/posts/git-usages/</guid>
			<description><![CDATA[%!s(<nil>)]]></description>
			<content type="html"><![CDATA[<p>This file will list some git non-common cases.</p>
<h2 id="ignore-the-committed-files">Ignore the committed files</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl"><span class="c1"># Add the target files into .gitignore</span>
</span></span><span class="line"><span class="cl"><span class="nb">echo</span> &lt;somefile or folder&gt; &gt;&gt; .gitignore
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Remove the files from the git index (not the actual files in the working dir)</span>
</span></span><span class="line"><span class="cl">git rm -r --cached .
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Add these removals to the staging area</span>
</span></span><span class="line"><span class="cl">git add .
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Commit them</span>
</span></span><span class="line"><span class="cl">git commit -m <span class="s2">&#34;Clean up ignored files&#34;</span>
</span></span></code></pre></div><h2 id="rewrite-email-of-the-history">Rewrite email of the history</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl">brew install git-filter-repo
</span></span><span class="line"><span class="cl">git filter-repo --email-callback <span class="s1">&#39;
</span></span></span><span class="line"><span class="cl"><span class="s1">      if b&#34;old_email@example.com&#34; in email:
</span></span></span><span class="line"><span class="cl"><span class="s1">          return email.replace(b&#34;old_email@example.com&#34;, b&#34;new_email@example.com&#34;)
</span></span></span><span class="line"><span class="cl"><span class="s1">      else:
</span></span></span><span class="line"><span class="cl"><span class="s1">          return email
</span></span></span><span class="line"><span class="cl"><span class="s1">  &#39;</span> --force
</span></span></code></pre></div>]]></content>
		</item>
		
		<item>
			<title>Volume 存储总结</title>
			<link>/posts/volume-%E5%AD%98%E5%82%A8%E6%80%BB%E7%BB%93/</link>
			<pubDate>Sat, 30 Mar 2024 15:20:31 +0800</pubDate>
			
			<guid>/posts/volume-%E5%AD%98%E5%82%A8%E6%80%BB%E7%BB%93/</guid>
			<description><![CDATA[%!s(<nil>)]]></description>
			<content type="html"><![CDATA[<p><img src="/imgs/the-volume-data.png" alt="the-volume-data"></p>
<p>Volumes 用来保证在 k8s 集群中部署有存储需求应用的一类对象。</p>
<p><img src="/imgs/pod-pvc-pv.jpg" alt="pod-pvc-pv"></p>
<h2 id="存储分类">存储分类</h2>
<ol>
<li>持久化存储
<ol>
<li>预先制备，可以根据卷模式，容量，类，回收策略，读写方式，挂载选项，以及一些 alpha 特性（吞吐量，io 等）等制备一批存储卷 PV</li>
<li>动态制备，需要自己开发或者使用第三方存储</li>
</ol>
</li>
<li>临时存储</li>
</ol>
<h2 id="卷类型">卷类型</h2>
<ol>
<li>emptyDir 在 Pod 被指派到某节点时，此卷会被创建，最初是空的，不具备持久性</li>
<li>fc 光纤通道</li>
<li>. hostPath 即将容器中的卷映射到节点宿主系统，此操作有安全风险，因为可以指定宿主系统任意位置去挂载</li>
<li>iscsi 删除卷，数据会被保留，卷只是被卸载，缺点：不允许同时写入，只能由一个 Pod 读写</li>
<li>local 卷只能用作静态创建的持久卷，不支持动态配置，代表的是被挂载的本地存储设备，如磁盘，分区或者目录</li>
<li>nfs linux 提供的一种网络文件存储服务</li>
<li>&hellip;</li>
</ol>
<h2 id="卷模式-volumemode">卷模式 volumeMode</h2>
<ol>
<li>FileSystem (default) 文件系统</li>
<li>Block 块设备</li>
</ol>
<h2 id="存储容量">存储容量</h2>
<p>对于已经挂载 PV 的 PVC，是不允许执行容量变更的 (PVC 中的其他字段也不允许变更)</p>
<h2 id="存储类-storageclass">存储类 StorageClass</h2>
<p>用于可自定义一些存储类，当没有给容器配置默认存储类的时候，系统中如果有默认存储类，则直接使用，如果集群中没有默认存储类，当添加了默认存储类的时候</p>
<h2 id="回收策略">回收策略</h2>
<ol>
<li>Retain (default) 当 Pod 被回收的时候，存储会处于 Released 的状态，等待管理员人工干预</li>
<li>Delete 当 Pod 被回收的时候，存储会自动被删除，需要用户自己实现删除策略，否则会删除失败 Failed</li>
<li>Recycle (rm -rf /thevolume/*) 简单擦除，v1.29，只有<code>nfs</code>和<code>hostPath</code>支持</li>
</ol>
<h2 id="访问方式">访问方式</h2>
<ol>
<li>ReadWriteOnce (RWO) 卷可以被同一个节点以读写方式挂载</li>
<li>ReadOnlyMany (ROX) 卷可以被多个节点以只读方式挂载</li>
<li>ReadWriteMany (RWX) 卷可以被多个节点以读写方式挂载</li>
<li>ReadWriteOncePod (RWOP) (v1.29 stable) 卷可以被一个 Pod 以读写方式挂载</li>
</ol>
<blockquote>
<p>NOTE:
此处的访问方式，并不会实际保证限制对应的读写，读写的具体限制，还是要看使用什么样的存储与存储服务</p>
</blockquote>
<h2 id="挂载选项-mountoptions">挂载选项 mountOptions</h2>
<p><code>volume.beta.kubernetes.io/mount-options</code> 该注解将来可能会被彻底废弃</p>
<ol>
<li>nfs linux 提供的一种网络文件存储服务</li>
<li>iscsi 同上</li>
<li>vsphereVolume</li>
<li>azureFile</li>
</ol>
<h2 id="节点亲和性">节点亲和性</h2>
<p>需要为<code>local</code>卷显示地设置<code>spec.nodeAffinity</code></p>
<h2 id="持久卷状态">持久卷状态</h2>
<ol>
<li>Available 卷是一个空闲资源，尚未绑定到任何 PVC</li>
<li>Bound 卷已经绑定到某 PVC</li>
<li>Released 卷所绑定的 PVC 已经被删除，但是关联的存储资源尚未被回收</li>
<li>Failed 卷的自动回收操作失败</li>
</ol>
<h2 id="卷快照-volumesnapshots">卷快照 VolumeSnapshots</h2>
<p>卷快照为集群用户提供了一种标准的方式，来在指定时间点复制卷的内容，并且不需要创建全新的卷。比如对某个数据库做更新或删除的时候，可以在这操作之前对卷做一个快照。</p>
<ul>
<li>制备方式，可以预制备，动态制备</li>
<li>删除方式，删除 VolumeSnapshots 会出发 VolumeSnapshotContent 一起删除</li>
<li>VolumeSnapshotClass 可以根据一些参数（如回收策略，访问方式，driver 等）定义快照类对象</li>
</ul>
<h2 id="csi-卷克隆">CSI 卷克隆</h2>
<p>卷克隆会为卷副本拷贝原来的数据，需要使用<code>dataSource</code>或<code>dataSourceRef</code>来引用同一命名空间中现有的 PVC</p>
<h2 id="跨命名空间使用卷">跨命名空间使用卷</h2>
<p>创建 ReferenceGrant 对象，声明<code>from</code>与<code>to</code></p>
<h2 id="projected-volumes">Projected Volumes</h2>
<ol>
<li>secret</li>
<li>configMap</li>
<li>downwardAPI 用来在容器内部挂载 Pod 元数据信息</li>
<li>serviceAccountToken 服务账号令牌卷</li>
<li>clusterTrustBundle (v1.29 alpha) 是一个作用域为集群的对象，向集群内的对象分发 X.509 根证书</li>
</ol>
<h2 id="volume-attributes-classes-v129-alpha">Volume Attributes Classes (v1.29 alpha)</h2>
<p>可以配置一些存储设备的特性如：iops，throughput</p>
<h2 id="临时卷">临时卷</h2>
<ol>
<li>类型
<ol>
<li>emptyDir</li>
<li>configMap,downwardAPI,secret</li>
<li>CSI 临时卷</li>
<li>通用临时卷</li>
</ol>
</li>
<li>其生命周期跟随 Pod，即删除 Pod 该卷也会同步删除，不同于 Docker 中的卷</li>
</ol>
<h2 id="卷健康监控">卷健康监控</h2>
<p>需要启用<code>CSIVolumeHealth</code>，可以上报节点失效事件，支持节点卷健康检测</p>
]]></content>
		</item>
		
		<item>
			<title>StatefulSet 理解</title>
			<link>/posts/statefulset-%E7%90%86%E8%A7%A3/</link>
			<pubDate>Fri, 29 Mar 2024 22:20:31 +0800</pubDate>
			
			<guid>/posts/statefulset-%E7%90%86%E8%A7%A3/</guid>
			<description><![CDATA[%!s(<nil>)]]></description>
			<content type="html"><![CDATA[<h2 id="核心要点">核心要点</h2>
<p>要 StatefulSet 控制器，需要一个无头服务<code>clusterIP: None</code>，可以为有状态服务提供 4 个特性</p>
<ol>
<li>稳定的网络标识</li>
<li>稳定的持久化存储</li>
<li>有序的扩容，Pod 副本数量将从 0 ～ N-1 有序部署，第 i 个 Pod 要想部署成功，比如要前 i-1 个 Pod 处于 Running 的状态</li>
<li>有序的缩容，Pod 副本数量将从 N-1 ~ 0 有序缩容，第 i 个 Pod 要想被回收，必须要等到从 i+1 到 N-1 的 Pod 都被回收</li>
</ol>
<h2 id="有序的扩缩容">有序的扩缩容</h2>
<p><img src="/imgs/sts-ordered-pod.jpg" alt="sts-ordered-pod"></p>
<p>所有 StatefulSet 控制器创建的 Pod 都会按照序号从小到大扩容和从大到小缩容，如果过程中，之前创建或回收的 Pod 失败，则不会继续进行相应的扩容或缩容，必须要等到那个有问题的 Pod 状态为 Running 或者成功回收</p>
<h2 id="稳定的网络标识">稳定的网络标识</h2>
<p><img src="/imgs/sts-stable-domain.jpg" alt="sts-stable-domain"></p>
<p>因为 Pod 的名称 ${podname}-{N} 是固定的，所以集群内部的 dns 插件，可以为这些 pod 提供稳定的域名解析，即使 Pod 回收后的重建，如：</p>
<ol>
<li><code>web-0.nginx.default.svc.cluster.local</code></li>
<li><code>web-0.nginx</code></li>
</ol>
<h2 id="稳定的网络存储">稳定的网络存储</h2>
<p><img src="/imgs/sts-stable-volume.jpg" alt="sts-stable-volume"></p>
<p>也是因为 Pod 名称固定，所以对于 PVC 绑定的 PV，如果因为 Pod 删除后重建，还能够自动关联相同名称的 Pod</p>
<h2 id="其他要点">其他要点</h2>
<ol>
<li><code>spec.updateStrategy.rollingUpdate.maxUnavailable</code> [1.24 alpha]（default 1）该字段用来控制更新期间最大不可用的 Pod 数量，默认为 1，可以是具体数字，可以是百分数<code>10%</code>，但是不能为 0</li>
<li><code>spec.minReadySeconds</code>（单位：s，default: 0）最短就绪时间，可以根据自己的应用适当设置，否则也有可能会导致 Pod 无限等待，因为还有<code>*Probe</code>和容器启动“hook”等手段检测 Container 是否就绪</li>
<li>如果 Pod 模版配置无法进入<code>Running</code>状态或<code>Ready</code>状态，需要管理员强制删除的时候，安全的做法是先将<code>replicas=0</code>副本数设置为 0，然后在执行强制删除，等到修复完毕 Pod 模版的时候，在更新<code>replica</code></li>
</ol>
]]></content>
		</item>
		
		<item>
			<title>ServiceAccount, Secret, ConfigMap</title>
			<link>/posts/serviceaccount-secret-configmap/</link>
			<pubDate>Thu, 28 Mar 2024 20:20:31 +0800</pubDate>
			
			<guid>/posts/serviceaccount-secret-configmap/</guid>
			<description><![CDATA[%!s(<nil>)]]></description>
			<content type="html"><![CDATA[<h2 id="serviceaccount">ServiceAccount</h2>
<h3 id="应用场景">应用场景</h3>
<ol>
<li>提供对存储在 <code>Secret</code> 中的敏感信息的只读访问</li>
<li>授予跨名字空间访问的权限</li>
<li>Pod 需要与外部服务器通信</li>
<li>使用<code>imagePullSecret</code>完成私有镜像仓库的身份认证</li>
<li>外部服务需要与 kube-apiserver 通信，如 CI/CD</li>
<li>集群上的第 3 方软件，该软件依赖不同 Pod 的 ServiceAccount</li>
</ol>
<h3 id="手动获取-serviceaccount-token">手动获取 ServiceAccount Token</h3>
<ol>
<li><code>TokenRequest API</code>（推荐）</li>
<li>以 Volume 的方式挂载（推荐）</li>
<li>ServiceAccount Token Secret （不推荐）这些 Token 不会过期，有安全隐患，<code>LegacyServiceAccountTokenNoAutoGeneration</code>默认启用</li>
</ol>
<h2 id="secret-的应用">Secret 的应用</h2>
<p>Secret 可以将少量敏感信息（大小不超过 1M）与应用分离，保存敏感信息的 k8s 对象，如密码，令牌或密钥等信息，可以通过文件的方式挂载到容器中使用，也可以通过环境变量的方式注入到容器中。ConfigMap 可用于保存一些配置文件（大小不超过 1M）。</p>
<ol>
<li>如容器启动注入环境变量</li>
<li>向 Pod 提供 SSH 密钥或密码等信息</li>
<li>从私有镜像仓库拉取镜像</li>
</ol>
<h2 id="secret-的内置类型">Secret 的内置类型</h2>
<p>以下类型的数据都是用 base64 编码，使用的时候都是自动解码，用户无需干预</p>
<ol>
<li><code>Opaque</code> 用户定义的任意数据</li>
<li><code>kubernetes.io/service-account-token</code> ServiceAccount 的令牌</li>
<li><code>kubernetes.io/dockercfg</code> ~/.docker/config.json 旧版</li>
<li><code>kubernetes.io/dockerconfigjson</code> ~/.docker/config.json 新版</li>
<li><code>kubernetes.io/basic-auth</code> 基本的身份验证</li>
<li><code>kubernetes.io/ssh-auth</code> SSH 身份的认证的私钥凭据</li>
<li><code>kubernetes.io/tls</code> 用于 TLS 客户端或服务器端的数据</li>
<li><code>bootstrap.kubernetes.io/token</code> 启动引导令牌数据</li>
</ol>
<h2 id="secret-的其他要点">Secret 的其他要点</h2>
<ol>
<li>保持最少特权访问原则 <code>kubernetes.io/enforce-mountable-secrets: &quot;true&quot;</code></li>
<li>将 Secret 标记为不可更改 <code>immutable: true</code></li>
</ol>
<h2 id="secret-的替代方案">Secret 的替代方案</h2>
<ol>
<li>可以使用 ServiceAccount 替代</li>
<li>可是使用<a href="https://kubernetes.io/zh-cn/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#examples">设备插件</a>来替代</li>
<li>对于身份验证，可以使用<code>CertificateSigningRequest</code>定制的签名验证来为 Pod 发放证书</li>
<li>可以使用第三方工具，来提供机密数据</li>
</ol>
<h2 id="secret-使用的良好实践">Secret 使用的良好实践</h2>
<h3 id="集群管理员">集群管理员</h3>
<ol>
<li>配置静态加密，配置 kube-apiserver 使用<code>EncryptionConfiguration</code>对象对 Secret 进行对称加解密</li>
<li>配置 Secret 资源的最小特权访问
<ul>
<li>组件：限制仅最高特权的系统级组件可以访问，仅在组件的正常行为需要时才授予权限</li>
<li>人员：限制对 Secret 的访问，仅允许集群管理员访问 etcd</li>
<li>对访问 Secret 的用户或 Pod（因为用户可以通过 Pod 访问到 Secret）做记录</li>
</ul>
</li>
<li>在 ServiceAccount 上使用<code>kubernetes.io/enforce-mountable-secrets</code>注解来强制执行有关如何在 Pod 中使用 Secret 的特定规则</li>
<li>改进 etcd 的管理策略，不在使用 etcd 所使用的持久存储时，考虑格式化对应的设备；在多个 etcd 实例之间使用 SSL/TLS 通信以保护传输中的 Secret</li>
<li>配置对外部 Secret 的访问，可以使用第三方 Secret 的解决方案，然后配置 Pod 访问该数据</li>
</ol>
<h3 id="开发者">开发者</h3>
<ol>
<li>限制特定容器集合才能访问 Secret，如果一个 Pod 中仅有一个容器需要访问 Secret，则可以 mount volume 或者使用环境变量的方式，避免其他容器访问该 Secret</li>
<li>读取后保护 Secret 数据，应用程序必须避免以明文记录 Secret 数据，必须避免将这些数据传输给不受信任的一方</li>
<li>避免共享 Secret manifest，提防 Secret manifest 中包含 base64 编码的数据</li>
</ol>
<h2 id="configmap-的应用">ConfigMap 的应用</h2>
<p>ConfigMap 可以为应用提供配置文件信息，如</p>
<ol>
<li><code>nginx</code> 配置文件</li>
<li>应用启动所需准备的配置文件</li>
</ol>
<h2 id="secret-与-configmap-的对比">Secret 与 ConfigMap 的对比</h2>
<p>相同之处：</p>
<ol>
<li>二者都可以通过环境变量的方式注入容器</li>
<li>二者都可以通过文件的方式挂载到容器内部</li>
<li>二者的大小限制都是不超过 1M</li>
<li>都可以设置为<code>immutable: true</code></li>
<li>二者的访问都可以单独配置对应的权限</li>
<li>对于以文件方式挂载，二者都支持热更新，但是如果使用了<code>subPath</code>则容器不会接收到更新，当然设置了<code>immutable: true</code>就不会有任何作用了</li>
</ol>
<p>不同之处：</p>
<ol>
<li>Secret 保存的都是通过 base64 编码或者其他加密方式经过处理的数据</li>
<li>ConfigMap 保存的数据是原始数据，没有对数据做任何加工处理</li>
<li>ConfigMap 需要保证数据信息是 KV 结构才能以环境变量的方式注入容器</li>
<li>Secret 需要自己设置对应的环境变量名称去注入到容器</li>
<li>Secret 可以使用定制的方式，对机密数据加密与解密</li>
<li>Secret 存储敏感信息，ConfigMap 存储非敏感信息</li>
</ol>
]]></content>
		</item>
		
		<item>
			<title>CronJob 总结</title>
			<link>/posts/cronjob-%E6%80%BB%E7%BB%93/</link>
			<pubDate>Wed, 27 Mar 2024 18:30:31 +0800</pubDate>
			
			<guid>/posts/cronjob-%E6%80%BB%E7%BB%93/</guid>
			<description><![CDATA[%!s(<nil>)]]></description>
			<content type="html"><![CDATA[<p>v1.21 stable</p>
<p>CronJob used for performing regular scheduled actions such as backups, report generation&hellip;</p>
<h2 id="schedule-syntax">Schedule syntax</h2>
<pre tabindex="0"><code class="language-cronjob" data-lang="cronjob"># ┌───────────── minute (0 - 59)
# │ ┌───────────── hour (0 - 23)
# │ │ ┌───────────── day of the month (1 - 31)
# │ │ │ ┌───────────── month (1 - 12)
# │ │ │ │ ┌───────────── day of the week (0 - 6) (Sunday to Saturday)
# │ │ │ │ │                                   OR sun, mon, tue, wed, thu, fri, sat
# │ │ │ │ │
# │ │ │ │ │
# * * * * *
</code></pre><h2 id="deadline-for-delayed-job-start">Deadline for delayed Job start</h2>
<p><code>spec.startingDeadlineSeconds</code></p>
<ul>
<li>After missing the deadline, the CronJob skips that instance of the job (future occurrences are still scheduled)</li>
<li>For Jobs that miss their configured deadline, Kubernetes treats theme as failed Jobs. No <code>startingDeadlineSeconds</code>, no deadline for this CronJob.</li>
<li>If the <code>spec.startingDeadlineSeconds</code> field is set, the CronJobController measures the time between a Job is expected to be created and now. If the difference is higher than that limit, it will ship this execution.</li>
</ul>
<blockquote>
<p>注意：如果 startingDeadlineSeconds 的设置值低于 10 秒钟，CronJob 可能无法被调度。 这是因为 CronJob 控制器每 10 秒钟执行一次检查。</p>
</blockquote>
<h2 id="concurrency-policy">Concurrency policy</h2>
<p><code>spec.concurrencyPolicy</code> specifies how to treat concurrent executions of a Job that is created.</p>
<ul>
<li>Allow (default): allow concurrently run.</li>
<li>Forbid: Do not allow concurrent runs.</li>
<li>Replace: old and new Job may run at the same time, the new one will replace the old one.</li>
</ul>
<h2 id="schedule-suspension">Schedule suspension</h2>
<p><code>spec.suspend = true</code> for this case. When <code>spec.suspend</code> changes from <code>true</code> to <code>false</code> on an existing CronJob without a <code>spec.startingDeadlineSeconds</code>, the missed Jobs are scheduled immediately.</p>
<h2 id="jobs-history-limits">Jobs history limits</h2>
<p><code>spec.successfulJobsHistoryLimit = 3</code>, <code>spec.failedJobsHistoryLimit = 1</code>. These fields specify how many completed and failed jobs should be kept.</p>
<h2 id="time-zones-v127-stable">Time zones (v1.27 stable)</h2>
<p><code>spec.timeZone</code> instructs Kubernetes to interpret the schedule relative to Coordinated Universal Time.</p>
<h2 id="cronjob-limitations">CronJob limitations</h2>
<ul>
<li>
<p>Unsupported TimeZone specification</p>
<p><code>CRON_TZ</code> or <code>TZ</code> is not officially supported, and starting with v1.29, Kubernetes will fail to create the resource with a validation error.</p>
</li>
<li>
<p>Modify a CronJob</p>
<p>Modify an existing CronJob will have no effect to existing Jobs, and new Jobs will take effect.</p>
</li>
<li>
<p>Job creation</p>
<p>The Jobs that user defined are idempotent. If the scheduled time from last time until now, the missed count &gt; 100, then new Job will not be created from now.</p>
</li>
</ul>
]]></content>
		</item>
		
		<item>
			<title>Job 总结</title>
			<link>/posts/job-%E6%80%BB%E7%BB%93/</link>
			<pubDate>Tue, 26 Mar 2024 17:25:31 +0800</pubDate>
			
			<guid>/posts/job-%E6%80%BB%E7%BB%93/</guid>
			<description><![CDATA[%!s(<nil>)]]></description>
			<content type="html"><![CDATA[<h2 id="默认启动的-job非并发">默认启动的 Job（非并发）</h2>
<ol>
<li>通常只启动一个 Pod，除非该 Pod 失败</li>
<li>当 Pod 成功终止时，立即视 Job 为完成状态</li>
</ol>
<p>关键指标：</p>
<ul>
<li><code>spec.completions = 1</code> 默认值为 1</li>
<li><code>spec.parallelism = 1</code> 默认值为 1</li>
</ul>
<h2 id="并行的-job">并行的 Job</h2>
<ol>
<li>具有确定完成计数的并行 Job
<ul>
<li><code>spec.completions</code> &gt; 0，可以设置或者不设置<code>spec.parallelism</code>，默认值为 1</li>
<li>Completed Job，当成功 (exitCode = 0) Pod 个数达到<code>spec.completions</code></li>
<li><code>spec.completionMode = Indexed</code>, Job index (0-<code>spec.completions</code>-1)</li>
</ul>
</li>
<li>带工作队列的并行 Job
<ul>
<li>No <code>spec.completions</code>, default is <code>spec.parallelism &gt;= 0</code>，and must set <code>spec.parallelism</code></li>
<li>Coordinated Pod，or specify Pod do which item(s)</li>
<li>Pod know its peer Pods are completed or not，to determine Job completed or not</li>
<li>No more new Pods, when Pod terminated successfully</li>
<li>Job completed when Pod (&gt;=1 terminated) and other Pods terminated successfully</li>
<li>Pod exited successfully, no other Pod will keep doing this task and will be in exiting process.</li>
</ul>
</li>
</ol>
<p>控制并行：</p>
<p><code>spec.parallelism &gt;= 0</code>，如果为 0，则 Job 相当于启动之后立即被暂停。实际在任意状态运行的 Pod 个数可能比<code>spec.parallelism</code>略大或略小，原因如下：</p>
<ul>
<li>For fixed completion count Job，paralleled Pod number &lt;= remaining completion count</li>
<li>For work queue Job，有任何的 Job 成功结束之后，不会有新的 Pod 启动，对于已经运行的 Pod，允许执行完毕</li>
<li>如果 JobController 没来得及做出响应，或者 JobController 因为任何原因（如资源不足，缺少 ResourceQuota 或者没有权限）无法创建 Pod。则 Pod 个数可能比请求的数目小</li>
<li>JobController 可能因为之前同一 Job 中 Pod 失败次数过多而压制新 Pod 的创建</li>
<li>当 Pod 处于体面终止进程中，需要一定时间才能停止</li>
</ul>
<h2 id="job-completion-mode">Job Completion Mode</h2>
<p><code>spec.completions</code> &gt; 0 &amp;&amp; <code>spec.completionMode</code> in (<code>NonIndexed</code>, <code>Indexed</code>)</p>
<ul>
<li>NonIndexed (default)</li>
<li>Indexed, get this index value through four mechanisms
<ol>
<li>Pod annotation <code>batch.kubernetes.io/job-completion-index</code></li>
<li>When <code>PodIndexLabel</code> (default enabled) feature gate enabled, Pod label <code>batch.kubernetes.io/job-completion-index</code> (&gt;= v1.28).</li>
<li>Pod hostname, $(job-name)-$(index). When use an IndexedJob in combination with a Service, Pods within the Job can use the deterministic hostnames to address each other via DNS. <a href="http://kubernetes.io/docs/tasks/job/job-with-pod-to-pod-communication/">Job with Pod-to-Pod Communication</a></li>
<li>For the containerized task, use environment variable <code>JOB_COMPLETION_INDEX</code></li>
</ol>
</li>
</ul>
<p>Reference: <a href="http://kubernetes.io/docs/concepts/workloads/controllers/job/#completion-mode">http://kubernetes.io/docs/concepts/workloads/controllers/job/#completion-mode</a></p>
<h2 id="handing-pod-and-container-failures">Handing Pod and container failures</h2>
<ul>
<li>User should handle <code>spec.template.spec.restartPolicy = OnFailure</code>, or set <code>spec.template.spec.restartPolicy = Never</code></li>
<li>User should handle temporary files, locks, incomplete output by yourself</li>
<li>Each Pod failure is counted in <code>spec.backoffLimit</code>.</li>
<li>Count Pod failure in <code>spec.backoffLimitPerIndex</code></li>
<li>Set <code>spec.parallelism = 1</code> &amp;&amp; <code>spec.completions = 1</code> &amp;&amp; <code>spec.template.spec.restartPolicy = Never</code> can not make sure the program run one time.</li>
<li>User should handle currency for <code>spec.parallelism &gt; 1</code> &amp;&amp; <code>spec.completions &gt; 1</code></li>
<li>Feature gate <code>PodDisruptionCondition</code> and <code>JobPodFailurePolicy</code> enabled and <code>spec.podFailurePolicy</code> be set, JobController will not treat Pod with <code>metadata.deletionTimestamp</code> as a failure Pod, until the Pod terminated (<code>.status.phase</code> in <code>Failure</code> <code>Success</code>). Once the Pod terminated, the JobController evaluates <code>.backoffLimit</code> and <code>.podFailurePolicy</code> for relevant job, consider this now-terminated Pod failed or not.</li>
<li>No above situation suited, JobController counts a terminating Pod as an immediate failure, even through that Pod terminates with <code>phase = Succeed</code> later.</li>
</ul>
<h2 id="pod-backoff-failure-policy">Pod backoff failure policy</h2>
<p>Set <code>spec.backoffLimit = X</code>, when Job retries count is <code>X</code>, the Job is marked failure.
Default <code>spec.backoffLimit = 6</code>，backoff time (10s, 20s, 40s) until 6m.</p>
<ul>
<li><code>status.phase = Failed</code></li>
<li><code>restartPolicy = OnFailure</code> &amp;&amp; <code>status.phase</code> in (<code>Pending</code>,<code>Running</code>)</li>
</ul>
<blockquote>
<p>Official suggestion: <code>restartPolicy = &quot;Never&quot;</code> and using a logging system to record logs.</p>
</blockquote>
<h2 id="backoff-limit-per-index">Backoff limit per index</h2>
<p>Feature gate <code>JobBackoffLimitPerIndex</code> should be enabled.</p>
<ul>
<li>Set <code>spec.backoffLimitPerIndex</code> for handling retries for Pod failure</li>
<li>Failed Job index will be added in <code>status.failedIndexes</code>. Completed Job index will be added in <code>status.completedIndex</code>, regardless of <code>backoffLimitPerIndex</code> field</li>
<li>A failing index Job does not interrupt execution of other indexes. If one index Job failed, the overall IndexedJob will be marked failed.</li>
<li>JobController will terminate the entire job failed by setting <code>spec.maxFailedIndexes</code>, including the running Pods for that Job.</li>
</ul>
<h2 id="pod-failure-policy">Pod failure policy</h2>
<p>Feature gate <code>JobPodFailurePolicy</code> enabled, recommended <code>PodDisruptionConditions</code> enabled, supported in v1.29.</p>
<p><code>spec.podFailurePolicy</code> enables k8s cluster to handle Pod failures based on the container exit codes and the Pod conditions.</p>
<p>A better control for handling Pod failures than the <a href="#pod-backoff-failure-policy">Pod backoff failure policy</a>, based on <code>spec.backoffLimit</code>.</p>
<ul>
<li>For avoiding unnecessary Pod restarts</li>
<li>Guarantee Job and ignore Pod failures caused by disruptions (eg. preemption, API-initiated eviction or taint-based eviction) so that don&rsquo;t count <code>spec.backoffLimit</code></li>
</ul>
<blockquote>
<p>Note: Because the Pod template specifies a restartPolicy: Never, the kubelet does not restart the main container in that particular Pod.</p>
</blockquote>
<p>Ignore action for failed Pods with condition <code>DisruptionTarget</code> excludes Pod disruptions from being counted towards <code>spec.backoffLimit</code></p>
<blockquote>
<p>Note: If the Job failed, either by the Pod failure policy or Pod backoff failure policy, and the Job is running multiple Pods, Kubernetes terminates all the Pods in that Job that are still Pending or Running.</p>
</blockquote>
<p>API requirements and semantics:</p>
<ul>
<li>Must define <code>spec.template.spec.restartPolicy = Never</code> for <code>spec.podFailurePolicy</code></li>
<li><code>spec.podFailurePolicy.rules</code> are evaluated in order. Once a rule matches a Pod failure, the remaining rules are ignored.</li>
<li><code>spec.podFailurePolicy.rules[*].onExitCodes.containerName</code> available for both containers and initContainers</li>
<li><code>spec.podFailurePolicy.rules[*].action</code>
<ul>
<li>FailJob</li>
<li>Ignore: relevant with <code>spec.backoffLimit</code></li>
<li>Count: relevant with <code>spec.backoffLimit</code></li>
<li>FailIndex: relevant with <a href="#backoff-limit-per-index">backoff limit per index</a></li>
</ul>
</li>
</ul>
<p>Reference: <a href="http://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-failure-policy">http://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-failure-policy</a></p>
<h2 id="job-termination-and-cleanup">Job termination and cleanup</h2>
<ul>
<li>A Job will be interrupted with a Pod <code>restartPolicy = Never</code> or a Container exits in <code>restartPolicy = OnFailure</code>. Once <code>spec.backoffLimit</code> be satisfied, the entire Job will be marked as failed and any running Pods will be terminated.</li>
<li><code>spec.activeDeadlineSeconds</code> be satisfied, all of its running pods are terminated, and the Job status will because <code>type: Failed</code> with <code>reason: DeadlineExceeded</code>.</li>
<li><code>spec.activeDeadlineSeconds</code> takes precedence over <code>spec.backoffLimit</code>. Once the Job reaches the time limit (<code>activeDeadlineSeconds</code>), even if the <code>backoffLimit</code> is not yet reached.</li>
</ul>
<h2 id="cleanup-finished-jobs-automatically-v123-stable">Cleanup finished jobs automatically (v1.23 stable)</h2>
<ul>
<li>TTL mechanism, <code>spec.ttlSecondsAfterFinished</code> for cleaning up finished Jobs (<code>Completed</code>, <code>Failed</code>), including all the cascading Objects, eg: Pods.</li>
</ul>
<blockquote>
<p>Note: If the Job do not be cleaned up, the cluster performance degradation or in worst case cause to go offline due to this degradation. Use <code>LimitRanges</code> and <code>ResourceQuotas</code> is a better way to avoid this.</p>
</blockquote>
<p>Reference: <a href="http://kubernetes.io/docs/concepts/workloads/controllers/job/#ttl-mechanism-for-finished-jobs">http://kubernetes.io/docs/concepts/workloads/controllers/job/#ttl-mechanism-for-finished-jobs</a></p>
<p>Some examples:</p>
<ul>
<li>Specify this field in Job manifest</li>
<li>Manually set this field of existing, already finished Jobs</li>
<li>Use a mutating admission webhook to set this field dynamically at Job creation time. Cluster administrators use cases.</li>
<li>Use a mutation admission webhook to set this field dynamically after the Job has finished, need to detect <code>status</code> of the Job.</li>
<li>Write your own controller to manage the cleanup TTL for Jobs.</li>
</ul>
<p>Caveats:</p>
<ul>
<li>Updating TTL for finished Jobs: K8s will not make sure the update for TTL with have been expired.</li>
<li>Time skew: Known that clocks aren&rsquo;t always correct, K8s Job use the timestamp for doing the clean up.</li>
</ul>
<h2 id="job-patterns">Job patterns</h2>
<p>Usage cases: like emails to be sent, or notification to be pushed, or frames to be rendered, or files to be transcoded, ranges of keys in a NoSQL to scan&hellip;</p>
<p>Different patterns for parallel computation, each with strengths and weaknesses. The tradeoffs are:</p>
<ul>
<li>A single job for all work items is better for large numbers of items.</li>
<li>Having each Pod process multiple work items is better for large numbers of items.</li>
<li>Several approaches use a work queue.</li>
<li>The job is associated with a headless Service.</li>
</ul>
<p>Reference: <a href="http://kubernetes.io/docs/concepts/workloads/controllers/job/#job-patterns">http://kubernetes.io/docs/concepts/workloads/controllers/job/#job-patterns</a></p>
<h2 id="advanced-usage">Advanced Usage</h2>
<ul>
<li>Suspending a job: <code>spec.suspend = true</code> (v1.24 stable)</li>
<li>Mutable Scheduling Directives (v1.27 stable)</li>
<li>Specifying your own Pod selector: <code>spec.selector</code></li>
<li>Job tracking with finalizers: <code>batch.kubernetes.io/job-tracking</code> (v1.26 stable)</li>
<li>Elastic Indexed Jobs (v1.27 beta)
<ul>
<li>When feature gate <code>ElasticIndexedJob</code> disabled, <code>spec.completions</code> immutable</li>
<li><code>spec.parallelism</code></li>
<li><code>spec.completions</code></li>
</ul>
</li>
<li>Delayed creation of replacement pods (v1.29 beta)
<ul>
<li>Feature gate <code>JobPodReplacementPolicy</code> enabled by default</li>
<li>Use <code>status.phase = Failed</code> for delaying Pods creation of replacement, set <code>spec.podReplacementPolicy = Failed</code></li>
<li>Without <code>podFailurePolicy</code> set, <code>podReplacementPolicy</code> selects the <code>TerminatingOrFailed</code> replacement policy: the control plane creates replacement Pods immediately upon Pod deletion (as soon as the control plane sees that a Pod for this Job has <code>deletionTimestamp</code> set).</li>
</ul>
</li>
</ul>
<h2 id="alternatives">Alternatives</h2>
<ul>
<li>Bare Pods</li>
<li>Replication Controller</li>
<li>Single Job starts controller Pod</li>
</ul>
<h2 id="job-usage-conclusion-personally">Job usage conclusion (personally)</h2>
<p>用户需要自己处理一些如 lock，重试，标记，验证等业务逻辑，来安全的使用 Job，JobController 并不会减轻开发工作量，因为 JobController 中的 Pod 会因为很多原因重启或失败，比如 Node eviction，比如 livenessProbe。</p>
<p>JobController 的优势，在于可以可控制的扩缩并行任务数量</p>
]]></content>
		</item>
		
		<item>
			<title>DaemonSet 应用</title>
			<link>/posts/daemonset-%E5%BA%94%E7%94%A8/</link>
			<pubDate>Mon, 25 Mar 2024 15:20:31 +0800</pubDate>
			
			<guid>/posts/daemonset-%E5%BA%94%E7%94%A8/</guid>
			<description><![CDATA[%!s(<nil>)]]></description>
			<content type="html"><![CDATA[<p>DaemonSet 是能够保证每个或部分节点只运行一个 Pod 副本的控制器，当有节点加入集群的时候，会自动创建对应的 Pod，同时如果有节点从集群中移除，则会回收对应的 Pod</p>
<h2 id="典型应用">典型应用</h2>
<ol>
<li>每个节点上运行的集群守护进程</li>
<li>每个节点上运行的日志收集进程</li>
<li>每个节点的监控守护进程，如 prometheus</li>
</ol>
<p>即为每种类型的守护进程在所有节点上都启动一个 DaemonSet。或者为同一种守护进程部署多个 DaemonSet，每个具有不同的标识来区分不同类型资源的守护进程</p>
<h2 id="如何调度-daemonset-控制的-pods">如何调度 DaemonSet 控制的 Pods</h2>
<p>通过<code>spec.affinity.nodeAffinity</code>来匹配目标 node，有以下 2 种</p>
<ol>
<li><code>requiredDuringSchedulingIgnoredDuringExecution</code> 强制要求必须满足条件</li>
<li><code>preferredDuringSchedulingIgnoredDuringExecution</code> 提供一种优先选择的条件，如果不满足也可以被调度到对应的节点</li>
</ol>
<h3 id="taints-and-tolerations-effect-说明">Taints and tolerations Effect 说明</h3>
<ol>
<li><code>NoExecute</code></li>
<li><code>NoSchedule</code></li>
</ol>
<h2 id="daemon-pod-通信">Daemon Pod 通信</h2>
<p>与 DaemonSet 中 Pod 通信模式：</p>
<ol>
<li><code>Push</code></li>
<li>NodeIP 和已知端口，如 hostPort</li>
<li>DNS</li>
<li><code>Service</code></li>
</ol>
<h2 id="更新-daemonset">更新 DaemonSet</h2>
<p>如果节点的标签被修改，DaemonSet 将立刻向新匹配上的节点添加 Pod，同时删除不匹配的节点上运行的 Pod。可以修改 DaemonSet 创建的 Pod，但不是 Pod 所有的字段都可以修改，如果没有修改，当下次有新节点加入集群的时候，该节点的 Pod 会使用 DaemonSet 初始化时用的模版。</p>
<h2 id="替代方案">替代方案</h2>
<ol>
<li>init 脚本，也就是 linux 系统守护进程</li>
<li>bare Pod，直接创建 Pod，这样就会失去高可用特性</li>
<li>static Pod，通过在一个指定的、受 kubelet 监视的目录下编写文件来创建 Pod，将来可能会被废弃</li>
<li>Deployment，这是最常用的方式</li>
</ol>
]]></content>
		</item>
		
		<item>
			<title>Service 理解</title>
			<link>/posts/service-%E7%90%86%E8%A7%A3/</link>
			<pubDate>Sun, 24 Mar 2024 21:25:31 +0800</pubDate>
			
			<guid>/posts/service-%E7%90%86%E8%A7%A3/</guid>
			<description><![CDATA[%!s(<nil>)]]></description>
			<content type="html"><![CDATA[<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l">v1</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l">Service</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">metadata</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">my-service</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">spec</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">selector</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">app.kubernetes.io/name</span><span class="p">:</span><span class="w"> </span><span class="l">MyApp</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">ports</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span>- <span class="nt">protocol</span><span class="p">:</span><span class="w"> </span><span class="l">TCP</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">port</span><span class="p">:</span><span class="w"> </span><span class="m">80</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">targetPort</span><span class="p">:</span><span class="w"> </span><span class="m">9376</span><span class="w">
</span></span></span></code></pre></div><p>port: 入站端口，targetPort：要访问的目标端口</p>
<h2 id="endpointslice">EndpointSlice</h2>
<ul>
<li>会通过以 label (<code>kubernetes.io/service-name: my-service</code>) 的形式，将目标端点链接到 Service</li>
<li>如果包含至少 100 个端点，则创建新的 EndpointSlice，以相同的方式链接到对应的 Service</li>
<li>如果使用的是 Endpoints，会截断 1000 个以上的端点，并设置 annotations <code>endpoints.kubernetes.io/over-capacity: truncated</code></li>
</ul>
<h2 id="应用协议">应用协议</h2>
<ul>
<li>SCTP</li>
<li>TCP</li>
<li>UDP</li>
</ul>
<!-- - HTTP(S)
- PROXY
- TLS -->
<h2 id="服务类型">服务类型</h2>
<ul>
<li>ClusterIP (default): available in cluster</li>
<li>NodePort: available out of cluster, ClusterIP mode + Virtual IP mapping</li>
<li>LoadBalancer: cloud provider</li>
<li>ExternalName: for cluster used a dynamic external resource by ip or domain</li>
</ul>
<h2 id="type-clusterip">type: ClusterIP</h2>
<p><code>spec.clusterIP = None</code>，Headless Service，也可以通过此字段指定自己的 IP</p>
<blockquote>
<p>避免地址冲突：1. IP 地址分配追踪；2. Service 虚拟 IP 地址段，尽管如此，自己指定如果网段与 k8s 集群 IP 相同， <code>clusterIP</code> 仍有冲突风险，否则可避免此类风险</p>
</blockquote>
<p><img src="/imgs/svc-clusterip.jpg" alt="LoadBalancer"></p>
<h2 id="type-nodeport">type: NodePort</h2>
<ul>
<li><code>--service-node-port-range</code> 可指定端口分配范围，默认值：30000-32767</li>
<li><code>nodePort</code> 默认从高位开始分配，用户可从低位指定端口</li>
</ul>
<p><img src="/imgs/svc-nodeport.jpg" alt="LoadBalancer"></p>
<h2 id="type-loadbalancer">type: LoadBalancer</h2>
<p><img src="/imgs/unlimited-ipvs.jpg" alt="UnlimitedIPVS"></p>
<p><img src="/imgs/svc-loadbalancer.jpg" alt="LoadBalancer"></p>
<h2 id="type-externalname">type: ExternalName</h2>
<p><img src="/imgs/svc-externalname.jpg" alt="LoadBalancer"></p>
<h2 id="headless-service">Headless Service</h2>
<p><code>clusterIP = None</code>, <code>port</code> 比如与 <code>targetPort</code> 匹配</p>
<h2 id="服务发现">服务发现</h2>
<ul>
<li>环境变量：${SVCNAME}_SERVICE_HOST:${SVCNAME}_SERVICE_PORT</li>
<li>DNS
<ul>
<li>my-service.default.svc.cluster.local</li>
<li>_http._tcp.my-service.default.svc.cluster.local，具有 http 端口，协议为 TCP</li>
</ul>
</li>
</ul>
<h2 id="虚拟-ip-寻址机制">虚拟 IP 寻址机制</h2>
<ul>
<li>
<p>流量策略</p>
<p><code>spec.externalTrafficPolicy</code> 控制 kubernetes 如何降流量路由到健康（“Ready”）的后端</p>
</li>
</ul>
<h2 id="session-affinity">Session affinity</h2>
<ul>
<li><code>spec.sessionAffinity</code> 可以使一个特定的 client 每次都能连接到同一个 Pod</li>
<li><code>spec.sessionAffinityConfig.clientIP.timeoutSeconds</code> (default 10800, 3 hours), 来设置 client 连接到同一个 Pod 的最大会话时间</li>
</ul>
<h2 id="外部-ip">外部 IP</h2>
<p><code>spec.externalIPs</code> 可以指定外部 IP 列表</p>
<h2 id="ingress">Ingress</h2>
<p>ingress-nginx 实现了 nginx 配置文件的动态加载与更新，解决了频繁更新 nginx 配置重载的问题</p>
<p><code>ingress-nginx-controller</code> (v1.10.0)，需要指定 <code>ingressClassName = nginx</code></p>
<h2 id="原理">原理</h2>
<p>CoreDNS 插件将 svc 对应的名称映射成了 IP，并且将这个 IP 写入到 IPVS 内核模块中，实现 Service 相对稳定的网络标识</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl">ipvsadm -Ln <span class="c1"># 查看ip端口转发配置</span>
</span></span><span class="line"><span class="cl">iptables-save <span class="c1"># iptables，ip端口转发配置</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># bind-utils</span>
</span></span><span class="line"><span class="cl"><span class="c1">#@10.96.0.10 为 kube-dns的ip</span>
</span></span><span class="line"><span class="cl"><span class="c1"># -t tcp, -u udp</span>
</span></span><span class="line"><span class="cl"><span class="c1"># A 查询域名解析为A的记录</span>
</span></span><span class="line"><span class="cl">dig -t A my-service.default.svc.cluster.local. @10.96.0.10
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 测试http服务</span>
</span></span><span class="line"><span class="cl">wget --spider --timeout<span class="o">=</span><span class="m">1</span> nginx
</span></span></code></pre></div>]]></content>
		</item>
		
		<item>
			<title>Pod 总结</title>
			<link>/posts/pod-%E6%80%BB%E7%BB%93/</link>
			<pubDate>Sat, 23 Mar 2024 19:25:31 +0800</pubDate>
			
			<guid>/posts/pod-%E6%80%BB%E7%BB%93/</guid>
			<description><![CDATA[%!s(<nil>)]]></description>
			<content type="html"><![CDATA[<h2 id="生命周期">生命周期</h2>
<p><img src="/imgs/pod-lifecycle.jpg" alt="pod-lifecycle"></p>
<h3 id="pause-容器">Pause 容器</h3>
<p>每个 pod 都会有一个 pause 容器，完成对 pod 内容器的一些资源共享，主要有 2 个功能：</p>
<ol>
<li>初始化网络栈</li>
<li>共享存储卷</li>
</ol>
<h3 id="initc-初始化容器">initC (初始化容器)</h3>
<p>initC 与 mainC 使用的字段一样，但是 initC 没有 lifecycle，readinessProbe, livenessProbe, startupProbe
每个 Pod 的 initC 可以是 0 个，理论上也可以是无数个, initC 是依次顺序完成初始化的，我们可以利用 initC 将一些任务分步实现，如:</p>
<ol>
<li>initC1 可以执行 git clone</li>
<li>initC2 可以安装一些与 mainC 无关的软件，包括一些编译软件</li>
<li>initC3 可以完成对软件的构建，比如前端 webpack，后端可执行文件的编译</li>
<li>initC4 可以完成一些 delay 或 until 的前置条件，因为 mainC 要想启动，必须要等 initC 完成</li>
<li>initC5 可以运行在一个不同于 mainC 的文件系统视图，因此完成对 Secrets 的访问</li>
<li>initC6 可以安全地运行一些工具，来一定程度上提高 mainC 运行的安全性</li>
<li>&hellip; 等等</li>
</ol>
<h3 id="mainc-主容器">mainC (主容器)</h3>
<p>每个 pod 必须至少有一个容器，否则 pod 无法存活，理论上来说每个 pod 的容器数量也可以是无数个，且所有的 mainC 都是同时执行的，mainC 中，有 hook，也有探针</p>
<ul>
<li>
<p>hook: exec,httpGet,tcpSocket</p>
<ul>
<li>lifecycle.postHook</li>
<li>lifecycle.preStop</li>
</ul>
</li>
<li>
<p><a href="https://kubernetes.io/zh-cn/docs/concepts/workloads/pods/pod-lifecycle/#types-of-probe">探针 probe</a></p>
<ul>
<li>startupProbe: 启动探针</li>
<li>livenessProbe：存活探针</li>
<li>readinessProbe：就绪探针</li>
</ul>
</li>
</ul>
<h3 id="sidecar-container">Sidecar Container</h3>
<p>不同于普通的 initC 按序启动，Sidecar Container 是活跃在整个 pod 的生命周期中的，也就是从 pause 容器完成初始化，一直到 pod 死亡，通过设置 restartPolicy=Always，可在 initContainers 添加一个 Sidecar Container，支持 <code>*Probe</code> 检测，支持 <code>lifecycle.*</code>，并共享容器的 CPU，Memory，网路与存储等资源，但是并不会影响 mainC，还有一点，就是因为 Sidecar 也属于 initC,所以 mainC 要想启动，同样要等 Sidecar 正常运行才行</p>
<ul>
<li>日志采集</li>
<li>监控</li>
<li>数据同步</li>
</ul>
<h3 id="ephemeral-container">Ephemeral Container</h3>
<p>主要用与调试容器，字段同 Container，不允许使用，ports,lifecycle,*Probe,resources，除此以外可以执行任意命令</p>
<h3 id="quality-of-service-class">Quality of Service Class</h3>
<p>被用于 pod 的 evict 中，当一个 node 资源过载的时候，根据下面的顺序，依次 evict</p>
<ul>
<li>BestEffort: 优先被 evict，默认的 QoS</li>
<li>Burstable: 仅次于 BestEffort</li>
<li>Guaranteed: 最不容易被 evict</li>
</ul>
<h2 id="downward-api">Downward API</h2>
<p>用于将 pod 的一些 <code>metadata.*</code>，通过 volume 的方式映射到容器内部</p>
<h2 id="disruptions">Disruptions</h2>
<h3 id="非自愿干扰involuntary-disruptions">非自愿干扰（Involuntary Disruptions）</h3>
<ol>
<li>节点物理机的硬件故障</li>
<li>集群管理员错误地删除实例</li>
<li>云提供商或虚拟机管理程序中的故障导致虚拟机消失</li>
<li>内核错误</li>
<li>节点由于集群网络隔离从集群中消失</li>
<li>由于节点资源不足导致 Pod 被驱逐</li>
</ol>
<h3 id="自愿干扰voluntary-disruptions">自愿干扰（Voluntary Disruptions）</h3>
<ol>
<li>删除 Deployment 等控制器</li>
<li>更新 Deployment 等控制器模版导致 Pod 重启</li>
<li>直接删除 Pod</li>
<li>排空(drain)节点进行修复或升级操作</li>
<li>从集群中 drain 节点以缩小集群</li>
<li>从节点中移除一个 Pod，以允许其他 Pod 使用该节点</li>
</ol>
<p>可以创建<code>PodDisruptionBudgets</code>来一定程度上缓解自愿干扰，但是不能够保证，调度器只是尽力去做。</p>
<h3 id="处理干扰-tbd">处理干扰 [TBD]</h3>
<h3 id="干扰预算-poddisruptionbudgets-tbd">干扰预算 (PodDisruptionBudgets) [TBD]</h3>
<h3 id="干扰状况-tbd">干扰状况 [TBD]</h3>
<h2 id="pod-开销">Pod 开销</h2>
<p>Pod 中的资源申请是包含容器运行时的开销的，关于容器运行时的开销，管理员可以在<a href="https://kubernetes.io/zh-cn/docs/reference/access-authn-authz/extensible-admission-controllers/#what-are-admission-webhooks">准入控制部分</a>预先设置，也可以由具体的 Pod 指定。</p>
<p>还可以通过<code>spec.overhead</code>由 Pod 自己定义，这个字段的优先级更高。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l">node.k8s.io/v1</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l">RuntimeClass</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">metadata</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">kata-fc</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">handler</span><span class="p">:</span><span class="w"> </span><span class="l">runc</span><span class="w"> </span><span class="c"># 此处是部署kubernetes集群时设置的容器运行时</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">overhead</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">podFixed</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">memory</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;120Mi&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">cpu</span><span class="p">:</span><span class="w"> </span><span class="s2">&#34;250m&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nn">---</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l">v1</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l">Pod</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nn">...</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">spec</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">runtimeClassName</span><span class="p">:</span><span class="w"> </span><span class="l">kata-fc</span><span class="w"> </span><span class="c"># 如果不设置的话，集群会根据不同的容器运行时，设置相应的</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">containers</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="nt">resources</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">limits</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">cpu</span><span class="p">:</span><span class="w"> </span><span class="l">500m</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">memory</span><span class="p">:</span><span class="w"> </span><span class="l">100Mi</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="l">...</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="nt">resources</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">limits</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">cpu</span><span class="p">:</span><span class="w"> </span><span class="l">1500m</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">memory</span><span class="p">:</span><span class="w"> </span><span class="l">100Mi</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="l">...</span><span class="w">
</span></span></span></code></pre></div><p>如果<code>.spec.runtimeClassName</code>指定了上述定义的 <code>kata-fc</code> <code>RuntimeClass</code>，则 Pod 总的资源申请会包含上述<code>overhead</code>(因为容器运行时在内的一部分开销)，如下所示：</p>
<pre tabindex="0"><code>  Namespace    Name       CPU Requests  CPU Limits   Memory Requests  Memory Limits  AGE
  ---------    ----       ------------  ----------   ---------------  -------------  ---
  default      test-pod   2250m (56%)   2250m (56%)  320Mi (1%)       320Mi (1%)     36m
</code></pre><!-- ```shell -->
<!-- docker ps --no-trunc -->
<!-- ``` -->
]]></content>
		</item>
		
		<item>
			<title>Kubernetes M1 Mac 重新搭建</title>
			<link>/posts/kubernetes-m1-mac-%E9%87%8D%E6%96%B0%E9%85%8D%E7%BD%AE/</link>
			<pubDate>Fri, 22 Mar 2024 18:18:31 +0800</pubDate>
			
			<guid>/posts/kubernetes-m1-mac-%E9%87%8D%E6%96%B0%E9%85%8D%E7%BD%AE/</guid>
			<description><![CDATA[%!s(<nil>)]]></description>
			<content type="html"><![CDATA[<h2 id="etchosts">/etc/hosts</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl">192.168.104.100 k8s-master01 m1
</span></span><span class="line"><span class="cl">192.168.104.101 k8s-node01 n1
</span></span><span class="line"><span class="cl">192.168.104.102 k8s-node02 n2
</span></span></code></pre></div><h2 id="禁用网卡">禁用网卡</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl"><span class="c1"># 永久禁用</span>
</span></span><span class="line"><span class="cl">nmcli device <span class="nb">set</span> enp0s2 managed no
</span></span><span class="line"><span class="cl"><span class="c1"># 永久启用</span>
</span></span><span class="line"><span class="cl">nmcli device <span class="nb">set</span> enp0s2 managed yes
</span></span><span class="line"><span class="cl">nmcli connection modify enp0s2 connection.autoconnect yes
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl"><span class="c1"># 设置主机名以及hosts文件的相互解析</span>
</span></span><span class="line"><span class="cl">hostnamectl set-hostname k8s-master01
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">yum install -y conntrack ipvsadm ipset iptables curl sysstat libseccomp wget vim net-tools git
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 打开 /etc/chrony.conf</span>
</span></span><span class="line"><span class="cl"><span class="c1"># 将当前的server区域都注视掉</span>
</span></span><span class="line"><span class="cl">pool ntp1.aliyun.com iburst
</span></span><span class="line"><span class="cl">pool ntp2.aliyun.com iburst
</span></span><span class="line"><span class="cl">pool ntp3.aliyun.com iburst
</span></span><span class="line"><span class="cl">allow 192.168.104.0/24
</span></span><span class="line"><span class="cl"><span class="nb">local</span> stratum <span class="m">10</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">systemctl restart chronyd
</span></span><span class="line"><span class="cl">systemctl <span class="nb">enable</span> chronyd
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">timedatectl set-timezone Asia/Shanghai
</span></span><span class="line"><span class="cl"><span class="c1">#将当前的 UTC 时间写入硬件时钟</span>
</span></span><span class="line"><span class="cl">timedatectl set-local-rtc <span class="m">0</span>
</span></span><span class="line"><span class="cl">systemctl restart rsyslog
</span></span><span class="line"><span class="cl">systemctl restart crond
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 设置防火墙为Iptables并设置空规则</span>
</span></span><span class="line"><span class="cl">systemctl stop firewalld <span class="o">&amp;&amp;</span> systemctl disable firewalld
</span></span><span class="line"><span class="cl">yum install -y iptables-services <span class="o">&amp;&amp;</span> systemctl start iptables <span class="o">&amp;&amp;</span> systemctl <span class="nb">enable</span> iptables <span class="o">&amp;&amp;</span> iptables -F <span class="o">&amp;&amp;</span> service iptables save
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># double check</span>
</span></span><span class="line"><span class="cl">systemctl status iptables
</span></span><span class="line"><span class="cl">iptables -L <span class="c1"># empty</span>
</span></span><span class="line"><span class="cl">cat /etc/sysconfig/iptables <span class="c1"># empty</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 关闭SELINUX</span>
</span></span><span class="line"><span class="cl">swapoff -a <span class="o">&amp;&amp;</span> sed -i <span class="s1">&#39;/ swap / s/^\(.*\)$/#\1/g&#39;</span> /etc/fstab
</span></span><span class="line"><span class="cl">setenforce <span class="m">0</span> <span class="o">&amp;&amp;</span> sed -i <span class="s1">&#39;s/^SELINUX=.*/SELINUX=disabled/&#39;</span> /etc/selinux/config
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 调整内核参数，对于k8s</span>
</span></span><span class="line"><span class="cl">cat <span class="s">&lt;&lt; EOF &gt; /etc/sysctl.d/99-kubernetes-cri.conf
</span></span></span><span class="line"><span class="cl"><span class="s">net.bridge.bridge-nf-call-ip6tables = 1
</span></span></span><span class="line"><span class="cl"><span class="s">net.bridge.bridge-nf-call-iptables = 1
</span></span></span><span class="line"><span class="cl"><span class="s">net.ipv4.ip_forward = 1
</span></span></span><span class="line"><span class="cl"><span class="s">user.max_user_namespaces=28633
</span></span></span><span class="line"><span class="cl"><span class="s">EOF</span>
</span></span><span class="line"><span class="cl">sysctl -p /etc/sysctl.d/99-kubernetes-cri.conf
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># centos 9 没有postfix服务，有没有其他的邮件服务</span>
</span></span><span class="line"><span class="cl"><span class="c1"># 关闭系统不需要的服务，如邮件服务 postfix，比较占资源</span>
</span></span><span class="line"><span class="cl"><span class="c1"># systemctl stop postfix &amp;&amp; systemctl disable postfix</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 日志服务配置</span>
</span></span><span class="line"><span class="cl">mkdir /var/log/journal <span class="c1"># 持久化保存日志的目录</span>
</span></span><span class="line"><span class="cl">mkdir /etc/systemd/journald.conf.d
</span></span><span class="line"><span class="cl">cat &gt; /etc/systemd/journald.conf.d/99-prophet.conf <span class="s">&lt;&lt;EOF
</span></span></span><span class="line"><span class="cl"><span class="s">[journal]
</span></span></span><span class="line"><span class="cl"><span class="s"># 持久化保存到磁盘
</span></span></span><span class="line"><span class="cl"><span class="s">Storage=persistent
</span></span></span><span class="line"><span class="cl"><span class="s">
</span></span></span><span class="line"><span class="cl"><span class="s"># 压缩历史日志
</span></span></span><span class="line"><span class="cl"><span class="s">Compress=yes
</span></span></span><span class="line"><span class="cl"><span class="s">
</span></span></span><span class="line"><span class="cl"><span class="s">SyncIntervalSec=5m
</span></span></span><span class="line"><span class="cl"><span class="s">RateLimitInterval=30s
</span></span></span><span class="line"><span class="cl"><span class="s">RateLimitBurst=1000
</span></span></span><span class="line"><span class="cl"><span class="s">
</span></span></span><span class="line"><span class="cl"><span class="s"># 最大占用空间
</span></span></span><span class="line"><span class="cl"><span class="s">SystemMaxUse=10G
</span></span></span><span class="line"><span class="cl"><span class="s">
</span></span></span><span class="line"><span class="cl"><span class="s"># 单日志文件最大 200M
</span></span></span><span class="line"><span class="cl"><span class="s">SystemMaxFileSize=200M
</span></span></span><span class="line"><span class="cl"><span class="s">
</span></span></span><span class="line"><span class="cl"><span class="s"># 日志保存时间 2 周
</span></span></span><span class="line"><span class="cl"><span class="s">MaxRetentionSec=2week
</span></span></span><span class="line"><span class="cl"><span class="s">
</span></span></span><span class="line"><span class="cl"><span class="s"># 不将日志发送到 syslog
</span></span></span><span class="line"><span class="cl"><span class="s">ForwardToSyslog=no
</span></span></span><span class="line"><span class="cl"><span class="s">EOF</span>
</span></span><span class="line"><span class="cl">systemctl restart systemd-journald
</span></span></code></pre></div><h2 id="kube-proxy-开启-ipvs-的前置条件">kube-proxy 开启 ipvs 的前置条件</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl">cat <span class="s">&lt;&lt; EOF &gt; /etc/modules-load.d/containerd.conf
</span></span></span><span class="line"><span class="cl"><span class="s">overlay
</span></span></span><span class="line"><span class="cl"><span class="s">br_netfilter
</span></span></span><span class="line"><span class="cl"><span class="s">EOF</span>
</span></span><span class="line"><span class="cl">modprobe overlay
</span></span><span class="line"><span class="cl"><span class="c1"># 5.4内核中没有nf_conntrack_ipv4</span>
</span></span><span class="line"><span class="cl"><span class="c1"># https://github.com/kubernetes-sigs/kubespray/issues/7176</span>
</span></span><span class="line"><span class="cl"><span class="c1"># 加载网桥连接模块</span>
</span></span><span class="line"><span class="cl">modprobe br_netfilter
</span></span><span class="line"><span class="cl">mkdir -p  /etc/sysconfig/modules
</span></span><span class="line"><span class="cl">cat &gt; /etc/sysconfig/modules/ipvs.modules <span class="s">&lt;&lt;EOF
</span></span></span><span class="line"><span class="cl"><span class="s">#!/bin/bash
</span></span></span><span class="line"><span class="cl"><span class="s">
</span></span></span><span class="line"><span class="cl"><span class="s">modprobe -- ip_vs
</span></span></span><span class="line"><span class="cl"><span class="s">modprobe -- ip_vs_rr
</span></span></span><span class="line"><span class="cl"><span class="s">modprobe -- ip_vs_wrr
</span></span></span><span class="line"><span class="cl"><span class="s">modprobe -- ip_vs_sh
</span></span></span><span class="line"><span class="cl"><span class="s">EOF</span>
</span></span><span class="line"><span class="cl">chmod <span class="m">755</span> /etc/sysconfig/modules/ipvs.modules <span class="o">&amp;&amp;</span> bash /etc/sysconfig/modules/ipvs.modules <span class="o">&amp;&amp;</span> lsmod <span class="p">|</span> grep -e ip_vs -e nf_conntrack
</span></span></code></pre></div><h2 id="初始化-master">初始化 master</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl">curl -OL https://github.com/containerd/containerd/releases/download/v1.7.14/containerd-1.7.14-linux-arm64.tar.gz
</span></span><span class="line"><span class="cl">tar Cxzvf /usr/local containerd-1.7.14-linux-arm64.tar.gz
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">curl -OL https://github.com/opencontainers/runc/releases/download/v1.1.12/runc.arm64
</span></span><span class="line"><span class="cl">install -m <span class="m">755</span> runc.arm64 /usr/local/sbin/runc
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">mkdir -p /etc/containerd
</span></span><span class="line"><span class="cl">containerd config default &gt; /etc/containerd/config.toml
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">修改前面生成的配置文件/etc/containerd/config.toml：
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="o">[</span>plugins.<span class="s2">&#34;io.containerd.grpc.v1.cri&#34;</span>.containerd.runtimes.runc<span class="o">]</span>
</span></span><span class="line"><span class="cl">  ...
</span></span><span class="line"><span class="cl">  <span class="o">[</span>plugins.<span class="s2">&#34;io.containerd.grpc.v1.cri&#34;</span>.containerd.runtimes.runc.options<span class="o">]</span>
</span></span><span class="line"><span class="cl">    <span class="nv">SystemdCgroup</span> <span class="o">=</span> <span class="nb">true</span>
</span></span><span class="line"><span class="cl">再修改/etc/containerd/config.toml中的
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="o">[</span>plugins.<span class="s2">&#34;io.containerd.grpc.v1.cri&#34;</span><span class="o">]</span>
</span></span><span class="line"><span class="cl">  ...
</span></span><span class="line"><span class="cl">  <span class="c1"># sandbox_image = &#34;registry.k8s.io/pause:3.8&#34;</span>
</span></span><span class="line"><span class="cl">  <span class="nv">sandbox_image</span> <span class="o">=</span> <span class="s2">&#34;registry.aliyuncs.com/google_containers/pause:3.9&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 为了通过systemd启动containerd，请还需要从https://raw.githubusercontent.com/containerd/containerd/main/containerd.service下载containerd.service单元文件，并将其放置在 /etc/systemd/system/containerd.service中。</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">cat <span class="s">&lt;&lt; EOF &gt; /etc/systemd/system/containerd.service
</span></span></span><span class="line"><span class="cl"><span class="s">[Unit]
</span></span></span><span class="line"><span class="cl"><span class="s">Description=containerd container runtime
</span></span></span><span class="line"><span class="cl"><span class="s">Documentation=https://containerd.io
</span></span></span><span class="line"><span class="cl"><span class="s">After=network.target local-fs.target
</span></span></span><span class="line"><span class="cl"><span class="s">
</span></span></span><span class="line"><span class="cl"><span class="s">[Service]
</span></span></span><span class="line"><span class="cl"><span class="s">ExecStartPre=-/sbin/modprobe overlay
</span></span></span><span class="line"><span class="cl"><span class="s">ExecStart=/usr/local/bin/containerd
</span></span></span><span class="line"><span class="cl"><span class="s">
</span></span></span><span class="line"><span class="cl"><span class="s">Type=notify
</span></span></span><span class="line"><span class="cl"><span class="s">Delegate=yes
</span></span></span><span class="line"><span class="cl"><span class="s">KillMode=process
</span></span></span><span class="line"><span class="cl"><span class="s">Restart=always
</span></span></span><span class="line"><span class="cl"><span class="s">RestartSec=5
</span></span></span><span class="line"><span class="cl"><span class="s">
</span></span></span><span class="line"><span class="cl"><span class="s"># Having non-zero Limit*s causes performance problems due to accounting overhead
</span></span></span><span class="line"><span class="cl"><span class="s"># in the kernel. We recommend using cgroups to do container-local accounting.
</span></span></span><span class="line"><span class="cl"><span class="s">LimitNPROC=infinity
</span></span></span><span class="line"><span class="cl"><span class="s">LimitCORE=infinity
</span></span></span><span class="line"><span class="cl"><span class="s">
</span></span></span><span class="line"><span class="cl"><span class="s"># Comment TasksMax if your systemd version does not supports it.
</span></span></span><span class="line"><span class="cl"><span class="s"># Only systemd 226 and above support this version.
</span></span></span><span class="line"><span class="cl"><span class="s">TasksMax=infinity
</span></span></span><span class="line"><span class="cl"><span class="s">OOMScoreAdjust=-999
</span></span></span><span class="line"><span class="cl"><span class="s">
</span></span></span><span class="line"><span class="cl"><span class="s">[Install]
</span></span></span><span class="line"><span class="cl"><span class="s">WantedBy=multi-user.target
</span></span></span><span class="line"><span class="cl"><span class="s">EOF</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1">#配置containerd开机启动，并启动containerd，执行以下命令:</span>
</span></span><span class="line"><span class="cl">systemctl daemon-reload
</span></span><span class="line"><span class="cl">systemctl <span class="nb">enable</span> containerd --now
</span></span><span class="line"><span class="cl">systemctl status containerd
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">curl -OL https://github.com/kubernetes-sigs/cri-tools/releases/download/v1.29.0/crictl-v1.29.0-linux-arm64.tar.gz
</span></span><span class="line"><span class="cl">tar -zxvf crictl-v1.29.0-linux-arm64.tar.gz
</span></span><span class="line"><span class="cl">install -m <span class="m">755</span> crictl /usr/local/bin/crictl
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1">#使用crictl测试一下，确保可以打印出版本信息并且没有错误信息输出:</span>
</span></span><span class="line"><span class="cl">crictl --runtime-endpoint<span class="o">=</span>unix:///run/containerd/containerd.sock  version
</span></span></code></pre></div><h2 id="安装-kuberadm-主从配置">安装 Kuberadm （主从配置）</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl">cat <span class="s">&lt;&lt;EOF &gt; /etc/yum.repos.d/kubernetes.repo
</span></span></span><span class="line"><span class="cl"><span class="s">[kubernetes]
</span></span></span><span class="line"><span class="cl"><span class="s">name=Kubernetes
</span></span></span><span class="line"><span class="cl"><span class="s">baseurl=https://pkgs.k8s.io/core:/stable:/v1.29/rpm/
</span></span></span><span class="line"><span class="cl"><span class="s">enabled=1
</span></span></span><span class="line"><span class="cl"><span class="s">gpgcheck=1
</span></span></span><span class="line"><span class="cl"><span class="s">gpgkey=https://pkgs.k8s.io/core:/stable:/v1.29/rpm/repodata/repomd.xml.key
</span></span></span><span class="line"><span class="cl"><span class="s">exclude=kubelet kubeadm kubectl cri-tools kubernetes-cni
</span></span></span><span class="line"><span class="cl"><span class="s">EOF</span>
</span></span><span class="line"><span class="cl">yum install -y kubelet kubeadm kubectl --disableexcludes<span class="o">=</span>kubernetes
</span></span><span class="line"><span class="cl">systemctl <span class="nb">enable</span> kubelet
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 可以使用下面命令查看系统支持的cgroup版本:</span>
</span></span><span class="line"><span class="cl">grep cgroup /proc/filesystems
</span></span></code></pre></div><h2 id="配置-kubernetes">配置 kubernetes</h2>
<p><code>kubeadm config print init-defaults &gt; kubeadm.yaml</code></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl">apiVersion: kubeadm.k8s.io/v1beta3
</span></span><span class="line"><span class="cl">kind: InitConfiguration
</span></span><span class="line"><span class="cl">localAPIEndpoint:
</span></span><span class="line"><span class="cl">  advertiseAddress: 192.168.104.100
</span></span><span class="line"><span class="cl">  bindPort: <span class="m">6443</span>
</span></span><span class="line"><span class="cl">nodeRegistration:
</span></span><span class="line"><span class="cl">  criSocket: unix:///run/containerd/containerd.sock
</span></span><span class="line"><span class="cl">  taints:
</span></span><span class="line"><span class="cl">  - effect: PreferNoSchedule
</span></span><span class="line"><span class="cl">    key: node-role.kubernetes.io/master
</span></span><span class="line"><span class="cl">---
</span></span><span class="line"><span class="cl">apiVersion: kubeadm.k8s.io/v1beta3
</span></span><span class="line"><span class="cl">kind: ClusterConfiguration
</span></span><span class="line"><span class="cl">kubernetesVersion: 1.29.0
</span></span><span class="line"><span class="cl">imageRepository: registry.aliyuncs.com/google_containers
</span></span><span class="line"><span class="cl">networking:
</span></span><span class="line"><span class="cl">  dnsDomain: cluster.local
</span></span><span class="line"><span class="cl">  serviceSubnet: 10.96.0.0/12
</span></span><span class="line"><span class="cl">  podSubnet: 10.244.0.0/16
</span></span><span class="line"><span class="cl">---
</span></span><span class="line"><span class="cl">apiVersion: kubelet.config.k8s.io/v1beta1
</span></span><span class="line"><span class="cl">kind: KubeletConfiguration
</span></span><span class="line"><span class="cl">cgroupDriver: systemd
</span></span><span class="line"><span class="cl">failSwapOn: <span class="nb">false</span>
</span></span><span class="line"><span class="cl">---
</span></span><span class="line"><span class="cl">apiVersion: kubeproxy.config.k8s.io/v1alpha1
</span></span><span class="line"><span class="cl">kind: KubeProxyConfiguration
</span></span><span class="line"><span class="cl">mode: ipvs
</span></span></code></pre></div><h3 id="启动-kubernetes">启动 Kubernetes</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl">kubeadm config images list --config kubeadm.yaml
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">kubeadm config images pull --config kubeadm.yaml
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">kubeadm init --config kubeadm.yaml <span class="p">|</span> tee kubeadm-init.log
</span></span></code></pre></div><h3 id="配置-worker-节点">配置 Worker 节点</h3>
<p>将 master 虚拟机 clone 一份</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl">hostnamectl set-hostname k8s-node01
</span></span><span class="line"><span class="cl">nmcli connection modify enp0s1 ipv4.addresses 192.168.104.101/24 ipv4.gateway 192.168.104.2 ipv4.dns <span class="s2">&#34;192.168.104.2 8.8.8.8&#34;</span> ipv4.method manual
</span></span><span class="line"><span class="cl">nmcli conn up enp0s1
</span></span><span class="line"><span class="cl">kubeadm reset
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">hostnamectl set-hostname k8s-node02
</span></span><span class="line"><span class="cl">nmcli connection modify enp0s1 ipv4.addresses 192.168.104.102/24 ipv4.gateway 192.168.104.2 ipv4.dns <span class="s2">&#34;192.168.104.2 8.8.8.8&#34;</span> ipv4.method manual
</span></span><span class="line"><span class="cl">nmcli conn up enp0s1
</span></span><span class="line"><span class="cl">kubeadm reset
</span></span></code></pre></div><h2 id="将所有的-worker-节点加入集群">将所有的 worker 节点加入集群</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl">kubeadm join 192.168.104.100:6443 --token abcdef.0123456789abcdef <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>	--discovery-token-ca-cert-hash sha256:5c81348991e838a0471c78115d26431f996d4c9006a6abb1449929538a269401
</span></span></code></pre></div><h2 id="在-master-节点配置集群">在 master 节点配置集群</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl">mkdir -p <span class="nv">$HOME</span>/.kube
</span></span><span class="line"><span class="cl">sudo cp -i /etc/kubernetes/admin.conf <span class="nv">$HOME</span>/.kube/config
</span></span><span class="line"><span class="cl">sudo chown <span class="k">$(</span>id -u<span class="k">)</span>:<span class="k">$(</span>id -g<span class="k">)</span> <span class="nv">$HOME</span>/.kube/config
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">curl -OL https://github.com/flannel-io/flannel/releases/latest/download/kube-flannel.yml
</span></span><span class="line"><span class="cl">kubectl apply -f kube-flannel.yml
</span></span></code></pre></div>]]></content>
		</item>
		
		<item>
			<title>Kubernetes 集群搭建</title>
			<link>/posts/kubernetes-%E9%9B%86%E7%BE%A4%E6%90%AD%E5%BB%BA/</link>
			<pubDate>Thu, 21 Mar 2024 19:20:31 +0800</pubDate>
			
			<guid>/posts/kubernetes-%E9%9B%86%E7%BE%A4%E6%90%AD%E5%BB%BA/</guid>
			<description><![CDATA[%!s(<nil>)]]></description>
			<content type="html"><![CDATA[<h2 id="kubeadm-启动流程">Kubeadm 启动流程</h2>
<ul>
<li>master: systemd &gt; kubelet &gt; 容器组件 &gt; Kubernetes</li>
</ul>
<h2 id="安装-kuberadm-主从配置">安装 Kuberadm （主从配置）</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl"><span class="c1"># 此操作会覆盖 /etc/yum.repos.d/kubernetes.repo 中现存的所有配置</span>
</span></span><span class="line"><span class="cl">cat <span class="s">&lt;&lt;EOF &gt; /etc/yum.repos.d/kubernetes.repo
</span></span></span><span class="line"><span class="cl"><span class="s">[kubernetes]
</span></span></span><span class="line"><span class="cl"><span class="s">name=Kubernetes
</span></span></span><span class="line"><span class="cl"><span class="s">baseurl=https://pkgs.k8s.io/core:/stable:/v1.29/rpm/
</span></span></span><span class="line"><span class="cl"><span class="s">enabled=1
</span></span></span><span class="line"><span class="cl"><span class="s">gpgcheck=1
</span></span></span><span class="line"><span class="cl"><span class="s">gpgkey=https://pkgs.k8s.io/core:/stable:/v1.29/rpm/repodata/repomd.xml.key
</span></span></span><span class="line"><span class="cl"><span class="s">exclude=kubelet kubeadm kubectl cri-tools kubernetes-cni
</span></span></span><span class="line"><span class="cl"><span class="s">EOF</span>
</span></span><span class="line"><span class="cl">yum install -y kubelet kubeadm kubectl --disableexcludes<span class="o">=</span>kubernetes
</span></span><span class="line"><span class="cl">systemctl <span class="nb">enable</span> kubelet
</span></span></code></pre></div><h2 id="初始化-master">初始化 master</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl">registry.k8s.io/kube-apiserver:v1.29.3
</span></span><span class="line"><span class="cl">registry.k8s.io/kube-controller-manager:v1.29.3
</span></span><span class="line"><span class="cl">registry.k8s.io/kube-scheduler:v1.29.3
</span></span><span class="line"><span class="cl">registry.k8s.io/kube-proxy:v1.29.3
</span></span><span class="line"><span class="cl">registry.k8s.io/coredns/coredns:v1.11.1
</span></span><span class="line"><span class="cl">registry.k8s.io/pause:3.9
</span></span><span class="line"><span class="cl">registry.k8s.io/etcd:3.5.12-0
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl"><span class="cp">#!/bin/bash
</span></span></span><span class="line"><span class="cl"><span class="cp"></span>
</span></span><span class="line"><span class="cl"><span class="nv">imgs</span><span class="o">=</span>/root/gcr.io
</span></span><span class="line"><span class="cl"><span class="k">for</span> i in <span class="k">$(</span> ls <span class="si">${</span><span class="nv">imgs</span><span class="si">}</span> <span class="k">)</span><span class="p">;</span> <span class="k">do</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># docker load -i ${imgs}/$i</span>
</span></span><span class="line"><span class="cl">    ctr -n<span class="o">=</span>k8s.io images import <span class="si">${</span><span class="nv">imgs</span><span class="si">}</span>/<span class="nv">$i</span>
</span></span><span class="line"><span class="cl"><span class="k">done</span>
</span></span></code></pre></div><p>/etc/containerd/config.conf</p>
<pre tabindex="0"><code class="language-conf" data-lang="conf"># comment
disabled_plugins = [&#34;cri&#34;]
</code></pre><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl">sudo rm /etc/kubernetes/manifests/kube-apiserver.yaml <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    /etc/kubernetes/manifests/kube-controller-manager.yaml <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    /etc/kubernetes/manifests/kube-scheduler.yaml <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    /etc/kubernetes/manifests/etcd.yaml
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">kubeadm reset
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># kubeadm config images pull --image-repository registry.aliyuncs.com/google_containers</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># kubeadm init --pod-network-cidr=10.244.0.0/16  --image-repository registry.aliyuncs.com/google_containers</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">kubeadm config print init-defaults &gt; kubeadm-config.yaml
</span></span><span class="line"><span class="cl"><span class="c1"># kubeadm config images pull --image-repository registry.cn-hangzhou.aliyuncs.com/google_containers</span>
</span></span><span class="line"><span class="cl">kubeadm init --config kubeadm-config.yaml <span class="p">|</span> tee kubeadm-init.log
</span></span></code></pre></div><h2 id="加入-master-节点">加入 master 节点</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl"><span class="c1"># 加入master节点报错执行以下命令</span>
</span></span><span class="line"><span class="cl">kubeadm reset
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 加入master节点</span>
</span></span><span class="line"><span class="cl">kubeadm join 172.16.1.100:6443 --token abcdef.0123456789abcdef <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>        --discovery-token-ca-cert-hash sha256:d7bcc5181f1c5b25f9b3d91fcfdf9ae85859cd54b860a29e0947fe39c4f3af82
</span></span></code></pre></div><h2 id="安装-flannel">安装 flannel</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl"><span class="c1"># 在master节点</span>
</span></span><span class="line"><span class="cl">kubectl apply -f https://github.com/flannel-io/flannel/releases/latest/download/kube-flannel.yml
</span></span></code></pre></div><h2 id="参考文档">参考文档</h2>
<ul>
<li><a href="https://developer.aliyun.com/article/1147479">https://developer.aliyun.com/article/1147479</a></li>
<li><a href="https://blog.frognew.com/2023/12/kubeadm-install-kubernetes-1.29.html">https://blog.frognew.com/2023/12/kubeadm-install-kubernetes-1.29.html</a></li>
</ul>
]]></content>
		</item>
		
		<item>
			<title>Kubernetes 组件介绍</title>
			<link>/posts/kubernetes-%E7%BB%84%E4%BB%B6%E4%BB%8B%E7%BB%8D/</link>
			<pubDate>Mon, 18 Mar 2024 16:30:31 +0800</pubDate>
			
			<guid>/posts/kubernetes-%E7%BB%84%E4%BB%B6%E4%BB%8B%E7%BB%8D/</guid>
			<description><![CDATA[%!s(<nil>)]]></description>
			<content type="html"><![CDATA[<h2 id="components-glance">Components Glance</h2>
<ul>
<li>
<p>Control Plane Components</p>
<ul>
<li>kube-apiserver</li>
<li>etcd</li>
<li>kube-scheduler: 主要将新创建没有分配 node 的 pod 分配到可用的 node 上</li>
<li>kube-controller-manager
<ul>
<li>NodeController</li>
<li>ReplicationController</li>
<li>ReplicaSet</li>
<li>Deployment</li>
<li>DaemonSet</li>
<li>StatefulSet</li>
<li>JobController</li>
<li>CronJobController</li>
<li>EndpointSliceController</li>
<li>ServiceAccountController</li>
</ul>
</li>
<li>cloud-controller-manager：提供对接云厂商的 api 接口
<ul>
<li>NodeController：管理节点</li>
<li>ServiceController：管理 load balancers</li>
<li>RouteController: 配置基于云基础设施的路由</li>
</ul>
</li>
</ul>
</li>
<li>
<p>Node Components</p>
<ul>
<li>kubelet</li>
<li>kube-proxy</li>
<li>container-runtime</li>
</ul>
</li>
<li>
<p>Addons</p>
<ul>
<li>DNS</li>
<li>Web Ui Dashboard</li>
<li>Container Resource Monitoring</li>
<li>Cluster Level Logging</li>
<li>Network Plugins</li>
</ul>
</li>
</ul>
<h2 id="控制器">控制器</h2>
<h3 id="内置控制器">内置控制器</h3>
<ol>
<li>
<p>无状态应用</p>
<ul>
<li>通用场景
<ul>
<li>RC</li>
<li>RS</li>
<li>Deployment</li>
</ul>
</li>
<li>特殊场景
<ul>
<li>批处理任务：任务运行成功，即返回状态码 0
<ul>
<li>Job</li>
<li>CronJob</li>
</ul>
</li>
<li>每个 Node 有且仅有一个
<ul>
<li>DaemonSet</li>
</ul>
</li>
<li>自动扩缩容
<ul>
<li>HPA，根据给定配置，配合 RC、RS、Deployment，动态扩缩副本数量</li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
<li>
<p>有状态应用</p>
<ul>
<li>StatefulSet</li>
</ul>
</li>
</ol>
<h3 id="自定义控制器">自定义控制器</h3>
<blockquote>
<p>Kubernetes 开发</p>
</blockquote>
<ul>
<li>VPA，根据给定配置，动态调整 Pod 中运行的 Container 对应的 Cpu，Memory 等资源使用（目前还是实验阶段）double check</li>
</ul>
<h2 id="内置控制器-1">内置控制器</h2>
<h3 id="statefulset">StatefulSet</h3>
<blockquote>
<p>适用于为了解决有状态服务的问题（对应 Deployments 和 ReplicaSets 是为了无状态服务而设计）</p>
</blockquote>
<ol>
<li>稳定持久化存储，即 pod 重新调度后还能访问到相同的持久化数据，基于 PVC 实现</li>
<li>稳定的网络标识，即 pod 重新调度后其 Podname 和 Hostname 不变，基于 Headless Service（即没有 Cluster IP 的 Service）实现</li>
<li>有序部署，有序扩展，即 Pod 是有顺序的，在部署或者扩展的时候能够根据定义的顺序依次进行（即从 0 到 N-1，在下一个 Pod 运行之前所有之前创建的 Pod 必须满足 Running 和 Ready 状态），基于 <span style="color: orange; font-weight: bold;">init containers</span><sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> 来实现。</li>
<li>有序收缩，有序删除，即从 N-1 到 0，与有序部署，有序扩展相反</li>
</ol>
<h3 id="daemonset">DaemonSet</h3>
<blockquote>
<p>适用于为了确保全部（或者一些）Node 上运行一个 Pod 副本。当有 Node 加入集群时，也会为他们新增一个 Pod。当有 Node 从集群移除时，这些 Pod 也会被回收。删除 DaemonSet 将会删除它创建的所有 Pod。具体应用：</p>
</blockquote>
<ol>
<li>运行集群存储 daemon 服务，例如在每个 Node 上运行 glusterd，ceph 等</li>
<li>运行日志收集 daemon 服务，如在每个 Node 上运行 fluentd，logstash 等</li>
<li>运行监控 daemon 服务，如在每个 Node 上运行 Prometheus Node Exporter 等</li>
</ol>
<h3 id="job">Job</h3>
<blockquote>
<p>适用于仅执行一次或多次的任务，比如批处理，数据备份，当备份完成，回收 Pod；也能保证批处理任务的一个 Pod 或多个 Pod 成功结束，比如冗余备份。</p>
</blockquote>
<h3 id="cronjob">CronJob</h3>
<blockquote>
<p>适用于基于时间的 Job，同 Linux cronjob，如：</p>
</blockquote>
<ol>
<li>在给定时间仅执行一次</li>
<li>周期性的在给定时间点执行</li>
</ol>
<h2 id="网络通信模式">网络通信模式</h2>
<h3 id="分类">分类</h3>
<ul>
<li>同 Pod 间不同 container 互相访问：通过回还网卡（lo）</li>
<li>不同 Pod 间互相访问
<ul>
<li>同物理机：通过 docker0 网桥互相访问，效率仅次于 lo</li>
<li>不同物理机：如 Flannel，利用 udp 对网络数据包的二次封装，效率最次</li>
</ul>
</li>
<li>SVC 网络与 Pod 间的通信</li>
</ul>
<h3 id="网络隔离">网络隔离</h3>
<blockquote>
<p>通过 namespace network 进行隔离</p>
</blockquote>
<h3 id="具体应用案例">具体应用案例</h3>
<ol>
<li>
<p>Flannel
使用 udp 协议对网络数据包的二次封装（三层网络通信，因为隔离了广播域，所以网络冲突几乎为 0，但效率可能较差），加上 network overlay，实现的扁平化网络</p>
</li>
<li>
<p>思科开发的组件</p>
</li>
<li>
<p>其他方式
比如通过路由表转发，或者通过广播域实现不同进程间的通信（二层网络通信，效率可能较高，但是网络冲突可能增大）</p>
</li>
</ol>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>init containers&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></content>
		</item>
		
		<item>
			<title>Kubernetes 架构理解</title>
			<link>/posts/kubernetes-%E6%9E%B6%E6%9E%84%E7%90%86%E8%A7%A3/</link>
			<pubDate>Sun, 17 Mar 2024 17:20:31 +0800</pubDate>
			
			<guid>/posts/kubernetes-%E6%9E%B6%E6%9E%84%E7%90%86%E8%A7%A3/</guid>
			<description><![CDATA[%!s(<nil>)]]></description>
			<content type="html"><![CDATA[<h2 id="整体架构">整体架构</h2>
<p><img src="/imgs/arch-glance.jpg" alt="架构图"></p>
<ul>
<li>Etcd 负责整个集群对象的数据存储</li>
<li>ApiServer 负责访问集群的入口，会与 etcd 服务交互，其封装了核心对象的增删改查操作，以 RestfulAPI 的方式提供给外部组件和内部组件调用。维护的核心对象资源数据持久化到 Etcd 中存储。</li>
<li>KubernetesScheduler 负责调度整个集群的 Node，新建立的 Pod 进行 Node 选择（即分配机器），调度集群资源，组件抽离，可以方便替换成其他调度器</li>
<li>Replication Controller 负责管控集群组件的存活状态，比如故障检测、自动扩展、滚动更新等</li>
<li>插件支持，可移除可添加，不属于 K8S 组件本身
<ul>
<li>CoreDNS 负责为整个集群提供 DNS 服务，可以提供网络端点固定访问方式</li>
<li>IngressController 负责 K8S 中服务的外网入口，提供的是 7 层网络模型接口，对于官方的调度机制走的是 IPVS（4 层网络模型，不支持主机名域名），如 nginx-ingress，traffic-ingress</li>
<li>Prometheus 负责整个集群资源的监控，时序数据库</li>
<li>Federation 提供跨可用区的集群，提供不同数据中心的 K8S 集群管理能力</li>
<li>Flunnel</li>
<li>EFK</li>
<li>ELK</li>
</ul>
</li>
<li>Node 可以理解为物理机
<ul>
<li>Kubelet 负责维护容器的生命周期，同时也负责 Volume（CSI）和网络（CNI）的管理，上层对接 api server，下层通过 OCRI（OCI）对接 CRI，进一步访问 Docker，即 ApiServer &lt;&ndash; <sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> Kubelet -&gt; OCRI(OCI) -&gt; CRI -&gt; Docker -&gt; Container</li>
<li>KubeProxy 负责为 Service 提供 Cluster 内部的服务发现和负载均衡，上层访问 api server，下层通过 netlinks（内核提供的接口，可以实现对应的防火墙以及 ipvs 对应的规则管控）接口，即 ApiServer &lt;&ndash; KubeProxy &ndash;&gt; NetLinks -&gt; firewall/ipvs</li>
<li>CRI (container runtime interface) 负责镜像管理以及 Pod 和容器的真正运行</li>
</ul>
</li>
</ul>
<h2 id="pod">Pod</h2>
<p>Pod 中 Container 的数量必须是 &gt;=1 的，一个 Pod 在启动的时候，K8S 会创建一个 pause 容器，然后 Pod 中的其他容器会共享 pause 容器初始化的网络空间栈，以及存储网络卷</p>
<blockquote>
<p>pause 容器的作用：1. 初始化网络空间栈；2. 挂载网络卷</p>
</blockquote>
<h3 id="自举自主方式">自举/自主方式</h3>
<h3 id="控制器管理方式">控制器管理方式</h3>
<ul>
<li>HPA (HorizontalPodAutoscaler)，工作在 RC，RS，Deployment 之上，负责控制 Pod 数量自动扩缩
<ul>
<li>RC 负责控制副本数量，当副本数量较少的时候</li>
<li>RS 除了 RC 具备的能力之外，具备根据标签选择 Pod，即根据标签做集合运算，选择一批 Pod</li>
<li>Deployment 负责部署 Pod，如滚动更新，回滚</li>
</ul>
</li>
</ul>
<h2 id="其他知识">其他知识</h2>
<ol>
<li>shell 中通过 trap 捕获 signal</li>
</ol>
<p>比如一个 nginx 容器</p>
<ol>
<li>直接通过/usr/local/nginx/sbin/nginx（entrypoint）启动，此时 entrypoint 为 1 号进程，当执行 <code>docker stop</code>，docker 会尝试使用 signal 15,9 等去杀死容器，由 1 号进程（/usr/local/nginx/sbin/nginx）衍生出来的 3,4 等 nginx 子进程会由 1 号进程停止，容器优雅退出</li>
<li>通过 startup.sh（entrypoint）启动 nginx 进程，此时 startup.sh 为 1 号进程，当执行 <code>docker stop</code>，docker 会尝试使用 signal 15,9 等去杀死容器，由 1 号进程<code>startup.sh</code> 通过<code>trap</code>捕获对应的信号，并将信号传递给 nginx 主进程，进一步去杀死 nginx 子进程，此时容器才能优雅退出。执行顺序为 DockerDaemon &gt; 1 号进程<code>startup.sh</code> &gt; 应用进程</li>
</ol>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>Kubelet 与 ApiServer 的交互是怎样进行的&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></content>
		</item>
		
		<item>
			<title>vimium 快捷键手册</title>
			<link>/posts/vimium-%E5%BF%AB%E6%8D%B7%E9%94%AE%E6%89%8B%E5%86%8C/</link>
			<pubDate>Mon, 15 May 2017 21:14:51 +0800</pubDate>
			
			<guid>/posts/vimium-%E5%BF%AB%E6%8D%B7%E9%94%AE%E6%89%8B%E5%86%8C/</guid>
			<description><![CDATA[%!s(<nil>)]]></description>
			<content type="html"><![CDATA[<h2 id="navigating-the-current-page">Navigating the current page:</h2>
<pre tabindex="0"><code>?       show the help dialog for a list of all available keys
h       scroll left
j       scroll down
k       scroll up
l       scroll right
gg      scroll to top of the page
G       scroll to bottom of the page
d       scroll down half a page
u       scroll up half a page
f       open a link in the current tab
F       open a link in a new tab
r       reload
gs      view source
i       enter insert mode -- all commands will be ignored until you hit Esc to exit
yy      copy the current url to the clipboard
yf      copy a link url to the clipboard
gf      cycle forward to the next frame
gF      focus the main/top frame
</code></pre><h2 id="navigating-to-new-pages">Navigating to new pages:</h2>
<pre tabindex="0"><code>o       Open URL, bookmark, or history entry
O       Open URL, bookmark, history entry in a new tab
b       Open bookmark
B       Open bookmark in a new tab
</code></pre><h2 id="using-find-支持正则-regular-expressionshttpsgithubcomphilcvimiumwikifind-mode">Using find: (支持正则 <a href="https://github.com/philc/vimium/wiki/Find-Mode">regular expressions</a>)</h2>
<pre tabindex="0"><code>/       enter find mode
          -- type your search query and hit enter to search, or Esc to cancel
n       cycle forward to the next find match
N       cycle backward to the previous find match
</code></pre><h2 id="navigating-your-history">Navigating your history:</h2>
<pre tabindex="0"><code>H       go back in history
L       go forward in history
</code></pre><h2 id="manipulating-tabs">Manipulating tabs:</h2>
<pre tabindex="0"><code>J, gT   go one tab left
K, gt   go one tab right
g0      go to the first tab
g$      go to the last tab
^       visit the previously-visited tab
t       create tab
yt      duplicate current tab
x       close current tab
X       restore closed tab (i.e. unwind the &#39;x&#39; command)
T       search through your open tabs
&lt;a-p&gt;   pin/unpin current tab
</code></pre><h2 id="using-marks">Using marks:</h2>
<pre tabindex="0"><code>ma, mA  set local mark &#34;a&#34; (global mark &#34;A&#34;)
`a, `A  jump to local mark &#34;a&#34; (global mark &#34;A&#34;)
``      jump back to the position before the previous jump
          -- that is, before the previous gg, G, n, N, / or `a
</code></pre><h2 id="additional-advanced-browsing-commands">Additional advanced browsing commands:</h2>
<pre tabindex="0"><code>]], [[  Follow the link labeled &#39;next&#39; or &#39;&gt;&#39; (&#39;previous&#39; or &#39;&lt;&#39;)
          - helpful for browsing paginated sites
&lt;a-f&gt;   open multiple links in a new tab
gi      focus the first (or n-th) text input box on the page
gu      go up one level in the URL hierarchy
gU      go up to root of the URL hierarchy
ge      edit the current URL
gE      edit the current URL and open in a new tab
zH      scroll all the way left
zL      scroll all the way right
v       enter visual mode; use p/P to paste-and-go, use y to yank
V       enter visual line mode
</code></pre>]]></content>
		</item>
		
	</channel>
</rss>
