Add post-11

This commit is contained in:
ClovertaTheTrilobita 2026-06-28 18:49:14 +08:00
parent 8d1e8fa0cb
commit b1580757ae
2 changed files with 612 additions and 0 deletions

305
src/blog/en/post-11.md Normal file
View file

@ -0,0 +1,305 @@
---
title: "[Study Notes] The First Boss outside the Tutorial Village—The KMP Algorithm"
pubDate: 2026-06-28
description: 'Great, now that you know how to print Hello World in Cpp, hurry up and deal with this adorable little thing!'
author: "Cloverta"
image:
url: "https://files.seeusercontent.com/2026/06/28/oA7c/pasted-image-1782643695107.webp"
alt: "meme"
tags: ["Study Notes", "Algorithms"]
---
## Preface
A mysterious alumnus who had already graduated shoved a hard drive into your hands. Stored inside were nearly ten years worth of original final exam papers! Unfortunately, however, the files were encrypted, and the study materials could only be viewed by entering the correct password.
There were only two other `.txt` files on the hard drive, each containing a string. One of them was a whopping 2 MB in size! The other was an adorable little thing containing only around a hundred characters.
“Find every occurrence of the substring!” The alumnuss words echoed in your ears. “Concatenate all the resulting positions, and youll get the password for extracting the study materials!”
What are we supposed to do?! The final exam starts in only eight hours. How can we get our hands on those study materials?!
## Brute-Force Matching
When we think about performing pattern matching between two strings, the first idea that pops into our heads is obviously...
Compare them one by one! 🤓☝️
Take every substring of length (m) from the text string and compare each of them with the pattern string in sequence, until a perfectly matching substring is found or every substring has failed to match.
This is way too easy. Suppose we have two strings, `abacccaaccba` and `ccb`, and we want to find every occurrence of `ccb` in the text string. We can simply compare the substring against the text string one position at a time, starting from the beginning.
First, we compare the first three characters of the text string, namely `aba`. Unfortunately, it is not `ccb`;
Next, we compare the second through fourth characters of the text string, namely `bac`. Unfortunately, that is not `ccb` either;
But that is fine. As long as we continue matching them one by one like this, we will eventually find it.
Lets write the code in one go!
```c
typedef struct { // Define the structure used to represent a string
char ch[MaxLen]; // Array used to store the characters in the string
int length; // Length of the string
}SString;
int Index(SString S, SString T) {
int i=1,j=1; // Initialize two index pointers: i marks the current character in the text string, while j marks the current character in the substring
while(i<=S.length && j<=T.length) { // Continue looping while i and j remain within their respective strings
if (S.ch[i]==T.ch[j]){ // If the characters currently pointed to by i and j are identical
i++; j++; // Move both i and j one position backward and begin comparing the next characters in their respective strings
}
else { // If the two characters are different
i=i-j+2; // Move i to the character after the starting position of the current comparison; for example, if we just started matching from the first character of the text string, i now returns to the second character
j=1; // Move j back to the first character of the substring
}
}
if (j>T.length) return i-T.length; // If j is greater than the length of the substring, the entire substring has been matched successfully, so return its position in the text string
return 0; // Otherwise, return 0 to indicate that the match failed
}
```
So Easy! Just as you are about to happily press “Compile and Run,” your intuition stops your fingertip in its tracks.
Something is wrong.
Oh no. No, no, no. This is not right!
**When matching the substring and the text string using this method, suppose the text string has a length of (m) and the substring has a length of (n). In the worst-case scenario, matching the entire text string requires (m \times n) time! Its time complexity is (O(mn))!** But you only have eight hours left, and reading through all the exam papers will take at least seven and a half hours. This method obviously will not work.
So what are we supposed to do?
This is where the famous KMP algorithm makes its grand entrance.
## The KMP Algorithm
### 1. The Idea
Could we come up with a method that does not move the `j` pointer all the way back to the beginning of the substring whenever a mismatch occurs?
Hmm... That seems possible. After all, the substring may contain repeated parts. Lets try an example first.
Suppose we have a partially matched text string `ababab??????` and the pattern string `abababcdef`.
When we reach the seventh character:
```txt
i
ababab??????
abababcdef
j
```
Suppose the match fails at this point. According to the brute-force matching approach, we would need to move `i` back to the second position of the text string and move `j` back to the first position of the substring.
But that is far too slow.
Let us activate our astonishing powers of observation. We notice that the first six characters of the substring are `ababab`, which consists of `ab` repeated three times.
Intuitively, could we not simply do this instead?
```txt
i
ababab??????
abababcdef
j
```
The position of `i` remains unchanged, while `j` moves back only to the fifth character. Clearly, `i` no longer needs to backtrack, and `j` does not need to return to the beginning every single time. The matching efficiency immediately improves by a huge amount!
But how do we know where `j` should return to in each situation? We cannot exactly take a glance at the strings every time and personally tell the computer where to go, can we?
### 2. The `next` Array
We need to store the character position to which `j` should backtrack in the `next` array.
In other words, we calculate such an array in advance. Suppose a mismatch occurs when `j` reaches the seventh character of the substring. We only need to check the `next` array, obtain the value stored in `next[7]`, and directly change `j` to that value.
That is to say, we only need to determine how many characters at the beginning of the string before the current character—the prefix—are identical to the same number of characters at the end—the suffix.
<details>
<summary>》〉What is the prefix & suffix of a string?〈《</summary>
- A prefix of a string is a substring that begins with the first character of the string, but it cannot be as long as the original string itself. For example, the prefixes of `abab` may be `a`, `ab`, and `aba`, but not `abab`.
- A suffix of a string is a substring that ends with the final character of the string, but it cannot be as long as the original string itself. For example, the suffixes of `abab` may be `b`, `ab`, and `bab`.
</details>
For the substring `abababcdef` in our example, when we reach the seventh character, `c`, the characters before `c` form the string `ababab`. It is easy to see that the **prefix** formed by the first four characters and the **suffix** formed by the final four characters are identical: both are `abab`.
Therefore, when the seventh character fails to match, we only need to move `j` back to position (4+1). In other words, we continue matching from the fifth character.
```txt
abababcdef
j
```
Now lets try calculating the entire `next` array for this substring~
For the first character, we find that there are no characters before it. We therefore define its corresponding `next[1]` as 0. This makes it convenient to directly perform `++j` during the next round of the algorithm, which causes matching to restart from the first character of the substring.
For the second character, there is only one character before it: `a`. Therefore, it has no prefix or suffix. During the next comparison, we must begin matching from the first character, so we define `next[2]` as `0+1=1`.
For the third character, the preceding string is `ab`. There is still no identical prefix and suffix when counting from the beginning and the end. During the next comparison, we must still start from the beginning, so we define `next[3]=1`.
For the fourth character, the preceding string is `aba`. We find that a prefix of length `1` and a suffix of length `1` are identical: both are `a`. Therefore, during the next comparison, we can directly begin matching from the second character, so we define `next[4]` as `1+1=2`.
For the fifth character, the preceding string is `abab`. It has an identical prefix and suffix of length `2`, both of which are `ab`. Therefore, we define `next[5]` as `2+1=3`.
For the sixth character, the preceding string is `ababa`, so we define `next[6]=3+1=4`.
When we reach the eighth character, the preceding string is `abababc`. At this point, there is no identical prefix and suffix, so we define `next[8]=0+1=1`.
Got it? Then lets move on to the code.
```c
void GetNext(SString T, int next[]){
int i = 1, j = 0; // At the very beginning, initialize i and j
next[1] = 0; // Define the first element of the next array as 0
while (i < T.length) { // Continue looping while i has not exceeded the total length of the string
if (j == 0 || T.ch[i] == T.ch[j]){ // If j is 0, or if the two characters pointed to by i and j are identical
++i; ++j; // Increment i and j by 1
next[i] = j; // Set the i-th element of the next array to j
}
else j = next[j]; // If they are different, move j back to the backtracking position stored for the current character
}
}
```
You can understand the code above as matching the substring against itself, except that `j` begins one position behind `i`.
When the first iteration begins, `j` is `0`, so the condition is satisfied. Both `i` and `j` are incremented by 1, and `next[i]`, which is `next[2]`, is set to 1.
```txt
i
ababcdef
ababcdef
j
```
The second iteration then begins. At this point, `i=2` and `j=1`, and the characters they point to are different, so the condition is not satisfied. Therefore, `j` returns to `next[1]`, which is position 0.
```txt
i
ababcdef
ababcdef
j
```
During the third iteration, `j` is `0`, so `i++; j++;` is performed. At this point, `j=1`, `i=3`, and `next[3]=1`.
```txt
i
ababcdef
ababcdef
j
```
During the fourth iteration, we find that the characters pointed to by the two pointers are identical, so `i++; j++;` is performed. At this point, `j=2`, `i=4`, and `next[4]=2`.
```txt
i
ababcdef
ababcdef
j
```
Continuing in the same way, we can obtain the entire `next` array for the substring.
### 3. The Code
After整理一下—after putting our ideas in order—we obtain the complete code:
```c
int IndexKMP(SString S, SString T, int next[]){
int i=1,j=1; // At the beginning, both i and j point to the first character
int next[T.length+1]; // Following the convention used by most textbooks, index 0 of the next array is left unused so that each array index matches the ordinal position of its corresponding element
GetNext(T, next); // Calculate the next array
while(i <= S.length && j <= T.length) { // Continue looping while i and j have not exceeded the lengths of their respective strings
if (j==0 || S.ch[i]==T.ch[j]){ // If j is 0, or if the characters pointed to by the two pointers are identical
++i; ++j; // Move both i and j one position backward
}
else j = next[j]; // Otherwise, move j back to the backtracking position stored in the next array
}
if (j>T.length) return i-T.length; // If j has traversed the entire substring, the match has succeeded, so return the position of the substring in the text string
return 0; // Otherwise, return 0 to indicate failure
}
void GetNext(SString T, int next[]){
int i = 1, j = 0; // Initialize i and j
next[1] = 0; // Define the first element of the next array as 0
while (i < T.length) { // Continue looping while i has not exceeded the total length of the string
if (j == 0 || T.ch[i] == T.ch[j]){ // If j is 0, or if the two characters pointed to by i and j are identical
++i; ++j; // Increment i and j by 1
next[i] = j; // Set the i-th element of the next array to j
}
else j = next[j]; // If they are different, move j back to the backtracking position stored for the current character
}
}
```
### 4. Optimizing the KMP Array
Being the clever person you are, you must already have noticed that the current algorithm is not optimal...
Why are you looking at me in such horror? `(゚д゚≡゚д゚)` The optimization is actually very simple. It requires only one tiny trick.
Look at `abababcdef`. Its `next` array is:
$$
\boxed{\text{next}=[0,1,1,2,3,4,5,1,1,1]}
$$
Now think about what happens if we reach the third character and discover a mismatch. We look up `next[3]=1`, meaning that we need to return to the first character and try matching again.
However, we already know that the first character is identical to the third character. Since the third character failed to match, the first character must also fail to match. We would then have to consult the array once again and jump to `next[1]=0`.
So why can we not simply jump directly to 0?
Exactly. After a mismatch, the original KMP algorithm may sometimes jump to a position containing the same character, meaning that the next comparison is guaranteed to fail again. `nextval` directly skips over this useless comparison.
Simple, right? Lets go straight to the code.
```c
void GetNextVal(SString T, int next[], int nextval[]) {
nextval[1] = 0; // Similarly, define the first element as 0
for (int i = 2; i <= T.length; ++i) { // Check each element in sequence, beginning with the second element
int j = next[i];
if (T.ch[i] != T.ch[j]) { // If the character at the backtracking position stored in the next array differs from the current character
nextval[i] = j; // Keep it unchanged
} else {
nextval[i] = nextval[j]; // Otherwise, replace the j-th element with the i-th element
}
}
}
```
You have now mastered the KMP algorithm. At last, you can proudly retrieve the password!
Your phone rings.
The alumnus, who had left more than 99 of your messages on read, suddenly sends you a message.
“Um, I forgot to tell you. My password string is actually indexed starting from 0, so youll need to rewrite the algorithm as a zero-based version.”
“And you absolutely must not just subtract 1 before `return`, or a big burly top will come pounding on your dorm-room door tonight 😠”
“Im sure you can figure it out.”
<img src="https://files.seeusercontent.com/2026/06/28/K0sd/pasted-image-1782642667317.webp" alt="pasted-image-1782642667317.webp" title="pasted-image-1782642667317.webp" style="display: block; margin: 0px auto; zoom: 33%;">

307
src/blog/zh/post-11.md Normal file
View file

@ -0,0 +1,307 @@
---
title: "[学习笔记]出新手村的第一个boss——KMP算法"
pubDate: 2026-06-28
description: '很好现在你已经会用Cpp打印Hello World了那么快点去解决这个可爱的小东西吧'
author: "三叶"
image:
url: "https://files.seeusercontent.com/2026/06/28/oA7c/pasted-image-1782643695107.webp"
alt: "meme"
tags: ["学习笔记", "算法"]
---
## 前言
已经毕业的神秘老学长将一块硬盘塞给了你里面存储着近10年的期末考试原卷但是很可惜文件被加密了只有输入正确的密码才能查看学习资料。硬盘中只有另外两个`.txt`文件分别存有两个字符串其中一个有足足2Mb大小而另一个则是一个非常短的只有百来个字符的可爱小不点。
“找到所有的子串位置!”老学长的话语回荡在你的耳边,“将得到的结果连在一起便是学习资料的解压密码!”
怎么办还有8个小时就要期末考试了我们该如何拿到学习资料
## 朴素匹配
一想到两个字符串进行模式匹配,那么我们脑海中蹦出来的第一个想法肯定是……
一个一个匹配!🤓☝️
将主串汇总所有长度为m的子串依次与模式串对比直到找到一个完全匹配的子串或所有的子串都不匹配为止。
这太简单了。假设我们有两个字符串`abacccaaccba`和`ccb`,我们想要找出主串中的所有`ccb`,我们可以把这个子串和主串从头开始一个一个匹配。
首先我们比较主串的前三个字符,即`aba`,很可惜,不是`ccb`
接下来比较主串的第2个到第4个字符是`bac`,很可惜,也不是`ccb`
不过没关系,只要这样子一个一个匹配下去,我们总能匹配到。
一口气把代码写出来吧!
```c
typedef struct { // 定义串的结构体
char ch[MaxLen]; // 存储串中字符的数组
int length; // 串的长度
}SString;
int Index(SString S, SString T) {
int i=1,j=1; // 初始化两个计数指针i用于标注我们在比较主串的哪一个字符j用于标注子串
while(i<=S.length && j<=T.length) { // 在i和j没有超出各自串的范围时循环
if (S.ch[i]==T.ch[j]){ // 如果发现i和j各自指向的字符相同
i++; j++; // 那么i和j同时向后移动一个元素开始比较各自串的下一个字符
}
else { // 如果两个字符不同
i=i-j+2; // i回到本次比较开始时的下一个字符例如刚刚我们从主串的第1个元素开始匹配i此时回到第二个元素
j=1; // j回到子串的第一个元素
}
}
if (j>T.length) return i-T.length; // 如果j大于子串的长度即整个子串匹配成功那么返回子串在主串中的位置
return 0; // 否则返回0代表匹配失败
}
```
So Easy! 就在你准备快活地按下“编译并运行”时,你的直觉让你的指尖停了下来。
不对劲。
哦不对不对不对。不对!
<b>用这种方法匹配子串和主串,假设主串长度为$m$,子串长度为$n$,那么在最坏的情况下,匹配完整个主串需要的时间是$m \times n$!时间复杂度是$O(mn)$</b>但是你只有8个小时的时间了而你看完所有的卷子也需要至少7个半小时这种方法显然不行。
那怎么办呢。
这时候就得请大名鼎鼎的KMP算法出场了。
## KMP算法
### 1. 思路
我们可不可以想出一种方法,在匹配失败的时候不要直接将`j`指针归位到子串的最开始?
嘶……好像行得通,毕竟子串中会有重复的地方嘛,先拿个例子试试吧。
对于一个我们匹配到一半的模式串`ababab??????`和子串`abababcdef`。
当我们匹配到第7个元素时
```txt
i
ababab??????
abababcdef
j
```
假如说此时匹配失败了,按照朴素匹配的思想,我们需要吧`i`回退到模式串第2个位置`j`回退到子串第1个位置。
但是这样太慢了。
让我们发动惊人的注意力。我们注意到子串的前6个字符是`ababab`它是由`ab`重复3次组成。
那么直觉上来看我们是不是可以直接这样:
```txt
i
ababab??????
abababcdef
j
```
`i`保持不变,而`j`仅回退到第5个元素。显然`i`不必再回溯,而`j`也不用每次都回到头,那匹配效率一下子高了一大截!
但是,我们怎么知道什么时候回到哪个地方?总不能每次瞅一眼然后告诉计算机该怎么走吧?
### 2. next数组
我们要把`j`应该回溯到的元素序号存在`next数组`内。
换句话说,我们事先计算好这样一个数组,假设在`j`匹配到子串第7个元素时匹配失败我们只需要查看一下`next数组`,找到`next[7]`中的值,直接把`j`的值改为该值。
也就是说,我们只需要计算出,在当前元素之前的字符组成的串中。前多少个字符(前缀)和后多少个字符(后缀)相同。
<details>
<summary>》〉什么是前缀和后缀?〈《</summary>
- 串的<b>前缀</b>:从串的第一个元素开始的子串,但是不能和主串一样长。例如`abab`的前缀可以是`a`、`ab`、`aba`,不能是`abab`。
- 串的<b>后缀</b>:从串的最后一个元素开始的子串,但是不能和主串一样长。例如`abab`的后缀可以是`b`、`ab`、`bab`。
</details>
对于例子中的子串`abababcdef`在我们匹配到第7个字符`c`时,在`c`之前的元素可以组成串`ababab`并且易知前4个字符组成的**前缀**和后四个字符组成的**后缀**相同,都是`abab`那么在第7个字符匹配失败时我们仅须将`j`回溯到第$4+1$个位置。也就是从第5个字符继续匹配。
```txt
abababcdef
j
```
现在我们试着求这个子串的整个next数组吧
对于第一个元素,我们发现它前面没有字符。那我们规定其对应的`next[1]`为0。这样方便在算法进行下一轮计算时直接进行`++j`操作,恰好从子串的第一个字符重新匹配。
对于第二个元素,它前面只有一个元素`a`,那么它没有前缀和后缀,下一次匹配时我们要从第一个字符开始匹配,我们记`next[2]`为`0+1=1`。
对于第三个元素,它前面的串为`ab`,从前数和从后数也都没有相同的钱后缀,下一次匹配我们仍然要从头开始,记`next[3]=1`。
对于第四个元素,前串为`aba`,我们发现有一个长度为`1`的前缀和长度为`1`的后缀相同,都是`a`那么下一次我们可以直接从第2个字符开始匹配记`next[4]`为`1+1=2`。
对于第五个元素,前串为`abab`,有长度为`2`的前缀后缀相同,都是`ab`,那么可以记`next[5]`为`2+1=3`。
对于第六个元素,前串为`ababa`,记`next[6]=3+1=4`。
到第八个元素,前串是`abababc`,此时没有前缀和后缀相同,那么记`next[8]=0+1=1`。
理解了吗?那我们上代码。
```c
void GetNext(SString T, int next[]){
int i = 1, j = 0; // 一切的最开始我们将i和j初始化
next[1] = 0; // 规定next数组的第一个元素为0
while (i < T.length) { // 当i不超过串的总长时循环
if (j == 0 || T.ch[i] == T.ch[j]){ // 如果j为0或者i和j所指向的两个元素相同时
++i; ++j; // i和j分别加1
next[i] = j; // 将next数组中的第i个元素记为j
}
else j = next[j]; // 如果不相同那么j回到当前元素记录的回溯位置
}
}
```
你可以理解上述代码为,将子串和自己匹配,但开始时`j`比`i`小一位。
进入第一轮循环时,由于`j`为`0`,满足判断条件,`i`和`j`加1并将`next[i]`(即`next[2]`记为1。
```txt
i
ababcdef
ababcdef
j
```
紧接着进行第二次循环,此时`i=2``j=1`,且它们所指的元素不相等,不满足条件,那么`j`返回到`next[1]`即第0位。
```
i
ababcdef
ababcdef
j
```
进行第三次循环,`j`为`0`,那么`i++; j++;`,此时`j=1``i=3``next[3]=1`。
```
i
ababcdef
ababcdef
j
```
进行第四次循环,我们发现两个指针指向的元素相同,那么`i++; j++;`,此时`j=2` `i=4``next[4]=2`。
```
i
ababcdef
ababcdef
j
```
以此类推我们可以得到整个子串的next数组。
### 3. 代码
整理整理我们的思路,可以得到完整的代码
```c
int IndexKMP(SString S, SString T, int next[]){
int i=1,j=1; // 开始时i和j均指向第一个字符
int next[T.length+1]; // 我们遵循大部分教材next数组第一个元素为空保证元素下标和为序相同
GetNext(T, next); // 计算next数组
while(i <= S.length && j <= T.length) { // 当i和j没有超过各自的串长时循环
if (j==0 || S.ch[i]==T.ch[j]){ // 如果j为0或者两个指针指向的字符相同
++i; ++j; // i和j都往后移一位
}
else j = next[j]; // 否则j回到next数组中记录的回溯位置
}
if (j>T.length) return i-T.length; // 如果j遍历完了子串说明遍历成功返回子串在主串中的位置
return 0; // 否则返回0表示失败
}
void GetNext(SString T, int next[]){
int i = 1, j = 0; // 将i和j初始化
next[1] = 0; // 规定next数组的第一个元素为0
while (i < T.length) { // 当i不超过串的总长时循环
if (j == 0 || T.ch[i] == T.ch[j]){ // 如果j为0或者i和j所指向的两个元素相同时
++i; ++j; // i和j分别加1
next[i] = j; // 将next数组中的第i个元素记为j
}
else j = next[j]; // 如果不相同那么j回到当前元素记录的回溯位置
}
}
```
### 4. KMP数组的优化
聪明的你一定发现了,目前这个算法并不是最优的……
干嘛这么惊恐地看着我(゚д゚≡゚д゚),它的优化其实很简单,仅仅需要一个小巧思。
你看,对于`abababcdef`它的next数组为
$$
\boxed{\text{next}=[0,1,1,2,3,4,5,1,1,1]}
$$
那么想想,如果我们匹配到第三个字符,发现不匹配,此时查询`next[3]=1`我们需要回到第一个字符重新匹配但是我们知道第1个字符和第3个字符一样第3个字符不匹配那么第1个字符也必然不匹配介时我们又要查表跳转到`next[1]=0`。
那么我们为什么不能直接跳转到0呢
没错,原始 KMP 在失配后,有时会跳到一个字符相同的位置,导致下一次比较必然再次失配。`nextval` 会直接跳过这种无效比较。
很简单吧?我们直接上代码。
```c
void GetNextVal(SString T, int next[], int nextval[]) {
nextval[1] = 0; // 同样规定第一个元素为0
for (int i = 2; i <= T.length; ++i) { // 从第二个元素依次检查
int j = next[i];
if (T.ch[i] != T.ch[j]) { // 如果next数组中回溯的字符和该字符不相同
nextval[i] = j; // 则保持不变
} else {
nextval[i] = nextval[j]; // 否则第j个元素改为第i个元素
}
}
}
```
现在你已经熟练掌握KMP算法了你可以得意洋洋的拿到密码了
手机响了。已读不回了99+消息的老学长突然给你发了条信息。
“那个忘了告诉你了其实我的密码串是从序列0开始的你需要把算法改成从0开始的版本哦。”
“千万不可以在return前直接减1否则晚上会有一个大猛1来框框敲你宿舍门😠”
“我相信你一定能写出来。”
<img src="https://files.seeusercontent.com/2026/06/28/K0sd/pasted-image-1782642667317.webp" alt="pasted-image-1782642667317.webp" title="pasted-image-1782642667317.webp" style="display: block; margin: 0px auto; zoom: 33%;">