字符串处理

# 11.字符串处理

JavaScript 中字符串是最常用的数据类型之一，从简单的文本拼接到复杂的模板引擎、正则匹配，字符串处理贯穿整个开发过程。本章系统讲解字符串的底层存储、完整方法库、模板字面量高级用法以及正则表达式深入语法。

# 11.1 字符串基础

# 11.1.1 字符串的内部表示

JavaScript 字符串采用 UTF-16 编码，每个字符占 2 个字节（16 位）。对于超出基本多语言平面（BMP）的字符（码点大于 U+FFFF），使用**代理对（Surrogate Pair）**表示，占 4 个字节。

// BMP 字符 —— 一个编码单元
const a = '你';
console.log(a.length);        // 1
console.log(a.charCodeAt(0)); // 20320

// 补充平面字符 —— 代理对，两个编码单元
const emoji = '😀';
console.log(emoji.length);        // 2（两个 UTF-16 编码单元）
console.log(emoji.charCodeAt(0)); // 55357（高代理 0xD83D）
console.log(emoji.charCodeAt(1)); // 56832（低代理 0xDE00）

// 正确获取码点
console.log(emoji.codePointAt(0)); // 128512（0x1F600）

这意味着 string.length 返回的是 UTF-16 编码单元的数量，而非可见字符数。ES6 引入了码点感知的 API 来正确处理这一问题。

# 11.1.2 字符串不可变性

JavaScript 中字符串是不可变的（immutable）。所有字符串方法都返回新字符串，不修改原始字符串。

let str = 'hello';
str.toUpperCase(); // 返回 'HELLO'
console.log(str);  // 仍然是 'hello'

// 字符串不能通过索引修改
str[0] = 'H';
console.log(str);  // 仍然是 'hello'

V8 内部优化：V8 对字符串有多种内部表示：

类型	说明
SeqString	连续内存存储的字符串
ConsString	两个字符串拼接的"绳索"结构，延迟展平
SlicedString	原字符串的子串视图，共享底层存储
ThinString	指向已展平字符串的包装
ExternalString	外部内存中的字符串

// ConsString：拼接时不立即复制
let s = 'hello' + ' ' + 'world'; 
// V8 可能创建 ConsString 树：('hello', (' ', 'world'))
// 只在需要连续内存时才展平

// SlicedString：substring 共享底层存储
let long = 'a'.repeat(10000);
let sub = long.substring(100, 200);
// sub 引用 long 的底层存储，偏移量 100，长度 100

# 11.1.3 创建字符串的方式

// 1. 字面量（最常用）
const s1 = 'hello';
const s2 = "hello";
const s3 = `hello`;  // 模板字面量

// 2. String() 构造函数（类型转换）
const s4 = String(123);    // '123'
const s5 = String(true);   // 'true'
const s6 = String(null);   // 'null'

// 3. String 构造器（不推荐，创建 String 对象）
const s7 = new String('hello');
typeof s7; // 'object'，不是 'string'

// 4. 从码点创建
const s8 = String.fromCharCode(72, 101, 108); // 'Hel'
const s9 = String.fromCodePoint(128512);       // '😀'

# 11.1.4 转义字符

// 常用转义
'\n'   // 换行
'\t'   // 制表符
'\\'   // 反斜杠
'\''   // 单引号
'\"'   // 双引号
'\0'   // 空字符

// Unicode 转义
'\u0041'       // 'A'（4位十六进制，BMP）
'\u{1F600}'    // '😀'（ES6 花括号语法，支持补充平面）
'\x41'         // 'A'（2位十六进制，Latin-1）

# 11.2 字符串查找方法

# 11.2.1 indexOf / lastIndexOf

const str = 'hello world hello';

str.indexOf('hello');      // 0（第一次出现的位置）
str.indexOf('hello', 1);   // 12（从位置1开始搜索）
str.indexOf('xyz');         // -1（未找到）

str.lastIndexOf('hello');   // 12（最后一次出现的位置）
str.lastIndexOf('hello', 11); // 0（从位置11向前搜索）

# 11.2.2 includes / startsWith / endsWith

ES6 引入的语义化查找方法，返回布尔值，比 indexOf !== -1 更清晰：

const str = 'Hello, World!';

str.includes('World');      // true
str.includes('world');      // false（大小写敏感）
str.includes('World', 8);   // false（从位置8开始搜索）

str.startsWith('Hello');    // true
str.startsWith('World', 7); // true（从位置7开始检查）

str.endsWith('!');          // true
str.endsWith('World', 12);  // true（只考虑前12个字符）

# 11.2.3 search

使用正则表达式搜索，返回第一个匹配的索引位置：

const str = 'Hello 123 World';
str.search(/\d+/);     // 6
str.search(/xyz/);     // -1
str.search(/hello/i);  // 0（忽略大小写）

# 11.2.4 match / matchAll

const str = 'cat bat sat';

// match - 无 g 标志返回详细信息
str.match(/(\w)at/);
// ['cat', 'c', index: 0, groups: undefined]

// match - 有 g 标志返回所有匹配
str.match(/\wat/g);
// ['cat', 'bat', 'sat']

// matchAll - ES2020，返回迭代器，每次都有详细信息
for (const m of str.matchAll(/(\w)at/g)) {
    console.log(`${m[0]} at ${m.index}, captured: ${m[1]}`);
}
// cat at 0, captured: c
// bat at 4, captured: b
// sat at 8, captured: s

matchAll 的优势：即使使用 g 标志也能获取捕获组和索引信息。

# 11.3 字符串截取方法

# 11.3.1 slice / substring / substr

三种截取方法的区别是面试常考点：

const str = 'Hello, World!';

// slice(start, end) - 支持负索引
str.slice(7, 12);    // 'World'
str.slice(-6);       // 'orld!'
str.slice(-6, -1);   // 'orld'

// substring(start, end) - 不支持负索引，自动交换参数
str.substring(7, 12); // 'World'
str.substring(12, 7); // 'World'（自动交换）
str.substring(-3);    // 'Hello, World!'（负数当作0）

// substr(start, length) - 已废弃，不推荐使用
str.substr(7, 5);     // 'World'

方法	参数含义	负索引	参数交换	推荐程度
`slice`	(start, end)	支持	不交换	★★★ 推荐
`substring`	(start, end)	当作0	自动交换	★★ 可用
`substr`	(start, length)	start支持	不交换	★ 已废弃

# 11.3.2 at

ES2022 新增，支持负索引获取单个字符：

const str = 'Hello';
str.at(0);   // 'H'
str.at(-1);  // 'o'
str.at(-2);  // 'l'

// 对比传统方式
str[str.length - 1]; // 'o'
str.charAt(str.length - 1); // 'o'

# 11.4 字符串转换方法

# 11.4.1 大小写转换

'Hello'.toUpperCase();      // 'HELLO'
'Hello'.toLowerCase();      // 'hello'

// 本地化转换（处理特殊语言规则）
'İstanbul'.toLocaleLowerCase('tr'); // 'istanbul'（土耳其语 İ → i）

# 11.4.2 去除空白

const str = '  hello world  ';

str.trim();       // 'hello world'
str.trimStart();  // 'hello world  '（ES2019）
str.trimEnd();    // '  hello world'（ES2019）

// 别名
str.trimLeft();   // 同 trimStart
str.trimRight();  // 同 trimEnd

# 11.4.3 填充

// padStart(targetLength, padString)
'5'.padStart(3, '0');      // '005'
'42'.padStart(5, '*');     // '***42'
'hello'.padStart(3);       // 'hello'（已超过目标长度，不变）

// padEnd(targetLength, padString)
'hi'.padEnd(5, '!');       // 'hi!!!'
'1.5'.padEnd(6, '0');      // '1.5000'

实际应用：

// 格式化时间
const h = String(9).padStart(2, '0');  // '09'
const m = String(5).padStart(2, '0');  // '05'
console.log(`${h}:${m}`);             // '09:05'

// 格式化编号
const id = String(42).padStart(6, '0'); // '000042'

# 11.4.4 重复与替换

// repeat
'ha'.repeat(3);    // 'hahaha'
'*'.repeat(0);     // ''
'x'.repeat(2.9);   // 'xx'（小数取整）

// replace - 只替换第一个
'hello hello'.replace('hello', 'hi'); // 'hi hello'

// replace 配合正则全局替换
'hello hello'.replace(/hello/g, 'hi'); // 'hi hi'

// replaceAll - ES2021，替换所有
'hello hello'.replaceAll('hello', 'hi'); // 'hi hi'

// replace 的回调函数
'hello world'.replace(/\b\w/g, c => c.toUpperCase());
// 'Hello World'

# 11.4.5 分割与拼接

// split
'a,b,c'.split(',');       // ['a', 'b', 'c']
'hello'.split('');         // ['h', 'e', 'l', 'l', 'o']
'a,,b'.split(',');         // ['a', '', 'b']
'a,b,c'.split(',', 2);    // ['a', 'b']（限制数量）

// 用正则分割
'a1b2c3'.split(/\d/);     // ['a', 'b', 'c', '']

// concat
'hello'.concat(' ', 'world'); // 'hello world'
// 实际开发中用 + 或模板字面量更常见

# 11.5 模板字面量

# 11.5.1 基本用法

ES6 引入的模板字面量使用反引号，支持多行文本和表达式嵌入：

const name = 'World';
const age = 25;

// 表达式嵌入
const greeting = `Hello, ${name}! You are ${age} years old.`;

// 多行文本（保留换行和缩进）
const html = `
<div>
    <h1>${name}</h1>
    <p>Age: ${age}</p>
</div>
`;

// 嵌入表达式
`${1 + 2}`;                    // '3'
`${age >= 18 ? 'adult' : 'minor'}`; // 'adult'
`${[1,2,3].join('-')}`;        // '1-2-3'
`${(() => 'IIFE result')()}`;  // 'IIFE result'

# 11.5.2 标签模板（Tagged Templates）

标签模板是模板字面量的高级形式，允许用函数解析模板字面量：

function tag(strings, ...values) {
    console.log(strings); // 静态字符串数组
    console.log(values);  // 插值表达式的值
    return strings.reduce((result, str, i) => 
        result + str + (values[i] !== undefined ? values[i] : ''), '');
}

const name = 'World';
const age = 25;
tag`Hello ${name}, age ${age}!`;
// strings: ['Hello ', ', age ', '!']
// values: ['World', 25]

strings 数组有一个特殊属性 raw，保留原始转义字符串：

function showRaw(strings) {
    console.log(strings[0]);     // 实际换行
    console.log(strings.raw[0]); // '\\n'（原始字符串）
}
showRaw`hello\nworld`;

# 11.5.3 标签模板的实际应用

1. HTML 转义防止 XSS

function safeHtml(strings, ...values) {
    const escape = (str) => String(str)
        .replace(/&/g, '&amp;')
        .replace(/</g, '&lt;')
        .replace(/>/g, '&gt;')
        .replace(/"/g, '&quot;')
        .replace(/'/g, '&#39;');
    
    return strings.reduce((result, str, i) => 
        result + str + (i < values.length ? escape(values[i]) : ''), '');
}

const userInput = '<script>alert("xss")</script>';
const html = safeHtml`<div>${userInput}</div>`;
// '<div>&lt;script&gt;alert(&quot;xss&quot;)&lt;/script&gt;</div>'

2. CSS-in-JS（styled-components 原理）

function css(strings, ...values) {
    const rawCSS = strings.reduce((result, str, i) =>
        result + str + (values[i] || ''), '');
    
    const style = document.createElement('style');
    style.textContent = rawCSS;
    document.head.appendChild(style);
    return rawCSS;
}

const primaryColor = '#007bff';
css`
    .button {
        background-color: ${primaryColor};
        color: white;
        padding: 8px 16px;
    }
`;

3. 国际化（i18n）

const i18n = (strings, ...values) => {
    const template = strings.join('{}');
    const translation = translations[template] || template;
    return values.reduce((result, val) => 
        result.replace('{}', val), translation);
};

const translations = {
    'Hello {}, you have {} messages': '你好 {}，你有 {} 条消息'
};

const name = 'Alice';
const count = 5;
i18n`Hello ${name}, you have ${count} messages`;
// '你好 Alice，你有 5 条消息'

4. SQL 查询构建（防注入）

function sql(strings, ...values) {
    const params = [];
    const query = strings.reduce((result, str, i) => {
        if (i < values.length) {
            params.push(values[i]);
            return result + str + `$${params.length}`;
        }
        return result + str;
    }, '');
    return { query, params };
}

const name = "O'Brien"; // 含特殊字符
const age = 25;
const result = sql`SELECT * FROM users WHERE name = ${name} AND age > ${age}`;
// { query: "SELECT * FROM users WHERE name = $1 AND age > $2", params: ["O'Brien", 25] }

# 11.5.4 String.raw

String.raw 是内置的标签函数，返回模板字面量的原始字符串（不处理转义）：

String.raw`Hello\nWorld`;  // 'Hello\\nWorld'（\n 不被解释）
String.raw`C:\Users\name`; // 'C:\\Users\\name'

// 等价于手动 \\
'C:\\Users\\name';

// 适用于正则表达式
const re = new RegExp(String.raw`\d+\.\d+`);
// 无需 '\\d+\\.\\d+'

# 11.6 正则表达式深入

# 11.6.1 正则表达式创建

// 字面量创建（推荐，静态模式）
const re1 = /pattern/flags;

// 构造函数（动态模式）
const re2 = new RegExp('pattern', 'flags');
const re3 = new RegExp(variable + '.*end', 'gi');

# 11.6.2 标志位（Flags）

标志	名称	说明
`g`	global	全局匹配，不在第一次匹配后停止
`i`	ignoreCase	忽略大小写
`m`	multiline	多行模式，`^` 和 `$` 匹配行首/行尾
`s`	dotAll	使 `.` 匹配包括换行符在内的所有字符
`u`	unicode	开启 Unicode 匹配（正确处理代理对）
`y`	sticky	粘连匹配，从 lastIndex 位置开始
`d`	hasIndices	生成匹配的起始/结束索引（ES2022）
`v`	unicodeSets	Unicode 属性集合（ES2024）

// s 标志 - dotAll
/hello.world/.test('hello\nworld');   // false
/hello.world/s.test('hello\nworld');  // true

// u 标志 - 正确处理 Unicode
/^.$/u.test('😀');  // true
/^.$/.test('😀');   // false（没有 u，. 不匹配代理对）

// y 标志 - 粘连匹配
const re = /\d+/y;
re.lastIndex = 4;
re.exec('abc 123 def'); // ['123']，从位置4精确匹配

# 11.6.3 量词与贪婪/懒惰匹配

// 贪婪量词（默认）- 尽可能多匹配
'aabab'.match(/a.*b/);   // ['aabab']

// 懒惰量词（加 ?）- 尽可能少匹配
'aabab'.match(/a.*?b/);  // ['aab']

// 量词列表
/a*/     // 0次或多次
/a+/     // 1次或多次
/a?/     // 0次或1次
/a{3}/   // 恰好3次
/a{2,5}/ // 2到5次
/a{2,}/  // 至少2次

# 11.6.4 断言（Assertions）

断言是零宽度匹配，不消耗字符：

// 1. 前瞻断言（Lookahead）
/\d+(?=px)/.exec('12px 14em');   // ['12']（后面跟着 px）
/\d+(?!px)/.exec('12px 14em');   // ['1']（后面不是 px）—— 注意匹配的是 '1'

// 更精确的写法
/\b\d+(?!px)\b/.exec('12px 14em'); // null（考虑边界）

// 2. 后行断言（Lookbehind）ES2018
/(?<=\$)\d+/.exec('$100 €200');  // ['100']（前面是 $）
/(?<!\$)\d+/.exec('$100 €200');  // ['00']（前面不是 $）—— 注意从什么位置匹配

// 实际应用：匹配整个数字
/(?<=\$)\d+/.exec('Price: $100 €200');  // ['100']

# 11.6.5 命名捕获组

ES2018 引入命名捕获组，提高正则可读性：

// 传统方式 - 用数字引用
const dateRe = /(\d{4})-(\d{2})-(\d{2})/;
const match = dateRe.exec('2024-03-15');
const year = match[1];  // '2024'

// 命名捕获组
const namedRe = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/;
const result = namedRe.exec('2024-03-15');
const { year, month, day } = result.groups;
// year: '2024', month: '03', day: '15'

// 在 replace 中使用命名引用
'2024-03-15'.replace(
    /(?<y>\d{4})-(?<m>\d{2})-(?<d>\d{2})/,
    '$<d>/$<m>/$<y>'
);
// '15/03/2024'

// 在正则内部反向引用
/(?<word>\w+)\s+\k<word>/.test('hello hello'); // true

# 11.6.6 Unicode 属性转义

ES2018 引入 \p{} 和 \P{}（需要 u 标志）：

// 匹配所有中文字符
/\p{Script=Han}/u.test('中');  // true
/\p{Script=Han}/u.test('a');   // false

// 匹配所有数字（包括各语言数字）
/\p{Number}/u.test('①');  // true
/\p{Number}/u.test('٣');  // true（阿拉伯数字3）

// 匹配标点符号
/\p{Punctuation}/u.test('，'); // true

// 匹配 Emoji
/\p{Emoji}/u.test('😀'); // true

# 11.6.7 灾难性回溯

正则引擎使用 NFA（非确定有限自动机），当模式存在歧义时会产生指数级回溯：

// 危险模式：嵌套量词
const evil = /^(a+)+$/;
// 对 'aaaaaaaaaaaaaaaaaab' 匹配极慢（指数级回溯）

// 解决方案
// 1. 使用原子组（JS 目前不支持，但可重写）
const safe = /^a+$/;  // 去掉嵌套

// 2. 限制输入长度
function safeMatch(input, pattern, maxLen = 1000) {
    if (input.length > maxLen) return null;
    return input.match(pattern);
}

// 3. 使用 ReDoS 检测工具检查正则

常见危险模式：

(a+)+ — 嵌套量词
(a|a)+ — 重叠选择
(a+b?)+ — 可选元素与量词组合

# 11.7 字符串与编码

# 11.7.1 码点操作

// 获取码点
'A'.codePointAt(0);          // 65
'😀'.codePointAt(0);         // 128512

// 从码点创建
String.fromCodePoint(65);    // 'A'
String.fromCodePoint(128512); // '😀'

// 遍历真实字符（码点感知）
const text = 'A😀B';
console.log([...text]);       // ['A', '😀', 'B']
console.log([...text].length); // 3（正确的字符数）

// 对比
console.log(text.length);     // 4（UTF-16 编码单元数）

# 11.7.2 编码转换

// TextEncoder / TextDecoder（UTF-8 编解码）
const encoder = new TextEncoder();
const encoded = encoder.encode('Hello 你好');
console.log(encoded); // Uint8Array [72, 101, 108, 108, 111, 32, 228, 189, 160, 229, 165, 189]

const decoder = new TextDecoder('utf-8');
const decoded = decoder.decode(encoded);
console.log(decoded); // 'Hello 你好'

// Base64 编解码
btoa('Hello');        // 'SGVsbG8='
atob('SGVsbG8=');     // 'Hello'

// 处理中文需要先编码
btoa(unescape(encodeURIComponent('你好'))); // '5L2g5aW9'
decodeURIComponent(escape(atob('5L2g5aW9'))); // '你好'

# 11.7.3 字符串规范化

Unicode 中同一个字符可能有多种表示方式，normalize() 用于统一：

// NFC（默认）- 组合形式
const nfc = '\u00e9';        // 'é'（预组合字符）
const nfd = '\u0065\u0301';  // 'é'（e + 组合重音符）

nfc === nfd;                  // false
nfc.normalize('NFC') === nfd.normalize('NFC'); // true

// 实际场景：比较用户输入
function normalizedEquals(a, b) {
    return a.normalize('NFC') === b.normalize('NFC');
}

# 11.8 实用字符串技巧

# 11.8.1 字符串反转

// 基础方式
function reverse(str) {
    return [...str].reverse().join('');
}
reverse('hello'); // 'olleh'
reverse('A😀B');  // 'B😀A'（正确处理 Unicode）

// 注意：不要用 str.split('').reverse().join('')
// 因为 split('') 按 UTF-16 编码单元分割，会破坏代理对

# 11.8.2 驼峰命名转换

// kebab-case → camelCase
function toCamelCase(str) {
    return str.replace(/-([a-z])/g, (_, c) => c.toUpperCase());
}
toCamelCase('background-color'); // 'backgroundColor'

// camelCase → kebab-case
function toKebabCase(str) {
    return str.replace(/[A-Z]/g, c => '-' + c.toLowerCase());
}
toKebabCase('backgroundColor'); // 'background-color'

// snake_case → camelCase
function snakeToCamel(str) {
    return str.replace(/_([a-z])/g, (_, c) => c.toUpperCase());
}
snakeToCamel('user_first_name'); // 'userFirstName'

# 11.8.3 模板引擎

一个简易的模板引擎实现，展示字符串处理的综合应用：

function template(tmpl, data) {
    return tmpl.replace(/\{\{(\w+(?:\.\w+)*)\}\}/g, (match, path) => {
        const value = path.split('.').reduce((obj, key) => 
            obj != null ? obj[key] : undefined, data);
        return value !== undefined ? String(value) : match;
    });
}

const tmpl = 'Hello, {{user.name}}! You have {{count}} messages.';
template(tmpl, { user: { name: 'Alice' }, count: 5 });
// 'Hello, Alice! You have 5 messages.'

# 11.8.4 高性能字符串拼接

// 少量拼接 - 模板字面量最清晰
const result = `${a} + ${b} = ${a + b}`;

// 大量拼接 - 数组 join 性能更好
const parts = [];
for (let i = 0; i < 10000; i++) {
    parts.push(`item-${i}`);
}
const result2 = parts.join(',');

// 原因：+ 运算符在 V8 中创建 ConsString 树
// 当树过深时会导致访问变慢
// join 一次性分配内存，效率更高

# 11.9 方法速查表

方法	说明	返回值
`charAt(i)`	获取索引位置字符	string
`charCodeAt(i)`	获取 UTF-16 编码值	number
`codePointAt(i)`	获取完整码点	number
`at(i)`	支持负索引取字符	string
`concat(...strs)`	拼接字符串	string
`includes(s, pos)`	是否包含子串	boolean
`startsWith(s, pos)`	是否以子串开头	boolean
`endsWith(s, len)`	是否以子串结尾	boolean
`indexOf(s, pos)`	查找子串位置	number
`lastIndexOf(s, pos)`	从后查找子串位置	number
`search(re)`	正则搜索	number
`match(re)`	正则匹配	array/null
`matchAll(re)`	全局正则匹配迭代器	iterator
`replace(s/re, new)`	替换第一个/正则匹配	string
`replaceAll(s, new)`	替换所有匹配	string
`slice(start, end)`	截取子串	string
`substring(start, end)`	截取子串	string
`split(sep, limit)`	分割为数组	array
`trim()`	去除两端空白	string
`trimStart()`	去除开头空白	string
`trimEnd()`	去除结尾空白	string
`padStart(len, s)`	头部填充	string
`padEnd(len, s)`	尾部填充	string
`repeat(n)`	重复 n 次	string
`toUpperCase()`	转大写	string
`toLowerCase()`	转小写	string
`normalize(form)`	Unicode 规范化	string
`localeCompare(s)`	本地化比较	number
`isWellFormed()`	是否格式良好（ES2024）	boolean
`toWellFormed()`	转为格式良好（ES2024）	string

上次更新: 2026/06/28, 17:55:19

← 模块开发迭代器与生成器→