TensorFlow 文本数据处理

TensorFlow 提供了强大的工具用于处理文本数据，广泛应用于自然语言处理（NLP）任务，如文本分类、情感分析、机器翻译等。通过 tf.data API、tf.keras.preprocessing 和其他模块，可以高效地加载、预处理、分词、向量化文本数据，并构建数据管道。本教程将介绍 TensorFlow 中文本数据处理的核心方法、常用操作和一个实用示例，适合初学者和需要快速参考的用户。如果需要更复杂的场景（如处理大规模文本或特定格式），请告诉我！

1. 核心工具

tf.data: 用于构建高效的文本数据输入管道。
tf.keras.preprocessing.text.Tokenizer: 分词和向量化工具，将文本转为数字序列。
tf.keras.layers.TextVectorization: 内置层，简化文本预处理和向量化。
tf.strings: 处理字符串操作（如分割、编码转换）。
TFRecord: 适合存储大规模文本数据。

2. 加载文本数据

TensorFlow 支持从多种来源加载文本数据。

2.1 从内存加载（列表或 NumPy 数组）

如果文本数据已加载为 Python 列表或 NumPy 数组：

import tensorflow as tf
import numpy as np

texts = np.array(["I love TensorFlow", "Keras is great", "NLP is fun"])
labels = np.array([1, 1, 0])  # 示例标签
dataset = tf.data.Dataset.from_tensor_slices((texts, labels))

2.2 从文本文件加载

使用 tf.data.TextLineDataset 逐行读取文本文件：

dataset = tf.data.TextLineDataset('text.txt')  # 每行作为一个字符串

文件示例（text.txt）：

I love TensorFlow
Keras is great
NLP is fun

2.3 从 TFRecord 文件加载

对于大规模文本数据，推荐使用 TFRecord：

def parse_tfrecord(example):
    feature_description = {
        'text': tf.io.FixedLenFeature([], tf.string),
        'label': tf.io.FixedLenFeature([], tf.int64)
    }
    example = tf.io.parse_single_example(example, feature_description)
    return example['text'], example['label']

dataset = tf.data.TFRecordDataset('text.tfrecord').map(parse_tfrecord)

3. 文本预处理

文本数据通常需要分词（tokenization）、向量化（vectorization）等预处理步骤。

3.1 使用 Tokenizer

Tokenizer 将文本转为词索引序列：

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# 示例文本
texts = ["I love TensorFlow", "Keras is great", "NLP is fun"]

# 初始化 Tokenizer
tokenizer = Tokenizer(num_words=1000, oov_token='<OOV>')  # num_words 限制词汇表大小
tokenizer.fit_on_texts(texts)  # 构建词表

# 转为序列
sequences = tokenizer.texts_to_sequences(texts)
print(sequences)  # 输出：[[2, 3, 4], [5, 6, 7], [8, 6, 9]]

# 填充序列（统一长度）
padded_sequences = pad_sequences(sequences, maxlen=5, padding='post', truncating='post')
print(padded_sequences)  # 输出：[[2 3 4 0 0], [5 6 7 0 0], [8 6 9 0 0]]

说明：

num_words：限制词汇表大小，只保留最常见词。
oov_token：处理未见词（out-of-vocabulary）。
pad_sequences：确保序列长度一致，填充 0 或截断。

3.2 使用 TextVectorization 层

TextVectorization 是一个内置层，集成分词和向量化：

from tensorflow.keras.layers import TextVectorization

# 定义 TextVectorization 层
vectorize_layer = TextVectorization(
    max_tokens=1000,  # 最大词汇量
    output_mode='int',  # 输出整数序列
    output_sequence_length=5  # 固定输出长度
)

# 适配数据
vectorize_layer.adapt(texts)

# 向量化
vectorized_texts = vectorize_layer(texts)
print(vectorized_texts)  # 输出：tf.Tensor([[2 3 4 0 0], [5 6 7 0 0], [8 6 9 0 0]], ...)

优势：可直接嵌入 Keras 模型，简化流程。

4. 数据增强

文本数据增强通常通过同义词替换、随机删除词等实现，但 TensorFlow 本身不提供内置增强工具。可以使用第三方库（如 nlpaug）或自定义函数：

def augment_text(text, label):
    text = tf.strings.regex_replace(text, 'love', 'like')  # 简单替换
    return text, label

dataset = dataset.map(augment_text)

5. 构建高效数据管道

使用 tf.data 构建文本数据管道：

dataset = (dataset
           .map(lambda text, label: (vectorize_layer(text), label), num_parallel_calls=tf.data.AUTOTUNE)
           .shuffle(buffer_size=1000)
           .batch(batch_size=32)
           .prefetch(tf.data.AUTOTUNE))

关键操作：

map：应用向量化或其他预处理。
shuffle：随机打乱数据，buffer_size 控制打乱范围。
batch：分组为批次，适配模型训练。
prefetch：预加载数据，减少 I/O 等待。

6. 完整示例：文本分类数据管道

以下是一个使用 IMDB 数据集进行情感分析的示例：

import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences

# 1. 加载 IMDB 数据
vocab_size = 10000  # 词汇表大小
max_len = 200  # 序列最大长度
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)

# 2. 填充序列
x_train = pad_sequences(x_train, maxlen=max_len, padding='post')
x_test = pad_sequences(x_test, maxlen=max_len, padding='post')

# 3. 创建数据管道
train_dataset = (tf.data.Dataset.from_tensor_slices((x_train, y_train))
                 .shuffle(1000)
                 .batch(32)
                 .prefetch(tf.data.AUTOTUNE))
test_dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(32)

# 4. 构建 Keras 模型
model = models.Sequential([
    layers.Embedding(vocab_size, 128, input_length=max_len),  # 词嵌入层
    layers.LSTM(64),  # LSTM 层
    layers.Dense(1, activation='sigmoid')  # 输出层，情感分类（0 或 1）
])

# 5. 编译和训练
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(train_dataset, epochs=5, validation_data=test_dataset)

# 6. 评估
test_loss, test_acc = model.evaluate(test_dataset)
print(f'测试集准确率: {test_acc:.4f}')

输出：

训练 5 个 epoch 后，测试准确率通常在 85%-90%。
数据管道高效处理文本序列，适合 LSTM 或其他模型。

说明：

IMDB 数据集：包含 25,000 条训练和测试电影评论，标签为 0（负面）或 1（正面）。
预处理：IMDB 数据已分词为整数序列，使用 pad_sequences 统一长度。
模型：使用 Embedding 层将词索引转为密集向量，LSTM 捕获序列特征。

7. 进阶用法

处理大规模文本：将文本存储为 TFRecord：

  def create_tfrecord(text, label):
      feature = {
          'text': tf.train.Feature(bytes_list=tf.train.BytesList(value=[text.encode()])),
          'label': tf.train.Feature(int64_list=tf.train.Int64List(value=[label]))
      }
      return tf.train.Example(features=tf.train.Features(feature=feature))

自定义分词：

  def custom_tokenize(text, label):
      tokens = tf.strings.split(text)
      return tokens, label
  dataset = dataset.map(custom_tokenize)

预训练嵌入：使用 TensorFlow Hub 的预训练词嵌入（如 BERT）：

  import tensorflow_hub as hub
  embed = hub.KerasLayer("https://tfhub.dev/google/nnlm-en-dim128/2")
  model = models.Sequential([embed, layers.Dense(1, activation='sigmoid')])

8. 性能优化

缓存小数据集：

  dataset = dataset.cache()

并行处理：在 map 中使用 num_parallel_calls=tf.data.AUTOTUNE。
优化词汇表：限制 vocab_size 减少内存占用。
GPU 兼容：避免 Python 自定义函数，使用 tf.strings 或 TextVectorization。

9. 总结

TensorFlow 的文本数据处理结合 tf.data、Tokenizer 和 TextVectorization，支持从简单内存数据到复杂文件流的处理。通过数据管道，可以高效加载、预处理和向量化文本，适配 NLP 模型。Embedding 和 LSTM 等层进一步增强了建模能力。

如果你需要具体示例（例如处理 CSV 文本、自定义分词、或使用 BERT），或想生成图表（如词汇分布），请告诉我！

TensorFlow 文本数据处理