Issue |
JNWPU
Volume 43, Number 2, April 2025
|
|
---|---|---|
Page(s) | 388 - 397 | |
DOI | https://doi.org/10.1051/jnwpu/20254320388 | |
Published online | 04 June 2025 |
A framework of variable-length sequence data preprocessing based on semantic perception
基于语义感知的变长序列数据预处理框架
1
School of Computer Science, Northwestern Polytechnical University, Xi'an 710072, China
2
School of Software, Northwestern Polytechnical University, Xi'an 710072, China
3
School of Cybersecurity, Northwestern Polytechnical University, Xi'an 710072, China
Received:
19
April
2024
Deep learning frameworks generally adopt padding or truncation operations toward variable-length sequences in order to use efficient yet intensive batch training. However, padding leads to intensive memory consumption, and truncation inevitably loses the original semantic information. To address this dilemma, a variable-length sequence preprocessing framework based on semantic perception is proposed, which leverages a typical unsupervised learning method to reduce the different dimensionality to the exact size and minimize information loss. Under the theoretical umbrella of minimizing information loss, information entropy is adopted to measure the semantic richness, weights to variable-length representations is assigned, and the semantic richness is used to fuse them. Extensive experiments show that the information loss of the present strategy is less than the truncated embeddings, and the apparent superiority of the present method in gaining more information capability and achieving promising performance on several text classification datasets.
摘要
深度学习框架处理变长序列时, 通常采用填充(padding)或截断(truncation)的方式, 以方便模型批量训练与处理。然而, 填充会加剧内存占用, 而截断则会使序列丧失原本的语义信息。因此, 提出了一种基于语义感知的变长序列预处理框架, 该框架利用典型的无监督学习方法, 压缩多维度数据并减小信息损失。同时, 基于最小化信息损失理论, 采用信息熵度量语义丰富度, 为变长表示分配权重, 并通过语义丰富度进行融合。此外, 实验表明该框架的信息损失相较传统的截断嵌入有所降低, 所提方法在信息获取方面具有显著优势, 在多个文本分类数据集上表现良好。
Key words: variable-length sequence / data preprocessing / padding / truncation / semantic information / maximizing information
关键字 : 变长序列 / 数据预处理 / 填充 / 截断 / 语义信息 / 最大化信息
© 2025 Journal of Northwestern Polytechnical University. All rights reserved.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.