[1]骆聪,周城.基于改进的n-gram模型的URL分类算法研究[J].计算机技术与发展,2018,28(09):38-41.[doi:10.3969/j.issn.1673-629X.2018.09.009]
 LUO Cong,ZHOU Cheng.Research on URL Classification Based on Improved n-gram Model[J].,2018,28(09):38-41.[doi:10.3969/j.issn.1673-629X.2018.09.009]
点击复制

基于改进的n-gram模型的URL分类算法研究()
分享到:

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
28
期数:
2018年09期
页码:
38-41
栏目:
智能、算法、系统工程
出版日期:
2018-09-10

文章信息/Info

Title:
Research on URL Classification Based on Improved n-gram Model
文章编号:
1673-629X(2018)09-0038-04
作者:
骆聪 周城
江南计算技术研究所,江苏 无锡,214083
Author(s):
LUO CongZHOU Cheng
Jiangnan Institute of Computing Technology,Wuxi 214083,China
关键词:
大数据网页分类 网页URL n-gram模型 URL分类
Keywords:
big dataweb page classificationURLn-gram modelURL classification
分类号:
TP391
DOI:
10.3969/j.issn.1673-629X.2018.09.009
文献标志码:
A
摘要:
在大数据时代,网络上的信息量获得了爆炸性增长,准确的网页分类技术有助于用户从海量网页中迅速定位到自己感兴趣的信息.网页分类技术在诸多应用中发挥着至关重要的作用,其大体可以分为基于网页内容分析和基于URL分析的网页分类.针对基于内容分析的网页分类技术在部分场景下的不足,提出仅根据网页URL信息进行网页分类.借鉴n-gram模型的思想,并将字符作为基本单位,进行URL特征的提取.考虑到URL各字段对于网页分类的区分能力不同,在剔除部分字段的同时,也为重要的path字段赋予更高的权重,在此基础上改进了n-gram模型.实验结果表明,将改进后的n-gram模型用于URL分类不仅提高了算法效率,而且网页分类的准确性也有所提升,其中训练时间减少了9.34%,网页分类结果的F1值提高了12.63%.
Abstract:
In the era of big data,the amount of information is increasing explosively. With the help of web page classification technology,users are able to access to the information they are interested in from the massive web pages. Web page classification technology which can mainly be divided into content-based and URL-based plays an important role in many applications. Considering content-based web page classification technology is inapplicable to some occasions,only URLs are used to classify web pages. Taking the character as the basic unit,the URL feature is extracted by drawing on the idea of n-gram model. As each field of a URL is different in the distinguish ability,some fields are not taken into account when classifying web pages. In the meantime,the path field is given a higher weight with the consideration of its importance. Then an n-gram model is improved based on this and experiment shows that the efficiency and accuracy of web page classification both get a certain increase. To be specific,training time reduces by 9.34% while F 1 score gets an increase of 12.63%.

相似文献/References:

[1]张高胤 谭成翔 汪海航.基于K-近邻算法的网页自动分类系统的研究及实现[J].计算机技术与发展,2007,(01):21.
 ZHANG Gao-yin,TAN Cheng-xiang,WANG Hai-hang.Design and Implementation of Web Page Automation Classification System Based on K- Nearest Neighbor Algorithm[J].,2007,(09):21.
[2]严霄凤,张德馨.大数据研究[J].计算机技术与发展,2013,(04):168.
 YAN Xiao-feng,ZHANG De-xin.Big Data Research[J].,2013,(09):168.
[3]余桂兰,陈珂,左敬龙.基于云模型的并行蚁群-SVM分类方法[J].计算机技术与发展,2014,24(04):131.
 YU Gui-lan,CHEN Ke,ZUO Jing-long.Parallel Ant Colony-SVM Classification Method Based on Cloud Model[J].,2014,24(09):131.
[4]王雷,陈彦先,袁哲,等. 面向预拌混凝土行业的云计算[J].计算机技术与发展,2014,24(08):14.
 WANG Lei,CHEN Yan-xian,YUAN Zhe JI Xu. Research on Cloud Computing for Ready-mixed Concrete Industry[J].,2014,24(09):14.
[5]金宗泽,冯亚丽,文必龙,等. 大数据分析流程框架的研究[J].计算机技术与发展,2014,24(08):117.
 JIN Zong-ze,FENG Ya-l,WEN Bi-long,et al. Research on Framework of Big Data Analytic Process[J].,2014,24(09):117.
[6]张也弛,周文钦,石润华. 一种面向云的大数据完整性检测协议[J].计算机技术与发展,2014,24(09):68.
 ZHANG Ye-chi,ZHOU Wen-qin,SHI Run-hua. A Big Data Integrity Checking Protocol for Cloud[J].,2014,24(09):68.
[7]谢怡,王航,刘新瀚,等. 大数据环境下数据读取关键技术研究[J].计算机技术与发展,2015,25(02):113.
 XIE Yi,WANG Hang,LIU Xin-han,et al. Research on Data Reading Techniques Based on Big Data Environment[J].,2015,25(09):113.
[8]付燕平,罗明宇,刘其军. 大数据三维模型快速显示技术研究[J].计算机技术与发展,2015,25(05):87.
 FU Yan-ping,LUO Ming-yu,LIU Qi-jun. Research on Fast Display Technology for Big Data Three-dimensional Model[J].,2015,25(09):87.
[9]赵震,任永昌. 大数据时代基于云计算的电子政务平台研究[J].计算机技术与发展,2015,25(10):145.
 ZHAO Zhen,REN Yong-chang. Research on E-government Platform Based on Cloud Computing in Big Data Era[J].,2015,25(09):145.
[10]胡存刚,程莹. 基于粒子群算法的大数据智能搜索引擎的研究[J].计算机技术与发展,2015,25(12):14.
 HU Cun-gang,CHENG Ying. Research on Big Data Intelligent Search Engine Based on PSO[J].,2015,25(09):14.

更新日期/Last Update: 2018-09-10