Python 标准库 xml.etree.ElementTree
注意:本文为 xml.etree.ElementTree 学习笔记,仅供自己学习使用,文中会把引用的链接附上。
文章目录
- 前言
- 一、ElementTree 和 Element
- 二、XML解析
- 1.我们将使用以下XML文档作为示例数据:
- 2.加载数据
- 2.读取数据
- 总结
前言
一、ElementTree 和 Element
XML是一种固定的分层数据格式,它最自然的表示方式是树。XML有两个类——ElementTree将整个XML文档表示为树,而Element表示树中的单个节点。一般这样调用:import xml.etree.ElementTree as ET(简称ET)。
1.与整个文档的交互(读写文件)通常是在ElementTree级别完成的(文件的读写)。
2.与单个XML元素及其子元素的交互是在Element级完成的。
Element是一个灵活的容器对象,设计用来在内存中存储分层数据结构。它可以被描述为一种介于列表和字典之间的东西。每个Element都有许多与之关联的属性,如下:
| 属性 | 类型 | 意义 | 调用 |
|---|---|---|---|
| tag | str | Element名 | Element.tag |
| attrib | dic | 元素有哪些属性 | Element.attrib |
| text | str | 第一个子元素之前的文本。 | Element.text |
| tail | str | 在元素结束标记之后,下一个元素开始标记之前的文本。 | Element.tail |
二、XML解析
1.我们将使用以下XML文档作为示例数据:
<Annotation created="16/05/2018" creator="XMLconverter"><DocumentSet><document id="ED0" document_level_value="CT+"><sentence id="ES0.0">China issues stern <event id="EE0.0" sentence_level_value="CT+">rebukeevent> over flight -EOP- .sentence><sentence id="ES0.3">Surveillance aircraft intercepted by 2 J_10 fighter jets over East China Sea -EOP- .sentence><sentence id="ES0.4">China urged the United States to immediately <event id="EE0.1" sentence_level_value="CT+">stopevent> its `` unsafe , unprofessional and unfriendly dangerous military activity '' after a US Navy surveillance plane flew in airspace over the East China Sea on Sunday .sentence>document><document id="ED1" document_level_value="CT+"><sentence id="ES1.0">Philippine President Duterte <event id="EE1.0" sentence_level_value="CT+">vowsevent> for closer relations with China -EOP- .sentence><sentence id="ES1.6">Visiting Chinese Foreign Minister Wang Yi LRB Front L RRB meets with Philippine President Rodrigo Duterte LRB Front R RRB in Manila , the Philippines , on July 25 , 2017 .sentence><sentence id="ES1.8">MANILA _ Philippine President Rodrigo Duterte <event id="EE1.1" sentence_level_value="CT+">pledgedevent> on Tuesday that his country is <event id="EE1.2" sentence_level_value="CT+">pledged1event> to build stronger bilateral relations with China .sentence>document>DocumentSet>
Annotation>
以上是我们实验室的语料
2.加载数据
代码如下(示例):
import xml.etree.ElementTree as ET
tree = ET.parse('F:/code/final/en_fin/data/english.xml')
root = tree.getroot()
作为一个元素,root有一个标签和一个属性字典:
>>>root.tag
'Annotation'
>>>root.attrib
{'created': '16/05/2018', 'creator': 'XMLconverter'}
2.读取数据
读取sentence中全部的文本信息
for doc in root[0]:for sent in doc:print(f"sent :", sent.text) # 输出文本s = ''for t in sent.itertext():s += tprint(s) #输出整个句子的文本breakbreak
输出:
sent : China issues stern
China issues stern rebuke over flight -EOP- .
长度,有元素的才有长度
len(root[0][0])
输出:
3
完整的例子
for doc in root[0]:id = doc.attrib['id']label = label2idx[doc.attrib['document_level_value']]sentence_list = []trigger_word_list = []flag = Falsefor sent in doc:if sent.text == '-EOP-.' or sent.text == '.':continues = ''for t in sent.itertext():s += ts = s.replace('-EOP-.', '。').lower()print(f"sent.itertext: ",s)if re.match(r'\d{4}\D\d{2}\D\d{2}\D\d{2}:\d{2}\D$', s) is not None:flag = Truecontinueelif flag:flag = Falseif len(sent) == 0:continueif len(s)<=4:continueif len(sent) > 0:tmp = sent.text.lower() if sent.text is not None else ''for event in sent:print(f"sent.text: ",tmp)print(f"event.text: ",event.text.lower())print(f"event.tail: ",event.tail.lower())
结果:
sent.itertext: china issues stern rebuke over flight -eop- .
sent.text: china issues stern
event.text: rebuke
event.tail: over flight -eop- .
sent.itertext: surveillance aircraft intercepted by 2 j_10 fighter jets over east china sea -eop- .
sent.itertext: china urged the united states to immediately stop its `` unsafe , unprofessional and unfriendly dangerous military activity '' after a us navy surveillance plane flew in airspace over the east china sea on sunday .
sent.text: china urged the united states to immediately
event.text: stop
event.tail: its `` unsafe , unprofessional and unfriendly dangerous military activity '' after a us navy surveillance plane flew in airspace over the east china sea on sunday .
sent.itertext: philippine president duterte vows for closer relations with china -eop- .
sent.text: philippine president duterte
event.text: vows
event.tail: for closer relations with china -eop- .
sent.itertext: visiting chinese foreign minister wang yi lrb front l rrb meets with philippine president rodrigo duterte lrb front r rrb in manila , the philippines , on july 25 , 2017 .
sent.itertext: manila _ philippine president rodrigo duterte pledged on tuesday that his country is pledged1 to build stronger bilateral relations with china .
sent.text: manila _ philippine president rodrigo duterte
event.text: pledged
event.tail: on tuesday that his country is
sent.text: manila _ philippine president rodrigo duterte
event.text: pledged1
event.tail: to build stronger bilateral relations with china .
总结
代码还是配合例子易懂
本文来自互联网用户投稿,文章观点仅代表作者本人,不代表本站立场,不承担相关法律责任。如若转载,请注明出处。 如若内容造成侵权/违法违规/事实不符,请点击【内容举报】进行投诉反馈!
