* [room] Perl & parsers & вложенные одинаковые теги
@ 2010-01-23 23:28 Di
0 siblings, 0 replies; only message in thread
From: Di @ 2010-01-23 23:28 UTC (permalink / raw)
To: культурный
офтопик
[-- Attachment #1: Type: text/plain, Size: 2051 bytes --]
Доброго времени суток!
Пробую из страницы вот такого типа: <div class=view_text><div
class=no_need>NOT NEEDED</div>NEED THIS</div> вытащить текст. Должно
получится NOT NEEDEDNEED THIS . В любом случае получается NOT NEEDED
Где я не прав?
Искал "perl parse nested tags", но ничего полезного не нашел...
Скрипт 1:
---------------------------------------------------------------------
#!/usr/bin/perl -w
use HTML::Parser ();
sub start_handler
{
my $tagname = shift; my $self = shift; my $att = shift;
if ($att->{'class'}){
if ($tagname eq "div" && $att->{'class'} eq "view_text") {
$self->handler(text => sub { print shift }, "dtext");
$self->handler(end => \&stop_handler,"tagname,self,attr");
}
}
}
sub stop_handler
{
my $tagname = shift;
my $self = shift;
my $att = shift;
if ($tagname eq "div"){
$self->eof;
}
}
my $p = HTML::Parser->new();
$p->handler( start => \&start_handler, "tagname,self,attr");
$p->parse_file(shift || die) || die $!;
print "\n";
---------------------------------------------------------------------
Скрипт 2:
---------------------------------------------------------------------
#!/usr/bin/perl -w
use strict;
use HTML::TokeParser;
my $p = HTML::TokeParser -> new(shift || die) || die $!;
while (my $token = $p -> get_tag('div')) {
my $class = $token -> [1]{class} || '';
if ($class eq 'view_text') {
my $text = $p -> get_trimmed_text("/div");
print $text;
}
}
print "\n";
---------------------------------------------------------------------
Скрипт 3 (тут вроде что-то как надо (->dump), сам текст не выводится)
---------------------------------------------------------------------
#!/usr/bin/perl -w
use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new; # empty tree
my $test = HTML::TreeBuilder->new; # empty tree
$tree->parse_file(shift || die);
$test = $tree;
$test = $test->find_by_attribute('class','view_text');
$test->dump;
$test->as_HTML;
$test->as_text;
$test->as_XML;
$tree->dump;
$tree->as_HTML;
[-- Attachment #2: denyago.vcf --]
[-- Type: text/x-vcard, Size: 366 bytes --]
begin:vcard
fn:Denis Timurovich Yagofarov
n:Yagofarov;Denis Timurovich
org:ITGIS NASU
adr:room 615;;Chokolovski blvdr., 13;Kiev;;03151;Ukraine
email;internet:denyago@rambler.ru
title:system administrator
tel;work:80442480755
tel;pager:diyago@jabber.te.ua
tel;home:80442434512
tel;cell:80662933760
note:diyago@jabber.te.ua
x-mozilla-html:FALSE
version:2.1
end:vcard
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2010-01-23 23:28 UTC | newest]
Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-01-23 23:28 [room] Perl & parsers & вложенные одинаковые теги Di
Культурный офтопик
This inbox may be cloned and mirrored by anyone:
git clone --mirror http://lore.altlinux.org/smoke-room/0 smoke-room/git/0.git
# If you have public-inbox 1.1+ installed, you may
# initialize and index your mirror using the following commands:
public-inbox-init -V2 smoke-room smoke-room/ http://lore.altlinux.org/smoke-room \
smoke-room@lists.altlinux.org smoke-room@lists.altlinux.ru smoke-room@lists.altlinux.com smoke-room@altlinux.ru smoke-room@altlinux.org smoke-room@altlinux.com
public-inbox-index smoke-room
Example config snippet for mirrors.
Newsgroup available over NNTP:
nntp://lore.altlinux.org/org.altlinux.lists.smoke-room
AGPL code for this site: git clone https://public-inbox.org/public-inbox.git