Культурный офтопик
 help / color / mirror / Atom feed
* [room] Perl & parsers & вложенные одинаковые теги
@ 2010-01-23 23:28 Di
  0 siblings, 0 replies; only message in thread
From: Di @ 2010-01-23 23:28 UTC (permalink / raw)
  To: культурный
	офтопик

[-- Attachment #1: Type: text/plain, Size: 2051 bytes --]

Доброго времени суток!

Пробую из страницы вот такого типа: <div class=view_text><div 
class=no_need>NOT NEEDED</div>NEED THIS</div> вытащить текст. Должно 
получится NOT NEEDEDNEED THIS . В любом случае получается NOT NEEDED

Где я не прав?

Искал "perl parse nested tags", но ничего полезного не нашел...

Скрипт 1:
---------------------------------------------------------------------
#!/usr/bin/perl -w
   use HTML::Parser ();
   sub start_handler
   {
	my $tagname = shift; my $self = shift; my $att = shift;
	if ($att->{'class'}){
		if ($tagname eq "div" && $att->{'class'} eq "view_text") {
			$self->handler(text => sub { print shift }, "dtext");
			$self->handler(end  => \&stop_handler,"tagname,self,attr");
		}
	}
   }
   sub stop_handler
   {
	my $tagname = shift;
	my $self = shift;
	my $att = shift;
	if ($tagname eq "div"){
		$self->eof;
	}
   }
   my $p = HTML::Parser->new();
   $p->handler( start => \&start_handler, "tagname,self,attr");
   $p->parse_file(shift || die) || die $!;
   print "\n";
---------------------------------------------------------------------
Скрипт 2:
---------------------------------------------------------------------
#!/usr/bin/perl -w
use strict;
use HTML::TokeParser;
my $p = HTML::TokeParser -> new(shift || die) || die $!;
while (my $token = $p -> get_tag('div')) {
     my $class = $token -> [1]{class} || '';
     if ($class eq 'view_text') {
     	my $text = $p -> get_trimmed_text("/div");
     	print $text;
     }
}
print "\n";
---------------------------------------------------------------------
Скрипт 3 (тут вроде что-то как надо (->dump), сам текст не выводится)
---------------------------------------------------------------------
#!/usr/bin/perl -w
use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new; # empty tree
my $test = HTML::TreeBuilder->new; # empty tree
     $tree->parse_file(shift || die);
	$test = $tree;
	$test = $test->find_by_attribute('class','view_text');
	$test->dump;
	$test->as_HTML;
	$test->as_text;
	$test->as_XML;
       $tree->dump;
       $tree->as_HTML;

[-- Attachment #2: denyago.vcf --]
[-- Type: text/x-vcard, Size: 366 bytes --]

begin:vcard
fn:Denis Timurovich Yagofarov
n:Yagofarov;Denis Timurovich
org:ITGIS NASU
adr:room 615;;Chokolovski blvdr., 13;Kiev;;03151;Ukraine
email;internet:denyago@rambler.ru
title:system administrator
tel;work:80442480755
tel;pager:diyago@jabber.te.ua
tel;home:80442434512
tel;cell:80662933760
note:diyago@jabber.te.ua
x-mozilla-html:FALSE
version:2.1
end:vcard


^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2010-01-23 23:28 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-01-23 23:28 [room] Perl & parsers & вложенные одинаковые теги Di

Культурный офтопик

This inbox may be cloned and mirrored by anyone:

	git clone --mirror http://lore.altlinux.org/smoke-room/0 smoke-room/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 smoke-room smoke-room/ http://lore.altlinux.org/smoke-room \
		smoke-room@lists.altlinux.org smoke-room@lists.altlinux.ru smoke-room@lists.altlinux.com smoke-room@altlinux.ru smoke-room@altlinux.org smoke-room@altlinux.com
	public-inbox-index smoke-room

Example config snippet for mirrors.
Newsgroup available over NNTP:
	nntp://lore.altlinux.org/org.altlinux.lists.smoke-room


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git