use strict;
use warnings;
use v5.10;
use XML::LibXML;
my $html = q{
<html>
<body>
<a href='/channels/folder1'>Alpha-Seeking</a>
<a href='/channels/folder2'>No Underlying Index</a>
</body>
</html>
};
# Parse the HTML
my $dom = XML::LibXML->load_html(string => $html);
# Find all links.
for my $node ($dom->findnodes('//a')) {
# Print their text.
say $node->textContent;
}
Solution 2 :
I must start by reiterating that it’s incredibly unwise to parse HTML or XML with regexes. Please consider using a proper HTML parser.
Having said that, your problem here is pretty simple to fix. What you call the “standard intuitive approach” works fine with a simple tweak.
Here’s what you have:
if ($string1=~ /'>(.*?)/) {print "got $1";}
And your regex is '>(.*?). That means “find a literal quote mark, followed by a greater than sign and then capture the minimum amount of anything following that”. It’s “the minimum amount” that’s the problem. The simplest thing that .*? can capture is nothing – the empty string.
Regexes are greedy by default; they match as much as possible. You add the ? to remove that greediness and make them match as little as possible. But you don’t want that here. Here, you want their greediness. So just remove that ?.
use warnings;
use strict;
my @strings = (
"<a href='/channels/folder1'>Alpha-Seeking",
"<a href='/channels/folder2'>No Underlying Index ,"
);
for my $string (@strings) {
if ($string =~ /'>(.*)/) { # Note: No "?" here
print "got $1n";
}
}
This displays:
got Alpha-Seeking
got No Underlying Index ,
Solution 3 :
This works for me
use warnings;
use strict;
my @strings = (
"<a href='/channels/folder1'>Alpha-Seeking",
"<a href='/channels/folder2'>No Underlying Index ,"
);
for my $string (@strings)
{
if ($string =~ /'>(.*?)$/)
{
print "got $1n";
}
}
running it gives
$ perl /tmp/abc.pl
got Alpha-Seeking
got No Underlying Index ,
Solution 4 :
While exploring various options, I managed to get this working with the following:
Replace the greater than sign with some other generic symbol (like a pipe)
$string=~ s/>/|/g; #Interestingly, '>' matches here without any issues
After that, split on the pipe char, and print/parse the second part:
($o1,$o2) = split(/|/, $string);
print "$o2|";
Works perfectly as a work-around.
Problem :
$string1="<a href='/channels/folder1'>Alpha-Seeking";
$string2="<a href='/channels/folder2'>No Underlying Index ,";
I need to extract “Alpha-Seeking” and “No Underlying Index ,” from the above 2 strings.
Basically, need everything from (‘>) to the last character of the string.
Thanks @schwern, This works, though it needs some parsing for using HTML Parser.
Comment posted by Aquaholic
Thanks @davecross, this works, but HTML can be multi-line where this fails. +1 for single-line working
Comment posted by Dave Cross
@Aquaholic: If you have more complicated specifications, then it’s best to mention them in your question, otherwise you’ll get answers that aren’t very helpful. If you want to deal with multi-line data, then you’ll need to specify what defines the end of the text.
Comment posted by Aquaholic
Agreed. Just that in this case it turned out to be additional need as more data got exposed, after I posted this q. Will be mindful in future.
Comment posted by Aquaholic
Thanks @pmqs, this works, but HTML can be multi-line where this fails. +1 for single-line working.
Comment posted by pmqs
@Aquaholic Agree, but you question suggested you were dealing with a single-line use-case 🙂
Comment posted by Dave Cross
Interestingly, ‘>’ matches here without any issues