Solution 1 :

Do not parse HTML with a regex. Regexes are very bad at parsing complex, balanced text like HTML.

For example:

<tag>
  outer
  <tag>
    middle
    <tag>inner</tag>
    middle
  </tag>
  outer
</tag>

Instead, use an HTML parser and search tools such as XPath.

Here is a demonstration using XML::LibXML.

use strict;
use warnings;
use v5.10;

use XML::LibXML;

my $html = q{
<html>
<body>
    <a href='/channels/folder1'>Alpha-Seeking</a>
    <a href='/channels/folder2'>No Underlying Index</a>
</body>
</html>
};

# Parse the HTML
my $dom = XML::LibXML->load_html(string => $html);

# Find all links.
for my $node ($dom->findnodes('//a')) {
    # Print their text.
    say $node->textContent;
}

Solution 2 :

I must start by reiterating that it’s incredibly unwise to parse HTML or XML with regexes. Please consider using a proper HTML parser.

Having said that, your problem here is pretty simple to fix. What you call the “standard intuitive approach” works fine with a simple tweak.

Here’s what you have:

if ($string1=~ /'>(.*?)/) {print "got $1";} 

And your regex is '>(.*?). That means “find a literal quote mark, followed by a greater than sign and then capture the minimum amount of anything following that”. It’s “the minimum amount” that’s the problem. The simplest thing that .*? can capture is nothing – the empty string.

Regexes are greedy by default; they match as much as possible. You add the ? to remove that greediness and make them match as little as possible. But you don’t want that here. Here, you want their greediness. So just remove that ?.

use warnings;
use strict;

my @strings = (
 "<a href='/channels/folder1'>Alpha-Seeking",
 "<a href='/channels/folder2'>No Underlying Index ,"
);

for my $string (@strings) {
  if ($string =~ /'>(.*)/) { # Note: No "?" here
    print "got $1n";
  }
}

This displays:

got Alpha-Seeking
got No Underlying Index ,

Solution 3 :

This works for me

use warnings;
use strict;

my @strings = (
 "<a href='/channels/folder1'>Alpha-Seeking",
 "<a href='/channels/folder2'>No Underlying Index ,"
);

for my $string (@strings)
{
    if ($string =~ /'>(.*?)$/) 
    {
        print "got $1n";
    } 
} 

running it gives

$ perl /tmp/abc.pl
got Alpha-Seeking
got No Underlying Index ,

Solution 4 :

While exploring various options, I managed to get this working with the following:

Replace the greater than sign with some other generic symbol (like a pipe)

$string=~ s/>/|/g;                 #Interestingly, '>' matches here without any issues

After that, split on the pipe char, and print/parse the second part:

    ($o1,$o2) = split(/|/, $string);
    print "$o2|";

Works perfectly as a work-around.

Problem :

$string1="<a href='/channels/folder1'>Alpha-Seeking";
$string2="<a href='/channels/folder2'>No Underlying Index ,";

I need to extract “Alpha-Seeking” and “No Underlying Index ,” from the above 2 strings.
Basically, need everything from (‘>) to the last character of the string.

Tried two ways,

1) The standard intuitive

($string1=~ /'>(.*?)/) {print "got $1";} 

but this does not seem to work on ‘>’ symbol.

2) Also tried

if ($string1=~ /(?=>)(.*?)/) {print "got $1";} 

based on inputs from Greater than and less than symbol in regular expressions, but it is not working.

Any inputs will be useful.

PS: Also, if the answer can include matching the “less than” symbo (“<“), that will be great!

Thanks

Comments

Comment posted by Shawn

What happens with your first try if you drop the

Comment posted by Aquaholic

@stevesliva ,.. Those quotes are clear. I modified them for posting this question. Have edited the original question to double-quotes.

Comment posted by pmqs

What exactly doe you mean about matching “<". Can you give an example please?

Comment posted by stackoverflow.com/questions/1732348/…

stackoverflow.com/questions/1732348/…

Comment posted by Aquaholic

Thanks @schwern, This works, though it needs some parsing for using HTML Parser.

Comment posted by Aquaholic

Thanks @davecross, this works, but HTML can be multi-line where this fails. +1 for single-line working

Comment posted by Dave Cross

@Aquaholic: If you have more complicated specifications, then it’s best to mention them in your question, otherwise you’ll get answers that aren’t very helpful. If you want to deal with multi-line data, then you’ll need to specify what defines the end of the text.

Comment posted by Aquaholic

Agreed. Just that in this case it turned out to be additional need as more data got exposed, after I posted this q. Will be mindful in future.

Comment posted by Aquaholic

Thanks @pmqs, this works, but HTML can be multi-line where this fails. +1 for single-line working.

Comment posted by pmqs

@Aquaholic Agree, but you question suggested you were dealing with a single-line use-case 🙂

Comment posted by Dave Cross

Interestingly, ‘>’ matches here without any issues

By