Jump to content
UBot Underground

Need Help with Extracting HTML Tags


Recommended Posts

Hello,

 

I am using the Page Scrape command to extract the text below: 

<td colspan="3" class="heading small"><strong>Product charges</strong></td>
</tr>

<tr>
<td class="title small">Tuneband for iPhone 4 & iPhone 4S, Black, Grantwood Technology's Armband, Silicone Skin, and Front/Back Screen Protector</td>
<td class="space"> </td>
<td class="quantity small">Qty:  1</td>
<td class="small"></td>
<td class="amount small">$21.99</td>
</tr>

<tr>
<td class="title small">Tuneband for iPhone 5 (NOT FOR IPHONE 5C OR IPHONE 5S), Black, Grantwood Technology's Armband, Silicone Skin, and Front Screen Protector</td>
<td class="space"> </td>
<td class="quantity small">Qty:  1</td>
<td class="small"></td>
<td class="amount small">$22.99</td>
</tr>
<tr>
<td colspan="5" height="25px"><hr></td>
</tr>
    

I want to use the Find Regular Expression function to extract all occurrences of the tag <td class="title small">, which should be (2) occurrences in this example. There is always a "title" class name, and sometimes there is more than one, like "title small".

 

However, when I use the following regex, only (1) occurrence is returned. Any ideas?

add list to list(%temp_list, $find regular expression(#temp, "<td class=\"title.*\">"), "Delete", "Global")
Link to post
Share on other sites

Wow! That works perfectly. If I want to extract the quantities and amounts, would I use:

add list to list(%temp_list, $find regular expression(#temp, "(?<=quantity\\ssmall\\\"\\>).*?(?=\\<)"), "Delete", "Global")

add list to list(%temp_list, $find regular expression(#temp, "(?<=amount\\ssmall\\\"\\>).*?(?=\\<)"), "Delete", "Global")
Link to post
Share on other sites
  • 1 year later...

If there are returns embedded in the text, then the current regex does not produce any matches. For example:

<td class="amount small">
      $21.99
</td>

How would you modify the regex to extract the text (including any returns, tabs, spaces, etc.)? Also, how would you strip all of these characters (Ubot's $trim command only strips spaces), leaving just the amount?

add list to list(%temp_list, $find regular expression(#temp, "(?<=amount\\ssmall\\\"\\>).*?(?=\\<)"), "Don\'t Delete", "Global")
Link to post
Share on other sites

You can start with this:

set(#temp, "<td class=\"amount small\">
      $21.99
</td>", "Global")
add list to list(%temp_list, $find regular expression($trim($replace(#temp, $new line, $nothing)), "(?<=amount\\ssmall\\\"\\>).*?(?=\\<)"), "Don\'t Delete", "Global")

And then when you use each list item you can call $trim to get rid of any extra spaces. That should be able to do it all for you.

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...