Jump to content
UBot Underground

How To Scrape Data From Page Without Unique Ids


Recommended Posts

I am trying to scrape some real estate data from this page:

 

http://v3.torontomls.net/Live/Pages/Public/Link.aspx?Key=6106f46dc223411685c459310be3c8c0&App=TREB

 

This page lists 29 separate properties, each identified by a unique MLS#.  There is no problem scraping all of the table data at the top of the page, but I am having difficulty scraping the more detailed information that appears below the table.  For example, the first piece of data that I am trying to scrape is the "Sold:" value which appears in the top-right hand corner of each property record.  This is the HTML for the "Sold:" value:

 

<span class="formitem formfield">
     <label>Sold:</label>
     <span class="value" style="font-weight:bold">$182,000</span>
</span>

When I use this selector <class="formitem formfield"> it gives me too many matches, so obviously this is an incomplete selector reference.  Looking at the above code, I think that what I need to grab is the innerhtml of the span class, which follows the "Sold:" label, which is inside the span class "formitem formfield".  But how do I translate this into ubot code?

 

Does anyone have any suggestions on how I can scrape the "Sold:" prices from this page?  Thanks.

 

Link to post
Share on other sites

Thanks stanf.  That sounds like a good idea.  Time for me to a watch a couple of regex tutorial videos to figure out how to do that.  Do you have any ideas or suggestions to get me started in the right direction?

Link to post
Share on other sites

This should give you a starting point

 

clear list(%OutPut)
navigate("http://v3.torontomls.net/Live/Pages/Public/Link.aspx?Key=6106f46dc223411685c459310be3c8c0&App=TREB","Wait")
wait for browser event("Everything Loaded",30)
loop(2) {
    run javascript("javascript:window.scrollTo(0, document.body.scrollHeight);")
    wait for browser event("Everything Loaded",30)
    add list to list(%OutPut,$list from text($trim($find regular expression($replace regular expression($document text,"\\r?\\n",$nothing),"(?<=<div class=\"report-container\">).*?(?=<div class=\"footer\">)")),$new line),"Delete","Global")
}
remove from list(%OutPut,0)
set(#Sold,$find regular expression($list item(%OutPut,0),"(?<=List:</label><span class=\"value\" style=\"font-weight:bold\">).*?(?=</span>)"),"Global")

Link to post
Share on other sites

This is *&%^$! awesome!

 

Thanks pal, this has just made my day.  You have given me the necessary "hook" to be able to jump in and figure the rest of it out from here by myself.  I'm enjoying learning all about regex through watching the online tutorials.  Thanks for the headstart.  

 

One last question:  Why do you loop 2 times?

Edited by APTS
Link to post
Share on other sites

Seems you need to scroll the page and add a large delay for all the data to load
So the site has a very slow server or the page is only loading what’s on screen
Play with it take parts of the code out, the loop or the delays then recheck the results
 

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...