How To Scrape Data From Page Without Unique Ids

APTS · August 1, 2015

I am trying to scrape some real estate data from this page:

http://v3.torontomls.net/Live/Pages/Public/Link.aspx?Key=6106f46dc223411685c459310be3c8c0&App=TREB

This page lists 29 separate properties, each identified by a unique MLS#. There is no problem scraping all of the table data at the top of the page, but I am having difficulty scraping the more detailed information that appears below the table. For example, the first piece of data that I am trying to scrape is the "Sold:" value which appears in the top-right hand corner of each property record. This is the HTML for the "Sold:" value:

<span class="formitem formfield">
     <label>Sold:</label>
     <span class="value" style="font-weight:bold">$182,000</span>
</span>

When I use this selector <class="formitem formfield"> it gives me too many matches, so obviously this is an incomplete selector reference. Looking at the above code, I think that what I need to grab is the innerhtml of the span class, which follows the "Sold:" label, which is inside the span class "formitem formfield". But how do I translate this into ubot code?

Does anyone have any suggestions on how I can scrape the "Sold:" prices from this page? Thanks.

stanf · August 1, 2015

i played with this for a while, the easiest what is to navigate to the page grab the document txt and regex the data you need

APTS · August 1, 2015

Thanks stanf. That sounds like a good idea. Time for me to a watch a couple of regex tutorial videos to figure out how to do that. Do you have any ideas or suggestions to get me started in the right direction?

stanf · August 1, 2015

regex course

http://www.ubotstudio.com/forum/index.php?/topic/15905-sell-learn-regular-expressions-video-course-2-hours-of-content/

regex tool

http://www.ubotstudio.com/forum/index.php?/topic/13979-sell-regex-builder-build-regular-expressions-for-ubot-with-ease/

didnt take the course, but i swear by the software,$37 bucks saves a lot of trouble

Pete · August 1, 2015

This should give you a starting point

clear list(%OutPut)
navigate("http://v3.torontomls.net/Live/Pages/Public/Link.aspx?Key=6106f46dc223411685c459310be3c8c0&App=TREB","Wait")
wait for browser event("Everything Loaded",30)
loop(2) {
    run javascript("javascript:window.scrollTo(0, document.body.scrollHeight);")
    wait for browser event("Everything Loaded",30)
    add list to list(%OutPut,$list from text($trim($find regular expression($replace regular expression($document text,"\\r?\\n",$nothing),"(?<=<div class=\"report-container\">).*?(?=<div class=\"footer\">)")),$new line),"Delete","Global")
}
remove from list(%OutPut,0)
set(#Sold,$find regular expression($list item(%OutPut,0),"(?<=List:</label><span class=\"value\" style=\"font-weight:bold\">).*?(?=</span>)"),"Global")

APTS · August 1, 2015

This is *&%^$! awesome!

Thanks pal, this has just made my day. You have given me the necessary "hook" to be able to jump in and figure the rest of it out from here by myself. I'm enjoying learning all about regex through watching the online tutorials. Thanks for the headstart.

One last question: Why do you loop 2 times?

Edited August 2, 2015 by APTS

Pete · August 2, 2015

Seems you need to scroll the page and add a large delay for all the data to load
So the site has a very slow server or the page is only loading what’s on screen
Play with it take parts of the code out, the loop or the delays then recheck the results

Sign In

How To Scrape Data From Page Without Unique Ids

Recommended Posts

APTS 3

Link to post

Share on other sites

stanf 43

Link to post

Share on other sites

APTS 3

Link to post

Share on other sites

stanf 43

Link to post

Share on other sites

Pete 121

Link to post

Share on other sites

APTS 3

Link to post

Share on other sites

Pete 121

Link to post

Share on other sites

Join the conversation

Browse

Activity