APTS 3 Posted August 1, 2015 Report Share Posted August 1, 2015 I am trying to scrape some real estate data from this page: http://v3.torontomls.net/Live/Pages/Public/Link.aspx?Key=6106f46dc223411685c459310be3c8c0&App=TREB This page lists 29 separate properties, each identified by a unique MLS#. There is no problem scraping all of the table data at the top of the page, but I am having difficulty scraping the more detailed information that appears below the table. For example, the first piece of data that I am trying to scrape is the "Sold:" value which appears in the top-right hand corner of each property record. This is the HTML for the "Sold:" value: <span class="formitem formfield"> <label>Sold:</label> <span class="value" style="font-weight:bold">$182,000</span> </span>When I use this selector <class="formitem formfield"> it gives me too many matches, so obviously this is an incomplete selector reference. Looking at the above code, I think that what I need to grab is the innerhtml of the span class, which follows the "Sold:" label, which is inside the span class "formitem formfield". But how do I translate this into ubot code? Does anyone have any suggestions on how I can scrape the "Sold:" prices from this page? Thanks. Quote Link to post Share on other sites
stanf 43 Posted August 1, 2015 Report Share Posted August 1, 2015 i played with this for a while, the easiest what is to navigate to the page grab the document txt and regex the data you need Quote Link to post Share on other sites
APTS 3 Posted August 1, 2015 Author Report Share Posted August 1, 2015 Thanks stanf. That sounds like a good idea. Time for me to a watch a couple of regex tutorial videos to figure out how to do that. Do you have any ideas or suggestions to get me started in the right direction? Quote Link to post Share on other sites
stanf 43 Posted August 1, 2015 Report Share Posted August 1, 2015 regex coursehttp://www.ubotstudio.com/forum/index.php?/topic/15905-sell-learn-regular-expressions-video-course-2-hours-of-content/ regex toolhttp://www.ubotstudio.com/forum/index.php?/topic/13979-sell-regex-builder-build-regular-expressions-for-ubot-with-ease/ didnt take the course, but i swear by the software,$37 bucks saves a lot of trouble Quote Link to post Share on other sites
Pete 121 Posted August 1, 2015 Report Share Posted August 1, 2015 This should give you a starting point clear list(%OutPut)navigate("http://v3.torontomls.net/Live/Pages/Public/Link.aspx?Key=6106f46dc223411685c459310be3c8c0&App=TREB","Wait")wait for browser event("Everything Loaded",30)loop(2) { run javascript("javascript:window.scrollTo(0, document.body.scrollHeight);") wait for browser event("Everything Loaded",30) add list to list(%OutPut,$list from text($trim($find regular expression($replace regular expression($document text,"\\r?\\n",$nothing),"(?<=<div class=\"report-container\">).*?(?=<div class=\"footer\">)")),$new line),"Delete","Global")}remove from list(%OutPut,0)set(#Sold,$find regular expression($list item(%OutPut,0),"(?<=List:</label><span class=\"value\" style=\"font-weight:bold\">).*?(?=</span>)"),"Global") Quote Link to post Share on other sites
APTS 3 Posted August 1, 2015 Author Report Share Posted August 1, 2015 (edited) This is *&%^$! awesome! Thanks pal, this has just made my day. You have given me the necessary "hook" to be able to jump in and figure the rest of it out from here by myself. I'm enjoying learning all about regex through watching the online tutorials. Thanks for the headstart. One last question: Why do you loop 2 times? Edited August 2, 2015 by APTS Quote Link to post Share on other sites
Pete 121 Posted August 2, 2015 Report Share Posted August 2, 2015 Seems you need to scroll the page and add a large delay for all the data to loadSo the site has a very slow server or the page is only loading what’s on screenPlay with it take parts of the code out, the loop or the delays then recheck the results Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.