Bot-Factory 602 Posted April 24, 2014 Report Share Posted April 24, 2014 Hello. I try to get some text from:http://zitate.net/zitate/top.html unfortunately they add a lot of HTML stuff to it to make exporting harder: <body style="min-width:570px;"><div><div class="menu-header" style="font-size:10pt;line-height:15px;"><div style="position:relative;width:550px;margin:auto;"><br><div style="position:absolute;left:0px;right:0px;top:0px;text-align:center;"></div><div style="position:absolute;left:0px;top:0px;"><a href="/"><img alt="Zitate" height="16" src="/x/p.gif" style="background:url('/x/i.png') 0px -64px; vertical-align:-3px;" width="16"></a> <a href="/">zitate.net</a></div><div style="position:absolute;right:0px;top:0px;text-align:right;"><form action="/zitate/suche.html" method="get" name="searchform"><div style="margin-top:-1px;"><input class="x-xInputField" maxlength="100" name="query" style="border-width:0px;width:100px;" type="text"> <a href="#" onclick="document.searchform.submit();return false;" title="Suchen"><img alt="Suchen" height="16" src="/x/p.gif" style="background:url('/x/i.png') 0px -96px; vertical-align:-3px;" width="16"></a><script type="text/javascript">document.searchform.query.focus();</script></div></form></div></div></div><div style="width:550px;margin:auto;"><br><h1 class="menu-contentHeaderTitle">Die besten Zitate</h1><br><div class="a-fs11Lh19"><table class="a-auto"><tr><td><div><br><span class="quote-quote" onmouseout="rToggleLn(this, 0);" onmouseover="rToggleLn(this, 1);">Wir <a href="/leben.html" title="516 Zitate">leben</a> alle unter dem gleichen <a href="/himmel.html" title="14 Zitate">Himmel</a>, aber wir haben nicht alle den gleichen <a href="/horizonte.html" title="2 Zitate">Horizont</a>.</span><br><br><a href="/konrad%20adenauer.html" title="1. Bundeskanzler von Deutschland, 1876 - 1967, 19 Zitate">Konrad Adenauer</a><div class="quote-quoteIcons"><a href="/zitat_830.html"><img alt="/x/details.png" height="16" onmouseout="mStop();" onmouseover="rShowDetails('830');" src="/x/p.gif" style="background:url('/x/i.png') 0px -128px;" width="14"></a><img alt="/x/bookmark.png" height="16" id="quote__id_quote_830_i" src="/x/p.gif" style="background:url('/x/i.png') 0px -112px;display:none;" title="Lesezeichen" width="16"><div class="quote-quoteDetails" id="quote__id_quote_830_p" onmouseout="mStop();" onmouseover="mWait();"></div></div><br><br></div></td><td style="padding-left:20px;text-align:right;"><a href="/konrad%20adenauer.html"><img alt="Konrad Adenauer" height="92" onmouseout="mStop();" onmouseover="rShowCreditsD('pp-187', this);" src="/konrad%20adenauer.lr.jpg" style="vertical-align:middle;padding:15px 0px;" width="69"></a><div id="pp-187" onmouseout="mStop();" onmouseover="mWait();" style="background:#ffffe1;border-color:#000000;border-width:1px;font-size:8pt;display:none;padding:2px;position:absolute;text-align:left;z-index:100;"><b>Bild</b> CC-BY-SA<br>by <a href="http://www.bild.bundesarchiv.de/archives/barchpic/search/?search%5Bform%5D%5BSIGNATUR%5D=B+145+Bild-F078072-0004" target="_blank">Katherine Young</a><br>via <a href="http://commons.wikimedia.org/wiki/File:Bundesarchiv_B_145_Bild-F078072-0004,_Konrad_Adenauer.jpg" target="_blank">Wikimedia</a></div></td></tr></table><hr class="menu-line"><table class="a-auto"><tr><td><div><br><span class="quote-quote" onmouseout="rToggleLn(this, 0);" onmouseover="rToggleLn(this, 1);">Ich bin nicht sicher, mit welchen <a href="/waffen.html" title="14 Zitate">Waffen</a> der dritte <a href="/weltkriege.html" title="2 Zitate">Weltkrieg</a> ausgetragen wird, aber im vierten Weltkrieg werden sie mit Stöcken und Steinen kämpfen.</span><br><br><a href="/albert%20einstein.html" title="Deutscher Physiker und Nobelpreisträger, 1879 - 1955, 41 Zitate">Albert Einstein</a><div class="quote-quoteIcons"><a href="/zitat_40.html"><img alt="/x/details.png" height="16" onmouseout="mStop();" onmouseover="rShowDetails('40');" src="/x/p.gif" style="background:url('/x/i.png') 0px -128px;" width="14"></a><img alt="/x/bookmark.png" height="16" id="quote__id_quote_40_i" src="/x/p.gif" style="background:url('/x/i.png') 0px -112px;display:none;" title="Lesezeichen" width="16"><div class="quote-quoteDetails" id="quote__id_quote_40_p" onmouseout="mStop();" onmouseover="mWait();"></div></div><br><br></div></td><td style="padding-left:20px;text-align:right;"><a href="/albert%20einstein.html"><img alt="Albert Einstein" height="92" src="/albert%20einstein.lr.jpg" style="vertical-align:middle;padding:15px 0px;" title="Albert Einstein, 41 Zitate" width="69"></a></td></tr></table><hr class="menu-line"><table class="a-auto"><tr><td><div><br><span class="quote-quote" onmouseout="rToggleLn(this, 0);" onmouseover="rToggleLn(this, 1);"><a href="/sagen.html" title="20 Zitate">Sage</a> nicht alles, was du <a href="/wissen.html" title="131 Zitate">weißt</a>, aber wisse immer, was du sagst.</span><br><br><a href="/matthias%20claudius.html" title="Deutscher Dichter, Journalist und Lyriker, 1740 - 1815, 5 Zitate">Matthias Claudius</a><div class="quote-quoteIcons"><a href="/zitat_3087.html"><img alt="/x/details.png" height="16" onmouseout="mStop();" onmouseover="rShowDetails('3087');" src="/x/p.gif" style="background:url('/x/i.png') 0px -128px;" width="14"></a><img alt="/x/bookmark.png" height="16" id="quote__id_quote_3087_i" src="/x/p.gif" style="background:url('/x/i.png') 0px -112px;display:none;" title="Lesezeichen" width="16"><div class="quote-quoteDetails" id="quote__id_quote_3087_p" onmouseout="mStop();" onmouseover="mWait();"></div></div><br><br></div></td><td style="padding-left:20px;text-align:right;"><a href="/matthias%20claudius.html"><img alt="Matthias Claudius" height="92" src="/matthias%20claudius.lr.jpg" style="vertical-align:middle;padding:15px 0px;" title="Matthias Claudius, 5 Zitate" width="69"></a></td></tr></table><hr class="menu-line"><br><span class="quote-quote" onmouseout="rToggleLn(this, 0);" onmouseover="rToggleLn(this, 1);">Es ist nicht zu <a href="/wenig.html" title="13 Zitate">wenig</a> <a href="/zeit.html" title="156 Zitate">Zeit</a>, die wir haben, sondern es ist zu <a href="/viel.html" title="14 Zitate">viel</a> Zeit, die wir nicht <a href="/nutzen.html" title="15 Zitate">nutzen</a>.</span><br><br><a href="/lucius%20annaeus%20seneca.html" title="Römischer Philosoph, Dramatiker und Staatsmann, 12 Zitate">Lucius Annaeus Seneca</a><div class="quote-quoteIcons"><a href="/zitat_1615.html"><img alt="/x/details.png" height="16" onmouseout="mStop();" onmouseover="rShowDetails('1615');" src="/x/p.gif" style="background:url('/x/i.png') 0px -128px;" width="14"></a><img alt="/x/bookmark.png" height="16" id="quote__id_quote_1615_i" src="/x/p.gif" style="background:url('/x/i.png') 0px -112px;display:none;" title="Lesezeichen" width="16"><div class="quote-quoteDetails" id="quote__id_quote_1615_p" onmouseout="mStop();" onmouseover="mWait();"></div></div><br><br><hr class="menu-line"><table class="a-auto"><tr><td><div><br><span class="quote-quote" onmouseout="rToggleLn(this, 0);" onmouseover="rToggleLn(this, 1);">Zwei Dinge sind unendlich, das <a href="/universen.html" title="10 Zitate">Universum</a> und die menschliche <a href="/dummheit.html" title="102 Zitate">Dummheit</a>, aber bei dem Universum bin ich mir noch nicht ganz sicher.</span><br><br><a href="/albert%20einstein.html" title="Deutscher Physiker und Nobelpreisträger, 1879 - 1955, 41 Zitate">Albert Einstein</a><div class="quote-quoteIcons"><a href="/zitat_805.html"><img alt="/x/details.png" height="16" onmouseout="mStop();" onmouseover="rShowDetails('805');" src="/x/p.gif" style="background:url('/x/i.png') 0px -128px;" width="14"></a><img alt="/x/bookmark.png" height="16" id="quote__id_quote_805_i" src="/x/p.gif" style="background:url('/x/i.png') 0px -112px;display:none;" title="Lesezeichen" width="16"><div class="quote-quoteDetails" id="quote__id_quote_805_p" onmouseout="mStop();" onmouseover="mWait();"></div></div><br><br></div></td><td style="padding-left:20px;text-align:right;"><a href=If someone has a smart idea how the text (quotes) from this site could be extracted, I would love to hear it. Kindest regardsDan Quote Link to post Share on other sites
kev123 132 Posted April 24, 2014 Report Share Posted April 24, 2014 browser or http? Quote Link to post Share on other sites
Bot-Factory 602 Posted April 24, 2014 Author Report Share Posted April 24, 2014 I forgot to add.... Via HTTP. Via Browser this is working fine. navigate("http://zitate.net/zitate/top.html", "Wait")add list to list(%quotes, $scrape attribute(<class="quote-quote">, "innertext"), "Delete", "Global") But normally I should be able to translate that to XPATH directly right? Dan Quote Link to post Share on other sites
Code Docta (Nick C.) 638 Posted April 24, 2014 Report Share Posted April 24, 2014 Here is html and xpath set(#get, $plugin function("HTTP post.dll", "$http get", "http://zitate.net/zitate/top.html", "", "", "", ""), "Global")set(#ahtml, $plugin function("HTTP post.dll", "$html parser", #get, "span", "class", "quote-quote", "InnerText"), "Global")set(#axpath, $plugin function("HTTP post.dll", "$xpath parser", #get, "//span[@class=\'quote-quote\']", "InnerText", "HTML"), "Global") 1 Quote Link to post Share on other sites
Bot-Factory 602 Posted April 24, 2014 Author Report Share Posted April 24, 2014 Thanks TC, sometimes I'm wondering why I asked a question :-) After I looked at the browser example, it was obvious that I could use the same html tags for the xpath parser . When I first looked at the html code, it confused me a lot. And I thought it's way more complicated than it actually was :-) Dan 1 Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.