Jump to content
UBot Underground

How could I extract this?


Recommended Posts

Hello.

 

I try to get some text from:

http://zitate.net/zitate/top.html

 

unfortunately they add a lot of HTML stuff to it to make exporting harder:

<body style="min-width:570px;"><div><div class="menu-header" style="font-size:10pt;line-height:15px;"><div style="position:relative;width:550px;margin:auto;"><br><div style="position:absolute;left:0px;right:0px;top:0px;text-align:center;"></div><div style="position:absolute;left:0px;top:0px;"><a href="/"><img alt="Zitate" height="16" src="/x/p.gif" style="background:url('/x/i.png') 0px -64px; vertical-align:-3px;" width="16"></a> <a href="/">zitate.net</a></div><div style="position:absolute;right:0px;top:0px;text-align:right;"><form action="/zitate/suche.html" method="get" name="searchform"><div style="margin-top:-1px;"><input class="x-xInputField" maxlength="100" name="query" style="border-width:0px;width:100px;" type="text"> <a href="#" onclick="document.searchform.submit();return false;" title="Suchen"><img alt="Suchen" height="16" src="/x/p.gif" style="background:url('/x/i.png') 0px -96px; vertical-align:-3px;" width="16"></a><script type="text/javascript">document.searchform.query.focus();</script></div></form></div></div></div><div style="width:550px;margin:auto;"><br><h1 class="menu-contentHeaderTitle">Die besten Zitate</h1><br><div class="a-fs11Lh19"><table class="a-auto"><tr><td><div><br><span class="quote-quote" onmouseout="rToggleLn(this, 0);" onmouseover="rToggleLn(this, 1);">Wir <a href="/leben.html" title="516 Zitate">leben</a> alle unter dem gleichen <a href="/himmel.html" title="14 Zitate">Himmel</a>, aber wir haben nicht alle den gleichen <a href="/horizonte.html" title="2 Zitate">Horizont</a>.</span><br><br><a href="/konrad%20adenauer.html" title="1. Bundeskanzler von Deutschland, 1876 - 1967, 19 Zitate">Konrad Adenauer</a><div class="quote-quoteIcons"><a href="/zitat_830.html"><img alt="/x/details.png" height="16" onmouseout="mStop();" onmouseover="rShowDetails('830');" src="/x/p.gif" style="background:url('/x/i.png') 0px -128px;" width="14"></a><img alt="/x/bookmark.png" height="16" id="quote__id_quote_830_i" src="/x/p.gif" style="background:url('/x/i.png') 0px -112px;display:none;" title="Lesezeichen" width="16"><div class="quote-quoteDetails" id="quote__id_quote_830_p" onmouseout="mStop();" onmouseover="mWait();"></div></div><br><br></div></td><td style="padding-left:20px;text-align:right;"><a href="/konrad%20adenauer.html"><img alt="Konrad Adenauer" height="92" onmouseout="mStop();" onmouseover="rShowCreditsD('pp-187', this);" src="/konrad%20adenauer.lr.jpg" style="vertical-align:middle;padding:15px 0px;" width="69"></a><div id="pp-187" onmouseout="mStop();" onmouseover="mWait();" style="background:#ffffe1;border-color:#000000;border-width:1px;font-size:8pt;display:none;padding:2px;position:absolute;text-align:left;z-index:100;"><b>Bild</b> CC-BY-SA<br>by <a href="http://www.bild.bundesarchiv.de/archives/barchpic/search/?search%5Bform%5D%5BSIGNATUR%5D=B+145+Bild-F078072-0004" target="_blank">Katherine Young</a><br>via <a href="http://commons.wikimedia.org/wiki/File:Bundesarchiv_B_145_Bild-F078072-0004,_Konrad_Adenauer.jpg" target="_blank">Wikimedia</a></div></td></tr></table><hr class="menu-line"><table class="a-auto"><tr><td><div><br><span class="quote-quote" onmouseout="rToggleLn(this, 0);" onmouseover="rToggleLn(this, 1);">Ich bin nicht sicher, mit welchen <a href="/waffen.html" title="14 Zitate">Waffen</a> der dritte <a href="/weltkriege.html" title="2 Zitate">Weltkrieg</a> ausgetragen wird, aber im vierten Weltkrieg werden sie mit Stöcken und Steinen kämpfen.</span><br><br><a href="/albert%20einstein.html" title="Deutscher Physiker und Nobelpreisträger, 1879 - 1955, 41 Zitate">Albert Einstein</a><div class="quote-quoteIcons"><a href="/zitat_40.html"><img alt="/x/details.png" height="16" onmouseout="mStop();" onmouseover="rShowDetails('40');" src="/x/p.gif" style="background:url('/x/i.png') 0px -128px;" width="14"></a><img alt="/x/bookmark.png" height="16" id="quote__id_quote_40_i" src="/x/p.gif" style="background:url('/x/i.png') 0px -112px;display:none;" title="Lesezeichen" width="16"><div class="quote-quoteDetails" id="quote__id_quote_40_p" onmouseout="mStop();" onmouseover="mWait();"></div></div><br><br></div></td><td style="padding-left:20px;text-align:right;"><a href="/albert%20einstein.html"><img alt="Albert Einstein" height="92" src="/albert%20einstein.lr.jpg" style="vertical-align:middle;padding:15px 0px;" title="Albert Einstein, 41 Zitate" width="69"></a></td></tr></table><hr class="menu-line"><table class="a-auto"><tr><td><div><br><span class="quote-quote" onmouseout="rToggleLn(this, 0);" onmouseover="rToggleLn(this, 1);"><a href="/sagen.html" title="20 Zitate">Sage</a> nicht alles, was du <a href="/wissen.html" title="131 Zitate">weißt</a>, aber wisse immer, was du sagst.</span><br><br><a href="/matthias%20claudius.html" title="Deutscher Dichter, Journalist und Lyriker, 1740 - 1815, 5 Zitate">Matthias Claudius</a><div class="quote-quoteIcons"><a href="/zitat_3087.html"><img alt="/x/details.png" height="16" onmouseout="mStop();" onmouseover="rShowDetails('3087');" src="/x/p.gif" style="background:url('/x/i.png') 0px -128px;" width="14"></a><img alt="/x/bookmark.png" height="16" id="quote__id_quote_3087_i" src="/x/p.gif" style="background:url('/x/i.png') 0px -112px;display:none;" title="Lesezeichen" width="16"><div class="quote-quoteDetails" id="quote__id_quote_3087_p" onmouseout="mStop();" onmouseover="mWait();"></div></div><br><br></div></td><td style="padding-left:20px;text-align:right;"><a href="/matthias%20claudius.html"><img alt="Matthias Claudius" height="92" src="/matthias%20claudius.lr.jpg" style="vertical-align:middle;padding:15px 0px;" title="Matthias Claudius, 5 Zitate" width="69"></a></td></tr></table><hr class="menu-line"><br><span class="quote-quote" onmouseout="rToggleLn(this, 0);" onmouseover="rToggleLn(this, 1);">Es ist nicht zu <a href="/wenig.html" title="13 Zitate">wenig</a> <a href="/zeit.html" title="156 Zitate">Zeit</a>, die wir haben, sondern es ist zu <a href="/viel.html" title="14 Zitate">viel</a> Zeit, die wir nicht <a href="/nutzen.html" title="15 Zitate">nutzen</a>.</span><br><br><a href="/lucius%20annaeus%20seneca.html" title="Römischer Philosoph, Dramatiker und Staatsmann, 12 Zitate">Lucius Annaeus Seneca</a><div class="quote-quoteIcons"><a href="/zitat_1615.html"><img alt="/x/details.png" height="16" onmouseout="mStop();" onmouseover="rShowDetails('1615');" src="/x/p.gif" style="background:url('/x/i.png') 0px -128px;" width="14"></a><img alt="/x/bookmark.png" height="16" id="quote__id_quote_1615_i" src="/x/p.gif" style="background:url('/x/i.png') 0px -112px;display:none;" title="Lesezeichen" width="16"><div class="quote-quoteDetails" id="quote__id_quote_1615_p" onmouseout="mStop();" onmouseover="mWait();"></div></div><br><br><hr class="menu-line"><table class="a-auto"><tr><td><div><br><span class="quote-quote" onmouseout="rToggleLn(this, 0);" onmouseover="rToggleLn(this, 1);">Zwei Dinge sind unendlich, das <a href="/universen.html" title="10 Zitate">Universum</a> und die menschliche <a href="/dummheit.html" title="102 Zitate">Dummheit</a>, aber bei dem Universum bin ich mir noch nicht ganz sicher.</span><br><br><a href="/albert%20einstein.html" title="Deutscher Physiker und Nobelpreisträger, 1879 - 1955, 41 Zitate">Albert Einstein</a><div class="quote-quoteIcons"><a href="/zitat_805.html"><img alt="/x/details.png" height="16" onmouseout="mStop();" onmouseover="rShowDetails('805');" src="/x/p.gif" style="background:url('/x/i.png') 0px -128px;" width="14"></a><img alt="/x/bookmark.png" height="16" id="quote__id_quote_805_i" src="/x/p.gif" style="background:url('/x/i.png') 0px -112px;display:none;" title="Lesezeichen" width="16"><div class="quote-quoteDetails" id="quote__id_quote_805_p" onmouseout="mStop();" onmouseover="mWait();"></div></div><br><br></div></td><td style="padding-left:20px;text-align:right;"><a href=

If someone has a smart idea how the text (quotes) from this site could be extracted, I would love to hear it.

 

Kindest regards

Dan

 

Link to post
Share on other sites

I forgot to add....  Via HTTP.

 

Via Browser this is working fine.

 

navigate("http://zitate.net/zitate/top.html""Wait")
add list to list(%quotes$scrape attribute(<class="quote-quote">"innertext"), "Delete""Global")

 

But normally I should be able to translate that to XPATH directly right?

 

Dan

Link to post
Share on other sites

Here is html and xpath

 

set(#get$plugin function("HTTP post.dll""$http get""http://zitate.net/zitate/top.html"""""""""), "Global")
set(#ahtml$plugin function("HTTP post.dll""$html parser"#get"span""class""quote-quote""InnerText"), "Global")
set(#axpath$plugin function("HTTP post.dll""$xpath parser"#get"//span[@class=\'quote-quote\']""InnerText""HTML"), "Global")

  • Like 1
Link to post
Share on other sites

Thanks TC,

 

sometimes I'm wondering why I asked a question :-) 

After I looked at the browser example, it was obvious that I could use the same html tags for the xpath parser .

 

When I first looked at the html code, it confused me a lot. And I thought it's way more complicated than it actually was :-)

 

Dan

  • Like 1
Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...