Jump to content
UBot Underground

Recommended Posts

Hi guys I need help on scraping email addresses on various websites. Normally I scrape email addresses using this code "(\w+@[a-zA-Z_]+?\.[a-zA-Z]{2,3})" but when I encounter an email address with spaces like this "contact @ gmail . com". What is the right code to scrape email address even that one with spaces?

 

Thanks

Link to post
Share on other sites

Here ya go...it's universal:

 

[a-zA-Z0-9\._\-]{3,}(@|AT|\s(at|AT)\s|\s*[\[\(\{]\s*(at|AT)\s*[\]\}\)]\s*)[a-zA-Z]{3,}(\.|DOT|\s(dot|DOT)\s|\s*[\[\(\{]\s*(dot|DOT)\s*[\]\}\)]\s*)[a-zA-Z]{2,}((\.|DOT|\s(dot|DOT)\s|\s*[\[\(\{]\s*(dot|DOT)\s*[\]\}\)]\s*)[a-zA-Z]{2,})?

Link to post
Share on other sites

Ok I modified your regex...this is NOT an optimal solution, but the regex works. (I say it's not optimal because I ultimately had to grab it by position...but you can modify that) The regex now grabs with or without the spaces. If you need me to explain what exactly I added to the regex let me know.

 

 

email_bich.ubot

 

 

John

  • Like 2
Link to post
Share on other sites

I'll explain anyhow for anyone else reading the thread...

 

I added this:

 

(\s|)

 

before AND after the @. What it says is:

 

Look for a space or nothing at all...whenever you have a "nothing" on one side of a pipe it makes it optional meaning the space does not need to be there. I hope that helps someone else as well!

 

John

Link to post
Share on other sites

Let me talk to the "Threadmaster"...(Buddy of course...he runs a tight ship! http://ubotstudio.com/forum/public/style_emoticons/default/smile.gif

 

John

 

PS I say this because he already started a thread that has helpful tips, etc so we don't have to keep looking up the little things we need often.

Link to post
Share on other sites
  • 3 months later...

Hi John

 

I had a play with the regex - Kreatus' tutorial point outs on the following thread helped loads http://ubotstudio.com/forum/index.php?/topic/6489-regex-101-and-beyond/

 

So I downloaded the regex cheatsheet as suggested by Frank and also found the following site which helped test the code in real time - similar to the tool that frank uses http://regex.larsolavtorvik.com/

 

Playing around a little I came up with the following code

 

(\w+(\s|)@(\s|)[a-zA-Z_]+?\.[a-zA-Z_]+(\.|)[a-zA-Z]{1,3})

 

I added the following to your original code

 

+(\.|)[a-zA-Z]{1,3})

 

This code should now also work with emails like .co.uk and also subdomain emails -

 

I think i have understood the code correctly - was a little unsure of the escape character \

 

got the idea from the (\s|) you placed to look for spaces -

 

seems to be working but need to test properly -

 

thanks

 

abbs

Link to post
Share on other sites

hi just wondering if anyone can help

Ive setup a quick test page here

 

Ive set the regular expression to return a list - however it will only scrape the support@gmail.com email address and no more

 

any idea why this is

 

thanks

Link to post
Share on other sites

Excellent - working perfect here too -

 

I think it was because i was using a set command to find the regular expression instead of add to list

 

thanks a million

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...