Brutal 164 Posted June 9, 2015 Report Share Posted June 9, 2015 I have a file of about 150,000 lines ubot has no issue pulling the text file data into a list and removing duplicate line entries. however, i have to do a loop to remove some text from every single line remaining in the file. it is different text per line that has to be removed. so, doing the actual process is easy enough - the problem is that it is taking hours for it to go through each line to remove the target text. do you (anyone) know of a faster way to do search/replace/remove on files of this size? Quote Link to post Share on other sites
HelloInsomnia 1103 Posted June 9, 2015 Report Share Posted June 9, 2015 Maybe regex, depends on how different the text is - and where it is. Can you post an example? Quote Link to post Share on other sites
Brutal 164 Posted June 9, 2015 Author Report Share Posted June 9, 2015 thanks for trying to help. ok, initially each list item looks like this... 1 | http://... 2 | http://... etc. Then i split the list so that it turns out like this... 1 | http://... 2 | http://... So the end result is that i come up with a nice clean list of urls. but because of the volume, it takes waaaayyy too long to process. 1 Quote Link to post Share on other sites
HelloInsomnia 1103 Posted June 9, 2015 Report Share Posted June 9, 2015 Try this, replace the FILENAME.txt with the file. set(#biglist,$list from file("FILENAME.txt"),"Global") clear list(%urls) add list to list(%urls,$list from text($trim($replace regular expression(#biglist,"\\d+\\s\\|\\s",$nothing)),$new line),"Delete","Global") 1 Quote Link to post Share on other sites
Brutal 164 Posted June 9, 2015 Author Report Share Posted June 9, 2015 Absolutely Brilliant man! Thanks so much for this - I had been pulling my hair out for hours. Quote Link to post Share on other sites
Brutal 164 Posted June 9, 2015 Author Report Share Posted June 9, 2015 No... You know what.... Dude, you are a freaking Rock Star! Thanks so much for this. 1 Quote Link to post Share on other sites
Code Docta (Nick C.) 638 Posted June 12, 2015 Report Share Posted June 12, 2015 Here is an alternative way of looking at it with find regex add list to list(%list, $list from text($find regular expression($read file("{$special folder("Desktop")}\\reg-test.txt"), "http.*"), $new line), "Delete", "Global")plugin command("Bigtable.dll", "large List from Regex", "a", , "http.*", "replace")alert($plugin function("Bigtable.dll", "Large list return", "a"))plugin command("Bigtable.dll", "large List from Regex", "b", $read file("{$special folder("Desktop")}\\reg-test.txt"), "http.*", "replace")alert($plugin function("Bigtable.dll", "Large list return", "b"))plugin command("Bigtable.dll", "Large list Remove duplicates", "b") cant imagine it being fast with 150k linesso I did it in Large Data plugin too http://www.ubotstudio.com/forum/index.php?/topic/16308-free-plugin-large-data/ CD Quote Link to post Share on other sites
Brutal 164 Posted June 18, 2015 Author Report Share Posted June 18, 2015 Thank you CD - I appreciate it! Quote Link to post Share on other sites
kev123 132 Posted June 18, 2015 Report Share Posted June 18, 2015 you should be able to get this in the 10< seconds range. In one of my apps for each link I1.strip http www. etc2.get the hostname3.work out the tld4. work out if its a subdomain. this involves creating sub lists and seeing if contains in a "tld list"/hashset per 100,000 takes 2 seconds granted its in c# but if you use the large data plugin the overhead of ubot shouldn't be very large and what i'm doing involves a lot more. if you cant get it below ten seconds using the large data plugin post you code here maybe time for me to spend a afternoon optimizing the plugin if possible. thankskev123 Quote Link to post Share on other sites
Seth Turin 223 Posted June 19, 2015 Report Share Posted June 19, 2015 You guys are awesome. Great answers. Quote Link to post Share on other sites
Brutal 164 Posted June 20, 2015 Author Report Share Posted June 20, 2015 I seem to be all squared away now Kev - Your biglist plugin is working perfectly for my needs. - Thanks man Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.