Ripping threads from able2know forums

So, i thought that is a good idea to take a copy of all the interesting threads on varius forums, just in case they shut down. doing it manually is waste of time. so, i went coding and make a crawler. after spending a couple of hours, i have now made a crawler that reads lines from a txt file, and downloads pages into folders from that.

code is here: Forum crawler.py

or here:

import urllib2
import re
import os

def isodd(num):
    return num & 1 and True or False

#open data about what to rip
file0 = open("Forum threads/threads.txt",'r')

#assign data to var
data0 = file0.readlines()

#hard var
number = 0

#skip odd numbers
for line in data0:

    #if it is even, then set first var, and continue
    if not isodd(number):
        outputfolder = line
        outputfolder = outputfolder[:-1]
        number = number+1
        continue

    #get thread url and remove last chars (linebreak +
    if isodd(number):
        threadurl = line
        threadurl = threadurl[:-2]
        number = number+1
        print "starting to crawl thread "+threadurl

    #create folder
    if not os.path.isdir("Forum threads/"+outputfolder):
        os.makedirs("Forum threads/"+outputfolder)

    #var introduction
    lastdata2 = "ijdsfijkds"
    lastdata = "kjdsfsa"

    #looping over all the pages
    for page in range(999):
        #range starts at 0, so +1 is needed
        response = urllib2.urlopen(threadurl+str(page+1))

        #assign the data to a var
        page_source = response.read()

        #used for detection of identical output
        #replace data in var2
        lastdata2 = lastdata

        #load new data into var1
        lastdata = page_source

        #check if they are identical
        if page>0:
            if lastdata == lastdata2:
                print "data identical, stopping loop"
                break

        #alternative check, check len
        if page>0:
            if len(lastdata) == len(lastdata2):
                print "length identical, stopping loop"
                break
                #used for detection of identical output

        #create a file for each page, and save the data in it
        output = open("Forum threads/"+outputfolder+"/"+str(page+1)+".html",'w')
        output.write(page_source)

        #progress
        print "wrote page "+str(page+1)+" in "+outputfolder+"/"
        print "length of file is "+str(len(page_source))

You Might Also Like

Before PISA and Binet

Amitai Shenhav, David G. Rand, and Joshua D. Greene – Divine Intuition: Cognitive Style Influences Belief in God

Tough modal logic formalization

Leave a Reply Cancel reply