Ripping books from UMDL Text: Leta S. Hollingworth’s Gifted children, their nature and nurture

quod.lib.umich.edu/g/genpub/AGE2118.0001.001?view=toc

Due to this book repeatedly coming up in conversation regarding the super smart people, it seems to be worth reading. It is really old, and should obviously be out of copyright (thanks Disney!), however it possibly isn’t and in any case I couldn’t find a useful PDF.

I did however find the above. Now, it seems to lack a download all function, and it’s too much of a hassle to download all 398 pictures manually. They also lack OCR. So, I set out to write a python script to download them.

First had to find a function to actually download files. So far I had only read the page source of pages and worked with that. This time I needed to save a file (picture) repeatedly.

Googling gave me: urllib.urlretrieve

So far so good right? Must be easy to just do a for loop now and get it overwith.

Not entirely. Turns out that the pictures are only stored temporarily in the website cache if one visits the page associated with the picture. So I had to make the script load the page before getting the picture. Slows it down a bit, but not too much trouble.

Next problem: sometimes the picture wasn’t downloaded correctly for some reason. The filesize however was a useful proxy for this purpose. So had to find a way to get the filesize. Google gave me os.stat (import os). Problem solved.

So after doing that as well, some pictures were still not being downloaded correctly. Weird. After debugging, it turned out that some of the pictures were not .gif but .jpg files, and located in a slightly different place. So then I had to modify the code to work that out as well.

Finally worked out for all the pictures.

I OCR’d it with ABBYY FineReader (best OCR on the market AFAIK).

Enjoy:

GIFTED CHILDREN THEIR NATURE AND NURTURE – Leta S. Hollingworth

The python code is here: downloader.py

Ripping threads from able2know forums

So, i thought that is a good idea to take a copy of all the interesting threads on varius forums, just in case they shut down. doing it manually is waste of time. so, i went coding and make a crawler. after spending a couple of hours, i have now made a crawler that reads lines from a txt file, and downloads pages into folders from that.

code is here: Forum crawler.py

or here:

import urllib2
import re
import os

def isodd(num):
    return num & 1 and True or False

#open data about what to rip
file0 = open("Forum threads/threads.txt",'r')

#assign data to var
data0 = file0.readlines()

#hard var
number = 0

#skip odd numbers
for line in data0:

    #if it is even, then set first var, and continue
    if not isodd(number):
        outputfolder = line
        outputfolder = outputfolder[:-1]
        number = number+1
        continue

    #get thread url and remove last chars (linebreak +
    if isodd(number):
        threadurl = line
        threadurl = threadurl[:-2]
        number = number+1
        print "starting to crawl thread "+threadurl

    #create folder
    if not os.path.isdir("Forum threads/"+outputfolder):
        os.makedirs("Forum threads/"+outputfolder)

    #var introduction
    lastdata2 = "ijdsfijkds"
    lastdata = "kjdsfsa"

    #looping over all the pages
    for page in range(999):
        #range starts at 0, so +1 is needed
        response = urllib2.urlopen(threadurl+str(page+1))

        #assign the data to a var
        page_source = response.read()

        #used for detection of identical output
        #replace data in var2
        lastdata2 = lastdata

        #load new data into var1
        lastdata = page_source

        #check if they are identical
        if page>0:
            if lastdata == lastdata2:
                print "data identical, stopping loop"
                break

        #alternative check, check len
        if page>0:
            if len(lastdata) == len(lastdata2):
                print "length identical, stopping loop"
                break
                #used for detection of identical output

        #create a file for each page, and save the data in it
        output = open("Forum threads/"+outputfolder+"/"+str(page+1)+".html",'w')
        output.write(page_source)

        #progress
        print "wrote page "+str(page+1)+" in "+outputfolder+"/"
        print "length of file is "+str(len(page_source))


Review of python book and some other thoughts

It was mentioned by TechDirt in their reporting on an absurd copyright case (so, pretty normal).

 

 

The strange choice of the author to use identation to mark borders between paragrafs when indents are very important in python. He cud just have used newlines to do that.

 

 

The code is not easily copyable. If one tries, one gets spaces between every character or so like this: >>>cho i c e = ’ham’. This seems to be due to the font used.

 

 

Sometimes the examples are not clearly enuf explained. For instance, elif is explained as an “optional condition”, which is not all that clear. Fortunately, this is not much of a problem if one has an IDE ready to test it. For the record, elif works as an alternative condition if the first one isnt true. Ex.:

a = 1

b = 2

c = 3

if a == 1:

print “a holds”

elif b == 2:

print “b holds and a doesnt”

elif c == c:

print “neither a or b holds, but c does”

>>a holds

and:

a = 0

b = 2

c = 3

if a == 1:

print “a holds”

elif b == 2:

print “b holds and a doesnt”

elif c == c:

print “neither a or b holds, but c does”

>>b holds and a doesnt

lastly:

a = 0

b = 4

c = 3

if a == 1:

print “a holds”

elif b == 2:

print “b holds and a doesnt”

elif c == c:

print “neither a or b holds, but c does”

>>neither a or b holds, but c does

 

Note how the order of the elif’s matter. An elif only activates when all the previous if’s and elif’s failed.

 

 

Python apparently does not understand how to add numbers and strings. So things like:

a = “string”

b = 1

print a+b

gives an error. It seems to me that one shud just have python autoconvert numbers to strings (using str function), just as python converts integers to floats when adding such two objects together. Perhaps this has changed in python 3ff. Im running 2.7.