I found some interesting data in a PDF. I want to get it into R for statistical analysis. However, the author has removed the decimals. So, when I try to paste the numbers, spreadsheet software removes the leading zeros which it considers redundant. Googling this issue reveals that a number of other people have a similar issue and that they haven’t found a good solution. For handling just a few cells, one can insert ‘ in front which solves the issue. This doesn’t work for pasting data.
Here’s what I did:
- Copy the table to a simple text editor, I use notepad++. Your editor must support regex.
- Use regex to insert the decimal separator (, or . depending on your preference/language).
- Copy the table to the spreadsheet.
In my case, I had a table that looks like this:
First, I OCR’d the table with ABBYY FineReader. Then I fixed any possible OCR errors (there was 1). Then I copied it to notepad++. Then I used search replace with the following regex parameters: search=”(\d\d\d)”, replace=”.\1″. In normal language this means: find any sequence of 3 digits in the text. Replace this by a dot followed by whatever you found. So this finds e.g. “011” and replaces with “.011”.