Clear Language, Clear Mind

September 20, 2017

The end of anonymity in the crowd is near

Filed under: Computer science,Science — Tags: , , , — Emil O. W. Kirkegaard @ 22:44

Not an endorsement of this technology or the use of it, just stating that it will happen.

Given sufficient measurement precision, all humans have unique genomes and fingerprints, but also faces and voices. The first two are well known and somewhat difficult to measure. However, the last two are very easy to measure, even at a distance. In the next few years, massive datasets will be built of public, semi-public and leaked private data linking people between all services with available data, for all available time periods. This first and foremost includes social media like Facebook, Instagram, Linkedin, but also Youtube, and every dating and porno site. There are a host of voice-only services like SoundCloud that currently handle anonymous users, which is also true for Youtube and porno sites. All these people will be automatically identified and linked in the near future. It will not be possible to take part in a public demonstration without a mask (illegal in many places) which cannot later to matched to you. It will not be possible to take part in amateur or paid-porn without a mask and maybe without being silent (even moans can be matched in all likelihood).

Because of the above, services will open whose business is based on this. These can be completely legal, or maybe illegal depending on the local jurisdiction. However, they will inevitably be created. There will be sites like ismylovecheatingonme.com and findthepast.com which have obvious and enormous markets. The less moral, depending on your view, services will send automatic blackmail to you, saying that you don’t pay a monthly fee, your family, spouse and friends will be informed of your misdeeds, whatever they are. Don’t think they can’t find your relations — you put them on Facebook, Linkedin, and in any case, family can be inferred from name and facial similarity anyway.

What shall we do about this technology? One can distort the voice to make it (probably) unmatchable, but one probably cannot do anything about the face except for masks, which are inconvenient.

June 30, 2017

Getting Logitech G502 to work on Linux Mint 18.1

Filed under: Computer science — Tags: — Emil O. W. Kirkegaard @ 04:18

So, Logitech software only works for Windows. Unfortunately, the sensitivity is extremely high by default, making the mouse much less useful. The extra buttons also have no function, which is annoying. So, I wanted to fix this. The mouse actually saves the setting on itself, meaning that the settings are plug-and-play-able. Meaning that my G502 I already configured on Windows before installing Linux works fine. But I bought a second one for use with my laptop (too much work to have to move the one mouse between computers).

So, naturally, I first sought a clean Linux fix. Wasted a number of hours trying to config it directly in Linux:

  • https://forums.linuxmint.com/viewtopic.php?t=223957
  • https://forums.linuxmint.com/viewtopic.php?t=173540
  • https://patrickmn.com/aside/lowering-gaming-mouse-sensitivity-in-ubuntu-9-10/
  • https://www.reddit.com/r/G502MasterRace/comments/5ie33x/how_does_the_g502_work_software_wise_on_linux/
  • http://www.linuxquestions.org/questions/linux-hardware-18/%5Bhowto%5D-set-up-gaming-mouse-logitech-g502-4175553446/
  • https://linustechtips.com/main/topic/256134-g502-in-linux/
  • and many others

Finally had enough. Installed Virtualbox, Windows 10, then followed this guide to get the host system to forward the USB device to the client.

Now, on to the next pesky Linux issue…

May 3, 2017

The FOSS fragmentation problem

Filed under: Computer science — Tags: , , , , , — Emil O. W. Kirkegaard @ 07:03

I could not find someone that briefly described this problem, so I’ll just do a very quick job at doing so. I take it that this is a common observation that I was just unable to find the right term for. And yes, I appreciate the metaness of this. :)

The FOSS fragmentation problem: everybody writes their own solutions (standards), but because it requires a lot of work to get any major project done right and humans consistently underestimate how long it takes to do something right (planning fallacy), there’s a plethora of half-done solutions. It’s a kind of coordination problem: if people would ‘just’ work together, we would have fewer but better solutions. There are two reasons why they generally don’t: first, people are almost always working on FOSS projects in their spare time, so there’s no boss that can order them around to focus. Second, FOSS attracts people who are prone to lone wolfery/independently minded. Larger projects with such persons tend to split.

Never claim general phenomenons without listing many examples. Okay.

Example: Linux distributions

There are at least 290 Linux distributions on DistroWatch. The familial relationship between distros has been visually summarized like this (via Wikipedia):

Think this is complicated? You can find similar figures for singly families too.

Example: Wiki systems

Given that the above example image comes form Wikipedia, we might wonder… how many Wiki systems are there? Well, WikiMatrix is a searchable list and has 141. It’s totally not comprehensive. Searching just a little bit more, I was able to find an out of date list of 20 Python-based systems. An exhaustive search of Github (and BitBucket, and SourceForge, and GitLab, and …) would turn up many more.

(I could not find any family chart. Use your imagination.)

Example: forum software

There a sister-site to the above, ForumMatrix, which lists 67 forums. And there’s a list of 19 Django-based forums, and there’s even other Python-based solutions too (e.g. FlashBB).

More examples?

You get the idea, but:

The problem extends to other domains

The problem is most commonly seen with FOSS projects, but occurs in any domain where it’s easy to set up your own version by copying from others or recreating from scratch. The primary other example I’m thinking of is, not coincidentally, licenses for FOSS projects. Wikipedia lists 54 licenses.

Rejoiner: but some of these areas are actually dominated by a few players of good quality

They are, e.g. browsers have just a few top players (but due to mobile units, this seems to be fragmenting further). However, the point remains in weaker form: if people working or who worked on all the competitors that approx. no one uses had instead been focused on, say, 5 or 10 projects, these projects would have been a lot better.

November 1, 2016

Syncthing: some notes and recommendations

Filed under: Computer science — Tags: , — Emil O. W. Kirkegaard @ 04:51

Syncthing is a open source synchronization application. It’s a replacement for the closed source Bittorrent Sync (BTS).  If you are currently using BTS, I recommend switching because 1) it’s closed sourced, and 2) owned by an American company, and thus 3) likely to implement NSA-style backdoors. This is made more likely by the fact that newer versions default to using a website to share the private key needed to set up a shared folder. Very suspicious. It makes it very easy to snoop the private key. If you must use it, use the old 1.4.110 version which does not have this suspicious ‘feature’.

However, really, use Syncthing. It’s free unlike Dropbox etc. The trade-off is that there is no central server. You only have your own network to synchronize between. I recommend setting up a network of devices using your trustworthy friends. Simply synchronize important folders to each other and pick people who don’t snoop your data or encrypt it to be safe (they must be trustworthy, but perhaps their wife/husband/kids aren’t).

I have used the software for more than a year and here are some experiences and comments:

  • Syncthing works well on mobile devices (only tested Android). I use it to automatically synchronize photos from my camera and to synchronize a subset of my elibrary to the mobile devices.
  • Sometimes there are version conflicts. It’s usually easy to figure out which file is newer. One can open both of them to verify the choice.
  • Most of the version conflicts contain temporary files or cache files. A bit annoying having to have to keep telling it which to use.
  • If you’re on Sindows, use the SyncTrazor GUI. It’s a small stand-alone browser that automatically runs, restarts, updates etc. Syncthing for you.
  • One can use versioning for files that frequently get edited, so as to avoid data loss. There are multiple types of versioning available. It’s not as good as git, but it’s reasonable good and suitable for larger files.
  • There does not seem to be any practical limitations to the size of shared folders. I have multiple with >400 GBs and they run fine.
  • There does not seem to be any practical limitations with the number of devices interconnected. I have clusters with 2-10 devices without any problems.
  • Connections can be poor when both devices are on the same LAN. It seems mostly related to the initial ability to connect. If in a hurry, make a wifi hotspot for one device to use to get around the problem.
  • Sometimes files will fail to synchronize. This is almost always because of illegal names. Windows and unix (linux/mac) have different rules for which characters that can be in names and how long names can be. So Windows devices will fail to synchronize names like “Henry Ford: My Favorite Car.pdf” because of the colon. One can click the “failed items” in the GUI to see which items are failing and get an idea about how to fix it.
  • In one case have I tried the index getting corrupted. This can be deleted manually and then it rebuilds itself on the next run. It can take a while, but I had no data loss. They should add a “Rebuild index” button to the GUI.
  • Syncthing does not handle very large files that are in use well. In my case I sync a virtual computer image which is a 40 GB file. If this file is open (=virtual computer running), it results in high CPU use because it keeps scanning the file while it gets changed. Note: pausing only pauses the syncing, not the indexing. One has to close it entirely. Still have not found a good solution to this problem. Can be a little alleviated by increasing the scanning intervals.

August 10, 2016

Installing R packages on Mint 17.1: cluster, KernSmooth, Matrix, nlme, mgcv

Filed under: Computer science — Tags: , , — Emil O. W. Kirkegaard @ 20:24

Installing R packages on Windows is easy: you run the install code and it always works. Not so on Linux! Here one sometimes has to install them thru apt-get (or whatever package manager) or install some missing system-level dependencies. Finding what to do can take a lot of time gooling and trial and error. So, in the spirit of contribuing to the internet about how to get stuff to work.

1: In install.packages(update[instlib == l, "Package"], l, contriburl = contriburl,  :
  installation of package ‘cluster’ had non-zero exit status
2: In install.packages(update[instlib == l, "Package"], l, contriburl = contriburl,  :
  installation of package ‘KernSmooth’ had non-zero exit status
3: In install.packages(update[instlib == l, "Package"], l, contriburl = contriburl,  :
  installation of package ‘Matrix’ had non-zero exit status
4: In install.packages(update[instlib == l, "Package"], l, contriburl = contriburl,  :
  installation of package ‘nlme’ had non-zero exit status
5: In install.packages(update[instlib == l, "Package"], l, contriburl = contriburl,  :
  installation of package ‘mgcv’ had non-zero exit status

These all installed for me after installing:

sudo apt-get install r-base-dev
sudo apt-get install gfortran

August 4, 2016

Deleting files with empty extensions on Windows 8.1

Filed under: Computer science — Tags: , , , , — Emil O. W. Kirkegaard @ 22:15

I’m posting this here so I can find it more easily because this is at least the third time I had to google the issue to retrace my own steps on StackOverflow to find the solution I found last time! I even found some useful comments I wanted to upvote, only to realize they were my own from months ago! Argh!

So here’s the situation:

  • You somehow managed to create a file with an empty extension. One cannot normally do this, but one can do it thru e.g. the browser if one types a dot at the end of the filename. This creates two files, one without an extension and one with an empty extension:

pesky file 0

pesky file 1

  • If one tries to delete them the normal way, one gets “file not found” error for the one with a dot at the end:

pesky file 2

  • And even using del thru an elevated cmd does not work.

pesky file 3

  • Being sneaky with del *. does not work either.

pesky file 4

  • One has to use a special syntax that can deal with ill-formatted file names:

pesky file 5

AND DEAD!

August 30, 2015

Converting a data.frame to a numerical matrix in R, not so easy!

Filed under: Computer science — Tags: , , , , — Emil O. W. Kirkegaard @ 06:01

Edit 2019

You don’t really need the below. Look at the `model.matrix()` function, which converts data frames into matrices for glmnet and similar function.

Original post

Sometimes you need to use a function that wants a numeric matrix as input. One such function is glmnet.cv() which performs lasso regression with cross validation, which is very cool. Unfortunately, it is picky about how it wants the input data. Here’s some lines of my code:

fit_cv = cv.glmnet(x = as.matrix(temp_df[predictors]), #predictor vars matrix
                   y = as.matrix(temp_df[dependent]), #dep var matrix
                   weights = weights_, #weights
                   alpha = alpha_) #type of shrinkage

We see that x must be a matrix of the predictors, y must be a matrix with the dependent (usually just one), the weights and alpha are optional, but since I am working with aggregate data I am almost always using weights. Alpha controls the kind of shrinkage used.

All well and good, until it isn’t. In my case, the predictor data.frame contains some factor variables. R actually uses numeric values as its internal representation of these, but displays them with strings. For instance:

DF = data.frame(a = 1:3, b = letters[10:12],
                c = seq(as.Date("2004-01-01"), by = "week", len = 3),
                stringsAsFactors = TRUE)

Which prints out like this:

> DF
  a b          c
1 1 j 2004-01-01
2 2 k 2004-01-08
3 3 l 2004-01-15

However, suppose we use my as.matrix solution above, then we get:

> as.matrix(DF)
     a   b   c           
[1,] "1" "j" "2004-01-01"
[2,] "2" "k" "2004-01-08"
[3,] "3" "l" "2004-01-15"

Which is not what we wanted. It gave us a character matrix which glmnet.cv() will then throw a nonsensical error about. Their bad error made me spend some time finding the actual error. Save yourself and others time. Always write good error messages for functions that will be used more than a couple of times!

Is there some easy built in way to solve the problem?

> as.numeric(DF)
Error: (list) object cannot be coerced to type 'double'

The easiest solution did not work.

> as.numeric(DF$b)
[1] 1 2 3

However, it does work for a single column. So maybe we can just try using it on all the columns:

> apply(DF, 2, as.numeric)
     a  b  c
[1,] 1 NA NA
[2,] 2 NA NA
[3,] 3 NA NA
Warning messages:
1: In apply(DF, 2, as.numeric) : NAs introduced by coercion
2: In apply(DF, 2, as.numeric) : NAs introduced by coercion

What? What is going on?

> apply(as.matrix(DF), 2, as.numeric)
     a  b  c
[1,] 1 NA NA
[2,] 2 NA NA
[3,] 3 NA NA
Warning messages:
1: In apply(as.matrix(DF), 2, as.numeric) : NAs introduced by coercion
2: In apply(as.matrix(DF), 2, as.numeric) : NAs introduced by coercion

It looks apply() does a silent as.matrix() which then causes the NAs. OK. How do we convert just the factor columns then? Maybe try some of the more fancy built in conversion calls:

> as.matrix.data.frame(DF)
     a   b   c           
[1,] "1" "j" "2004-01-01"
[2,] "2" "k" "2004-01-08"
[3,] "3" "l" "2004-01-15"

Nope.

> as.data.frame.matrix(DF)
  a b     c
1 1 j 12418
2 2 k 12425
3 3 l 12432

Closer, this time the date got converted, but the factor got converted to character, not integers. We could do a loop:

> for (col_idx in seq_along(DF)) {
+   DF[col_idx] = as.numeric(DF[[col_idx]])
+ }
> DF = as.matrix(DF)
> str(DF)
 num [1:3, 1:3] 1 2 3 1 2 ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:3] "a" "b" "c"

Which works, but now it is getting silly. Maybe some implicit loops:

> lapply(DF, as.numeric)
$a
[1] 1 2 3
$b
[1] 1 2 3
$c
[1] 12418 12425 12432

Closer, but it returns a list, not a matrix. Maybe just try converting:

> as.matrix(lapply(DF, as.numeric))
  [,1]     
a Numeric,3
b Numeric,3
c Numeric,3

But no no, life isn’t that easy. What about as.data.frame?

> as.data.frame(lapply(DF, as.numeric))
  a b     c
1 1 1 12418
2 2 2 12425
3 3 3 12432

Huh, that works, but as.matrix didn’t. Oh well, just one final step:

> DF = as.matrix(as.data.frame(lapply(DF, as.numeric)))
> str(DF)
 num [1:3, 1:3] 1 2 3 1 2 ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:3] "a" "b" "c"

We got what we wanted!

Sometimes, R does not make your life easy.

July 25, 2015

Approximate string matching in R

Filed under: Computer science,Linguistics/language — Tags: , , , , , — Emil O. W. Kirkegaard @ 07:21

In trying to merge some data, I was confronted with a problem of matching up strings where the author had mutilated them. He had done so in two ways: cutting them off at the 8th character or using personal abbreviations. The first one is relatively easy to deal with. The second one is not.

So I looked around a bit and found that there are others who had similar problems:

  • http://stats.stackexchange.com/questions/3425/how-to-quasi-match-two-vectors-of-strings-in-r
  • http://stackoverflow.com/questions/2231993/merging-two-data-frames-using-fuzzy-approximate-string-matching-in-r

The large table below shows the matching results. The leftmost column has the strings I need to match. The list to be matched against is the list of country and regional names and their ISO 3 abbreviations here: https://osf.io/59dr7/

It looks like for most cases the shortened version worked fine. 165 of 193 matches were found all of which were correct.

The agrep (with max distance = .1, the default), found a match in 175 cases, so only a little improvement there. But it gets worse, in many cases, it disagrees with the stricter matching method and gets it wrong. There is no case where it is correct over the simpler method. Strangely, in all 10 cases where the simple method failed, agrep got it right. But it got it wrong in the easier cases. In some cases, it is truly bizarre: given the string “United S”, it goes for “United Republic of Tanzania” instead of the much easier “United States”. Strangely, a common error is preferring a subset/longer version over an exact match. No human would make this error. E.g. given “Moldova”, it prefers “Moldova, Republic” of over just “Moldova”.

There are a number of different errors it makes. In the comments below I have noted the type of error (my judgment).

For the moment, I would caution the use of this algorithm.

Country Genetic_distance_SA to_short_result agrep_result best_match agreement filled_in comments
Norway 1455.52 Norway Norway Norway TRUE Norway
Netherla 1453.28 Netherlands Netherlands Netherlands TRUE Netherlands
Ireland 1940.31 Ireland Iceland Ireland FALSE Ireland prefers substitution over exact
Liechste 1511.83 Liechtenstein Liechtenstein FALSE Liechtenstein correct
Germany 1484.92 Germany Germany Germany TRUE Germany
Sweeden 1453.79 Sweden Sweden FALSE Sweden correct
Switzerl 1557.96 Switzerland Switzerland Switzerland TRUE Switzerland
Iceland 1932.36 Iceland Iceland Iceland TRUE Iceland
Denmark 1472.52 Denmark Denmark Denmark TRUE Denmark
Belgium 1940.31 Belgium Belgium Belgium TRUE Belgium
Austria 1465.7 Austria Australia Austria FALSE Austria prefers part deleted
France 1896.22 France France France TRUE France
Slovenia 1292.32 Slovenia Slovenia Slovenia TRUE Slovenia
Finland 2420.3 Finland Finland Finland TRUE Finland
Spain 1929.44 Spain Saint Barth<U+FFFD>lemy Spain FALSE Spain no idea
Italy 1961.9 Italy Italy Italy TRUE Italy
Luxembur 1929.98 TRUE Luxembourg
Czech Re 1524.73 Czech Republic Czech Republic Czech Republic TRUE Czech Republic
U. K. 1916.91 TRUE UK
Greece 1283.05 Greece Greece Greece TRUE Greece
Cyprus 1288.53 Cyprus Cyprus Cyprus TRUE Cyprus
Estonia 2302.86 Estonia Estonia Estonia TRUE Estonia
Slovakia 1573.38 Slovakia Slovakia Slovakia TRUE Slovakia
Malta 1912.52 Malta Gibraltar Malta FALSE Malta prefers subset + substitution over exact
Poland 1905.67 Poland Poland Poland TRUE Poland
Lithuani 2389.28 Lithuania Lithuania Lithuania TRUE Lithuania
Portugal 1949.34 Portugal Portugal Portugal TRUE Portugal
Latvia 2256.69 Latvia Latvia Latvia TRUE Latvia
Croatia 1289.64 Croatia Croatia Croatia TRUE Croatia
Romania 1928.4 Romania Romania Romania TRUE Romania
Bulgaria 1399.01 Bulgaria Bulgaria Bulgaria TRUE Bulgaria
Serbia 1421.01 Serbia Serbia Serbia TRUE Serbia
Russia 1975.49 Russia Russia Russia TRUE Russia
Albania 1301.47 Albania Albania Albania TRUE Albania
Macedoni 1334.51 Macedonia Macedonia Macedonia TRUE Macedonia
Armenia 1558.32 Armenia Armenia Armenia TRUE Armenia
Moldova 1527.95 Moldova Moldova, Republic of Moldova FALSE Moldova prefers longer
Botswana 347.18 Botswana Botswana Botswana TRUE Botswana
South Af 0 South Africa South Africa South Africa TRUE South Africa
Ghana 395.9 Ghana Ghana Ghana TRUE Ghana
Eq Guine 373.2 TRUE Equatorial Guinea
Congo 452.9 Congo Congo Congo TRUE Congo
Kenya 366.78 Kenya Kenya Kenya TRUE Kenya
Cameroon 319.62 Cameroon Cameroon Cameroon TRUE Cameroon
Tanzania 352.54 Tanzania Tanzania Tanzania TRUE Tanzania
Nigeria 342.24 Nigeria Nigeria Nigeria TRUE Nigeria
Uganda 358.75 Uganda Uganda Uganda TRUE Uganda
Zambia 352.54 Zambia Gambia Zambia FALSE Zambia prefers substitution over exact
Sudan 316.95 Sudan Sudan Sudan TRUE Sudan
Zimbabwe 352.54 Zimbabwe Zimbabwe Zimbabwe TRUE Zimbabwe
Ethiopia 705.3 Ethiopia Ethiopia Ethiopia TRUE Ethiopia
Guinea 395.9 Guinea Guinea Guinea TRUE Guinea
CentAfrR 469.7 TRUE Central African Republic
SierraLe 395.9 TRUE Sierra Leone
Mozambiq 355.75 Mozambique Mozambique Mozambique TRUE Mozambique
CongoDR 410.17 Congo Republic of Congo Republic of FALSE Congo Republic of wrong but excuseable; Congo Democratic Republic
Andorra 1912.83 Andorra Andorra Andorra TRUE Andorra
Angola 353.49 Angola Angola Angola TRUE Angola
Belarus 1949.85 Belarus Belarus Belarus TRUE Belarus
Benin 394.54 Benin Benin Benin TRUE Benin
Bosnia 1337.37 Bosnia Bosnia and Herzegovina Bosnia FALSE Bosnia prefers longer
BurkinaF 378.36 Burkina Faso Burkina Faso FALSE Burkina Faso correct
Burundi 362.98 Burundi Burundi Burundi TRUE Burundi
Cape Ver 963.09 Cape Verde Cape Verde Cape Verde TRUE Cape Verde
Chad 537.81 Chad Chad Chad TRUE Chad
Comoros 352.54 Comoros Comoros Comoros TRUE Comoros
IvoryCoa 468.01 TRUE Ivory Coast
Djibouti 750.88 Djibouti Djibouti Djibouti TRUE Djibouti
Eritrea 665.96 Eritrea Eritrea Eritrea TRUE Eritrea
Gabon 360.07 Gabon Gabon Gabon TRUE Gabon
Gambia 395.9 Gambia Gambia Gambia TRUE Gambia
Georgia 1613.7 Georgia Georgia Georgia TRUE Georgia
Guinea-B 395.9 Guinea-Bissau Guinea-Bissau Guinea-Bissau TRUE Guinea-Bissau
Lesotho 352.54 Lesotho Lesotho Lesotho TRUE Lesotho
Liberia 395.9 Liberia Liberia Liberia TRUE Liberia
Malawi 352.54 Malawi Malawi Malawi TRUE Malawi
Mali 430.58 Mali Australia Mali FALSE Mali prefers subset + substitution over exact
Mauritan 681.23 Mauritania Mauritania Mauritania TRUE Mauritania
Namibia 419.32 Namibia Namibia Namibia TRUE Namibia
Niger 315.05 Niger Niger Niger TRUE Niger
Rwanda 364.26 Rwanda Rwanda Rwanda TRUE Rwanda
SaoTomeP 339.89 TRUE Sao Tome and Principe
Senegal 395.9 Senegal Senegal Senegal TRUE Senegal
Seychell 1709.06 Seychelles Seychelles Seychelles TRUE Seychelles
Somalia 500.74 Somalia Somalia Somalia TRUE Somalia
Swazilan 400.17 Swaziland Swaziland Swaziland TRUE Swaziland
Togo 395.9 Togo Togo Togo TRUE Togo
Ukraine 1947.94 Ukraine Ukraine Ukraine TRUE Ukraine
Australi 1971.39 Australia Australia Australia TRUE Australia
United S 1792.33 United States United Republic of Tanzania United States FALSE United States bizarre
New Zeal 2061.12 New Zealand New Zealand New Zealand TRUE New Zealand
Canada 1958.73 Canada Canada Canada TRUE Canada
Japan 2176.04 Japan Japan Japan TRUE Japan
Hong Kon 2674.63 Hong Kong Hong Kong Hong Kong TRUE Hong Kong
Korea 2399.11 Korea Korea Democratic People’s Republic of Korea FALSE Korea prefers subset + substitution over exact
Israel 1539.63 Israel Israel Israel TRUE Israel
Singapor 2459.04 Singapore Singapore Singapore TRUE Singapore
Qatar 1733.83 Qatar Qatar Qatar TRUE Qatar
Hungary 2432.96 Hungary Hungary Hungary TRUE Hungary
Bahrain 971.99 Bahrain Bahrain Bahrain TRUE Bahrain
Chile 2279.52 Chile Chile Chile TRUE Chile
Argentin 1994.59 Argentina Argentina Argentina TRUE Argentina
Barbados 468.02 Barbados Barbados Barbados TRUE Barbados
Uruguay 1918.61 Uruguay Uruguay Uruguay TRUE Uruguay
Cuba 1370.91 Cuba Aruba Cuba FALSE Cuba prefers subset + substitution over exact
Saudi Ar 1468.3 Saudi Arabia Saudi Arabia Saudi Arabia TRUE Saudi Arabia
Mexico 2024.64 Mexico Mexico Mexico TRUE Mexico
Malaysia 1922.77 Malaysia Malaysia Malaysia TRUE Malaysia
Trinidad 1024.1 Trinidad and Tobago Trinidad and Tobago Trinidad and Tobago TRUE Trinidad and Tobago
Kuwait 1081.15 Kuwait Kuwait Kuwait TRUE Kuwait
Lebanon 1543.46 Lebanon Lebanon Lebanon TRUE Lebanon
Venezuel 1280.81 Venezuela, Bolivarian Republic of Venezuela, Bolivarian Republic of Venezuela, Bolivarian Republic of TRUE Venezuela, Bolivarian Republic of
Mauritiu 1792.48 Mauritius Mauritius Mauritius TRUE Mauritius
Jamaica 595.5 Jamaica Jamaica Jamaica TRUE Jamaica
Peru 2096.08 Peru Hviderusland Peru FALSE Peru prefers subset + substitution over exact
Dominica 521.08 Dominica Dominica Dominica TRUE Dominica
SaintLuc 497.7 TRUE Saint Lucia
Ecuador 2228.58 Ecuador Ecuador Ecuador TRUE Ecuador
Brazil 1875.81 Brazil Brazil Brazil TRUE Brazil
SaintVin 395.9 TRUE Saint Vincent
Colombia 1973.6 Colombia Colombia Colombia TRUE Colombia
Iran 1945.07 Iran France Iran FALSE Iran prefers subset + substitution over exact
Tonga 2390.38 Tonga Tonga Tonga TRUE Tonga
Turkey 2167.95 Turkey Turkey Turkey TRUE Turkey
Belize 1481.26 Belize Belize Belize TRUE Belize
Tunisia 203.38 Tunisia Tunisia Tunisia TRUE Tunisia
Jordan 1539.63 Jordan Jordan Jordan TRUE Jordan
SriLanka 1783.84 TRUE Sri Lanka
DomRep 1206.72 TRUE Dominican Republic
W. Samoa 2388.58 W. Samoa W. Samoa W. Samoa TRUE W. Samoa
Fiji 2534.15 Fiji Fiji Fiji TRUE Fiji
China 2646.26 China China China TRUE China
Thailand 2068.81 Thailand Thailand Thailand TRUE Thailand
Surinam 1562.55 Suriname Suriname Suriname TRUE Suriname
Paraguay 2243.61 Paraguay Paraguay Paraguay TRUE Paraguay
Bolivia 2410.22 Bolivia Bolivia, Plurinational State of Bolivia FALSE Bolivia prefers longer
Philipin 2628.84 Philipines Philipines Philipines TRUE Philipines
Egypt 1401.52 Egypt Egypt Egypt TRUE Egypt
Syria 1590.05 Syria Syria Syria TRUE Syria
Honduras 1979.74 Honduras Honduras Honduras TRUE Honduras
Indonesi 2602.63 Indonesia Indonesia Indonesia TRUE Indonesia
VietNam 2264.3 Viet Nam Viet Nam FALSE Viet Nam correct, but odd, vs. Vietnam
Morocco 191.55 Morocco Morocco Morocco TRUE Morocco
Guatemal 2040.41 Guatemala Guatemala Guatemala TRUE Guatemala
Irak 1625.4 Irak Iran Irak FALSE Irak prefers substitution over exact
India 1888.5 India India India TRUE India
Laos 3012.42 Laos Lao People’s Democratic Republic Laos FALSE Laos prefers longer
Pakistan 1901.47 Pakistan Pakistan Pakistan TRUE Pakistan
Madagasc 1678.96 Madagascar Madagascar Madagascar TRUE Madagascar
Papua 3115.88 Papua New Guinea Papua New Guinea FALSE Papua New Guinea correct
Yemen 1190.13 Yemen Yemen Yemen TRUE Yemen
Nepal 2030.11 Nepal Nepal Nepal TRUE Nepal
CookIsla 2437.7 TRUE Cook Islands
Macau 2660.44 Macau Macao Macau FALSE Macau prefers substitution over exact
Marianas 2437.7 Mariana Isl. Mariana Isl. FALSE Mariana Isl. correct
Marshall 2437.7 Marshall Islands Marshall Islands Marshall Islands TRUE Marshall Islands
NCaledon 2437.7 TRUE New Caledonia
Taiwan 2673.71 Taiwan Taiwan, Province of China Taiwan FALSE Taiwan prefers longer
PuertoRi 1654.5 TRUE Puerto Rico
Afghanis 1962.74 Afghanistan Afghanistan Afghanistan TRUE Afghanistan
Algeria 185.65 Algeria Algeria Algeria TRUE Algeria
Antigua/ 491.84 Antigua and Barbuda Antigua and Barbuda FALSE Antigua and Barbuda correct
Azerbaij 2190.35 Azerbaijan Azerbaijan Azerbaijan TRUE Azerbaijan
Bahamas 594.17 Bahamas Bahamas Bahamas TRUE Bahamas
Banglade 1897.24 Bangladesh Bangladesh Bangladesh TRUE Bangladesh
Bhutan 2082.28 Bhutan Bhutan Bhutan TRUE Bhutan
Brunei 1904.48 Brunei Brunei Darussalam Brunei FALSE Brunei prefers longer
Burma 2138.54 Burma Burma Burma TRUE Burma
Cambodia 2254.37 Cambodia Cambodia Cambodia TRUE Cambodia
Costa Ri 1938.1 Costa Rica Costa Rica Costa Rica TRUE Costa Rica
El Salva 1016.14 El Salvador El Salvador El Salvador TRUE El Salvador
Grenada 537.25 Grenada Grenada Grenada TRUE Grenada
Guyana 1379.76 Guyana Guyana Guyana TRUE Guyana
Haiti 434.51 Haiti Haiti Haiti TRUE Haiti
Kazakhst 2122.18 Kazakhstan Kazakhstan Kazakhstan TRUE Kazakhstan
Kiribati 2281.44 Kiribati Kiribati Kiribati TRUE Kiribati
Korea (N 2399.11 Korea North Korea North FALSE Korea North correct
Kyrgysta 2143.13 TRUE Kyrgyzstan
Libya 185.65 Libya Libano Libya FALSE Libya prefers subset, deletion, insertion over exact
Maldives 1836.17 Maldives Maldives Maldives TRUE Maldives
Micrones 2437.7 Micronesia, Federated States of Micronesia, Federated States of Micronesia, Federated States of TRUE Micronesia, Federated States of
Mongolia 2542.15 Mongolia Mongolia Mongolia TRUE Mongolia
Nicaragu 1856.28 Nicaragua Nicaragua Nicaragua TRUE Nicaragua
Oman 1594.25 Oman Cayman Islands Oman FALSE Oman prefers subset + substitution over exact
Panama 1809.12 Panama Panama Panama TRUE Panama
SKittsNe 469.61 TRUE Saint Kitts and Nevis
Solomon 3050.76 Solomon Islands Solomon Islands FALSE Solomon Islands
Tajikist 2000.53 Tajikistan Tajikistan Tajikistan TRUE Tajikistan
TimorLes 2602.63 TRUE Timor–Leste
Turkmeni 2212.49 Turkmenistan Turkmenistan Turkmenistan TRUE Turkmenistan
UArabEm 1286.35 TRUE United Arab Emirates
Uzbekist 2193.47 Uzbekistan Uzbekistan Uzbekistan TRUE Uzbekistan
Vanuatu 2385.83 Vanuatu Vanuatu Vanuatu TRUE Vanuatu

 

The R code:

gn = read.csv("genetic_distance.csv", encoding = "UTF-8", stringsAsFactors = F)
gn$abbrev = as_abbrev(gn$Country)

trans = read.csv("countrycodes.csv", sep=";", encoding = "UTF-8", stringsAsFactors = F)
trans$shorter = str_sub(trans$Names, 1, 8)

intersect(trans$shorter, gn$Country)

matches = data.frame(source_names = gn$Country,
                     to_short = pmatch (gn$Country, trans$shorter)
                     )

agrep(gn$Country[4], trans$shorter)

best_matches = matrix(nrow = nrow(gn))
for (idx in seq_along(gn$Country)) {
  match_idx = agrep(gn$Country[idx], trans$shorter, max.distance = .1, useBytes = T)
  #skip on no match
  if (length(match_idx) == 0) next
  #insert match
  best_matches[idx] = match_idx
}

matches$agrep = best_matches

matches$to_short_result = trans[matches$to_short, "Names"]
matches$agrep_result = trans[matches$agrep, "Names"]

for (idx in 1:nrow(matches)) {
  if (!is.na(matches[idx, "to_short_result"])) {
    matches[idx, "best_match"] = matches[idx, "to_short_result"]
  }
  if (is.na(matches[idx, "to_short_result"])) {
    matches[idx, "best_match"] = matches[idx, "agrep_result"]
  }
}

#output
write.table(matches, "clipboard", na = "", sep = "\t")

July 7, 2015

Getting free wifi forever in airports on Mint 17.1

Filed under: Computer science — Tags: , , , , , , , — Emil O. W. Kirkegaard @ 13:44

Many airports have free wifi services. The problem with these is that they are time-limited, usually to 1 to 3 hours. This can be very annoying if one is stuck in an airport for an extended period, as I am right now.

Non-technical solution

If you have spent your time on one device, you can simply switch to a new one. If you have brought a smartphone, tablet and a laptop, you can use the time on each of these.

This solution may be sufficient in some situations.

Technical solution

The wifi services rely on your computers MAC address to identity you. They keep track of these and so when you have used all the time on a given MAC, it will be temporarily blocked from using the internet again.

The solution is simple: we kill the batman we switch to a new MAC address every time one has expired. How do we do this? The built in network controller can change the MAC address, but this did not work for me. Instead I downloaded macchanger using:

sudo apt-get install macchanger

This is a small program that lets you easily change MAC addresses. I found a ton of guides, but they did not fully work.

Here’s my current routine.

  1. Disable the wifi using by clicking turn off in the dock-menu.
  2. Delete the previous connection to the network.
  3. Open a terminal as root.
  4. Type:
    macchanger -s wlan0

    to show the current MAC address.

  5. Type:
    macchanger -a wlan0

    to get a new similar MAC address.

  6. Re-do step (4) to see that it worked.
  7. Turn on wifi.
  8. Connect to the network.
  9. Enjoy internet for as long as it lasts, start over from step (1).

I’m not sure if everything here is strictly necessary, but this works for me.

January 26, 2015

As AI improves, what is the long-term solution to spam?

Filed under: Computer science — Tags: , , , — Emil O. W. Kirkegaard @ 03:54

I’ve thought about this question a few times but haven’t found a good solution yet. The problem needs more attention.

AI and computer recognition of images is getting better. I’ll go ahead and say that the probability of reaching the level where computers are as good at humans at any of kind of CAPTCHA/verify-I’m-not-a-bot-test within the next 20 years is very, very high. CAPTCHAs and other newer anti-bot measures cannot continue to work.

We don’t want our comment sections, discussion forums and social media full of spamming bots, so how do we keep them from doing so? I have three proposal types for the online world.

Make sending cost

Generally speaking, spam does not work well. The click-rate is very, very low but it pays because sending mass amounts of it is very cheap. In the analog world, we also see spam in our mail boxes (curiously, this is allowed but online spam is not!), but not quite as much as online. The cost of sending spam in the analog world keeps the amount down (printing and postage).

Participation in the online world is generally free after one has paid one’s internet bill. Generally, stuff that isn’t free on the internet is not used much. Free is the new normal. Free also means the barrier for participation is very low which is good for poor people.

The idea is that we add a cost to writing comments, but keep it very low. Since spam only works when sending large amounts (e.g. 10,000,000 emails per year), and normal human usage does not require sending comparably large amounts (<1,000 emails in most cases), we could add a cost to this which is prohibitively large for bots, but (almost) negligible for humans. E.g. .01 USD per email sent. Thus, human usage would cost <10 USD per year, but botting would cost 100,000 USD.

Who gets the money? One could make it decentralized, so that the blog/newspaper/media owner gets the money. In that way, discussing on a service also supports them, altho very little.

This could maybe work for email spam, but for highly read comment sections (e.g. on major newspapers or official forums for large computer games), the rather small price to pay for writing 1000 (or even 10) posts would not be a deterrent. Hence the pay-for-use proposal fails to deal with some situations.

Making the microtransactions work should not be a problem with cryptocurrencies which can also send anonymously.

Verified users by initial cost

Another idea based on payment is that one can set up a service where users can pay a small fee (e.g. 10 USD) to be registered. The account from this service can then be used to log into the comment section (forum etc.) of other sites and comment for free. The payment can be with cryptocurrency as before so anonymity can be preserved. It is also possible to create multiple outward profiles from one registered account so that a user cannot be tracked from site to site.

If an account has been found to send spam, it can then be disabled and the payment has been wasted. The payment will not have to be large, but it needs to be sufficient to run the service. Perhaps one can outsource the spammer-or-not decision-making to a subset of users who wish to work for free (many services rely upon a subset of users to provide their time, e.g. OKCupid).

The proposal has the same problem as the one above in that it requires payment to participate.

Verified users without payment

A third proposal is to set up a service where one can make a profile for free, but that requires one to somehow prove that one is a real person. This could be done with confidential information about a person e.g. passport + access to a database of this. This would probably require cooperation with officials in each country. Probably they will keep the information about who is who if they can, so it is difficult to see how privacy could be preserved with proposals of this type.

As before, accounts will still need to be deactivated if they are found to be spamming. If the government is involved, they will surely push for other grounds for deactivation: intellectual monopoly infringement, sex work related stuff, foul language, national security matters and so on. This makes it the least preferred solution type to me.

Other solutions?

Generally, the goal is to preserve privacy, no cost of participation and being spam-free. How can it be done? Is online discussion doomed to be overspammed?

Older Posts »

Powered by WordPress