#21 Anonymize datasets
Opened 13 years ago by nigelj. Modified 2 years ago

Apologies if this is how the system currently works.

I'm posting this on behalf of the Fedora Board. We would like for our election voting tool to work like this:

FAS2 user goes to vote, they authenticate. The voting system assigns the user a unique ID number. The user then votes, and confirms their vote. The "ballot" is stored but is tied only to this unique ID number (not to the FAS2 user/account). The FAS2 system records that the FAS2 user has voted, but does not store a correlation between the FAS2 account and the unique ID number attached to the ballot.

Thus, we have a dataset showing the results, but it is anonymized automatically, and we still know who has (and has not) voted, just not who (or what) they voted for.

Thanks in advance.

2008-07-29 19:26:00 changed by mmcgrath

Just so we're clear. The idea is to ensure that the results of a ballot cannot in any way be tied to a user. We would also not want this data in logs on the elections app servers correct?

This might make fraud more likely but then I'm not entirely sure how its any better or worse from that perspective with what we have now.

G, can you think of any technical hurdles in implementing this?

Spot/Board: Do you want us to anonymize the data that's already in the database?

2008-07-29 19:35:42 changed by spot

Yes, that is the idea. We would not want a correlation between a ballot and a user anywhere, not even logs. If we're careful with ensuring that once a ballot is submitted, the FAS2 user is marked as having voted in an election, I don't think we open the door to any additional fraud possibilities.

As to anonymizing the existing data, I'm not sure it is worth the effort.

2008-07-29 21:39:01 changed by jspaleta

If we are going to be making changes to address the voting system... are we gonna make any effort to contact someone doing active research into electronic voting methods to get an idea as to existing workable implementation concepts that accomplish the desired anonymizing of the voting but also protect against fraud?


Okay, so in the process of changing to mySQL I'm also taking the opportunity to do just this, but there are issues with the 'nothing in logs' wish.

a. Apache by default logs authenticated usernames in the access log (I can't remember at which point this happens (i.e. proxy, app server, or application level) off the top of my head
b. For some people, it'd be possible to reconstruct a reasonably accurate idea of who voted for who (by examining both the logs and the database), this in turn is for a couple of reasons:
1. Anyone with access to the HTTP logs on proxyx/log1/lockbox1 could correlate the IPs of requests to the voting app, and with usernames from calls to applications such as pkgdb (or maybe in some cases to a geographical locality)
2. The order of votes could still be reconstructed because of the ordering of the votes

I'd also pose the question:

Is this just to please the board so that a nosey person can't snoop in the DB and find out the values of someones vote (which I agree with), or is it aimed so we can just dump the DB and share it to the world (my opinions on this are already out there).

I'd like to move this application to fas1/fas2, but I don't actually have the privledges to do this, I'd also like to restrict access to the database to the hosts that the application runs on as an extra measure. The problem with this, is in the current state there are still a few tasks that need interaction with the database to flip bits/make changes/fix mistakes. This would help prevent snoopers of the database.

As for satisfying people that just want to mine our data, I'm still very much against it (especially providing the whole data), but we can still do this somewhat crudely, in particular by providing random samples so it cannot be reconstructed, but I think the time spent on this, could be put to significantly better use on other features for the application.

I don't think we want the voting app to live on fas1/fas2 as fas is a whole 'nother level of privacy. Voting records are something we don't really want to give out but they don't have the same risk as someone's home address.

I'm with G on the difficulty of actually keeping this from the logs in such a way that it's 100% not traceable.

I'd also note that this does open the door to different types of voter fraud than before. Having votes tied to userids means that we could do things like allow a person to change their vote once cast (a requested feature... but not a heavily requested feature), allow a person to review their own personal voting record (did I fat finger that? I don't remember voting in the last Board Election, is someone stuffing the ballot box?), etc.

Keeping a table that records that a user voted in an election but not what their ballot was means that if people come forward and say, the system is saying I already voted but I haven't it is much harder to investigate the possibility that something fraudulent is going on.

Since I'm not the main coder of this, I'm not in a position to veto it but I'm not altogether convinced that it is a good idea.

I don't think we can promise 100% anonymity, and I don't think there's sufficient support (or justification) to just deliver a dump of the DB data to anyone. By disconnecting the user ID's from the votes, we at least make it possible for any community member to potentially validate the vote totals.

As for changing votes after the fact -- no electoral system I know of supports changing your mind later. That opens the door to all sorts of havoc. We already provide a confirmation screen for voting that should serve to allay concerns about voter error, and IIRC a screen that confirms the vote once cast, which could be printed. We could include language on the screen that encourages it, too -- and since it's a HTTPS connection, couldn't we let the voter know their Vote ID, so the voter's the only person who knows the userID <-> voteID connection? (This is seriously treading into separate ticket waters, I know.)

Replying to [comment:3 toshio]:

We could include language on the screen that encourages it, too -- and since it's a HTTPS connection, couldn't we let the voter know their Vote ID, so the voter's the only person who knows the userID <-> voteID connection? (This is seriously treading into separate ticket waters, I know.)

So the problem is: If you come to me next election and say I'm userid:100001. I haven't voted yet but the system is saying I did, there's no way for me to tell if there's something wrong with the system, who benefitted from the votes cast using your userid, or if you're just trying to cause a commotion and derail the process.

If the purpose of this feature is to allow anyone to validate the votes without access to the db here's what I'd do:

  1. Have a table that maps userid::election => one-use-voteid.
  2. Do not give out information from this table to anyone.

This way, db administrators have the ability to find out what's going on if something happens. At the same time, validation of voting outside the db will only lead to anonymous ids, not to userids used anywhere else. This still allows administrators to peak at who is voting for which candidates if they want to track the information through enough tables in both databases but, as G says, an administrator who is sufficiently motivated can get that information by checking http logs as well as the information in the db.

Another way to phrase this is: which problem are we more concerned about:

  1. Voting administrators finding out who voted for what (Note: simultaneously, you have to trust that your admins haven't compromised the actual numbers, just compromised the anonymity of the data.)
  2. Voters attempting to cause fraud in the system or add distrust to the numbers returned from the process.

Is this ticket still open?

Is this likely to be worked on? It would be nice to have the same features in Nuancier. It may be helpful to examine Estonian e-voting system here

There is still extensive research work in this area with many people being sceptical that electronic voting can be secure, anonymous and auditable, though for this application, maybe trusting the system administrator is reasonable.

Login to comment on this ticket.