top | item 35272227

The drama in trying to convert election PDFs to Spreadsheets

716 points| markessien | 3 years ago |markessien.com | reply

141 comments

order
[+] OoTheNigerian|3 years ago|reply
Nice read. It's important to note

1.The 2020 protesters did not begin vandalizing property, but government infiltrated the protests by burning cars and maiming people.

2. The Obidient movement encompassed multiple sub movements of which a part of the #EndSARS was one of them. A vast majority of Peter Obi's supporters were not #EndSARS activists.

3. Elections in Nigeria are fraught with treacherous behavior so everyone suspects everything. It's important to be very careful with your communication. There is a lot of desperation in the land and so if in a position of information leverage, the responsible thing is to handle the privilege with care and transparency.

[+] crazygringo|3 years ago|reply
First of all, what a fantastic and inspiring read.

But, I'm left greatly confused -- the article never states whether this changed the result.

It says that halfway through counting Obi was in the lead, but nothing about when finished counting.

And when I look at the spreadsheet, the last row (#3380) appears to be the totals, which lists:

  APC     LP     PDP     NNPP
  149014  85748  329030  8305
Which shows LP (Obi) in third place, just like the official results.

So what point is the article trying to make at the end of the day? Or have I misunderstood the numbers?

[+] sd9|3 years ago|reply
Those are the results for just one state, Adamawa.

However, like you I don't know what the overall results are; I agree that the article could make this clearer.

[+] karagenit|3 years ago|reply
I totaled up the results from only the "crosschecked" CSV files, here's what I saw:

  APC:  5928825
  LP:   4731127
  PDP:  4555334
  NNPP: 1019045
I tried to manually verify about a dozen rows myself, half were so blurry/low res they were illegible but the ones that were legible were all correct.

And for the "unsure" CSVs:

  APC:  1308067
  LP:    578482
  PDP:   736183
  NNPP:  513245
Also checked about a dozen, and all but one of them were wildly inaccurate so I wouldn't trust these much.
[+] error503|3 years ago|reply
I collected all the _crosschecked CSVs and got:

  LP       PDP      APC      NNPP
  4731127  4555334  5928825  1019045
Obi seems to make second place here, but far from first.

https://i.imgur.com/UaZbXz6.png

[+] jxramos|3 years ago|reply
probably because it was submitted as evidence and not yet accepted by the courts or wherever things wound up next.
[+] djoldman|3 years ago|reply
Checking one at random:

https://docs.google.com/spreadsheets/d/1HhV9iJxXTU9liAZPIDoM...

...shows 0s in the first row for all candidate parties. But the corresponding photo shows votes for all three:

https://inec-cvr-cache.s3.eu-west-1.amazonaws.com/cached/res...

I hope it's not a mistake and that there's some arcane law/technicality to explain it.

edit: another mistake on row 21, LP should get 25 but it was credited to NNPP:

https://docs.inecelectionresults.net/elections_prod/1292/sta...

[+] dan-robertson|3 years ago|reply
Yeah looks weird. When I scrolled to a random part, the numbers seemed to line up. They didn’t say things were entirely correct though. Perhaps the data quality is sufficient for a challenge. Odd that the first rows seem more wrong though.
[+] MontagFTB|3 years ago|reply
So the bug where the first voting sheet shown to a user was from the same 10% of the photos turned out to be a feature, serving as a CAPTCHA of sorts to weed out the bad actors from the good.

If memory serves, some CAPTCHA techniques include showing two numbers to transcribe, where one’s value is already known. If that number is transcribed incorrectly, then the other number’s result isn’t used, and the CAPTCHA fails. Perhaps a similar technique may have also helped here?

[+] Spare_account|3 years ago|reply
This approach was part of their strategy:

>Then we started showing some results we knew to the bots - if they entered wrong numbers, we would stop accepting the results.

[+] dan-robertson|3 years ago|reply
I think the bug was that your first sheet came from a small set and the people entering bad data would refresh instead of doing the actually random next sheet, so entries for most of the sheets came only from people who had long sessions who were apparently more likely to enter good data.
[+] malborodog|3 years ago|reply
Can you explain that again differently? I didn’t understand that captcha point. It feels important though.
[+] churchill|3 years ago|reply
Oh, and Mark didn't mention that Bola Ahmed Tinubu was indicted for heroin charges in the US in 2003, forfeited $460k & is just too old to run a democracy this size.

Atiku Abubakar (second candidate) was a former VP and the president he served under (Obasanjo) still insists the dude remains a monument to corruption.

There's been a coordinated campaign at all levels to rig this election massively and we saw voter intimidation, manipulation in broad daylight, and the acquiescence of foreign governments to it all.

[+] churchill|3 years ago|reply
Proofs:

To explain the $460k he forfeited to the feds for his heroin trafficking indictment [0][1], Tinubu claims to have worked at Deloitte as a consultant & made $850k in pre-tax bonuses a year. Problem is, Deloitte claims he's never worked for them [2] and a director at Deloitte earns $340k, according to Glassdoor [3].

[0]: https://www.bbc.com/news/world-africa-61732548 [1]: https://www.scribd.com/document/345742027/Bola-Tinubu-Heroin [2]: https://pbs.twimg.com/media/FhhgxX2WQAAWOVo?format=jpg [3]: https://www.glassdoor.com/Salary/Deloitte-Director-Salaries-...

[+] bschne|3 years ago|reply
> is just too old to run a democracy this size

Ahem, somebody tell the U.S. that

[+] churchill|3 years ago|reply
I meant heroin trafficking
[+] charles_f|3 years ago|reply
> run a democracy this size.

From the looks of it, if he runs it, it won't be a democracy

[+] lostlogin|3 years ago|reply
> is just too old to run a democracy this size.

Bola Ahmed Tinubu was born 29 March 1952. He is 70.

Joe Biden was Born November 20, 1942. He is 80.

There are plenty of world leaders that are old and I completely agree with you. Why aren’t there upper age limits? The UK House of Lords, US Congress and US Supreme Court have this problem too.

[+] mtrovo|3 years ago|reply
Is the access to the original photos open? It might be fit for a good Kaggle competition, although maybe a little too late for this current election.
[+] jasonjayr|3 years ago|reply
From the article, it seems like the rush was to collect enough evidence to file a challenge within the legal timeframe. With a challenge filed, it seems like there is a bit more time to verify claims + other evidence. (I know nothing of the system of government there, but) -- it seems like the prudent thing to do would be for the courts to mandate a neutral verification of each of those paper sheets. (ie, 10 trusted representatives from each party re-key the figures manually).
[+] olabyne|3 years ago|reply
If you want, you have exactly the same issue to solve with Kenya last year.

The pictures of all of the voting sites are available, but the country went to chaos to pick a winner. It is crazy , because on the lower level (in voting offices), the vote process was respected and the numbers are trustworthy, but the higher you go and the more corruption happens, as each aggregation of data removes trust to the system.

[+] dec0dedab0de|3 years ago|reply
This would have been a good use for hn style shadow banning. Especially if they didn't publish the current tally, then the original easy to detect bots may have never realized you were on to them
[+] rqtwteye|3 years ago|reply
I still don't understand how we ended up with PDF as sort of standard to archive data. PDF is already pretty bad for things like manuals but for things like spreadsheets we basically collect the data, then we destroy all the structure by putting it in into POF, and later on we painstakingly try to restore the data from PDF which is often almost impossible to do with accuracy.

It just shows that bad solutions often win.

[+] spacebanana7|3 years ago|reply
I've thought about this and come round to think that the flaws of PDF are actually essential to the success of the document format.

- Non-responsive (compared to HTML). Allows PDFs to serve as a common standard between other document formats with different resizing logic, like Latex and Word.

- Difficultly of network access from code running inside document. Allows PDFs to generally operate offline. Nobody's brave enough to try to write a single page application in a PDF

- Destroying data structure. Allows forward compatibility with anything that can be displayed statically on a screen. New applications can have different ideas about how tables, text or charts should work but if there's static visual output then it'll convert to PDF. Awareness of say, the structure of tables is precisely what makes it so difficult for say google sheets and excel to stay compatible with each other's new table features. If somebody develops a new language with new characters not even in Unicode it'll still work on a PDF

It's also worth noting that most PDF limitations have the characteristic of making things hard but not absolutely impossible. These escape hatches prevent people with hard requirements from actually moving to a new format.

If it were truly impossible to get invoice data from PDFs people might've shifted to a different format for business transactions. But if it's merely difficult some company will come up with an API that works as a good enough extraction solution whose cost is justified by the other compatibility benefits of PDFs, so the ecosystem stays with PDFs.

[+] varenc|3 years ago|reply
For this particular case, the use of PDFs seems irrelevant. Photos were just taken of each polling unit’s results. These photos happened to then be embedded into PDFs for distribution, but the core underlying data is just an image embedded into that PDF. No important data was destroyed when these photos were placed into PDFs.
[+] layer8|3 years ago|reply
> how we ended up with PDF as sort of standard to archive data.

I don’t think we really did. They are a standard for archiving typeset page-based documents.

Of course, paper documents used to be standard for archiving data, and some continue to do so in the form of PDF.

In principle, it is possible to integrate all the structure you want in a PDF (using Marked Content, Structure Attributes and User Properties), but for data (as opposed to document structure) you’d need custom software to generate and interpret that.

[+] anigbrowl|3 years ago|reply
Because PDF shows you a page on screen that will look the same if you print it out, and print layouts have been optimized for reading convenience over centuries. And if you give someone with no technical expertise a pdf file, it's virtually certain that they're going to be able to open it because some kind of viewer is built into most operating systems.

You're totally right about PDF being a massive pain in the butt for any other purpose, but unless you have an alternative that handles the basic use case at least as well and other use cases way better, PDF is here to stay.

[+] chrisfinazzo|3 years ago|reply
It's old, and sometimes things don't come out right, but this is one way out of that hornet's nest.

https://tabula.technology

There's also a CLI if that is more to your liking. If that doesn't do it, there's always the brute-force option of scripting in your language of choice to pull the data out.

[+] codeulike|3 years ago|reply
these are just photos embedded in a PDF, which actually isn't that bad an idea, because it lets you scan multiple pages and join them together as a 'document'

(not sure if the documents in OP had several pages, but if you've scanned/photographed a multi-page document, PDF is not that bad of a solution)

[+] manv1|3 years ago|reply
Back in the day there were at least two programs competing for the role that PDF fills today that I remember: diskpaper and PDF. Apple also had one for its developer docs, but it was never released commercially, I believe.

PDF provided more fidelity for printing, had better tooling (it was by Adobe after all), it was cross-platform, could be displayed on the desktop, so it won. The reader was cross-platform so end-users didn't have to mess with installing plugins for various image types. And because everyone in the document creation division(1) used Postscript to print, printing to PDF was super-easy. And at some point everyone had a postscript printer driver on their machine, so printing to PDF because super-easy as well.

It's not an archiving tool, but people use it for archiving...just like the way a spreadsheet isn't a project management tool, but millions of people use it for project management.

At this point the network effects for the PDF file format would make it difficult to replace. With PDF you can practically guarantee(2) that the file will look the same on any device.

(1) This was more true back then than today, probably (2) assuming that you embedded the fonts, and that the reader doesn't suck.

What's funny is I don't think Adobe really makes any money off of PDF; it's an accidental de-facto standard.

[+] davedx|3 years ago|reply
It depends. There are PDFs with rasterized images of text (like in the article, when it’s a scan or photo of a document), then there are PDFs with vector positioned text runs (when it’s usually a result of some digital process). The latter are way easier to process than the former.
[+] andrewio|3 years ago|reply
Try https://parsio.io.

It converts PDFs into a structured JSON format that you can export anywhere using a Zapier or Make automation:

[+] redman25|3 years ago|reply
This might be a sensitive question but I wonder if something like this would work in the United States? With all of the fears of election interference why not trust but verify?
[+] charles_f|3 years ago|reply
Would you trust the recount? I mean, the only way to engage the number of people you need to do that kind of recount is by having them very pissed, so most likely feeling like their party was wronged and therefore the thing is partisan by essence. If you're on the winning party you wouldn't trust the numbers the others give you anyhow, so what's the point
[+] pjc50|3 years ago|reply
Genuinely the US would do better if it had paper elections with a handcount with observers. The system works in the UK just fine. Unfortunately, there's a category of people in both the US and Nigeria who use "election interference" to mean "accurately counting the votes".
[+] bunabhucan|3 years ago|reply
The distrust is not based on evidence. Actual election fraud is incredibly rare in the US, typically things like someone owning property in two states and voting the state ballot in one and federal in the other. Getting two ballots is legal but using both is jail time. Typical solo offender is a conservative white male.
[+] harvey9|3 years ago|reply
This is some compelling writing. I know this has real life implications for real people so I hope it's not in poor taste to say it would make a good movie.
[+] cwkoss|3 years ago|reply
I agree, but still needs an ending! Will this be a story of triumph or tragedy?
[+] tr33house|3 years ago|reply
I'd tried something like this with the Kenyan election but our setup was to use OCR (google cloud) -> text -> parse -> sqlite

We started late so the results were out when we finished but I think it'll be a good idea to develop software that can parse the PDF results and display them faster than the electoral bodies can. In Kenya, and Nigeria, the delays cause a lot of anxiety

[+] hoseja|3 years ago|reply
Silly, you don't malcount the actual votes, you brainwash the population and pervert the process until they vote the way you want them to, like in the advanced first world democracies.
[+] avodonosov|3 years ago|reply
That's not the worst case, if wise elite brainwashes (manufactures consent of) the population.

Worse is when the elite is not so wise (sometimes plainly crazy), or the elite loses control to crazy people, adversaries. Or self-induced mass hysteria of the population.

The direct "democracy" that very soon will inevitably be enabled by technology, poses great dangers in the situation where masses are so easily manupulateable, and their collective intelligence seems not raising above individual level, but degrading below it for some reason. Violent chaos, lynch courts, etc.

[+] mattlutze|3 years ago|reply
This was thrilling.

Sometimes, one person's bug is another person's feature :)

[+] londons_explore|3 years ago|reply
Isn't things like this the reason that the UN provide election observers?

By spot checking just a random 100 votes are correctly tallied, you can be pretty sure the outcome of the election is legit in a > 10M voter country.

[+] throwaway81523|3 years ago|reply
I've done stuff like this semi manually. Use pdftotext to get the text tables out of the pdf, eyeball it and massage with emacs keyboard macros, and in some cases python scripts. It's not that big a deal but it is somewhat ad hoc.

I know that OCR software is able to read stuff like magazine articles and figure out column layout, embedded charts, etc. It's weird if is nothing to do that with a pdf. Maybe I'll look around or see if I can hack up something.

[+] hardlianotion|3 years ago|reply
That is a great job - well done from a grateful Nigerian.
[+] mmmuhd|3 years ago|reply
Elupee 75, To be frank, you did a great job and i am proud of someone from my country pulling this off, but the bitter truth is President Elect Bola Ahmed Tinibu won this election. Peter Obi's youth support is predominantly in the south, and Christian majority parts of the country, he clearly lack support in the Muslim north, where I am from. I voted for Kwankwaso though.