Some thought after Pycon 2018
Python has become, if not the de-facto standard for data science, then at least one of the biggest contenders. As we wrote about in a previous entry, we sent a group of our top data engineers and developers to learn about the latest news in data science, and Python development in general. We share below some of the notes and impressions from this year’s Pycon conference for those of you that didn’t have the chance, or time to attend.
Ethics in data science
One very interesting and thought-provoking keynote talk was about the ethics of data science, and was held by Lorena Mesa from GitHub. She is a former member of the Obama campaign, as well as member of the Python Software Foundation board. In this talk, she presented experiences from the 2008 US-presidential campaign, and the role of data science in the rise of social media as a political platform. She also discussed the dangers that we have seen emerge from that in the years to follow. Data science have emerged as a powerful tool for spreading well intended information, not so well intended (dis-)information, or monitoring people for their political view, or even attempts at preemptive policing.
One of the most scary examples was a decidedly minority-report style scenario, in which police used an automated opaque system to give scores from 0 – 500 to estimate how likely individuals were to commit crime, and used this information to affect policing actions (this was done in Chicago, and there has been a strong backlash in media). An extra worrisome part of this is the black-box approach in which we cannot quite know what factors the system takes into consideration, or the biases that are inherent due to the data with which it has been built. Another example on this note was an investigation made by the American Civil Liberties Union (ACLU) in which they used a facial recognition tool (and recommended settings) that had been sold to the police, and used it to match members of the U.S. Congress versus a database of 25000 mugshots. The system falsely matched 28 Congress members to mugshots, with a disproportionate number of these matches being against Congress members of colour. This is a tricky problem where the inherent socioeconomic issues laying behind the source material (the mugshots) are carried through to the predictions done by the system in non-obvious ways. Something that surely would need to be addressed, and being taken into consideration before we can allow ourselves to trust the results of such a system.
Finally, perhaps it is time for us data engineers to consider, and at least start the discussion about the larger ramifications of the type of data we collect, the algorithms we train, and how our results affect our society. Perhaps it is time for a hippocratic oath for data scientists?
Quantum Computing with Python
Over the last decade, Quantum computing has advanced from the realm of science fiction to actual machines in research labs, and now even to be available as a cloud computing resource. IBM Q is one of the frontier companies in research in Quantum computer, and they provide the open-source library qiskit, which allows anyone to experiment with Quantum computation algorithms. You can use qiskit to either run a simulator for your quantum computing programs, or you can use it connect over the cloud with an actual quantum computing machine housed at IBM’s facilities to test run the algorithms.
The size of the machines, counted as number of quantum bits, has been quite limited for a long time, but it is now fast approaching sizes that cannot conveniently be simulated with normal machines.
Contrary to popular belief, a quantum computer cannot solve NP hard problems in polynomial time, unless we also have P = NP. Instead, the class of problems that can be solved by a quantum computer is called BQP, and it is known that BQP extends beyond P, and contains some problems in NP, but not NP-hard problems. We also know that BQP is a subset of PSPACE.
This has the consequence that we can easily solve important cryptographical problems such as prime-factorization quickly with a sufficiently large quantum computer, but we cannot necessarily solve eg. NP-complete problems (such as 3-SAT), or planning, or many of the other problems important for artificial intelligence. Nonetheless, the future of quantum computing is indeed exciting, and will completely change not just encryption, but also touch on almost all other parts of computer science. An exciting future made more accessible through the python library qiskit.
A developer amongst (data) journalists
Eléonore Mayola shared her insights from her involvement in an organization called J++, which stands for Journalism++, as a software developer who aids journalists in sifting through vasts troves of data to uncover newsworthy facts, and also teaches them basic programming skills. She showcased a number of examples of data-driven journalistic projects, ranging from interactive maps of Sweden displaying the statistics of moose hunts, or insurance prices, through the Panama Papers revelations, to The Migrants’ Files, a project tallying up the cost of the migrant crisis in terms of money, and lost human lives.
When it comes to her experience teaching journalists to code, some of the main takeaways presented were that even the most basic concepts, which many professional software developers would find trivial, can already have a big impact in this environment. Another point was that it is important to keep a reasonable pace, and avoid overwhelming students with too much information at once, and last, but not least, that the skills of software developers are sorely needed even in fields that many of us probably wouldn’t even consider working in.