naftaliharris's comments

naftaliharris | 6 years ago | on: Tauthon: Fork of Python 2.7 with new syntax, builtins, libraries from Python 3

Original author here. I built this a few years ago, and the main motivation at the time was that I'd heard people say that adding the new features in Python3 required breaking backwards compatibility, which I didn't believe. IMO the only feature that really required breaking backwards compatibility was the str/unicode consolidation and refactoring. This project was my way of proving that we could have gotten the other features that people tend to be most excited about (async/await, function annotations, new super(), etc) without breaking existing Python2 code. I think it was successful at that as a proof of concept.

It was a fun project; I learned a lot about how the CPython implementation works and have a lot of respect for the people that built it. It was surprisingly easy to implement Tauthon based off the work the core dev team did on Python3: https://www.naftaliharris.com/blog/nonlocal/

For what it's worth, I do believe that Python3 is a better language than Python2. We use Python 3.7 at my work (SentiLink) and we've had a good experience with it. (If you're starting a new project or can migrate, I'd recommend it). But I do think that the ~10 year saga of upgrading to Python3 from Python2 wasn't necessary when the main benefit was really the unicode refactoring.

I no longer maintain Tauthon personally but there are others who are excited about the project who occasionally add new features or bugfixes.

naftaliharris | 6 years ago | on: Ask HN: Who is hiring? (July 2019)

SentiLink | Software Engineer (backend, platform, infra, machine learning, data science) | ONSITE | San Francisco, CA | sentilink.com

SentiLink prevents synthetic fraud, an emerging fraud vector in which fraudsters open accounts using name/DOB/SSN combinations that don't correspond to real people. Our partners include top ten US banks, fintechs, and alternative lenders. We're backed by investors including Andreessen Horowitz, Max Levchin (Affirm CEO/PayPal Co-Founder), and former presidents/CEO's of Visa, Transunion, HSBC, and Citi.

We recently closed a $14M Series A [1] and are hiring software engineers to help us build our identity platform. Our tech stack uses Go (for the API part) and Python (for the ML part) on k8s and the work involves a lot of complex and sensitive data.

Please apply at https://jobs.lever.co/sentilink.

[1] https://businessinsider.com/synthetic-fraud-detection-startu...

naftaliharris | 7 years ago | on: Ask HN: Who is hiring? (May 2019)

SentiLink | Software Engineer (backend, platform, infra, machine learning) | ONSITE | San Francisco, CA | sentilink.com

SentiLink prevents synthetic fraud, an emerging fraud vector in which fraudsters open accounts using name/DOB/SSN combinations that don't correspond to real people. Our partners include top ten US banks, fintechs, and alternative lenders. We're backed by investors including Andreessen Horowitz, Max Levchin (Affirm CEO/PayPal Co-Founder), and former presidents/CEO's of Visa, Transunion, HSBC, and Citi.

We recently closed our Series A [1] and are hiring software engineers to help us build our identity platform. Our tech stack uses Go (for the API part) and Python (for the ML part) on k8s and the work involves a lot of complex and sensitive data.

Please apply at https://angel.co/sentilink/jobs or shoot a resume/github/linkedin to me, (my first name at a domain I'm sure you can guess).

[1] https://businessinsider.com/synthetic-fraud-detection-startu...

naftaliharris | 8 years ago | on: The New ID Theft: Millions of Credit Applicants Who Don’t Exist

Two super interesting things about this kind of fraud which make it especially tricky to catch and deal with:

1. Unlike with ID theft, there's no consumer victim. With ID theft, eventually the victim will find out about it, (by getting a call from a collections agent or seeing the trade on their credit report). They'll then contact the lender or the bureau, and contest the validity of the loan. The end result is that the lender gets a stream of loans that are labeled as identity theft losses. Since there's no consumer victim with synthetic fraud, though, lenders don't get this stream of labeled data and have a hard time knowing which of their losses are synthetic fraud (and which are just ordinary credit losses).

2. Synthetic fraud cuts right through typical ID theft prevention systems. ID theft prevention is about checking whether the applicant is the same as the identity they're using to apply for credit. So you check if the email the applicant uses matches the identity, (e.g. don't want [email protected] used as the email for Jane Smith), you check the phone number, you check if the applicant can complete KBA (knowledge based authentication, e.g. questions about previous addresses), you check the billing address, and so forth. But synthetic identities have their own aged phone numbers, emails, addresses, and credit histories, and so all of these verifications go through without any flags raised. Essentially the ID theft prevention system was checking whether the applicant is the same as the identity that they're using, but with synthetic fraud the applicant created the identity.

Source: my startup focuses heavily on preventing synthetic fraud for lenders, (PM me for details).

naftaliharris | 8 years ago | on: Ask HN: Who is hiring? (March 2018)

SentiLink | Backend, Security, and Machine Learning Engineers | San Francisco | ONSITE

SentiLink is reinventing identity, beginning with financial services in the United States. The current system is broken: SSN's are used as both a username and a password, but after repeated data breaches are also effectively semi-public. Identity-verification data isn't shared, so the same fraudsters target every company and consumers have to continually reverify themselves with different institutions. Billions of dollars are lost every year to criminals who are very rarely caught or punished. SentiLink is building the arbiter of identity to bring identity into the 21st century.

Our investors include former co-founders and C-level execs at PayPal, Palantir, Affirm, Visa, and Citibank, including Max Levchin (SciFi) and Hans Morris (Nyca Partners).

Apply here: https://angel.co/sentilink/jobs or email me (first name at sentilink.com).

naftaliharris | 8 years ago | on: As Computer Coding Classes Swell, So Does Cheating

It's pretty surprising students think they can get away with this. If there's any class you'd get caught in, it'd be a CS class. In the CS classes I took in undergrad and grad school, the professors would tell everyone they would be using automated plagiarism detectors, and even explain a bit about how they worked and explicitly say they were smart enough to detect the "rename-the-variables" trick.

That said, I've got to imagine that claims that "as many as 20 percent of the students in one 2015 computer science course were flagged for possible cheating" are a misrepresentation or a misunderstanding, on the part of the journalist. I mean, sure, if you set the threshold for the plagiarism detector at a low level, you can flag 20%, 50%, or however many students you want for "possible cheating", but it's not necessarily a real thing.

naftaliharris | 8 years ago | on: Show HN: Velo.com – a marketplace for used bicycles

Do customers not want to try out the bike before buying? Especially for $5k+...?

naftaliharris | 9 years ago | on: Paradoxes of probability and other statistical strangeness

You probably wouldn't expect them to. But the same kind of Stein phenomenon holds under a much broader set of conditions, including arbitrary covariance matrices and arbitrary quadratic loss, (see e.g. [1]). It's a very general phenomenon!

[1] https://projecteuclid.org/euclid.aos/1176345691

naftaliharris | 9 years ago | on: Paradoxes of probability and other statistical strangeness

A less popular but perhaps more influential phenomenon is Stein's Paradox [1]. Here's a provocative example often given to illustrate it: Say you have a baseball player, soccer player, and football player, and you wish to estimate the true mean number of home runs, goals, and touchdowns each scores per year. If you have their last ten seasons worth of data for each, then the obvious thing to do, for each player, is to estimate the true yearly mean score for each player by their average yearly score from the last ten years. (E.g., the baseball player hits an average of 20 home runs each year, so let's estimate their true mean yearly home runs by 20). Stein's Paradox says that you can actually do a lot better than this.

Even more crazy, the James-Stein Estimator which does this actually uses data about the football player and soccer player to make predictions about the baseball player, (and vice-versa). This is deeply unintuitive to most people since the players aren't related to each other at all. The phenomenon only holds with at least three players; it doesn't work for two.

(More generally, Stein's Paradox is the fact that if you have p >= 3 independent Gaussians with a known variance, you can do better in estimating their p-dimensional mean than just using their sample means).

I've spent a bunch of time trying to understand why this actually works [2]; to be honest I still don't deeply understand. But nonetheless the consensus is that the same shrinkage phenomenon is what causes improved performance for a variety of high-dimensional estimators, (lasso or ridge regression, e.g.), making the paradox very very influential.

[1] https://en.wikipedia.org/wiki/James%E2%80%93Stein_estimator [2] https://www.naftaliharris.com/blog/steinviz/

naftaliharris | 9 years ago | on: Software Engineer Starts Unlikely Business: A Weekly Newspaper

One thing this article and the comments haven't discussed yet is how much the viability of local newspapers depends on the local real estate market. Take a look at your local newspaper if you have one, and I bet it's chock full of real estate ads. (Where I live, the Palo Alto Daily Post certainly is, as is the competing Weekly; example here: [1]). And it makes sense: the most relevant ads for a local population are matters of local interest, and for expensive houses, the ROI for ad-spend can easily make sense.

So I expect that a big factor in whether quality local newspapers can survive is the strength of the local housing market, (measured through e.g. median house price and yearly volume). As a practical matter, this means that only in relatively affluent places is local news financially feasible, (although the housing market isn't the only reason why that's the case). It also means that more people searching for property online may present a challenge for local news.

[1] http://www.paloaltoonline.com/morguepdf/2017/2017_03_24.paw....

naftaliharris | 9 years ago | on: H&R Block and Intuit Are Lobbying Against Making Tax Filling Free and Easy

I'm surprised the article and comments haven't mentioned CreditKarma's tax solution yet, which is (actually) free. [1] Presumably their strategy is to take the users and data they get from offering free tax filing and use them to advertise lending products. I think that's a sustainable and politically feasible way to get free tax filing, actually; I expect that in ~5 years Credit Karma will have eaten a big part of TurboTax and H&R Block's lunch.

[1] https://www.creditkarma.com/tax

naftaliharris | 9 years ago | on: Ask HN: Have you created a programming language and why?

Yes, Tauthon. (https://github.com/naftaliharris/tauthon).

It lets people with Python 2 code start to use new features from Python 3. (It's a backwards-compatible fork of Python 2.7 with features like async/await, function annotations, and keyword-only arguments backported from Python 3).

naftaliharris | 9 years ago | on: Python 2.8?

(I'm the primary author of this interpreter).

> Backporting features will be hard

It's actually pretty straightforward: I just find the relevant changes in the Python 3 history, and apply them to Python 2. Usually a handful of things have changed between 2 and 3, so I typically can't just pipe the diff to "git apply -3", but frankly backporting features is more tedious than difficult. If you're interested, for example, here's my recent implementation of the "nonlocal" keyword; you can see the commit messages reference the Python 3 commits: https://github.com/naftaliharris/placeholder/pull/60

> Why try to create an inferior python 3?

The ultimate goal is to build an interpreter that can run both Python 2 and 3 code. Unfortunately, there is some code that runs and has different behavior under Python 2 and Python 3 (e.g., 'print("a", "b")' ), so anyone who wants to write an interpreter that can run both kinds of code will need to decide what to do there. I decided to defer to Python 2 behavior in those cases, since most of my code is in Python 2 and I don't want to change it. :-)

naftaliharris | 9 years ago | on: Library-managed 'arXiv' spreads scientific advances rapidly and worldwide

This article misses one of the biggest value-adds of arXiv, at least in my field (Statistics): since almost everyone posts to arXiv, you can almost always find a free version of a published and potentially pay-walled paper. In the past, publishing in a peer-reviewed journal would (1) improve the paper through peer review, (2) signal the quality of the paper based on the prestige of the journal, and (3) distribute the paper. With arXiv, publishing now only does (1) and (2).

naftaliharris | 9 years ago | on: Why I'm Making Python 2.8

> the writing has been officially on the wall ever since 2011 so at that point you have to admit you did choose to incur tech debt and do nothing about it, so the claimed loss of productivity is on you.

When Python 3 was released, it offered Python users a trade: In exchange for a productivity loss (porting your Python 2 code), you'd get a productivity gain (new features in Python 3 and removed cruft). Some projects and companies thought this was a good trade, and have upgraded over the years, and many have not, and haven't. The interpreter I've been working on tries to improve on the terms of that deal for people who have not switched to Python 3.

> What a terrible, terrible situation. Now you'll have "python" code that will neither run on 2.7 nor run compliantly on 3.x.

That's the point, yes. Obviously any interpreter that's backwards compatible with 2.7 but includes new features from 3.x is going to let people write code that doesn't run under 2.7 or 3.x. But what does it matter if your code doesn't run under interpreters that you aren't using and don't intend to use?

> Just call it anything else

I'll change the name.

naftaliharris | 9 years ago | on: Why I'm Making Python 2.8

> It's also especially irresponsible and hubristic to attempt to make a language that is seemingly compatible with both Python 2 and 3, because 1) I trust that if it was possible Guido and the other developers would have made it

It is possible actually, that's kind of the point! The interpreter I've been working on passes the 2.7 unit tests (i.e. those in Lib/test/), and as well as unit tests for the new features that have been backported from Python 3.

Even if you don't believe me, it's interesting to note that, e.g., while Python 3.0 was being developed, function annotations and keyword-only arguments coexisted with tuple unpacking. I built the code and ran it myself, in fact: https://twitter.com/naftaliharris/status/784421498291310592. Tuple unpacking was actually removed later, introducing the backwards incompatibility after the new functionality had been added. Timeline:

Oct 2006, keyword-only arguments.

Dec 2006, function annotations.

Mar 2007, removing tuple unpacking.

There was also a promising backport of keyword only arguments to CPython 2.6 (!) that was never merged, (http://bugs.python.org/issue1745), due to lack of follow-through.

naftaliharris | 9 years ago | on: Why I'm Making Python 2.8

Author here. Imagine my surprise when I got back from a day of sightseeing (I'm on vacation in Spain) and saw that this had blown up. I had intended to "release" this project after New Years, after I'd gotten back and a week or two after 3.6 is released [1], and didn't expect this to get picked up since the project has been on Github for over a year (although inactive for much of that time) and since my blog usually doesn't get much traffic.

A lot of people here have strong opinions about the name "Python 2.8". I don't mind changing it, and intend to do so, (https://github.com/naftaliharris/python2.8/issues/47). I picked it initially since when talking with friends about this project it conveyed pretty darn immediately what the project is and does. I'd be very keen to hear people's suggestions for alternate names!

For those of you with 2.7 codebases or projects, I'd be extremely interested in hearing about whether you were able to get this interpreter to run your code. Personally, the biggest challenges I've had so far are with dependencies that check for `sys.version_info[:2] == (2, 7)` as opposed to something like `sys.version_info[0] < 3`. But I'd be very interested in other people's experiences, particularly with larger codebases.

[1] A minor and somewhat pedantic point: The interpreter I've been working on includes PEP 515 (underscores in numeric literals), which is new in 3.6. I didn't think it was right for me to "take credit" for this new feature before it was even out in Python 3.6. Obviously, the real credit for this feature existing (in 3.6 or in any interpreter) goes to the CPython core devs, and especially Georg Brandl.

naftaliharris | 9 years ago | on: Ask HN: What are the best resources to learn Python for Data Analysis

http://sebastianraschka.com/ has some great articles for machine learning in Python.

naftaliharris | 10 years ago | on: Visualizing K-Means equilibria

Very neat to see the how the centroids move to the equilibria. Also interesting to look at other point distributions to see what happens in more pathological cases. (For the interested, I once made a kmeans visualization http://www.naftaliharris.com/blog/visualizing-k-means-cluste..., there is one in the sibling comment to this, and many others can be found with a little googling).

naftaliharris | 10 years ago | on: Ask HN: What alternative to find and xargs do you use?

I use ack! http://beyondgrep.com/