Building the Future of TensorFlow

[+] atty|3 years ago|reply

Maybe “TF 3” or whatever they call it will be ergonomic and a pleasure to use, but that was the promise of TF 2, and unless you wanted to use Keras it was anything but. I’m glad I have been able to work only in PyTorch and Jax. Maybe TF will get better but I’m not holding my breath. On the other hand, XLA is very nice, and I hope they continue to develop it.

I hope torch has a convincing distributed tensor API coming soon. Their development on ShardedTensor seems to have slowed or stopped recently, so TF’s DTensor is definitely ahead, which is a shame. And of course TF’s ecosystem for the whole lifecycle is more mature with TFX, TF.js, etc, but torch is slowly closing those gaps, and hopefully that will continue.

[+] blululu|3 years ago|reply

The ergonomics for Tensorflow are probably always going to be behind PyTorch (or Keras for that matter). The fact that the API has not been stable for the past 6 years has really burned me one too many times to not flinch at using it. It is basically an internal Google tool that has been made available to the public, and like most internal tools at Google the deprecated/developmental dichotomy applies (https://goomics.net/50/). That said the deployment of TensorFlow models onto mobile devices or the browser is really good, so sometimes the pain is necessary.

[+] liuliu|3 years ago|reply

PyTorch does have issues with both distributed tensor (which is easier to solve) and deployment (which is harder to solve, but solvable).

I also think Keras' Functional API is superior in terms of composability than PyTorch's OOP model, but I am biased as a software engineer. It does feel like the community thinks the OOP model is much more hackable thus easier to use.

All in all, it is still early days. We don't have a competent all-in-one OSS SQL database until late 2000s, which is 20-ish years after the theory was ready and taught extensively in the school. And even after that, we have plenty of innovations around database in 2010s for new use cases. Frameworks for differentiable programming have long way to go.

[+] p1esk|3 years ago|reply

Their development on ShardedTensor seems to have slowed or stopped recently

What makes you say that?

[+] brrrrrm|3 years ago|reply

how do you envision a distributed tensor API as working? (perhaps a code snippet of an ideal API?)

[+] mudrockbestgirl|3 years ago|reply

Merging Keras into TF and trying to copy PyTorch by adopting their model was the downfall of TF. The initial TF releases were great. It was simple, easy to reason about, and solved a clear problem. But then Google wanted TF to appeal to everyone and solve all problems for industry and research and beginners and experts at the same time because people need to get promoted internally. And that never goes well. Nobody I know wants to use TF these days (but some are forced to).

Let's hope JAX won't suffer the same fate.

[+] probably_wrong|3 years ago|reply

> The initial TF releases were great.

I object to this statement. The earlier releases of TF (ca. 2013) were impossible to debug and the documentation was always broken - if you tried to follow the Seq2Seq tutorial you know what I'm talking about. I'd argue that these releases were great for the 50 people who already knew how to use it, but they were aggressively unhelpful for beginners.

PyTorch won my lab (and certainly others) because you could add prints to check your dimensions while TF forced you to build a correct computation graph all at once. Performance? Sure, TF is probably faster. But I'd argue that TF's big mistake was not taking their new users into consideration.

[+] soraki_soladead|3 years ago|reply

I agree. I truly miss the graph-mode API, especially coming from Theano before TF, but it wasn’t as beginner friendly and Google wanted to capture market share for their cloud.

At least with jax the core library isn’t adopting any of the framework level stuff so those can evolve independently.

[+] make3|3 years ago|reply

the original tf was truly horrendous :) it was extremely unintuitive and slow to program. that's why so many people switched to pytorch. the merge was a good idea, it just came too late. tf is as popular as it is just because it was the first to market, & because of Google's megaphone, even inside Google researchers don't want to use it, and a large fraction have switched to Jax

[+] mirker|3 years ago|reply

The downfall was fragmentation. They merged Keras after TF started having multiple competing application libraries, each managed by different people (and each fighting for promotion, as you say). Keras just happened to be the most popular. Even after Keras, various teams (e.g., deepmind) have decided to make their own libraries.

[+] claudppl|3 years ago|reply

In hugging face, almost 85% of models are exclusive to PyTorch, and even those that are not exclusive have about a 50% chance of being available in PyTorch as well. In contrast, only about 16% of all models are available for TensorFlow, with only about 8% being TensorFlow-exclusive.

(source: https://www.assemblyai.com/blog/pytorch-vs-tensorflow-in-202...)

[+] zone411|3 years ago|reply

That matches my experience reading a lot of ML research papers. Most papers that include code are PyTorch-only. Some have both PyTorch and TF. Few are TF-only. Then there is JAX and it's been growing.

[+] reckless|3 years ago|reply

I feel like the comments on backwards compatibility are due to the absolute shitshow of TF2 compatibility for TF1 code and models.

Also the threat of Pytorch can be seen when reading between the lines, especially since it's now run by a foundation and the darling of the diffusion model developments.

[+] learndeeply|3 years ago|reply

PyTorch has been a darling of almost every noteworthy open source model for the past 3-4 years (BLOOM, GPT-J, StyleGAN3, detectron, etc). Personally, I've only seen people use TensorFlow/XLA if they got free TPU credits from Google (gpt-neo), or if it was released by Google (t5).

[+] Dayshine|3 years ago|reply

I just wish they would put some focus on making tensorflow reliable in a production environment.

It can easily be wildly non deterministic across different cpus or GPUs, or even in the same session, with the same input.

Performance seems to get worse with new releases, and there are frequent subtle breaking changes when using models built with old versions on newer releases.

Tensorflow serving is barely controllable, and requires insane tuning to make it perform the same as pytorch, but provides little to no docs.

The vast majority of models people build just don't work in tensorflow serving either, as you can't reach in with hacky python to mess with internal state.

If you use a custom host instead then you have to deal with literal gigabytes of python dependencies, making your docker images huge.

Memory usage is uncontrollable and causes terrible performance or instant host death. Results vary depending on cpu count, and automatic parallelism can reduce performance.

I just don't understand how Google use tensorflow internally for real world services.

[+] learndeeply|3 years ago|reply

> Tensorflow serving is barely controllable, and requires insane tuning to make it perform the same as pytorch

What are you using for serving PyTorch models?

[+] numlock86|3 years ago|reply

I went to PyTorch from Keras/TensorFlow and couldn't be happier. Good to see they are still trying, though. Competition is always good.

[+] jdeaton|3 years ago|reply

I've had my last straw moment with tensorflow some time ago. JAX has been an a pleasure and Im never looking back. It got to a point on my team where it was just easier to rewrite our entire distributed training infrastructure in jax with pmap than coerce TF2 into doing what I wanted it to.

[+] jstx1|3 years ago|reply

What was TF unable to do that you could do in JAX?

[+] ipunchghosts|3 years ago|reply

Fix the damn abstraction bugs! Models of models only works in a limited fashion. Its a damn graph, I can't understand why they can't get this to work. For me the tf ship has sailed.

[+] cavisne|3 years ago|reply

Was honestly expecting this to be a deprecation announcement…

Not because it’s google but because Jax has so much momentum lately.

[+] hyperbovine|3 years ago|reply

I believe it is, whether they realize it or not. JAX/PyTorch + XLA is just so much better in so many ways. Development on XLA will (thankfully) continue, and JAX and PyTorch will continue to cannibalize the TF userbase.

[+] Jabbles|3 years ago|reply

The blog post doesn't mention DeepMind. I assume that's because they exclusively use JAX?

https://www.deepmind.com/blog/using-jax-to-accelerate-our-re...

[+] dman|3 years ago|reply

Is it me or does this between the lines suggest that the current approach to Tensorflow is not working?

[+] dekhn|3 years ago|reply

yes, if you see a blog post from google that follows this title pattern and the text says what this does, it means there's an explicit acknowledgement that something serious wasn't working and the leadership decided to course-correct.

TF will continue to have a place at Google for prod work but its application base is going to continue to shrink. I'm just blown away they're rejiggering the distributed model again.

[+] jstx1|3 years ago|reply

It's just that TensorFlow is big and bloated, has quite a few quirks and it's not the cool thing to use anymore so they're doing some work to address these issues. TensorFlow is still very popular and you can get work done with it.

[+] jstx1|3 years ago|reply

Looks good. Does adopting the numpy API standard while maintaining backwards compatibility mean that they'll have to duplicate a lot of functionality?

[+] soraki_soladead|3 years ago|reply

Nope. It’s a different namespace and the numpy methods just wrap existing TF methods: https://github.com/tensorflow/tensorflow/blob/v2.10.0/tensor...

[+] rjsw|3 years ago|reply

It would be nice if they made it easier to build at all.

[+] fock|3 years ago|reply

hey - it's using bazel, that makes things great. Well, bazel won't work on your system, because protobuf needs a patch for a new libc. Why didn't you use Ubuntu? Now it builds! Oooh, wait they are vendoring protobuf in tensorflow.....

(all that, while depending on a single-file, 3-release python library...)

[+] trash3|3 years ago|reply

[deleted]

[+] ipunchghosts|3 years ago|reply

I think f chollet's attitude has really hamstrung keras taking off. I really like the api but its brittle in many ways like models in models

[+] jxy|3 years ago|reply

> The future of TensorFlow will be 100% backwards-compatible

I love to hear that. But does Python even guarantee 100% backwards-compatible?

[+] DSingularity|3 years ago|reply

Guaranteeing backwards compatibility at the level of a wide purpose language like python is how you end up with stagnation (see CPP).

For domain-specific languages like TF I guess they were motivated to commit to it to ensure adoption of the new versions.

48 comments