Apache NiFi | WingNews

[+] cs02rm0|5 years ago|reply

I've used it a fair bit, though not for a couple of years. Few points, some of which may be out of date:

* I've seen customers fall into the trap of thinking they don't need expensive developers because you can drag and drop, just people who can use a mouse can crack on with NiFi.

* It persisted its config to an XML file, including the positions of boxes on the UI. Trying to keep this config in source control with multiple devs working on it was impossible.

* Some people take the view that you should use 'native' NiFi processors and not custom code. This results in huge graphs of processors with 1000s of boxes with lines between you have to follow. Made both better and worse by being able to descend and ascend levels in the graph. The complexity that way quickly becomes insane.

* You're essentially programming with it. I've no doubt you could use it to write, say, an XMPP server if so inclined. Which means you can do a great many things of huge complexity. Programming tools have developed models for inheritance and composition, abstraction, static analysis, etc. which NiFi just didn't have. The amount of repeated logic I've seen it's configuration accumulate is beyond anything I've seen from any novice programmer.

I ended up feeling like it could be an OK choice in a very small number of places, but I never got to work on one of those. The NSA linking together multiple systems with a light touch is possibly one such use case. For most everyone else, I couldn't recommend it.

[+] zokier|5 years ago|reply

> I've seen customers fall into the trap of thinking they don't need expensive developers because you can drag and drop, just people who can use a mouse can crack on with NiFi.

That is sort of the key problem I see with NiFi (and equivalents). The heavy emphasis on graphical UI and visual paradigm sort of implies that its oriented towards non-developers, but problem is that it doesn't make non-developers suddenly expert system architects or developers even if they manage to click through the UI. And many developers probably prefer just defining stuff in code instead of having fancy UIs. So it sort of falls between these to categories.

Of course there is huge spectrum of skill in people, and there are probably plenty of "semi-technical" persons to whom this is perfect match, especially if supported by some more techy people.

[+] ekianjo|5 years ago|reply

Its indeed drag and drop but as soon as you want to do something that's slightly a little more complex than the regular processors in place, you need to use regex filters, branching, AVRO converters, and non-tech users will be lost very quickly.

I see it very useful to automate certain operations (watch S3 storage, take action as soon as object comes in and store in into a DB), as for such use cases it's pretty much drag and drop.

[+] tiew9Vii|5 years ago|reply

Another NiFi article was in Hacker News the other week with a fair few positive comments.

I left a comment with my thoughts and we pretty much agree with the exact same issues. Good to see I’m not the only one as seeing some organisations use it more and more to the extent insisting any data ingested in to systems need to go via nifi. Granted some of these are extremely large and disfunctional companies.

[+] closeparen|5 years ago|reply

I have not found that argument persuasive to the managers who believe that coding is inherently wasteful if what you’re trying to do is technically possible in a workflow builder GUI.

[+] ashtonkem|5 years ago|reply

The idea that some sufficiently smart compiler/language/tool will replace expensive developers is a siren song that’s as old as COBOL. Most of these tools have historically failed to deliver much (if any) value. In the few cases where they genuinely made programming easier, the bar for what computers were expected to accomplish was raised to the point where specialized labor was once again required.

[+] leoh|5 years ago|reply

Curious if anyone from the NiFi team cares to comment / has thoughts on how to work around these issues (e.g. source control)

[+] _57jb|5 years ago|reply

We used NiFi...one of the worst experiences.

It installs like an appliance and feels like you are grappling with a legacy tool weighed down by a classic view on architecture and maintenance.

We had built a data pipeline and it was for very high-scale data. The theory of it was very much like a TIBCO type approach around data-pipelines.

Sadly the reality was also like a TIBCO type approach around data-pipelines.

One persons experience and opinion and I am super jaded by it due to some vendor cramming it down one of our directors throats who subsequently crammed it down ours when we warned how it would turn out. It ended up being a very leaky and obtuse abstraction that didn't belong in our data-pipeline when you planned how it was maintained longer-term.

I ultimately left that company. It had to do with as much of their leadership and tooling dictation as anything else, NiFi was one of many pains. I am sure there are places that are using NiFi who will never outgrow the tool so take it with a grain of salt.

Said company ultimately struggled for the very reasons those of us who left were predicting (the tooling pipeline was a mess and was thrashing on trying to get it right, constantly breaking by forcing this solution, along with others, into the flow. Lots of finger-pointing).

Sucks to have that: "I told you so..." moment when you never wanted that outcome for them....I just couldn't be a part of their spiral anymore.

[+] contravariant|5 years ago|reply

Nifi is a very powerful tool, but also a very specific one, and a self described 'necessary evil'. It does one heck of a job at getting data from A to B though.

[+] taftster|5 years ago|reply

> It installs like an appliance and feels like you are grappling with a legacy tool weighed down by a classic view on architecture and maintenance.

This is actually a fair and well-articulated point of view. NiFi is currently an "appliance" like you said. Worse, it's a Pet and not a Cattle.

I believe there is active work in the community to address some of that pain. For example, there was a recent addition to NiFi called "stateless NiFi" which enables NiFi to better run in Kubernetes and other "cloud" architectures.

It's not there yet, it's still what would be described as a "fat" application. But I believe that eventually NiFi will evolve to more like a command-and-control tool for the cloud and less like something you have to install directly to your hardware. We hopefully see the day where "NiFi-as-a-Service" exists, which would really be an improvement over the current model.

[+] edmundsauto|5 years ago|reply

Can you elaborate by what you mean on a TIBCO like approach? I haven't used their tools, but would like to know more about the issues you ran into. What were examples of the leaky abstraction>?

[+] sixdimensional|5 years ago|reply

Curious, did you have a preferred alternative?

I get the feeling you described, Nifi has a.. heavy and highly structured feel to it, but lighter alternatives are not as integrated, say... Airflow, Streamsets, AWS Glue, Kafka (different beast) etc.

That said, Nifi is incredibly powerful and complete considering it's open source and free.

[+] MrBuddyCasino|5 years ago|reply

> Sadly the reality was also like a TIBCO type approach around data-pipelines.

Thats exactly how it looks like, thanks for confirming. Will avoid.

[+] gopalv|5 years ago|reply

NiFi's biggest strength is that it is a 2-way system - it is not Storm, it is not Flink, it is not Kafka, it is not SQS+Lambda.

I like to think of it like Scribe from FB, but with an extremely dynamic configuration protocol.

The places where it really shines is where you can't get away with those 3 and the problem is actually something that needs a system which can back-pressure and modify flows all the way to the source - it is a spiderweb data collection tool.

So someone trying to Complex Event Processing workflows or time-range join operations with it, will probably succeed at the small scale, but start pulling their hair out at the 5-10GB/s rate.

So its real utility is in that it deploys outside your DC, not inside it.

This is the Site-to-Site functionality and MiniFI is the smallest chunk of it, which can be shrunk into a simple C++ something you can deploy it in every physical location (say warehouse or grocery store).

The actually useful part of that is the SDLC cycle for NiFi, which lets you push updates to a flow. So you might want to start with a low granularity parsing of your payment logs on the remote side as you start, but you can switch your attention over it to & remove sampling on the fly if you want.

If you're an airline flying over the arctic, you might have an airline rated MiniFI box on board which is sending low traffic until a central controller pushes a "give me more info on fuel rates".

Or a cold chain warehouse which is monitoring temperature on average, until you notice spikes and ask for granular data to compare to power fluctuations.

It is a data extraction & collection tool, not a processing and reporting tool (though it can do that, it is still a tool for bringing data after extraction/sampling, not enrichment).

[+] monstrado|5 years ago|reply

Incredible piece of software. I've used it in production at my last two jobs. You can build almost anything in NiFi once you get into the mindset of how it works.

A good way to get started with NiFi is to use it as a highly available quartz-cron scheduler. For example, running "some process" every 5 seconds.

Disclaimer: I'm an Apache NiFi committer.

An article you might find interesting about it's ability to scale.

https://blog.cloudera.com/benchmarking-nifi-performance-and-...

Disclaimer v2: I used to work at Cloudera

[+] nightowl_games|5 years ago|reply

This page is classic Apache project in that I read it and have no idea what it does. Can you high level explain what this thing is really for?

[+] chirau|5 years ago|reply

Quick question, what does the role of Data Engineer at Epic Games entail and what technologies are you working with?

[+] taftster|5 years ago|reply

NiFi at first glance sometimes just looks like a glorified GUI for building out a data-delivery application. But NiFi doesn't just compile an application to be deployed on your network. Instead, the "power" of NiFi is that it allows an operations staff to perform the regular day-in-day-out task of monitoring, regulating and if needed modifying the delivery of data to an enterprise.

NiFi gives insight to your enterprise data streams in a way that allows "active" dataflow management. If a system is down, NiFi allows dataflow operations to make changes and deal with problems directly, right at tier 1 support.

It's often the case that an enterprise software developer has an ongoing role of ensuring the healthy state of the applications from their team. They don't just develop, they are frequently on call and must ensure that data is flowing properly. NiFi helps decouple those roles, so that the operations of dataflow can be actively managed by a dedicated support team that is more tightly integrated with the "mission" of their dataflow.

NiFi additionally offers some features that most programmers skip to help with the resiliency of the application. For example:

- the concept of "back pressure" is baked into NiFi. This helps ensure that downstreams systems don't get overrun by data, allowing NiFi to send upstream signals to slow or buffer the stream.

- data provenance, the ability to see where every piece of data in the system originated and was delivered (the pedigree of the data). Includes the ability to "replay" data as needed.

- dynamic routing, allowing a dataflow operator to actively manage a stream, splicing it, or stopping delivery to one source and delivering to another. Sources and Sinks can be temporarily stopped and queued data placed into another route. Representational forms can be changed (csv -> xml -> json, avro), and even schemas can be changed based on stream.

Anyone can write a shell script that uses curl to connect with a data source, piping to grep/sed/awk and sending to a database. NiFi is more about visualizing that dataflow, seeing it in real-time, and making adjustments to it as needed. It also helps answer the "what happens when things go wrong" question, the ability to back-off if under contention, or replay in case of failure.

(disclaimer: affiliated with NiFi)

[+] banjoriver|5 years ago|reply

NiFi is vey good at reliably moving data at very high volumes, low latency, with a large number of mature integrations, in a way that allows for fine grained tuning, and i've seen first hand that it is very scalable. It's internal architecture is very principled: https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.ht...

Out of the box it is incredibly powerful and easy to use; in particular it's data provenance, monitoring, queueing, and back pressure capabilities are hard to match; custom solution would take extensive dev to even come close to the features.

It is not code, and that means it is resistant to code based tooling. For years it's critical weakness was related to migrating flows between environments, but this has been mostly resolved. If you are in a place with dev teams and separate ops teams, and lots of process required to make prod changes, then this was problematic.

However, the GUI flow programming is insanely powerful and is ideal when you need to do rapid prototyping, or quickly adapt existing pipelines; this same power and flexibility means that you can shoot yourself in the foot. As others have said, this is not a tool for non technical people; you need to understand systems, resource management, and the principles of scaling high volume distributed workloads.

This flow based visual approach makes understanding what is happening easier for someone coming later. I've seen a solution that required a dozen containers of redis, two multiple programming languages, zookeeper, a custom gui, and and mediocre operational visibility, be migrated to a simple nifi flow that was 10 connected squares in a row. The complexity of the custom solution, even though it was very stable and had nice code quality, meant that that solution became a legacy debt quickly after it was deployed. Now that same data flow is much easier to understand, and has great operational monitoring.

Some suggestions: - limit NiFi's scope to data routing and movement, and avoid data transformations or ETL in the flow. This ensures you can scale to your network limits, and aren't cpu/memory bound by the transformation of content. - constrain the scope of each instance of nifi, and not deploy 100s of flows onto a single cluster. - you can do alot with a single node, only go to a cluster for HA and when you know you need the scale.

[+] unixhero|5 years ago|reply

Phew! Happy to have read the comments here. They say a lot. I will go with Apache Airflow for all my workflow needs from now on. I wasn't entirely sure if this was the best bet, but after seeing all of this I am now.

I know a massive installation [0] which is about to be open sourced, where Apache NIFI is used in the middle of the stack as a key component. No dismissal of the capabilities this package offers intended.

[0] https://sikkerhetsfestivalen.no/bidrag2019/138

slides [slide #32]: https://static1.squarespace.com/static/5c2f61585b409bfa28a47...

[+] pacofvf|5 years ago|reply

For the love of god, don't use NiFi to trigger an Airflow DAG.

[+] recov|5 years ago|reply

Can you expand? We just set this workflow up and it seems to be working fine.

[+] yawz|5 years ago|reply

If you're considering Apache NiFi, you should also look at Apache Airflow and Uber Cadence to decide what model would work best for you.

[+] ekianjo|5 years ago|reply

They do have totally different use cases, so it should be fairly quick to decide which one is for you.

[+] corndoge|5 years ago|reply

Can someone explain what this is? I can't find anything on the website that explains it

[+] gregw134|5 years ago|reply

Sure, I've worked with it. It let's you visually build data pipelines. It's extremely useful for getting work done quickly--you just drag and drop prebuilt connectors to things like elasticsearch, s3, or twitter and you have a data pipeline, including automatic backoff and the ability to inspect the data at each step. It's visual so it's easy to tell what's going on. Biggest downside is it's not automatically distributed. You can set it up to be distributed, but you have to do the plumbing yourself on the nifi graph by dropping nodes for routing tasks between nifi servers. Overall, perfect tool for quickly building a pipeline that can be easily shown to the business and in which you can visually see all data and errors at each step.

[+] zokier|5 years ago|reply

Disclaimer: I haven't worked with NiFi specifically, but with another similar product

As far as I know, it at least partially fulfills the role of "enterprise application integration" pattern; the idea is sort of proto-servicemesh where you have bunch of weird enterprise applications all around and you need to get them talking to each other, so you plop in a fancy EAI middleware thingy in the middle which then will talk to each of the applications individually, converting and transforming data from one format and protocol to another. But its not just a dumb hub, in addition to doing arbitrarily complex transformations, it can also make "routing" decisions based on the data itself. And this pattern being product of its time, instead of defining the network in code in some nice DSL everything is just "configured" and lots of things is achieved through clicketyclicking through GUI. I suspect this is at least partially due "configuring a turn-key product" sounding less scary to managers than "developing a system on framework".

If you look at the docs page of NiFi[1] and scroll quickly through the long list in the left sidebar with all the "processors", you already can get an idea what sort of things it does.

While this sort of solutions can be very useful in some situations, I wouldn't necessarily start designing a greenfield architecture around NiFi. But if you end up running it, then you might find that piece by piece it will accumulate all sorts of bits and bops of things because it's "convenient" to throw it in there, while the logic flows become more and more eldritch in nature.

[1] http://nifi.apache.org/docs.html

[+] paullth|5 years ago|reply

Been that way for at least 5 years https://news.ycombinator.com/item?id=10190846

There's a few articles in the old comments that explains it's use case a little

[+] contravariant|5 years ago|reply

I quite like the description they give in the docs[1], a brief summary:

> Over the years dataflow has been one of those necessary evils in an architecture. [...] NiFi is built to help tackle these modern dataflow challenges.

[1]: http://nifi.apache.org/docs.html

[+] _57jb|5 years ago|reply

Think TIBCO data pipelines.

You create sources of data and can have non-technical (or maybe non-developer) roles wire together these data pipelines with transformations, aggregations etc.

[+] rfsliva|5 years ago|reply

We are using NiFi as our dataflow engine for real time data ingest. We are using a current version, 1.11.4, and have several instances running including a development instance. The interface provides our team the ability to do quick iterative development and testing. An example of one of our use cases is we have 2 dataflows that ingest data from 2 different vehicle location/status systems and pump them into SQL Server. At the same time another dataflow merges the data from SQL Server and sends the data to Azure Event Hub. These dataflows were easy to setup, test and extend. This replaced a process that was written in Go.

[+] endlessmike89|5 years ago|reply

Nifi is a good (not great) tool, mostly because of all of the functionality you get out of the box. It comes with almost any kind of connector you would need for moving data. There's a pretty steep learning curve, but once you push through that, creating a new data flow from scratch is quick and easy. It sucks that other people in this thread have had bad experiences with Nifi, and I can't say that I haven't. However, it has generally been a positive addition to my team's stack.

[+] haddr|5 years ago|reply

Had some second hand opinions on running NiFi in prod and all of them were rather negative, some saying it was a mistake. That was around one year ago. I wonder if things have changed since then.

[+] sixhobbits|5 years ago|reply

I have never heard of this before, and I'm sad that profit-driven, marketing speak has taken over even non-profit product pages.

> An easy to use, powerful, and reliable system.

This is the title. That's the most important sentence, and it's absolutely meaningless.

It's bad enough that everything has to "sell" - just describe to me what your product does and I'll decide if I need or not. Don't try to convince me.

If you have to sell, do it by differentiating yourself from your competitors. No one is calling themselves "Difficult to use, weak, and unreliable", so saying the opposite is not differentiation.

When did we accept that marketing-speak was default communication. Can't we have some landing pages that are essays? Or even a few paragraphs instead of trying-to-be-catchy bullet point phrases in large font?

[+] taywrobel|5 years ago|reply

> This is the title. That's the most important sentence, and it's absolutely meaningless.

Well, yeah, it's meaningless if you cut off the second half...

> ...to process and distribute data.

That's what it does. The adjectives before it aren't the meat of the sentence.

[+] pazo|5 years ago|reply

I have experience from multiple projects with NiFi and it was the main reason for me and others quitting the company. Somehow management were convinced by some salesmen that this would be the golden bullet, however, all of their deliveries were delayed. We experienced issues debugging flows with performance problems, and even basic version control was problematic due to ids being replaced every time.

[+] josephmosby|5 years ago|reply

NiFi is a fantastic tool for a certain set of organizational constraints.

* It doesn't need much in the way of dependencies to run. If you can get Java onto a machine, you can probably get NiFi to run on that machine. - That is HUGE if you are operating in an environment where getting any new dependencies installed on a machine is an operational nightmare.

* It doesn't require a lot of overhead. Specifically, no database.

* You can write components for it that don't require a whole lot of tweaking for small changes to the incoming data. So, if I have a machine processing a JSON file that looks like XXYX and another machine processing a nearly identical JSON file that looks like XYXX, the tweaks can be made pretty easily.

So, if you're looking for a lightweight, low overhead, easily configurable tool that may be running in an environment where you've got to run lots of little instances that are mostly similar but not quite, NiFi is great.

If you are running a centralized data pipeline where you have a dedicated team of data engineers to keep the data flowing, there are better options out there.

[+] tspann|5 years ago|reply

No more XML. Check out NiFi 1.11.4, it does everything you need for easy ingest. If you are reading some files putting them into Kafka or S3 or a database or MongoDB or Hbase or Hive or Impala or Oracle or Kudu or ..., it's genius.

https://www.datainmotion.dev/

[+] advaita|5 years ago|reply

As far as I see, your link has nothing to do with 1.11.4 release. May be you wanted to link to a specific page?

[+] Sodman|5 years ago|reply

Having used NiFi in production, my biggest issue with it is handling source control and multiple environments. As the "IDE" is effectively also the runtime, the lines between "local", "stage", and "prod" are easy to blur.

They have a built-in source control product called "NiFi Registry", which can even be backed by git. The workflow for promoting flows between environments feels clunky though, especially as so much environment-specific configuration is required once your number of components gets high enough.

Moving our Java, Ruby or Go code between environments or handling versioning and releases was a piece of cake, in comparison.

[+] tomrod|5 years ago|reply

Do I understand what this is: general purposes SSIS-type data integration, pipeline, and workflow tool?

If so, how does it compare to SSIS, dbt, and other projects (please name!)?

Otherwise, what is an analogous toolset?

[+] benjaminwootton|5 years ago|reply

I have been working on a new product which competes with NiFi, providing streaming data transformations.

Think, if order value > 100 and the customer has ordered 3 times in the last hour and the product will be in stock tomorrow.

Kafka streams, Flink and Dataflow are super powerful and I think there is room for a GUI tool.

Would be great to hear experiences of NiFi in this domain or discuss the space with any experienced users. Will add contact details in my profile.

[+] joshz404|5 years ago|reply

Have you heard of BPEL? :D

135 comments