top | item 15596049

(no title)

npolet | 8 years ago

I was part of a team developing a travel product which needed huge amounts of data from flights to car rental... My god the travel industry is a mess in terms of accessible data. It was the biggest pain point. API's would randomly change structure, endpoints would stop working with unknown errors, data would be significantly different across companies for the same data etc... It was a shambles. Never again will I develop in the travel industry. Unless you own the data, working with it is terrible.

We ended up building our own little system that merged our own flight data with data from suppliers. The schema some of the suppliers provided was non sensical and it ended up being a really finely tuned system that would break reguarly if one of the data suppliers decided to change it without notice.

I spent my days dreaming of a standardized data pool, and while some of the data was standardized, not enough of it was.

Thank goodness I'm now working in an industry that we can totally own our data and are not reliant on anyone else for it.

discuss

lotyrin|8 years ago

Reminds me of my time spent being a customer to a real estate MLS. When industries traditionally rented access to information asymmetry, even if those rents have since diminished, there's a lingering cultural momentum not to "show your hand" by playing nice with information (in addition to these not being organizations with the insight and incentives aligned to be attractive or conducive to IT competence).

i_cant_speel|8 years ago

I deal with MLSs regularly as part of my current job and this was the first thing that popped into my head when I read the parent comment. They are hell to deal with and we would pay a good amount of money to anyone who could standardize all of them and feed them to us in a consistent format.

Half of the time I spend debugging is on figuring out what is going wrong with different MLSs.

paulie_a|8 years ago

Ive worked with MLS data before. It was terrible. I am currently working with ACH, DDF and I am astounded that our financial system even operates. Because if there is one thing you want in payment processing, it's wildly innacurate and inconsistent data that is lacking documentation.

pishpash|8 years ago

Not coincidentally these are also industries with the most middlemen.

tyingq|8 years ago

Didn't like OTA[1], eh?

I don't know that other industries are much better though. Ask a healthcare IT person about HL7.

[1]https://en.m.wikipedia.org/wiki/OpenTravel_Alliance

snuxoll|8 years ago

FUCK HL7! Holy shit, I’ll never get the time back I spent implementing our standard HL7 data extractor and all the other software needed to get that into our billing system. Our poor EDI team has the worst of it though, they’re the ones responsible for mapping all the edge cases various EMR’s (and their per-hospital customizations) into our standard formats.

HL7 claims to be a standard, but it’s a hell of a lot closer to JSON in terms of standardization (that is, a serialization format) than say ANSI X12 (an ugly but generally standard data interchange system).

SmirkingRevenge|8 years ago

> Ask a healthcare IT person about HL7.

Been there, done that. Never again!

cwilkes|8 years ago

I liked to say “HL7, standard?” As more of a question.

Bluestrike2|8 years ago

The car industry is another industry whose data is a series of nightmares. There are a number of services that aggregate inventory data for dealers using them. Sounds great, right? Maybe, if they had halfway decent APIs. Multiple providers--with zero connection to one another, I checked quite carefully after becoming exasperated--somehow managed to independently come to the decision that the best way to make inventory data available was to push large CSV files to an FTP server you setup.

Whatever. People can come up with crazy solutions and you're stuck working with them. At least you've got the CSV files by that point, right? Well, sort of. Turns out they needed a good bit of work. And by a bit of work, I mean that some of them were quite possible the worst CSV files ever generated. And it was...weird. Normally, if there's a problem with a CSV file it's at least consistent across the entire file. Quotation marks not escaped? Ok, no big deal. They're all like that. Usually, you can normalize the data and move forward.

If only. Some lines had escaped quotes. Some didn't. Some lines were actually multiple records because apparently the magic linebreak decided to go on strike. In one case, I kid you not, the file switched from comma-separated to tab-separated. Huh? How'd that even happen? Some values were perfectly valid, just handled in a way that's guaranteed to annoy the hell out of you. But fine. You do what you can, reject the bad (and log to a file for manual review in the hopes that you'll figure...something...out about it), and move on with your sort-of-normalized data. But that's just referring to the data itself, and not the entries.

When you order a new car, the build sheet for every manufacturer is simple enough. Every option has a code. Every option has a name. New car dealers get their inventory data back from the factory and it's plugged into their inventory management systems. Through whatever accidental acts of magic and chicanery, that data eventually makes its way to the data sources you're busy importing. Unfortunately, at some point in the process all of those beautiful factory codes and names--standardized, constant, etc.--disappear. That beautiful "Cobalt Blue" is somehow transmogrified into a very unhelpful "Blue." And don't even get me started on factory options. At times, you're lucky if you somehow accidentally get the basics like, oh, "has two front seats and possibly four-ish wheels." It's even worse with used cars, because some unlucky salesperson/clerk/receptionist/car washer kid had to sit down and manually enter the car data.

Instead of thinking of that giant CSV file as a wonderful list of thousands of cars just waiting to be discovered and purchased, you start to see it as more of a starting point. It's an incomplete list of cars for sale, with some of the information about each car. You need to use other sources to fill in some of the blanks, fix some of the most obvious errors, etc. Luckily, dealers upload their data to as many services as they can. Unluckily, it's often...different across those same source, and it's up to you to figure out what data to keep the same or change. Depending on the manufacturer, you can decode the VIN and pull up all sorts of useful information about how the car's build options. Maybe. You can then use the manufacturer's pricing guide for that model year to fill in the blanks. Assuming, of course, that you've got a copy of the pricing guide in question. Which isn't guaranteed, since they're generally not publicly available (though they do leak...often).

I'm a huge Porsche fan, and I know more Porsche fans. We're all nuts. Details matter. Do you want to know how many [shades of blue](http://paintref.com/cgi-bin/colorcodedisplay.cgi?manuf=Porsc...) Porsche has used over the years? Many of the paint colors are available across different models and different years, so the number is a lot scarier than it actually is, but there are a total of 641 entries in the linked paint database. Someone searching for a used Porsche wants to make certain that they're looking for a specific color or option. They don't want to just use "blue" for their search. They're looking for the gorgeous [Oslo blue](https://gearpatrol.com/2016/08/18/definitive-ranking-blue-po...) imprinted in their minds during a magical childhood moment, damnit. Which was only used in 1961, except for custom paint-to-sample orders in later years. There was 1 993 Turbo S in Turquoise Blue. Jerry Seinfeld bought it. A PTS color will affect the relative value of the car (new or used), so it's one of those fiddly bits that matters a bit.

You'll get another multi-gigabyte CSV file delivered to by FTP on the morrow. And joy, it's not an incremental update with just the new cars. It's all the cars they have data for. If a car has been sold, it'll be omitted from the file. It's up to you to figure out which. Hopefully, the car won't be "un-sold" in a day or two after the buyer backs out. That can get weird, especially when you're dealing with multiple providers. Finally, your simple diff is further complicated by the inevitable likelihood that random data across random listings has been changed as well. Perhaps they caught a typo in the VIN number, or the annoying fact that the paint color originally listed never actually existed.

Needless to say, I sound a bit crazy at this point. Your only option is to accept that, much as it might annoy the hell out of you, there are going to be serious compromises involved with the data you have access to. Apologies to those with sever OCD and perfectionists. Handily, some manufacturers have sites for their dealers (i.e., http://porsche-dealer.com). Usually, these sources are pristine with everything in order. You just need to scrape it where it's allowed, which is its own set of fun.

And don't even get me started on the photos included with listings. It's rare that you'll see decent photos taken by a dealer. Mostly, you'll find those on eBay Motors with certain sellers. Everyone else does their best to make them as terrible as possible. The data providers then take that as a challenge, and crush the ever-living hell out of the image with another round of JPEG compression and give the resulting monstrosity to you at a nice, small resolution. Want something bigger? Forget it.

Maybe I'm blowing all of this out of proportion. Perhaps the industry has changed in the past few years since my experience. Personally, I doubt it. In any case, dealing with this was annoying as hell. This rambling post was oddly cathartic.

Anyhow, if you ask me, I'm pretty sure all of the car search sites just scrape each other; like Ouroboros, depicting a serpent eating its own tail.

closeparen|8 years ago

>somehow managed to independently come to the decision that the best way to make inventory data available was to push large CSV files to an FTP server you setup.

Somehow this is indeed a standard for data transfers between non-tech companies. (The ones that are sophisticated enough not to use fax or email attachments).

rokhayakebe|8 years ago

You know I was thinking as I was reading you guys are crazy for having 641 blues. However I definitely see your point after clicking the links. These are entirely different colors, indeed.

vpribish|8 years ago

that was an awesome writeup. I've wrangled finance data and know the pain. Those Porsche blue links are wonderful!

SmirkingRevenge|8 years ago

I think this is probably the general state of data exchange systems in most industries/sectors where an IT company/startup hasn't come along to successfully disrupt it or none of the major players have become tech-oriented enough to see - much less sort out - these kinds problems well.

That's been my experience anyways.

sidlls|8 years ago

Hardly. Most of these things are because the companies in these industries have an interest in having it the way it is. It has little to do with (not) being sophisticated with technology or because some startup hasn't come to save the day. Startups tend to actually have a very small impact on anything generally speaking. It's rare that one blossoms into a Google or Facebook.

benjarrell|8 years ago

IT companies are no better from my experience.

elmalto|8 years ago

I work in the flight industry and still dream of that standardized data structure...

sbfeibish|8 years ago

Seems to me what you describe is a business opportunity. Your business takes on the burden of obtaining the data, and putting it in an easy to use form; for others. There's a company in Pensacola, Fl that's been doing this for years with things like lottery numbers and gas prices.

madeofpalk|8 years ago

> Unless you own the data, working with it is terrible.

Heh. Having done work for an airline, I wouldnt even say this.

dorchadas|8 years ago

Ain't that the truth! I tried just messing around with it a while ago to make my own airfare searcher to check a few things for me and it was a mess. Ended up doing the Amadeus sandbox, but the cheapest flights aren't necessarily there and it's a real pita.

unknown|8 years ago

[deleted]

unknown|8 years ago

[deleted]

new_user224|8 years ago

How did you integrate all the different schemas? RDF, some kind of rule engine, or plain Java, Python, ... ?

hchasestevens|8 years ago

Having worked previously at a company that sounds very similar (maybe even the same one?), our approach was primarily using individually written scrapers and API integrations (when available) in Python, utilizing an underlying scraping framework that predated requests. As you might imagine, these integrations required constant maintenance and were often bug-prone, so much so that the company eventually found itself in the position of outsourcing the work... There were attempts to reduce maintenance through the use of a in-house DSL/rules engine, but ultimately, the range of integrations it was able to support was very limited, and the project was scrapped.