top | item 18650646

Ask HN: How to incorporate machine learning into day job?

169 points| s_c_r | 7 years ago | reply

I work for a small regional shipping company, mostly building CRUD apps and doing EDI integrations. I'd like to find a practical side project using machine learning and/or data science that could add value at work, but for the life of me I can't come up with any problems that I couldn't solve with a relational database (postgres) and a data transformation step. I've spent some time learning pytorch, numpy, and pandas but I know that if I don't use it, especially at work, I'll just forget everything I've learned. My boss is a dev and is generally supportive of learning new things and finding ways to innovate independently, so if I can come up with a good idea I'm sure he'll let me pursue it in my spare time. Has anyone tried to do this before? Any suggestions would be great.

53 comments

[+] randcraw|7 years ago|reply

1) You could use ML to assess the on-time reliability of each of your delivery agents (contractors or individuals). Providing more accurate estimates of delivery time to customers (maybe via text message) might be very desirable. Or you could notify only when expected delivery time has changed (arriving earlier or later). For Just In Time-based shops, this could be a big win.

2) You could assess different delivery routes/ regions to determine if they are more/less on-time than other routes/ regions. Are the number of delivery vehicles adequate? When should you adjust the number of vehicles or change the routes themselves (like moving some peripheral regions to another route, or adjusting the cost charged when delivery is delayed).

3) When do external factors (like weather, esp rain or snow) introduce delays? Can you predict these delays, and ideally, compensate by changing routes or adding more delivery vehicles?

4) Should you more dynamically adjust your shipping fees to reflect faster/slower delivery time targets? This way you can tune your routes and manpower to save money for those who aren't as time sensitive, and improve the response time for those who are.

A lot of this is basic operations research. But you can call it AI, or use AI techniques just as well as traditional OR methods. Nobody will care what math/methods you use if you can add value.

[+] bicubic|7 years ago|reply

> 1) You could use ML to assess the on-time reliability of each of your delivery agents (contractors or individuals)

Keep in mind that there may be flow on effects and ethical considerations. Once you assign a metric to individuals, someone is going to start attaching it to KPIs, ranking individuals by it, and ultimately firing individuals by it.

Ethics is an increasingly prominent aspect of ML.

[+] KurtMueller|7 years ago|reply

This is so cool. What are some of the traditional methods used to analyze this kind of data?

[+] johngalt|7 years ago|reply

This is a common problem in tech:

"I've just learned a neat new tool but I never apply it because I can solve all the problems in front of me with the tools I have."

In effect, you've found a local maximum where every direction seems like a step backwards, or an investment of time without any reasonable payoff.

Here are two general strategies to deal with this:

1. Take a well understood, and well documented existing need, and replicate the solution with the new toolkit. Acknowledge from the outset that this will be a step backwards, but go through the details anyway to better understand the technology. The goal isn't to make the system better, but to improve your understanding of ML and it's real world application. By choosing a well understood system, you are only learning applied ML rather than trying to simultaneously learn ML and the problem. Work toward parity with your existing methods. This part is rarely a big step forward, but I guarantee that this process will generate 100 good ideas about where to go next.

2. Find problems that were previously ignored, because they couldn't be solved. Something no one is even thinking to ask for, because none of the prior tools could do the job. This is the ideal situation because you are in a greenfield space where anything is an improvement. For ML specifically look at anywhere a lot of data is being generated but no one has the time to read it all unless something goes wrong.

When learning any new technology there is always a gap between learning it in the lab, and trying to execute with it IRL. The best way to maximize your own ability is to simply start applying it and building experience. Don't wait for a perfect halo project.

[+] cVwEq|7 years ago|reply

Using a broad definition of ML/data science, here are a few ideas:

First, coding toy problems (related to shipping or not) that implement linear regression, genetic algorithms, or neural networks, etc. will be a useful start

Analyze shipping and tracking EDI data to predict whether a shipment will be late (0.0 to 1.0 output, 1.0 being it will be late for certain)

Predict the likelihood a customer will churn (stop using your services) based on changes in volume, billing amounts, and other characteristics

Predicting this year's peak season shipping volume based on past years' data. See if you can beat the marketing/sales folks' predictions

Identify factors correlated with the most profitable shippers

Predict the likelihood a package is damaged

Use a genetic algorithm to improve driver routing

Reconfigure pickup times / drop off times to improve profitability

Use EDI shipping data to build a network graph of who is shipping to whom, segmented by type of some sort. Say you find that many A-type firms are shipping to B-type firms; any B-type firms that are not already customers could be interesting targets.

Score prospects to estimate their profitability by comparing their characteristics to existing customers' profitabilities

Use a neural network (or something else) to analyze EDI shipping data, damage data, and make packaging recommendations to customers

Analyze tracking EDI data, segmented by delivery area (zip+4?) and see if there are areas where drivers are more efficient at delivering faster. Maybe start an initiative to look at what separates the most efficient drivers from the least.

Reporting: not sexy, but really useful in this space

Bona fides: I used to work in the supply chain consulting space and consulted at firms like yours. Things are surprisingly basic in the shipping space - less meaty data science than one might think.

Edit: Formatting

[+] Raffers|7 years ago|reply

Thank you, from reading your post it has given me some ideas to play around with applying some of these incident data.

[+] kmax12|7 years ago|reply

To get started, I'd pick the most important business problem you have and then solve it using the simplest machine learning approach.

You mentioned using Pytorch. Instead, I recommend a classical machine learning using a library like scikit-learn (https://scikit-learn.org/). Use a random forest classifer and you'll get pretty good results out of the box.

If your data is in a postgres database across multiple tables, you will likely have to perform feature engineering in order to get it machine learning ready. For that, I recommend a library for automated feature engineering called Featuretools (http://github.com/featuretools/featuretools/). Here's a good article to get started with it (https://towardsdatascience.com/automated-feature-engineering...)

Finally, you will need to define a prediction problem and extract labeled training examples. I see people in this thread have suggested ideas of problems to work on. The key here is make sure that you pick a problem that you can both predict and take an action based off the prediction. For example, you could predict that there will be an influx of shipments to fulfill tomorrow, but that might not be enough to time to hire more people to help you fulfill them.

If you're curious what the process looks like end-to-end check out this blog series on a generalized framework for solving machine learning problems that was applied to customer churn prediction: https://blog.featurelabs.com/how-to-create-value-with-machin...

Full disclosure: I work for Feature Labs and develop Featuretools.

[+] SatvikBeri|7 years ago|reply

My general tip is to look for things that are almost but not quite automatable, where a human needs to do a quick look-over.

One big example is fraud: it's next-to-impossible to define a 100% accurate set of rules to filter fraud, but it's often easy to train an algorithm to catch the worst offenders, or flag suspicious cases to significantly narrow the amount a human needs to review.

[+] edraferi|7 years ago|reply

Find a system or process that uses a series of rules to categorize, label, or action things, especially one that is occasionally incorrect. Model these rules with a ML algorithm using the rule outputs & user corrections as labels. See if you can build an ML system that out-performs the rules. (If you can, it'll probably be by looking at data that the rules didn't consider.)

Note: your ML system will likely be less explainable than the existing rules. This won't matter as much if the current rule collection is already more complex than a human can deal with. It will matter a LOT if your decisions are subject to regulation.

[+] dmitrygr|7 years ago|reply

  > I can't come up with any problems
  > that I couldn't solve with a
  > relational database (postgres) and
  > a data transformation step.

Congratulations, you have seen through the hype! Most "machine learning" claims you see are solvable just with linear regressions on slightly cleaned up data.

[+] edraferi|7 years ago|reply

Agree. However, Linear Regression IS machine learning (it's a supervised classifier). Most AI/ML is 80% data integration, 15% problem scoping and 5% algorithm selection.

[+] screye|7 years ago|reply

> pytorch, numpy

If your problems aren't audio / image based, then consider using traditional ML instead.

If you are just starting out. Check out SKLearn, Scipy and graphical models like CRFs. They are tried in tested methods that also require less specialized skills.

As someone else said, a lot of AI, ML tools are simply repackaged old school OR methods. The older methods get 95% there, with <50% of the effort.

Cutting edge ML isn't required for most problems. Especially non visual or time series problems.

[+] garysieling|7 years ago|reply

If you explore the word2vec family of algorithms, you can improve text search by pulling in external datasets. E.g. use a model trained on Wikipedia to find synonyms, or build a neural network that maps user search terms to documents in your database.

[+] badgers|7 years ago|reply

If you're looking for problems to solve in the transportation / shipping context, one that comes to mind is estimated delivery. Try predicting the day (maybe even down to the hour) of when something will arrive at its destination once it enters your company's network. It may require feeding it the origin and destination, the product mix in the trailer, customer priority, weather conditions, time of year (peak season?), etc.

[+] s_c_r|7 years ago|reply

Oh yeah, I can see how this one could be doable. Has practical application too, if it can be made accurate enough.

[+] alimw|7 years ago|reply

There must be someone in your company doing some form of business analysis? Probably in spreadsheets. Talk to them, especially about aspects of their work that involve probability. Don't expect to find a problem that looks exactly like one in the Tensorflow tutorial; you may have to get creative and you may have to do some maths with a more old-school flavour.

[+] philip1209|7 years ago|reply

Start somewhere simple - like using ML to make search easier. Try using a collaborative filter to improve search for your company - even by doing something as simple as updating the placeholder text.

[+] s_c_r|7 years ago|reply

I like this idea, thanks.

[+] choward|7 years ago|reply

> for the life of me I can't come up with any problems that I couldn't solve with a relational database

It sounds like you have the right tool for the job now, so great. Keep using it. Dependencies should be added to projects as conservatively as possible. The best dependency is no dependency. You shouldn't go seeking out dependencies. Your app doesn't depend on machine learning, so why would you make it depend on something it doesn't actually depend on? Future maintainers (including yourself) would hate you for it.

[+] nothrows|7 years ago|reply

Try implementing a genetic learning algorithm to send and reply to your work emails. For the fitness score try using your yearly salary in dollars. I haven't yet implemented this, but theoretically there's no upper bounds to how much money this will earn you.

[+] NicoJuicy|7 years ago|reply

Then everyone will act like management and the company will go broke? :P

[+] awa|7 years ago|reply

Some CRUD applications have scope for adding recommendations... based on other things the client has create d predict and auto-populate fields in advance or give them recommendations if changing some fields in existing records can help them. Again, it will depend on type of CRUD application.

There's also scope of using ML in analytics and monitoring side apart from the main application and is generally better tolerated by the product team.

[+] davemp|7 years ago|reply

ML is not a glue gun. CRUD apps need glue guns. You need to find a place that needs a sorting (classifying) machine.

[+] mds|7 years ago|reply

Here is some marketing material that may spark some ideas:

https://www.ibm.com/watson/supply-chain/resources/csc/deskto...

One concrete example IBM likes to talk up is predicting shipping delays due to weather events and automatically recommending alternate suppliers.

[+] arandr0x|7 years ago|reply

Pick a report with lots of numerical columns that your company is currently having accounting or the CFO "interpret" into a go-no go decision. Implement logistic regression.

There may be a way you can do some computer vision tasks for quality control in some parts of your business -- most businesses that deal in physical goods have quality control by visual inspection and in most of those you can end up with a CNN that provides a quick enough, good enough solution. However, sometimes for regulatory reasons it's not practical, or it's something that is not a critical part of the chain, and so on. But you could ask operations staff about whether they sometimes do that kind of task, and whether it takes up a lot of their day. It's not like you have to find the good idea alone.

[+] RoadieRoller|7 years ago|reply

I am a beginner too. Below is a problem I am recently working on. You can work on it too if you think you have a lot of PDFs in any part of your shipping business. I started attacking this to mainly learn python and uses python ML libs.

My problem statement includes classifying hundreds of thousands of PDFs to different categories based on the content/first few pages. That is, if you have a pdf of a novel by Jeffrey Archer, it should be categorized as Entertainment or Novel etc. If you have a e-book of say Python for Dummies, it should be categorized as Engineering or Technology or Education or Programming and the like.

[+] elliekelly|7 years ago|reply

What's the risk of loss/insurance coverage situation for the items you ship? ML is good for digesting several input variables and assessing the risk of X happening. You might be able to pair that with an up-sale of insurance coverage to the client to generate additional income or use it for cases where the company self-insuring certain shipments might be an effective way to save money for little risk. Depending on how your company typically deals with insurance of course.