Machine Learning since 1997: what’s new?

Tom Mitchell is creator and chair of, to my knowledge, the only “Machine Learning Department” in academia.

In doing so he authored a document in 2006 which strives to define Machine Learning as a separate intellectual discipline from statistics, computer science, or other related fields. His is one of several definitions.

So I was happy to see that he’s announced he’s updating his classic textbook on the subject from 1997.

He’s put up a page with one chapter on generative modeling and a request for comments. It’s been a dynamic 16 years for machine learning — what do you think is missing and should be included in the 2nd edition?

 

 

 

Save the dates! 5th World Science Festival coming to NYC May 30-June 3 2012

It’s hard to believe that it’s already been 5 years since the first world science festival here in NYC. The event is a series of lectures and science-related events, including plays, outdoor events, and of course lectures from some of the most well known scientists and science expositors.

Save the dates! May 30-June 3 in NYC. Agenda available here.

New Microsoft Research lab announced in NYC

I was happy to see the news this morning that Microsoft is opening a new research lab in NYC, with 15 of the former members of the Yahoo R+D NYC lab as its founding members.

The Yahoo group is one of the most multidisciplinary research teams I’ve ever seen, with great research collaborations among physicists, machine learning experts, applied mathematicians, and social science, all learning about human behavior by analyzing web-scale datasets.
They have also managed to show how a research lab can make great contribution both to the local and international research communities in their field. For example, Jake Hofman at Yahoo has been teaching a great ‘Data Driven Modeling‘ class at Columbia for years; John Langford has been a co-organizer of the New York Academy of Sciences’ one-day Machine Learning Symposium since it was founded (while also organizing international machine learning conferences, running a great machine learning blog, etc…)

Some particularly exciting aspects from the announcement include:

– Mathematical Physicist Jennifer Chayes seems to be implying she’ll be spending at least part of her time here in NYC rather than her current home of MSR-Cambridge

– Multiple people in the story said they’re interested in ties with startups and universities, which should be good for the intellectual landscape of NYC dataphiles.

Congrats to all and to NYC!

Theory Day! Friday at Columbia

This Friday will be the 2012 Theory Day, co-organized by Columbia, NYU, and IBM. I’ve been to several and, as the blurb says, “in particular, students are encouraged to attend.”.

I was happy to see that Michael Kearns will be presenting his work on “Experiments in Social Computation”. I had the pleasure of seeing Kearns present this a few years ago at one of the “Simons Science Series” lectures (vimeo here). Kearns is a rare academic who has made fundamental contributions in theory, yet also understands how to apply this theory in the real world. Rajeev Motwani comes to mind as another professor I’ve heard described this way — someone whose work advanced the field in theoretical contributions, yet was also a great example to students as to how their academic work could be applied. (A Stanford PhD I know told me recently that, when he was starting a tech startup, the best business advisor he ever had was a winner of the Gödel Prize — Motwani.)

In Kearns’ case his work is in machine learning theory, where he had some of the early fundamental contributions, wrote a great book on computational learning theory, but also has been a consultant or advisor to some great startups and VC firms, started the Penn-Lehman automated trading competition, and more recently I noticed helped the energetic students from Penn’s Dining Philosophers Club by judging at one of their recent hackathons.

Also speaking will be researchers from Microsoft Research New England (in Cambridge, Boston), a lab which has really shown how a big tech company can still, even in 2012, provide a home for diverse and groundbreaking pure research — ranging from social computation to mathematical physics. When I visited there in February, I talked to Jennifer Chayes and Adam Kalai about how they not only have the freedom to do great research, but interact with the local student and startup communities via hackathons, tech talks, and other activities that improve the local nerdscape. It would be great if NYC had a similar lab, particularly now that Yahoo’s NYC R+D lab is rumored to be disbanding.

Engineering Careers, Consulting, and Startups

Last night I attended the senior dinner for graduating Columbia engineers and sat next to a student who was going into consulting. The company she’s going to work for, she said, recruited heavily, and offered a diversity of experiences, which appealed to her because she said she wasn’t quite sure yet what she wanted to do with her life. She admitted that she would probably not use any of her undergraduate STEM education.

I wonder how many students don’t pursue advanced studies in science or engineering because they feel like they didn’t find anything they want to specialize in? I wonder what we as faculty could do better to help them?

Also, I wonder if there’s not some way were she could find a way of applying her engineering talents without forcing her to specialize. For example, if she were to go work at a small startup company with engineering or technical problems her talents might be applied (I noticed that someone else at the table said she was his ‘go to’ person whenever he got stuck on programming, for example), yet, because at a small company everyone needs to do every task, she would be able to see multiple facets of creating and scaling a company, in addition to finding that her engineering talents were applied. Certainly such engineering skills are in high demand here in NYC.

Perhaps we as STEM faculty here in NYC can do a better job helping our students see all the different ways they can apply the technical skills they learn in our courses.

Your thoughts appreciated.

Data Science Hackathon & Data Science in NYC

On Saturday, I was a judge for the Data Viz Competition at the NYC Data Hackathon, part of the world’s first global data hackathon. Along with my fellow judges Cathy O’Neil and Jake Porway, we gave an award to the team that best found a  nontrivial insight from the data provided for the competition and managed to render that insight visually.

Unlike a hackNY hackathon, where the energy is pretty high and the crowd much younger (hackNY hackathons are for full time students only; this crowd all were out of school — in fact at least one person was a professor), here everyone was really heads down. There was plenty of conversation and smiles but people were working quite hard, even 12 hours into the hackathon.

I noticed two things that were unusual about the participants, both of which I think speak well of the state of `data science’ in NYC:

  • I’ve never been in a room with such a healthy mix of Wall Street quants and startup data scientists. Many of the teams included a mix of people from different sectors working together. The winning team was typical in this way: 1 person from Wall Street; 1 freelancer; and 1 data scientist from an established NYC startup.
  • I met multiple people visiting from the Bay Area contemplating moving to NYC. In 2004-2007 many of my students from Columbia moved out to SF under the historical notion that that was `the place’ where they could work at a small company that would demand their technical mastery and give them sufficient autonomy to see their work come to light under their own direction.

I was glad to meet people from the Bay Area who were sufficiently impressed with NYC’s data scene to consider moving here. Of course I told them it was exactly the right thing to do and I looked forward to seeing them again soon once they’d become naturalized citizens of NYC.

Huge thanks to Shivon Zils and Matt Truck for hosting us in such a nice location, to Jeremy Howard for his suggestion a few weeks ago to throw the event, and to Max Shron for encouraging everyone to include a visualization prize as part of this event.

how to write a paper (one possible answer)

how to write a paper

a student recently asked me how to write a paper. here’s an algorithm i’d suggest, with plenty of room for an individual to deviate.

  1. punchline(s)
  2. nickname
  3. *figures
  4. *references
  5. outline
  6. abstract
  7. (w) intro and outtro
  8. (w) middle
  9. show definite coauthors
  10. show possible coauthors
  11. acknowledgements
  12. title
  13. code
  14. submit and post
  • punchline(s)

readers, reviewers, and you in 5 years are going to want to have some pithy way of remembering that paper. what is the “main result”? what did you learn? if answering this takes a long time, maybe you don’t understand the subject well yet, or maybe it’s really 2 papers.

  • nickname

most of the projects i work on have a nickname for the project. sometimes it’s just the name of the cvs/svn/github repository. it helps you and your collaborators define a bite-size quantum of research.

  • figures

decide what figures are necessary to illustrate the punchline. decide which are going in the mansucript and which in supplementary material. the * indicates that this is how people read the paper — they’ll skim the figures and references first

  • references

decide who you should cite to support the argument and set the background/context. see above for *. someone i know once said to me that the first thing she or he reads in a paper is the references to see if she or he is cited. i’m still not sure if she or he was serious.

  • outline

next write an outline. seriously you need to do this. don’t just sit down and start writing stuff.

  • abstract

now you are allowed to write the abstract

  • (w) intro

now write the beginning of the paper. the (w) indicates that htis is the part of the alogirhtm most people think of when they think of “writing a paper”

  • (w) outtro

now write the conclusions, what you showed, what you’d like to do in subsequent papers, where to find the source code.

  • (w) middle

now write the rest

  • show definite coauthors

if you haven’t already, make sure you show to people who are going to be coauthors

  • show possible coauthors

if you haven’t already, show it to people who may or may not want to be coauthors. be generous

  • acknowledgements

think about everyone who helped you and funded you. be generous. also people from the above section who elected not to be coauthors should be acknowledged

  • title
  • code

if your work is computational (including modeling and statistical work), upload all code + data to a neutral, 3rd party site (e.g., code.google.com, github.com, sourceforge.net ). For more on this see this blog post or this talk or this roundtable.

  • submit & post to arxiv.org

I’m not sure which is supposed to come first, but it seems reasonable to me that one submit to the journal and, as soon as possible thereafter, post to arxiv.org, with a note saying “submitted for publication” in the comment field.

edit: added two reproducibility steps; these two are really more about how to submit than how to write, but good for the species, so i added

Tomorrow! Machine Learning Symposium & Startup-Student Afterparty

Folks there’s still time for NYC’s biggest annual machine learning event:

The 5th annual NYAS Machine Learning Symposium is tomorrow!

This year there will also be an afterparty with NYC startups presenting machine learning problems “in the wild".

Monica Kerr of NYAS’s Science Alliance sent along this blurb:

Machine Learning Careers in NYC Startups

Student, postdocs, and professionals: Thinking about machine learning careers? Intrigued by New York City’s emerging startup ecosystem?

After the close of the 5th Annual Machine Learning Symposium there will be a series of short talks on machine learning problems encountered in the NYC startup community. Local startups foursquare.com, drop.io, etsy.com, and bit.ly will discuss examples of real-world, data-intensive challenges that they currently face, as well as job opportunities at NYC startups. There will be food, beverages, and an opportunity to discuss technical or career matters with researchers in startups informally.

This event is co-organized by Science Alliance and hackNY.org with the support of friends including IA Ventures and AOL Ventures.

In order to attend this special “Machine Learning Careers in NYC Startups”

event you must be registered for the 5th Annual Machine Learning Symposium (www.nyas.org/ml2010) AND RSVP by emailing Melanie Koundourou at mkoundourou@nyas.org.

5th Annual Machine Learning Symposium | The New York Academy of Sciences

Registration now open for the 5th Annual Machine Learning Symposium @ The New York Academy of Sciences!

I’ve been to all 4 of these previously, I think, and they’re a great gathering of all the local friends of data+ML goodness. Highly recommended. Industry, comp bio, theory, stats, robotics, whatever your bag, there’ll be some good stuff. Familiar faces, confusion, complexity, projections, generalization, loss. However, your regret may be unbounded if you miss it. I’ll stop there.

Good times, good times. Kudos to the organizing committee and to NYAS for making it happen and for gathering together NYC’s ML peeps. (and with a great view, too!)

Also: submit poster abstracts before the Sept 17 deadline.

Afghanistan / Wikileaks visualized (Data Viz + Open Source FTW)

NYC’s own Drew Conway (grad student @ NYU) Mike Dewar (postdoc @ Columbia) put together a visualization of the recent trove of data released by wikileaks.org about Afghanistan. The work was picked up by Wired.com this week in a post titled Open Source Tools Turn WikiLeaks Into Illustrated Afghan Meltdown and also featured in the Atlantic.

In a victory for transparency & reproducibility, they also distribute their source code as well (in R (via github.com)) so you can try this at home.