IBM Community Day Data Science

The Jupyter Notebook Stack has become the "de facto" platform used by data scientists to build interactive applications and with the popularity of deep learning, there is an increasing need of resources to make deep learning effective. In this session, we will discuss how one can scale their Jupyter Notebook deployment by enabling kernels to run in a distributed mode using multiple compute nodes, an analytical engine such as Apache Spark or even containers managed by Kubernetes.

The Jupyter notebook has quickly become one of data scientists’ favorite tools. When using them in IBM's Watson Studio, you get a complete platform for building an application - from data preparation and analytics to building and deploying machine learning models. Jupyter notebooks are a big step up from executing code at the command line, but the basic notebook environment doesn’t do much to automate repetitive tasks. This is where Pixiedust comes in. It puts some of the most common visualization tasks behind a convenient GUI so you don’t have to remember all those obscure arguments that go into the creation of a simple bar chart. Even better, Pixiedust is extensible, so if the function you want to automate isn’t available, you can write a “PixieApp” – a Python class that extends Pixiedust – to do the job. Come learn how to use Pixiedust and build PixieApps.

Deriving deep insights from Fast Data has historically been challenging. System designers have typically had to choose between fast insights on a window of the data (the streaming analytics solution) or analyzing all of the data, but after a considerable delay (the data lake/warehouse approach). Unfortunately this tradeoff is no longer acceptable. We’ll discuss IBM Db2 Event Store, a new offering capable of ingesting millions of events per second, while at the same time making that data immediately available for analytics. The solution is builton an open source stack (Spark, Parquet) and is integrated into IBM’s Data Science Experience for first-class data analysis. We will provide examples of Db2 Event Store, and how it can be used to quickly ingest data and analyze it through Scala and Python notebooks. Come learn about IBM's newest offering for Fast Data Management.

Successful deployments in machine learning require that models generalize well to new data. When building predictive work flows it is easy to overlook where bias may enter when creating features thereby leading to an over-optimistic expectation of a success deployment. The correct placement and use of the Partition Node when creating targeted features can help to avoid this problem. When trees are built using sampling methods to address imbalanced data the coverage and accuracies reported for the resulting rules are based on the balanced training data. A method starting with the Rule Trace Node is shown that finds the coverage and accuracy of individual tree rules on the test data. This will indicate how each rule generalizes to new data.

What if you could reduce your planning process from 1 week to 1 hour, or from 1 hour to 1 second? What if you could, at the click of a button, improve your bottom line by double digits? In this session, you will learn to do just that by leveraging IBM's powerful Machine Learning (ML) and Decision Optimization (DO) technologies together. You will learn the differences and complementary strengths of ML and DO, learn best practices, and see examples of combining these technologies to achieve financial gains and efficiencies based on some IBM Data Science Elite client projects. The session also includes a demo of combining ML & DO in IBM Data Science Experience (DSX).

Optimization is about making plans, schedules, in short, decisions. However, optimisation has to build on good data, which most of the time comes form forecasts, predictions. In this talk a specific use-case will be brought on showing the integration between prediction and optimisation. More specifically, imagine that a bank brings about new products and wants to target its clients on a way that the bank reaches out to those clients who most likely will want to buy some of the new products. It means that the bank has to predict what each client or client groups most likely will buy and then build a campaign to advertise the products to the best candidates. Obviously the bank has a given budget for the campaign, has different channels to approach clients, where each channel has different cost, and the bank wants to maximise the expected revenue minus cost. SPSS Modeler will be used for the prediction part and solving an OPL model in CPLEX Optimization Studio will give the decisions for which clients should be approached by what offer thru what channel.

The application of natural language classification continues to grow. Over a period of 18 months, my team looked at how to effectively classify, map and assess regulations (financial services industry) to gain insights into the critical requirements and “map” them onto available existing controls.

This presentation provides an overview of the process used to train Watson Natural Language Classifier to interpret such financial regulations. What were the critical steps, and what did we learn along the way that you might apply to your own projects? We’ll share those insights during this session.

The Enterprise Data Warehouse (EDW) has traditionally been the foundation for data storage. So how do you leverage current investments while remaining relevant and competitive? It is important for your organization to continue to evolve, accelerate development/deployment times, provide high performance, and a cloud ready platform to drive the future of advanced analytics. In this webinar we will talk about current data challenges, and how Data Science and Machine Learning is driving better, faster data driven decisions. We will also discuss the need for a Hybrid data strategy, and how the IBM Integrated Analytics System, as the mainstay of the Enterprise Data Warehouse of the future, remains an integral part of that strategy. We will talk about IBM’s future vision and how current and future innovative, scalable, and flexible technology is harnessing growing data for even your most advanced workloads. While providing your data scientists with a platform for advanced analytics.

Traditional approach to model serving is to treat model as code, which means the that machine learning implementation has to be somehow adopted for model serving. As the amount of machine learning tools and techniques grows such approach is becoming more questionable. Additionally machine learning and model serving are driven by very different quality of service requirements - while machine learning is typically batch, mostly concerned with scalability and processing power, model serving is mostly concerned with performance and stability.

An alternative approach to model serving, proposed in the presentation, is treating model as data. Such approach allows complete decoupling between model implementation for machine learning and model serving and allows for easier standardization of model serving implementation. Additionally such approach allows for dynamic update of served model without requirement of restart of the system.

The presentation explores usage of Tensorflow and PMML as model representation and their usage for building “real time updatable” model serving architecture. Boris will also present options for implementing such architecture leveraging Akka Streams and Flink .

Driverless AI speeds up data science workflows by automating feature engineering, model tuning, ensembling and model deployment. Hemen Kapadia will give a quick overview on Driverless AI and its features – Automatic Feature Engineering, Machine Learning Interpretability, Automatic Visualization.

Driverless AI turns Kaggle-winning recipes into production-ready code and is specifically designed to avoid common mistakes such as under or overfitting, data leakage or improper model validation. Avoiding these pitfalls alone can save weeks or more for each model, and is necessary to achieve high modeling accuracy.

With Driverless AI, everyone can now train and deploy modeling pipelines with just a few clicks from the GUI. Advanced users can use the client/server API through a variety of languages such as Python, Java, C++, go, C# and many more. To speed up training, Driverless AI uses highly optimized C++/CUDA algorithms to take full advantage of the latest compute hardware.

For example, Driverless AI runs orders of magnitudes faster on the latest Nvidia GPU supercomputers on Intel and IBM platforms, both in the cloud or on premise. There are two more product innovations in Driverless AI: statistically rigorous automatic data visualization and interactive model interpretation with reason codes and explanations in plain English. Both help data scientists and analysts to quickly validate the data and models.

After reviewing popular techniques used in supervised, unsupervised and semi-supervised machine learning, this presentation will feature selection methods in these different contexts, especially the metrics used to assess the value of a feature or set of features, be it binary, continuous or categorical variables. In this session we will go into deeper detail and review modern feature selection techniques for unsupervised learning, typically relying on entropy-like criteria. While these criteria are usually model-dependent or scale-dependent, we introduce a new model-free, data-driven methodology in this context, with an application to an interesting number theory problem (simulated data set) in which each feature has a known theoretical entropy.

We will also briefly discuss high precision computing as it is relevant to this peculiar data set, as well as units of information smaller than the bit. Despite the apparent advanced level of this presentation, it is made accessible to a large audience of data scientists, ranging from practitioners to executives.

Data science isn’t just creeping into areas of modern business, it’s being targeted in every department. Gone are the days where data science (DS) was a one-off project in hopes to improve a single area of the company. Organizations currently look to take advantage of DS advancements in every business aspect. The challenge arises when you want to operationalize your DS practice. How do you make a DS project repeatable from discovery to feature selection, through training, implementation and deployment? What measures can you use to state a return on investment (ROI) as traditional SDLC won’t satisfy customer demands for faster time to market and data volumes are growing too large for single console solutions. Come see how Hortonworks (HWX) & IBM provide a connected platform for data science. The combination of HWX and IBM provides the best of both open-source standard and business solutions. I will walk through building a machine learning model, pushing processing to a massive amount of data, running in a secure environment and deploying the model, all on Hortonworks Data Platform & DSX. This talk will cover notebook development, spark distributed execution, containerizing custom libraries and model deployment. Join me for the discussion and see why driving your company to be data science ready is a platform decision.

Self-driving cars are here, and how they are deployed will change the world. In this session, you will see how a miniature race car can be powered by open source to faster time to innovation. Learn how data is captured, Deep Learning models are trained in TensorFlow using pooled GPUs, created and deployed back to the car to improve functionality as new data is fed in. Witness how new innovations from Apache Hadoop 3.1 support leading technologies from Hortonworks and IBM to transform all industries.

FREE | ONLINE | CONFERENCE

Powered by

an IBM Community Virtual Event

July 24th, 2018

9am-6pm EDT/6am-3pm PDT

TALKS

IBM Code

SPSS Models in Watson Studio

Accelerate Training of Deep Learning Tasks with PowerAI

Scaling Jupyter Notebooks with Jupyter Enterprise Gateway

Automate Data Science Drudgery with Pixiedust

Getting Started with Decision Optimization for DSX

Statistics for Data Science: What You Should Know and Why

Level Up your Open SourceRy with Data Science + R

Machine Learning Competitions : Engaging, Learning, and Winning

IBM Sessions

Mega Trends in Data Science

Data Science Best and Worst Practices

How Decision Optimization Can Complement ML in Decision Making

IBM Db2 Event Store - Deriving Deep Insights from Fast Data

Extend SPSS Modeler Capabilities with Open Source

Tips on Generalizing Feature Creation and Assessing Imbalanced-Data Trees

Make Better, Faster, Smarter Decisions by combing Machine Learning and Decision Optimization

Machine Learning Competitions : Engaging, Learning, and Winning

Integrating predictive (SPSS Modeler) and Prescriptive Analytics (CPLEX Optimization Studio) - Campaign Optimization

Natural Language Classification in Analyzing Regulation

The Hybrid Enterprise Data Warehouse of the Future –Do More with your Data

Partner

What You Need to Know About Fast Data

Operationalizing Model Serving

A Look Under the Hood of Driverless AI

Scalable Automatic Machine Learning with H2O

Feature Selection for Unsupervised Learning

Data Science at Scale – A Platform Decision

Can Cars Think Like Humans?

More to be announced soon Talk titles and abstracts are subject to change

PARTNERS

MEDIA

This virtual event is brought to you by IBM Community.

Share. Solve. Do More.

community.ibm.com

Get in Touch.

9am-6pm EDT/
6am-3pm PDT