Getting started with Apache Beam and Kotlin

Are you looking for a simple getting started guide for working with Kotlin and Apache Beam? Well congratulations, you are in the right place. I’ve been doing some live coding on the subject, and I made a teeny tiny project to help me get started on new Beam projects quicker.

There is an official guide for getting started, so you should probably start there. However, that guide begins with a fully functional project and I prefer to build a minimum viable product from scratch that is easy to run in an IDE. So if you are like me, and prefer to type your own code then keep reading!

Note: If you don’t like typing either, you can just clone the repo and you are ready to go.

Wait, what is Beam?

Beam is an open-source and unified model for interacting with streaming and batch data. Have you heard of projects or methodologies like Spark, Flink, MapReduce or batch processing? If not, then this post will not be useful for you. If you have, though, and you haven’t heard of Beam – then please let me introduce you!

Beam is a project that lets you choose a supported language (currently Java, Python, and Go), and write code can run in either batch or streaming mode, and can run in a variety of engines (Spark, Flink, Apex, Dataflow, etc). That means you get to focus on your business logic, and let Beam worry about the implementation. Check out the compatibility matrices for more information.

Is this the future of batch and streaming data?
Maybe! Time will tell but I think it’s worth investigating if these areas interest you.

Create the project

First up, “File” > “New Project” and choose “Maven”. Check the “Create from archetype” checkbox and select the org.jetbrains.kotlin:kotlin-archetype-jvm, because Kotlin is awesome.

Choose the Kotlin JVM archetype, but you can use the default settings for everything else.

Add the minimal dependencies

I’m aiming for a bare-bones implementation here, so I’m just going to add the dependency for the direct runner directly to my project for now. Later we can break this out to a “profile” so you can optionally compile in the bits for the execution engines of your choice.


<!-- Add these to your dependencies in pom.xml -->
<dependency>
  <groupId>org.apache.beam</groupId>
  <artifactId>beam-sdks-java-core</artifactId>
  <version>2.19.0</version>
</dependency>

<!-- This dependency is for the local runner. -->
<!-- Later we can move this one out to a profile -->
<dependency>
  <groupId>org.apache.beam</groupId>
  <artifactId>beam-runners-direct-java</artifactId>
  <version>2.19.0</version>
  <scope>runtime</scope>
</dependency>

<groupId>org.apache.beam</groupId>

</dependency>

<groupId>org.apache.beam</groupId>

<artifactId>beam-runners-direct-java</artifactId>

<scope>runtime</scope>

</dependency>

Create a simple pipeline

Now let’s setup a simple pipeline that is essentially a stripped down version of the official example.

I’ll paste the code below, but here is the outline of what is happening:

Create a pipeline with default options
Add a step to read from pom.xml
Add a step to split the text into text characters
Add a step to count the characters
Add a step to format the output to a collection of strings
Add a step to write the output to files, prefixed with “counts:
Execute the pipeline

fun main(args: Array<String>) {
    val p = Pipeline.create()
    p
            .apply<PCollection<String>>(TextIO.read().from("pom.xml"))
            .apply(
                    FlatMapElements.into(TypeDescriptors.strings())
                            .via(ProcessFunction<String, List<String>> { input -> input.toList().map { it.toString() } })
            )
            .apply(Count.perElement<String>())
            .apply(
                    MapElements.into(TypeDescriptors.strings())
                            .via(ProcessFunction { input -> "${input.key} : ${input.value}" })
            )
            .apply(TextIO.write().to("counts"))

    p.run()
}

fun main(args: Array<String>) {

val p = Pipeline.create()

.apply<PCollection<String>>(TextIO.read().from("pom.xml"))

.apply(

FlatMapElements.into(TypeDescriptors.strings())

.via(ProcessFunction<String, List<String>> { input -> input.toList().map { it.toString() } })

)

.apply(Count.perElement<String>())

.apply(

MapElements.into(TypeDescriptors.strings())

.via(ProcessFunction { input -> "${input.key} : ${input.value}" })

)

.apply(TextIO.write().to("counts"))

p.run()

}

Now you can run the code, and you’ll end up with several files in your project directory of the format counts-0000*!

(note: if your main method does not show up as runnable, you may need to right-click pom.xml and “Maven” > “Reimport”)

What is next?

I plan to use this project as a jumping off point for experimentation, but I also have some small plans for this project on it’s own.

Here is what I’m looking at doing to this repo next:

Adding tests
Adding profiles for multiple runners
Reading from a streaming source of data

Let me know if you’re interested in hearing more about Beam and I will be happy to write more about it.

Check out our YouTube channel for more information and live streaming with Beam!

Share the joy

Navigation

Wait, what is Beam?

Create the project

Add the minimal dependencies

Create a simple pipeline

What is next?

About Joe Zack